<div style="background-color: lightgray; padding: 18px;">
    <h1> Learning Python | Day 16
    
</div>

### Features:
Pandas
- Introduction
- Installation and Import
- Pandas Series
- Operations and Methods
- Exercices

<div style="background-color: lightgreen; padding: 10px;">
    <h2> Introduction to Python Pandas
</div>

``Pandas`` is an open-source library that is built on top of ``NumPy`` library. 

It is a Python package that offers various *data structures* and *operations* for manipulating numerical data and time series. It is mainly popular for importing and analyzing data much easier. Pandas is **fast** and it has high-performance & productivity for users.

Here is a list of things that we can do using Pandas:
- Data set cleaning, merging, and joining.
- Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data.
- Columns can be inserted and deleted from DataFrame and higher dimensional objects.
-  Powerful group by functionality for performing split-apply-combine operations on data sets.
- Data Visulaization

Sources:
- https://pandas.pydata.org/
- https://www.w3schools.com/python/pandas/default.asp
- https://www.geeksforgeeks.org/pandas-tutorial/

<div style="background-color: lightgreen; padding: 10px;">
    <h2> Installation and Import
</div>

If you have ``Python`` and ``PIP`` already installed on a system, then installation of ``Pandas`` is very easy.

Install it using this command:

In [4]:
# On cmd:
# conda install pandas
# pip install pandas

After the pandas have been installed into the system, you need to import the ``library``. 
Pandas is usually imported under the ``pd`` alias.

This module is generally imported as follows:

In [3]:
import pandas as pd

In [9]:
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}
print(mydataset)
print(type(mydataset))

print('-'*32)

myvar = pd.DataFrame(mydataset)

print(myvar)
print(type(myvar))

{'cars': ['BMW', 'Volvo', 'Ford'], 'passings': [3, 7, 2]}
<class 'dict'>
--------------------------------
    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2
<class 'pandas.core.frame.DataFrame'>


In [10]:
# Checking pandas version:

import pandas as pd

print(pd.__version__)

1.5.3


<div style="background-color: lightgreen; padding: 10px;">
    <h2>  Intro to Pandas Series
</div>

Pandas generally provide two data structures for manipulating data, They are: 
- Series
- DataFrame

*Pandas Series*

A ``Pandas Series`` is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called indexes.

Pandas Series is nothing but a **column** in an ``Excel sheet``. Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.

Sources
- https://www.geeksforgeeks.org/introduction-to-pandas-in-python/
- https://www.w3schools.com/python/pandas/pandas_series.asp
- https://www.geeksforgeeks.org/python-pandas-series/

In [11]:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

0    1
1    7
2    2
dtype: int64


***Note:*** If nothing else is specified, the values are labeled with their ``index`` number. First value has index ``0``, second value has index ``1`` etc.

With the index argument, you can name your own labels.

In [12]:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

x    1
y    7
z    2
dtype: int64


In [13]:
# When you have created labels, you can access an item by referring to the label.

print(myvar["y"])

7


<div style="background-color: lightgreen; padding: 10px;">
    <h2>  Pandas Series: Operations and Methods
</div>

We will get a brief insight on all these basic operations which can be performed on Pandas Series :
- Creating a Series
- Accessing element of Series
- Indexing and Selecting Data in Series
- Binary operation on Series
- Methods on Series

In the real world, a Pandas Series will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. 

``Pandas Series`` can be created from the ``lists``, ``dictionary``, and from a ``scalar value`` etc. Series can be created in different ways, here are some ways by which we create a series:

In [1]:
import pandas as pd
import numpy as np
 
# simple array
data = np.array(['g','e','e','k','s'])
 
ser = pd.Series(data)
print(ser)

0    g
1    e
2    e
3    k
4    s
dtype: object


In [2]:
import pandas as pd
 
# a simple list
list = ['g', 'e', 'e', 'k', 's']
  
# create series form a list
ser = pd.Series(list)
print(ser)

0    g
1    e
2    e
3    k
4    s
dtype: object


In [7]:
import pandas as pd
import numpy as np
 
# creating simple array
data = np.array(['g','e','e','k','s','f', 'o','r','g','e','e','k','s'])
ser = pd.Series(data)
    
#retrieve the n-first element
print(ser[:7])

0    g
1    e
2    e
3    k
4    s
5    f
6    o
dtype: object


In [8]:
# creating simple array
data = np.array(['g','e','e','k','s','f', 'o','r','g','e','e','k','s'])
ser = pd.Series(data,index=[10,11,12,13,14,15,16,17,18,19,20,21,22])
    
# accessing a element using index element
print(ser[16])

o


---

We can also select the desired elements using the methods:

- `iloc`: locates the element by the numerical position of the index (starting from 0).
- `loc`: locates the element by the series' own index (which was named at some previous point).

In [9]:
import pandas as pd

my_list = [10, 20, 30, 40]
serie = pd.Series(my_list, index=['a', 'b', 'c', 'd'], dtype=np.float32)

In [10]:
serie.iloc[2]

30.0

In [11]:
serie.iloc[1:3]

b    20.0
c    30.0
dtype: float32

In [12]:
serie.loc['c']

30.0

In [13]:
serie.loc['a':'c']

a    10.0
b    20.0
c    30.0
dtype: float32

In [16]:
# Its also possible to use a mask like in NumPy to filter:

mask = serie > 15
mask

a    False
b     True
c     True
d     True
dtype: bool

In [15]:
serie.loc[mask]

b    20.0
c    30.0
d    40.0
dtype: float32

---
Pandas Series Operations:

| FUNCTION | DESCRIPTION |
|----------|-------------|
| add() | Method is used to add series or list like objects with the same length to the caller series |
| sub() | Method is used to subtract series or list like objects with the same length from the caller series |
| mul() | Method is used to multiply series or list like objects with the same length with the caller series |
| div() | Method is used to divide series or list like objects with the same length by the caller series |
| sum() | Returns the sum of the values for the requested axis |
| prod() | Returns the product of the values for the requested axis |
| mean() | Returns the mean of the values for the requested axis |
| pow() | Method is used to put each element of the passed series as an exponential power of the caller series and returned the results |
| abs() | Method is used to get the absolute numeric value of each element in Series/DataFrame |
| cov() | Method is used to find the covariance of two series |

In [17]:
values = [1, 1, 2, 3, 5, 8, 13]
fibonacci = pd.Series(values)

In [18]:
fibonacci.sum()

33

In [20]:
fibonacci[fibonacci > 4].sum()

26

---
Pandas Series Methods:


| FUNCTION | DESCRIPTION |
|----------|-------------|
| Series() | A pandas Series can be created with the Series() constructor method. This constructor method accepts a variety of inputs |
| combine_first() | Method is used to combine two series into one |
| count() | Returns the number of non-NA/null observations in the Series |
| size() | Returns the number of elements in the underlying data |
| name() | Method allows giving a name to a Series object, i.e., to the column |
| is_unique() | Method returns boolean if values in the object are unique |
| idxmax() | Method to extract the index positions of the highest values in a Series |
| idxmin() | Method to extract the index positions of the lowest values in a Series |
| sort_values() | Method is called on a Series to sort the values in ascending or descending order |
| sort_index() | Method is called on a pandas Series to sort it by the index instead of its values |
| head() | Method is used to return a specified number of rows from the beginning of a Series. The method returns a brand new Series |
| tail() | Method is used to return a specified number of rows from the end of a Series. The method returns a brand new Series |
| get() | Method is called on a Series to extract values from a Series. This is alternative syntax to the traditional bracket syntax |
| unique() | Pandas unique() is used to see the unique values in a particular column |
| nunique() | Pandas nunique() is used to get a count of unique values |
| value_counts() | Method to count the number of times each unique value occurs in a Series |
| factorize() | Method helps to get the numeric representation of an array by identifying distinct values |
| map() | Method to tie together the values from one object to another |
| between() | Pandas between() method is used on the series to check which values lie between the first and second argument |
| apply() | Method is called and fed a Python function as an argument to use the function on every Series value. This method is helpful for executing custom operations that are not included in pandas or numpy |nd new Series |

In [21]:
serie.value_counts()

10.0    1
20.0    1
30.0    1
40.0    1
dtype: int64

In [23]:
serie.value_counts(True)

10.0    0.25
20.0    0.25
30.0    0.25
40.0    0.25
dtype: float64

In [22]:
serie.describe()

count     4.000000
mean     25.000000
std      12.909945
min      10.000000
25%      17.500000
50%      25.000000
75%      32.500000
max      40.000000
dtype: float64

---
When series are composed of elements of the "string" type (``str`` in Python), pandas will interpret their ``dtype`` as being an object.

In these cases, the series has a str attribute that allows us to perform string-specific operations across the elements of the series.

In [24]:
pronomes = pd.Series(['eu', 'tu', 'ele/ela', 'nós', 'vós', 'eles/elas'])
pronomes

0           eu
1           tu
2      ele/ela
3          nós
4          vós
5    eles/elas
dtype: object

In [25]:
pronomes.str

<pandas.core.strings.accessor.StringMethods at 0x1ea72bf3b10>

In [26]:
pronomes.str.upper()

0           EU
1           TU
2      ELE/ELA
3          NÓS
4          VÓS
5    ELES/ELAS
dtype: object

<div style="background-color: lightgreen; padding: 10px;">
    <h2>  Exercices:
</div>

---

#### <font color="blue">Exercice 1</font>
- Maria is keeping track of her daily expenses over a month. For this, she has the following series:

```python
dias_da_semana = ['Seg', 'Ter', 'Qua', 'Qui', 'Sex', 'Sab', 'Dom']
semana1 = pd.Series([50, 66, 55, 55, 23, np.nan, 92], index=dias_da_semana)
semana2 = pd.Series([np.nan, 78, 76, 66, 64, 44, 78], index=dias_da_semana)
semana3 = pd.Series([55, 75, 89, 77, 78, 57, np.nan], index=dias_da_semana)
semana4 = pd.Series([67, 34, np.nan, 25, 45, np.nan, 95], index=dias_da_semana)
```

As you can see, for each of these weeks, there is one or more missing values. That said:

- Fill in each missing value with the mean of the same weekday for the set of 4 weeks.
- For each week, print the total spending, average spending, median spending, maximum spending, minimum spending, and standard deviation.
- For each week, print the total spending on weekdays and the total spending on weekends.

In [27]:
dias_da_semana = ['Seg', 'Ter', 'Qua', 'Qui', 'Sex', 'Sab', 'Dom']
semana1 = pd.Series([50, 66, 55, 55, 23, np.nan, 92], index=dias_da_semana)
semana2 = pd.Series([np.nan, 78, 76, 66, 64, 44, 78], index=dias_da_semana)
semana3 = pd.Series([55, 75, 89, 77, 78, 57, np.nan], index=dias_da_semana)
semana4 = pd.Series([67, 34, np.nan, 25, 45, np.nan, 95], index=dias_da_semana)

In [50]:
# 1:
df = pd.DataFrame([semana1,semana2,semana3,semana4], columns=dias_da_semana)
total_mean = df.mean()
total_mean

Seg    57.333333
Ter    63.250000
Qua    73.333333
Qui    55.750000
Sex    52.500000
Sab    50.500000
Dom    88.333333
dtype: float64

In [51]:
df = df.fillna(total_mean)
df

Unnamed: 0,Seg,Ter,Qua,Qui,Sex,Sab,Dom
0,50.0,66.0,55.0,55.0,23.0,50.5,92.0
1,57.333333,78.0,76.0,66.0,64.0,44.0,78.0
2,55.0,75.0,89.0,77.0,78.0,57.0,88.333333
3,67.0,34.0,73.333333,25.0,45.0,50.5,95.0


In [46]:
# 2:
for semana in [semana1, semana2, semana3, semana4]:
  print('')
  print(semana.sum())
  print(semana.mean())
  print(semana.median())
  print(semana.max())
  print(semana.min())
  print(semana.std())
  print (semana.describe())


341.0
56.833333333333336
55.0
92.0
23.0
22.44474697265858
count     6.000000
mean     56.833333
std      22.444747
min      23.000000
25%      51.250000
50%      55.000000
75%      63.250000
max      92.000000
dtype: float64

406.0
67.66666666666667
71.0
78.0
44.0
13.109792777411345
count     6.000000
mean     67.666667
std      13.109793
min      44.000000
25%      64.500000
50%      71.000000
75%      77.500000
max      78.000000
dtype: float64

431.0
71.83333333333333
76.0
89.0
55.0
13.212367943206345
count     6.000000
mean     71.833333
std      13.212368
min      55.000000
25%      61.500000
50%      76.000000
75%      77.750000
max      89.000000
dtype: float64

266.0
53.2
45.0
95.0
25.0
28.146047679914137
count     5.000000
mean     53.200000
std      28.146048
min      25.000000
25%      34.000000
50%      45.000000
75%      67.000000
max      95.000000
dtype: float64


In [44]:
# 3:
semana1[0:5].sum()

249.0

In [38]:
semana1[5:].sum()

92.0

---

#### <font color="blue">Exercice 2</font>

You have a series called 'clientes' with a thousand positions that depict the credit takers from your bank, divided into the categories 'new customer', 'non-defaulter customer', and 'defaulter customer'. Consider the code below as a generator for this list:

```python
valores_possiveis = ['cliente novo', 'cliente adimplente', 'cliente inadimplente']
probabilidades = np.random.dirichlet(np.ones(len(valores_possiveis)), size=1).flatten()
clientes = pd.Series(np.random.choice(valores_possiveis, size=1000, p=probabilidades))
```

Find out how the clients in this series are distributed among the three categories, first in absolute values and then in percentage values.

In [53]:
valores_possiveis = ['cliente novo', 'cliente adimplente', 'cliente inadimplente']
probabilidades = np.random.dirichlet(np.ones(len(valores_possiveis)), size=1).flatten()
clientes = pd.Series(np.random.choice(valores_possiveis, size=1000, p=probabilidades))
clientes

0      cliente inadimplente
1      cliente inadimplente
2      cliente inadimplente
3              cliente novo
4      cliente inadimplente
               ...         
995      cliente adimplente
996            cliente novo
997    cliente inadimplente
998            cliente novo
999    cliente inadimplente
Length: 1000, dtype: object

In [54]:
clientes.value_counts()

cliente inadimplente    499
cliente adimplente      324
cliente novo            177
dtype: int64

In [56]:
clientes.value_counts(True)*100

cliente inadimplente    49.9
cliente adimplente      32.4
cliente novo            17.7
dtype: float64

---

#### <font color="blue">Exercice 3</font>


João is a manager of an electronics store and is monitoring the inventory of products in his store. He has two pandas series, as shown in the following code: one series containing the names of the products available in stock and another series containing the quantity in stock for each product.

```python
produtos = pd.Series(['Celular', 'Tablet', 'Notebook', 'Fone de Ouvido', 'Smartwatch'])
quantidade_estoque = pd.Series([15, 8, 20, 5, 12])
```

João wants to filter the products that have low stock, i.e., those with fewer than 10 units available. Help him do this using series filtering, where one series is used as a mask for another series.

In [57]:
produtos = pd.Series(['Celular', 'Tablet', 'Notebook', 'Fone de Ouvido', 'Smartwatch'])
quantidade_estoque = pd.Series([15, 8, 20, 5, 12])

In [58]:
produtos[quantidade_estoque<10]

1            Tablet
3    Fone de Ouvido
dtype: object