<table style="float:left">
    <tr>
        <td>
            <img src="images/emlyon.png" style="height:60px; float:left; padding-right:10px; margin-top:5px" />
        </td>
        <td style="padding-bottom:10px;">
            <h1 style="border-bottom: 1px solid #eeeeee;"> AI Booster Week 01 - Python for Data Science </h1>
            <span style="display:inline-block; margin-top:-15px;">
            <a href="https://masters.em-lyon.com/fr/msc-in-data-science-artificial-intelligence-strategy">[Emlyon]</a> MSc in Data Science & Artificial Intelligence Strategy (DSAIS)    
            <br/>
            Sep 2024, Paris | © Saeed VARASTEH
            </span>
        </td>
    </tr>
</table>

### Pandas Library II

---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

---

In [None]:
# Pandas Options
pd.set_option('display.max_columns', 15)
pd.set_option("display.max_rows", 5)

---

<div style='color:gray; font-size:16pt;'> 
Loading Data into Pandas
</div>

#### Read Data

In [None]:
data = pd.read_csv('data/adult.csv', na_values="?")
print( data.shape )

In [None]:
data.head(5)

In [None]:
data.tail(5)

---

<div style='color:gray; font-size:16pt;'> 
Data Cleaning
</div>

__Drop Columns__

Let's start by dropping columns, which we won't be using.

In [None]:
data = data.drop(columns=["fnlwgt","education","relationship","capital-loss"])

In [None]:
print( data.shape )
data.head(5)

__Dealing with NaNs:__

In [None]:
data.isna().sum()

In [None]:
data.isnull().sum()

In [None]:
data.dropna(inplace=True)

In [None]:
print( data.shape )
data.head(5)

__Rename Columns__

In [None]:
data.rename(columns={
        'capital-gain': 'capital', 
        'hours-per-week': 'workinghours'
    },
    inplace=True
)
data.head(5)

In [None]:
data.columns

__Type Conversion__

Notice anything off with the data types?

In [None]:
data.dtypes

Lets fix column type:

In [None]:
data['workclass'] = data.workclass.astype('category')
data.head(5)

__Creating New Column__

In [None]:
data['above_mean_hours'] = data.workinghours > data.workinghours.mean()
data.head(5)

Using `assign` function:

In [None]:
data.assign(above_mean_hours=lambda x: x.workinghours > x.workinghours.mean())
data.head(5)

<div class="alert-info"> 
    <b>lambda</b> functions: These small, anonymous functions can receive multiple arguments, but can only contain one expression (the return value).
</div>

#### Filtering Data

In [None]:
data.loc[(data['native-country'] == 'United-States') & (data['age'] >=30) & (data['age'] <= 60)]

In [None]:
data.loc[~data['marital-status'].str.contains('married|Married')]

#### Conditional change

In [None]:
data['gender'].value_counts()

In [None]:
# Change 'Female' category of 'gender' to 'F' ----- Inplace
data.loc[ data['gender'] == 'Female','gender' ] = 'F'

In [None]:
data.loc[ data['gender'] == 'Male','gender' ] = 'M'

In [None]:
data.head(5)

#### Sorting values

In [None]:
data.sort_values(['age','educational-num'],ascending=[1,0])

__Encodings__

__sklearn__ `LabelEncoder`

In [None]:
data.race.value_counts()

In [None]:
from sklearn.preprocessing import LabelEncoder
race_encoder = LabelEncoder()
data['race'] = race_encoder.fit_transform( data['race'] )

In [None]:
data.head(5)

In [None]:
data.race.value_counts()

In [None]:
print( {k: v for k, v in enumerate(list(race_encoder.classes_))} )

__pandas__ One Hot Encoder

In [None]:
data_encoder = pd.get_dummies(data, columns=["race"])
print( data_encoder.shape )
data_encoder.head()

#### Discretization

Use the `pd.cut()` function to create values bins of equal width:

In [None]:
pd.cut(data["educational-num"], bins=3, labels=['low', 'medium', 'high'])

---

<div style='color:gray; font-size:16pt;'> 
Data Exploration
</div>

#### Count total data per country

In [None]:
data["native-country"].unique()

In [None]:
data["native-country"].value_counts()

In [None]:
data.groupby("native-country").count()['age']

In [None]:
data['count'] = 1
data.groupby("native-country").count()['count']

#### Highest capital per country

In [None]:
data.groupby('native-country')['native-country','capital'].max()

#### Pivot Table

A pivot table is a table of grouped values that aggregates the individual items of a more extensive table within one or more discrete categories. 

In [None]:
data.head(5)

We can build a pivot table to compare educational-num across the gender in our dataset:

In [None]:
data.pivot_table(index='educational-num', columns='gender', values='count', aggfunc='sum')

#### Crosstabs

The `pd.crosstab()` function provides an easy way to create a frequency table.

In [None]:
pd.crosstab(index=data["educational-num"],columns=data["gender"])

#### Describe

In [None]:
data.describe()

---

<div style='color:gray; font-size:16pt;'> 
Data Visualization
</div>

In this section, we will learn how to visualize data using pandas along with the Matplotlib and Seaborn libraries for additional features. We will create a variety of visualizations that will help us better understand our data.

Before everthing, to embed SVG-format plots in the notebook, we will also call the `%config` and `%matplotlib inline` magics:

In [None]:
%config InlineBackend.figure_formats = ['svg']
%matplotlib inline

In [None]:
data.head(5)

#### Plotting with Pandas

We can create a variety of visualizations using the `plot()` method.

__Line Plot__

<div class="alert-info">
The plot() method returns an Axes object that can be modified further (e.g., to add reference lines, annotations, labels, etc.).
</div>

In [None]:
data.loc[0:10,"age"].plot(title='Ages 0:10', ylabel='Age', alpha=0.8)

__Bar Plot__

For our next example, we will plot vertical bars to compare income throughput versus marital status. Let's start by creating a pivot table with the information we need

In [None]:
plot_data = data.pivot_table(index='marital-status', columns='income', values='count', aggfunc='sum')
plot_data.head()

Pandas offers other plot types via the `kind` parameter, so we specify `kind='bar`' when calling the `plot()` method. Then, we further format the visualization using the `Axes` object returned by the `plot()` method:

In [None]:
ax = plot_data.plot(
    kind='bar', rot=0, xlabel='', ylabel='adults', fontsize=8,
    figsize=(12, 1.5), title='Income by Marital Status'
)

# customize the legend
ax.legend(title='', loc='center', bbox_to_anchor=(0.5, -0.3), ncol=3, frameon=False)

__Plotting distributions__

Let's now compare the distribution of age across income levels. We will create two subplots for each income level with both a histogram and a kernel density estimate (KDE) of the distribution. 

Pandas has generated the `Figure` and `Axes` objects for both examples so far, but we can build custom layouts by creating them ourselves with Matplotlib using the `plt.subplots()` function.

While pandas lets us specify that we want subplots and their layout (with the `subplots` and `layout` parameters, respectively), using Matplotlib to create the subplots directly gives us additional flexibility:

In [None]:
fig, axes = plt.subplots(2, 1, sharex=True, sharey=True, figsize=(6, 4))

for income, ax in zip(data.income.unique(), axes):
    plot_data = data[ data['income'] == income ].age
    plot_data.plot(kind='hist', legend=False, density=True, alpha=0.8, ax=ax)
    plot_data.plot(kind='kde', legend=False, color='blue', ax=ax)
    ax.set(title=f'{income} Age Distributions', xlabel='Age')

fig.tight_layout() # handle overlaps

If you're new to the `zip()` function, check out [this article](https://realpython.com/python-zip-function/).

__Plotting with Seaborn__

In [None]:
import seaborn as sns

The __Seaborn__ library provides the means to easily visualize long-format data without first pivoting it. In addition, it also offers some additional plot types – once again building on top of Mtplotlib.

With Seaborn, we can specify plot colors according to values of a column with the `hue` parameter. When working with functions that generate subplots, we can also specify how to split the subplots by values of a long-format column with the `col` and `row` parameters. Lets revisit the above plot using seaborn:

In [None]:
sns.displot( data=data, x='age', col='income', kde=True, height=2.5 )

In [None]:
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(3, 2))
sns.distplot(data["age"], ax=ax)
plt.tight_layout()

__Heatmaps__

We can also use Seaborn to visualize pivot tables as heatmaps:

In [None]:
plot_data = data.pivot_table(index='marital-status', columns='income', values='count', aggfunc='sum')
plot_data.head()

In [None]:
ax = sns.heatmap(data=plot_data, cmap='Blues', annot=True, fmt='.1f')
_ = ax.set_title('Income by Marital Status')

__Box Plot__

In [None]:
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(4, 4))
sns.boxplot(y="age", data=data, ax=ax)
plt.tight_layout()

__Correlation Plot__

In [None]:
plt.figure(figsize=(8, 5))
sns.heatmap(data.corr().abs(),  annot=True)

---

<div class="alert-info" style="border-bottom: solid 1px lightgray; background-color:#f0ffff;">
    <img src="images/self.png" style="height:60px; float:left; padding-right:10px;" />
    <span style="font-weight:bold; color:#1a8a8a">
        <h4 style="padding-top:25px;"> SELF-STUDY </h4>
    </span>
</div>

## Summary:

### Create Test Objects

| Operator | Description |
|:---- |:---- |
| **`pd.DataFrame(np.random.rand(20,5))`** | **5 columns and 20 rows of random floats** | 
| **`pd.Series(my_list)`** | **Create a series from an iterable my_list** | 
| **`df.index = pd.date_range('1900/1/30', periods=df.shape[0])`** | **Add a date index** | 

### Viewing/Inspecting Data

| Operator | Description |
|:---- |:---- |
| **`df.head(n)`** | **First n rows of the DataFrame** | 
| **`df.tail(n)`** | **Last n rows of the DataFrame** | 
| **`df.shape`** | **Number of rows and columns** | 
| **`df.info()`** | **Index, Datatype and Memory information** | 
| **`df.describe()`** | **Summary statistics for numerical columns** | 
| **`s.value_counts(dropna=False)`** | **View unique values and counts** | 
| **`df.apply(pd.Series.value_counts)`** | **Unique values and counts for all columns** | 

### Selection

| Operator | Description |
|:---- |:---- |
| **`df[col]`** | **Returns column with label col as Series** | 
| **`df[[col1, col2]]`** | **Returns columns as a new DataFrame** | 
| **`s.iloc[0]`** | **Selection by position** | 
| **`s.loc['index_one']`** | **Selection by index** | 
| **`df.iloc[0,:]`** | **First row** | 
| **`df.iloc[0,0]`** | **First element of first column** | 

### Data Cleaning

| Operator | Description |
|:---- |:---- |
| **`df.columns = ['a','b','c']`** | **Rename columns** | 
| **`pd.isnull()`** | **Checks for null Values, Returns Boolean Arrray** | 
| **`pd.notnull()`** | **Opposite of pd.isnull()** | 
| **`df.dropna()`** | **Drop all rows that contain null values** | 
| **`df.dropna(axis=1)`** | **Drop all columns that contain null values** | 
| **`df.dropna(axis=1,thresh=n)`** | **Drop all rows have have less than n non null values** | 
| **`df.fillna(x)`** | **Replace all null values with x** | 
| **`s.fillna(s.mean())`** | **Replace all null values with the mean** | 
| **`s.astype(float)`** | **Convert the datatype of the series to float** | 
| **`s.replace(1,'one')`** | **Replace all values equal to 1 with 'one'** | 
| **`s.replace([2,3],['two', 'three'])`** | **Replace all 2 with 'two' and 3 with 'three'** | 
| **`df.rename(columns=lambda x: x + 1)`** | **Mass renaming of columns** | 
| **`df.rename(columns={'old_name': 'new_ name'})`** | **Selective renaming** | 
| **`df.set_index('column_one')`** | **Change the index** | 
| **`df.rename(index=lambda x: x + 1)`** | **Mass renaming of index** | 

### Filter, Sort, and Groupby

| Operator | Description |
|:---- |:---- |
| **`df[df[col] > 0.6]`** | **Rows where the column col is greater than 0.6** | 
| **`df[(df[col] > 0.6) & (df[col] < 0.8)]`** | **Rows where 0.8 > col > 0.6** | 
| **`df.sort_values(col1)`** | **Sort values by col1 in ascending order** | 
| **`df.sort_values(col2,ascending=False)`** | **Sort values by col2 in descending order.5** | 
| **`df.sort_values([col1,col2],ascending=[True,False])`** | **Sort values by col1 in ascending order then col2 in descending order** | 
| **`df.groupby(col)`** | **Returns a groupby object for values from one column** | 
| **`df.groupby([col1,col2])`** | **Returns groupby object for values from multiple columns** | 
| **`df.groupby(col1)[col2]`** | **Returns the mean of the values in col2, grouped by the values in col1** | 
| **`df.pivot_table(index=col1,values=[col2,col3],aggfunc=mean)`** | **Create a pivot table that groups by col1 and calculates the mean of col2 and col3** | 
| **`df.groupby(col1).agg(np.mean)`** | **Find the average across all columns for every unique col1 group** | 
| **`df.apply(np.mean)`** | **Apply the function np.mean() across each column** | 
| **`nf.apply(np.max,axis=1)`** | **Apply the function np.max() across each row** | 

### Join/Combine

| Operator | Description |
|:---- |:---- |
| **`df1.append(df2)`** | **Add the rows in df1 to the end of df2 (columns should be identical)** | 
| **`pd.concat([df1, df2],axis=1)`** | **Add the columns in df1 to the end of df2 (rows should be identical)** | 
| **`df1.join(df2,on=col1, how='inner')`** | **SQL-style join the columns in df1 with the columns on df2 where the rows for col have identical values. The 'how' can be 'left', 'right', 'outer' or 'inner'** | 

### Statistics

| Operator | Description |
|:---- |:---- |
| **`df.describe()`** | **Summary statistics for numerical columns** | 
| **`df.mean()`** | **Returns the mean of all columns** | 
| **`df.corr()`** | **Returns the correlation between columns in a DataFrame** | 
| **`df.count()`** | **Returns the number of non-null values in each DataFrame column** | 
| **`df.max()`** | **Returns the highest value in each column** | 
| **`df.min()`** | **Returns the lowest value in each column** | 
| **`df.median()`** | **Returns the median of each column** | 
| **`df.std()`** | **Returns the standard deviation of each column** |

### Importing Data

| Operator | Description |
|:---- |:---- |
| **`pd.read_csv(filename)`** | **From a CSV file** | 
| **`pd.read_table(filename)`** | **From a delimited text file (like TSV)** | 
| **`pd.read_excel(filename)`** | **From an Excel file** | 
| **`pd.read_sql(query, connection_object)`** | **Read from a SQL table/database** | 
| **`pd.read_json(json_string)`** | **Read from a JSON formatted string, URL or file.** | 
| **`pd.read_html(url)`** | **Parses an html URL, string or file and extracts tables to a list of dataframes** | 
| **`pd.read_clipboard()`** | **Takes the contents of your clipboard and passes it to read_table()** | 
| **`pd.DataFrame(dict)`** | **From a dict, keys for columns names, values for data as lists** |

### Exporting Data

| Operator | Description |
|:---- |:---- |
| **`df.to_csv(filename)`** | **Write to a CSV file** | 
| **`df.to_excel(filename)`** | **Write to an Excel file** | 
| **`df.to_sql(table_name, connection_object)`** | **Write to a SQL table** | 
| **`df.to_json(filename)`** | **Write to a file in JSON format** |

---