<a href="https://colab.research.google.com/github/zerotodeeplearning/ztdl-masterclasses/blob/master/solutions_do_not_open/Pandas_Matplotlib_Seaborn_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learn with us: www.zerotodeeplearning.com

Copyright © 2021: Zero to Deep Learning ® Catalit LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Documentation links:

- [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb)
- [Numpy](https://docs.scipy.org/doc/)
- [Pandas](https://pandas.pydata.org/docs/getting_started/index.html)
- [Pandas Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- [Matplotlib](https://matplotlib.org/)
- [Matplotlib Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf)
- [Seaborn](https://seaborn.pydata.org/)
- [Scikit-learn](https://scikit-learn.org/stable/user_guide.html)
- [Scikit-learn Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf)
- [Scikit-learn Flow Chart](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

# Pandas Matplotlib Seaborn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Google Colab

Colaboratory is an online environment that allows you to write and execute Python in the browser.

- Zero configuration required
- Free access to GPUs and TPUs
- Easy sharing

It's based on [Jupyter Notebook](https://jupyter.org/)

If you've never used it before it's a good idea to read the [tutorial here](https://colab.research.google.com/notebooks/intro.ipynb).

### Keyboard shortcuts

Here are some of the most common commands. Try them out:


- `⌘/Ctrl+M H` => open the keyboard shortcut help
- `⌘/Ctrl+M A` => Create a cell above
- `⌘/Ctrl+M B` => Create a cell below
- `⌘/Ctrl+M D` => Delete current cell
- `⌘/Ctrl+M M` => Convert cell to Markdown
- `⌘/Ctrl+M Y` => Convert cell to Code
- `Shift+Enter` => Run cell and select next cell
- `Ctrl+Space`, `Option+Esc` or `Tab` => Autocomplete


### Saving your work

- Colab notebooks are automatically saved in your Google Drive
- You can export them to Github too
- You can download them to your local computer

## Pandas

Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Let's explore some of its functionality together.

### Reading and exploring data

In [None]:
url = "https://raw.githubusercontent.com/zerotodeeplearning/ztdl-masterclasses/master/data/"

In [None]:
df = pd.read_csv(url + "titanic-train.csv")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

### Plotting

Matplotlib and Seaborn are great libraries for plotting and exploring data visually.

You can take inspiration from their plot galleries:

- [Matplotlib Gallery](https://matplotlib.org/3.2.1/gallery/index.html)
- [Seaborn Gallery](https://seaborn.pydata.org/examples/index.html)

In [None]:
df[['Age', 'Fare']].plot.scatter(x='Age', y='Fare');

In [None]:
survived_counts = df['Survived'].value_counts()

In [None]:
survived_counts.plot.bar(title='Dead / Survived');

In [None]:
survived_counts.plot.pie(
    figsize=(5, 5),
    explode=[0, 0.15],
    labels=['Dead', 'Survived'],
    autopct='%1.1f%%',
    shadow=True,
    startangle=90,
    fontsize=16);

In [None]:
df['Age'].plot.hist(
    bins=16,
    range=(0, 80),
    title='Passenger age distribution')
plt.xlabel("Age");

In [None]:
sns.pairplot(df[['Age', 'Pclass', 'Fare', 'SibSp', 'Survived']],
             hue='Survived');

In [None]:
sns.jointplot(x='Age', y='Fare', data=df)

### Indexing

Retrieving elements by row, by column or both. Try to understand each of the following statements

In [None]:
df['Ticket']

In [None]:
df[['Fare', 'Ticket']]

In [None]:
df.iloc[3]

In [None]:
df.iloc[0:4, 4:6]

In [None]:
df.loc[0:4, 'Ticket']

In [None]:
df.loc[0:4, ['Fare', 'Ticket']]

### Selections

Retrieving part of the dataframe based on a condition. Try to understand each of the following statements.

In [None]:
df[df.Age > 70]

In [None]:
df[(df['Age'] == 11) & (df['SibSp'] == 5)]

### Distinct elements

In [None]:
df['Embarked'].unique()

### Group-by & Sorting

In [None]:
# Find average age of passengers that survived vs. died
df.groupby('Survived')['Age'].mean()

In [None]:
df.sort_values('Age', ascending = False).head()

### Join (merge)

In [None]:
df1 = df[['PassengerId', 'Survived']]
df2 = df[['PassengerId', 'Age']]

pd.merge(df1, df2, on='PassengerId').head()

### Pivot Tables

In [None]:
df.pivot_table(index='Pclass', columns='Survived', values='PassengerId', aggfunc='count')

In [None]:
df['Pclass'].value_counts()

### Time series data

In [None]:
dfts = pd.read_csv(url + 'time_series_covid19_confirmed_global.csv')
df1 = dfts.drop(['Lat', 'Long'], axis=1).groupby('Country/Region').sum().transpose()

In [None]:
df1.head()

In [None]:
df1.index.dtype

In [None]:
df1.index = pd.to_datetime(df1.index)

In [None]:
df1.index.dtype

In [None]:
df1[['Italy','US']].plot(logy=True)
plt.title("COVID-19 confirmed Cases");

### Exercise 1:

Use `df`, Pandas and your knowledge of Python to answer the following questions:

- What data type is the object `df`? (hint: use the `type()`) function
- select passengers that survived
- calculate average price paid by survived and dead people in each class (use a pivot table)
- Plot the histogram of Fares

In [None]:
type(df)

In [None]:
df[df['Survived']==1]

In [None]:
df.pivot_table(index='Pclass',
               columns='Survived',
               values='Fare',
               aggfunc='mean').round(2)

In [None]:
df['Fare'].plot.hist(bins=50)
plt.xlabel('Fare');


### Exercise 2:

Here are some additional questions you can use to test your skills. Feel free to try as many as you'd like.


#### Pandas data types:
- What data type is the object `df['Age']`?
- What data type are the values contained in the object `df['Age']`. (hint: use the `.dtype` attribute)
- What data type is the object `df[['Age', 'Fare']]`

#### Selections
- select passengers that embarked in port S
- select male passengers
- select passengers who paid less than 40.000 and were in third class
- find the names of passegner Ids 674

#### Aggregations
- calculate the average age of passengers using the method `.mean()`
- count the number of survived and the number of dead passengers using the method `.value_counts()`
- count the number of survived and dead by Sex (Use a double Groupby)
- calculate average price paid by survived and dead people in each class (use a pivot table)
- (Advanced) calculate the average survival rate by Sex and Pclass (use a pivot table). Bonus points if you also count the total number in each group.

#### Plots
- Plot the histogram of Fares
- Use the `.value_counts()` method to count how many passegers there are in each class and plot the counts using either a `bar` or a `pie` chart.
- Use a heatmap plot to visualize the survival rates by Sex and Pclass calculated above

#### Time Series
For these questions use `df1` instead of `df`

- Plot the time evolution of another country (not US, not Italy). Bonus point if you add a title.
- Use the `.diff()` method to calculate the number of daily new cases
- Plot some countries (e.g. US and Italy)
- (Advanced) Use the `pd.Grouper` to aggregate the new cases from daily to weekly
- (Advanced) Create a new DataFrame with aligned data by shifting every country to the day where 1000 cases where reported. Plot a few countries to compare the evolution from that day forward.

In [None]:
type(df['Age'])

In [None]:
df['Age'].dtype

In [None]:
type(df[['Age', 'Fare']])

In [None]:
df[df['Embarked'] == 'S']

In [None]:
df[df['Sex'] == 'male']

In [None]:
df[(df['Fare'] < 40000) & (df['Pclass'] == 3)]

In [None]:
df.loc[df['PassengerId'] == 674, 'Name']

In [None]:
df['Age'].mean()

In [None]:
df['Survived'].value_counts()

In [None]:
df.groupby(['Sex', 'Survived'])['PassengerId'].count()

In [None]:
surv_rate = df.pivot_table(index='Pclass',
                           columns='Sex',
                           values='Survived',
                           aggfunc=['mean'])
surv_rate.round(2)

In [None]:
df.pivot_table(index='Pclass',
               columns='Sex',
               values='Survived',
               aggfunc=['count', 'sum'])

In [None]:
passengers_per_class = df['Pclass'].value_counts()

In [None]:
passengers_per_class.plot.bar(title='Passengers per Class');

In [None]:
passengers_per_class.plot.pie(title='Passengers per Class');

In [None]:
sns.heatmap(surv_rate, cmap='RdYlGn');

In [None]:
df1['Spain'].plot(logy=True);

In [None]:
new_cases = df1.diff()

In [None]:
new_cases[['US', 'Italy']].plot();

In [None]:
new_cases[['US', 'Italy']].groupby(pd.Grouper(freq='W')).sum().plot();

In [None]:
values_above_1000 = df1[df1 > 1000].dropna(how='all', axis=1)

countries = values_above_1000.columns

values_above_1000_shifted = pd.DataFrame(np.nan,
                                         index=np.arange(len(values_above_1000)),
                                         columns=countries)


for c in countries:
    non_null = values_above_1000[c].dropna().values
    n = len(non_null)
    values_above_1000_shifted.loc[:n-1, c]= non_null

In [None]:
values_above_1000_shifted[['US', 'Italy', 'Spain', 'Korea, South']].plot(logy=True);