In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
import matplotlib

In [None]:
# Check package version
#!pip list
!pip show seaborn
#dir(sns)

## Two $y$-axes

In [None]:
gapminder = pd.read_csv('data/gapminder-FiveYearData.csv')
gapminder

Let us subset gapminder data by using Pandas query functionality to filter for rows with Australia.

In [None]:
gapminder_subset = gapminder[gapminder.country=="Australia"]
gapminder_subset

Naively, let us plot both on the same plot with a single y-axis.

In [None]:
# create figure and axis objects with subplots()
fig,ax=plt.subplots()
ax.plot(gapminder_subset.year, gapminder_subset.lifeExp, marker="o")
ax.set_xlabel("Year")
ax.set_ylabel("Life Expectancy")
ax.plot(gapminder_subset.year, gapminder_subset["gdpPercap"], marker="o")
plt.show()

We can immediately see that this is a bad idea. The line for `lifeExp` over years is flat and really low. We don't see any variation in it because of the scale of `gdpPercap` values.

One of the solutions is to make the plot with two different y-axes. The way to make a plot with two different y-axis is to use two different axes objects with the help of `twinx()` function.

We first create figure and axis objects and make a first plot. In this example, we plot `year` vs `lifeExp`. And we also set the x and y-axis labels by updating the axis object.

In [None]:
# create figure and axis objects with subplots()
fig,ax = plt.subplots()

# make a plot
ax.plot(gapminder_subset.year, gapminder_subset.lifeExp, color="red", marker="o")

# set x-axis label
ax.set_xlabel("year",fontsize=14)

# set y-axis label
ax.set_ylabel("lifeExp",color="red",fontsize=14)

# twin object for two different y-axis on the sample plot
ax2=ax.twinx()

# make a plot with different y-axis using second axis object
ax2.plot(gapminder_subset.year, gapminder_subset["gdpPercap"],color="blue",marker="o")
ax2.set_ylabel("gdpPercap",color="blue",fontsize=14)
plt.show()

# save the plot as a file
fig.savefig('Figure01.jpg',
            format='jpeg',
            dpi=100,
            bbox_inches='tight')

## Sentiment Analysis with `textblob`

**Sentiment Analysis** is the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral.

**Sentiment Labels**:

Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we're going to ignore them for now). A corpus sentiment is the average of these.

- Polarity: How positive or negative a word is. -1 is very negative. +1 is very positive.
- Subjectivity: How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.

How to find Polarity and Subjectivity of document?

`textblob` finds all of the words and phrases, than it can assign a polarity and subjectivity to all words, average all of them together and return final polarity, subjectivity.

In [None]:
#import sys
#!{sys.executable} -m pip install textblob

In [None]:
from textblob import TextBlob

In [None]:
TextBlob("great").sentiment

In [None]:
TextBlob("not great").sentiment

In [None]:
TextBlob("My tea is green.").sentiment

In [None]:
TextBlob("This is a very bad idea.").sentiment

In [None]:
TextBlob("This is not such a bad idea.").sentiment

## Plots in `seaborn`

`Seaborn` is a library for making statistical graphics in Python. It builds on top of `matplotlib` and integrates closely with pandas data structures .

`Seaborn` design allows you to explore and understand your data quickly. `Seaborn` works by capturing entire `dataframes` or arrays containing all your data and performing all the internal functions necessary for semantic mapping and statistical aggregation to convert data into informative plots.

It abstracts complexity while allowing you to design your plots to your requirements.

In [None]:
flights_data = sns.load_dataset("flights")
flights_data.head()

In [None]:
sns.scatterplot(data=flights_data, x="year", y="passengers");plt.show()

In [None]:
sns.lineplot(data=flights_data, x="year", y="passengers");plt.show()

In [None]:
sns.barplot(data=flights_data, x="year", y="passengers");plt.show()

### Tips

In [None]:
tips_df = sns.load_dataset('tips')
tips_df.head()

Let's create an additional column to the data set with the percentage that represents the tip amount over the total of the bill.

In [None]:
tips_df["tip_percentage"] = tips_df["tip"] / tips_df["total_bill"]
tips_df.head()

In [None]:
sns.histplot(tips_df["tip_percentage"], binwidth=0.05,
             kde=True); plt.show()

In the next plot, we will load the chart with the full dataset instead of just one column, and then we will set the property `hue` to the column `time`. This will force the chart to use different colors for each value of time and add a legend to it.

In [None]:
sns.histplot(data=tips_df, 
             x="tip_percentage", 
             binwidth=0.02, 
             hue="time"); plt.show() # try other categories...

**Total of tips per day of the week** is another interesting metric that shows how much money in tips can the personnel expect depending on the day of the week.

In [None]:
sns.barplot(data=tips_df, x="day", y="tip", estimator=np.sum); 
# NOTE: plt.show() was not used, instead ; is used to supress the text after the figure

**Impact of table size and day on the tip.** Does the day of the week and the table size impact the tip percentage? To draw the next chart we will combine the pivot function of pandas to pre-process the information and then draw a heatmap chart.

In [None]:
pivot = tips_df.pivot_table(
    index=["day"],
    columns=["size"],
    values="tip_percentage",
    aggfunc=np.average,
    observed=False)
sns.heatmap(pivot)

In [None]:
pivot

In [None]:
sns.heatmap(pivot,
            annot=True,       # include annotations
            fmt="0.4f",
            linewidths=.5,    # line gap between cells
            cmap="YlGnBu",    
            center=pivot.loc["Sat", 4],
            xticklabels=1, yticklabels=True,
            cbar=True);

## Visualize missing values (NaN) values using `Missingno` Library

In [None]:
#!pip show missingno

In [None]:
# If the packege is currently exists in the system, 
# installing it again will not upgrade the current version.

# You can
# 1. uninstall the current version and install a new one
# 2. upgrade
# 3. install the version by specifying the exact version number you need, as in the example below:

#import sys
#!{sys.executable} -m pip install missingno==0.5.2

In [None]:
import missingno as msno

In [None]:
df = pd.read_csv('data/athlete_events.csv')

In [None]:
msno.matrix(df);

In [None]:
msno.bar(df);

**Heatmap** shows the correlation of "missingness" between every 2 columns. 

In [None]:
msno.heatmap(df);