# Table of Contents
 <p><div class="lev2 toc-item"><a href="#Exploring-and-Visualising-Data;-Dealing-with-Outliers-and-Missing-Data" data-toc-modified-id="Exploring-and-Visualising-Data;-Dealing-with-Outliers-and-Missing-Data-01"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Exploring and Visualising Data; Dealing with Outliers and Missing Data</a></div><div class="lev2 toc-item"><a href="#A-Couple-of-Plots-on-pima-and-gapminder" data-toc-modified-id="A-Couple-of-Plots-on-pima-and-gapminder-02"><span class="toc-item-num">0.2&nbsp;&nbsp;</span>A Couple of Plots on <code>pima</code> and <code>gapminder</code></a></div><div class="lev2 toc-item"><a href="#Univariate-Data-Distributions,-Missing-Values,-Outliers,-Plotting-With-seaborn" data-toc-modified-id="Univariate-Data-Distributions,-Missing-Values,-Outliers,-Plotting-With-seaborn-03"><span class="toc-item-num">0.3&nbsp;&nbsp;</span>Univariate Data Distributions, Missing Values, Outliers, Plotting With <code>seaborn</code></a></div><div class="lev2 toc-item"><a href="#Histograms,-Bin-Sizes,-KDE;-Subplots" data-toc-modified-id="Histograms,-Bin-Sizes,-KDE;-Subplots-04"><span class="toc-item-num">0.4&nbsp;&nbsp;</span>Histograms, Bin Sizes, KDE; Subplots</a></div><div class="lev2 toc-item"><a href="#Data-Distributions-and-Categories" data-toc-modified-id="Data-Distributions-and-Categories-05"><span class="toc-item-num">0.5&nbsp;&nbsp;</span>Data Distributions and Categories</a></div><div class="lev2 toc-item"><a href="#For-class,-labels-rather-than-0-and-1?" data-toc-modified-id="For-class,-labels-rather-than-0-and-1?-06"><span class="toc-item-num">0.6&nbsp;&nbsp;</span>For <code>class</code>, labels rather than 0 and 1?</a></div><div class="lev2 toc-item"><a href="#Bivariate-Distributions;-Inspecting-Possible-Relationships" data-toc-modified-id="Bivariate-Distributions;-Inspecting-Possible-Relationships-07"><span class="toc-item-num">0.7&nbsp;&nbsp;</span>Bivariate Distributions; Inspecting Possible Relationships</a></div>

## Exploring and Visualising Data; Dealing with Outliers and Missing Data

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
import seaborn as sns
sns.set()
sns.set_context("notebook")

In [None]:
mpl.rcParams['figure.figsize'] = (6.4*1.4, 4.8*1.4)

Apart from the already seen Gapminder, we will also work with the well-known [Pima Indians Diabetes](https://www.kaggle.com/uciml/pima-indians-diabetes-database) dataset.

In [None]:
pima_df = pd.read_csv('../data/pima-indians-diabetes.csv')
pima_df.sample(3)

In [None]:
gm_df = pd.read_csv('../data/gapminder.tsv', sep='\t')
gm_df.sample(3)

## A Couple of Plots on `pima` and `gapminder`

In [None]:
pima_df['class'].value_counts()

In [None]:
pima_df['class'].value_counts().plot.bar();

In [None]:
ax = pima_df['class'].value_counts().plot.bar()
ax.set_xlabel('class (diabetic/non-diabetic)')
ax.set_ylabel('count');

In [None]:
fig, ax = plt.subplots()
fig.tight_layout()
ax.set_xlabel('class (diabetic/non-diabetic)')
ax.set_ylabel('count')
pima_df['class'].value_counts().plot.bar(ax=ax);

In [None]:
gm_df.query('country=="Afghanistan"').set_index('year')['lifeExp'].plot.line();

We haven't discussed `.set_index()` and `.query`?

## Univariate Data Distributions, Missing Values, Outliers, Plotting With `seaborn`

* also, sits on top of `matplotlib`, and one can make use of `Axes`'s and `Figure`'s underlying methods.

In [None]:
sns.distplot(pima_df['pregnant']);

In [None]:
sns.distplot(pima_df['diastolic_bp']);

In [None]:
df = pima_df.loc[pima_df['diastolic_bp']>0]
sns.distplot(df['diastolic_bp']);

In [None]:
sns.boxplot(pima_df['pregnant'])

In [None]:
pima_df.loc[ pima_df['pregnant']>=13 ]

In [None]:
pima_df['pregnant'].describe()

In [None]:
pima_df['pregnant'].value_counts(normalize=True)

## Histograms, Bin Sizes, KDE; Subplots

In [None]:
sns.distplot(gm_df.loc[gm_df['year']==2007, 'lifeExp'])

In [None]:
# histograms with different bin sizes
(_, axs) = plt.subplots(1, 3, figsize=(18,6), sharey=True)
df = gm_df.loc[ gm_df['year'] == 2007 ]
sns.distplot(df['lifeExp'], ax=axs[0])
sns.distplot(df['lifeExp'], bins=np.arange(0, 101, 10), ax=axs[1])
sns.distplot(df['lifeExp'], bins=np.arange(0, 101, 2), ax=axs[2])

## Data Distributions and Categories

In [None]:
sns.boxplot(pima_df.loc[pima_df['plasma_glucose']>0, 'plasma_glucose'])

In [None]:
sns.boxplot(x='class', y='plasma_glucose', data=pima_df.loc[pima_df['plasma_glucose']>0])
# sns.boxplot(x='class', y='plasma_glucose', data=pima_df)

In [None]:
sns.boxplot(x='year', y='lifeExp', data=gm_df)

In [None]:
sns.violinplot(pima_df['plasma_glucose'])

In [None]:
sns.violinplot(x='class', y='plasma_glucose', data=pima_df)

In [None]:
sns.violinplot(x='year', y='lifeExp', data=gm_df)

In [None]:
sns.stripplot(x='year', y='lifeExp', data=gm_df, hue='continent', jitter=True)

In [None]:
ax = sns.boxplot(x='year', y='lifeExp', data=gm_df)
sns.stripplot(x='year', y='lifeExp', data=gm_df, hue='continent', jitter=True, ax=ax);

## For `class`, labels rather than 0 and 1?

In [None]:
# transform attribute (integer to string) to make sure
# that the numbers 0 and 1 are interpreted as nominal (rather than numeric)
# df = pima_df
pima_df['class_lbl'] = pima_df['class'].map({ 0: 'non-diabetic', 1: 'diabetic'})
df = pima_df.loc[ pima_df['plasma_glucose']>0 ]
df.head()

In [None]:
ax = sns.boxplot(x='plasma_glucose', y='class_lbl', data=df)
sns.stripplot(x='plasma_glucose', y='class_lbl', data=df, jitter=True, ax=ax);

## Bivariate Distributions; Inspecting Possible Relationships

In [None]:
sns.jointplot(x='triceps_sft', y='bmi', data=pima_df, alpha=0.3);

In [None]:
df = pima_df.loc[ pima_df['triceps_sft']>0 ]
sns.jointplot(x='triceps_sft', y='bmi', data=df, alpha=0.3)

In [None]:
df = pima_df.loc[ pima_df['triceps_sft']>0 ]
sns.jointplot(x='triceps_sft', y='bmi', data=df, alpha=0.3);

In [None]:
df = pima_df.loc[ pima_df['triceps_sft']>0 ]
sns.jointplot(x='triceps_sft', y='bmi', data=df, kind="hex");

In [None]:
pima_df.columns

In [None]:
# pairplot
mask = (pima_df['plasma_glucose']>0) & (pima_df['diastolic_bp']>0) & (pima_df['triceps_sft']>0)
df = pima_df.loc[ mask, [ 'bmi', 'diastolic_bp', 'triceps_sft', 'class_lbl' ] ]
sns.pairplot(data=df, hue='class_lbl', height=3.5)