# Lecture 4 Exploratory Data Analysis
__MATH 3480__ - Dr. Michael Olson

In Exploratory Data Analysis, we need to follow these steps:
1. Obtain and Clean the Data
2. Wrangle the Data
3. Look at statistical calculations
4. Graph the data 
5. Draw conclusions and make hypotheses from (3) and (4), looking for relationships that we might use

|              | Quantitative Data | Categorical Data |
| :----------- | :---------------- | :--------------- |
| Calculations | Mean, Mode<br>5-summary Statistics<br>Distributions (count, standard deviation/variance) | Probabilities<br>Expected Values<br>Probability/Binomial/etc. Distributions |
| Graphs       | Histogram/KDE (kernel density estimator)<br>Boxplot/Violinplot<br>Scatterplot<br>Timeseries<br>Heatmap | Barplot<br>Pie Chart<br>Venn Diagram<br>Tree Diagram |

The goal of EDA:
* Derive Insights
* Generate Hypotheses

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
mpg = sns.load_dataset('mpg')
mpg.head()

## Data Validation
Look through the data for
* Missing values
* Incorrect data types (is the year an integer or a float? Is the weight a float or a string?)
* Incorrect categories (is the origin 'usa' or 'uas'?)
* Patterns

In [None]:
mpg.dtypes

In [None]:
mpg['origin'].unique()

In [None]:
mpg['origin'].isin(['usa','japan','europe'])

In [None]:
( ~mpg['origin'].isin(['usa','japan','europe']) ).sum()

In [None]:
# mpg.groupby('origin').mean() # Error because it doesn't know what to do with 'name' column
mpg.drop('name', axis=1).groupby('origin').mean()

In [None]:
all_cars_by_year = mpg.drop(['name','origin'], axis=1).groupby('model_year').mean()
all_cars_by_year

In [None]:
# mpg[mpg['origin'] == 'usa']
# mpg[mpg['origin'] == 'usa'].drop(['name','origin'], axis=1)
mpg[mpg['origin'] == 'usa'].drop(['name','origin'], axis=1).groupby('model_year').mean()

In [None]:
cars_by_year = mpg.drop('name', axis=1).groupby(['origin','model_year']).mean()
cars_by_year

## Calculations

### Quantitative Calculations

In [None]:
mpg.info()

In [None]:
mpg.describe()

### Categorical Calculations

In [None]:
origin_counts = mpg['origin'].value_counts()
origin_counts

## Graphing
Below are some of the common graphs that we make. Use these graphs to find patterns within the data.

### Categorical plots
Barplots, pie charts

#### Countplot

In [None]:
sns.countplot(data=mpg, x='origin')
plt.title('Count of cars by origin country')
plt.xlabel('')

#### Bar Graph

In [None]:
sns.barplot(data=mpg, x='origin', y='mpg', errorbar=('ci',90))
plt.title('Gas Mileage by Country of Origin')
plt.xlabel('')
plt.ylabel('Gas Mileage (miles per gallon)')

#### Pie Chart

In [None]:
plt.pie(origin_counts, labels=origin_counts.index, autopct='%.1f%%') 

### Quantitative Plots
Histograms, KDE plots, Boxplots, Violinplots, Scatterplots, Regression plots, Timeseries, Heatmap

#### Histogram

In [None]:
sns.histplot(data=mpg, x='mpg', binwidth=5)

In [None]:
sns.histplot(data=mpg, y='horsepower', bins=10)

#### KDE (Kernel Density Estimator)

In [None]:
fig,ax = plt.subplots()
sns.kdeplot(data=mpg, x='horsepower', ax=ax)
ax2 = plt.twinx(ax)
sns.histplot(data=mpg, x='horsepower', bins=10, ax=ax2)

In [None]:
sns.kdeplot(data=mpg, x='horsepower', hue='origin')

In [None]:
sns.kdeplot(data=mpg, x='horsepower', hue='cylinders')

In [None]:
sns.kdeplot(data=mpg, x='mpg', hue='cylinders')

#### Boxplot

In [None]:
sns.boxplot(data=mpg, x='mpg', y='origin')
plt.title('Gas Mileage by Origin Country')
plt.xlabel('Gas Mileage (miles per gallon)')
plt.ylabel('')

#### Violinplot

In [None]:
sns.violinplot(data=mpg, x='mpg', y='origin')
plt.title('Distribution of Gas Mileage by Origin Country')
plt.xlabel('Gas Mileage (miles per gallon)')
plt.ylabel('')

#### Pairplot

In [None]:
sns.pairplot(data=mpg, hue='cylinders')

#### Scatterplot

In [None]:
sns.scatterplot(data=mpg, x='mpg', y='horsepower', hue='origin')
plt.title('Comparison of Gas Mileage with Engine Power')
plt.xlabel('Gas Mileage (miles per gallon)')
plt.ylabel('Engine Power (horsepower)')

#### Regression Plot

In [None]:
sns.regplot(data=mpg, x='horsepower', y='weight')

In [None]:
# No hue option
fig,ax = plt.subplots()
for origin in mpg['origin'].unique():
    sns.regplot(data=mpg[mpg['origin'] == origin], x='horsepower', y='weight', label=origin)

plt.legend()

In [None]:
mpg.groupby('origin').max()

#### Timeseries

In [None]:
cars_by_year.loc['usa'].index

In [None]:
sns.lineplot(data=mpg, x='model_year', y='mpg', hue='origin')
plt.legend()
plt.title('Gas Mileage over Time by Country of Origin')
plt.xlabel('Model Year\n(76 = 1976)')
plt.ylabel('Gas Mileage (miles per gallon)')

#### Heatmap

In [None]:
# Create table of the model year and origin, with the average mpg as values
car_by_year = mpg.drop('name', axis=1).pivot_table(columns='model_year', index='origin', values='mpg', aggfunc='mean')

plt.figure(figsize=(10,3))
sns.heatmap(data=car_by_year) 

plt.title('Gas Mileage over Time by Country of Origin')
plt.xlabel('Model Year\n(76 = 1976)')
plt.ylabel('')