# Exploration and Correlation Notebook
**Adam Fletcher** <br>
08 APR 19 <br>
Version 1 <br>
<br>
Useful Link for making plots in Matplotlib:
> http://scipy-lectures.org/intro/matplotlib/matplotlib.html

**TODO:** 
- Descriptive Statistics for Univariate Statistics (Mean, Median, Mode, Variance, SD, SE)
- Display completeness of data (eg. showing missing values)
- Calculate outliers
- Hypothesis and AB tests

### Table of Contents
- [Data Import](#Data_Import)
- [Data Exploration](#Data_Exploration)
- [Univariate Analysis](#Univariate_Analysis)
- [Multivariate Analysis](#Multivariate_Analysis)
- [Correlation](#Correlation)

### Prerequisites

In [None]:
!pip install pandas
%matplotlib inline

import random
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import statistics
import numpy as np
import scipy
from scipy import stats
from IPython.display import display, HTML
import seaborn

<a id='Data_Import'></a>
### Data Import
Data should be imported with the __variable name = data__

In [None]:
# Data Import
data = pd.read_csv('/Users/adam/Downloads/trip.csv')

data = pd.read_excel('/Users/adam/Downloads/Concrete_Data.xls')
data.columns = ['cement_component', 'furnace_slag', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate', 'fine_aggregate', 'age', 'concrete_strength']


# Print info about the imported data
print("Length of Data Frame:", len(data), "rows and", len(data.columns), "columns")
data.head()

<a id='Data_Exploration'></a>
### Data Exploration

Quickly look at the Distribution for **specific variables** <br> 

<a id='Univariate_Analysis'></a>
#### Univariate Analysis

TODO: Add Descriptive Statistics

**Barcharts**

In [None]:
selected = ['from_station_id', 'to_station_id', 'usertype', 'gender', 'birthyear']
plt.figure(figsize=(20,30))
plot_count = 1

for feature in selected:
        plt.subplot(3, 3, plot_count) #Adjust grid to make room for more plots/ change shape of matrix
        plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=0.5) #Adjust the spacing between plots
        groupby_var = data.groupby(feature).size()
        groupby_var.plot.bar(title = feature.replace('_',' ').title())
        plt.ylabel('Number of Occurances')
        plot_count+=1
plt.show()

<a id='Multivariate_Analysis'></a>
#### Multivariate Plots

**Scatterplots:** Against one comparable variable

In [None]:
plt.figure(figsize=(15,10.5))
plot_count = 1

for feature in list(data.columns):
        data_ = data[data[feature] != 0] #Remove 0's from Data
        plt.subplot(3, 3, plot_count)
        seaborn.regplot(data_[feature], data_['concrete_strength']) #regplot adds a linear regression line
        plt.xlabel(feature.replace('_',' ').title())
        plt.ylabel('Concrete strength')
        plot_count+=1
plt.show()

**Pair Plots:** All variables plotted against one another

In [None]:
data = data.dropna()
seaborn.pairplot(data, kind='reg')
plt.show()

<a id='Correlation'></a>
### Correlation

3 Major Tests for Correlation:

- Pearsons Correlation - Most Common, Normal Distibuted **Quantitative** variables
>https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php
- Kendall Rank Correlation - (Non-parametric, **non-normal distribution**, **Quantitative** variables
>https://statistics.laerd.com/spss-tutorials/kendalls-tau-b-using-spss-statistics.php
- Spearman Rank Correlation - Non-parametrics, **non-normal distribution**, **Ordinal/Categorical** variables
>https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide.php

**Correlation Tables**

data.corr() --> add in 'pearson', 'kendal', 'spearman'

In [None]:
data.corr('pearson')

**Correlation over specific Category** <br>

In this example 'age' is a categorical variable and so exploring how multivariate correlation changes with age is a good idea 

In [None]:
for value in data['age'].unique():
    print("")
    print ("Age: ",  value)
    data_map = data['age'] == value
    data_ = data[data_map]
    display(data_.corr())
    
## How do I make this occur on just one line??