<center><h1>Developing Differently</h1></center>
<center><h1>with</h1></center>
![alt text](https://datascienceinsider.files.wordpress.com/2015/12/jupyter-logo.png)

## What are Notebooks?
* A somewhere between an ***IDE*** and an ***Interpreter***
* Repeatable, interactive & visual
* Self supporting documentation
* Language Agnostic


![alt text](./images/arch.png)

# Where do Notebooks get used?
* Data Analytics
    * Interative
    * Visual
    * Connect to distributed computing resources
* Education
    * Descriptive
    * Example => Results
    * Modular, not monolithic
* Presentations
    * Easy to format
    * Run code live (& test before hand)

![alt text](./images/workflow.png)

# Exploratory Data Analysis

>**In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.**

### Getting started...
* What do I have to work with?

In [None]:
ls

**Other CLI Stuff**
* Run other scripts with %run
* Execute OS specific commands with !*my_cool_cmd*

In [None]:
import pandas as pd
df = pd.read_csv('./data/college.csv')
df.columns

* **What's going on above?**
    * Import Python Data Analysis Library - Pandas
    * Read the csv file into memory
    * Display the columns back in nice format

* **Try it**
    * What's the difference between `print(df.head())` and `df.head()`?
    * What does `;` at the end of a cell do?

In [None]:
df.head()

In [None]:
df.describe()

* **What's going on above?**
    * Get Summary Statistics
    * Region is not present because it is not numerical
    * Year is being summarized because it is being treated numerically

* **Try it**
    * Change Year to be a Categorical variable
        * `df['year'] = df['year'].astype('category')`
    * What is `df`?
        * `?df`

In [None]:
??df

In [None]:
avg_df = df.groupby(['region']).mean()
avg_df.head()

* **What's going on above?**
    * Aggregating data by the Region and then taking the average

* **Try it**
    * Rename Columns so we know they are averages
        * `avg_df = avg_df.rename(columns={'Expenditure': 'Avg_Exp', 'Income': 'Avg_Income'})`
    * How many records are there for each year?

In [None]:
%matplotlib inline

from pylab import rcParams
rcParams['figure.figsize'] = 20, 10

import seaborn as sns

ax = sns.boxplot(x="region", y="Income", hue="year", data=df)

* **What's going on above?**
    * Use the magic function to plot inline
    * Set the plot size (so it presents well)
    * Import statiscal visualization library - Seaborn
    * Box plot expenidures by its categorical variables

* **Try it**
    * Look at incomes

In [None]:
ax = sns.jointplot(x="Income", y="Expenditure", data=df, kind="reg")

In [None]:
ax = sns.lmplot(x="Income", y="Expenditure", col="region", data=df)

In [None]:
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from scipy import stats

def eta_squared(aov):
    aov['eta_sq'] = 'NaN'
    aov['eta_sq'] = aov[:-1]['sum_sq']/sum(aov['sum_sq'])
    return aov

def omega_squared(aov):
    mse = aov['sum_sq'][-1]/aov['df'][-1]
    aov['omega_sq'] = 'NaN'
    aov['omega_sq'] = (aov[:-1]['sum_sq']-(aov[:-1]['df']*mse))/(sum(aov['sum_sq'])+mse)
    return aov

formula = 'Expenditure~region+year+region*year'
model = ols(formula, df).fit()
aov_table = anova_lm(model, typ=2)

eta_squared(aov_table)
omega_squared(aov_table)
aov_table

* **What's going on above?**
    * Importing libraries
    * Defining functions
    * Performing ANOVA

We start with some brief introduction on theory of ANOVA. If you are more interested in the four methods to carry out one-way ANOVA with Python click here. ANOVA is a means of comparing the ratio of systematic variance to unsystematic variance in an experimental study. Variance in the ANOVA is partitioned in to total variance, variance due to groups, and variance due to individual differences.


<img src ="./images/python_anova_theory_partitioning_of_variance.gif" style="width: 200px;"/>
<center>Partioning of Variance in the ANOVA. SS stands for Sum of Squares.</Center>
<br>

The ratio obtained when doing this comparison is known as the F-ratio. A one-way ANOVA can be seen as a regression model with a single categorical predictor. This predictor usually has two plus categories. A one-way ANOVA has a single factor with J levels. Each level corresponds to the groups in the independent measures design. The general form of the model, which is a regression model for a categorical factor with J levels, is:

<br>
<center>$y_i = b_0+b_1X_{1,i} +...+b_{j-1,i} + e_i$</Center>
<br>

There is a more elegant way to parametrize the model. In this way the group means are represented as deviations from the grand mean by grouping their coefficients under a single term.  I will not go into detail on this equation:

<br>
<center>$y_{ij} = \mu_{grand} + \tau_j + \varepsilon_{ij}$</center>
<br>

As for all parametric tests the data need to be normally distributed (each groups data should be roughly normally distributed) for the F-statistic to be reliable. Each experimental condition should have roughly the same variance (i.e., homogeneity of variance), the observations (e.g., each group) should be independent, and the dependent variable should be measured on, at least,  an interval scale.



In [None]:
* **What's going on above?**
    * We can use 
    * Defining functions
    * Performing ANOVA