<center><h1>Developing Differently</h1></center>
<center><h1>with</h1></center>
![alt text](https://datascienceinsider.files.wordpress.com/2015/12/jupyter-logo.png)

## What are Notebooks?
* Somewhere between an ***IDE*** and an ***Interpreter***
* Repeatable, interactive & visual
* Self supporting documentation
* Language Agnostic


## Main features of the web application

 - In-browser editing for code, with automatic syntax highlighting, indentation, and tab completion / introspection.
 
 - The ability to execute code from the browser, with the results of computations attached to the code which generated them.
 - Displaying the result of computation using rich media representations, such as HTML, LaTeX, PNG, SVG, etc. For example, publication-quality figures rendered by the matplotlib library, can be included inline.
 
 - In-browser editing for rich text using the Markdown markup language, which can provide commentary for the code, is not limited to plain text.
 
 - The ability to easily include mathematical notation within markdown cells using LaTeX, and rendered natively by MathJax.
 
 - There are four types of cells: code cells, markdown cells, raw cells and heading cells.

![alt text](./images/arch.png)

# Where do Notebooks get used?
* Data Analytics
    * Interative
    * Visual
    * Connect to distributed computing resources
* Education
    * Descriptive
    * Example => Results
    * Modular, not monolithic
* Presentations
    * Easy to format
    * Run code live (& test before hand)

![alt text](./images/workflow.png)

# Exploratory Data Analysis

>**In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.**

### Getting started...
* What do I have to work with?

In [None]:
ls

**Other CLI Stuff**
* Run other scripts with %run
* Execute OS specific commands with !*my_cool_cmd*

In [None]:
import pandas as pd
df = pd.read_csv('./data/college.csv')
df.columns

* **What's going on above?**
    * Import Python Data Analysis Library - Pandas
    * Read the csv file into memory
    * Display the columns back in nice format

* **Try it**
    * What's the difference between `print(df.head())` and `df.head()`?
    * What does `;` at the end of a cell do?

In [None]:
df.head()

In [None]:
df.describe()

* **What's going on above?**
    * Get Summary Statistics
    * Region is not present because it is not numerical
    * Year is being summarized because it is being treated numerically

* **Try it**
    * Change Year to be a Categorical variable
        * `df['year'] = df['year'].astype('category')`
    * What is `df`?
        * `?df`

In [None]:
??df

In [None]:
avg_df = df.groupby(['region']).mean()
avg_df.head()

* **What's going on above?**
    * Aggregating data by the Region and then taking the average

* **Try it**
    * Rename Columns so we know they are averages
        * `avg_df = avg_df.rename(columns={'Expenditure': 'Avg_Exp', 'Income': 'Avg_Income'})`
    * How many records are there for each year?

In [None]:
%matplotlib inline

from pylab import rcParams
rcParams['figure.figsize'] = 20, 10

import seaborn as sns

ax = sns.boxplot(x="region", y="Income", hue="year", data=df)

* **What's going on above?**
    * Use the magic function to plot inline
    * Set the plot size (so it presents well)
    * Import statiscal visualization library - Seaborn
    * Box plot expenidures by its categorical variables

* **Try it**
    * Look at incomes

In [None]:
ax = sns.jointplot(x="Income", y="Expenditure", data=df, kind="reg")

In [None]:
ax = sns.lmplot(x="Income", y="Expenditure", col="region", data=df)

In [None]:
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from scipy import stats

def eta_squared(aov):
    aov['eta_sq'] = 'NaN'
    aov['eta_sq'] = aov[:-1]['sum_sq']/sum(aov['sum_sq'])
    return aov

def omega_squared(aov):
    mse = aov['sum_sq'][-1]/aov['df'][-1]
    aov['omega_sq'] = 'NaN'
    aov['omega_sq'] = (aov[:-1]['sum_sq']-(aov[:-1]['df']*mse))/(sum(aov['sum_sq'])+mse)
    return aov

formula = 'Expenditure~region+year+region*year'
model = ols(formula, df).fit()
aov_table = anova_lm(model, typ=2)

eta_squared(aov_table)
omega_squared(aov_table)
aov_table

* **What's going on above?**
    * Importing libraries
    * Defining functions
    * Performing ANOVA