# Agenda
- ### Python Basics
- ### Data Wrangling & Understanding with Pandas
- ### Data Visualizing with Matplotlib
- ### Model Building with Scikit-Learn

# At This Point:

* ### Anaconda Installed
* ### Able to open Jupyter Notebook
* ### Able to import the follwing libraries:
  - #### numpy
  - #### pandas
  - #### matplotlib
  - #### sklearn

# Python Basics

### There are many ways to run Python code.
Command Line Interpreter 
    - Usually installed as /usr/local/bin/python

Command Line File Execution 
    - Basic: SublimeText2, Atom, Notepad ++
    - Advanced: Vim, Emacs 

IDE (Integrated Development Environments)
    - Canopy
    - PyCharm
    - Spyder

Jupyter Notebook (formerly iPython Notebook)

### What is Jupyter Notebook?
It is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media.

### What are the key Python libraries for data science?

1.  **Pandas** – R-like dataframes, data cleaning, data munging, data exploration

2.  **Matplotlib** – plotting environment

3.  **NumPy** – fast arrays/matrices and vector operations

4. **Scikit-learn** – maching learning, data mining


### Import Python Libraries

In [None]:
## import an entire library
import matplotlib

## import an entire library with an alias
import numpy as np
import pandas as pd

## import a specific module from a library (with an alias)
from sklearn import datasets 
from matplotlib import pyplot as plt
import matplotlib.pyplot as plt

## import even more specific modules
from sklearn.datasets import make_blobs
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
# from sklearn.linear_model import LogisticRegression as LogReg

# Data Wrangling with Pandas

### 1. The Basics

**Pandas DataFrame:**
Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

In [None]:
data1 = dict()
data1 = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]
       }

df1 = pd.DataFrame(data1, index = ['a', 'b', 'c', 'd', 'e'])

In [None]:
df1.head()

**Basic CRUD Operations with DataFrames**

##### CREATE

In [None]:
## scalar val
df1['new_var'] = False

## array-like
df1['new_var'] = range(len(df1))

## reference to another column
df1['new_var'] = df1['pop'] * 20
df1

When assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, it will be instead conformed exactly to the DataFrame’s index, inserting missing values in any holes:

##### READ

In [None]:
## read as dict-like object\
df1['new_var']

## read as attribute
# df1.pop

## read multiple columns
# df1[['state', 'new_var']]

In [None]:
df1

In [None]:
## reading rows
df1.loc['a', 'year']
# df1.ix[3, 2]
# df1.iloc[0]

##### UPDATE

In [None]:
df1.state == 'Ohio'

In [None]:
## Same as create (for a column that already exists)
df1['new_var'] = True
df1.new_var = range(5)[::-1]
df1['state_is_ohio'] = df1.state == 'Ohio'

df1

##### DELETE

In [None]:
## Drop a row
df1.drop('d', axis = 0)

## Drop a column
df1.drop('new_var', axis = 1)

## Inplace (irreversible live-changes, be very sure, do not do this on accident!)
df1.drop('new_var', axis = 1, inplace = True)

In [None]:
df2 = df1.drop('state_is_ohio', axis = 1)

In [None]:
df2

In [None]:
df1

### 1. The Basics - Exercises

In [None]:
## build a new dataframe (df2)

df2 = pd.DataFrame(data1)

In [None]:
## print the entire DF

df2

In [None]:
## print just the column `[var_name]`



In [None]:
## print just the third row (use a few different methods)

df2.iloc[2]
df2.loc['c']


In [None]:
## create a new column `new_var` with incrementing values
print range(5)
print np.arange(5)



In [None]:
## delete a column

df2.drop('year', axis = 1)

In [None]:
## delete a row

### 2. Select and Filter

In [None]:
df3 = pd.DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['a', 'b', 'c', 'd'])

In [None]:
df3

In [None]:
df3['a']

df3[['a',]]

Quick Question: What's the difference between: df3['a'] and df3[['a']]

##### Select A Range of Rows

In [None]:
df3

In [None]:
df3[0:3]

Things to remember about slicing dataframes:
* python is zero-indexed
* start number is inclusive
* end number is exclusive

##### Select Using Index Names - .loc

In [None]:
df3.loc[['Ohio', 'Utah'], ['b','d']]

##### Select Using Index Locations - .iloc

In [None]:
df3.iloc[[0, 2], [1,3]]

##### Select Using A Combination - .ix

In [None]:
df3.ix[[0, 2], ['b', 'd']]

##### Select Using a Boolean Statement

In [None]:
df3

In [None]:
# df3.a > 5

df3[(df3.a > 5) == False]

#### Combining Multiple Selection Methods

In [None]:
# Boolean + .loc selection
df3.loc[df3.a > 5, ['b', 'c', 'd']]

In [None]:
series = pd.Series(range(5), index = ['a', 'b', 'c', 'd', 'a'], name = 'California')
series

In [None]:
series.loc['b']

In [None]:
series.loc['a']

In [None]:
df3.index.is_unique

### 2. Select and Filter - Exercises

In [None]:
df4 = pd.DataFrame(np.arange(25).reshape((5, 5)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York', 'California'],
                 columns=['one', 'two', 'three', 'four', 'five'])

In [None]:
## print the rows for New York and California for all columns


In [None]:
## print columns `two` and `three` for all rows


In [None]:
## print all rows where `four` is even


**CHALLENGE 1:** print all rows where `four` is even or `one` is > 10

**CHALLENGE 2:** print all columns where row "Colorado" >= 7.  [Hint](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.transpose.html)

### 3. Joining Dataframes

In [None]:
df5_a = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df5_b = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})

In [None]:
df5_a

In [None]:
df5_b

Join on index (note defaults to LEFT JOIN)

In [None]:
df5_a.join(df5_b, lsuffix='a', rsuffix='b')
# df5_a.join(df5_b)

Join on common column (note: defaults to INNER JOIN)

In [None]:
df5_a.merge(df5_b)
df5_a.merge(df5_b, on = 'key')

** Join/Merge Notes:**
* Join combines DFs on index by default
* Join returns a left join by default

* Merge combines DFs on common columns by default (but best practice is to explicitly define using `on` argument)
* Merge returns an inner join by default

**Similar to SQL joins:**
    * one to one
    * one to many
    * many to many

#### Some More Advanced Merging

In [None]:
df5_a.merge?
# df5_a.merge(df5_b, left_index=True, right_on = 'data2', suffixes = ['_a', '_b'])

### 3. Joining Dataframes - Exercises

In [None]:
df6_a = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df6_b = pd.DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})

Join the above dataframes using the join and merge methods to join on indexes and column values.
Define the suffixes for common column names.

In [None]:
df6_a

In [None]:
df6_b

In [None]:
df6_a.merge(df6_b, how = 'outer', left_on = 'lkey', right_on = 'rkey')

# Data Understanding with Pandas

Pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame. Compared with the equivalent methods of vanilla NumPy arrays, they are all built from the ground up to exclude missing data.

### 1. Descriptive Statistics

In [None]:
df7 = pd.read_csv('wnba.csv', na_values=-99)

In [None]:
df7.head()

In [None]:
df7.info()

In [None]:
df7.describe()

In [None]:
df7.head()

In [None]:
# df7.sum()
# df7.mean()
# df7.median()
# df7.max()
# df7.min()

df7[['year', 'wins']].mean()

### 1. Descriptive Statistics - Exercises

In [None]:
# use df7 (live data)

In [None]:
## print the summary stats


In [None]:
## print the average


In [None]:
## print the standard deviation


In [None]:
## print the total wins for the New York Liberty during this dataset


### 2. Duplicates

In [None]:
df8 = pd.read_csv('wnba.csv')

In [None]:
df8.head()

By default, drop_duplicates removes entire rows of duplicates, keeping the first occurence of duplication.

In [None]:
df8.drop_duplicates()

The subset argument is used to drop duplicates based on __just__ the specified column(s)

In [None]:
df8.drop_duplicates(subset = ['year'])

### 2. Duplicates - Exercises

In [None]:
## print a list of the unique teams listed in the dataset (bonus: there's another way to do this without drop_duplicates)


In [None]:
## create a new DataFrame `df8_unique` that has the top scoring team for each \
## season-- most points_per_game --  (hint: the DataFrame is sorted by season)


### 3. Correlation and Covariance

In [None]:
df9 = pd.read_csv('wnba.csv')

Summary of all values

In [None]:
df9.corr()

In [None]:
df9.cov()

Comparing __just__ specific series

In [None]:
df9.wins.corr(df9.losses)

In [None]:
df9.points_per_game.cov(df9.points_against_per_game)

### 3. Correlation and Covariance - Exercises

In [None]:
 #print the correlation for `wins` and `points_per_game`


In [None]:
## print the covariance for `losses` and `points_against_per_game`


In [None]:
## The default correlation metric is the Pearson correlation. Print the  correlation \
## matrix using Spearman correlation instead (hint: look at the documentation).


### 4. Handling Missing Data

Missing data is common in most data analysis applications. The default for most commands on pandas objects is to exclude missing data. Pandas uses the floating point value NaN (Not a Number) to represent missing data in both floating as well as in non-floating point arrays.

In [None]:
df10 = pd.DataFrame([[1., 6.5, 3., np.nan], 
                     [1., np.nan, np.nan, np.nan],
                     [np.nan, np.nan, np.nan, np.nan], 
                     [np.nan, 6.5, 3., np.nan],
                     [4., 5., 1., np.nan]])

df10 = pd.read_csv('titanic.csv')

In [None]:
df10.info()

#### Drop Null Values

In [None]:
df10.head()

In [None]:
df10.dropna().head()

By default, dropna() removes all series where any value is missing. In this example, column 3 is empty in every row, so every row is dropped.

In [7]:
df10.dropna?
# (how = 'any').info()

Object `df10.dropna` not found.


By setting how = 'all', dropna() will only remove series where ALL values are empty. In this example, it is row index 2.

In [None]:
df10.dropna(axis = 1)

Again, the default `how` for dropna is `any`. Now, let's change that to `all` again.

In [None]:
df10.dropna(axis = 1, how = 'all')

#### Fill Null Values

In [None]:
df10.head()

In [None]:
df10.fillna(method='bfill').head()

How does having a missing value in the first row affect your choice of fillna method?

### 4. Handling Missing Data - Exercises

In [None]:
## Try a few other fillna methods (hint: look at the documentation)



In [None]:
## Think of a situation where you would favor dropna over fillna. When would you favor fillna?

## What other ways can we fill the missing values without fillna?

# Data Visualizing with Matplotlib

Matplotlib is the most popular Python library for producing plots and other 2D data visualizations. Basic matplotlib graphs are easy to make.  The trickier parts are in the details.

A very solid tutorial on matplotlib: https://github.com/rougier/matplotlib-tutorial

In [None]:
# import matplotlib.pyplot as plt
%matplotlib inline

### 1. Two Simple Sine Graphs

In [None]:
np.linspace(0,1)

In [None]:
X = range(5)
print X

y = range(5)[::-1]
print y

plt.scatter(X, y)

In [None]:
X = np.linspace(-np.pi, np.pi, 256, endpoint=True)
C, S = np.cos(X), np.sin(X)

plt.plot(X, C)
plt.plot(X, S, color = '#000000')

plt.title('Sine and Cosine')
plt.xlabel('X values')
plt.ylabel('f (x)')
# plt.show()

### 2. Scatterplot

In [None]:
n = 1024
std = 5

X = np.random.normal(0,std,n)
Y = np.random.normal(0,std,n)

plt.scatter(X,Y, alpha = 0.2);

I feel like I'm not learning anything from this viz... Can we add some formatting to get more out of this scatterplot?

In [None]:
n = 1024

X, c = make_blobs(n_samples=n, n_features=2, centers = 4, cluster_std=1, random_state=935)
X1, X2 = np.hsplit(X, 2)

plt.scatter(X1, X2, s = 50, c = c, alpha = 0.5)
# plt.show()

### 3. Bar Plots

In [None]:
plt.bar?

In [None]:
n = 12
X = np.arange(n)
Y1 = (1-X/float(n)) * np.random.uniform(0.5,1.0,n)
Y2 = (1-X/float(n)) * np.random.uniform(0.4,.2,n)

print X
print Y1
print Y2
plt.bar(X, Y1, width = 0.5, facecolor='#9999ff', edgecolor='white')
plt.bar(X +.5 , Y2, width = 0.5, facecolor='red', edgecolor='white')
plt.xticks([])
# plt.show()

## Matplotlib - Exercises

Given the graph below, how do we make it pink, with no borderline edges? Can we add a title and axes labels?

In [None]:
n = 12
X = np.arange(n)
Y2 = (1-X/float(n)) * np.random.uniform(0.5,1.0,n)

plt.bar(X, Y2)
# plt.show()

**CHALLENGE:** Supposed we have want to combine the two graphs above, and make the graph like that below, what should we do?

![](http://www.labri.fr/perso/nrougier/teaching/matplotlib/figures/bar_ex.png)

### Quick Detour: Data Visualization WITHOUT Matplotlib

In [None]:
df9.hist();

# Model Building with Scikit-Learn

![hello](http://cacm.acm.org/system/assets/0001/3678/rp-overview.jpg)

###### Source: http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext

**[R2D3's Visual Introduction to Machine Learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)**

### [Terminology](/home/chris/Documents/GA/intro_ml.pdf):
- Features / predictors / variables / X
- Outcome / predictions / Y
- Supervised vs. non supervised
- Classification / regression 
- Training dataset / testing dataset
- Overfitting

![](http://scikit-learn.org/stable/_static/ml_map.png)

Under/Overfitting, Visualized
![Overfitting, Visualized](https://shapeofdata.files.wordpress.com/2013/02/overfitting.png)

In [None]:
from sklearn.datasets import make_blobs
from sklearn import datasets

from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression

### 1. Regression Model

In [None]:
x = np.linspace(-3, 3, 100)
y = x + np.random.normal(0,1,100)

In [None]:
plt.plot(x, y, 'o')

One of the simplest models again is a regression, which fits a line to the scatterplot data.  This is called ordinary least squares linear regression.  The function in sklearn is LinearRegression.

In [None]:
## regressions can (and will) have multiple features, so a linear regression expects a vector for each observation
X = x.reshape(len(x), 1)
print x.shape
print X.shape

In [None]:
## Randomly split our dataset into a training split from which the algorithm \
## will learn and a test set on which we'll measure our accuracy.
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)

In [None]:
# initiate an instance of the sklearn LinearRegression object and fit the training data
linreg = LinearRegression()
linreg.fit(Xtrain, ytrain)

In [None]:
linreg.coef_, linreg.intercept_

In [None]:
yhat_train = linreg.predict(Xtrain)
#print zip(ytrain, yhat_train)

In [None]:
# plotting known x, y values
plt.plot(Xtrain, ytrain, 'o', label="data")

# plotting predicted y valuaes
plt.plot(Xtrain, yhat_train, 'o', label="prediction")

# plotting trendline
# Xspace = np.linspace(x.min(), x.max(), 256).reshape(256, 1)
# plt.plot(Xspace, linreg.predict(Xspace), color = 'r', linewidth = 2)

plt.legend(loc='best')

In [None]:
yhat_test = linreg.predict(Xtest)

In [None]:
# plotting known x, y values
plt.plot(Xtest, ytest, 'o', label="data")

# plotting predicted y valuaes
plt.plot(Xtest, yhat_test, 'o', label="prediction")

# plotting trendline
# Xspace = np.linspace(x.min(), x.max(), 256).reshape(256, 1)
# plt.plot(Xspace, linreg.predict(Xspace), color = 'r', linewidth = 2)

plt.legend(loc='best')

To quantitatively evaluate the performance of the model, we use the R<sup>2</sup> or Mean Squared Error.

In [None]:
print 'Train', linreg.score(Xtrain, ytrain)
print 'Test', linreg.score(Xtest, ytest)

### 1. Regression Model - Exercise

Fit LinearRegression on the boston housing dataset.

In [None]:
boston = datasets.load_boston()

In [None]:
X = boston.get('data')
y =boston.get('target')

### Interactive Demo: Classification Model

In [None]:
# import some data to play with
iris = datasets.load_iris()

dfIris = pd.DataFrame(iris.data[:, :2], columns = iris.feature_names[:2])
dfIris['y'] = iris.get('target')

X = dfIris.iloc[:, :-1].values
y = dfIris.y.values
# print X.shape, y.shape

X_train, X_test, y_train, y_test = train_test_split(X, y)
# print X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
X.shape, X_train.shape, X_test.shape

In [None]:
clf = LogisticRegression()
clf.fit(X_train, y_train)
yhat = clf.predict(X_test)

print clf.score(X_train, y_train)
print clf.score(X_test, y_test)

### Classification Model - Exercise

Re-run the classification model above, but this time use the entire iris dataset.