# A Rapid and Informal Introduction to Python for Data Science

# - Modeling Packages in Python - Statsmodels & Scikit-Learn

#### Developed by:  Brian Vegetabile, PhD Candidate, University of California, Irvine

This notebook is a supplement to the workshop "A Rapid and Informal Introduction to Python for Data Science"

# Getting Started

Let's add run our beginning piece of code to import the libraries and packages we'll need.

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
%matplotlib inline
```

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm              # Statsmodels interactive version
import statsmodels.formula.api as smf     # Statsmodels formula version similar to R
%matplotlib inline

# Inferential Models in Python -  Statsmodels

We won't do too in depth into either package, but did want to provide a few examples of them so that you know they exist.  After this you should have all of the tools needed to get up and running with Python for data science and statistics.  

http://statsmodels.sourceforge.net/devel/index.html

In the numpy section we introduced linear regression and showed it as an example for matrix manipulation.  We go back to that here and show how both statsmodels and sklearn perform linear regression.  

First let's simulate some data again:

```python
np.random.seed(333)

beta0 = 5
beta1 = 1
n = 1000
X = np.random.normal(size=n)
noise = np.random.normal(size=n)
Y = beta0 + beta1*X + noise
```

This is the same data that was simulated earlier.  


Statsmodels has two ways of analyzing this data.  We'll start with the `numpy` version before introducing the formulaic version.  

Take a look at our data, it doesn't have a column of ones needed for linear regression yet.  Statsmodels provides a nice function that can add this for us.  

```python
sm.add_constant(array)
``` 

Let's take a look at what that does

```python
X = sm.add_constant(X)
print X
```

To fit a simple model we can use the `OLS` module of the `statsmodels.api` with the format

```python
ols_model = sm.OLS(response, design_matrix)
```

This returns an OLS object and hasn't done anything yet with the data.  To actually fit the model we have to call the `.fit()` method of an OLS object.

```python
model_fit = ols_model.fit()
```

We can then return the results using the `summary` method of a model fit.  

```python
model_fit.summary()
```

###### Mini Exercise - Fit a model with the Y and X data to test out this function.

After you've fit the model see what else is available to an `ols` model after it has been fit.  Most of the statistical summaries that you would want are there. 

### Statsmodels with R Style formulas

Another way to use statsmodels is using `R` style formulas from the `statsmodels.formula.api`.  We'll continue with the same simulated data as before.

```python
np.random.seed(333)

beta0 = 5
beta1 = 1
n = 1000
X = np.random.normal(size=n)
noise = np.random.normal(size=n)
Y = beta0 + beta1*X + noise
```

To use these formulas, we can create a pandas DataFrame (or simply a dictionary) of our data.  This will let us use the formula style calls from `R`.  

```python
dat = pd.DataFrame({'Y':Y, 'X':X})
```

and then can use the following to fit our model

```python
smf.ols('Y ~ X', data=dat).fit().summary()
```

Notice that `OLS` is lowercased in this example.  The `smf` package also has `OLS` which allows for the numpy computation as well.  

# Predictive Models in Python with Scikit-Learn

An alternative package to using `statsmodels` is to use the `scikit-learn` package in Python.  

http://scikit-learn.org/stable/

This package is primarily used for predictive modeling and machine learning in Python and isn't really suited for inference as you'll see shortly.  

Let's do our linear regression model again 

```python
from sklearn import linear_model
```

In [None]:
from sklearn import linear_model

To create a linear regression model,

- Calling `linear_model.LinearRegression()` creates an object of class  `sklearn.linear_model.base.LinearRegression`
    - Defaults 
        - `fit_intercept = True`: automatically adds a column vector of ones for an intercept
        - `normalize = False`: defaults to not normalizing the input predictors
        - `copy_X = False`: defaults to not copying X
        - `n_jobs = 1`: The number of jobs to use for the computation. If -1 all CPUs are used. This will only provide speedup for n_targets > 1 and sufficient large problems.
    - Example
        - `lmr = linear_model.LinearRegression()`
- To fit a model, the method `.fit(X,y)` can be used
    - X must be a column vector for scikit-learn
        - This can be accomplished by creating a DataFrame using `pd.DataFrame()`
    - Example
        - lmr.fit(X,y)
- To see the $\beta$ estimates use `.coef_` for the coefficients for the predictors and `.intercept_` for $\beta_0$

On our data this will become 

```python
X_colvec = X[:, np.newaxis]
Y_colvec = Y[:, np.newaxis]

lmr = linear_model.LinearRegression()
model_fit = lmr.fit(X_colvec, Y_colvec)
print model_fit.coef_, model_fit.intercept
```

In [None]:
np.random.seed(333)

beta0 = 5
beta1 = 1
n = 1000
X = np.random.normal(size=n)
noise = np.random.normal(size=n)
Y = beta0 + beta1*X + noise

X = X[:, np.newaxis]
Y = Y[:, np.newaxis]

lmr = linear_model.LinearRegression()
model_fit = lmr.fit(X, Y)
print model_fit.coef_, model_fit.intercept_

We see that this very similar to how the `statsmodel` linear regression works, where we create an object and then subsequently fit a model using the method of that object.  

A major differenece between the two is what we get back in the end.  The `statsmodel` object has many inferential things that we would want, including the variance of the estimates of thr $\beta$ coeffients.  The `sklearn` model does not have these but has many functions for predicting out of sample values and evaluating model performance on new data.  They are do similar functions but were developed for completely different tasks.  That is part of the reason for highlighting both of these packages quickly.

# Exercise - Old Faithful Data

Use the dataset `oldfaithful.csv` to perform a linear regression to estimate the `waiting` time until the next eruption based upon the `eruption` time in minutes.  

```python
oldfaithful = pd.read_csv('data/oldfaithful.csv')
```

Choose to either practice with `sklearn` or `statsmodels`

# Moving on from here... Practice Datasets

From the `statsmodels` package, many of the same datasets that are available in `R` have been made available in `Python`

```python
# Example dataset load
import statsmodels.api as sm
sunspots = sm.datasets.sunspots.load_pandas().data
```

Alternatively you can get data from the UCI Machine Learning Repository.

http://archive.ics.uci.edu/ml/

###### Practice, practice, practice