# Models Models Models

## Imports

In [1]:
# Necessary imports
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

%matplotlib inline

  from pandas.core import datetools


# Linear Regression

## Assumptions of Linear Regression (ordinary least squares)

* **Linear Relationship**: there is a linear relationship between the independent and dependent variables
* **Multivariate Normality** of all variables: check the data normality with a histogram 
* **No or Little Multicollinearity**: independent variables should not be correlated because it could cause confusion in interpretability of results, extreme changes in coefficients between only slightly varying models, etc
    * Check for using correlation matrix
    * Check for using tolerance: T = 1-R^2, multicollinearity if T < 0.01
    * Check for using variance inflation factor (VIF): VIF = 1/T, multicollinearity if VIF > 100
    * Fix by centering the data OR removing the correlated independent variables
* **No Autocorrelation**: residuals should be independent of one another, ie. no periodicity
    * Check for using Durbin-Watson test: 1.5 < DW < 2.5 means no autocorrelation
* **Homoscedasticity**: residuals should be equally distributed across the regression line
    * If heteroscedasticity is present, a non-linear correction could help
   

## How to interpret results values from linear regression
* **DF Model**: degrees of freedom of the model; the number of predictor variables
* **DF Residuals**: degrees of freedom of the residuals; (# observations) - (DF Model) - 1; want a larger number
* **R^2**: cost function called the coefficient of determination that reflects how much variability in the data is captured by the model
    * R^2 = 1 - SSE/SST
    * SSE = sum of squared errors = SUM (Y_obs - Y_pred)^2 --> randomness left in model
    * SST = total sum squares = SUM (Y_obs - Y_mean)^2 --> variance in data
* **Adjusted R^2**: more predictor variables typically increase the R^2, but there's the curse of dimensionality! Adjusted R^2 accounts for model complexity/number of predictor variables
* **F-Statistic**: you can use hypothesis testing to guide your model, but it's not an absolute indicator of model quality. Null hypothesis of the F-test is that the data can be modeled by setting regression coefficients to zero.
    * If p-value of the F-test is < 0.05, reject the null and assume the model is doing something right
* **Log-likelihood**: a cost function that is calculated under the assumption that the errors follow a normal distribution (slightly unreasonable)
    * Want to maximize this number (maximum likelihood estimator)
* **AIC and BIC**: calculate penalties for model complexity based on the log likelihood
    * AIC: akaike information criterion; relative estimate of information loss between different models
    * BIC: Bayesian information criterion; Bayesian argument not information and more critical than AIC
    * Use these to compare between different models - use the model with the lower value
* **Coefficients**: or weights
* **Standard Error**: metric associated with covariance matrix of estimated coefficients
* **t-statistic**: using the t-test for each estimated coefficient
    * Null hypothesis: the coefficient of a given predictor is zero, ie. it has no effect on the target variable
    * If p-value of t-test < 0.05, reject the null and the coefficient is significant
* **Conidence Interval**: range of values we'd expect to find the parameter of interest (ie. the coefficients)
    * Want a smaller range because it indicates higher confidence
* **Skewness and Kurtosis**:
* **Omnibus Test**:
* **Durbin-Watson**:
* **Jarque-Bera Test**:
* **Condition Number**:

## sm.OLS with statsmodels

* Via the `statsmodels.api` package
* To use this method, you need to generate a **matrix** of **features**, **`X`** and a **vector** of **targets**, **`y`** where each row represents a single **observation**.  In statsmodels, you can do this with a call to **`patsy.dmatrices`**
* Given a dataframe with the columns: Y, X1, X2, X3, X4, X5, X6

In [None]:
import patsy
import statsmodels.api as sm

In [None]:
# Create your feature matrix (X) and target vector (y)
y, X = patsy.dmatrices('Y ~ X1 + X2 + X3 + X4 + X5 + X6', data=df, return_type="dataframe")
# Create your model
model = sm.OLS(y, X)
# Fit your model to your training set
fit = model.fit()
# Print summary statistics of the model's performance
fit.summary()

## smf.OLS with statsmodels

* ##### `statsmodels.formula.api`
The formula approach handles the creation of the `X` and `y` matrices internally, so all you have to do is supply the R formula for your model when you create your `ols` object.  
* Given a dataframe with the columns: Y, X1, X2, X3, X4, X5, X6

In [None]:
import statsmodels.formula.api as smf

In [None]:
# Define the model
lm1 = smf.ols('Y ~ X1 + X2 + X3 + X4 + X5 + X6', data=df)
# Fit the model
fit1 = lm1.fit()
# Print summary statistics of the model's performance
fit1.summary()

### Plotting Residuals
* residuals: (actual value - predicted value)
* want errors to be random 
* If residuals look systematic (e.g. missing high for one range and low for another) then we probably are missing the actual functional dependency underlying the data (perhaps it's not really linear)

In [None]:
# Use statsmodels to plot the residuals
fit1.resid.plot(style='o', figsize=(12,8))

## LinearRegression() with sklearn

In [None]:
from sklearn.linear_model import LinearRegression

* with dataframe df.columns = Y, X2, X3, X4, X5, X6

In [None]:
# Create an empty model
lr = LinearRegression()
# Choose the predictor variables, here all but the first which is the response variable
# This model is analogous to the Y ~ X1 + X2 + X3 + X4 + X5 + X6 model
X = df.iloc[:, 1:]
# Choose the response variable(s)
y = df.iloc[:, 0]
# Fit the model to the full dataset
lr.fit(X, y)
# Print out the R^2 for the model against the full dataset
lr.score(X,y)

* using a section of the dataframe data

In [None]:
# Create an empty model
lr1 = LinearRegression()
# Choose the predictor variables, here all but the first which is the response variable
# This model is analogous to the Y ~ X1 + X3 + X6 model
X = df[['X1', 'X3', 'X6']]
# Choose the response variable(s)
y = df['Y']
# Fit the model to the full dataset
lr1.fit(X, y)
# Print out the R^2 for the model against the full dataset
lr1.score(X, y)

### Output from LinearRegression()

- `fit()`: Fit a model to a set of training data

** After model is fit **
- `score()`: Score the performance of a model on a given sample of data with known ground truth dependent variables
- `predict()`: Predict target/response variables based on a sample of independent variables (features, predictors, etc)
- `intercept_`: our $\beta_0$ intercept in our regression model
- `coef_`: the other $\beta$s in our model

In [None]:
lr.fit(X_train, y_train)

lr.score(X_train, y_train) # the R^2 of the model on the whole training set
lr.score(X_test, y_test) # the R^2 of the model for the predicted values of the test set against the actual y_test

y_predicted = lr.predict(X_test)

lr.intercept_

lr.coef_

## PolynomialRegression() with sklearn and Pipelines

We can transform the predictor variables by any type of function we want before inputting them to linear regression.  This is the idea behind [**Polynomial Regression**](https://en.wikipedia.org/wiki/Polynomial_regression) and it allows us (along with similar functional regressions) to essentially model our response variables as any function of our predictor variables that we like.  Viewed in this way, Linear Regression is just a special instance of Polynomial Regression with a polynomial of degree 1.

** Pipelines **
The [***make_pipeline***](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) function is used to string together a pipeline of operations that is able to first transform our linear features into polynomial features and then run a linear regression against the resulting polynomial features. 

In [None]:
# import PolynomialFeatures and make_pipeline for Polynomial Regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

In [None]:
# Set the degree of our polynomial
degree = 3
# Generate the model type with make_pipeline
# This tells it the first step is to generate 3rd degree polynomial features in the input features and then run
# a linear regression on the resulting features
est = make_pipeline(PolynomialFeatures(degree), LinearRegression())
# Fit our model to the training data
est.fit(X_train, y_train)

# Ground Truth
X_test, y_test

# Prediction
X_test, est.predict(X_test)

## Correlations

In [None]:
# Visualizing pairwise correlations with seaborn
sns.pairplot(example_df[['column1','column2', 'column3', 'column4', 'column5']])

# plots correlations as scatter plots of each column vs each column

## Dummy Variables with Patsy

A couple of things:  
1) First we'll add an intercept (comes for free w/ patsy)  
2) Second, in this case, one 'make' ('alfo-romero' here) is missing from the matrix

This is because patsy knows about (The Dummy Variable Trap)[http://www.algosome.com/articles/dummy-variable-trap-regression.html] The idea behind DVT: Take a categorical variable that has two outcomes (example: boy & girl for the feature 'sex) - We only need one column: "Girl", we can obviously solve for 'Boy' given this column -- if we had both columns, we would automatically introduce multicolinearity  -- this idea can be extended to categorical features with 20+ categories ! 

In [None]:
# Let's use patsy to convert the feature 'make' to categorical variables

X=patsy.dmatrix('make',data=df,return_type='dataframe')
X.head()

# Logistic Regression

## KNN - K-Nearest Neighbors

* lazy
* 'fits' fast, predicts slow (need to store all the original training data)
* assuming d-dimensional data, the straightforward implementation is O(dn) time
* higher memory (saves training set)
* various implementations, including weighted KNN

In [None]:
# Scaling is crucial to KNN

from sklearn.preprocessing import scale

X = scale(X)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.4, 
                                                    random_state=0)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print(accuracy_score(y_test, y_pred))

### with cross-validation

In [None]:
from sklearn.model_selection import cross_val_score

# Perform 3-fold cross validation with
# n_neighbors = 5 and print out the scores and their means

knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X_train, y_train, 
                         cv=3, scoring='accuracy')

print(scores)
print(scores.mean())

### cross-validation to find optimal k-value

In [None]:
# search for an optimal value of K for KNN

k_scores = []

for k in range(1, 31):

    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, y_train, cv=3, scoring='accuracy')
    
    k_scores.append((k, scores.mean()))
    
k_scores = pd.DataFrame(k_scores, columns=['k', 'accuracy']).set_index('k')
k_scores.head()

**Once the optimal k-value is found, re-evaluate the model and run the model on the test data set**

### Parameter Tuning with GridSearchCV and KNN

In [None]:
from sklearn.model_selection import GridSearchCV

# create a parameter grid: map the parameter names to the values that should be searched
# Grid search uses all the parameters

param_grid = {'n_neighbors': range(1, 31)}

grid = GridSearchCV(KNeighborsClassifier(), 
                    param_grid, 
                    cv=3, 
                    scoring='accuracy')

grid = grid.fit(X_train, y_train)

# extensive results provided
grid.cv_results_

# Dimensionality Reduction

## SVD

In [4]:
from sklearn.decomposition import TruncatedSVD

In [None]:
svd = TruncatedSVD(n_components=3)
svd_out = svd.fit_transform(user_item_mat)

# Clustering