# Lab1 - Scikit-learn
Author: Christopher DiMattia

## 1. Introduction

The goal of this lab is to become familiar with the scikit-learn library.

You will practice loading example datasets, perform classification and regression with linear scikit-learn models, and investigate the effects of reducing the number of features (columns in X) and the number of samples (rows in X and y).


In [81]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Classification

Using yellowbrick spam - classification  
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

The goal is to investigate `LogisticRegression(max_iter=2000)` and effects of reducing the number of features and number of samples on classification performance.

### 2.1 Implement convenience function

In [82]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

def get_classifier_accuracy(model, X, y):
    '''Calculate train and validation accuracy of classifier (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn classifier): Classifier to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training accuracy, validation accuracy
    
    '''

    #split data into training/validaiton
    X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=956)

    #fit model
    model.fit(X_train,y_train)

    #predict on training dataa
    y_train_predict = model.predict(X_train)
    #compare predicted to training data to get training accuracy
    accuracy_train = accuracy_score(y_train, y_train_predict)

    #predict on validation data
    y_val_predict = model.predict(X_val)
    #compare predicted to validation data to get validation accuracy
    accuracy_val = accuracy_score(y_val,y_val_predict)

    #return training and validation accuracies
    return(accuracy_train, accuracy_val)

### 2.2 Load data

Use the yellowbrick function `load_spam()`, load the spam data set into feature matrix `X` and target vector `y`.

Print size and type of `X` and `y`.


In [83]:
from yellowbrick.datasets.loaders import load_spam

#I'm not sure if you wanted the actual size function or the dimensions so I put both.  I also don't know what is meant by type so I simply stated
# the data types?  I assume type is not referring to classification vs regression as this section is clearly classification.
X, y = load_spam()
print("X size: " + str(X.size) + ".  X types: float & int")
print("y size: " + str(y.size) + ".  Y type is boolean (spam or not spam represented as 0 or 1)")
print("X dimensions: " + str(X.shape))
print("y dimensions: " + str(y.shape))

X size: 262200.  X types: float & int
y size: 4600.  Y type is boolean (spam or not spam represented as 0 or 1)
X dimensions: (4600, 57)
y dimensions: (4600,)


Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [84]:
X_small, X_val, y_small, y_val = train_test_split(X, y, random_state=174, train_size=0.01)
print("X small size: " + str(X_small.size))
print("y small size: " + str(y_small.size))

X small size: 2622
y small size: 46


### 2.3 Train and evaluate models

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
4. Call your convenience function `get_classifier_accuracy()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation accuracy for each call to the `results` DataFrame
6. Print `results`

In [85]:
from sklearn.linear_model import LogisticRegression

#create model
logRegression = LogisticRegression(max_iter=2000)

#get accuracies useing the convenience function
X_y = get_classifier_accuracy(logRegression,X,y)
X_y_2col = get_classifier_accuracy(logRegression,X.iloc[:,0:2],y)
X_y_small = get_classifier_accuracy(logRegression,X_small,y_small)

#add labels for each datasize and include difference in accuracy between validation and training to help answer next question
X_y = ("X & y"),X_y[0],X_y[1],X_y[1]-X_y[0]
X_y_2col = ("X & y w/ 2 features"),X_y_2col[0],X_y_2col[1],X_y_2col[1]-X_y_2col[0]
X_y_small = ("X & y, small"),X_y_small[0],X_y_small[1],X_y_small[1]-X_y_small[0]

#list of tuples to input into dataframe
data = [X_y,X_y_2col,X_y_small]

#create dataframe and fill with results
results = pd.DataFrame(data, columns=["DataSize","Training Accuracy","Validation Accuracy","Val Acc minus Train Acc"])
#print/show results
results

Unnamed: 0,DataSize,Training Accuracy,Validation Accuracy,Val Acc minus Train Acc
0,X & y,0.935072,0.917391,-0.017681
1,X & y w/ 2 features,0.608986,0.613043,0.004058
2,"X & y, small",0.941176,0.75,-0.191176


### 2.4 Questions
1. What is the validation accuracy using all data? What is the difference between training and validation accuracy?
2. How does the validation accuracy and difference between training and validation change when only two columns are used? Provide values.
3. How does the validation accuracy and difference between training and validation change when only 1% of the rows are used? Provide values.

Answer
1. Validation accuracy with all data: 91.7%.  Training accuracy is 1.8% higher
2. The validation accuracy drops drastically to 61.3& (a 30.4% drop) but the difference between the training and validation accuracy is still relatively small at 0%
3. The validation accuracy stays very similar at 94.1% (only a 0.6% increase) but the difference between the training and validation accuracy drastically increases to a 19.1% difference (with training accuracy being greater than validation accuracy.  Training: 94.1% & Validation: 75.0% accuracies)

See above

## 3. Regression

Using yellowbrick energy - regression  
https://www.scikit-yb.org/en/latest/api/datasets/energy.html

The goal is to investigate `LinearRegression()` and effects of reducing the number of features and number of samples on regression performance.

### 3.1 Implement convenience function

In [86]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def get_regressor_mse(model, X, y):
    '''Calculate train and validation mean-squared error (mse) of regressor (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn regressor): Regressor to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training mse, validation mse
    
    '''
    #same as above
    #split data into training/validaiton
    X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=956)

    #fit model
    model.fit(X_train,y_train)

    #predict on training dataa
    y_train_predict = model.predict(X_train)
    #same as classification except this time it's regression so using mse accuracy to get accuracy
    mse_train = mean_squared_error(y_train, y_train_predict)

    #predict on validation data
    y_val_predict = model.predict(X_val)
    #compare predicted to validation data to get mse
    mse_val = mean_squared_error(y_val,y_val_predict)

    #return training and validation accuracies
    return(mse_train, mse_val)
   

    

### 3.2 Load data

Use the yellowbrick function `load_energy()` load the energy data set into feature matrix `X` and target vector `y`.

Print dimensions and type of `X` and `y`.

In [87]:
from yellowbrick.datasets.loaders import load_energy

#Similar to above I assume the type refers to the data type of the feature matrix and target vector
X, y = load_energy()
print("X dimensions: " + str(X.shape) + ".  X types: real & int")
print("y dimensions: " + str(y.shape) + ".  y type is float")

X dimensions: (768, 8).  X types: real & int
y dimensions: (768,).  y type is float


Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [88]:
#Same as above.  Split the data into training and validation sets
X_small, X_val, y_small, y_val = train_test_split(X, y, random_state=174, train_size=0.01)
print("X dimensions: " + str(X_small.shape) + ".  X types: real & int")
print("y dimensions: " + str(y_small.shape) + ".  y type is float")

X dimensions: (7, 8).  X types: real & int
y dimensions: (7,).  y type is float


### 3.3 Train and evaluate models

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Create a pandas DataFrame `results` with columns: Data size, training MSE, validation MSE
4. Call your convenience function `get_regressor_mse()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation MSE for each call to the `results` DataFrame
6. Print `results`

In [80]:
from sklearn.linear_model import LinearRegression

#create model
linearRegression = LinearRegression()

#get accuracies useing the convenience function
X_y = get_regressor_mse(linearRegression,X,y)
X_y_2col = get_regressor_mse(linearRegression,X.iloc[:,0:2],y)
X_y_small = get_regressor_mse(linearRegression,X_small,y_small)

X_y

#add labels for each datasize and include difference in accuracy between validation and training to help answer next question
X_y = ("X & y"),X_y[0],X_y[1],X_y[1]-X_y[0]
X_y_2col = ("X & y w/ 2 features"),X_y_2col[0],X_y_2col[1],X_y_2col[1]-X_y_2col[0]
X_y_small = ("X & y, small"),X_y_small[0],X_y_small[1],X_y_small[1]-X_y_small[0]

#list of tuples to input into dataframe
data = [X_y,X_y_2col,X_y_small]

#create dataframe and fill with results
results = pd.DataFrame(data, columns=["DataSize","Training MSE","Validation MSE","Val MSE minus Train MSE"])
#print/show results
results


Unnamed: 0,DataSize,Training MSE,Validation MSE,Val MSE minus Train MSE
0,X & y,7.981975,10.292306,2.310331
1,X & y w/ 2 features,53.60043,46.410426,-7.190004
2,"X & y, small",2.284541e-28,69.977449,69.977449


### 3.4 Questions
1. What is the validation MSE using all data? What is the difference between training and validation MSE?
1. How does the validation MSE and difference between training and validation change when only two columns are used? Provide values.
1. How does the validation MSE and difference between training and validation change when only 1% of the rows are used? Provide values.

*YOUR ANSWERS HERE*
1. The MSE for all data is 8.0 and 10.3 for the training and validation data respectively.  The difference between the two is 2.3 (valdiation minus training)
2. When only two columns are used the MSE decreases (gets better) slightly (from 8.0 to 5.4) while the validation accuracy becomes much worse (from 10.3 to 46.4).
3. The training MSE greatly improves from 8.0 to 0.0 while the validation MSE greatly worsens from 10.3 to 70.0.

See above

## 4. Observations/Interpretation

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*
Patterns:
1. More features obviously results in high accuracies (compare 92% vs 67% for classification and 10.3 vs 46.4 mse for regression)
2. Larger data sets also obviously result in higher accuracies (compare 92% vs 75% for classification and 10.3 vs 70.0 for regression)
3. Smaller data sets will still retain high training accuracy because even with a small amount of data a curve can still be fit quite well (we can conceive of an extreme example of only a few data points that can be fit perfectly).  In comparision removing most/all of a datas features will result in a low training accuracy which makes sense given that two features could be highly uncorrelated or even antagonistic to each other.  With so few features they can appear more as "outliers" or noise given that they might only make a small contribution to the predictive accuracy of a model or fit.  Unless those 2 specific features are the most impactful it's more likely to look like noise and thus have a lower accuracy and for any complex set of data this will not be the case by definition.
4. Another patter is that the accuracy of the training vs validation accuracy for removed features is very similar.  Again this makes sense because removing features basically makes the fit "fuzzy" but it doesn't misrepresent the data, whereas removing data can lead to a misrepresentation of the data.  By taking only 1% of the avaliable data, bias and chances for outliers greatly increases which is why the training vs validation accuracy is so different between the smaller and large sets.


## 5. Reflection
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*
Likes
1. I greatly liked the ease of which I could create and get results from the models given how long the actual mathematical implmentation is.  I particularily like how the way sklearn is set up in that I am able to swap out any models with ease to compare different ML model fits.
2. I liked having a large dataset that was non-trivial.  While trivial datasets are excellent for learning it's a nice change of pace to use a larger amount of data.  It made me realize that interpretting results will be far more challenging than with simple assignments and while this assignment was not heavily focused on interpretation it does make it clear that interpreting results with many features will pose a challenge.

Dislikes
1. Not a dislike about the assignment in particular but due to how other assignments lined up this week I did not have a chance to test other models just to see for comparision.  I want to do more comparisions between different types of models to get a better handle of the advantages/disadvantages of each via direct comparision.
2. While I liked the ease with which the models are implemented I do think it would be beneficial to create a model from scratch so as to better appreciate the math behind the model.  I also think it would be interesting to do a "speed" comparision between these highly optimized models and a laymen's implmentation.
3. I didn't get a sense for the weights that were applied to each feature.  Had I had more time I would have liked to print those and try to see if I could identify or at least get a grasp of what features most impact the results.
4. I'm not sure it would be useful but perhaps more plotting of results would be intesting for interpretation (to see how useful/useless it is).  That said I can do this on my own.

Interesting, Confusing, etc
1.  I found the assignment fairly straight forward with little confusion or challenge.  The most difficult part was the interpretation as that required the most critical thought and I'm still not sure I captured/know all the main theoretical points.   I also found it very interesting and motivating that the accuracy for the spam data was so high (91% with all data) given how complex natural langauge is, espcially given these are probably the simplest ML techniques.