# Lab1 - Scikit-learn
Author: Christopher DiMattia

## 1. Introduction

The goal of this lab is to become familiar with the scikit-learn library.

You will practice loading example datasets, perform classification and regression with linear scikit-learn models, and investigate the effects of reducing the number of features (columns in X) and the number of samples (rows in X and y).


In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Classification

Using yellowbrick spam - classification  
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

The goal is to investigate `LogisticRegression(max_iter=2000)` and effects of reducing the number of features and number of samples on classification performance.

### 2.1 Implement convenience function

In [13]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

def get_classifier_accuracy(model, X, y):
    '''Calculate train and validation accuracy of classifier (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn classifier): Classifier to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training accuracy, validation accuracy
    
    '''

    #split data into training/validaiton
    X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=956)

    #fit model
    model.fit(X_train,y_train)

    #predict on training dataa
    y_train_predict = model.predict(X_train)
    #compare predicted to training data to get training accuracy
    accuracy_train = accuracy_score(y_train, y_train_predict)

    #predict on validation data
    y_val_predict = model.predict(X_val)
    #compare predicted to validation data to get validation accuracy
    accuracy_val = accuracy_score(y_val,y_val_predict)

    #return training and validation accuracies
    return(accuracy_train, accuracy_val)

### 2.2 Load data

Use the yellowbrick function `load_spam()`, load the spam data set into feature matrix `X` and target vector `y`.

Print size and type of `X` and `y`.


In [14]:
from yellowbrick.datasets.loaders import load_spam

X, y = load_spam()
print("X size: " + str(X.size))
print("y size: " + str(y.size))

X size: 262200
y size: 4600


Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [21]:
X_small, X_val, y_small, y_val = train_test_split(X, y, random_state=174, train_size=0.01)
print("X small size: " + str(X_small.size))
print("y small size: " + str(y_small.size))

X small size: 2622
y small size: 46


### 2.3 Train and evaluate models

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
4. Call your convenience function `get_classifier_accuracy()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation accuracy for each call to the `results` DataFrame
6. Print `results`

In [46]:
from sklearn.linear_model import LogisticRegression
from itertools import chain

#create model
logRegression = LogisticRegression(max_iter=2000)

#get accuracies useing the convenience function
X_y = get_classifier_accuracy(logRegression,X,y)
X_y_2col = get_classifier_accuracy(logRegression,X.iloc[:,0:2],y)
X_y_small = get_classifier_accuracy(logRegression,X_small,y_small)

#add labels for each datasize and include difference in accuracy between validation and training to help answer next question
X_y = ("X & y"),X_y[0],X_y[1],X_y[1]-X_y[0]
X_y_2col = ("X & y w/ 2 features"),X_y_2col[0],X_y_2col[1],X_y_2col[1]-X_y_2col[0]
X_y_small = ("X & y, small"),X_y_small[0],X_y_small[1],X_y_small[1]-X_y_small[0]

#list of tuples to input into dataframe
data = [X_y,X_y_2col,X_y_small]

#create dataframe and fill with results
results = pd.DataFrame(data, columns=["DataSize","Training Accuracy","Validation Accuracy","Val Acc minus Train Acc"])
#print/show results
results

Unnamed: 0,DataSize,Training Accuracy,Validation Accuracy,Val Acc minus Train Acc
0,X & y,0.935072,0.917391,-0.017681
1,X & y w/ 2 features,0.608986,0.613043,0.004058
2,"X & y, small",0.941176,0.75,-0.191176


### 2.4 Questions
1. What is the validation accuracy using all data? What is the difference between training and validation accuracy?
2. How does the validation accuracy and difference between training and validation change when only two columns are used? Provide values.
3. How does the validation accuracy and difference between training and validation change when only 1% of the rows are used? Provide values.

Answer
1. Validation accuracy with all data: 91.7%.  Training accuracy is 1.8% higher
2. The validation accuracy drops drastically to 61.3& (a 30.4% drop) but the difference between the training and validation accuracy is still relatively small at 0%
3. The validation accuracy stays very similar at 94.1% (only a 0.6% increase) but the difference between the training and validation accuracy drastically increases to a 19.1% difference (with training accuracy being greater than validation accuracy.  Training: 94.1% & Validation: 75.0% accuracies)


## 3. Regression

Using yellowbrick energy - regression  
https://www.scikit-yb.org/en/latest/api/datasets/energy.html

The goal is to investigate `LinearRegression()` and effects of reducing the number of features and number of samples on regression performance.

### 3.1 Implement convenience function

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def get_regressor_mse(model, X, y):
    '''Calculate train and validation mean-squared error (mse) of regressor (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn regressor): Regressor to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training mse, validation mse
    
    '''
    #same as above
    #split data into training/validaiton
    X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=956)

    #fit model
    model.fit(X_train,y_train)

    #predict on training dataa
    y_train_predict = model.predict(X_train)
    #same as classification except this time it's regression so using mse accuracy to get accuracy
    mse_train = mean_squared_error(y_train, y_train_predict)

    #predict on validation data
    y_val_predict = model.predict(X_val)
    #compare predicted to validation data to get mse
    mse_val = mean_squared_error(y_val,y_val_predict)

    #return training and validation accuracies
    return(mse_train, mse_val)
   

    

### 3.2 Load data

Use the yellowbrick function `load_energy()` load the energy data set into feature matrix `X` and target vector `y`.

Print dimensions and type of `X` and `y`.

In [None]:
# TODO: ADD YOUR CODE HERE


Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [None]:
# TODO: ADD YOUR CODE HERE


### 3.3 Train and evaluate models

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Create a pandas DataFrame `results` with columns: Data size, training MSE, validation MSE
4. Call your convenience function `get_regressor_mse()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation MSE for each call to the `results` DataFrame
6. Print `results`

In [None]:
# TODO: ADD YOUR CODE HERE


### 3.4 Questions
1. What is the validation MSE using all data? What is the difference between training and validation MSE?
1. How does the validation MSE and difference between training and validation change when only two columns are used? Provide values.
1. How does the validation MSE and difference between training and validation change when only 1% of the rows are used? Provide values.

*YOUR ANSWERS HERE*



## 4. Observations/Interpretation

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*



## 5. Reflection
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

