<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Lesson: Linear Regression - Train/Test Split

---

# Introduction

Let's practice train_test_split and k_folds cross validation.


### Here's What We Will Be Doing:

* Working with Boston housing data to predict the value of a home
* Create a test-train split of the data.
* Train each of your models on the training data.
* Evaluate each of the models on the test data.


**Then, we'll try k-folds cross validation.**

* Try a few different splits of data for the same models.
* Perform a k-fold cross-validation and use the cross-validation scores to compare your models. Did this change your rankings?


## Linear Regression Use Case

In this given task, you will be asked to model the median home price of various houses across U.S. Census tracts in the city of Boston. We are predicting a continuous, numeric output (price).

In [2]:
# Regular import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [6]:
# read in the dataset 
boston = pd.read_csv('data/boston_data.csv')

FileNotFoundError: [Errno 2] File data/boston_data.csv does not exist: 'data/boston_data.csv'

### 1. Clean Up Data and Perform Exporatory Data Analysis

Boston data is from scikit-learn, so it ought to be pretty clean, but we should always perform exploratory data analysis.

In [9]:
# Exploratory data analysis in a function

In [68]:
# EDA Function 
def eda(dataframe):
    print(f"top 2 rows:\n {dataframe.head(2)}\n ")
    print(f"dataframe types: {dataframe.info} \n ")
    print(f"missing values: {dataframe.isnull().sum()} \n")
    print(f"dataframe describe: {dataframe.describe()}")
    
    for item in dataframe:
        print(item)
        print(dataframe[item].value_counts(), '\n')
        
    return None

In [69]:
# Run the EDA function 
eda(boston)

top 2 rows:
       CRIM   ZN   INDUS   CHAS    NOX     RM   AGE     DIS  RAD  TAX  PTRATIO  \
0  0.00632  18.0    2.31     0  0.538  6.575  65.2  4.0900    1  296     15.3   
1  0.02731   0.0    7.07     0  0.469  6.421  78.9  4.9671    2  242     17.8   

   LSTAT  MEDV  
0   4.98  24.0  
1   9.14  21.6  
 
dataframe types: <bound method DataFrame.info of         CRIM   ZN   INDUS   CHAS    NOX     RM   AGE     DIS  RAD  TAX  \
0    0.00632  18.0    2.31     0  0.538  6.575  65.2  4.0900    1  296   
1    0.02731   0.0    7.07     0  0.469  6.421  78.9  4.9671    2  242   
2    0.02729   0.0    7.07     0  0.469  7.185  61.1  4.9671    2  242   
3    0.03237   0.0    2.18     0  0.458  6.998  45.8  6.0622    3  222   
4    0.06905   0.0    2.18     0  0.458  7.147  54.2  6.0622    3  222   
..       ...   ...     ...   ...    ...    ...   ...     ...  ...  ...   
501  0.06263   0.0   11.93     0  0.573  6.593  69.1  2.4786    1  273   
502  0.04527   0.0   11.93     0  0.573  6.120  7

## Using `scikit-learn` Linear Regression

### 2. Feature Selection - pick 3-4 predictors (i.e. CRIM, ZN, etc...) that you will use to predict our target variable, MEDV 


In [70]:
boston.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1    ZN      506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    int64  
 10  PTRATIO  506 non-null    float64
 11  LSTAT    506 non-null    float64
 12  MEDV     506 non-null    float64
dtypes: float64(10), int64(3)
memory usage: 51.5 KB


In [72]:
boston.columns

Index(['CRIM', ' ZN ', 'INDUS ', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'LSTAT', 'MEDV'],
      dtype='object')

In [1]:
# make the correlation matrix 
boston.corr()

NameError: name 'boston' is not defined

In [None]:
# create it in Seaborn


In [None]:
# Create X and y with your features

In [73]:
features = ['LSTAT', 'RM', 'PTRATIO']
X = boston[features]
y = boston['MEDV']

### 3. Try a 70/30 train/test split (70% of the data for training - 30% for testing)


In [None]:
## make your first train test split

In [None]:
## instantiate your Linear Regression model

In [None]:
## fit the model

In [None]:
## score the model

### 4. Make a function that accepts X, y, and a float for the % of the data you want to use for training in train_test_split. The function should instantiate, fit, and score the model. Return the score. 

In [36]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression

In [74]:
def tts(X, y, train_percent):
    '''
    Instantiate, fit, and score the model.

    '''
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_percent, random_state=42)
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    score = lr.score(X_test, y_test)
    return score
    

In [75]:
## Call your model to test it

In [76]:
tts(X, y, 0.7)

0.650904156861472

### 5. K-folds cross validation is a safer way to check performance than a single train_test_split. 

Recall that k-fold cross-validation creates a hold-out portion of your dataset for each iteration:

![](http://i.imgur.com/0PFrPXJ.png)

Update: (now k is 5 by default)

#### Perform k-folds cross validation, varying the number of folds from 5 through 10. Which number of folds scores best?

In [77]:
lr = LinearRegression()

In [78]:
len(X)

506

In [83]:
X.head()

Unnamed: 0,LSTAT,RM,PTRATIO
0,4.98,6.575,15.3
1,9.14,6.421,17.8
2,4.03,7.185,17.8
3,2.94,6.998,18.7
4,5.33,7.147,18.7


In [79]:
X_randomized = X.sample(frac=1)

In [81]:
X_randomized.head()

Unnamed: 0,LSTAT,RM,PTRATIO
423,23.29,6.103,20.2
414,36.98,4.519,20.2
255,9.25,5.876,16.4
274,3.53,6.758,17.6
319,12.73,6.113,18.4


In [80]:
len(X_randomized)

506

In [66]:
X_randomized.mean()

LSTAT      12.653063
RM          6.284634
PTRATIO    18.455534
dtype: float64

In [67]:
X.mean()

LSTAT      12.653063
RM          6.284634
PTRATIO    18.455534
dtype: float64

In [64]:
cross_val_score(lr, X_randomized, y)

array([-0.05017149, -0.05798199, -0.9982995 , -0.11376518, -2.43381843])

In [57]:
my_scores = [cross_val_score(lr, X_randomized, y, cv = i) for i in range(5, 11)]
my_scores

[array([-1.83134875e-03, -7.76356285e-02, -1.00273133e+00, -1.32586894e-01,
        -2.47019560e+00]),
 array([-0.05419168, -0.01400422, -1.00125488, -0.49140664, -0.28211465,
        -2.32625481]),
 array([-0.06707312, -0.12080609, -0.53634929, -1.09333212, -0.04538688,
        -0.56941207, -1.84488256]),
 array([-8.38299125e-02, -2.55989189e-03, -8.03980446e-02, -8.26655518e-01,
        -8.39011016e-01, -1.33953202e-02, -6.43583318e+00, -1.52255803e+00]),
 array([-0.09704557, -0.03056011, -0.04505547, -1.03753466, -1.15053257,
        -0.10439082, -0.03171531, -6.02524922, -1.20312061]),
 array([-0.15809222, -0.14352328, -3.49774316, -0.9855243 , -0.4809204 ,
        -1.31290272, -0.05747769, -0.23068777, -5.20955684, -0.91223138])]

In [48]:
list_scores = []

for i in range(5, 11):
    my_score = cross_val_score(lr, X, y, cv = i).mean()
    list_scores.append(my_score)

In [49]:
list_scores

[0.430002983179557,
 0.4719272797945379,
 0.4354041099486394,
 0.3406890399199286,
 0.36923864801176465,
 0.2098837291463725]

Which makes you feel more confident about your model performance - TTS or cross validation? Why?