# Lab 4: Random numbers, splitting data, evaluating model performance

- **Author:** Niall Keleher ([nkeleher@uw.edu](mailto:nkeleher@uw.edu))
- **Date:** 22 Jan 2018
- **Course:** INFX 574: Core Methods in Data Science

### Learning Objectives:
By the end of the lab, you will be able to:
* create dummy variables for use in regressions
* generate random numbers for use in randomization and train-test splits
* identify measures for evaluating regression performance

### Topics:
1. Qualitative/Categorical predictors
2. Generating random numbers 
3. Splitting data into training and test sets
4. Running regressions & generating predictions
5. Model performance

### References: 
* [Pandas - get_dummies()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html)
* [random library](https://docs.python.org/2/library/random.html)
* [Sci-kit Learn Cross Validation](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html)
* [Introduction to Statistical Learning, Lab #5](http://www-bcf.usc.edu/~gareth/ISL/Chapter%205%20Lab.txt)

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
auto_df = pd.read_csv('Auto.csv')
auto_df = auto_df[auto_df.horsepower != '?'] #remove NAs or missing?

In [4]:
auto_df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


### 1. Qualitative/Categorical predictors -  Generate dummy variables in python

In [5]:
auto_df.cylinders.value_counts() #frequency table for cylinders

4    199
8    103
6     83
3      4
5      3
Name: cylinders, dtype: int64

In [6]:
pd.get_dummies(auto_df.cylinders).head() #this is a little misleading as we only see the ones with 8 as their cylinders in the head. 
#So all the ones we see in the head are marked as "1" for 8 cylinders

Unnamed: 0,3,4,5,6,8
0,0,0,0,0,1
1,0,0,0,0,1
2,0,0,0,0,1
3,0,0,0,0,1
4,0,0,0,0,1


In [7]:
cyl_dummies = pd.get_dummies(auto_df.cylinders, prefix='cyl') #this splits it so that we have labels for the dummy columns
cyl_dummies.head()

Unnamed: 0,cyl_3,cyl_4,cyl_5,cyl_6,cyl_8
0,0,0,0,0,1
1,0,0,0,0,1
2,0,0,0,0,1
3,0,0,0,0,1
4,0,0,0,0,1


In [8]:
auto_df2 = pd.concat([auto_df, cyl_dummies], axis=1) #add in the dummy columns

In [9]:
auto_df2.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,cyl_3,cyl_4,cyl_5,cyl_6,cyl_8
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,0,0,0,0,1
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,0,0,0,0,1
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,0,0,0,0,1
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,0,0,0,0,1
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,0,0,0,0,1


### 2. Generating random numbers - randomizing treatment assignment

In [10]:
import random

In [11]:
random.random()  # Random float x, 0.0 <= x < 1.0

0.7110364670441853

In [12]:
random.uniform(1,100)  # Random float x, 0.0 <= x < 100.0

1.3202230612193726

In [13]:
random.randint(1, 10)  # Integer from 1 to 10, endpoints included

3

In [14]:
random.sample([1, 2, 3, 4, 5],  3) #actually a hard thing to write by hand, I've done it :D 

[4, 2, 3]

In [15]:
random.seed(47653)

In [16]:
raw_data = {'first_name': ['Niall', 'Josh', 'Li', 'Lavi', 'Jevin', 'Emma'],  
        'sex': ['male', 'male', 'female', 'male', 'male', 'female']}
df = pd.DataFrame(raw_data, columns = ['first_name', 'sex'])

In [17]:
df

Unnamed: 0,first_name,sex
0,Niall,male
1,Josh,male
2,Li,female
3,Lavi,male
4,Jevin,male
5,Emma,female


In [18]:
df['rand'] = df.apply(lambda row: random.random(), axis=1) #using a lambda function (anonymous function), we assign a random value to each of the rows

In [19]:
df

Unnamed: 0,first_name,sex,rand
0,Niall,male,0.009981
1,Josh,male,0.897681
2,Li,female,0.804464
3,Lavi,male,0.147438
4,Jevin,male,0.942135
5,Emma,female,0.426891


In [20]:
df['treat'] = (df['rand']<.5) #add in a new column to determine assignment. So if the rand val was less than .5, treat them.

In [21]:
df

Unnamed: 0,first_name,sex,rand,treat
0,Niall,male,0.009981,True
1,Josh,male,0.897681,False
2,Li,female,0.804464,False
3,Lavi,male,0.147438,True
4,Jevin,male,0.942135,False
5,Emma,female,0.426891,True


### 3. Splitting data into training and test sets

In [22]:
auto_df['rand'] = auto_df.apply(lambda row: random.random(), axis=1) # the above example was just on toy data.
#now let's do the random thing for our auto data
auto_df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,rand
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,0.880276
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,0.877639
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,0.808728
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,0.659385
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,0.98979


In [23]:
auto_df['train'] = (auto_df['rand']>.33) # select 2/3 of the data to train on (approximately)

In [24]:
len(auto_df)

392

In [25]:
len(auto_df[auto_df['train']]) #this syntax is different from R... somehow it's only taking the "True" values for train

277

In [26]:
auto_train = auto_df[auto_df['train']] #put them into their own frame
auto_train.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,rand,train
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,0.880276,True
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,0.877639,True
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,0.808728,True
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,0.659385,True
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,0.98979,True


Using Scikit-Learn

In [27]:
from sklearn.cross_validation import train_test_split



In [28]:
X = auto_df['weight']

In [29]:
y = auto_df['mpg']

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) #haha a lot easier than doing it manually okay we get it

In [31]:
len(X_train)

262

In [32]:
len(y_train)

262

In [33]:
len(X_test)

130

In [34]:
len(y_test)

130

### 4. Running regressions & generating predictions

In [35]:
auto_df.head(1)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,rand,train
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,0.880276,True


In [36]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error, r2_score

In [37]:
overfit_mod = smf.ols(formula='mpg ~ weight', data = auto_df) # here we want to know how mpg and weight are related
overfit_result = overfit_mod.fit()
print(overfit_result.summary())

# so let's break this stuff down. First of all, OLS is ordinary least squares regression.
# remember that with linear regression we are looking for weights on predictors that help us to explain an outcome.
# looking at the R squared we can say there is a moderate positive relationship between the two variables

# note we are looking at the entire dataset without splitting it!

# TODO: plot all the models vs each other

                            OLS Regression Results                            
Dep. Variable:                    mpg   R-squared:                       0.693
Model:                            OLS   Adj. R-squared:                  0.692
Method:                 Least Squares   F-statistic:                     878.8
Date:                Tue, 30 Apr 2019   Prob (F-statistic):          6.02e-102
Time:                        11:14:04   Log-Likelihood:                -1130.0
No. Observations:                 392   AIC:                             2264.
Df Residuals:                     390   BIC:                             2272.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     46.2165      0.799     57.867      0.0

In [38]:
train_mod = smf.ols(formula='mpg ~ weight', data = auto_train)
train_result = train_mod.fit()
print(train_result.summary())


# but they don't look that different... they're not exactly the same, but they're very close. Meaning we probably have a good model?

                            OLS Regression Results                            
Dep. Variable:                    mpg   R-squared:                       0.705
Model:                            OLS   Adj. R-squared:                  0.704
Method:                 Least Squares   F-statistic:                     657.9
Date:                Tue, 30 Apr 2019   Prob (F-statistic):           6.49e-75
Time:                        11:14:04   Log-Likelihood:                -799.34
No. Observations:                 277   AIC:                             1603.
Df Residuals:                     275   BIC:                             1610.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     47.0463      0.959     49.048      0.0

### Exercise

#### Use scikitlearn to train a model to predict mpg using weight, horsepower, cylinders, displacement, acceleration, origin and year

Reference: http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares

In [39]:
from sklearn import linear_model

In [40]:
lin_mod = linear_model.LinearRegression()
auto_train.head()
#OK so because cylinders, year, and origin are all categorical, they'll need to be dummy coded

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,rand,train
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,0.880276,True
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,0.877639,True
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,0.808728,True
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,0.659385,True
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,0.98979,True


In [41]:
def add_dummies(vec, df,pre):
    pd.get_dummies(vec).head()
    vec_dummies = pd.get_dummies(vec, prefix=pre) #this splits it so that we have labels for the dummy columns
    df = pd.concat([df, vec_dummies], axis=1)
    return df
    
#auto_df[['horsepower','cylinders','displacement','acceleration','origin','year']

# which ones are categorical?
# cylinders
# acceleration
# origin
# year
        
d = add_dummies(auto_df['horsepower'], auto_df, 'horse')
d = add_dummies(auto_df['cylinders'], d, 'cyl')
d = add_dummies(auto_df['acceleration'], d, 'acc')
d = add_dummies(auto_df['origin'], d, 'orig')
d = add_dummies(auto_df['year'], d, 'year')


d.columns



Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin', 'name', 'rand',
       ...
       'year_73', 'year_74', 'year_75', 'year_76', 'year_77', 'year_78',
       'year_79', 'year_80', 'year_81', 'year_82'],
      dtype='object', length=220)

In [42]:
import matplotlib.pyplot as plt
# we know our outcome (response) variable is y = mpg

y = auto_df['mpg'] #pull the response variables
X = auto_df[['horsepower','cylinders','displacement','acceleration','origin','year']] #pull the predictor matrix (do we need to dummy code?)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)



lin_mod.fit(X_train,y_train)
print(lin_mod.coef_)

# Make predictions using the testing set

mpg_pred = lin_mod.predict(X_test)

compare = pd.DataFrame(mpg_pred,y_test)
compare.head()
mean_squared_error(mpg_pred,y_test)

X_train.head()

# OK so I'm pretty sure we actually have to dummy code the categorical variables. It's probably trying to use the year/cylinders/origin as actual numbers.
# We can't have that

#Honestly I got caught up doing Assignment 3 and Jevin told me not to worry about the lab so I'm gonna hand it in! 
#I know I need to do the regression on the dummy codes instead of the actual categorical values.



[-0.08662745 -0.67122764 -0.01086368 -0.35671652  2.01384861  0.63673956]


Unnamed: 0,horsepower,cylinders,displacement,acceleration,origin,year
368,88,4,112.0,18.6,1,82
182,86,4,107.0,15.5,2,76
120,112,4,121.0,15.5,2,73
309,76,4,98.0,14.7,2,80
221,145,8,305.0,12.5,1,77
