# Linear Regression with Python!

In the lab, we learned how to conduct linear regression in the context of machine learning with `R`. You may have guessed it, but `sklearn` has a built in library for running linear regression models as well, however, its handling of inputs is a little bit different (we'll get into this later). Let's continue with the same `mpg` dataset. 

### Read in the Data

We can actually read in the mpg data from the `datasets/` directory. We can do that now.

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model # necessary package for linear regression

with open('../../../datasets/mpg.csv') as file:
    df = pd.read_csv(file)
    df = df.drop('Unnamed: 0', 1) # remove id column


In [2]:
df.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
0,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
1,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
2,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
3,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
4,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


### Training and Testing Sets

In the previous module, we covered the topic of training and testing sets. Remember, we want to avoid overfitting our data, so we need to actually train the data on a subset of the entire data frame and then test it with a smaller subset. For this practice, we will split `df` into 70% training and use the other 30% as testing data. We'll go ahead and do that now...

In [3]:
train = df.sample(frac=7/10, random_state = 1)
test = df.drop(train.index)

Now we need to define our inputs (the predictors) and the target (outcome). We will continue down the path of the `R` lab and use the `displ` and `class` variables as our inputs and the `hwy` as our target.  Remember back to module 6 where we had to convert these variables to `numpy` arrays. 


**Activity 1**: *Create a numpy array for the input variables `train[['displ, 'class']]` and call this object `train_X`. The create a numpy array for the target variable `train['hwy']` and call this `train_y`. Then do the same thing for the `test` data frame, but be sure to call the new objects `test_X` and `test_y`.*

In [4]:
# Code for Activity 1 goes here
# *****************************

train_X = np.asarray(train[['displ', 'class']])
train_y = np.asarray(train.hwy)

test_X = np.asarray(test[['displ', 'class']])
test_y = np.asarray(test.hwy)

### Creating a Linear Regression Model

Now that we have our inputs and targets defined for both the training and testing sets, the next step is to create the model. First we will create the linear regression object by calling the `LinearRegression()` method from the `linear_model` library. We will call this object `regr`.

In [5]:
# Create linear regression object
regr = linear_model.LinearRegression()

Now we can use the `.fit()` method to train our model on the training variables...or can we? What happens when we run the following line of code?

In [6]:
# Train the model using the training sets
regr.fit(train_X, train_y)

ValueError: could not convert string to float: 'suv'

Uhoh! Look at that value error. The `LinearRegression()` method implementation in `sklearn` needs floated inputs, and the `class` column is one full of `strings`.  In `R`, these can be handled appropriately, but in Python, you have to turn these values into dummy variables. In other words, we will create a column for each value in the `class`, and if that observation at that point is of that class, then it will be marked as 1, otherwise it will take the value 0. Take the following table for example.

Observation | Color
------------|------
1           | 'Red'
2           | 'Green'
3           | 'Red'
4           | 'Blue'

Once we create dummy variables out of the `Color` column, the data will be in the following format...

Observation | Red | Green | Blue
------------|-----|-------|-----
1 | 1.0| 0.0|0.0
2 | 0.0| 1.0|0.0
3 | 1.0| 0.0|0.0
4 | 0.0| 0.0|1.0

We would then use these three values as part of our input. The is exactly what we need to do with our `df` before we can proceed.

### Creating Dummy Variables

Fortunately, `pandas` provides us with a pretty convenient method for transforming a variable into some dummy variables. See what this looks like when we do this on the original `df` data frame.

In [7]:
pd.get_dummies(df['class'])

Unnamed: 0,2seater,compact,midsize,minivan,pickup,subcompact,suv
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0
5,0.0,1.0,0.0,0.0,0.0,0.0,0.0
6,0.0,1.0,0.0,0.0,0.0,0.0,0.0
7,0.0,1.0,0.0,0.0,0.0,0.0,0.0
8,0.0,1.0,0.0,0.0,0.0,0.0,0.0
9,0.0,1.0,0.0,0.0,0.0,0.0,0.0


Now we need to turn this into a seperate data frame object and join it to our original data frame `df`.

**Activity 2**: *Create an object of dummy variables from the `df['class']` variable. Call this new data frame object, `class_dummies`.*

In [8]:
# Code for Activity 2 goes here
# *****************************

class_dummies = pd.get_dummies(df['class'])

And now for the joining of the original `df` object to the newly created `class_dummies`. Since each observation of the `df` data frame lines up with each observation of the `class_dummies` data frame, we can use the following method to join the two frames. 

In [10]:
df2 = pd.concat([df, class_dummies], axis=1)
df2.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,2seater,compact,midsize,minivan,pickup,subcompact,suv
0,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,0.0,1.0,0.0,0.0,0.0,0.0,0.0


If everything was created correct, we should see our original `df` with some extra columns (one from every category that was in the `class` column). What this method does is concatenate data frame objects on axis 1, or the column axis (remember, 0 is the row axis).

Uhoh, but this is going to interfere with our old training and testing sets, so we are going to have to redefine those before we can make any progress.

**Activity 3** : *Create a new testing and training set from the new data frame **`df2`**. Use the same parameters to define these two sets as we did for the original training and testing sets above. Call these objects `train` and `test` respectively.*

In [11]:
# Code for Activity 3 goes here
# *****************************


train = df2.sample(frac=7/10, random_state = 1)
test = df2.drop(train.index)

Oh yes, and since we just created new training and testing sets, we have to define the inputs and targets for each set again. Below I will define the new input and target variables for the training set. Remember, the new inputs will include all of the dummy variables that we just created.

In [12]:
train_X = np.asarray(train[['displ', '2seater', 'compact', 'midsize', 'minivan', 'pickup','subcompact','suv']])
train_y = np.asarray(train.hwy)

And now we need to do the same thing, except from the testing set.

**Activity 4** : *Create the testing inputs and testing target from the the `test` data frame. Call these new objects `test_X` and `test_y` respectively.*

In [13]:
# Code for Activity 4 goes here
# *****************************


test_X = np.asarray(test[['displ', '2seater', 'compact', 'midsize', 'minivan', 'pickup','subcompact','suv']])
test_y = np.asarray(test.hwy)

Now we can go ahead and train our data on the `regr` model object that we created above. Will it work this time?

In [33]:
regr.fit(train_X, train_y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Of course it will! Now our `regr` model is trained from our training set. Let's see what the results look like. First we can see the intercept.

In [34]:
regr.intercept_

32.995677220025989

And then the predictor variables' coefficients. Remember, these influence the intercept for each variable.

In [35]:
# The coefficients
print('Coefficients: \n', regr.coef_)

Coefficients: 
 [-2.42737226  7.99592858  0.75584649  1.17086032 -1.78384786 -5.48121739
  1.33358906 -3.99115918]


And we can zip that up to see the name of the variable by the coefficient. It is important to note the difference between `R` and `Python` in this case. In this case, to find the intercept of a particular class, you would take the intercept + the displacement coefficient + the class coefficient.

In [36]:
z = zip(['displ', '2seater', 'compact', 'midsize', 'minivan', 'pickup','subcompact','suv'],regr.coef_)
list(z)

[('displ', -2.4273722565586464),
 ('2seater', 7.9959285758845402),
 ('compact', 0.75584648787547093),
 ('midsize', 1.1708603177117278),
 ('minivan', -1.7838478624497738),
 ('pickup', -5.4812173947250429),
 ('subcompact', 1.3335890606586231),
 ('suv', -3.9911591849555386)]

### Assessing the Model

We can then assess the model by looking at the R-Squared value of the training set. In `Python` they call this method `.score` and we call it on the `regr` object and pass the inputs and output as arguments. Take a look below...

In [37]:
# Explained variance score: 1 is perfect prediction
print('R-Squared: {}'.format(regr.score(train_X, train_y)))

R-Squared: 0.8045617859827147


But we want to see how well our model predicts our testing set, and doing that is simple. Instead of passing the train inputs and target as arguments, pass the test.

**Activity 5** : *Assess the performance of our `regr` by finding the `R-Squared` of our model on the testing set.*

In [38]:
# Code for Activity 5 goes here
# *****************************

print('R-Squared: {}'.format(regr.score(test_X, test_y)))

R-Squared: 0.7596803856444374


There we have it! Not too shabby! 

And finally, if we wanted to predict some output with our model using our testing set inputs, or any new inputs, we can do so by calling the `predict()` method on the `regr` object and pass our input array as an argument. Take a look at how we do that on the `test_X` object.

In [39]:
regr.predict(test_X)

array([ 29.38225365,  28.89677919,  29.38225365,  29.38225365,
        28.89677919,  27.36989522,  16.13944508,  14.4402845 ,
        27.15558393,  27.15558393,  25.94189781,  25.94189781,
        15.16849617,  25.38613594,  23.20150091,  23.20150091,
        18.53318248,  18.04770802,  16.10581022,  19.53776623,
        17.59586843,  16.3821823 ,  14.68302172,  16.10581022,
        16.10581022,  14.89212409,  13.67843796,  15.89670785,
        19.29502901,  19.29502901,  16.34854745,  23.1633539 ,
        30.44547067,  29.95999622,  28.34084412,  28.34084412,
        29.47452177,  27.77536119,  17.59586843,  15.16849617,
        14.19754727,  18.80955456,  17.83860565,  15.89670785,
        15.89670785,  19.29502901,  17.83860565,  16.86765675,
        27.92583029,  27.92583029,  28.0981069 ,  25.67073464,
        25.67073464,  15.4112334 ,  24.94252296,  24.94252296,
        20.75145236,  17.59586843,  28.34084412,  25.74119526,
        29.38225365,  20.96055473,  20.96055473,  19.26

Remember what these points represent. They are the predicted highway mpg given that we know the displacement and class beforehand. 