In [42]:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

# Further Regression Considerations

- Collinearity
- Cleaning and Preparing Data
- Test Train Split for Assessment


### Collinearity

The notion of independence of variables is related to the notion of collinearity.  Briefly, we find collinearity anytime we find strong relationships between dependent variables.  As we saw earlier, the relationship between `newspaper` and other mediums were interrelated to one another.  This can be detected by looking both at plots of the variables themselves against one another, examining the correlation coefficients of variables, and calculating the Variance in Frequency measure for the different features.

In [36]:
credit = pd.read_csv('data/credit.csv')
ads = pd.read_csv('data/ads.csv', index_col = 'Unnamed: 0')

In [37]:
from pandas.plotting import scatter_matrix
scatter_matrix(credit);

<IPython.core.display.Javascript object>

Note the relationships between `Limit, Rating`, and `Balance`.  Both `Limit` and `Rating` seem to be related to `Balance`, however they are strongly related to one another.  This is not to be confused with the relationships between `TV` and `radio` that we saw earlier.  We can see this clearly by comparing the variables to one another side by side.

In [4]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [4]:
plt.figure(figsize = (5, 5))
plt.subplot(1, 2, 1)
plt.scatter(ads['TV'], ads['radio'], alpha = 0.3);
plt.title("Television and Radio")

plt.subplot(1, 2, 2)
plt.scatter(credit['Rating'], credit['Limit'], alpha = 0.3);
plt.title("Rating and Limit")

<IPython.core.display.Javascript object>

Text(0.5,1,'Rating and Limit')

### Collinearity Example

The `longley` dataset available through the `statsmodels` dataset package is another example of a highly collinear dataset.  Here, we are interested in determining the regression predicting the percent employed.

In [5]:
import statsmodels as sm

In [6]:
longley = sm.datasets.get_rdataset('longley')

In [7]:
longley.data.head()

Unnamed: 0,GNP.deflator,GNP,Unemployed,Armed.Forces,Population,Year,Employed
1947,83.0,234.289,235.6,159.0,107.608,1947,60.323
1948,88.5,259.426,232.5,145.6,108.632,1948,61.122
1949,88.2,258.054,368.2,161.6,109.773,1949,60.171
1950,89.5,284.599,335.1,165.0,110.929,1950,61.187
1951,96.2,328.975,209.9,309.9,112.075,1951,63.221


In [8]:
print(longley.__doc__)

+---------+-----------------+
| longley | R Documentation |
+---------+-----------------+

Longley's Economic Regression Data
----------------------------------

Description
~~~~~~~~~~~

A macroeconomic data set which provides a well-known example for a
highly collinear regression.

Usage
~~~~~

::

    longley

Format
~~~~~~

A data frame with 7 economical variables, observed yearly from 1947 to
1962 (*n=16*).

``GNP.deflator``
    GNP implicit price deflator (*1954=100*)

``GNP``
    Gross National Product.

``Unemployed``
    number of unemployed.

``Armed.Forces``
    number of people in the armed forces.

``Population``
    ‘noninstitutionalized’ population *≥* 14 years of age.

``Year``
    the year (time).

``Employed``
    number of people employed.

The regression ``lm(Employed ~ .)`` is known to be highly collinear.

Source
~~~~~~

J. W. Longley (1967) An appraisal of least-squares programs from the
point of view of the user. *Journal of the American Statistical
Association* 

In [9]:
long_data = longley.data

In [10]:
long_data.columns

Index(['GNP.deflator', 'GNP', 'Unemployed', 'Armed.Forces', 'Population',
       'Year', 'Employed'],
      dtype='object')

In [11]:
long_data.head()

Unnamed: 0,GNP.deflator,GNP,Unemployed,Armed.Forces,Population,Year,Employed
1947,83.0,234.289,235.6,159.0,107.608,1947,60.323
1948,88.5,259.426,232.5,145.6,108.632,1948,61.122
1949,88.2,258.054,368.2,161.6,109.773,1949,60.171
1950,89.5,284.599,335.1,165.0,110.929,1950,61.187
1951,96.2,328.975,209.9,309.9,112.075,1951,63.221


In [12]:
corr_mat = long_data.corr()

In [13]:
plt.figure()
sns.heatmap(corr_mat)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1a1550b780>

In [15]:
scatter_matrix(long_data);

<IPython.core.display.Javascript object>

### Problem

Return to your example dataset in the `Credit` example.  Remove any features you believe are highly correlated and refit your model.  Discuss performance.

### Feature Engineering and Cleaning


We want to return to our Housing example and consider how to use some of `scikitlearn`'s functionality to deal with missing values.  We want to determine the correct way of dealing with these one by one, and use some of what we know about the data to inform these decisions.  If we have objects that are missing values, we can either exclude the observations, or encode the missing values using some kind of numerical value.  


In [13]:
ames = pd.read_csv('data/ames_housing.csv')

In [14]:
ames.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [20]:
ames.info()
ames.describe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

<bound method NDFrame.describe of         Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0        1          60       RL         65.0     8450   Pave   NaN      Reg   
1        2          20       RL         80.0     9600   Pave   NaN      Reg   
2        3          60       RL         68.0    11250   Pave   NaN      IR1   
3        4          70       RL         60.0     9550   Pave   NaN      IR1   
4        5          60       RL         84.0    14260   Pave   NaN      IR1   
5        6          50       RL         85.0    14115   Pave   NaN      IR1   
6        7          20       RL         75.0    10084   Pave   NaN      Reg   
7        8          60       RL          NaN    10382   Pave   NaN      IR1   
8        9          50       RM         51.0     6120   Pave   NaN      Reg   
9       10         190       RL         50.0     7420   Pave   NaN      Reg   
10      11          20       RL         70.0    11200   Pave   NaN      Reg   
11      12        

In [17]:
ames['Alley'].value_counts()

Grvl    50
Pave    41
Name: Alley, dtype: int64

In [18]:
ames.Alley.head(10)

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
Name: Alley, dtype: object

In [19]:
#Choose only the rows where the Alley data is not null
ames[ames['Alley'].notna()].head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
21,22,45,RM,57.0,7449,Pave,Grvl,Reg,Bnk,AllPub,...,0,,GdPrv,,0,6,2007,WD,Normal,139400
30,31,70,C (all),50.0,8500,Pave,Pave,Reg,Lvl,AllPub,...,0,,MnPrv,,0,7,2008,WD,Normal,40000
56,57,160,FV,24.0,2645,Pave,Pave,Reg,Lvl,AllPub,...,0,,,,0,8,2009,WD,Abnorml,172500
79,80,50,RM,60.0,10440,Pave,Grvl,Reg,Lvl,AllPub,...,0,,MnPrv,,0,5,2009,WD,Normal,110000
87,88,160,FV,40.0,3951,Pave,Pave,Reg,Lvl,AllPub,...,0,,,,0,6,2009,New,Partial,164500


In [19]:
#I want to turn the Alley into a get dummies column and concat to original df
#X = pd.get_dummies(ames.Alley)
ames_alley_df = pd.DataFrame(pd.get_dummies(ames.Alley))
ames_alley_df.columns
ames_alley_df.head(5)

Unnamed: 0,Grvl,Pave
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0


In [22]:
#only want to select the columns from the Ames dataset where the values are numeric so that I can use it for my Linear Regression
ames_numbers = ames.select_dtypes(include = 'int64')

In [23]:
ames_numbers.head()

Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,1,60,8450,7,5,2003,2003,706,0,150,...,0,61,0,0,0,0,0,2,2008,208500
1,2,20,9600,6,8,1976,1976,978,0,284,...,298,0,0,0,0,0,0,5,2007,181500
2,3,60,11250,7,5,2001,2002,486,0,434,...,0,42,0,0,0,0,0,9,2008,223500
3,4,70,9550,7,5,1915,1970,216,0,540,...,0,35,272,0,0,0,0,2,2006,140000
4,5,60,14260,8,5,2000,2000,655,0,490,...,192,84,0,0,0,0,0,12,2008,250000


In [26]:
ames_numbers.columns

Index(['Id', 'MSSubClass', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars',
       'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')

In [28]:
#Look at the correlation of the ames_numbers dataset
corr_ames_numbers = ames_numbers.corr()

In [30]:
#Visualize the correlation
plt.figure()
sns.heatmap(corr_ames_numbers)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1a1a35eb38>

Appears to be a great deal of correlation between the Sales Price and the following variables:
    
    -GarageArea
    
    -GrLiveArea
    
    -OverallCond
    
    -RemodAdd
    
    -TotalBsmtSF

In [31]:
#Now that I only have the numeric values,I can start to look at linear regressions
#Need to drop the Sales Price because that is what I'm trying to predict!
X = ames_numbers.drop('SalePrice', axis = 1)
y = ames.SalePrice

In [32]:
X.head()

Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,1,60,8450,7,5,2003,2003,706,0,150,...,548,0,61,0,0,0,0,0,2,2008
1,2,20,9600,6,8,1976,1976,978,0,284,...,460,298,0,0,0,0,0,0,5,2007
2,3,60,11250,7,5,2001,2002,486,0,434,...,608,0,42,0,0,0,0,0,9,2008
3,4,70,9550,7,5,1915,1970,216,0,540,...,642,0,35,272,0,0,0,0,2,2006
4,5,60,14260,8,5,2000,2000,655,0,490,...,836,192,84,0,0,0,0,0,12,2008


# Finding the best variables to use as indicators

Using all numeric variables as indicators:

In [49]:
lr = LinearRegression()
lr.fit(X,y)
pred = lr.predict(X)
print('The root mean squared error is {:.2f}'.format(np.sqrt(mean_squared_error(y, pred))))

The root mean squared error is 34709.80


Using the variable with the greatest correlation as the indicator

In [60]:
X2 = ames_numbers['GarageArea']
y2 = ames_numbers['SalePrice']
lr = LinearRegression()
lr.fit(X2.values.reshape(-1,1), y2)
pred2 = lr.predict(X2.values.reshape(-1,1))
print('The root mean squared error is {:.2f}'.format(np.sqrt(mean_squared_error(y2, pred2))))

The root mean squared error is 62093.07


In [75]:
ames_numbers = pd.DataFrame(ames_numbers)
ames_numbers['OverallCond'] = ames_numbers['OverallCond'].fillna("None")
X3 = ames_numbers['OverallCond']
y3 = ames_numbers['SalePrice']
lr = LinearRegression()
lr.fit(X3.values.reshape(-1,1), y3)
pred3 = lr.predict(X3.values.reshape(-1,1))
print('The root mean squared error is {:.2f}'.format(np.sqrt(mean_squared_error(y3, pred3))))

The root mean squared error is 79174.24


In [110]:
X4 = ames_numbers[['OverallCond', 'GarageArea', 'GrLivArea', 'Fireplaces']]
y4 = ames_numbers['SalePrice']
lr = LinearRegression()
lr.fit(X4, y4)
pred4 = lr.predict(X4)
coef4 = lr.coef_
inter4 = lr.intercept_
print('The root mean squared error is {:.2f}'.format(np.sqrt(mean_squared_error(y4, pred4))))
print('The intercept is {:.2f}' .format(inter4))
print(coef4)

The root mean squared error is 48336.90
The intercept is -9464.11
[ 1318.88355535   135.83006403    70.76244138 18840.11166174]


In [None]:
ames = ames.replace({"BsmtCond": {"No": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}})

In [115]:
ames['BsmtCond'].value_counts()

3.0    1311
4.0      65
2.0      45
1.0       2
Name: BsmtCond, dtype: int64

In [79]:
ames = ames.replace({"BsmtQual" : {"No" : 0, "Po" : 1, "Fa" : 2, "TA": 3, "Gd" : 4, "Ex" : 5}})

In [116]:
ames['BsmtQual'].value_counts()

3.0     649
4.0     618
5.0     121
None     37
2.0      35
Name: BsmtQual, dtype: int64

In [118]:
ames = ames.replace({"GarageCond": {"None": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}})

In [119]:
ames.GarageCond.value_counts()

3    1326
0      81
2      35
4       9
1       7
5       2
Name: GarageCond, dtype: int64

In [120]:
ames = ames.replace({"GarageQual": {"None": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}})

TA      1311
None      81
Fa        48
Gd        14
Ex         3
Po         3
Name: GarageQual, dtype: int64

In [124]:
ames.GarageQual.value_counts()

3    1311
0      81
2      48
4      14
5       3
1       3
Name: GarageQual, dtype: int64

**PROBLEMS**

Continue to code a few more columns and make sure to replace any `na` values in at least:

- `OverallQual`
- `OverallCond`
- `GarageQual`
- `GarageCond`
- `PoolArea`
- `PoolQC`

In [89]:
ames = pd.DataFrame(ames)
ames['OverallQual'] = ames['OverallQual'].fillna("None")
ames['OverallCond'] = ames['OverallCond'].fillna("None")
ames['GarageQual'] = ames['GarageQual'].fillna("None")
ames['GarageCond'] = ames['GarageCond'].fillna("None")
ames['PoolArea'] = ames['PoolArea'].fillna("None")
ames['PoolQC'] = ames['PoolQC'].fillna("None")
ames['BsmtQual'] = ames['BsmtQual'].fillna("None")

In [92]:
ames['BsmtQual'].head(5)

0    4
1    4
2    4
3    3
4    4
Name: BsmtQual, dtype: object

In [81]:
ames = ames.replace({"BsmtQual" : {1: "isntgood"}})

In [125]:
ames['OverallQual'].value_counts()

5     397
6     374
7     319
8     168
4     116
9      43
3      20
10     18
2       3
1       2
Name: OverallQual, dtype: int64

In [127]:
ames.PoolQC.value_counts()

None    1453
Gd         3
Ex         2
Fa         2
Name: PoolQC, dtype: int64

In [128]:
ames = ames.replace({"PoolQC": {"None": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}})

In [129]:
ames.PoolQC.value_counts()

0    1453
4       3
5       2
2       2
Name: PoolQC, dtype: int64

### Adding New Features

We can create many new features to help improve our models performance.  For example, any of the measures that have multiple categories could be combined.  Take `Overall`, `Garage`, and `Pool` for example.  We can create combinations of the subcolumns as follows.

In [96]:
ames.isna

<bound method DataFrame.isna of         Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0        1          60       RL         65.0     8450   Pave  None      Reg   
1        2          20       RL         80.0     9600   Pave  None      Reg   
2        3          60       RL         68.0    11250   Pave  None      IR1   
3        4          70       RL         60.0     9550   Pave  None      IR1   
4        5          60       RL         84.0    14260   Pave  None      IR1   
5        6          50       RL         85.0    14115   Pave  None      IR1   
6        7          20       RL         75.0    10084   Pave  None      Reg   
7        8          60       RL          NaN    10382   Pave  None      IR1   
8        9          50       RM         51.0     6120   Pave  None      Reg   
9       10         190       RL         50.0     7420   Pave  None      Reg   
10      11          20       RL         70.0    11200   Pave  None      Reg   
11      12          

In [97]:
ames.SaleCondition.value_counts()

Normal     1198
Partial     125
Abnorml     101
Family       20
Alloca       12
AdjLand       4
Name: SaleCondition, dtype: int64

In [130]:
ames[['OverallCond', 'OverallQual', 'GarageCond', 'GarageQual','PoolArea','PoolQC']].describe()

Unnamed: 0,OverallCond,OverallQual,GarageCond,GarageQual,PoolArea,PoolQC
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,5.575342,6.099315,2.808904,2.810274,2.758904,0.017808
std,1.112799,1.382997,0.719685,0.722898,40.177307,0.268952
min,1.0,1.0,0.0,0.0,0.0,0.0
25%,5.0,5.0,3.0,3.0,0.0,0.0
50%,5.0,6.0,3.0,3.0,0.0,0.0
75%,6.0,7.0,3.0,3.0,0.0,0.0
max,9.0,10.0,5.0,5.0,738.0,5.0


**PROBLEMS**


Continue to add additional features that combine other existing ones in a sensible way.  Here are a few additional ideas:

```python
ames['OverallGrade'] = ames['OverallQual'] * ames['OverallCond']
ames['GarageOverall'] = ames['GarageQual'] * ames['GarageCond']
ames['PoolOverall'] = ames['PoolArea'] * ames['PoolQC']
```

Be sure you've coded these as numeric vectors before creating columns based on arithmetic involving them.

In [131]:
ames['OverallGrade'] = ames['OverallQual'] * ames['OverallCond']
ames['GarageOverall'] = ames['GarageQual'] * ames['GarageCond']
ames['PoolOverall'] = ames['PoolArea'] * ames['PoolQC']

In [132]:
ames.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,OverallGrade,GarageOverall,PoolOverall
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,,0,2,2008,WD,Normal,208500,35,9,0
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,,0,5,2007,WD,Normal,181500,48,9,0
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,,0,9,2008,WD,Normal,223500,35,9,0
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,,0,2,2006,WD,Abnorml,140000,35,9,0
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,,0,12,2008,WD,Normal,250000,40,9,0


In [133]:
ames_numbers2 = ames.select_dtypes(include = 'int64')

In [136]:
ames_numbers2.describe()

Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,ScreenPorch,PoolArea,PoolQC,MiscVal,MoSold,YrSold,SalePrice,OverallGrade,GarageOverall,PoolOverall
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,10516.828082,6.099315,5.575342,1971.267808,1984.865753,443.639726,46.549315,567.240411,...,15.060959,2.758904,0.017808,43.489041,6.321918,2007.815753,180921.19589,33.864384,8.392466,10.167808
std,421.610009,42.300571,9981.264932,1.382997,1.112799,30.202904,20.645407,456.098091,161.319273,441.866955,...,55.757415,40.177307,0.268952,496.123024,2.703626,1.328095,79442.502883,9.219624,2.362042,153.928372
min,1.0,20.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0,1.0,0.0,0.0
25%,365.75,20.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,223.0,...,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0,30.0,9.0,0.0
50%,730.5,50.0,9478.5,6.0,5.0,1973.0,1994.0,383.5,0.0,477.5,...,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0,35.0,9.0,0.0
75%,1095.25,70.0,11601.5,7.0,6.0,2000.0,2004.0,712.25,0.0,808.0,...,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0,40.0,9.0,0.0
max,1460.0,190.0,215245.0,10.0,9.0,2010.0,2010.0,5644.0,1474.0,2336.0,...,480.0,738.0,5.0,15500.0,12.0,2010.0,755000.0,90.0,25.0,2952.0


In [141]:
corr_ames_numbers = ames_numbers2.corr()
plt.figure()
sns.heatmap(corr_ames_numbers)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1a1ace3eb8>

In [139]:
lm = LinearRegression()
X = ames['GrLivArea']
y = ames['SalePrice']
lm.fit(X.values.reshape(-1,1), y)

#lm.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Scikitlearn Linear Regression

In [143]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [67]:
ads['TVradio'] = ads['TV'] * ads['radio']

In [68]:
ads_X = ads.drop(['sales', 'newspaper'], axis = 1)

In [69]:
ads_label = ads['sales'].copy()

In [144]:
X_train, X_test, y_train, y_test = train_test_split(X4, y4)

In [145]:
lm = LinearRegression()

In [146]:
lm.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [147]:
lm.coef_

array([ 1589.01939824,   132.23912873,    79.86606689, 18636.62346257])

In [148]:
lm.intercept_

-22261.199624065543

In [149]:
lm.score(X_train, y_train)

0.6637592638196598

In [150]:
lm.score(X_test, y_test)

0.5153806722207619

In [151]:
from sklearn.metrics import mean_squared_error

In [152]:
predictions = lm.predict(X_test)

In [153]:
predictions[:8]

array([104267.60978121, 113593.21163686, 400711.72232861, 277576.08143441,
       141317.6396094 , 252668.77674216, 119089.99837364, 111629.41872443])

In [154]:
y_test[:8]

1249    119000
804     118000
1169    625000
1044    278000
538     158000
618     314813
52      110000
398      67000
Name: SalePrice, dtype: int64

In [155]:
mse = mean_squared_error(y_test, predictions)

In [156]:
rmse = np.sqrt(mse)

In [157]:
print("MSE: ", mse, "\nRMSE: ", rmse)

MSE:  3080645421.953495 
RMSE:  55503.562245620735


**PROBLEM**

Using the `sklearn` implementation of `LinearRegression()`, create a test and train set from your housing data.  To begin, fit a linear model on the **Logarithm** of the sales column with the `GrLivArea` feature.  Use this as your baseline to compare your transformations to.  

Include the transformations from above into a second linear model and try it out on the test set. Did the performance improve with your adjustments and transformations? 

Add polynomial features into the mix and see if you can get better improvement still.

In [159]:
X_GrLivArea = ames_numbers['GrLivArea']
y_GrLivArea = ames_numbers['SalePrice']

In [160]:
X_train, X_test, y_train, y_test = train_test_split(X_GrLivArea, y_GrLivArea)

In [163]:
lm2= LinearRegression()

In [165]:
lm2.fit(X_train.values.reshape(-1,1), y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [170]:
lm2.coef_

array([112.45369972])

In [171]:
lm2.intercept_

10583.61162081387

In [173]:
lm2.score(X_train.values.reshape(-1,1), y_train)

0.5205926462239013

In [176]:
predictions = lm2.predict(X_test.values.reshape(-1,1))

In [177]:
mse = mean_squared_error(y_test, predictions)

In [178]:
rmse = np.sqrt(mse)

In [179]:
print("MSE: ", mse, "\nRMSE: ", rmse)

MSE:  3995218893.2568393 
RMSE:  63207.74393424305
