# Multi-feature Linear Regression in Python
This week we are going to dive into linear regression in higher dimension, that is to say with more features than just one for our input variable. I will be borrowing/duplicating some code from my previous post regarding **Week 1**.

The first week is on Linear Regression and implementing Gradient Descent and normalization. Lets dive in!

In [1]:
import pandas as pd
import sklearn.preprocessing, sklearn.decomposition, \
    sklearn.linear_model, sklearn.pipeline, sklearn.metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn_pandas import DataFrameMapper, cross_val_score
from pathlib import Path, PureWindowsPath
import os, sys
import numpy as np
import matplotlib.pyplot as plt

p = Path(os.getcwd()).parents[0]
ex1_path = Path(f'{p}\Octave Code\ex1')                
assert ex1_path.exists(), "Check path to data"
os.chdir(os.path.abspath(ex1_path))

df = pd.read_csv(Path(r'ex1data2.txt'))

Its always important to check what your data types are and if you have any null values.

I know from the course that the columns are: 
1. Square Feet
2. Bedrooms
3. Price

In [2]:
df.columns = ['sqft', 'bdrms', 'price']

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 3 columns):
sqft     46 non-null int64
bdrms    46 non-null int64
price    46 non-null int64
dtypes: int64(3)
memory usage: 1.2 KB


In [4]:
df.describe()

Unnamed: 0,sqft,bdrms,price
count,46.0,46.0,46.0
mean,1998.434783,3.173913,339119.456522
std,803.333019,0.768963,126103.418369
min,852.0,1.0,169900.0
25%,1429.5,3.0,249900.0
50%,1870.0,3.0,299900.0
75%,2284.5,4.0,368875.0
max,4478.0,5.0,699900.0


Good news, we don't have any null entries! When we look at the range, we have some very different data. The range for bdrms is 1-5, where the range for sqft is ~2k-4.5k. Although with this small data set it wouldn't matter, with a large dataset, regularization would be important, particularly for gradient descent. This is easy to add to our pipeline. I'm going to give it a shot and see what I can learn! Keep in mind with our small dataset of 46 labels it would be much faster to write a closed form solution via Least Squares.

I will use pipelines just like we did last time. 

In [5]:
mapper = DataFrameMapper([
    (['sqft'], sklearn.preprocessing.StandardScaler()),
    (['bdrms'], sklearn.preprocessing.StandardScaler())
])

pl = sklearn.pipeline.Pipeline ([
    ('featurize', mapper),
    ('lm', sklearn.linear_model.LinearRegression())
])

pl.fit(df.drop('price', axis=1), df.price)
y_pred = pl.predict(df)



In [6]:
# Get prediction Data
# Note that pred MUST have a column 'X'
pred = pd.DataFrame({'sqft':[1650],'bdrms':[3]})
pl.predict(pred)



array([ 292195.80095132])

The original prediction was: $293081.46

This is a pretty minor price difference of approximately 3%! 

This is pretty cool. It is a very easy process, and the amazing [pipeline package](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) allows us to add new machine learning ideas very easily.