## Linear Regression
An estimator is any model that tries to estimate a variable y from another variable(s) x given pairs of data (x<sub>1</sub>,y<sub>1</sub>),(x<sub>2</sub>,y<sub>2</sub>),...,(x<sub>N</sub>,y<sub>N</sub>)

Regression is when the targets (y) are quantities (not cat vs dog but rather price of a house)

An example would be trying to guess how many people will buy icecream from a shop at any day given the temprature on that day. In this case, the target is the number of icecream people will buy, it is an integer and any value (ex: 14 icecreams) doesn't represent a particular class, so it will be represented as a regression problem. In this case, it turns out that x is also a quantity (temprature); however, this is not necessary and is tackled by approaches other than linear regression.

To estimate the price of the icecream from temperature we gather the data on various days from various different places and get the following data.



We can see that a line (could  also be a curve) could fit the data, but how can we find which line exactly.

#### Line Equation Review
A line in 2D can be parametrized using a slope m and a y-intercept b which would result in it having an equation

$$y = mx+b$$

In the example of the icecream, if the temperature at a day was 30, our prediction for the number of icecreams sold would be m\*30 + b, so we simply need to find m and b.

#### Squared Loss
We need some measurement of how good our line is to be able to find the "best" line, so we will measure the difference between our predictions and the correct values from the data.


Then, we will square all the differences and add them up. This will be the value we try to minimize.

#### Note
The normal equation will probably not work. It is left for you to figure it out and let everyone else know, let's see who figures it out first (it's a tricky issue).




In [24]:
import numpy as np
import pandas as pd

In [25]:
class Linear_Regression():
    '''
    Linear Regression model created using only NumPy
    
    Attributes
    ----------
    weights: np.array of floats
        All the parameters of the model (including bias)
    '''
    def __init__(self):
        self.weights = None
        self.prediction = None
        
    def train(self,data_X,data_y):
        '''
        Train the model using the given data
        
        Parameters
        ----------
        data_X: np.array, shape = (N, num_features)
            Features from the data, each row is one data point. Assumes that a column of ones was added to data_X
        data_y: np.array, shape = (N, num_targets)
            The target values to predict, each row contains the targets for one data point
        '''
        ########################## Insert code here ##########################


        ########################## GRADIENT DESCENT ##########################
        """
        theta = np.zeros(data_X.shape[1])
        iterations = 50000                                                            #sets number of iterations
        alpha = 0.0000000069                                                          #sets learning rate
        for i in range(1):
          for i in range(iterations):
            predictions = data_X.dot(theta)                                           #predicts Y values with current weights
            errors = np.subtract(predictions, data_y)                                 #calculates the error
            sum_delta = (alpha / data_X.shape[0]) * data_X.transpose().dot(errors);   #calculate gradient function
            theta = theta - sum_delta;                                                #update the weights
        self.weights = theta
        """
        ######################### NORMAL EQUATION ###########################

        self.weights = (np.linalg.pinv(data_X)@data_y)

    def predict(self,x_to_predict):
        '''
        Predict using the given value as input
        
        Assumes that self.train(.,.) has been called before calling this method
        
        Parameters
        ----------
        x_to_predict: np.array, shape = (M, num_features)
            A given list of inputs to predict targets for, each row is one input. Assumes that a column of ones was added similar to the training data
        
        Returns
        -------
        np.array of floats, shape = (M, num_targets)
            Predicted values for each input
        '''
        ########################## Insert code here ##########################
        self.prediction = x_to_predict.dot(self.weights)
        

### Import the data and remove useless columns

In [26]:
df = pd.read_csv("train.csv")
df.drop(columns=["Id"],inplace=True)
df.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


### Handle the missing data (NaNs)

In [27]:
df.drop(columns=df.columns[df.isnull().sum().values>200],inplace=True)
df.dropna(inplace=True)
df.isnull().sum().values

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0])

### Replace categorical data (strings) with numerical values

In [28]:
obj_to_replace = df["MSZoning"].dtype

for column in df.columns:
    if df[column].dtype == obj_to_replace:
        uniques = np.unique(df[column].values)
        for idx,item in enumerate(uniques):
            df[column] = df[column].replace(item,idx)
            
df.head()

Unnamed: 0,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,3,8450,1,3,3,0,4,0,5,...,0,0,0,0,0,2,2008,8,4,208500
1,20,3,9600,1,3,3,0,2,0,24,...,0,0,0,0,0,5,2007,8,4,181500
2,60,3,11250,1,0,3,0,4,0,5,...,0,0,0,0,0,9,2008,8,4,223500
3,70,3,9550,1,0,3,0,0,0,6,...,272,0,0,0,0,2,2006,8,0,140000
4,60,3,14260,1,0,3,0,2,0,15,...,0,0,0,0,0,12,2008,8,4,250000


### Add the bias column (column of ones)

In [29]:
df["bias"] = np.ones(df.shape[0])
df.head()

Unnamed: 0,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,...,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,bias
0,60,3,8450,1,3,3,0,4,0,5,...,0,0,0,0,2,2008,8,4,208500,1.0
1,20,3,9600,1,3,3,0,2,0,24,...,0,0,0,0,5,2007,8,4,181500,1.0
2,60,3,11250,1,0,3,0,4,0,5,...,0,0,0,0,9,2008,8,4,223500,1.0
3,70,3,9550,1,0,3,0,0,0,6,...,0,0,0,0,2,2006,8,0,140000,1.0
4,60,3,14260,1,0,3,0,2,0,15,...,0,0,0,0,12,2008,8,4,250000,1.0


### Divide the data into training, testing, X, and y

In [30]:
df = df.sample(frac=1).reset_index(drop=True)
training_df = df[:-100]
val_df = df[-100:]
training_y = training_df["SalePrice"].values
training_X = training_df.drop(columns=["SalePrice"]).values
val_y = val_df["SalePrice"].values
val_X = val_df.drop(columns=["SalePrice"]).values

print(training_X.shape)
print(np.mean(training_y))




(1238, 74)
187199.19870759288


### Train the linear regressor

In [45]:
# Create and fit the model
LR_regressor = Linear_Regression()
LR_regressor.train(training_X,training_y)
print("Weights:\n",LR_regressor.weights)


# Calculate Mean Absolute Error (Easier to interpret than MSE)
########################## Insert code here ##########################
LR_regressor.predict(training_X)
predictions = LR_regressor.prediction
errors = np.absolute(np.subtract(predictions, training_y))
MAE = np.mean(errors , dtype=np.float64)
print ("\nMAE:",MAE)

Weights:
 [-1.21894235e+02 -9.86939435e+02  3.39776044e-01  3.73841697e+04
 -1.30540010e+03  4.34321056e+03 -4.97188124e+04  3.05335077e+02
  6.44634443e+03  4.22556961e+02 -5.63039030e+02 -1.15336165e+04
 -1.20903212e+03 -1.65693242e+03  1.25336873e+04  4.78698125e+03
  2.42181101e+02  2.96387310e+01  2.37142596e+03  3.62490837e+03
 -5.93751022e+02 -4.43736522e+01  4.62224756e+03  3.03769233e+01
 -6.86557949e+03 -6.84991467e+02  2.69933823e+03 -9.34947480e+03
  2.82507619e+03 -3.34960782e+03 -1.03418054e+03  1.76372511e+00
  4.39713080e+02  6.76432760e+00 -3.85563911e+00  4.67246417e+00
 -6.19808531e+03 -7.39141172e+02 -6.73360908e+02  1.37046806e+01
  1.92486156e+01  2.17086676e+01 -1.37734330e+01  2.71838586e+01
  6.84557276e+03  2.06155610e+03  1.12506778e+03 -2.71430127e+03
 -4.09594722e+03 -2.16871376e+04 -8.47969440e+03  3.30394092e+03
  3.96989672e+03  4.09008743e+03  5.38324926e+02 -1.72122568e+02
 -7.14151144e+02  1.47333766e+04  5.17854168e-01  7.38890730e+00
  1.71834327e+0

### Train using the sklearn linear regressor

In [46]:
from sklearn.linear_model import LinearRegression

# Create and fit the model
########################## Insert code here ##########################
LR_sk = LinearRegression().fit(training_X,training_y)
print("Weights:\n", LR_sk.coef_)

# Calculate Mean Absolute Error (Easier to interpret than MSE)
########################## Insert code here ##########################
predictions = LR_sk.predict(training_X)
errors = np.absolute(np.subtract(LR_regressor.prediction, training_y))
MAE = np.mean(errors , dtype=np.float64)
print ("\nMAE:",MAE)

Weights:
 [-1.21894235e+02 -9.86939435e+02  3.39776044e-01  3.73841697e+04
 -1.30540010e+03  4.34321056e+03 -4.97188124e+04  3.05335077e+02
  6.44634443e+03  4.22556961e+02 -5.63039030e+02 -1.15336165e+04
 -1.20903212e+03 -1.65693242e+03  1.25336873e+04  4.78698125e+03
  2.42181101e+02  2.96387310e+01  2.37142596e+03  3.62490837e+03
 -5.93751022e+02 -4.43736522e+01  4.62224756e+03  3.03769233e+01
 -6.86557949e+03 -6.84991467e+02  2.69933823e+03 -9.34947480e+03
  2.82507619e+03 -3.34960782e+03 -1.03418054e+03  1.76373776e+00
  4.39713080e+02  6.76434026e+00 -3.85562646e+00  4.67245152e+00
 -6.19808531e+03 -7.39141172e+02 -6.73360908e+02  1.37046806e+01
  1.92486177e+01  2.17086697e+01 -1.37734309e+01  2.71838565e+01
  6.84557276e+03  2.06155610e+03  1.12506778e+03 -2.71430127e+03
 -4.09594722e+03 -2.16871376e+04 -8.47969440e+03  3.30394092e+03
  3.96989672e+03  4.09008743e+03  5.38324926e+02 -1.72122568e+02
 -7.14151144e+02  1.47333766e+04  5.17854168e-01  7.38890730e+00
  1.71834327e+0

# Predict


In [47]:
LR_regressor.predict(val_X)     #replace "val_X" with data to be predicted
print("Predictions:\n", LR_regressor.prediction)

Predictions:
 [237717.17000674 188723.92248138 354303.46848772  90161.1074796
 124774.63407638  84012.56868291 249589.61938117 262371.39851546
 165877.93430858 243737.22798793 258044.82205052 158087.5228898
 180240.31933074  71149.72709    124948.0501268  256021.09313946
 186966.61382005 107346.1750329   96660.13963967 160940.93970255
  89711.5468953  132211.59523708 387696.54156055 248331.26070044
 200536.92034258 112061.45078237 289590.08583475 183771.29776558
 153744.48707692 122267.87510608 259022.59214379 171183.32948935
 142216.2997219  183298.48048121  92978.73179728  96229.30978422
 146715.06408308 195422.71215417  94319.88085912 130377.25798122
 258771.39690977 144854.25358636 318046.64469027 144823.51287336
 212465.94618778 123201.67604591 153691.39229881 295607.83185401
 156617.16278985 110152.79641707 157408.44331726 207788.69213832
 145396.00268599 122370.07820828 223526.88156574 206821.77056479
 198801.63437928 210468.74374412 229520.3489021   76022.55119221
 337065.69213