## Linear Regression
An estimator is any model that tries to estimate a variable y from another variable(s) x given pairs of data (x<sub>1</sub>,y<sub>1</sub>),(x<sub>2</sub>,y<sub>2</sub>),...,(x<sub>N</sub>,y<sub>N</sub>)

Regression is when the targets (y) are quantities (not cat vs dog but rather price of a house)

An example would be trying to guess how many people will buy icecream from a shop at any day given the temprature on that day. In this case, the target is the number of icecream people will buy, it is an integer and any value (ex: 14 icecreams) doesn't represent a particular class, so it will be represented as a regression problem. In this case, it turns out that x is also a quantity (temprature); however, this is not necessary and is tackled by approaches other than linear regression.

To estimate the price of the icecream from temperature we gather the data on various days from various different places and get the following data.



We can see that a line (could  also be a curve) could fit the data, but how can we find which line exactly.

#### Line Equation Review
A line in 2D can be parametrized using a slope m and a y-intercept b which would result in it having an equation

$$y = mx+b$$

In the example of the icecream, if the temperature at a day was 30, our prediction for the number of icecreams sold would be m\*30 + b, so we simply need to find m and b.

#### Squared Loss
We need some measurement of how good our line is to be able to find the "best" line, so we will measure the difference between our predictions and the correct values from the data.


Then, we will square all the differences and add them up. This will be the value we try to minimize.

#### Note
The normal equation will probably not work. It is left for you to figure it out and let everyone else know, let's see who figures it out first (it's a tricky issue).




In [None]:
import numpy as np
import pandas as pd

In [None]:
class Linear_Regression():
    '''
    Linear Regression model created using only NumPy
    
    Attributes
    ----------
    weights: np.array of floats
        All the parameters of the model (including bias)
    '''
    def __init__(self):
        self.weights = None
        self.prediction = None
        
    def train(self,data_X,data_y):
        '''
        Train the model using the given data
        
        Parameters
        ----------
        data_X: np.array, shape = (N, num_features)
            Features from the data, each row is one data point. Assumes that a column of ones was added to data_X
        data_y: np.array, shape = (N, num_targets)
            The target values to predict, each row contains the targets for one data point
        '''
        ########################## Insert code here ##########################
        theta = np.zeros(data_X.shape[1])
        iterations = 50000
        alpha = 0.0000000069
        cost_history = np.zeros(iterations)
        for i in range(1):
          for i in range(iterations):
            predictions = data_X.dot(theta)
            errors = np.subtract(predictions, data_y)
            sum_delta = (alpha / data_X.shape[0]) * data_X.transpose().dot(errors);
            theta = theta - sum_delta;
        self.weights = theta
    def predict(self,x_to_predict):
        '''
        Predict using the given value as input
        
        Assumes that self.train(.,.) has been called before calling this method
        
        Parameters
        ----------
        x_to_predict: np.array, shape = (M, num_features)
            A given list of inputs to predict targets for, each row is one input. Assumes that a column of ones was added similar to the training data
        
        Returns
        -------
        np.array of floats, shape = (M, num_targets)
            Predicted values for each input
        '''
        ########################## Insert code here ##########################
        self.prediction = np.dot(x_to_predict,self.weights)
        

### Import the data and remove useless columns

In [None]:
df = pd.read_csv("train.csv")
df.drop(columns=["Id"],inplace=True)
df.head()

### Handle the missing data (NaNs)

In [None]:
df.drop(columns=df.columns[df.isnull().sum().values>200],inplace=True)
df.dropna(inplace=True)
df.isnull().sum().values

### Replace categorical data (strings) with numerical values

In [None]:
obj_to_replace = df["MSZoning"].dtype

for column in df.columns:
    if df[column].dtype == obj_to_replace:
        uniques = np.unique(df[column].values)
        for idx,item in enumerate(uniques):
            df[column] = df[column].replace(item,idx)
            
df.head()

### Add the bias column (column of ones)

In [None]:
df["bias"] = np.ones(df.shape[0])
df.head()

### Divide the data into training, testing, X, and y

In [None]:
df = df.sample(frac=1).reset_index(drop=True)
training_df = df[:-100]
val_df = df[-100:]
training_y = training_df["SalePrice"].values
training_X = training_df.drop(columns=["SalePrice"]).values
val_y = val_df["SalePrice"].values
val_X = val_df.drop(columns=["SalePrice"]).values

print(training_X.shape)
print(np.mean(training_y))




### Train the linear regressor

In [None]:
# Create and fit the model
LR_regressor = Linear_Regression()
LR_regressor.train(training_X,training_y)
print(LR_regressor.weights)


# Calculate Mean Absolute Error (Easier to interpret than MSE)
########################## Insert code here ##########################
N = 1/training_X.shape[0]
total_sum = 0
for i in range(training_X.shape[0]):
  total_sum += abs(LR_regressor.weights.dot(training_X[i])-training_y[i])
error = N*total_sum
print(error)

### Train using the sklearn linear regressor

In [None]:
from sklearn.linear_model import LinearRegression

# Create and fit the model
########################## Insert code here ##########################
LR_sk = LinearRegression().fit(training_X,training_y)
print(LR_sk.coef_)

# Calculate Mean Absolute Error (Easier to interpret than MSE)
########################## Insert code here ##########################
N = 1/training_X.shape[0]
total_sum = 0
for i in range(training_X.shape[0]):
  total_sum += abs(LR_sk.coef_.dot(training_X[i])-training_y[i])
error = N*total_sum
print(error)