### Regression
* Predict output that is in a continuous domain


### Assignment 
* Apply the Regressor to scikit-learns "California Housing dataset" with your target being the median house value
* Look into what statistics are available to you to analyze the performance of your model.
* Using those statistics, attempt to hand-tune your model (don't spend too much time on this!)
* After tuning your model, come up with an explanation for why the Model performed as it did
* Be prepared to explain your regressor to the rest of the group

### SGD 
* Stochastic Gradient Descent
* What you are trying to do is fit a regression to the data
* Take small batches of the data (stochastic part) 
* Predict and calculate loss and gradient of the loss function
* Adjusts the coefficients until loss function gradient is minimized 

In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, mean_absolute_percentage_error
import pandas as pd 



In [4]:
data = fetch_california_housing()
X = data.data
y = data.target


In [5]:
df = pd.DataFrame(X, columns = data.feature_names)
df['target'] = y
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 16)

### Loss Functions:
* Squared Error: take the difference between the actual and predicted values and square
* Huber: quadratic for small values and linear for large, decreases the influence of outliers
* Epsilon-insensitive: sets all losses less than e = 0, penalizes larger errors

In [7]:
lossf = ['squared_error', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive']
learning_rates = ['constant', 'optimal', 'invscaling', 'adaptive']
stats = pd.DataFrame(columns = ["loss function", "learning rate", "MSE", "Absolute_Error_Percentage", "R2"])
for l in lossf: 
    for lr in learning_rates:
        model = SGDRegressor(loss = l, learning_rate= lr)
        model.fit(X_train, y_train)
        pred = model.predict(X_test)
        r2 = r2_score(y_pred = pred, y_true = y_test)
        mean_abs = mean_absolute_percentage_error(y_test, pred)
        mse = mean_squared_error(y_test, pred)
        stats.loc[len(stats.index)] = [l, lr, mse, mean_abs, r2]





In [8]:

stats.head(20)



Unnamed: 0,loss function,learning rate,MSE,Absolute_Error_Percentage,R2
0,squared_error,constant,1.258596e+32,5861799000000000.0,-9.409286000000001e+31
1,squared_error,optimal,1.316212e+30,530629500000000.0,-9.840025999999999e+29
2,squared_error,invscaling,6.12045e+29,370956100000000.0,-4.5756609999999996e+29
3,squared_error,adaptive,1.048396e+27,1268396000000.0,-7.837828e+26
4,huber,constant,657999.2,393.1735,-491920.5
5,huber,optimal,14363.77,56.73049,-10737.38
6,huber,invscaling,573.8458,12.15897,-428.0083
7,huber,adaptive,0.8200067,0.4069983,0.3869614
8,epsilon_insensitive,constant,27760350.0,2992.874,-20753690.0
9,epsilon_insensitive,optimal,720181.7,398.9154,-538408.2


# GradientBoostingRegressor

In [9]:
model =  GradientBoostingRegressor()
model.fit(X_train, y_train)
pred = model.predict(X_test)
r2 = r2_score(y_pred = pred, y_true = y_test)
mean_abs = mean_absolute_percentage_error(y_test, pred)
mse = mean_squared_error(y_test, pred)
print(r2)
print(mean_abs)
print(mse)

0.791824831446877
0.21073877246276435
0.27845722314219257


# AdaBoostRegressor

In [11]:
model =  AdaBoostRegressor()
model.fit(X_train, y_train)
pred = model.predict(X_test)
r2 = r2_score(y_pred = pred, y_true = y_test)
mean_abs = mean_absolute_percentage_error(y_test, pred)
mse = mean_squared_error(y_test, pred)
print(r2)
print(mean_abs)
print(mse)

0.3891381433190342
0.5904294290245115
0.817095033558045


In [7]:
model = SVR(kernel='linear')
model.fit(X_train, y_train)
pred = model.predict(X_test)
r2 = r2_score(y_pred = pred, y_true = y_test)
mean_abs = mean_absolute_percentage_error(y_test, pred)
mse = mean_squared_error(y_test, pred)
print(r2)
print(mean_abs)
print(mse)