# Supervised Learning with scikit-learn

## Regression

#### Regression mechanics
* y = ax + b
  + Simple linear regression uses one feature
  - y = target
  - x = single feature
  - a, b = parameters/coefficients of the model - slope, intercept
* How do we choose a and b?
  + Define an error function for any given line
  + Choose the line that minimizes the error function
* Error function = loss function = cost function

In [6]:
import pandas as pd
diabetes_df = pd.read_csv("../datasets/diabetes_clean.csv")
display(diabetes_df.head())
X = diabetes_df.drop("glucose", axis=1).values
y = diabetes_df["glucose"].values

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


#### The loss function

#### Ordinary Least Sqaures
Residual Sum of Squares (RSS) =
Ordinary Least Squares (OLS) = minimize RSS

#### Linear regression in higher dimensions
y = a1x1 + a2x2 + b
* to fit a linear regression model here:
  + Need to specify 3 variables: a1, a2, b
* In higher dimensions:
  + Known as multiple regression
  + Must specify coefficients for each features and the variable b
    - y = a1x1 + a2x2 + a3x3 + ... + anxn + b
* scikit-learn works exactly the same way:
  + Pass two arrays: features and target

#### Linear regression using all features

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
reg_all = LinearRegression() # performs OLS under the hood
reg_all.fit(X_train, y_train)
y_pred = reg_all.predict(X_test)
display(y_pred)

array([119.91303675,  95.70325357, 104.63962314, 114.05040231,
       118.58321727, 127.12240463, 101.32326148, 109.68914522,
       114.03971568, 123.60324707, 124.99713549, 112.30131897,
       151.83921306, 120.08016862,  99.38245855, 137.26590828,
        98.92170524, 103.54780373, 124.48277405, 138.98291632,
       114.94730968, 108.01568548,  94.81640369, 104.61880645,
       115.5808436 , 129.44870188,  94.22207529, 105.35660341,
       135.62224377, 116.96796741, 132.2906766 , 152.69963123,
       149.45755335, 128.2848374 , 129.10818568, 144.78293146,
       112.09337022, 133.52916199, 113.45464867, 126.11093124,
       102.62046077, 120.53135451, 114.54168207, 137.65315875,
       107.95068535, 148.70812727, 155.6870585 ,  98.79913503,
       121.11416209, 145.42843867,  94.44116407, 150.979827  ,
       164.44641014, 133.12283053, 116.16435787, 100.29525093,
       120.43548843,  86.16602491, 110.89528052, 136.71958446,
       129.92989495, 100.63738165, 150.5625166 , 162.71

#### R-squared
* R^2: quantifies the variance in target values explained by the features
  + Values range from 0 to 1
* High R^2 to Low R^2

#### R-squared in scikit-learn

In [10]:
display(reg_all.score(X_test, y_test))


0.28280468810375115

#### Mean squared error and root mean squared error

MSE = 1/n sum of i = 1 in (Yi - Yi)^2
* MSE is measured in target units, squared
RMSE = sqrt(MSE)
* Measure RMSE in the same units at the target variable

#### RMSE in scikit-learn

In [11]:
from sklearn.metrics import mean_squared_error
display(mean_squared_error(y_test, y_pred, squared=False))

26.341459582232265

In [29]:
import numpy as np

numpy1 = np.array([17.2, 20.0, 8.25, 9.50])
numpy2 = np.array([13.0, 24.0, 8.25, 9.0])

display(numpy1[numpy1 == numpy2])

array([8.25])

In [53]:
logins = pd.DataFrame([['a', 'JAN', 7, 2015, 17357],['b', 'FEB', 8, 2015, 10011]])
logins.columns = ["", "MONTH", "day", 'year', 'session_id']
logins.set_index('')
for j, q in logins. iterrows():
    logins.loc[j, 'month'] = q['MONTH'].lower()
display(logins)

Unnamed: 0,Unnamed: 1,MONTH,day,year,session_id,month
0,a,JAN,7,2015,17357,jan
1,b,FEB,8,2015,10011,feb
