# Applied Machine Learning with Scikit Learn - Regressions

*Adapted from https://github.com/justmarkham*

### Libraries

- [scikit-learn](http://scikit-learn.org/stable/)
- pandas
- matplotlib

In this tutorial we will see some basic example of Linear Regression for prediction and Logistic Regression for classification.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from pandas.plotting import scatter_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
import seaborn as sns
%matplotlib inline

# Prediction with Linear Regression

|     *            | continuous     | categorical    |
| ---------------- | -------------- | -------------- |
| **supervised**   | **regression** | classification |
| **unsupervised** | dim. reduction | clustering     |

### Motivation

Why are we learning linear regression?
- widely used
- runs fast
- easy to use (not a lot of tuning required)
- highly interpretable
- basis for many other methods


Let's import the dataset:

In [None]:
data = pd.read_csv('Advertising.csv', index_col=0)
data.head()

What are the **features**?
- TV: advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
- Radio: advertising dollars spent on Radio
- Newspaper: advertising dollars spent on Newspaper

What is the **response**?
- Sales: sales of a single product in a given market (in thousands of widgets)

In [None]:
# print the shape of the DataFrame
data.shape

In [None]:
# visualize the relationship between the features and the response using scatterplots
fig, axs = plt.subplots(1, 3, sharey=True)
data.plot(kind='scatter', x='TV', y='sales', ax=axs[0], figsize=(16, 8), grid=True)
data.plot(kind='scatter', x='radio', y='sales', ax=axs[1], grid=True)
data.plot(kind='scatter', x='newspaper', y='sales', ax=axs[2], grid=True)

## Estimating ("Learning") Model Coefficients

Generally speaking, coefficients are estimated using the **least squares criterion**, which means we find the line (mathematically) which minimizes the **sum of squared residuals** (or "sum of squared errors"):

<img src="08_estimating_coefficients.png">

What elements are present in the diagram?
- The black dots are the **observed values** of x and y.
- The blue line is our **least squares line**.
- The red lines are the **residuals**, which are the distances between the observed values and the least squares line.

How do the model coefficients relate to the least squares line?
- $\beta_0$ is the **intercept** (the value of $y$ when $x$=0)
- $\beta_1$ is the **slope** (the change in $y$ divided by change in $x$)

Here is a graphical depiction of those calculations:

<img src="08_slope_intercept.png">

## Hands on!
Let's create the features and class vectors (X and y)

In [None]:
feature_cols = ['TV', 'radio', 'newspaper']
X = data[feature_cols]
y = data.sales

X.describe()

**Scikit-learn** provides a easy way to tran the model:

In [None]:
logistic = LinearRegression()  # create the model
logistic.fit(X, y)  # train it

Back to the theory! Let's see how the formula looks:

In [None]:
for f in range(len(feature_cols)):
    print("{0} * {1} + ".format(logistic.coef_[f], feature_cols[f]))
print(logistic.intercept_)



$$y = \beta_0 + \beta_1  \times TV + \beta_1  \times radio + \beta_1  \times newspaper$$
$$y = 2.938 + 0.045 \times TV + 0.18  \times radio + -0.001  \times newspaper$$

Let's plot the predictions and the original values:

In [None]:
lr = LinearRegression()

# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted = cross_val_predict(lr, X, y, cv=5)

# Plot the results
fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(y, predicted, edgecolors=(0, 0, 0))
ax.plot([min(y), max(y)], [min(y), max(y)], 'r--', lw=4)
ax.set_xlabel('Original')
ax.set_ylabel('Predicted')
plt.show()

# Classification with Logistic Regression

|*|continuous|categorical|
|---|---|---|
|**supervised**|regression|**classification**|
|**unsupervised**|dim. reduction|clustering|

Let's go back to the Titanic dataset. We are interessed in predicting the 'survived' variable given the feature of the passenger. For the sake of simplicity, we consider only 4 features:

- pclass
- sex
- age
- fare

In [None]:
titanic_raw = pd.read_excel('titanic.xls')
titanic = titanic_raw[['pclass', 'sex', 'age', 'fare', 'survived']].dropna(axis=0, how='any')
titanic.head()

In [None]:
dead = titanic[titanic['survived']==0]
survived = titanic[titanic['survived']==1]

print("Survived {0}, Dead {1}".format(len(dead), len(survived)))

Specify the columns to use as features and the labels for the traning:

In [None]:
titanic_features = ['pclass', 'sex', 'age', 'fare']
titanic_class = 'survived'

#### Q: How is the age distribution between the two groups?

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5));

dead_age = dead[['age']]
survived_age = survived[['age']]

dead_age.plot.hist(ax=axes[0], ylim=(0, 150), title='Dead - Age')
survived_age.plot.hist(ax=axes[1], ylim=(0, 150), title='Survived - Age')


Visible difference for young children.

### Let's prepare the feature vector for the training

The dataset contains categorical variable: sex (male|female)

We need to convert it in vector format. Pandas offers the method *get_dummies* that takes care of this expansion

In [None]:
# The features vector
X = pd.get_dummies(titanic[titanic_features])
X.head()
# titanic['pclass'] = titanic['pclass'].astype('category')

The labels used for the traning:

In [None]:
y = titanic['survived']

Let's create a new model...

In [None]:
logistic = LogisticRegression(solver='lbfgs')

# for f in range(len(feature_cols)):
#     print("{0} * {1} + ".format(logistic.coef_[f], feature_cols[f]))
print(logistic)

... and evaluate the precison/recall with a cross validation (10 splits).

**Scikit-Learn** offers this convenient menthod to split the dataset and evaluate the performance.

In [None]:
precision = cross_val_score(logistic, X, y, cv=10, scoring="precision")
recall = cross_val_score(logistic, X, y, cv=10, scoring="recall")

# Precision: avoid false positives
print("Precision: %0.2f (+/- %0.2f)" % (precision.mean(), precision.std() * 2))
# Recall: avoid false negatives
print("Recall: %0.2f (+/- %0.2f)" % (recall.mean(), recall.std() * 2))

### Explore the model output

Let's train on the full dataset

In [None]:
logistic = LogisticRegression(solver='lbfgs')
logistic.fit(X, y)

Given one sample, logistic regression generates the probability of belonging to the positive class. With **Scikit-Learn** we can access to this value thanks to the method *predict_proba*

In [None]:
pred = logistic.predict_proba(X)
pred

Of course, since we trained the whole dataset, we don't have new samples to predict, but we can predict the outcome and the relative probability for some artificial samples. Would you survive?

In [None]:
X.columns

In [None]:
logistic.predict([[3, 25, 200, 0, 1]])

In [None]:
logistic.predict_proba([[3, 25, 200, 0, 1]])

In [None]:
logistic.predict([[3, 25, 200, 1, 0]])

In [None]:
logistic.predict_proba([[3, 25, 200, 1, 0]])