<a href="https://colab.research.google.com/github/zerotodeeplearning/ztdl-masterclasses/blob/master/notebooks/Regression_with_Scikit_Learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Catalit LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Documentation links:

- [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb)
- [Numpy](https://docs.scipy.org/doc/)
- [Pandas](https://pandas.pydata.org/docs/getting_started/index.html)
- [Pandas Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- [Matplotlib](https://matplotlib.org/)
- [Matplotlib Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf)
- [Seaborn](https://seaborn.pydata.org/)
- [Scikit-learn](https://scikit-learn.org/stable/user_guide.html)
- [Scikit-learn Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf)
- [Scikit-learn Flow Chart](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

# Regression with Scikit Learn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
url = "https://raw.githubusercontent.com/zerotodeeplearning/ztdl-masterclasses/master/data/"

## Regression

In [None]:
df = pd.read_csv(url + 'weight-height.csv')

In [None]:
df.head()

In [None]:
sns.scatterplot(data=df,
                x='Height',
                y='Weight');

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [None]:
X = df[['Height']].values
y = df['Weight'].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)

In [None]:
model = LinearRegression()

In [None]:
model.fit(X_train, y_train)

In [None]:
model.score(X_train, y_train)

In [None]:
model.score(X_test, y_test)

In [None]:
y_pred_test = model.predict(X_test)

## Plot the line of best fit

In [None]:
plt.scatter(X_test, y_test)
plt.plot(X_test, y_pred_test, color='red')
plt.title("Model coef: {:0.3f}, Intercept: {:0.2f}".format(model.coef_[0], model.intercept_))
plt.xlabel("Height")
plt.ylabel("Weight");

### Compare true and predicted values ($y$ vs $\hat{y}$)

In [None]:
plt.scatter(y_test, y_pred_test)
plt.xlabel("True Values")
plt.ylabel("Predicted Values")

m = y_test.min()
M = y_test.max()

plt.plot((m, M), (m, M), color='red');

### Exercise 1: multiple features

More features: `sqft`, `bdrms`, `age`, `price`

- load the dataset `housing-data.csv`
- visualize the data using `sns.pairplot`
- add more columns in the feature definition `X = ...`
- train and evaluate a Linear regression model to predict `price`
- compare predictions with actual values
- is your score good?
- change the `random_state` in the train/test split function. Does the score stay stable?

### Exercise 2

- Encapsulate the split/train/evaluate steps into a single function with signature:

```python
def train_eval(random_state=0):
  # YOUR CODE HERE

  return train_score, test_score
```
- Compare the performance of the model for several random states

- Bonus points if you plot a histogram of train and test scores

In [None]:
from sklearn.metrics import r2_score

### Exercise 3:

Let's see how easy it is to test different models on a larger dataset.

Here we load the California Housing dataset from Scikit Learn. Your goal is to define a function that:
- trains a model
- plots `y_pred` vs `y_true`

You can skip doing train/test split for this exercise.

Then compare the performance of the models given below:
- `sklearn.linear_model.LinearRegression`
- `sklearn.ensembleGradient.BoostingRegressor`
- `sklearn.ensembleRandom.ForestRegressor`
- `sklearn.linear_model.Ridge`
- `sklearn.linear_model.Lasso`

Function signature:

```python
def train_eval_plot(model):
  # YOUR CODE HERE

```

In [None]:
from sklearn.datasets import load_boston, fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge, Lasso

In [None]:
dataset = fetch_california_housing()
y = dataset.target

Xdf = pd.DataFrame(dataset.data, columns=dataset.feature_names)
X = Xdf.values

Xdf.head()

In [None]:
print(dataset.DESCR)