# Statistical Modeling and Machine Learning

This section will cover machine learning with some statistical modeling in Python.

## What Is Machine Learning?
Rather than programming expertise into our applications, we program the application to learn from data (predictive analytics) using machine-learning models

### Prediction Examples
* Improve **weather forecasting** to save lives, minimize injuries and property damage
* Improve **cancer diagnoses** and **treatment regimens** to save lives
* **Detect fraudulent credit-card purchases** and **insurance claims**
* Predict **customer “churn”**, what prices houses are likely to sell for, ticket sales of new movies, and anticipated revenue of new products and services
* Predict the **best strategies for coaches and players** to use to win more games and championships
* ...and more

## Scikit-Learn 
* Scikit-learn, also called **sklearn**, conveniently packages the most effective machine-learning algorithms as **estimators**. 
* **Algorithms are encapsulated, so you don’t see the intricate details and heavy mathematics of how these algorithms work.**
* You’ll use **scikit-learn** to **train each model** on a subset of your data, then **test each model** on the rest to see how well your model works. 
* Once your models are trained, you’ll put them to work making **predictions** based on **data they have not seen**. 

### Which Scikit-Learn Estimator Should You Choose?
* **It’s difficult to know in advance which model(s) will perform best on your data**, so you typically try many models and pick the one that performs best across different types of datasets. 
* A popular approach is to **run many models and pick the best one(s)** based on scoring
* Unless you are super interested and math inclined, **you likely won't know (or need to know) the details of the mathematical algorithms** in the sklearn estimators
* With experience, you’ll **become familiar with which algorithms may be best for particular types of datasets and problems**. 

### [Datasets Bundled with Scikit-Learn](http://scikit-learn.org/stable/datasets/index.html)
* Scikit-learn also provides capabilities for loading datasets from other sources, such as the 20,000+ datasets available at https://openml.org. 

| Datasets bundled with scikit-learn | &nbsp;
| :--- | :---
| **_"Toy" datasets_** | **_Real-world datasets_**
| Boston house prices | Olivetti faces
| Iris plants | 20 newsgroups text
| Diabetes | Labeled Faces in the Wild face recognition
| Optical recognition of handwritten digits | Forest cover types
| Linnerrud | RCV1
| Wine recognition | Kddcup 99 
| Breast cancer Wisconsin (diagnostic) | California Housing 

### Steps in Performing Machine Learning 
* loading the dataset
* exploring the data to get a handle on its contents (e.g. numpy, pandas, visualization, etc.)
* **Make sure your data is clean before you use it for ML!**
* transforming your data (converting non-numeric data to numeric data because scikit-learn requires numeric data
* splitting the data for training and testing
* creating the model
* training and testing the model
* tuning the model and evaluating its accuracy
* making predictions on live data that the model hasn’t seen before. 

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, ElasticNet, Lasso, Ridge
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn import metrics

## Time Series and Simple Linear Regression 
* **Linear regression** is the **simplest** regression algorithm
* Given a collection of numeric values representing an **independent variable** and a **dependent variable**, simple linear regression **describes the relationship between these variables with a straight line**, known as the **regression line**
   \begin{equation}y = m x + b\end{equation}
   * y = mx + b (m is the slope, and b is the value of y when x = 0) 

## Example: Predicting High Temperatures (NYC Temperature Data)

The file `ave_hi_nyc_jan_1895-2018.csv` contains New York City high temperature data for all Januarys from 1895-2018 with the following data:

* Date - A value of the form 'YYYYMM'
* Value - Tempearture in Fahrenheit
* Anomaly - Difference between the value for the given date and average values for all dates.

### Data Preparation

In [None]:
# Load the data, rename the `'Value'` column to `'Temperature'`, remove `01` from the end of each date value and display a few data samples:
df_nyc = pd.read_csv('ave_hi_nyc_jan_1895-2018.csv')
df_nyc.columns = ['Date', 'Temperature', 'Anomaly']
df_nyc['Date'] = df_nyc['Date'].floordiv(100)
df_nyc.head()

### Splitting the Data for Training and Testing
* We’ll use the **`LinearRegression`** estimator from **`sklearn.linear_model`** 
* By default, this estimator uses **all** the **numeric features** in a dataset to perform **multiple linear regression**  
* For **simple linear regression** select **one** feature (the `Date` here) as the **independent variable**
    * Scikit-learn estimators require training and testing data to be **two-dimensional** 
    * We'll transform **`Series` of _n_** elements, into two dimensions containing **_n_ rows** and **one column**

In [None]:
df_nyc['Temperature'].values

In [None]:
# Split arrays into train and test subsets which will be randomly selected
X_train, X_test, y_train, y_test = train_test_split(df_nyc['Date'].values.reshape(-1, 1), df_nyc['Temperature'].values, random_state=11)  #random_state is for reproducibility to select the same data from training and testing

In [None]:
# Confirm the **75%–25% train-test split** 
print(X_train.size, 'items --', X_train.size/(X_train.size + X_test.size) * 100)
print(X_test.size, 'items --', X_test.size/(X_train.size + X_test.size) * 100)

### Training the Model
* [**LinearRegression default settings**](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
* [**How Do You Know You Have Enough Training Data?**](https://towardsdatascience.com/how-do-you-know-you-have-enough-training-data-ad9b1fd679ee)

In [None]:
linear_regression = LinearRegression()
linear_regression.fit(X=X_train, y=y_train)

* To find the **best fitting regression line** for the data, the `LinearRegression` estimator **iteratively adjusts** the **slope** and **intercept** to **minimize** the **sum of the squares** of the data points’ **distances** from the line.
* The slope and intercept are used to make predictions and plot the regression line

In [None]:
print(linear_regression.coef_) # Slope is the estimator’s **`coeff_`** attribute (**m** in the equation) 
print(linear_regression.intercept_)  # Intercept is the estimator’s **`intercept_`** attribute (**b** in the equation)

### Testing the Model
* Test the model using the data in **`X_test`** and check some of the **predictions**

In [None]:
expected = y_test
predicted = linear_regression.predict(X_test)

In [None]:
for p, e in zip(predicted[::5], expected[::5]):  # check every 5th element
    print('Predicted: %.2f, Expected: %.2f' % (p, e))

#### Question: What do you think about the predictions?

### Predicting Future Temperatures and Estimating Past Temperatures 
* Use the **coefficient** and **intercept** values to make **predictions** 

In [None]:
# lambda implements y = mx + b
predict = (lambda x: linear_regression.coef_ * x + linear_regression.intercept_)

In [None]:
print('The prediction for 1890 is', predict(1890))
print('The prediction for 2021 is', predict(2021))

### Visualizing the Dataset with the Regression Line 
* Create a **scatter plot** with a regression line 
* **Cooler** temperatures shown in **darker colors**

In [None]:
axes = sns.scatterplot(data=df_nyc, x='Date', y='Temperature', hue='Temperature', palette='winter', legend=False)  
axes.set_ylim(10, 70)  # scale y-axis 

x = np.array([min(df_nyc['Date'].values), max(df_nyc['Date'].values)]) # creates an array from 1895-2018
y = predict(x)

line = plt.plot(x, y)

### Overfitting/Underfitting
* Common problems that prevent accurate predictions
* When creating a model, key goal is **making accurate predictions** for **data it has not yet seen** 
* **Underfitting** occurs when a **model is too simple to make predictions**, based on its training data
    * You attempt to use a linear model when the problem really requires non-linear model
* **Overfitting** occurs when your **model is too complex**
    * Most extreme case would be a model that memorizes its training data
    * Model memorizes the training data and is unable to make predictions with data not yet sen

### Practice Problem

Using the file, `ave_yearly_temp_nyc_1895-2017.csv`, predict what the average yearly temperature will be in 2030, 2040, and 2050. What are the predictions telling you and what can you infer by this?

## Example: Multiple Linear Regression (California Housing Dataset)
* [**California Housing dataset**](http://lib.stat.cmu.edu/datasets). An example of a data set bundled with scikit-learn, containing 20,640 samples, each with eight numerical features. The dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (typically has a population of 600 to 3,000 people).
* The dataset has **20,640 samples**—**one per block group**—with **eight features** each:
	* **median income**—in tens of thousands, so 8.37 would represent $83,700
	* **median house age**—in the dataset, the maximum value for this feature is 52
	* **average number of rooms** 
	* **average number of bedrooms** 
	* **block population**
	* **average house occupancy**
	* **house block latitude**
	* **house block longitude**
    * **Target** &mdash; **median house value** in hundreds of thousands, so 3.55 would represent \$355,000
* Reasonable to expect **more bedrooms**, **more rooms** or **higher income** would mean **higher house value**
* **Combine all numeric features to make predictions**
    * More likely to get **more accurate predictions** than with simple linear regression

### Data Exploration and Preparation

In [None]:
from sklearn.datasets import fetch_california_housing
california = fetch_california_housing()

In [None]:
# Confirm number of **samples/features**, number of **targets**, **feature names**
print('Shape', california.data.shape)
print('Feature Names', california.feature_names)

In [None]:
# Create a data frame with a column for median house values
california_df = pd.DataFrame(california.data, columns=california.feature_names)
california_df['MedHouseValue'] = pd.Series(california.target)
california_df.head()

In [None]:
# Calculate some summary statistics
california_df.describe()

### Visualization Exploration
* Helpful to **visualize** data by **plotting the target value** against **each** feature
    Shows how **median home value** relates to **each feature**
* Use **`DataFrame` method **`sample`**** to **randomly select 10% of the 20,640 samples** for graphing

In [None]:
# Display scatter plots of several features. Feature on x-axis and median home value on y-axis

sample_df = california_df.sample(frac=0.1, random_state=17)
sns.set_style('whitegrid')   
for feature in california.feature_names:
    plt.figure(figsize=(15, 7))  # 15"-by-7" Figure
    sns.scatterplot(data=sample_df, x=feature, y='MedHouseValue', hue='MedHouseValue', palette='cool', legend=False)

#### Question: What are the visualizations telling you?

<!-- ![California Housing Dataset scatterplot of Median House Value vs. Median Income](./ch14images/medincome.png "California Housing Dataset scatterplot of Median House Value vs. Median Income")
 ![California Housing Dataset scatterplot of Median House Value vs. House Age](./ch14images/houseage.png "California Housing Dataset scatterplot of Median House Value vs. House Age")
 ![California Housing Dataset scatterplot of Median House Value vs. Average Rooms](./ch14images/averooms.png "California Housing Dataset scatterplot of Median House Value vs. Average Rooms")
 ![California Housing Dataset scatterplot of Median House Value vs. Average Bedrooms](./ch14images/avebedrooms.png "California Housing Dataset scatterplot of Median House Value vs. Average Bedrooms")
 ![California Housing Dataset scatterplot of Median House Value vs. Population](./ch14images/population.png "California Housing Dataset scatterplot of Median House Value vs. Population")
 ![California Housing Dataset scatterplot of Median House Value vs. Average Occupancy](./ch14images/aveoccupancy.png "California Housing Dataset scatterplot of Median House Value vs. Average Occupancy")
 ![California Housing Dataset scatterplot of Median House Value vs. Lattitude](./ch14images/lattitude.png "California Housing Dataset scatterplot of Median House Value vs. Lattitude")
 ![California Housing Dataset scatterplot of Median House Value vs. Longitude](./ch14images/longitude.png "California Housing Dataset scatterplot of Median House Value vs. Longitude")<hr style="height:2px; border:none; color:black; background-color:black;"> -->

### Splitting the Data for Training and Testing

In [None]:
# Split arrays into train and test subsets which will be randomly selected
X_train, X_test, y_train, y_test = train_test_split(california.data, california.target, random_state=11)

In [None]:
# Confirm the **75%–25% train-test split** 
print(X_train.size, 'items --', X_train.size/(X_train.size + X_test.size) * 100)
print(X_test.size, 'items --', X_test.size/(X_train.size + X_test.size) * 100)

### Training the Model 
* **`LinearRegression`** tries to use **all** features in a dataset’s `data` array
    * **error** if any features are **categorical**  
    * Categorical data must be preprocessed into numerical data or excluded
* **Scikit-learn’s bundled datasets** are already in the **correct format** for training

In [None]:
linear_regression = LinearRegression()
linear_regression.fit(X=X_train, y=y_train)

* **Separate coefficients** for each feature (stored in `coeff_`) and **one intercept** (stored in `intercept_`) 
    * **Positive coefficients** &mdash; median house value **increases** as feature value **increases** 
    * **Negative coefficients** &mdash; median house value **decreases** as feature value **increases**
    * **HouseAge**, **AveOccup** and **Population** are **close to zero**, so these apparently have little to no affect on **median house value**

In [None]:
# Slope is the estimator’s **`coeff_`** attribute (**m** in the equation) 
for i, name in enumerate(california.feature_names): #enumerate adds a number to an iteration
    print(f'{name:>10}: {linear_regression.coef_[i]}')  
print()    
# Intercept is the estimator’s **`intercept_`** attribute (**b** in the equation)
print(linear_regression.intercept_)  # Intercept is the estimator’s **`intercept_`** attribute (**b** in the equation)

* Can use coefficient values in following equation to **make predictions**:

\begin{equation}
y = m_1 x_1 + m_2 x_2 + ... + m_n x_n + b
\end{equation}

* <em>m</em><sub>1</sub>, <em>m</em><sub>2</sub>, …, <em>m</em><sub><em>n</em></sub> are the **feature coefficients**
* <em>b</em> is the **intercept**
* <em>x</em><sub>1</sub>, <em>x</em><sub>2</sub>, …, <em>x</em><sub><em>n</em></sub> are **feature values** (the **independent variables**)
* <em>y</em> is the **predicted value** (the **dependent variable**)
   

### Testing the Model 

In [None]:
expected = y_test
predicted = linear_regression.predict(X_test)

In [None]:
print('Expected', expected[:5])   # first five targets 
print('Predicted', predicted[:5])  # first 5 predictions

* In **regression**, it’s **tough to get exact predictions**, because you have **continuous outputs**
    * Every possible value of <em>x</em><sub>1</sub>, <em>x</em><sub>2</sub> … <em>x</em><sub><em>n</em></sub> in the following calculation predicts a value

### Visualizing the Expected vs. Predicted Prices 
* Create a `DataFrame` containing columns for the expected and predicted values:

In [None]:
df = pd.DataFrame()
df['Expected'] = pd.Series(expected)
df['Predicted'] = pd.Series(predicted)
df.head()

* Plot the data as a scatter plot with the **expected (target) prices** along the x-axis and the **predicted prices** along the **y**-axis: 

In [None]:
figure = plt.figure(figsize=(9, 9))
axes = sns.scatterplot(data=df, x='Expected', y='Predicted', hue='Predicted', palette='cool', legend=False)

start = min(expected.min(), predicted.min())
end = max(expected.max(), predicted.max())

axes.set_xlim(start, end)
axes.set_ylim(start, end)

# Note: This is NOT a regression line. It is representing what perfect predictions would look like. If every predicted value matched expected value, all dots would be along dashed line.
line = plt.plot([start, end], [start, end], 'k--')

#### Question: What do you think this model is suggesting regarding expected median house value?

### Evaluating the Regression Model
* **Metrics for regression estimators** include **coefficient of determination** (**$R^{2}$ score**; 0.0-1.0)
    * **1.0** &mdash; estimator **perfectly predicts** the **dependent variable’s value**, given independent variables' values
    * **0.0** &mdash; **model cannot make predictions with any accuracy**, given independent variables’ values 
* Calculate with arrays representing the **expected** and **predicted results**

In [None]:
metrics.r2_score(expected, predicted)

### Trying Other Models
* **Try several estimators** to determine whether any **produces better results** than `LinearRegression` 
* [Information about estimators used here](https://scikit-learn.org/stable/modules/linear_model.html)

In [None]:
estimators = {
    'LinearRegression': linear_regression,
    'ElasticNet': ElasticNet(),
    'Lasso': Lasso(),
    'Ridge': Ridge()
}

* Run the estimators using **k-fold cross-validation** (splits data into folds/splits to handle both training and testing)

In [None]:
for estimator_name, estimator_object in estimators.items():
    kfold = KFold(n_splits=10, random_state=11, shuffle=True)
    scores = cross_val_score(estimator=estimator_object, X=california.data, y=california.target, cv=kfold, scoring='r2') #cv = cross-validation generator defines how to split the samples; scoring gets R^2 scores for each fold
    print(f'{estimator_name:>16}: ' + f'mean of r2 scores={scores.mean():.3f}')