# Mini Project #1: Baseball Analytics

The overall purpose of this mini-project is to predicting MLB wins per season by modeling data to KMeans clustering model and linear regression models.

## Part 3: Analysis/Modeling

In this part of the project, you are going to conduct actual analysis on the data your just processed in Part 1 & 2. The tasks in the part include:
- K-means Clustering: pre-modeling part that provides insights toward the data;
- Linear Regression: predict Wins (continuous) using trained linear regression model;
- Logistic Regression: predict Win_bins (categorical) using trained logistic regression model __on your own__.

Let's get started.

In [None]:
# import dependencies
import pandas as pd
import numpy as np

In [None]:
# read-in required data
# features for analysis
data_features = pd.read_csv('../ba545-data/baseball_analytics_features.csv', header=0, index_col=0)

# continuous target `wins`
wins = pd.read_csv('../ba545-data/baseball_analytics_wins.csv',  index_col=0, names = ['wins'])

# categorical target `Win_bins`
win_bins = pd.read_csv('../ba545-data/baseball_analytics_target.csv',  index_col=0, names = ['win_bins'])

# display if data are read correctly
print(data_features.head())
print(wins.head())
print(win_bins.head())

Check the __data types__ of `data_features`.

In [None]:
## Write your code here


### K-means Clustering

K-means clustering, as a basic clustering technique, can capture internal relationship(s) between your data points. Sometimes we use (k-means) clustering as a pre-modeling step for supervised learning: essentially, we can use k-means clsutering to capture the internal relationship of the features, and then capture the relationship in an additional feature that being used as an input to a classification/regression model.

One key step in k-means clustering is to determine the value of `k` - how many clusters? If we want to use the clustering results as an additional (categorical) feature, we should not have a higher value of `k`. Also, increasing value of `k` may increase the erroneous relationship being captured. The k-means model is provided in `sklearn.clustering`.

In this tutorial, we use **Grid Search** to find the best value of `k`. To conduct Grid Search, you need a range of `k` and a metric that measures the performance under each value of `k`. In this context, we select the metric as the [**silhouette score**](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html) (`s_score`), which is provided in `sklearn.metrics`.

In [None]:
# import dependencies
from sklearn.cluster import KMeans
from sklearn import metrics

Silhouette score is a visualized way of measuring the performance of clustering. Thus, we need to import `matplotlib` to visualize the clustering.

In [None]:
# import and initialize matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')

In [None]:
# We need to create a figure that contains different value of `k` as sub-figures
fig = plt.figure(figsize=(20,20))
fig.subplots_adjust(left=0,right=1,bottom=0,top=1,hspace=0.05,wspace=0.5)

#### complete your code below
#### create an empty dictionary `s_score_dict` that we will use to store silhouette scores
#### for different `k` values; use different `k` values as keys, and corresponding
#### silhouette score as values


#### now we create a for-loop go through a range of `k` values in [2, 11]
for i in range(2,11):
    #### add a sub-figure `ax` to `fig` using `.add_subplot(8,8,i+1,xticks=[],yticks=[])`
    
    # conduct the k-means clustering using `k = i`
    km = KMeans(n_clusters=i, random_state=2019)
    # any clustering model needs a distance metric, in this case, `distance` is the distance between
    # any pair of data points
    distances = km.fit_transform(data_features)
    # clustering models will generate `labels` - if you want to create the additional feature 
    # as discussed above, you will use `labels` as its values
    labels = km.labels_
    # you will then applied the fitted `km` model to `data_faetures`
    l= km.fit_predict(data_features)
    # Silhouette score is computed between `data_features` and `l`
    s_s= metrics.silhouette_score(data_features, l)
    #### update the `s_score_dict` using `i` as key and `s_s` as value
    
    # we will plot the clusters out using scatter plot
    plt.scatter(distances[:,0], distances[:,1], c=labels)
    #### add 'i clusters' as the title of each sub-figure
    
    
#### show plot


Visually, we know that 2-clusters looks the best. Let's double check the silhouette score to make sure.

In [None]:
s_score_dict

As observed in the figure, 2-cluster model returns the highest silhouette score. 

__Rule of thumb__: However, we normally start searching for `k` value at `3`.

So we are going to build a k-means model of `k=3`, and then add the `cluster_label` as a feature.

In [None]:
#### complete your code below
#### create a model called `kmeans_model` with `n_clusters = 3` and `random_state = 2019`


#### capture `distances` by fit (`fit_transform`) `kmeans_model` to `data_features`


#### record labels of clusters in `labels`


#### create a scatter plot (plt.scatter()) to plot the clusters


#### add title to plot as `3-cluster plot`


#### show the plot


Looks pretty good, correct? Now let's add the `labels` to `data_features` as an additional feature so that we can use it in further analysis.

In [None]:
# look at `labels`
print(labels)
print(len(labels))
print(data_features.shape[0])

In [None]:
#### complete your code below
#### add `labels` to `data_features`
#### add `labels` as a column in `data_features` namely `label`


#### double check by looking at the first 5 rows of `data_features`


### Linear Regression

We will train linear regression models to predict a continuous target `wins`.

In [None]:
#### complete your code below
#### first we need to create the dataset we will use for the regression model
#### `reg_data` = `data_features` + `wins`


#### double check by looking at the first 5 rows of `reg_data`


In [None]:
#### complete your code below
#### investigate descriptive stats using describe()


Let's import the dependencies for building and evaluation a linear regression model.

In [None]:
# Import `LinearRegression` from `sklearn.linear_model`
from sklearn.linear_model import LinearRegression

# Import `mean_absolute_error` from `sklearn.metrics`
from sklearn.metrics import mean_absolute_error, mean_squared_error

Then let's define the features and target. There are two ways of doing this. Let's try the first.

In [None]:
#### complete your code below
#### create a variable `reg_values` which are the values in `reg_data`


#### create a variable `X` which contains all columns in `reg_values` besides the last 


#### create a variable `y` which contains the last column in `reg_values`


Here is an alternative method:

In [None]:
#### complete your code below
#### create a variable `Xa` which contains all values in `data_features`


#### create a variable `ya` which contains values in `wins`


Now we need to split our data into training (`X_train`, `y_train`) and testing (`X_test`, `y_test`).

In [None]:
#### complete your code below
#### import `train_test_split` from `sklearn.model_selection`


#### split X, y into training and testing, using 75/25 split, and set `random_state = 2019`


In [None]:
# Create Linear Regression model, fit model, and make predictions
lr = LinearRegression(normalize=True)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

In [None]:
# calculate the MAE
mae = mean_absolute_error(y_test, y_pred)

# Print `mae`
print(mae)

In [None]:
# Calculate the RMSE

#from sklearn.metrics import mean_squared_error

from math import sqrt

rmse = sqrt(mean_squared_error(y_test, y_pred))

print(rmse)

You can print your linear regression model.

In [None]:
lr.coef_

In [None]:
lr.intercept_

We can try to train an advanced regression model to see if there is any improvement in results.

In [None]:
# Import `RidgeCV` from `sklearn.linear_model`
from sklearn.linear_model import RidgeCV

# Create Ridge Linear Regression model, fit model, and make predictions
rrm = RidgeCV(alphas=(0.01, 0.1, 1.0, 10.0), normalize=True)
rrm.fit(X_train, y_train)
predictions_rrm = rrm.predict(X_test)

# Determine mean absolute error
mae_rrm = mean_absolute_error(y_test, predictions_rrm)
print(mae_rrm)

In [None]:
# Calculate the RMSE
rmse_rrm = sqrt(mean_squared_error(y_test, predictions_rrm))
rmse_rrm 

We can also see how much contribution the `label` feature provides to the regression model.

In [None]:
#### Complete your code below
#### create a variable `Xb` without `label`
#### you can do it by getting X[:,:-1]

#### create your training and testing data using Xb and y
#### remember that Xb does not contain 'label', use the same parameters as before
#### 75/25 split, and `random_state = 2019`


#### Create Linear Regression model, fit model, and make predictions


#### calculate the MAE


#### Print `mae`


### Calculate and print RMSE


In your analysis, MAE or RMSE are both at the same scale as your target (`y`) variable. Even though we can use the values of MAE/RMSE to compare models; when we need to report/interpret how good our model(s) is, we need to convert it to a ratio (_error ratio_, ER):

$$ ER(y,\hat{y}) = \frac{metric}{y_{range}} $$

in which, $ metric $ is the metric you want to use (e.g., MAE/RMSE), and $y_{range} = y_{max} - y_{min}$.

In [None]:
# calculate the `mae_ratio` and `rmse_ratio` below
# write your code below


### Question: 
__Do you observe an improvement or not while excluding `label` in the analysis? In other words, does `label` help with the analysis? Answer in the next block__.

__Double click and type your answer__

### Logistic Regression

You will need to create a logistic regression model __on your own__, using `data_features` as features, and `win_bins` as the target.

If you have any question, refer to the logistic regression notebook for more help.

__Hint:__ You should consider using `sklearn`'s classification report to evaluation your results - since this is a classification problem. The docs can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html).