# Use-case for SciKit-Learn (sklearn)
Datafile: "College Basketball Dataset.csv"<br>
        "College Basketball Dataset-2019.csv"<br>
Datafile description:"College basketball-Description.pdf"

The original data are from: https://www.kaggle.com/andrewsundberg/college-basketball-dataset. You can find additinal useful materials in this website.

This notebook file will use a story style to show some examples of using sklearn (Scikit-learn) package to build prediction models and classification models to help solve interesting problems.  The sklearn website is: https://scikit-learn.org/stable/

Scikit-learn is probably the most useful library for machine learning in Python. The sklearn library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.


## The Story

Suppose you had a friend who is a fan of NCAA college basketball. Today, he tells you that he recently found an online casino in which he can bet on lots of aspects of the NCAA Division I College Basketball and he is anxious to get started. Currently, he has some datasets that contain lots of statistics of the Division I league from the past several years, so he wants to use those datasets to help him decide on his bets. 

Unfortunately, he has no idea of how to use the datasets. So, when he heard that you are taking this course, he immediately came to see you and ask for help.  You decide to try to use sklearn to help your friend.  Your conversation goes something like:

You--<font color = "red">"So, what you plan to bet on?"</font>

Your friend--<font color = "blue">"Wins above bubble and their postseason rank if they have one."</font>

You--<font color = "red">"uh...What are those?"</font>

Your friend--<font color = "blue">"Let's see the datasets and I can explain to you. "</font>

You--<font color = "red">"Okay."</font>

### You import some general packages and the dataset

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# the file path - Make sure that you understand this ../ notation!
fname = "../Data/College Basketball Dataset.csv"
# read the data into a pandas dataframe and show the fist five lines
df = pd.read_csv(fname)
df.head()

In [None]:
# a simple dataframe information
df.info()

You--<font color = "red">"A lot of columns!"</font>

Your friend--<font color = "blue">"Yeah. The dataset includes statistics of NCAA Divison 1 basketball from 2013 to 2018. Let me tell you what each column represents···"</font>

After a long explaination/discussion, you make a summary of the columns:

- TEAM: The Division I college basketball school<br>
- CONF: The Athletic Conference in which the school participates in. (More detail can be found in our datafile description)<br>
- G: Number of games played<br>
- W: Number of games won<br>
- ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense)<br>
- ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)<br>
- BARTHAG: Power Rating (Chance of beating an average Division I team)<br>
- EFG_O: Effective Field Goal Percentage Shot(offense)<br>
- EFG_D: Effective Field Goal Percentage Allowed(defense)<br>
- TOR: Turnover Percentage Allowed (Turnover Rate)<br>
- TORD: Turnover Percentage Committed (Steal Rate)<br>
- ORB: Offensive Rebound Rate<br>
- DRB: Offensive Rebound Rate Allowed<br>
- FTR : Free Throw Rate (How often the given team shoots Free Throws)<br>
- FTRD: Free Throw Rate Allowed<br>
- 2P_O: Two-Point Shooting Percentage<br>
- 2P_D: Two-Point Shooting Percentage Allowed<br>
- 3P_O: Three-Point Shooting Percentage<br>
- 3P_D: Three-Point Shooting Percentage Allowed<br>
- ADJ_T: Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo)<br>
- WAB: Wins Above Bubble (The bubble refers to the cut off between making the NCAA March Madness Tournament and not making it)<br>
- POSTSEASON: Round where the given team was eliminated or where their season ended (R68 = First Four, R64 = Round of 64, R32 = Round of 32, S16 = Sweet Sixteen, E8 = Elite Eight, F4 = Final Four, 2ND = Runner-up, Champion = Winner of the NCAA March Madness Tournament for that given year)<br>
- SEED: Seed in the NCAA March Madness Tournament<br>
- YEAR: Season<br>

You--<font color = "red">" In some cases, if needed, you can manipulate the dataset and get the portion of dataset that you are interested. Or, you can explore the dataset a little bit to get extra information. "</font>

You--<font color = "red">" For me, as an Auburn student, I'm curious about how Auburn perform in recent years, so I can just show those records related to Auburn. "</font>

In [None]:
# show records related to Auburn
df[df['TEAM']=='Auburn'].sort_values(by="YEAR")

We can find that Auburn team was stonger in 2018 than in previous years. <br>
We can do other queries like showing all 2018 records or all champion teams' records

In [None]:
# show all 2018 records 
df[df['YEAR']==2018]

In [None]:
# show all champions' records by year
df[df['POSTSEASON']=='Champions'].sort_values(by='YEAR')

In [None]:
# This is the version that you can use to show multiple outputs in one cell. Sometimes this could be more convenient than using multiple cells to show output
from IPython.display import display
display(df[df['YEAR']==2018])
display(df[df['POSTSEASON']=='Champions'])
# You can find the outputs are the same as the outputs of the above two cells

### Now let's focus on Wins Above the Bubble (WAB)
Your friend--<font color = "blue">"Wins Above Bubble (WAB) tells us how often an average 'bubble' team would be expected to win each game, based on opponent and location. A short example: For Kansas’ game Wednesday against Kansas State, a bubble team would be expected to beat K-State 82% of the time. WAB takes that number to credit or debit a team for every game it plays. So if KU loses, it will get -0.82 added to its total for performing that much worse than an average bubble team. If it wins, it gets +0.18 added to its total. Adding up every game of the season, you have the WAB"</font>

Your friend--<font color = "blue">"Here is the NCAA basketball March Madness bracket for 2021. From this figure, you can find how the March Madness works and the rank of the teams."</font>

![NCAA baskeyball March Madness](https://www.ncaa.com/_flysystem/public-s3/styles/original/public-s3/images/2021-04-06/2021-ncaa-tournament-bracket_1.jpg?itok=irVohr9I)

You--<font color = "red">"Okay. Now, it's much clearer."</font>

You--<font color = "red">"Altough your dataset has lots of columns, its probably that only some of the columns are related to your goal. We need to check the _correlations_ and choose variables (columns) that appear to be correlated to our target (WAB) to build a model. "</font>

Your friend--<font color = "blue">"Sure."</font>

### Feature Correlation

Feature correlation is a way to understand the relationship between multiple variables and attributes in your dataset. Using the estimated correlations, you can get some insights such as:

* One or multiple attributes depend on another attribute or a cause for another attribute.
* One or multiple attributes are associated with other attributes.


#### Correlation is very useful in many aspects:
- Correlation can help in predicting one attribute from another (Great way to impute missing values).
- Correlation can (sometimes) indicate the presence of a causal relationship.
- Correlation is used as a basic quantity for many modelling techniques.


In [None]:
# show a simple correlation check 
# get correlation matrix
corr=df.corr()
# generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# set the matplotlib figure
fig, ax = plt.subplots(figsize=(20, 12))

# generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

Your friend--<font color = "blue">"Seems not all columns are in this correlation figure?"</font>

You--<font color = "red">"You are right, this figure can only show _numeric_ columns, columns like TEAM are ignored."</font>

Your friend--<font color = "blue">"So, how should we use these correlations?"</font>

You--<font color = "red">"The darker the color, the higher the correlation between the two variables. So, for WAB, columns W, ADJOE, ADJDE, BARTHAG appear to be the most 4 relevant columns. "</font>

You--<font color = "red">"Because WAB is a numeric value in continous space, I think we can use regression in sklearn to build a prediction model.  "</font>

Your friend--<font color = "blue">"Great! So we will use these four columns to build our model , right?"</font>

You--<font color = "red">"It's not quite that easy (yet).  In general, you still need to carefully select the proper variables based on additional knowledge (about dataset, techniques you use, etc.). Let's start with no more than two columns.  Which two do you think could be the most relative? "</font>

Your friend--<font color = "blue">"Uh... I prefer ADJDE and BARTHAG."</font>

You--<font color = "red">"Okay. But before dive into the actual model building part, it's  always wise to check whether we need to do a data pre-processing."</font>

Your friend--<font color = "blue">"What is that?"</font>

You--<font color = "red">"Sometimes dataset may have missing values, wrong values, or values in different scales. So you may need to do some cleaning or normalization."</font>

Your friend--<font color = "blue">"OK. That makes sense, so how is our dataset?"</font>

You--<font color = "red">"Looks like it's good, no need for additional data pre-processing.  Someone has already done this job for -- we know that data is almost always 'dirty'."</font>

(Recall what you learn about Data pre-processing in class: 

1. Data cleaning -- focus on dealing with missing or wrong values

2. Data normalization -- the goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values)

## Prediction Models

#### The following cells will show multiple simple linear regression models in scikit-learn. We will create several linear regression models to predict WAB using different number of variables.

You--<font color = "red">"First, you should know that in general, there is no "correct" model for predicting something. What we do here is trying to build a model that its predictions can be as close as the real values."</font>

You--<font color = "red">"Now, let's start to try the very basic linear regression model."</font>

(The following cells will show a simple linear regression model in scikit-learn. We will create a linear regression models to predict WAB using variables ADJDE . The linear regression model will like:
$Y_{WAB}=\beta_{0}+\beta_{1}*X_{ADJDE}$

The goal is to find the _intercept_ ($\beta_{0}$) and the variables' _coefficients_ ($\beta_{1}$) in this linear regression model.)
model details about this linear model can be found here: https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares

In [None]:
# import the tools we will use in this model
from sklearn import linear_model

You--<font color = "red">"Then, we need to create a _model object_."</font>

In general, you need to create an corresponding obejct in order to use the models from sklearn

In [None]:
# Create linear regression object
regr = linear_model.LinearRegression()
regr

You--<font color = "red">"So we have a model object now, but it's just an empty model object.  Now, we need to get our training sets so that we can populate the model."</font>

Your friend--<font color = "blue">"What are training sets?"</font>

You--<font color = "red">"Oh, they are the input and output data that you use to train your model. "</font>

We extract the data for our X vairables and Y variable. We will use these data to train our linear regression model.

In [None]:
# Get the X variables and Y variable for next training
df_X_train = df[['ADJDE']]
df_Y_train = df['WAB']

# lets have a quick look at the training variables
# the x variables:
display(df_X_train)
# The Y variable (or the target)
display(df_Y_train)

Your friend--<font color = "blue">"How to train the model? Is it complex?"</font>

You--<font color = "red">"Well, for us, it is quite simple. Sklearn provides an easy way for users to train their models. "</font>

The actual training process. A lot of models in scikit-learn use this .fit() function to train the model

In [None]:
# Train the model
regr.fit(df_X_train, df_Y_train)

You--<font color = "red">"Now, we can get the intercept and coefficients of our linear model."</font>

After training, our linear regression object now contain the interception and the coefficients of the model. We can check that.

In [None]:
# Show interception
print("The intercept is {:.3f}".format(regr.intercept_))
print("The coefficient for variable ADJDE is {:.3f}".format(regr.coef_[0]))

Your friend--<font color = "blue">"Great! Now I can use this for my bet!"</font>

You--<font color = "red">"Wait a minute! Yeah, in general, you could use this model, but, it is better to _evaluate_ the model you built before you use it."</font>

Your friend--<font color = "blue">"OK. So how should we do that?"</font>

You--<font color = "red">"I see you have another dataset that includes the 2019 statistics for the same data. We can use this dataset to evaluate the performance of our model."</font>

Your friend--<font color = "blue">"Cool! Let's do it!"</font>

Now, let's test our linear regression model using the information from 2019.  The data file is "College Basketball Dataset-2019.csv"

In [None]:
# the test file path
tfname = "../Data/College Basketball Dataset-2019.csv"
# read the data into a pandas dataframe and show the fist five lines
df_test = pd.read_csv(tfname)
df_test.head()

In [None]:
# some information about the dataset
df_test.info()

We can do the same queries like we did for the original(training) dataset to find(show) additional information or to select certain portion of the dataset. 

In [None]:
# show Auburn and champion records in the test dataset
display(df_test[df_test['TEAM']=='Auburn'])
display(df_test[df_test['POSTSEASON']=='Champions'])
# You can find that the test dataset include the same data, but only for 1 
# (btw - they double dribbled.  I was there.)

You--<font color = "red">"We need to provide the input to our model and compute the estimated WAB. Then compare our estimated WAB with the real WAB."</font>

Extract the X variables in our test file and use them to make a prediction

In [None]:
# X varibles
df_X_test = df_test[['ADJDE']]
# Make predictions
df_Y_pred = regr.predict(df_X_test)

You--<font color = "red">"It is good to use some visualizations to show our predictions along with the real values.  A _scatter plot_ should do nicely."</font>

Let's use a plot to show the real values of WAB and our predictions. The black dots represents the real values, the blue dots represents the prediction values. 

In [None]:
plt.rcParams['figure.figsize'] = (20.0, 12.0)
plt.rcParams['xtick.labelsize'] =15
plt.rcParams['ytick.labelsize'] =15
# Using 2d plot to compare the real valus and our predictions
plt.scatter(df_test['ADJDE'], df_test['WAB'],  color='black', label='Real Value')
plt.scatter(df_test['ADJDE'], df_Y_pred, color='blue', linewidth=3, label='Prediction' )
plt.legend(markerscale=2)
plt.xlabel('ADJDE')
plt.ylabel('WAB')
plt.title('Real values VS Predictions')

Your friend--<font color = "blue">"I think it looks good."</font>

You--<font color = "red">"Yeah, the evaluation metric is often based on user's objectives. If the evaluation of a model doesn't meet the user's objectives, the model needs to be improved."</font>

You--<font color = "red">"In scikit-learn, there are lots of metric functions you can use to evaluate your models. I will show you common one that checks the mean square error(MSE)."</font>

For more metrics, you can check here: https://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
# import the the module we will use
from sklearn.metrics import mean_squared_error
# compute MSE
mean_squared_error(df_test['WAB'], df_Y_pred)

You--<font color = "red">"In general, The smaller the MSE is, the better the model is."</font>

Your friend--<font color = "blue">"OK. So is this model good or bad?"</font>

You--<font color = "red">"Well, let's build another linear regression model that use two variables. And then we can compare them."</font>

Your friend--<font color = "blue">"Let's do it!"</font>

(The following cells will show another simple linear regression model in scikit-learn. We will create a linear regression model to predict WAB using variables ADJDE and BARTHAG. The linear regression model will like:
$Y_{WAB}=\beta_{0}+\beta_{1}*X_{ADJDE}+\beta_{2}*X_{BARTHAG}$

The goal is to find the _intercept_ ($\beta_{0}$) and the variables' _coefficients_ ($\beta_{1} and \beta_{2}$) in this linear regression model.)
model details about this linear model can be found here: https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares

In [None]:
# Create linear regression object
regr1 = linear_model.LinearRegression()
# Get the X variables and Y variable for next training
df_X_train1 = df[['ADJDE','BARTHAG']]
df_Y_train1 = df['WAB']
# Train the model
regr1.fit(df_X_train1, df_Y_train1)
# Show interception
print("The intercept is {:.3f}".format(regr1.intercept_))
print("The coefficient for variable ADJDE is {:.3f}".format(regr1.coef_[0]))
print("The coefficient for variable BARTHAG is {:.3f}".format(regr1.coef_[1]))
# X varibles
df_X_test1 = df_test[['ADJDE','BARTHAG']]
# Make predictions
df_Y_pred1 = regr1.predict(df_X_test1)
# compute MSE
print("\nThe MSE for this model is: {:.3f}".format(mean_squared_error(df_test['WAB'], df_Y_pred1)))

You--<font color = "red">"Now you can see the MSE of this model is less than the previous one."</font>

You--<font color = "red">"Because now we have two x variables, the linear model should be a plane in 3D space. We can check that with 3d visualization."</font>

In [None]:
# import plotly package
import plotly.express as px

# form the dataframe for plot
XYZ_R = pd.DataFrame.copy(df_X_test1)
XYZ_R['WAB']=df_test['WAB']
XYZ_R['Label']='Real Value'
XYZ_P=pd.DataFrame.copy(df_X_test1)
XYZ_P['WAB']=df_Y_pred1
XYZ_P['Label']='Prediction'
XYZ = pd.concat([XYZ_R,XYZ_P])
XYZ['Size']=0.1

# 3d plot
fig = px.scatter_3d(XYZ, x='ADJDE', y='BARTHAG', z='WAB',color='Label',size='Size')
fig.update_layout(autosize=False,width=1000,height=1000)
fig.show()

You--<font color = "red">"You see, the red dots are our predictions and they are all located on a plane."</font>

You--<font color = "red">"Also, if you want to test variaty of combinations of the variables, you can make codes to help you automatically decide which variables you need to use based on your metrics."</font>

You--<font color = "red">"For example, I want to find the best model based on MSE and has no more than three input variables."</font>

In [None]:
# import itertools 
from itertools import combinations
# possible input variables
v_list = ['W','ADJOE','ADJDE','BARTHAG']
# selected variable. initial equal to None
sv = None
# selected metric. initial equal to infinity
s_metric = float('inf')
# number of variables
for i in range(1,4):
    # pick the combination
    for ele in combinations(v_list,i):
        # create linear regression object
        regrC = linear_model.LinearRegression()
        # X variables and Y variable
        df_X_trainC = df[list(ele)]
        df_Y_trainC = df['WAB']
        # Train the model
        regrC.fit(df_X_trainC, df_Y_trainC)
        # test X varibles
        df_X_testC = df_test[list(ele)]
        # Make predictions
        df_Y_predC = regrC.predict(df_X_testC)
        # compute MSE
        t_metric = mean_squared_error(df_test['WAB'], df_Y_predC)
        # choose the least MSE
        if t_metric < s_metric:
            s_metric = t_metric
            sv = list(ele)
# See the result
sv, s_metric

Your friend--<font color = "blue">"I get your point. So the last model is better. Now I think I can use this prediction model to help me predict WAB, but how to deal with the postseason rank? It is not  a numeric value."</font>  (See the next cell for the distinct postseason rank values)

You--<font color = "red">"Your are right. The postseason rank is a **_catagorical value_**, so to predict postseason rank, we need to use a **_classification_** model."</font>

You--<font color = "red">"Oh, I need to mention that even the values in a column are numeric, sometimes we can still treat them as catagoricak values."</font>

You--<font color = "red">"Now, let's first check what values we have in the postseason column. And then build our classification models."</font>

In [None]:
# check the unique values in POSTSEASON column
df['POSTSEASON'].unique()

In this postseason column, we have 9 unique values - One for losing in each seven rounds of the NCAA tournament + 1 value for the champion + 1 value (nan - not a value, actually) for teams that don't reach the tournament.

## 2. Classification Model


In this classification model example, we will show you how to use **_decision tree_** to build classification models. Before we dive into the actual model building process, let's first see what is decision tree. Based on Wikipedia, a decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label. <br>
Suppose we have some data records (up) and a decision tree (down) shown below : <br>
![Data records for showing decision tree](https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-content/uploads/2015/01/Decision-Tree-Example-5-Decision-tree-Edureka-768x432.png)
![Decision tree](https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-content/uploads/2015/01/Decision-Tree-Example-6-Decision-tree-Edureka-768x432.png)

By having the above decision tree, For a record, from the top to bottom, we can follow the tree to make a decision. 

(The following cells will show simple **_decision tree models_** in scikit-learn. We will create a decision tree classifier to predict POSTSEASON using variables W, ADJOE, ADJDE, BARTHAG, and WAB.


A decision tree classifier is a tree-based model. It will take all the variables in the dataset and split on them. At each split, it either assigns the data point to a class or splits on a different (or the same) variable. This process continues until there are no more features to split on, or a stopping criteria is reached.

<font color = "red"> You can also find in the following cells that for a decision tree model, many aspects can impact the performance of the decision tree model </font>)
model details about this decision tree model can be found here: https://scikit-learn.org/stable/modules/tree.html#classification


You--<font color = "red">"So, let's first import the module we will use from the sklearn package!"</font>

In [None]:
# import the tools we will use in this model
from sklearn import tree

You--<font color = "red">"For a decision tree model, we still need to identify the input variables. So, if a team can participate in the postseason matches, what statistics do you think may related to its rank?"</font>

Your friend--<font color = "blue">"Let me think about it ... I guess the their rank may have some relation to W, ADJOE, ADJDE, BARTHAG, and WAB."</font>

You--<font color = "red">"Okay. Considering that not all teams will participate the postseason matches, we first need to extract the observations that have a postseason rank."</font>

Based on the dataset infomation, POSTSEASON column only has 408 non-null values. So, let's first extract those 408 observations and and generate the training set with target variable and input variables. 

In [None]:
# extract the 408 observations with non-null postseason ranks
OBs = df.dropna()
OBs

In [None]:
# the target variable
Class = OBs['POSTSEASON']
# the input variables
Train = OBs[['W','ADJOE','ADJDE','BARTHAG','WAB']]
Train

You--<font color = "red">"In general, setting proper parameters is important in model building. Usually, it can dramatically affect the performance of your model."</font>

Your friend--<font color = "blue">"I didn't see you provide any parameters in the linear model building."</font>

You--<font color = "red">"Yes, when you build models using sklearn, each model has some default paramters. If you don't specify the parameters, sklearn will use those default paramters to build the model. "</font>

Your friend--<font color = "blue">"OK, I get it. So, this time we will specify some paramters, right?"</font>

You--<font color = "red">"Uh...Let's first use the default paramters and see what's going on."</font>

Now, create a decision tree classifier object. As we show above, you need to create an corresponding obejct in order to use the models from scikit-learn. Then, we use the default parameters to train our model.

In [None]:
# decision tree object
clf = tree.DecisionTreeClassifier()
# training model
clf = clf.fit(Train, Class)

You--<font color = "red">"We can also use the 2019 dataset to test this decision tree model.By providing the input values of a team, the decision tree model will give an estimated rank of that team."</font>

Let's use the 2019 dataset to make a prediction of their post-season rank for the first 68 teams.

In [None]:
# get the first 68 teams
PostSeasonTeams = df_test[df_test['POSTSEASON'].notna()]
# make the predictions
clf.predict(PostSeasonTeams[['W','ADJOE','ADJDE','BARTHAG','WAB']])

Your friend--<font color = "blue">"Now we need to evaluate the model, right?"</font>

You--<font color = "red">"Exactly."</font>

We can evaluate our prediction with their real post-season rank. Sklearn also provides functions for us to evaluate our model. Here, we will use the accuracy as our evaluation metric.

In [None]:
# import the needed package
from sklearn import metrics
# real rank
real_rank = PostSeasonTeams['POSTSEASON']
# predicted rank
pred_rank = clf.predict(PostSeasonTeams[['W','ADJOE','ADJDE','BARTHAG','WAB']])
# check the accuracy
metrics.accuracy_score(real_rank, pred_rank)

Your friend--<font color = "blue">"The accuracy is just above 40%, I don't think this is a good model."</font>

You--<font color = "red">"Well, this model could not be good for this 2019 dataset."</font>

Your friend--<font color = "blue">"What do you mean by saying that?"</font>

You--<font color = "red">"Sometimes, a good model for one dataset may not be a good model for another dataset. If you use the statistics of NCAA from 2020, this model may have a higher accurocy."</font>

Your friend--<font color = "blue">"OK. I get your point. BTW, in 2020 NCAA basketball didn't have postseason, so those predictions would be easy! "</font>

You--<font color = "red">"Lol, Okay."</font>

(The above is the process of building, applying and evaluating a decision tree model. You can see that the accuracy of our model for the 2019 dataset is around 40%. If you are satisfied with this accuracy, you may stop here. But if you want to build a <font color = "red">"better model"</font> ( Here, we mean a better model for the 2019 dataset. Sometimes, a good model for one dataset may not be a good model for another dataset), you may need to do some parameter tunings or need to change your training set.)

Your friend--<font color = "blue">"So, what should we do next, build another decision tree model?"</font>

You--<font color = "red">"Well, do you know what is our decision tree? "</font>

Your friend--<font color = "blue">"Nope."</font>

You--<font color = "red">"OKay. So I think it's better to let you have a visualization view of decision tree. This might help you judge the quality of the decision tree."</font>

Let's first see what decision tree looks like. The following cells provide three different visualization methods for our decision tree model.

#### 1. Use the default plot method to show the tree. 
Here we don't show the whole tree, we only show part of the tree (But if you like, you can change the parameter max_depth to see the different part of the tree)

In [None]:
# plot size
plt.figure(figsize=(20,20))
# plot the decision tree
tree.plot_tree(clf, max_depth=2, fontsize=15);

### 2. Use graphviz package to help us show the tree

In [None]:
# import the package
import graphviz 
# provide the feature names and the class names
featurenames = ['W','ADJOE','ADJDE','BARTHAG','WAB']
classname = clf.classes_
# generate the plot
dot_data = tree.export_graphviz(clf,feature_names=featurenames,
                                class_names = classname,filled=True,
                                special_characters=True)
graph = graphviz.Source(dot_data)  
graph 

### 3. Or, we can show the tree in textual format

In [None]:
r = tree.export_text(clf, feature_names=featurenames)
print(r)

You--<font color = "red">"Now we can build another tree that we will change some default paramters."</font>

Let's build another tree model and this time, we force the max depth of our tree be 3. Then, let's evaluate our new model.

In [None]:
# the decision tree model object
clf = tree.DecisionTreeClassifier(max_depth=3)
# train the model
clf = clf.fit(Train, Class)
# make predictions
pred_rank = clf.predict(PostSeasonTeams[['W','ADJOE','ADJDE','BARTHAG','WAB']])
# check the accuracy
metrics.accuracy_score(real_rank, pred_rank)

Your friend--<font color = "blue">"Seems better than the previous one."</font>

You--<font color = "red">"Yes, in terms of accuracy."</font>

Your friend--<font color = "blue">"So, what's going on with this model?"</font>

You--<font color = "red">"Let's plot this tree and you can find the differences."</font>

You can find the accuracy of our new model is raised. Let's plot our tree as well and see what is going on for our new tree.

In [None]:
# use graphviz to plot the tree
dot_data = tree.export_graphviz(clf,feature_names=featurenames,
                                class_names = classname,filled=True,
                                special_characters=True)
graph = graphviz.Source(dot_data)  
graph 

Your friend--<font color = "blue">"Oh, I see, the decision tree is looked like be pruned."</font>

You--<font color = "red">"Yes. Because we force the max depth of the tree to 3."</font>

You--<font color = "red">"And you should find that unlike our first tree model, the leaf of the new tree model is not always pure. Because we limit the max depth of the tree, even the node is not pure, the tree will assign a class to it."</font>

Your friend--<font color = "blue">"I find that in our training set, nearly half of the observations are in rank R64. Will this defect our model?"</font>

You--<font color = "red">"It may be. Let me show you an interesting thing."</font>

First calculate the number of teams have R64 rank of the post-season in 2019.

In [None]:
# calculate the R64 numbers
(PostSeasonTeams['POSTSEASON']=='R64').sum()

There are 32 teams got R64 rank in 2019. So, if we only use observations with R64 rank to build our model, our model will predict everything as R64. And when we make prediction use that model, we can get an accuracy of 32/68= 47% which is better than our first tree model!!

In general, if the classes in the training set is unbalanced,  the classes own more observations may lead bias in your model (NOT just for decision tree model)

Your friend--<font color = "blue">"It is interesting. I see the influence of the unbalanced training set. So how to reduce the influence?"</font>

You--<font color = "red">"We can pick parts of the original training set or add some synthetic observations to expand the original training set. I'll use parts of our original training set to build another decision tree model."</font>

Now, let's pick some observations in our original training set to form a new training set. This time, our new training set will have 6 Champions teams, 6 2ND teams, 12 F4 teams, 24 E8 teams, 48 S16 teams, <font color = "red">48 R32 teams, 48 R64 teams</font>, and 24 R68 teams. For R32 and R64 teams, we will choose the observation randomly.

In [None]:
# let's first extract non-R32 and non-R64 teams in our 2013-2018 dataset
Non_R32_64 = OBs[(OBs['POSTSEASON']!='R64')&(OBs['POSTSEASON']!='R32')]
# make sure every time your run this cell, you get the same sample. If you want to get other sample, change the number
# in the parenthesis
np.random.seed(0)
# 48 samples of the R32 teams
R32_sample = OBs[OBs['POSTSEASON']=='R32'].sample(48)
# 48 samples of the R64 teams
R64_sample = OBs[OBs['POSTSEASON']=='R64'].sample(48)
# create our new training set
NewOBs = pd.concat([Non_R32_64,R32_sample,R64_sample])
NewOBs

We will limit the max depth of the tree model to 4. The following cells will build the model, evaluate the model, and plot the model

In [None]:
# the target variable
Class = NewOBs['POSTSEASON']
# the input variables
Train = NewOBs[['W','ADJOE','ADJDE','BARTHAG','WAB']]
clf = tree.DecisionTreeClassifier(max_depth=4)
clf = clf.fit(Train, Class)
pred_rank = clf.predict(PostSeasonTeams[['W','ADJOE','ADJDE','BARTHAG','WAB']])
# check the accuracy
metrics.accuracy_score(real_rank, pred_rank)

In [None]:
# plot the tree
dot_data = tree.export_graphviz(clf,feature_names=featurenames,
                                class_names = classname,filled=True,
                                special_characters=True)
graph = graphviz.Source(dot_data)  
graph

Your friend--<font color = "blue">"Based on the accuracy, this new model seems to be worse than the previous one."</font>

You--<font color = "red">"Yes. But when you check the the plots, you can find that this one can predict every rank and the previous model cannot predict Champions, 2ND, F4. "</font>

Your friend--<font color = "blue">"You are right. Uh, it is hard to decide which one I will use."</font>

You--<font color = "red">"Making a model is easy but judging a model is difficult. Good luck to your bets!"</font>

### The following cell will show you an example of how to use code to help us find models

In [None]:
# import gridsearch module
from sklearn.model_selection import GridSearchCV
# possible input variables
v_list = ['W','ADJOE','ADJDE','BARTHAG','WAB','SEED','TOR','TORD']
# selected variable. initial equal to None
sv = None
# selected model. initial equal to None
sm = None
# selected paramters. initial equal to None
sp = None
# selected accuracy. initial equal to 0
s_accuracy = 0
# number of variables
for i in range(4,9):
    # pick the combination
    for ele in combinations(v_list,i):
        # X variables and Y variable
        TrainC = OBs[list(ele)]
        ClassC = OBs['POSTSEASON']
        # decision tree object
        dtc = tree.DecisionTreeClassifier()
        # parameter grid
        parameters = {'splitter':("best","random"),   
                      'max_depth':[3,4,5],
                      'random_state':[35]}
        # create gridsearch object
        clf_S = GridSearchCV(dtc, parameters, scoring= 'accuracy')
        # train the model
        clf_S.fit(TrainC,ClassC)
        # use the best model to do prediction
        pred_rank = clf_S.best_estimator_.predict(PostSeasonTeams[list(ele)])
        # check preidction accuracy
        accuracy = metrics.accuracy_score(real_rank, pred_rank)
        # choose the highest accuracy model
        if accuracy > s_accuracy:
            s_accuracy = accuracy
            sv = list(ele)
            sp = clf_S.best_params_
            sm = clf_S.best_estimator_
s_accuracy, sv, sp

## In this notebook, we just show you some examples of using sklearn to build models. You can use sklearn to develop much more complicated models than the ones in this notebook. But when you need to build a model to solve a problem, always remember to follow these 5 steps:
1. Reading the Data
2. Exploratory Data Analysis
3. Data Pre-processing
4. Model Building
5. Model Evaluation