<a href="https://colab.research.google.com/github/dornercr/INFO371/blob/main/INFO371_Week5_FeatureE_CrossV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 371: Data Mining Applications

## Week 5: Feature Engineering and Model Selection
### Prof. Charles Dorner, EdD (Candidate)
### College of Computing and Informatics, Drexel University

In [None]:
import pandas as pd
import numpy as np
from google.colab import files
import matplotlib.pyplot as plt

## Collecting the data
The data we will be using is the match history data for the NBA for the 2015-2016 season.

The website [http://basketball-reference.com](http://basketball-reference.com) contains a significant number of resources
and statistics collected from the NBA and other leagues.

To download the dataset, perform the following steps:
1. Navigate to [here](http://www.basketball-reference.com/leagues/NBA_2016_games.html) in your web browser.
2. Click Share & Export.
3. Click Get table as CSV (for Excel).
4. Copy the data, including the heading, into a text file named basketball.csv.
5. Repeat this process for the other months, except do not copy the heading.

This will give you a CSV file containing the results from each game of this season of the NBA. (My file contains 1316 games and a total of 1317 lines in the file, including the header line.)

In [None]:
# upload the NBA file
from google.colab import files
files.upload()

Saving nba-2016.csv to nba-2016.csv


{'nba-2016.csv': b'Date,Start (ET),Visitor/Neutral,PTS,Home/Neutral,PTS,,,Attend.,Notes\r\nTue Oct 27 2015,8:00p,Cleveland Cavaliers,95,Chicago Bulls,97,Box Score,,21957,\r\nTue Oct 27 2015,8:00p,Detroit Pistons,106,Atlanta Hawks,94,Box Score,,19187,\r\nTue Oct 27 2015,10:30p,New Orleans Pelicans,95,Golden State Warriors,111,Box Score,,19596,\r\nWed Oct 28 2015,7:00p,Washington Wizards,88,Orlando Magic,87,Box Score,,18846,\r\nWed Oct 28 2015,7:30p,Philadelphia 76ers,95,Boston Celtics,112,Box Score,,18624,\r\nWed Oct 28 2015,7:30p,Chicago Bulls,115,Brooklyn Nets,100,Box Score,,17732,\r\nWed Oct 28 2015,7:30p,Utah Jazz,87,Detroit Pistons,92,Box Score,,18434,\r\nWed Oct 28 2015,7:30p,Indiana Pacers,99,Toronto Raptors,106,Box Score,,19800,\r\nWed Oct 28 2015,7:30p,Charlotte Hornets,94,Miami Heat,104,Box Score,,19724,\r\nWed Oct 28 2015,8:00p,New York Knicks,122,Milwaukee Bucks,97,Box Score,,18717,\r\nWed Oct 28 2015,8:00p,San Antonio Spurs,106,Oklahoma City Thunder,112,Box Score,,18203,\r\

```
df = pd.read_csv('nba-2016.csv')
df.head()
```

## Parse date when loading and clean the column names

```
df = pd.read_csv('nba-2016.csv', parse_dates=["Date"])
df.columns = ["Date", "Start (ET)", "Visitor Team", "VisitorPts", \
"Home Team", "HomePts", "OT?", "Score Type", "Attend.", "Notes"]
df.head()
```

## Add a target column

```
df["HomeWin"] = df["VisitorPts"] < df["HomePts"]
```

## Extract the target values

```
y_homewin = df["HomeWin"].values
y_homewin
```

## What is the home team advantage in this data set?
```
df.Homewin.mean()
```

## Predict using team names
```
df['Home Team'].nunique()
```

Encode the team names into one-hot vectors:
```
from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder()
onehot.fit(df[["Home Team"]])
```

List the unique categories corresponding to the one hot values:

```
onehot.categories_
```

Transform the team names into one-hot values:
```
X = onehot.transform(df[["Home Team"]]).toarray()
```

Fit a decision tree on the team names:
```
from sklearn.tree import DecisionTreeClassifier
dt_nba = DecisionTreeClassifier()

dt_nba.fit(X, y_homewin)
```

Print out the tree as text format:
```
from sklearn.tree import export_text

print(export_text(dt_nba, feature_names=onehot.categories_[0]))
```

## Visualize the Decision Tree


```
!pip install -q dtreeviz

import dtreeviz

viz = dtreeviz.model(dt_nba,
               X,
               y_homewin,
               target_name='HomeWin',
               feature_names=onehot.categories_[0],
               class_names=["No", "Yes"]
               )

viz.view(scale=1.5)

```

## Evaluation
Evaluation should be done on a separate test data. Scikit Learn provides a convenient method for splitting arrays or matrices into random train and test subsets.

```
from sklearn.model_selection import train_test_split

# Split the given data set into train and test subsets
X_train, X_test, y_train, y_test = train_test_split(X, y_homewin, test_size=0.2, random_state=42)

```

List the shapes of training and test data sets
```
X_train.shape, X_test.shape, y_train.shape, y_test.shape
```

Train a decision tree on training data:
```
dt_nba = DecisionTreeClassifier()
dt_nba.fit(X_train, y_train)
```

Predict homewin labels on test data:
```
y_pred = dt_nba.predict(X_test)
```

Evaluate the performance of the decision tree using team names:
```
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)
```

## Cross Validation
A primary method to evaluate and select machine learning models on a limited data set. The K-Fold cross validation goes like:
- Shuffle the dataset randomly.
- Split the dataset into k groups
- For each unique group:
 - Take the group as a hold out or test data set
 - Take the remaining groups as a training data set
 - Fit a model on the training set and evaluate it on the test set
- Obtain the final score of the model using the scores on individual folds.

### Use the cross_val_score implemented in Scikit Learn

```
from sklearn.model_selection import cross_val_score

dt_nba = DecisionTreeClassifier()

scores= cross_val_score(dt_nba, X, y_homewin, cv=10, scoring='accuracy')

```

Print out the mean accuracy:
```
print("The accuracy of predicting on names: {}".format(np.mean(scores)))
```

Compare to the home team advantage:
```
df["HomeWin"].mean()
```

## Create a new feature: who won the last game

Initialize two new columns with False (0) values:
```
df['HomeWonLast'] = 0 # did the home team win its last game?
df['VisitorWonLast'] = 0 # did the visitor team win its last game?
```

Use a dictionary to compute the correct values:

```
from collections import defaultdict
won_last = defaultdict(int)
```

An algorithm for computing the value whether the home team won last game:
```
for index, row in df.sort_values("Date").iterrows():
    home_team = row['Home Team']
    visitor_team = row['Visitor Team']
    df.at[index, 'HomeWonLast'] = won_last[home_team]
    df.at[index, 'VisitorWonLast'] = won_last[visitor_team]
    won_last[home_team] = int(row['HomeWin'])
    won_last[visitor_team] = 1- int(row['HomeWin'])
```

### Predict using HomeWonLast and VisitorWonLast

```
X_lastWon = df[['HomeWonLast', 'VisitorWonLast']]

dt_nba = DecisionTreeClassifier()

scores = cross_val_score(dt_nba, X_lastWon, y_homewin, scoring='accuracy')

print("The accuracy of predicting on who won last: {}".format(np.mean(scores)))

```

### Predict using names and previous won

```
name_oneHot = pd.get_dummies(df['Home Team'])

X_nameOneHot_lastWon = pd.concat([name_oneHot, df[['HomeWonLast', 'VisitorWonLast']]], axis=1)
X_nameOneHot_lastWon

```

Fit the evaluate the decision tree:
```
dt_nba = DecisionTreeClassifier()
scores = cross_val_score(dt_nba, X_nameOneHot_lastWon, y_homewin, scoring='accuracy')
print('The accuracy of predicting on names and last won: {}'.format(np.mean(scores)))
```

## Random Forest
An ensemble method consisting of a large number of individual decision trees.

It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees.

```
from sklearn.ensemble import RandomForestClassifier


randomForest = RandomForestClassifier(random_state=14)
scores = cross_val_score(randomForest, X_nameOneHot_lastWon, y_homewin, scoring='accuracy')
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

```

## Grid search for model selection
- A tuning technique that attempts to compute the optimum values of hyperparameters.
- An exhaustive search that is performed on a the specific parameter values of a model.

```
from sklearn.model_selection import GridSearchCV

```

Define the grid of parameters for tuning:
```
parameter_space = {
"max_features": ["sqrt", "log2", None],
"n_estimators": [100, 200],
"criterion": ["gini", "entropy"],
"min_samples_leaf": [2, 4, 6],
}

```

Create a Random Forest and tune its parameters:
```
clf = RandomForestClassifier(random_state=14)

grid = GridSearchCV(clf, parameter_space)

grid.fit(X_nameOneHot_lastWon, y_homewin)
print("Accuracy: {0:.1f}%".format(grid.best_score_ * 100))
```

Print out the best model:
```
print(grid.best_estimator_)
```

## Predict Using the Final Model

```
final_model =  grid.best_estimator_

y_pred = final_model.predict(X_nameOneHot_lastWon)

accuracy_score(y_homewin, y_pred)

```