# Predicting Sports Winners with Decision Trees

Here, we will look at predicting the winner of sports matches using a different type of classification algorithm: decision trees. These algorithms have a number of advantages over other algorithms. One of the main advantages is that they are readable by humans. In this way, decision trees can be used to learn a procedure,
which could then be given to a human to perform if needed. Another advantage is that they work with a variety of features.

## Loading the dataset

We'll be looking at predicting the winner of games of the **National Basketball Association (NBA)**. Matches in the NBA are often close and can be decided in the last minute, making predicting the winner quite difficult. Many sports share this characteristic, whereby the expected winner could be beaten by another team on the right day.

Various research into predicting the winner suggests that there may be an upper limit to sports outcome prediction accuracy which, depending on the sport, is between 70 percent and 80 percent accuracy. There is a significant amount of research being performed into sports prediction, often through data mining or statistics-based methods.

## Collecting the data

The data we will be using is the match history data for the NBA for the 2013-2014 season. The website http://Basketball-Reference.com contains a significant number of resources and statistics collected from the NBA and other leagues. To download the dataset, perform the following steps:

1. Navigate to http://www.basketball-reference.com/leagues/NBA_2014_games.html in your web browser.
2. Choose to get table as csv file
3. copy and paste the csv to your data folder and make a note of the path.

This will download a **CSV** (short for **Comma Separated Values**) file containing the
results of the 1,230 games in the regular season for the NBA.

CSV files are simply text files where each line contains a new row and each value is separated by a comma (hence the name). CSV files can be created manually by simply typing into a text editor and saving with a *.csv* extension. They can also be opened in any program that can read text files, but can also be opened in Excel as a spreadsheet.

We will load the file with the **pandas** (short for **Python Data Analysis**) library, which is an incredibly useful library for manipulating data. Python also contains a,built-in library called *csv* that supports reading and writing CSV files. However, we will use pandas, which provides more powerful functions that we will use later in the chapter for creating new features.

In [1]:
DATA = 'data/'
NBA_2014_SEASON = DATA + 'leagues_NBA_2014_games_games.csv'

In [2]:
import numpy as np
import pandas as pd

results = pd.read_csv(NBA_2014_SEASON)
results.head()

Unnamed: 0,Date,Start (ET),Visitor/Neutral,PTS,Home/Neutral,PTS.1,Unnamed: 6,Unnamed: 7,Attend.,LOG,Arena,Notes
0,Tue Oct 29 2013,7:00p,Orlando Magic,87,Indiana Pacers,97,Box Score,,18165,2:17,Bankers Life Fieldhouse,
1,Tue Oct 29 2013,10:30p,Los Angeles Clippers,103,Los Angeles Lakers,116,Box Score,,18997,2:27,STAPLES Center,
2,Tue Oct 29 2013,8:00p,Chicago Bulls,95,Miami Heat,107,Box Score,,19964,2:32,AmericanAirlines Arena,
3,Wed Oct 30 2013,7:00p,Brooklyn Nets,94,Cleveland Cavaliers,98,Box Score,,20562,2:23,Quicken Loans Arena,
4,Wed Oct 30 2013,8:30p,Atlanta Hawks,109,Dallas Mavericks,118,Box Score,,19834,2:14,American Airlines Center,


## Cleaning up the dataset

After looking at the output, we can see a number of problems:
- The date is just a string and not a date object
- From visually inspecting the results, the headings aren't complete or correct

These issues come from the data, and we could fix this by altering the data itself. However, in doing this, we could forget the steps we took or misapply them; that is, we can't replicate our results. As with the previous section where we used pipelines to track the transformations we made to a dataset, we will use pandas to apply
transformations to the raw data itself.

In [11]:

# Parse the date column as a date
results = pd.read_csv(NBA_2014_SEASON, parse_dates=["Date"])
# Fix the name of the columns
results.columns = ['Date', 'Start (ET)', 'Visitor Team', 'VisitorPts', 'Home Team', 'HomePts', 'Score Type', 'OT?', 'Attend.', 'LOG', 'Arena', 'Notes']
results.head()

Unnamed: 0,Date,Start (ET),Visitor Team,VisitorPts,Home Team,HomePts,Score Type,OT?,Attend.,LOG,Arena,Notes
0,2013-10-29,7:00p,Orlando Magic,87,Indiana Pacers,97,Box Score,,18165,2:17,Bankers Life Fieldhouse,
1,2013-10-29,10:30p,Los Angeles Clippers,103,Los Angeles Lakers,116,Box Score,,18997,2:27,STAPLES Center,
2,2013-10-29,8:00p,Chicago Bulls,95,Miami Heat,107,Box Score,,19964,2:32,AmericanAirlines Arena,
3,2013-10-30,7:00p,Brooklyn Nets,94,Cleveland Cavaliers,98,Box Score,,20562,2:23,Quicken Loans Arena,
4,2013-10-30,8:30p,Atlanta Hawks,109,Dallas Mavericks,118,Box Score,,19834,2:14,American Airlines Center,


Now that we have our dataset, we can compute a **baseline**. A baseline is an accuracy that indicates an easy way to get a good accuracy. Any data mining solution should beat this.

In each match, we have two teams: a home team and a visitor team. An obvious baseline, called the chance rate, is 50 percent. Choosing randomly will (over time) result in an accuracy of 50 percent.

## Extracting new features