# Intoduction
***
Predicting the outcomes of a sport's match has always been challenging and intresting problem. It has therefore drawn a wide range of research.

Today with huge amount of comprehensive data in sport datasets, sport data modeling techniques enables us to find hidden knowledge to impact the sport industry.


# The Aim
***
This project focuses on using machine-learning algorithms to build a model for
Predicting a NBA game outcomes, which could be used for team and game analysis to improve profromance or even more fun betting! 

***

In terms of betting there are three different common bet lines: 

-	Moneyline (who is winner)
-	Spread (Who is winner and what will be the difference in their scores)
-	Total (Total combined socre of both teams)

In this project our target is the total points made by both teams of an upcoming match based on historic data of teams. 

<img style="float: center;" src="./Pictures/Bet_Sample.png">


### So, I want to predict if I should bet on Over or Under the score line! 

# Whats Happening?

Before exploring the problem, it is good to have a broad view of what is happening in real world of betting. 

The distribution of actual total points is a wide range between ~140 and ~270 which makes it hard to predict. the distribution of bet line is in a range of ~170 to ~240. which means there is a meaningfull error for the bet companies predictions which means it's really even hard for bet companies to predict the games. 

#### Distibution of Total Points vs. Total Bet Line :
<img style="float: center;" src="./Pictures/distribution.png">

#### Distribution of Bet Errors (difference between Total Point and Total Bet Line) :

<img style="float: center;" src="./Pictures/beterror.png">

#### Base Line :

<img style="float: center;" src="./Pictures/baseline.png">


# Raw Data
***
### Game Data
  - Games schedule and results for each season (Regular Season 2013 to 2017)
    - This data includes statistical data for each game 
    - The data of each game is aggregated by match for all players
    - Scraped from http://www.basketball-refrence.com
    
<img style="float: center;" src="./Pictures/Four_Factors.png">
<img style="float: center;" src="./Pictures/basic.png">
<img style="float: center;" src="./Pictures/advanced.png">

### Bet Data

  - Total points bet line for each game
    - Scraped from  http://www.oddsshark.com

<img style="float: center;" src="./Pictures/bets_log.png">


# Feature Engineering
***
As a matter of fact, we have few information about upcoming match. Game date and time, location of match, home and away team names, and some data about arrangment are the only data that we have.

But also we have a lots of information about what has happened in past games. The team standings, how much phisycal pressure for players based on schedule and distance they traveled, how previously these two teams palyed against each other and etc.

Also I believe teams have a different performance when they are home or away and teams have different strategies based on their standings and their oppenent standing, and 
there are some key players that really change the outcomes of a match. 


##### The success key of this project, is to find and create the featuers that can represent hidden information about upcoming match. Some of these information are built by aggregating the historical statistics of past games, and the other features are created to represent the items that can affect the performance of players.

### Created Features
This project tries to create huristic features than can provide information about upcoming match : 

- #### Aggregated data from previous matches: 
    - The average performance statistics of last 5  games as home or away team for home and away team
    - The average performance statistics of last 10 games for home and away team
    - The average performance statistics of last 3  games for home and away team played against each other
- #### Phisycal abilities : 
    - How many rest days before upcoming match (Recovery Time)
    - If the Last match of home or away team had extra time (minutes played)
    - How many matches in last 5 days
    - Distance traveled from last match (under construction)
    - Distance traveled from last match as a home (under construction)
- #### Motivation :
    - How many wins and loses in current season
    - How many wins and loses in last 10 matches
    - How many games behind between two teams
    - The streak (how many win or lose in a row)
    - Being home or away streak (how many being home or away in a row)
    
### Assumptions

As we see above, i have used the data from last 10 games to predict the upcoming match. Also there is a lots of fundumental changes between seasons(coach,players,etc). so it dosn't seem wise to use first 10 games of each season and i have ignored them. 


### Data Structure :

I Used three different CSV files to store my data
- games and bet raw data- ready for feature engineering
   - Includes all the information about past games. 
- featured data
   - Includes some data about the next match and all the historical data from past games.
   - This data set includes all of my engineered features.
- predicted data
   - Includes the predictions vs. true values in ordr to EDA on predictions
   
### Correlation Between Created Fetures and Basic statistics(before aggregation) with target
<img style="float: center;" src="./Pictures/corr.png">


# Model Selection 
***
I choosed to use regression models to predict the total points of upcoming match. Then i can compare the prediction to bet line and see how many bets i win or lose. Also with regression models, i have the chance to review my predictions, 
find high value residules and try to apply some models to predict these 'outlires' and avoid betting on these games.

I tried : 

- LinearRegression
- Ridge
- SVR
- RandomForestRegressor
- ADABoostRegressor

Looking to the different models, the best results came from RandomForestRegresor.

>The Grid Search results for RandomForestRegresor :
- n_estimators=1500
- min_samples_split=16
- max_features=None
- max_depth=None

Also, i can calculate confusion metrix with the values "True Over", "False Over", "True Under" and "False Under" to have a better undrestaning of results

# Model Evaluation
***
## Train and Test Data :

- Data :                 The NBA regular season
- Train Data:            Four seasons (2013,2014,2015,2016)
- Train Data Shape :     (3768, 229)
- Test Data:             Current Season (2017)
- Test Data Shape :      (405, 229)

## Evaluation Method
In terms of betting, profit/loss is the best measurment to evaluate the model.
But first, let's define the assumptions : 

### Betting Assumptions : 
- If we win a bet, we get +1.0 point
- If we lose a bet, we lose -1.1 point (margin for bet company)
- There is some tie games (if the total point is equal to bet line, that would be tie)
- We bet on All the games of season after 10th game, and same amount of bet for all bets

Knowing that 

### $$Count(Win) + Count(Lose) + Count(Tie) = Count(All)$$

So, The profit/loss pecentage formula which shows the model performance would be :

### $$\alpha = 100*\frac{Count(Win) - 1.1*Count(Lose)}{Count(All) - Count(Tie)} $$

#### If Alpha is a possitive number, the model is profitable. and the value of R shows the percentage of profit 
***
## Actual Results :

#### Mean Absolute Error of Prediction and Mean Absolute Error of Bet Line :
<img style="float: center;" src="./Pictures/mae.png">

#### Base Line :
<img style="float: center;" src="./Pictures/baseline_actual.png">

#### Alpha Score : 
<img style="float: center;" src="./Pictures/win_lose.png">

#### Confusion Metrix : 
<img style="float: center;" src="./Pictures/confusion_metrix.png">

#### Comparing predicted values and true values and bet line
<img style="float: center;" src="./Pictures/result1.png">

#### Sample Of Predictions and Residuals
<img style="float: center;" src="./Pictures/result2.png">

#### Most Important Coefficient From Linear Regression (Mean Of 1000 Train-Test Split)
<img style="float: center;" src="./Pictures/coef.png">

#### Feature Importance from Random Forest Regressor
<img style="float: center;" src="./Pictures/featureimportance.png">


# Future Work

### Players Detail Data : 
Some injured or banned players can affect the results significantly. 
Also there is some information about the arrangment few hours before starting the match, which could improve predictions by using actual players data. 

### More Data :
Always more data is better. 

### More Features :
Being away has a huge impact on performance. Either Travel or Time zone change could affect players. 
Playing more than one away game in a row, could significantly reduce the phisical abilities and consentration of players. 

### Using the advantage againts bet companies
There is one huge potential advantage for me. I havn't to bet on every single game. So if I can find the games that are risky to predict, I can easily ignore them.  
