<a href="https://colab.research.google.com/github/bojanpetrushevski/basketball-data-science/blob/master/Intro%20and%20data%20acquisition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Basketball data science**
---
As data science has contributed in other fields of life, it can also improve and enrich the quality of perspective that basketball strategists have on the game of basketball. This project's focus would be analysing data for [NBA](https://en.wikipedia.org/wiki/National_Basketball_Association) basketball players who played in the past couple of years. Since today's NBA is far more different than decades ago, there are many unpredictive facts that a naked eye cannot notice. There is where data science can come in handy. My idea is that we can analyze a dataset in order to accomplish prediction of some useful information about the players. 


For instance, I am planning to predict the win shares (player statistic which attempts to divvy up credit for team success to the individuals on the team, per [Basketball Reference](https://www.basketball-reference.com/about/ws.html)) which a single player has according to his statistics during some previous season. This could be predicted using other given statistic information about the player as the number of shots he takes in a game, the shooting percentage, the points percentage of a player, etc... This information could be useful to office executives of every NBA team such as general managers, scouts, risk managers, assitant managers and other front office staff so they can see the big picture of what players are cruicial for what teams and even more important - if they expect from a player to perform with certain statistic next season, how much it will really contribute on a team level. Technically spoken, this output should can be continious so I could use some of the following techniques: linear regression, support vector regression, k-nearest-neighbors, decision tree regression. I will try to train a model using more of them a compare the results to see which algorithm is the most accurate for this kind of problem.


Another idea that I have is that I can predict player position based on his other statistics as well since an NBA player can behave as a playmaker even though in reality his [original position](https://www.myactivesg.com/Sports/Basketball/How-To-Play/Basketball-Rules/Basketball-Positions-and-Roles) is a forward or maybe even center. Again, this kind of analysis could help the front offices of NBA teams to determine what kind of players (position-wise) they really have and what position holes need to be filled in a certain NBA teams. Since this output would be categorical, I thought of using some techniques like neural networks, decision trees classification, rule-based classification, etc...

In general, my point would be to try as many techniques as my time and resources would allow me from the ones that I was introduced during the Data Mining course. It will be interesting to analyze the final results that different algorithms will produce on the same dataset.


Firstly, we need to do some data aquisition in order to create our own datasets. For that purpose, I will be using the BeautifulSoup package to achieve successfull data scrapping.

In [6]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
from enum import Enum
# enum for different type of basketbal players statistics
DATA_TYPE = Enum('DATA_TYPE', 'per_game advanced')

# function to merge two pandas dataframes
def merge_dataframes(df1, df2, on, how):
  return pd.merge(per_game_stats, advanced_stats, on=on, how=how)

# function to concat two pandas dataframes
def concat_dataframes(df1, df2):
  return pd.concat([df1, df2], ignore_index=True)

# class to scrap basketball data from BasketballReference
class BasketballReferenceScrapper:
    def __init__(self, url):  
        self.url = url

    def scrap_data(self, year, data_type):
        # replace the placeholders in url with the year and data_type
        url_to_open = self.url.format(year, data_type)      
        print(url_to_open)
        html = urlopen(url_to_open)
        soup = BeautifulSoup(html)
        soup.findAll('tr', limit=2)
        # use getText()to extract the text we need into a list
        headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]
        # exclude the first column as we will not need the ranking order from Basketball Reference for the analysis
        headers = headers[1:]
        print(headers)

        # get all rows which are actually data and use 'full-table' class to avoid headers
        rows = soup.findAll('tr', {'class': 'full_table'})
        player_stats = [[td.getText() for td in rows[i].findAll('td')]
                    for i in range(len(rows))]

        return pd.DataFrame(player_stats, columns = headers)

#creating an instance of the class with the url of the website
scrapper = BasketballReferenceScrapper('https://www.basketball-reference.com/leagues/NBA_{0}_{1}.html')

#static array we the years we will analyze
years = ['2015', '2016', '2017', '2018', '2019', '2020']

data = None

# iterate through the years and scrap data for each one
for year in years:
  # scrap data for basic per game stats
  per_game_stats = scrapper.scrap_data(year, DATA_TYPE.per_game.name)
  # scrap data for advanced stats
  advanced_stats = scrapper.scrap_data(year, DATA_TYPE.advanced.name)
  # combine both data frames into one using the previously declared function
  combined_stats = merge_dataframes(per_game_stats, advanced_stats, ['Player', 'Pos', 'Age', 'Tm', 'G'], 'inner')
  # rename some columns 
  combined_stats.rename(columns={'MP_x': 'MPG', 'MP_y': 'MPT'}, inplace=True)
  # append the current year as a column to every row in the dataset
  combined_stats['END_YEAR'] = year
  # if this is the first item from the list, assign the variable; otherwise append it to the current dataframe
  if data is None:
    data = combined_stats
  else:
    data = concat_dataframes(data, combined_stats)

#drop empty columns
data = data.drop(['\xa0'], axis = 1)



https://www.basketball-reference.com/leagues/NBA_2015_per_game.html
['Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']
https://www.basketball-reference.com/leagues/NBA_2015_advanced.html
['Player', 'Pos', 'Age', 'Tm', 'G', 'MP', 'PER', 'TS%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', '\xa0', 'OWS', 'DWS', 'WS', 'WS/48', '\xa0', 'OBPM', 'DBPM', 'BPM', 'VORP']
https://www.basketball-reference.com/leagues/NBA_2016_per_game.html
['Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']
https://www.basketball-reference.com/leagues/NBA_2016_advanced.html
['Player', 'Pos', 'Age', 'Tm', 'G', 'MP', 'PER', 'TS%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'T

As you have noticed, we have used BasketballReference webiste (the most popular website to access basketball data). Next, we will see the info in the dataset which contains data for all players in the period 2015 - 2020. Also, we will export the dataset to csv file.

In [5]:
data.info()

data.to_csv('/content/drive/My Drive/nba_players_dataset.csv', encoding='utf-8')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3038 entries, 0 to 3037
Data columns (total 51 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Player    3038 non-null   object
 1   Pos       3038 non-null   object
 2   Age       3038 non-null   object
 3   Tm        3038 non-null   object
 4   G         3038 non-null   object
 5   GS        3038 non-null   object
 6   MPG       3038 non-null   object
 7   FG        3038 non-null   object
 8   FGA       3038 non-null   object
 9   FG%       3038 non-null   object
 10  3P        3038 non-null   object
 11  3PA       3038 non-null   object
 12  3P%       3038 non-null   object
 13  2P        3038 non-null   object
 14  2PA       3038 non-null   object
 15  2P%       3038 non-null   object
 16  eFG%      3038 non-null   object
 17  FT        3038 non-null   object
 18  FTA       3038 non-null   object
 19  FT%       3038 non-null   object
 20  ORB       3038 non-null   object
 21  DRB       3038

The second dataset I wa