## Baller-Metrics

### Introduction

If you ask me, one of the coolest appilications of Data Science is in sports, asides summary statistics, the field of Sport Analytics deals with the analysis of sport data to reveal insightful patterns that can improve in-game performance, reduce uncertainty in the outcome of games or even uncover overlooked talent. The techniques behind sport analytics are some of the most elegant implementations of mathematical modelling, however, the results are the stuff of headlines.

In this project you will journey with me as we curate a robust dataset that will allow for a number of sport analytics techniques. The dataset will be primarily based on game-by-game data for each active player. Information on each team and player attributes will be combined to this dataset to form a database.


In [None]:
import string
import scrapy
import pandas as pd
# Import the CrawlerProcess
from scrapy.crawler import CrawlerProcess

In [None]:
# create list of a to z, excluding x
strings = string.ascii_lowercase[:23] + "yz"

base_url = "https://www.basketball-reference.com/players/"

#Add Base 
url_list= [base_url + string for string in strings]


In [None]:
# Create the Spider class
class BasketSpider(scrapy.Spider):
    name = 'BasketSpider'
    # start_requests method
    def start_requests( self ):
        #Batched manually by adjusting the index below
        for url in url_list[10:]:
            yield scrapy.Request(url ,callback = self.parse_getplayers)
          
    

    def parse_getplayers(self, response):
        # get links to all bold letters
        players_links = response.xpath('//tbody//tr//strong/a/@href').extract()
        #follow link
        for player_link in players_links:
            yield response.follow(url = player_link, callback = self.parse_getseasons)
    
    
    def parse_getseasons(self, response):
        # get links to all bold letters
        season_links = response.xpath('//div[@id="inner_nav"]/ul/li[2]/div/ul[1]//@href').extract()
        #follow link
        for season_link in season_links:
            gamelog_table = season_link
            yield response.follow(url = gamelog_table, callback = self.parse)
        
    def parse(self, response):
        #parse player name
        player_name = response.xpath('//div[@id="inner_nav"]/ul/li[1]/a/u').extract_first()[3:-13]
        
        game_stats = response.xpath('//table/tbody//tr')
        # ignore the table header row
        for game in game_stats[1:]:
            three_pct = game.xpath('td[15]//text()').extract_first()
            three = game.xpath('td[13]//text()').extract_first()
            threePA = game.xpath('td[14]//text()').extract_first()
            plus_minus = game.xpath('td[29]//text()').extract_first()
            FG_pct = game.xpath('td[12]//text()').extract_first()
            FT_pct = game.xpath('td[18]//text()').extract_first()
            G    =  game.xpath('td[1]//text()').extract_first()
            Date =  game.xpath('td[2]//text()').extract_first()
            Age  =  game.xpath('td[3]//text()').extract_first()
            Team =  game.xpath('td[4]//text()').extract_first()
            At   =  game.xpath('td[5]//text()').extract_first()
            Opp  =  game.xpath('td[6]//text()').extract_first()
            Form =  game.xpath('td[7]//text()').extract_first()
            GS   =  game.xpath('td[8]//text()').extract_first()
            MP   =  game.xpath('td[9]//text()').extract_first()
            FG   =  game.xpath('td[10]//text()').extract_first()
            FGA  =  game.xpath('td[11]//text()').extract_first() 
            FT   =  game.xpath('td[16]//text()').extract_first()
            FTA  =  game.xpath('td[17]//text()').extract_first() 
            ORB  =  game.xpath('td[19]//text()').extract_first()
            DRB  =  game.xpath('td[20]//text()').extract_first()
            TRB  =  game.xpath('td[21]//text()').extract_first()
            AST  =  game.xpath('td[22]//text()').extract_first()
            STL  =  game.xpath('td[23]//text()').extract_first()
            BLK  =  game.xpath('td[24]//text()').extract_first()
            TOV  =  game.xpath('td[25]//text()').extract_first()
            PF   =  game.xpath('td[26]//text()').extract_first()
            PTS  =  game.xpath('td[27]//text()').extract_first()
            GmSc =  game.xpath('td[28]//text()').extract_first()
            name =  player_name
            if Date is None:
                key = name + '0'
            else:
                key = name + " / " + Date
            #dates are unique for each player. A unique key is name + date
            player_stats[key] = [G,Date,Age,Team,At,Opp,Form,GS,MP,FG,FGA,FG_pct,three,threePA,three_pct,FT,FTA,FT_pct,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,plus_minus]
            

player_stats = {}
              
 # Run the Spider
process = CrawlerProcess()
process.crawl(BasketSpider)
process.start()



In [None]:
len(player_stats)

In [None]:
#convert dictionary of stats into table
df = pd.DataFrame.from_dict(player_stats).transpose()


In [None]:
df.head()

In [None]:
#rename columns in a table
df.columns = [
 'key','G','Age','Team','At','Opp','Form',
 'GS','MP','FG','FGA','FG_pct','three',
 'threePA','three_pct','FT','FTA',
 'FT_pct','ORB','DRB','TRB','AST','STL',
 'BLK','TOV','PF','PTS','GmSc','plus_minus'
]

In [None]:
df.to_csv("player_stats2.csv")