# Beginning Web Scraping- May 21st

## Goal

This is the beginning of an NBA data analysis project. The initial goal of this project is to determine the primary factors related to winning basketball games. 

## Content of this notebook

While I'm not sure where this project will lead, I will begin where most data analysis projects do- data collection. 

In this document, I will gather the team stats for Game 3 of the 2018 Western Conference Finals. This game was played between the Houston Rockets and the Golden State Warriors on May 20, 2018. The Warriors won 126-85.

The team stats for this game are gathered in a table at http://www.espn.com/nba/matchup?gameId=401032763 . These stats include FG Made-Attempted, Assists, and Blocks among others, and are based on the total actions made by the whole team. For instance, if Kevin Durant had 10 assists and Stephen Curry had 5 assists while no one else had any assists, the Warriors would have 15 assists as a team.

I will use the requests library to grab the content of the table as an html file. I will then use the BeautifulSoup library to read and access the table. I will then store the stats in a DataFrame using the Pandas library.

There were two changes I needed to make to the DataFrame.
1. Three stats involving "Made-Attempted" were written in the format "(int)-(int)". These stats were "FG Made-Attempted", "3PT Made-Attempted", and "FT Made-Attempted". I split these into two columns, one column of "Made" and one column of "Attempted".

2. All of the entries are of string type. I changed all of the entries to float's (if a percentage) and int's (otherwise).


By using stats from ESPN, I agree to the Disney terms of use.

## Using requests and BeautifulSoup to read page

I first gather read in a page as an html file using the requests library. I then use the BeautifulSoup library to process the html file. I print the BeautifulSoup in a readable format using the prettify method.

In [1]:
url = "http://www.espn.com/nba/matchup?gameId=401032763"

import requests
page = requests.get(url) #rquests.models.Response object

html = page.content #html file

In [2]:
from bs4 import BeautifulSoup #library that parses web-accessible data

soup = BeautifulSoup(html, 'lxml') #processing html file

In [3]:
#Compare the great change in readability between the html and processed files
print(html[:500])

print('\n' + '-'*50 + '\n')

print(soup.prettify()[:500]) 

b'\n\t<!DOCTYPE html>\n\t<html class="no-icon-fonts">\n\t<head>\n\t\t<meta http-equiv="content-type" content="text/html; charset=UTF-8" />\n<meta http-equiv="x-ua-compatible" content="IE=edge,chrome=1" />\n<meta name="viewport" content="initial-scale=1.0, maximum-scale=1.0, user-scalable=no">\n<link rel="canonical" href="http://www.espn.com/nba/matchup?gameId=401032763" />\n<title>Rockets vs. Warriors - Team Statistics - May 20, 2018 - ESPN</title>\n<meta name="description" content="Get team statistics for the '

--------------------------------------------------

<!DOCTYPE html>
<html class="no-icon-fonts">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=edge,chrome=1" http-equiv="x-ua-compatible"/>
  <meta content="initial-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport"/>
  <link href="http://www.espn.com/nba/matchup?gameId=401032763" rel="canonical"/>
  <title>
   Rockets vs. Warriors - Team Statistics - May 20, 2018

## Finding the team stats table

I find that there are four tables in the html file. By printing the beginning of each table, I find that Table 1 stores all of the team stats (besides points scored and points scored per quarter). 

In [4]:
#find all tables in file
tables = soup.find_all('table')

print('Document contains {0} HTML Table(s)'.format(len(tables)))

Document contains 4 HTML Table(s)


In [5]:
#print first 400 characters of 4 tables (to be able to understand content)
for idx in range(len(tables)):
    print('Table {0}'.format(idx))
    print(tables[idx].prettify()[:400])
    print('\n' + '-'*50 + '\n')
    
#Find that:
    '''
Table 0: Quarters, Team name with points per quarter
Table 1: all team stats in tbody
Table 2,3: not displayed on page, regular season win-loss records for other teams (not sure why this is)
    '''


Table 0
<table class="miniTable" id="linescore">
 <thead>
  <tr>
   <th class="network">
   </th>
   <th>
    1
   </th>
   <th>
    2
   </th>
   <th>
    3
   </th>
   <th>
    4
   </th>
   <th class="final-score">
    T
   </th>
  </tr>
 </thead>
 <tbody>
  <tr>
   <td class="team-name">
    HOU
   </td>
   <td>
    22
   </td>
   <td>
    21
   </td>
   <td>
    24
   </td>
   <td>
    18
   </td>
  

--------------------------------------------------

Table 1
<table class="mod-data">
 <thead>
  <tr class="header">
   <th>
    Matchup
   </th>
   <th>
    <img src="http://a.espncdn.com/combiner/i?img=/i/teamlogos/nba/500/hou.png&amp;h=100&amp;w=100"/>
   </th>
   <th>
    <img src="http://a.espncdn.com/combiner/i?img=/i/teamlogos/nba/500/gs.png&amp;h=100&amp;w=100"/>
   </th>
  </tr>
 </thead>
 <tbody>
  <tr class="highlight" data-stat-attr="fieldGoalsM

--------------------------------------------------

Table 2
<table class="mod-data">
 <thead>
  <tr>
   <th class="left" scope="

In [6]:
#want to extract team stats (found in body of Table 1)
tb = tables[1].tbody

print(tb.prettify()[:400])

<tbody>
 <tr class="highlight" data-stat-attr="fieldGoalsMade-fieldGoalsAttempted">
  <td>
   FG Made-Attempted
  </td>
  <td>
   32-81
  </td>
  <td>
   48-92
  </td>
 </tr>
 <tr class="highlight" data-stat-attr="fieldGoalPct">
  <td>
   Field Goal %
  </td>
  <td>
   39.5
  </td>
  <td>
   52.2
  </td>
 </tr>
 <tr class="highlight" data-stat-attr="threePointFieldGoalsMade-threePointFieldGoalsAtt


## Extracting stats from table

I first show how to print each of the team stats. I then store all of the team stats in a DataFrame with 22 columns (one for each stat) and two records (a row each for HOU and GS).

In [7]:
for row in tb.find_all('tr'):
    tdx = [val for val in row.find_all('td')]
    
    #1st td: stat header
    stat_name = tdx[0].contents[0].strip()
    
    #2nd td: visitor stat
    visitor_stat = tdx[1].contents[0].strip()
    
    #3rd td: home stat
    home_stat = tdx[2].contents[0].strip()
    
    print(f'Stat {stat_name}: Visitor {visitor_stat}, Home {home_stat}')

Stat FG Made-Attempted: Visitor 32-81, Home 48-92
Stat Field Goal %: Visitor 39.5, Home 52.2
Stat 3PT Made-Attempted: Visitor 11-34, Home 13-32
Stat Three Point %: Visitor 32.4, Home 40.6
Stat FT Made-Attempted: Visitor 10-13, Home 17-18
Stat Free Throw %: Visitor 76.9, Home 94.4
Stat Total Rebounds: Visitor 41, Home 49
Stat Offensive Rebounds: Visitor 10, Home 11
Stat Defensive Rebounds: Visitor 31, Home 38
Stat Assists: Visitor 19, Home 20
Stat Steals: Visitor 3, Home 11
Stat Blocks: Visitor 5, Home 7
Stat Total Turnovers: Visitor 20, Home 8
Stat Points Off Turnovers: Visitor 28, Home 8
Stat Fast Break Points: Visitor 10, Home 23
Stat Points in Paint: Visitor 40, Home 56
Stat Personal Fouls: Visitor 19, Home 16
Stat Technical Fouls: Visitor 2, Home 1
Stat Flagrant Fouls: Visitor 0, Home 0


In [8]:
'''
Want to replace one stat FG Made-Attempted 
into two stats FG Made, FG Attempted
'''

stat_names = []
visitor_stats = []
home_stats = []

for row in tb.find_all('tr'):
    tdx = [val for val in row.find_all('td')]
    
    stat_names.append(tdx[0].contents[0].strip())
    
    visitor_stats.append(tdx[1].contents[0].strip())
    
    home_stats.append(tdx[2].contents[0].strip())
    
import pandas as pd

stats_df = pd.DataFrame(columns=stat_names)
stats_df.loc[0] = visitor_stats
stats_df.loc[1] = home_stats

stats_df.index = ['HOU', 'GS']

print(stats_df)


    FG Made-Attempted Field Goal % 3PT Made-Attempted Three Point %  \
HOU             32-81         39.5              11-34          32.4   
GS              48-92         52.2              13-32          40.6   

    FT Made-Attempted Free Throw % Total Rebounds Offensive Rebounds  \
HOU             10-13         76.9             41                 10   
GS              17-18         94.4             49                 11   

    Defensive Rebounds Assists Steals Blocks Total Turnovers  \
HOU                 31      19      3      5              20   
GS                  38      20     11      7               8   

    Points Off Turnovers Fast Break Points Points in Paint Personal Fouls  \
HOU                   28                10              40             19   
GS                     8                23              56             16   

    Technical Fouls Flagrant Fouls  
HOU               2              0  
GS                1              0  


## Fixing format and types of DataFrame

There are two major changes I want to make to this dataframe.

1. Some of the stats have hyphens in them to denote made shots vs. attempted shots.
I want to split these stats (FG, 3PT, and FT) each into two separate stats.

2. I want to convert all stats from string type to (float or int) type. The stats involving percentage will be converted to float type. The other stats will be converted to int type.



In [9]:
#In this cell, I will do the first task of transforming one stat into two stats.

import copy

expanded_stat_names = copy.deepcopy(stat_names) #don't want to alter original stat_names
stat_names_copy = copy.deepcopy(stat_names)

#turn 'FG Made-Attempted' into two stats 'FG Made' and 'FG Attempted'
expanded_stat_names.insert(1, 'FG Attempted')
expanded_stat_names[0] = 'FG Made'

#convert 3PT Made-Attempted
expanded_stat_names.insert(4, '3PT Attempted')
expanded_stat_names[3] = '3PT Made'

#convert FT Made-Attempted
expanded_stat_names.insert(7, 'FT Attempted')
expanded_stat_names[6] = 'FT Made'


#create new DataFrame with these column names
expanded_stats_df = pd.DataFrame(columns=expanded_stat_names)

#initialize DataFrame with 0's
row_of_zeros = [0 for stat in expanded_stat_names]
expanded_stats_df.loc['HOU'] = row_of_zeros
expanded_stats_df.loc['GS'] = row_of_zeros

#transfer stats from stats_df that don't involve Made-Attempted 
stats_to_immediately_transfer = stat_names_copy

stats_to_immediately_transfer.remove('FG Made-Attempted')
stats_to_immediately_transfer.remove('3PT Made-Attempted')
stats_to_immediately_transfer.remove('FT Made-Attempted')

expanded_stats_df[stats_to_immediately_transfer] = stats_df[stats_to_immediately_transfer]

def transfer_one_to_two(original_stat_name, new_stat_name_1, new_stat_name_2):
    '''
    Transfers Made-Attempted stat in original stats df to 
    two stats of made and attempted in expanded stats df
    
    Inputs:
    Original_stat_name: column name '*** Made-Attempted'
    new_stat_name_1: column name '*** Made'
    new_stat_name_2: column name '*** Attempted'
    
    Output:
    None (two columns of stats df are changed)
    
    '''
    original_stat = stats_df[[original_stat_name]]
    hou_stat = original_stat.iloc[0,0].split('-')
    gs_stat = original_stat.iloc[1,0].split('-')
    expanded_stats_df.loc['HOU', [new_stat_name_1,new_stat_name_2]] = hou_stat
    expanded_stats_df.loc['GS', [new_stat_name_1, new_stat_name_2]] = gs_stat
    
transfer_one_to_two('FG Made-Attempted', 'FG Made', 'FG Attempted')
transfer_one_to_two('3PT Made-Attempted', '3PT Made', '3PT Attempted')
transfer_one_to_two('FT Made-Attempted', 'FT Made', 'FT Attempted')

In [10]:
'''
I will now change all of the entries of the DataFrame into ints and floats.
The entries in the columns with %'s (FG, 3PT, and FT) will be cast as floats.
The entries in the other columns will be cast as ints.
'''

#print(expanded_stats_df.loc[:, ['Total Rebounds', 'Offensive Rebounds', 'Defensive Rebounds']])

#columns to cast as float's
float_columns = ['Field Goal %', 'Three Point %', 'Free Throw %']

#columns to cast as int's
int_columns = copy.deepcopy(expanded_stat_names)
for column in float_columns:
    int_columns.remove(column)

#cast % columns as floats
for column in float_columns:
    expanded_stats_df.loc['HOU', column] = float(expanded_stats_df.loc['HOU', column])
    expanded_stats_df.loc['GS', column] = float(expanded_stats_df.loc['GS', column])

#cast other columns as ints
for column in int_columns:
    expanded_stats_df.loc['HOU', column] = int(expanded_stats_df.loc['HOU', column])
    expanded_stats_df.loc['GS', column] = int(expanded_stats_df.loc['GS', column])

In [11]:
#We can now see our final DataFrame (split midway through to show all columns)

print(expanded_stats_df.loc[:, :'Free Throw %'])
print(expanded_stats_df.loc[:, 'Total Rebounds':])

    FG Made FG Attempted Field Goal % 3PT Made 3PT Attempted Three Point %  \
HOU      32           81         39.5       11            34          32.4   
GS       48           92         52.2       13            32          40.6   

    FT Made FT Attempted Free Throw %  
HOU      10           13         76.9  
GS       17           18         94.4  
    Total Rebounds Offensive Rebounds Defensive Rebounds Assists Steals  \
HOU             41                 10                 31      19      3   
GS              49                 11                 38      20     11   

    Blocks Total Turnovers Points Off Turnovers Fast Break Points  \
HOU      5              20                   28                10   
GS       7               8                    8                23   

    Points in Paint Personal Fouls Technical Fouls Flagrant Fouls  
HOU              40             19               2              0  
GS               56             16               1              0  
