## NBA Statistics During Golden State Warriors (GSW) Championship Run Project  - Data Wrangling

### Introduction:

In this first section of the project, raw data will be scraped from the basketball-reference website for all 30 NBA teams. Once the raw data has been extracted, the data will be cleaned and formatted, and then converted into a CSV file for the next section of the project &mdash; exploratory analysis.

Imported libraries are BeautifulSoup, Requests, and Pandas; in addition, the time module has been imported. 

In [1]:
from bs4 import BeautifulSoup # library to pull data from web and parse data 
import requests # library for http requests to retrieve data
import time # module providing functions to work with time
import pandas as pd # library for data manipulation and analysis

Below are functions to scrape data from the Basketball Rerence website and organize acquired data to be converted into a Pandas dataframe for future use.

In [2]:
def get_soup(team_list):
    """
    Request data from website and parse data using BeautifulSoup; 
    returns a soup object.
    
    Takes a list of teams to identify the specific web-page to scrape team data.
    """
    # create dictionary of page response requests for each team using team list abbreviations
    page_list = {}
    for i, team in enumerate(team_list):
        url = "https://www.basketball-reference.com/teams/{}/stats_basic_totals.html".format(team)
        page_list[team] = requests.get(url)
#         time.sleep(15) # time is in seconds; adjust sleep time as needed for cycling through requests
    
    # create list to append soup object texts from scrape
    soup_list = []
    for i, item in enumerate(team_list):
        soup_list.append(BeautifulSoup(page_list[item].text, "html.parser"))
        
    return soup_list

In [3]:
def get_dataframe(soup_obj):
    """
    Extract text data from an html soup object and convert the text to a list; 
    returns a list of text data.
    
    Takes a soup object.
    """
    # create list of row elements searching for 'tr' (table row) tag in each NBA team text data
    row_values = []
    for i, item in enumerate(soup_list):
        row_values.append(soup_list[i].find_all('tr'))
        
    # create list of stats elements in each NBA team searching for all 'td' (table data) tag
    stats_value = []
    for i, item in enumerate(row_values):
        for e, item2 in enumerate(row_values[i]):
            stats_value.append(row_values[i][e].find_all('td'))
            
    # create list of season elements in each NBA team searching for all 'th' (table header) tag
    season_values = []
    for i, item in enumerate(row_values):
        for e, item2 in enumerate(row_values[i]):
            season_values.append(row_values[i][e].find_all('th'))
            
    # create list of season elements extracting text from previous tag in season_values variable
    season_nums = []
    for i, item in enumerate(season_values):
        season_nums.append([(season_values[i][0].get_text())])
        
    # create new list of stats for each year of the NBA teams
    data_list = []
    for item in stats_value:
        stats_nums = []
        for e in range(len(item)):
            stats_nums.append(item[e].get_text())

        data_list.append(stats_nums)
        stats_nums = None

    # create a new list to filter out the blank '[]' elements in the "data_list" variable
    stats_list = list(filter(lambda x : x != [], data_list))
    
    # create a new list to filter out ['Season'] elements in the "season_nums" variable
    season_list = list(filter(lambda x : x != ['Season'], season_nums))
    
    # create new dataframe list using filtered variables ("season_list" and "stats_list") for use to create the nba_data_frame
    dataframe_list = []
    for i, item in enumerate(season_list):
        dataframe_list.append(item)
        dataframe_list[i].extend(stats_list[i])
    
    return dataframe_list

**Script to retrieve data from basketball-reference website**

In [4]:
# list of team abbreviations for url requests
teams = ['ATL','BOS','NJN','CHA','CHI','CLE','DAL','DEN','DET','GSW','HOU','IND','LAC','LAL','MEM','MIA','MIL','MIN','NOH',
         'NYK','OKC','ORL','PHI','PHO','POR','SAC','SAS','TOR','UTA','WAS']

In [5]:
# call get_soup() function to return soup object and assign to variable
soup_list = get_soup(teams)

In [6]:
# call get_dataframe() function to return dataframe list and assign to variable
df_list = get_dataframe(soup_list)

In [7]:
# create list of headers for dataframe
headers = [th.get_text() for th in soup_list[0].find_all('tr')[0].find_all('th')]

# view headers
print(headers)

['Season', 'Lg', 'Tm', 'W', 'L', 'Finish', '\xa0', 'Age', 'Ht.', 'Wt.', '\xa0', 'G', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']


**Note:** There are empty columns as noted by headers *"\xa0"* above.

**Convert dataframe *list* from script to Pandas dataframe**

In [8]:
# create nba data dataframe
nba_df = pd.DataFrame(df_list, columns = (headers)) 

In [9]:
# check shape of data to see dimensions; rows, columns
print(nba_df.shape)

(1513, 34)


In [10]:
# view column names; recall their are "empty" columns
print(nba_df.columns)

Index(['Season', 'Lg', 'Tm', 'W', 'L', 'Finish', ' ', 'Age', 'Ht.', 'Wt.', ' ',
       'G', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%',
       'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV',
       'PF', 'PTS'],
      dtype='object')


In [11]:
# drop irrelevant columns 
nba_df.drop(['Lg', '\xa0', '\xa0', 'Ht.', 'Wt.', 'G', 'MP', 'PF'], axis=1, inplace=True)

# view new columns
print(nba_df.columns.values)

['Season' 'Tm' 'W' 'L' 'Finish' 'Age' 'FG' 'FGA' 'FG%' '3P' '3PA' '3P%'
 '2P' '2PA' '2P%' 'FT' 'FTA' 'FT%' 'ORB' 'DRB' 'TRB' 'AST' 'STL' 'BLK'
 'TOV' 'PTS']


In [12]:
# check for possible na values in dataframe
print(nba_df.isna().sum())

Season    0
Tm        0
W         0
L         0
Finish    0
Age       0
FG        0
FGA       0
FG%       0
3P        0
3PA       0
3P%       0
2P        0
2PA       0
2P%       0
FT        0
FTA       0
FT%       0
ORB       0
DRB       0
TRB       0
AST       0
STL       0
BLK       0
TOV       0
PTS       0
dtype: int64


In [13]:
# take a look at the data types
print(nba_df.dtypes)

Season    object
Tm        object
W         object
L         object
Finish    object
Age       object
FG        object
FGA       object
FG%       object
3P        object
3PA       object
3P%       object
2P        object
2PA       object
2P%       object
FT        object
FTA       object
FT%       object
ORB       object
DRB       object
TRB       object
AST       object
STL       object
BLK       object
TOV       object
PTS       object
dtype: object


**Note:** All data are currently an object type, but will be converted during export. However, *Season* column will stay as an object type due to dash, and will need to be manually converted to numeric format.

In [14]:
# view results of dataframe; first 5 rows
print(nba_df.head())

    Season   Tm   W   L Finish   Age    FG   FGA   FG%    3P  ...    FTA  \
0  2018-19  ATL  29  53      5  25.1  3392  7524  .451  1067  ...   1918   
1  2017-18  ATL  24  58      5  25.4  3130  7015  .446   917  ...   1654   
2  2016-17  ATL  43  39      2  27.9  3123  6918  .451   729  ...   2039   
3  2015-16  ATL  48  34      2  28.2  3168  6923  .458   815  ...   1638   
4  2014-15  ATL  60  22      1  27.8  3121  6699  .466   818  ...   1735   

    FT%  ORB   DRB   TRB   AST  STL  BLK   TOV   PTS  
0  .752  955  2825  3780  2118  675  419  1397  9294  
1  .785  743  2693  3436  1946  638  348  1276  8475  
2  .728  842  2793  3635  1938  672  397  1294  8459  
3  .783  679  2772  3451  2100  747  486  1226  8433  
4  .778  715  2611  3326  2111  744  380  1167  8409  

[5 rows x 26 columns]


The analysis will be focusing on the 2014-15 to 2018-19 NBA seasons; create a new dataframe reflecting the seasons to be analyzed.

For purposes of the analysis, a separate dataframe will be created specifically for the Golden State Warriors.

In [15]:
# deep copy original dataframe to a new dataframe; one for all teams and one for GSW
all_teams_df = nba_df[nba_df['Season'] >= '2014-15'].copy(deep = True)
gsw_df = nba_df[(nba_df['Season'] >= '2014-15') & (nba_df['Tm'] == 'GSW')].copy(deep = True)

In [16]:
# reset index to start at 0...n
all_teams_df.index = range(len(all_teams_df.index))
gsw_df.index = range(len(gsw_df.index))

In [17]:
# all data are currently an object type as they're still in parsed html format; data will be converted during export
# noted above the "Season" variable is an object type; adjust season column to have one year
team_adj_seasons = []
gsw_adj_seasons = []

team_adj_seasons = all_teams_df.Season.apply(lambda year: year[:2] + year[5:])
gsw_adj_seasons = gsw_df.Season.apply(lambda year: year[:2] + year[5:])
        
all_teams_df['Season'] = team_adj_seasons
gsw_df['Season'] = gsw_adj_seasons

In [18]:
# convert season to a numeric
all_teams_df['Season'] = pd.to_numeric(all_teams_df['Season'])
gsw_df['Season'] = pd.to_numeric(gsw_df['Season'])

In [19]:
# recheck data types to make sure "Season" variable is now numeric
print(gsw_df.dtypes)

Season     int64
Tm        object
W         object
L         object
Finish    object
Age       object
FG        object
FGA       object
FG%       object
3P        object
3PA       object
3P%       object
2P        object
2PA       object
2P%       object
FT        object
FTA       object
FT%       object
ORB       object
DRB       object
TRB       object
AST       object
STL       object
BLK       object
TOV       object
PTS       object
dtype: object


Convert and save dataframes with relevant data for analysis

In [20]:
# convert new dataframes to csv for analysis
all_teams_df.to_csv("all_teams_df.csv", index=False)
gsw_df.to_csv("gsw_df.csv", index=False)