# Saving Game and Team Data in CSV's

In this notebook, we will set ourselves up to apply SQL to our data by saving our data in CSV files. The main tasks we will do are:

1. Change all column names to lowercase with underscores instead of spaces

2. Compile all of the dates and matchup ID's in a DataFrame

3. Save all of the Team tables in a single DataFrame

## Reorganizing season schedule for 2017-2018 Boston Celtics

In the notebook "Scraping all regular seasons- June3", we scraped all of the regular season NBA games since the 2004-2005 season and stored their dates and matchup ID's in 14 CSV files (one for each season). We will collect all of this information in one table. 

We will start by organizing the data from the games played by the Boston Celtics during the 2017-2018 regular season.  

We begin by importing the necessary libraries. 

In [1]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import datetime
import time



In [2]:
#We first import the table from the 2017-2018 season

year = 2018
season_type = 'regular'
base_matchup_csv_name = '{0}-{1}-date-gameIDs' #base url for date, matchup ID csv's

season_df = pd.read_csv(base_matchup_csv_name.format(str(year), season_type), index_col='Unnamed: 0')

#print first five columns of data
season_df.head()

Unnamed: 0,bos,bkn,ny,phi,tor,gs,lac,lal,phx,sac,...,atl,cha,mia,orl,wsh,den,min,okc,por,utah
0,"[datetime.date(2017, 10, 17), '400974437']","[datetime.date(2017, 10, 18), '400974701']","[datetime.date(2017, 10, 19), '400974441']","[datetime.date(2017, 10, 18), '400974439']","[datetime.date(2017, 10, 19), '400974769']","[datetime.date(2017, 10, 17), '400974438']","[datetime.date(2017, 10, 19), '400974442']","[datetime.date(2017, 10, 19), '400974442']","[datetime.date(2017, 10, 18), '400974767']","[datetime.date(2017, 10, 18), '400974768']",...,"[datetime.date(2017, 10, 18), '400974705']","[datetime.date(2017, 10, 18), '400974700']","[datetime.date(2017, 10, 18), '400974702']","[datetime.date(2017, 10, 18), '400974702']","[datetime.date(2017, 10, 18), '400974439']","[datetime.date(2017, 10, 18), '400974766']","[datetime.date(2017, 10, 18), '400974440']","[datetime.date(2017, 10, 19), '400974441']","[datetime.date(2017, 10, 18), '400974767']","[datetime.date(2017, 10, 18), '400974766']"
1,"[datetime.date(2017, 10, 18), '400974703']","[datetime.date(2017, 10, 20), '400974774']","[datetime.date(2017, 10, 21), '400974781']","[datetime.date(2017, 10, 20), '400974772']","[datetime.date(2017, 10, 21), '400974778']","[datetime.date(2017, 10, 20), '400974444']","[datetime.date(2017, 10, 21), '400974788']","[datetime.date(2017, 10, 20), '400974777']","[datetime.date(2017, 10, 20), '400974777']","[datetime.date(2017, 10, 20), '400974776']",...,"[datetime.date(2017, 10, 20), '400974770']","[datetime.date(2017, 10, 20), '400974770']","[datetime.date(2017, 10, 21), '400974780']","[datetime.date(2017, 10, 20), '400974774']","[datetime.date(2017, 10, 20), '400974773']","[datetime.date(2017, 10, 21), '400974786']","[datetime.date(2017, 10, 20), '400974775']","[datetime.date(2017, 10, 21), '400974787']","[datetime.date(2017, 10, 20), '400974771']","[datetime.date(2017, 10, 20), '400974775']"
2,"[datetime.date(2017, 10, 20), '400974772']","[datetime.date(2017, 10, 22), '400974789']","[datetime.date(2017, 10, 24), '400974802']","[datetime.date(2017, 10, 21), '400974778']","[datetime.date(2017, 10, 23), '400974797']","[datetime.date(2017, 10, 21), '400974784']","[datetime.date(2017, 10, 24), '400974805']","[datetime.date(2017, 10, 22), '400974791']","[datetime.date(2017, 10, 21), '400974788']","[datetime.date(2017, 10, 21), '400974786']",...,"[datetime.date(2017, 10, 22), '400974789']","[datetime.date(2017, 10, 23), '400974795']","[datetime.date(2017, 10, 23), '400974793']","[datetime.date(2017, 10, 21), '400974779']","[datetime.date(2017, 10, 23), '400974798']","[datetime.date(2017, 10, 23), '400974798']","[datetime.date(2017, 10, 22), '400974790']","[datetime.date(2017, 10, 22), '400974790']","[datetime.date(2017, 10, 21), '400974785']","[datetime.date(2017, 10, 21), '400974787']"
3,"[datetime.date(2017, 10, 24), '400974802']","[datetime.date(2017, 10, 24), '400974801']","[datetime.date(2017, 10, 27), '400974824']","[datetime.date(2017, 10, 23), '400974792']","[datetime.date(2017, 10, 25), '400974814']","[datetime.date(2017, 10, 23), '400974796']","[datetime.date(2017, 10, 26), '400974819']","[datetime.date(2017, 10, 25), '400974815']","[datetime.date(2017, 10, 23), '400974799']","[datetime.date(2017, 10, 23), '400974799']",...,"[datetime.date(2017, 10, 23), '400974793']","[datetime.date(2017, 10, 25), '400974806']","[datetime.date(2017, 10, 25), '400974810']","[datetime.date(2017, 10, 24), '400974801']","[datetime.date(2017, 10, 25), '400974815']","[datetime.date(2017, 10, 25), '400974806']","[datetime.date(2017, 10, 24), '400974803']","[datetime.date(2017, 10, 25), '400974811']","[datetime.date(2017, 10, 24), '400974804']","[datetime.date(2017, 10, 24), '400974805']"
4,"[datetime.date(2017, 10, 26), '400974818']","[datetime.date(2017, 10, 25), '400974809']","[datetime.date(2017, 10, 29), '400974841']","[datetime.date(2017, 10, 25), '400974808']","[datetime.date(2017, 10, 27), '400974827']","[datetime.date(2017, 10, 25), '400974814']","[datetime.date(2017, 10, 28), '400974835']","[datetime.date(2017, 10, 27), '400974827']","[datetime.date(2017, 10, 25), '400974813']","[datetime.date(2017, 10, 26), '400974820']",...,"[datetime.date(2017, 10, 26), '400974816']","[datetime.date(2017, 10, 27), '400974821']","[datetime.date(2017, 10, 28), '400974829']","[datetime.date(2017, 10, 27), '400974822']","[datetime.date(2017, 10, 27), '400974826']","[datetime.date(2017, 10, 27), '400974823']","[datetime.date(2017, 10, 25), '400974807']","[datetime.date(2017, 10, 27), '400974825']","[datetime.date(2017, 10, 26), '400974819']","[datetime.date(2017, 10, 25), '400974813']"


We will reorganize this data into a DataFrame with columns: 

- `season_start_year`, first year of season (each season spans over two calendar year).
- `season_end_year`, second year of season.
- `game_day`, day that the game occurs (integer between 0 and 31).
- `game_month`, month that the game occurs.
- `game_year`, year that the game occurs.
- `team`, one of the two teams played in the game (each game will occur twice in this DataFrame based on two different teams playing).
- `matchup_id`, Matchup ID of the game.

In [3]:
x = season_df.loc[0,'bos']

print(x)

print(type(x))

[datetime.date(2017, 10, 17), '400974437']
<class 'str'>


Unfortunately, each of the items in the DataFrame are stored as strings instead of as lists. To rectify this issue and avoid using the `eval` function, we will split the string and gather the desired data. We will see how long splitting the date-matchupID string takes to see if it would be costly it is to split this string.

In [4]:
start_time = time.time()

split_date_matchup = x.split(',')

print('Splitting took', str(time.time()-start_time), 'seconds.')

year = int(split_date_matchup[0][-4:])
month = int(split_date_matchup[1][1:])
day = int(split_date_matchup[2][1:-1])
matchup_id = int(split_date_matchup[3].split("'")[1])

print([year, month, day, matchup_id])

print('Total process took', str(time.time()-start_time), 'seconds.')

Splitting took 0.001302957534790039 seconds.
[2017, 10, 17, 400974437]
Total process took 0.00492405891418457 seconds.


We see that splitting the string takes a negligible amount of time. Thus, we will not worry about splitting the string everytime. We now writing getting the date as a function.

In [5]:
def get_year(date_matchup_str):
    '''
    Retrieves year from a date, matchup ID pair.
    
    Input:
    string (represents list of a date object and matchup ID)
    
    Output:
    int (year of game)
    '''
    
    year = date_matchup_str.split(',')[0][-4:]
    return int(year)

def get_month(date_matchup_str):
    '''
    Retrieves month from a date, matchup ID pair.
    
    Input:
    string (represents list of a date object and matchup ID)
    
    Output:
    int (month of game)
    '''

    month = date_matchup_str.split(',')[1][1:]
    return int(month)


def get_day(date_matchup_str):
    '''
    Retrieves day (int between 1 and 31) from a date, matchup ID pair.
    
    Input:
    string (represents list of a date object and matchup ID)
    
    Output:
    int between 1 and 31 (day of game)
    '''
    
    day = date_matchup_str.split(',')[2][1:-1]
    return int(day)

def get_full_date(date_matchup_str):
    '''
    Retrieves full date from a date, matchup ID pair.
    
    Input:
    string (represents list of a date object and matchup ID)
    
    Output:
    string in form MM/DD/YYYY
    '''
    
    date = '{0}/{1}/{2}'.format(get_month(date_matchup_str),get_day(date_matchup_str),get_year(date_matchup_str))
    return date

def get_matchup_id(date_matchup_str):
    '''
    Retrieves matchup ID from a date, matchup ID pair.
    
    Input:
    string (represents list of a date object and matchup ID)
    
    Output:
    int
    '''
    
    matchup_id = date_matchup_str.split(',')[3].split("'")[1]
    return int(matchup_id) 

We now check our functions on the first game played by Boston.

In [6]:
boston_first_game = season_df.loc[0,'bos']

print("String for Boston's first game: '{0}'".format(boston_first_game))
print('Month: {0}'.format(get_month(boston_first_game)))
print('Day: {0}'.format(get_day(boston_first_game)))
print('Year: {0}'.format(get_year(boston_first_game)))
print('Full date: {0}'.format(get_full_date(boston_first_game)))
print('Matchup ID: {0}'.format(get_matchup_id(boston_first_game)))

String for Boston's first game: '[datetime.date(2017, 10, 17), '400974437']'
Month: 10
Day: 17
Year: 2017
Full date: 10/17/2017
Matchup ID: 400974437


We will now create our desired DataFrame that provides information on all of the games played during the 2017-2018 season.

To do this, we will write an abbreviation for a team and returns a DataFrame with columns describing all of the games played by that team:

- `team`, the team's abbreviation.
- `season_start_year`, first year of season (each season spans over two calendar year).
- `season_end_year`, second year of season.
- `season_type`, type of game played (either 'preseason', 'regular', or 'postseason').
- `game_month`, month that the game occurs.
- `game_day`, day that the game occurs (integer between 0 and 31).
- `game_year`, year that the game occurs.
- `game_date`, full date of game in form MM-DD-YYYY.
- `matchup_id`, Matchup ID of the game.

In [7]:
def get_team_schedule_info(df, abbrev, yr, season_type):
    '''
    Returns a DataFrame describing all of the games played during a team's season.
    
    Input:
    df: DataFrame containing info for all games of season
    abbrev: string for abbreviation of a team
    yr: int for end year of season (so if yr=2018, then game is part of the 2017-2018 season)
    
    Output:
    DataFrame with 9 columns and number of rows equal to number of games played 
    '''
    
    #will have column names as keys
    #used to construct DataFrame
    schedule_info = {}
    
    
    #Series of dates, Matchup ID's for team represented by abbrev
    
    #case that Boston, Indiana only played 81 games during 2012-2013 season
    if yr == 2013 and (abbrev == 'bos' or abbrev == 'ind'):
        team_season = df.loc[:-1,abbrev]
    
    #otherwise, all teams played the same number of games during season
    else:
        team_season = df.loc[:,abbrev]
    
    schedule_info['team'] = pd.Series([abbrev] * team_season.size)
    schedule_info['season_start_year'] = pd.Series([yr - 1] * team_season.size)
    schedule_info['season_end_year'] = pd.Series([yr] * team_season.size)
    schedule_info['season_type'] = pd.Series([season_type] * team_season.size)
    schedule_info['game_month'] = team_season.apply(get_month)
    schedule_info['game_day'] = team_season.apply(get_day)
    schedule_info['game_year'] = team_season.apply(get_year)
    schedule_info['game_date'] = team_season.apply(get_full_date)
    schedule_info['matchup_id'] = team_season.apply(get_matchup_id)
    
    #turn dictionary into DataFrame
    detailed_season_team_df = pd.DataFrame.from_dict(schedule_info)
    
    #rearrange columns of DataFrame
    columns = ['team', 'season_start_year', 'season_end_year', 'season_type',\
               'game_month', 'game_day', 'game_year', 'game_date', 'matchup_id']
    
    return detailed_season_team_df[columns]

We check that this function returns a DataFrame with the right number of rows and that looks correct for the first few games of Boston's season.

In [8]:
boston_season_df = get_team_schedule_info(season_df, 'bos', 2018, 'regular')

print('Number of rows: {0}'.format(boston_season_df.shape[0]))

boston_season_df.head()

Number of rows: 82


Unnamed: 0,team,season_start_year,season_end_year,season_type,game_month,game_day,game_year,game_date,matchup_id
0,bos,2017,2018,regular,10,17,2017,10/17/2017,400974437
1,bos,2017,2018,regular,10,18,2017,10/18/2017,400974703
2,bos,2017,2018,regular,10,20,2017,10/20/2017,400974772
3,bos,2017,2018,regular,10,24,2017,10/24/2017,400974802
4,bos,2017,2018,regular,10,26,2017,10/26/2017,400974818


## Organizing all regular season games since 2004-2005 season 

We will now use our infrastructure to go beyond organizing the data of regular season games played by the Boston Celtics during one season and organize the data of all regular season games of all teams since the 2004-2005 season.

In [9]:
#abbreviations for all teams
abbrevs = season_df.columns.tolist() 

season_type = 'regular'
base_matchup_csv_name = '{0}-{1}-date-gameIDs' #base url for date, matchup ID csv's

#stores all regular season games of all teams since 2004-2005 season
#could be made faster by not reading csv file every iteration
all_regular_season_games = pd.concat([get_team_schedule_info(pd.read_csv(base_matchup_csv_name.format(str(year), season_type)),abbrev,year,season_type)\
                                     for year in range(2005, 2019) for abbrev in abbrevs])

In [10]:
#total number of games in DataFrame (every game counted twice)
print('Total games: {0}'.format(all_regular_season_games.shape[0]))

all_regular_season_games.head()

Total games: 33796


Unnamed: 0,team,season_start_year,season_end_year,season_type,game_month,game_day,game_year,game_date,matchup_id
0,bos,2004.0,2005.0,regular,11.0,3.0,2004.0,11/3/2004,241103002.0
1,bos,2004.0,2005.0,regular,11.0,5.0,2004.0,11/5/2004,241105002.0
2,bos,2004.0,2005.0,regular,11.0,6.0,2004.0,11/6/2004,241106018.0
3,bos,2004.0,2005.0,regular,11.0,10.0,2004.0,11/10/2004,241110002.0
4,bos,2004.0,2005.0,regular,11.0,12.0,2004.0,11/12/2004,241112002.0


We see that all of the int columns have been converted to type float. We will conclude this notebook by converting all of these columns back to type int and then saving our csv.

In [11]:
columns_to_int = ['season_start_year', 'season_end_year', 'game_month',\
                  'game_day', 'game_year', 'matchup_id']

for col in columns_to_int:
    all_regular_season_games.loc[:,col] = all_regular_season_games.loc[:,col].apply(lambda x: int(x))
    
all_regular_season_games.head()

Unnamed: 0,team,season_start_year,season_end_year,season_type,game_month,game_day,game_year,game_date,matchup_id
0,bos,2004,2005,regular,11,3,2004,11/3/2004,241103002
1,bos,2004,2005,regular,11,5,2004,11/5/2004,241105002
2,bos,2004,2005,regular,11,6,2004,11/6/2004,241106018
3,bos,2004,2005,regular,11,10,2004,11/10/2004,241110002
4,bos,2004,2005,regular,11,12,2004,11/12/2004,241112002


In [12]:
all_regular_season_games.to_csv('all_regular_season_games_since_ohfour.csv')