### ** Cleaning Data Scraped from Baseball-Reference.com **

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('baseball_reference_2016_scrape.csv')

stripping extraneous characters from columns in the dataframe

In [3]:
df['attendance'] = df['attendance'].apply(lambda x: x.strip("]'"))
df['game_duration'] = df['game_duration'].apply(lambda x: x.strip(": "))
df['venue'] = df['venue'].apply(lambda x: x.strip(" :"))
df['start_time'] = df['start_time'].apply(lambda x: x.strip("Start Time: "))

splitting columns from the dataframe

In [4]:
df['attendance'] = df['attendance'].str.replace(' ', '')
df['attendance'] = df['attendance'].str.replace(',', '')
df['day_of_week'] = df['date'].str.split(',', 3, expand=True)[0]
df['game_type_remove'] = df['game_type']
df['game_type'] = df['game_type_remove'].str.split(',', 2, expand=True)[0]
df['field_type'] = df['game_type_remove'].str.split(',', 2, expand=True)[1]
df['field_type'] = df['field_type'].str.replace(' on', 'on')

cleaning and adjusting the weather column, currently titled as 'other_info_string'


In [5]:
df['start_time_weather'] = df['other_info_string'].str.split('</strong> ', 5, expand=True)[5]

n = 0
for weather in df['start_time_weather']:
    if df['start_time_weather'][n]==None:
        df['start_time_weather'][n] = df['other_info_string'].str.split('</strong> ', 5, expand=True)[4][n]
        n += 1
    else:
        n+= 1
        
df['temperature'] = df['start_time_weather'].str.split('&', 2, expand=True)[0]
df['start_time_weather1'] = df['start_time_weather'].str.split(', ', 3, expand=True)[1]
df['start_time_weather2'] = df['start_time_weather'].str.split('Wind ', 3, expand=True)[1]
df['start_time_weather3'] = df['start_time_weather2'].str.split('.', 2, expand=True)[0]
df['wind_speed'] = df['start_time_weather3'].str.split(', ', 2, expand=True)[0]
df['wind_speed'] = df['start_time_weather3'].str.split('mph', 2, expand=True)[0]
df['wind_speed'] = df['start_time_weather3'].str.split('mph', 2, expand=True)[0]
df['start_time_weather3'] = df['start_time_weather3'].str.split('mph', 2, expand=True)[1]
df['wind_direction'] = df['start_time_weather3'].str.split(', ', 2, expand=True)[0]
df['sky'] = df['start_time_weather3'].str.split(', ', 2, expand=True)[1]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


create a column for total runs

In [6]:
df['total_runs'] = df['away_team_runs'] + df['home_team_runs']

adjusting for missing data that caused misalignment in initial scrape

In [7]:
df['attendance'][220:221] = None
df['game_duration'][220:221] = '3:18'
df['game_type'][220:221] = 'Day Game'
df['field_type'][220:221] = 'on grass'
df['venue'][220:221] = 'Citi Field'
df['attendance'][1724:1725] = None
df['game_duration'][1724:1725] = '2:40'
df['game_type'][1724:1725] = 'Day Game'
df['field_type'][1724:1725] = 'on grass'
df['venue'][1724:1725] = 'PNC Park'
df['attendance'][1912:1913] = None
df['game_duration'][1912:1913] = '3:10'
df['game_type'][1912:1913] = 'Day Game'
df['field_type'][1912:1913] = 'on grass'
df['venue'][1912:1913] = 'U.S. Cellular Field'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

Se

changing data types

In [8]:
df['attendance'] = df['attendance'].astype(float)
df['date'] = pd.to_datetime(df['date'])
df['temperature'] = df['temperature'].astype(float)
df['wind_speed'] = df['wind_speed'].astype(float)
df['game_hours_dec'] = df['game_duration'].str.split(':', 2, expand=True)[1].astype(float)/60 + df['game_duration'].str.split(':', 2, expand=True)[0].astype(float)

filling in missing data

In [9]:
df['sky'] = df['sky'].astype(object).fillna('Unknown')
df['wind_direction'] = df['wind_direction'].astype(object).fillna('Unknown')

dropping columns from the dataframe that will not be needed

In [10]:
df.drop(['boxscore_url'], axis=1, inplace=True)
df.drop(['game_duration'], axis=1, inplace=True)
df.drop(['game_type_remove'], axis=1, inplace=True)
df.drop(['other_info_string'], axis=1, inplace=True)
df.drop(['start_time_weather'], axis=1, inplace=True)
df.drop(['start_time_weather1'], axis=1, inplace=True)
df.drop(['start_time_weather2'], axis=1, inplace=True)
df.drop(['start_time_weather3'], axis=1, inplace=True)

creating a new field to differentiate between reular season and post season games

In [11]:
df['season'] = 0
n = 0
for date in df['date']:
    if df['date'][n].month == 10 and df['date'][n].day > 2:
        df['season'][n] = 'post season'
        n += 1
    elif df['date'][n].month == 11:
        df['season'][n] = 'post season'
        n += 1
    else:
        df['season'][n] = 'regular season'
        n += 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


create a column to count the number of home team wins

In [12]:
df['home_team_win'] = 0
n = 0
for win in df['home_team_win']:
    if df['home_team_runs'][n] > df['away_team_runs'][n]:
        df['home_team_win'][n] = 1
        n += 1
    else:
        df['home_team_win'][n] = 0
        n += 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


create a column to count the number of home team losses

In [13]:
df['home_team_loss'] = 0
n = 0
for win in df['home_team_loss']:
    if df['home_team_runs'][n] < df['away_team_runs'][n]:
        df['home_team_loss'][n] = 1
        n += 1
    else:
        df['home_team_loss'][n] = 0
        n += 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


create a column to state a home team win or loss

In [14]:
df['home_team_outcome'] = 0
n = 0
for win in df['home_team_outcome']:
    if df['home_team_runs'][n] > df['away_team_runs'][n]:
        df['home_team_outcome'][n] = 'Win'
        n += 1
    else:
        df['home_team_outcome'][n] = 'Loss'
        n += 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2463 entries, 0 to 2462
Data columns (total 25 columns):
attendance           2460 non-null float64
away_team            2463 non-null object
away_team_errors     2463 non-null int64
away_team_hits       2463 non-null int64
away_team_runs       2463 non-null int64
date                 2463 non-null datetime64[ns]
field_type           2463 non-null object
game_type            2463 non-null object
home_team            2463 non-null object
home_team_errors     2463 non-null int64
home_team_hits       2463 non-null int64
home_team_runs       2463 non-null int64
start_time           2463 non-null object
venue                2463 non-null object
day_of_week          2463 non-null object
temperature          2463 non-null float64
wind_speed           2463 non-null float64
wind_direction       2463 non-null object
sky                  2463 non-null object
total_runs           2463 non-null int64
game_hours_dec       2463 non-null float64
season

export cleaned data to csv

In [16]:
df.to_csv('baseball_reference_2016_clean.csv')