# Scraping/Cleaning Hitter & Pitcher Statistics

2010-2016

Hitters: Player, R, HR, RBI, SB, AVG, Season/year, AB

Pitchers: Name, K, W, sv, era, whip, season, IP

In [172]:
import pandas as pd
import re

We will first create a pandas DataFrame that holds the following batting statistics for players for the 2010 through 2016 seasons:
- Player Name
- Runs (R)
- Homeruns (HR)
- Runs Batted In (RBI)
- Stolen Bases (SB)
- Batting Average (AVG)
- Season/Year
- At Bats (AB)

In [187]:
# ================================== CREATE DATAFRAME OF ALL HITTER INFORMATION =====================================

# empty list to append to for each year of information
hittersDF = []
# list of years to loop through
year_list = [2016, 2015, 2014, 2013, 2012, 2011, 2010]
for year in year_list:
    # open the file PlayerData/2016Hitters (or whatever year)
    with open('PlayerData/' + str(year)+'Hitters') as inFile:
        # convert csv file to pandas DataFrame
        DF2016 = pd.read_csv(inFile)
        # create year column
        DF2016["Year"] = year
        # append to master list
        hittersDF.append(DF2016)
#concatenate master list into pandas DataFrame
hittersDF = pd.concat(hittersDF)

# remove all except columns of interest for our particular project (files have more stats than we are interested in)
hittersDF = hittersDF[["Name", "Tm", "R", "HR", "RBI", "SB", "BA", "AB", "Pos Summary", "Year"]]

### TO DO FOR HITTING DATA

In [188]:
# need to drop pitchers from this - position == 1 (or has a 1 in it), need to use regex
# be careful of symbols/etc. - see baseball-reference glossary for terms
# what to do about NaNs?
# problem that index values are repeating?
# change datatypes


In [189]:
# replace names with corrected version by removing the unnecessary portions that appear in the DF above
hittersDF['Name'] = hittersDF['Name'].str.replace(r'[*|\\|#|\+].*', '')
# change all letters in names to lower case
def lowercase(mystring):
    return str.lower(mystring)
hittersDF["Name"] = hittersDF["Name"].apply(lowercase)

In [190]:
# some names are duplicated because there is an entry for every team (if they were traded)
# the first entry is their total, so we wish to keep that
# drop duplicates based on name column - default is to keep first occurence, which is the one we want (total)
for yr in year_list:
    hittersDF[hittersDF["Year"] == yr] = hittersDF[hittersDF["Year"] == yr].drop_duplicates('Name')

In [191]:
# change positions datatype from object to string
hittersDF["Pos Summary"] = hittersDF["Pos Summary"].astype(str)

In [192]:
# drop players with position = '1' (pitchers)
hittersDF = hittersDF[(hittersDF["Pos Summary"] != "1") & (hittersDF["Pos Summary"] != "/1")]
hittersDF = hittersDF.dropna(subset = ["Name"], axis=0)

In [193]:
hittersDF = hittersDF[(hittersDF["AB"] > 200)]
hittersDF = hittersDF.drop("Tm", axis=1)

In [198]:
# no null values
hittersDF.isnull().sum()

Name           0
R              0
HR             0
RBI            0
SB             0
BA             0
AB             0
Pos Summary    0
Year           0
dtype: int64

We now create a pandas DataFrame that holds the following pitching statistics for players (pitchers only) for the 2010 through 2016 seasons:
- Player Name
- Strikeouts (K)
- Wins (W)
- Saves (SV)
- Earned Run Average (ERA)
- Walks plus Hits per Inning Pitched (WHIP)
- Season/Year
- Innings Pitched (IP)

In [204]:
# ================================== CREATE DATAFRAME OF ALL PITCHER INFORMATION =====================================

# empty list to append to for each year of information
pitchersDF = []
# loop through years of interest
for year in year_list:
    # open the file PlayerData/2016Pitchers (or whatever year)
    with open('PlayerData/'+str(year)+'Pitchers') as inFile:
        # convert csv file to pandas DataFrame
        DF2016 = pd.read_csv(inFile)
        # create year column
        DF2016["Year"] = year
        # append to master list
        pitchersDF.append(DF2016)
#concatenate master list into pandas DataFrame
pitchersDF = pd.concat(pitchersDF)
# remove all except columns of interest for our particular project (files have more stats than we are interested in)
pitchersDF = pitchersDF[["Name", "Tm", "Year", "SO", "W", "SV", "ERA", "WHIP", "IP"]]

### TO DO FOR PITCHING DATA

In [205]:
# what to do about NaNs?
# problem that index values are repeating?
# data types

In [206]:
# replace names with corrected version by removing the unnecessary portions that appear in the DF above
pitchersDF['Name'] = pitchersDF['Name'].str.replace(r'[*|\\|#|\+].*', '')

In [208]:
pitchersDF["Name"] = pitchersDF["Name"].apply(lowercase)

In [209]:
# some names are duplicated because there is an entry for every team (if they were traded)
# the first entry is their total, so we wish to keep that
# drop duplicates based on name column - default is to keep first occurence, which is the one we want (total)
for yr in year_list:
    pitchersDF[pitchersDF["Year"] == yr] = pitchersDF[pitchersDF["Year"] == yr].drop_duplicates('Name')
pitchersDF = pitchersDF.dropna(subset = ["Name"], axis=0)

In [210]:
pitchersDF = pitchersDF.drop("Tm", axis=1)

In [221]:
# drop rows where pitcher ERA is infinite because they have 0 innings pitched

# Dropped rows where pitcher had no stats recorded for the year
pitchersDF = pitchersDF[pitchersDF.ERA.notnull()]
pitchersDF = pitchersDF[pitchersDF.WHIP.notnull()]

In [223]:
pitchersDF.isnull().sum()

Name    0
Year    0
SO      0
W       0
SV      0
ERA     0
WHIP    0
IP      0
dtype: int64

In [1]:
pitchersDF = pitchersDF[pitchersDF["IP"] > 25]

NameError: name 'pitchersDF' is not defined

# DREW

# JAKE