# Notebook 1: Cleaning Baseball Reference Data

### Introduction

This notebook is intended as a reference for cleaning data scraped from Baseball Reference. There are no special libraries required, as most of the cleaning can be done with pandas.

In [1]:
import pandas as pd

### Cleaning Function

All of the major data cleaning can be done in one cleaning function. There are several characters that need to be stripped, including some hidden ones that look like spaces, but aren't. There are also instances where pitchers play on more than one team during the season, and there is a row with stats for each team, as well as row with totals across all teams they played for. I wanted to keep just the totals, since the team-specific data isn't as important to me.  This function takes care of all that.

In [2]:
def clean_bbref_season(csvpath, suffix):
    file = pd.read_csv(csvpath)
    file['Full_Name'] = file['Name'].str.split("\\", expand = False).str[0].str.strip('*')
    file['Full_Name'] = file.Full_Name.str.replace(u'\xa0', ' ')
    file['ID'] = file['Name'].str.split("\\", expand = False).str[1]
    cols = file.columns.tolist()
    new_cols = [cols[-2], cols [-1]] + [i for i in cols[1:-2]]
    single_row_file = file.groupby('ID').filter(lambda x: len(x) == 1)
    multi_row_file = file[file.Tm == 'TOT']
    no_duplicates = pd.concat([single_row_file, multi_row_file]).sort_index()[new_cols]
    new_names = [(i,i+ suffix) for i in no_duplicates.iloc[:, 2:].columns.values]
    no_duplicates.rename(columns = dict(new_names), inplace=True)
    return no_duplicates
    

### Saving to CSV

Finally, I'll call the function on my scraped data, and save the resulting pandas dataframe as a csv.

In [3]:
season_2016_c = clean_bbref_season('../data/season_2016.csv', '_2016')
season_2017_c = clean_bbref_season('../data/season_2017.csv', '_2017')
season_2018_c = clean_bbref_season('../data/season_2018.csv', '_2018')
season_2019_c = clean_bbref_season('../data/season_2019.csv', '_2019')

In [4]:
# season_2016_c.to_csv('../data/season_2016_c.csv', index = False)
# season_2017_c.to_csv('../data/season_2017_c.csv', index = False)
# season_2018_c.to_csv('../data/season_2018_c.csv', index = False)
# season_2019_c.to_csv('../data/season_2019_c.csv', index = False)