# Clean Pitcher Data

In this notebook, I read the JSON file of pitcher player data into a `pandas` DataFrame, and clean the data so it is ready for analysis.

In [1]:
import pandas as pd
import numpy as np
import pickle
%matplotlib inline

In [2]:
df = pd.read_json('data_pitchers.json')
df.columns

Index(['age', 'era', 'g', 'ip', 'losses', 'player_name', 'position', 'salary',
       'so', 'so9', 'team', 'war', 'win_loss_perc', 'wins', 'year'],
      dtype='object')

Salary data is in string form.  I need to strip off the '$' sign, get rid of commas, and convert to an integer.

In [3]:
df.salary = df.salary.apply(lambda x: int(''.join(x.strip('$ ').split(','))) if x else None)

For my analysis, I am only interested in players that were active from 2000 onward.  However, I scraped data for all players, so I need to drop rows that correspond to player-years before 2000.  Maybe I can use the extra data on another project someday.

In [4]:
df.drop(df[df.year < 2000].index, inplace=True)

Several rows do not contain salary data.  These rows will throw my analysis off, so I need to drop them.  This might be a risk; I do not know what might have caused a row to contain no data.  It is possible that there is some systematic mistake, but for this project I will treat this missing values as worthless data.

In [5]:
df = df.dropna(subset=['salary'])

Transform salary to log(salary).

In [6]:
df['log_salary'] = np.log(df.salary)

Since 2000, there have been some team relocations and renamings.  Map old team names to the corresponding current team name.

In [7]:
replacements = {
                "FLA": "MIA",
                "MON": "WSN",
                "ANA": "LAA",
                "TBD": "TBR",
                }

df = df.replace({'team': replacements})

Since I'm trying to predict a player's salary in the year after he registers season-level stats, I need to apply a shift to create "next year salary" and "next year log salary" columns.  I do this with a split-apply combine operation in `pandas`.  I group the DataFrame by player, and shift the salary and log salary columns.

In [9]:
df = df.reset_index(drop=True)
df[['next_year_salary', 'next_year_log_salary']] = df.groupby('player_name')[['salary', 'log_salary']].apply(lambda x: x.shift(1))

The league minimum salary has changed over time.  Create a new column with information about minimum salary in a given year.  The dictionary below actually corresponds to next year's minimum in each given year.

In [10]:
lm_dict = {
            2000: 300000,
            2001: 300000,
            2002: 300000,
            2003: 316000,
            2004: 327000,
            2005: 380000,
            2006: 390000,
            2007: 400000,
            2008: 400000,
            2009: 414000,
            2010: 480000,
            2011: 480000,
            2012: 480000,
            2013: 507500,
            2014: 507500,
            2015: 535000,
            2016: 545000,
            2017: 555000
            }

df['league_min'] = np.zeros(df.shape[0])

for year_, salary_ in lm_dict.items():
    df.league_min.loc[df.year == year_] = salary_

Compute how much money a player made above the league minimum, and create a new column.

In [12]:
df['salary_over_minimum'] = df['next_year_salary'] - df['league_min']

Most players make the league minimum or slightly more than the league minimum.  I might be interested in analyzing just the subset of players that make more than the league minimum.  For now, I set the thrreshold at $10,000 above the league minimum, which is not much, and create an indicator variable.

In [13]:
threshold = 10000

df['player_at_min'] = np.zeros(df.shape[0])
df.player_at_min.loc[df.salary_over_minimum <= threshold] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


A lot of null values resulted from shifting the salary columns.  Drop rows with null values and reset the index.

In [14]:
df = df.dropna()
df = df.reset_index(drop=True)

The dataframe is ready to be used.  Pickle for future use.

In [17]:
#pickle_filename = 'pickled_pitchers.pkl'
#with open(pickle_filename, 'wb') as f_obj:
#    pickle.dump(k_df, f_obj)