# Data Analysis Project: Figure Out How Much a Player Should be Worth
## Research Question: How can we assess the value of a player, based on their past performance and future indicators?
- Figure out how player contracts have increased over time and compare that to the increase in inflation </br>
- Make sure to deflate the player contract to get the real player contract value </br>
    - This would allow you to compare player salaries across different times, independent of inflation </br>
- Calculate the Baseball Price Index, where you can see how much money for 1 WAR has increased or decreased over time </br>
- Calculate what a player contract should be based on past performance and future predictions </br>
    - Calculate the optimal Dollar/WAR </br>
- Figure out how player's performance changes with time </br>
    - Graph how a player's performance changes with time
    - Use that to figure out how much a player's total contract should be, given their past 3 year performance </br>
    - Take into account inflation that a player's salary doesn't decrease due to inflation </br>
    - Calculate how much value a player like Shohei Ohtani should get in Free Agency </br>
- Also figure out how long a contract should be </br>
    - Graph how the length of contracts have changed through time </br>
    - Graph how the length of a contract changes depending on a player's contract </br>
        - Show the relationship between a player's contract length and past WAR </br>
    - Find the optimum between player contract length and AAV using the calculated total contract value and the $/WAR
        - Doing this, you can figure out if it's better to have a longer contract with a higher AAV or a shorter contract with a lower AAV
- Create a visualization in Tableau using these numbers

`pybaseball` library information </br>
https://github.com/jldbc/pybaseball/tree/master/docs

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from openpyxl import Workbook, load_workbook
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource, HoverTool

#importing batting data first
from pybaseball import batting_stats, batting_stats_bref, bwar_bat, statcast_batter, playerid_lookup, statcast_batter_expected_stats, statcast_batter_exitvelo_barrels
from datetime import date, datetime



In [4]:
class Plot:
    def __init__(self, x_column, y_column, data_source):
        self.x_values = x_column.astype(float)
        self.y_values = y_column.astype(float)
        self.x_name = x_column.name
        self.y_name = y_column.name
        
        self.source = ColumnDataSource(data_source)
        
        self.tooltips=[
            (f'{x_column.name}', '@{'f'{x_column.name}''}'),
            (f'{y_column.name}', '@{'f'{y_column.name}''}'),
        ]
        
        self.xy_plot = figure(title = f'Relationship between {x_column.name} and {y_column.name}', x_axis_label = f'{x_column.name}', y_axis_label = f'{y_column.name}', tooltips = self.tooltips)
    
    def scatter(self, regression):
        self.xy_plot.circle(f'{self.x_name}', f'{self.y_name}', size=5, source = self.source)
        if regression == False:
            output_notebook()
            show(self.xy_plot)
        return self.xy_plot
        
    def regression(self, degree):
        coefficients = np.polyfit(self.x_values, self.y_values, degree)
        poly_function = np.poly1d(coefficients)
        polyline = np.linspace(min(self.x_values), max(self.x_values), 100)

        self.xy_plot.line(polyline, poly_function(polyline), line_width=2, color='red', legend_label='Polynomial Line') #the polyline values are passed into the poly_function which creates a smooth graph

        output_notebook()
        show(self.xy_plot)

        print(poly_function)
        
        if degree == 1:
            print(f'Slope: {coefficients[0]}')

In [5]:
cpi_data = pd.read_csv('/Users/gakuueno/Documents/Python/Python Data Analysis Projects/CPI Data.csv')
cpi_data.rename(columns = {'CPILFESL': 'CPI', 'DATE': 'Date'}, inplace = True)
cpi_data.drop(index = 66, inplace = True)

cpi_data['CPI'] = cpi_data['CPI'].astype(float)/100
cpi = cpi_data['CPI']

cpi_data['Date'] = pd.to_datetime(cpi_data['Date'])
year = cpi_data['Date'].dt.year.astype(int)

cpi_data.insert(1, 'Year', year)
cpi_data.insert(2, 'Contract Start Year', year)

cpi_plot = Plot(cpi_data['Year'], cpi_data['CPI'], cpi_data)
cpi_plot.scatter(True)
cpi_plot.regression(1)

'''cpi_data[cpi_data['Year'] == 1991] #remember that the year column is an integer

#Guesstimating a cpi for 2023 to make calculations consistent
cpi_data['Date'] = cpi_data['Date'].astype(str)
cpi_2023 = 3.00
new_row = {'Date':'2023-01-01', 'Year': 2023, 'Contract Start Year': 2023, 'CPI':cpi_2023}
cpi_data = cpi_data.append(new_row, ignore_index=True)
cpi_data['Date'] = pd.to_datetime(cpi_data['Date'])'''

cpi_data

 
0.04262 x - 83.44
Slope: 0.04261597736840266


Unnamed: 0,Date,Year,Contract Start Year,CPI
0,1957-01-01,1957,1957,0.289333
1,1958-01-01,1958,1958,0.295917
2,1959-01-01,1959,1959,0.301750
3,1960-01-01,1960,1960,0.306417
4,1961-01-01,1961,1961,0.310000
...,...,...,...,...
61,2018-01-01,2018,2018,2.575614
62,2019-01-01,2019,2019,2.632075
63,2020-01-01,2020,2020,2.677049
64,2021-01-01,2021,2021,2.772530


In [6]:
#load workbook
fa = load_workbook('/Users/gakuueno/Documents/Python/Python Data Analysis Projects/MLB Contract Value Project/MLB-Free Agency 1991-2023.xlsx')

#organize the sheets with the extra row remaining FA
remain_fa = np.arange(2019, 2024, 1) #since the final number isn't included, the range has to be up to 2024
year = [str(year) for year in remain_fa]

# Iterate through each sheet in the workbook
for sheet_name in fa.sheetnames:
    sheet = fa[sheet_name]
    if sheet_name in year:
        start_row = 1
        end_row = 12
        # Iterate through the range of rows to be deleted in reverse order
        for row in range(end_row, start_row - 1, -1):
            sheet.delete_rows(row)
    else:
        start_row = 1
        end_row = 11
        for row in range(end_row, start_row -1, -1):
            sheet.delete_rows(row)
    sheet['C1'] = 'Age'

# Save the modified workbook        
fa.save('/Users/gakuueno/Documents/Python/Python Data Analysis Projects/MLB Contract Value Project/Edited MLB-Free Agency 1991-2023.xlsx')

fa = load_workbook('/Users/gakuueno/Documents/Python/Python Data Analysis Projects/MLB Contract Value Project/Edited MLB-Free Agency 1991-2023.xlsx')
raw_fa_data = pd.DataFrame()

for sheet in fa.sheetnames:
    yearly_fa = pd.read_excel('/Users/gakuueno/Documents/Python/Python Data Analysis Projects/MLB Contract Value Project/Edited MLB-Free Agency 1991-2023.xlsx', sheet_name = sheet, header = 0)
    yearly_fa = yearly_fa[['Player', "Pos'n", 'Age', 'New Club', 'Years', 'Guarantee', 'Term', 'AAV']]
    raw_fa_data = pd.concat([raw_fa_data,yearly_fa], axis = 0)

raw_fa_data

Unnamed: 0,Player,Pos'n,Age,New Club,Years,Guarantee,Term,AAV
0,"Judge, Aaron",rf-dh,31.066,NYA,9.0,400000000.0,2023-31,40000000.0
1,"Turner, Trea",ss,30.001,PHI,11.0,300000000.0,2023-33,27272727.0
2,"Bogaerts, Xander",ss,30.273,SDN,11.0,280000000.0,2023-33,25454545.0
3,"Correa, Carlos",ss,28.282,MIN,6.0,200000000.0,2023-28,33333333.0
4,"deGrom, Jacob",rhp-s,35.012,TEX,5.0,185000000.0,2023-27,37000000.0
...,...,...,...,...,...,...,...,...
96,"Sheets, Larry",of,31.000,dnp,,,1991,
97,"Thurmond, Mark",lhp,34.000,dnp,,,1991,
98,"Tudor, John",lhp-s,37.000,dnp,,,1991,
99,"Ward, Gary",of-1b,37.000,dnp,,,1991,


In [144]:
fa_df = raw_fa_data.copy()
fa_df.rename(columns = {'Player': 'player', "Pos'n":'position', 'Age': 'age_at_contract_sign', 'New Club':'team', 'Years':'length', 'Guarantee':'total_contract_value', 'Term':'contract_term', 'AAV':'aav'}, inplace = True)
fa_df.dropna(subset = ['aav'], inplace = True)
fa_df.dropna(subset = ['contract_term'], inplace = True)
fa_df.dropna(subset = ['length'], inplace = True)
team_mapping = {
    'NYA':'NYY',
    'LAN':'LAD',
    'SDN':'SDP',
    'NYN':'NYM',
    'CHN':'CHC',
    'SFN':'SFG',
    'SLN':'STL',
    'CHA':'CHW',
    'TBA':'TBR',
    'CAL':'LAA',
    'ANA':'LAA',
    'KCA':'KCR',
}

fa_df = fa_df.applymap(lambda x: team_mapping[x] if x in team_mapping else x)

fa_df['player'] = fa_df['player'].str.replace(' ','').str.split(',').str[::-1].apply(lambda x: ' '.join(x))
fa_df['contract_start_year'] = fa_df['contract_term'].astype(str).str.split('-').str[0].astype(int)
fa_df['contract_end_year'] = (fa_df['contract_start_year'] + fa_df['length'] - 1).astype(int)
fa_df['age_at_contract_sign'] = fa_df['age_at_contract_sign'].astype(int)
fa_df = fa_df[fa_df['total_contract_value'] > 100]

fa_df['length'] = fa_df['length'].replace(0, 1)
fa_df = fa_df.reset_index(drop = True)
fa_df

Unnamed: 0,player,position,age_at_contract_sign,team,length,total_contract_value,contract_term,aav,contract_start_year,contract_end_year
0,Aaron Judge,rf-dh,31,NYY,9.0,400000000.0,2023-31,40000000.0,2023,2031
1,Trea Turner,ss,30,PHI,11.0,300000000.0,2023-33,27272727.0,2023,2033
2,Xander Bogaerts,ss,30,SDP,11.0,280000000.0,2023-33,25454545.0,2023,2033
3,Carlos Correa,ss,28,MIN,6.0,200000000.0,2023-28,33333333.0,2023,2028
4,Jacob deGrom,rhp-s,35,TEX,5.0,185000000.0,2023-27,37000000.0,2023,2027
...,...,...,...,...,...,...,...,...,...,...
3016,Max Venable,of,34,LAA,1.0,425000.0,1991,425000.0,1991,1991
3017,Bill Krueger,lhp,33,SEA,1.0,400000.0,1991,400000.0,1991,1991
3018,John Moses,of,33,BOS,1.0,350000.0,1991,350000.0,1991,1991
3019,Terry Puhl,of,34,NYM,1.0,350000.0,1991,350000.0,1991,1991


In [145]:
indices_over_multiple_years = fa_df.index.repeat(fa_df['length'])
yearly_df = fa_df.iloc[indices_over_multiple_years]
indices_over_multiple_years

Int64Index([   0,    0,    0,    0,    0,    0,    0,    0,    0,    1,
            ...
            3011, 3012, 3013, 3014, 3015, 3016, 3017, 3018, 3019, 3020],
           dtype='int64', length=5369)

In [None]:
yearly_df = master_fa_df.reindex(master_fa_df.index.repeat(master_fa_df['Years']))
yearly_df.reset_index(drop = True, inplace = True)

yearly_df['Age'] += yearly_df.groupby((yearly_df['Name'] != yearly_df['Name'].shift()).cumsum()).cumcount()
yearly_df['Year'] = yearly_df['Contract Start Year'] + yearly_df.groupby((yearly_df['Name'] != yearly_df['Name'].shift()).cumsum()).cumcount()

yearly_df['Name'] = pd.Categorical(yearly_df['Name'], categories=yearly_df['Name'].unique(), ordered=True)
yearly_df.sort_values(by=['Name', 'Year', 'Age'], ascending=[True, False, False], inplace=True)

yearly_df.drop(['CPI', 'Real AAV'], axis = 1, inplace = True)

yearly_df.rename(columns = {'Real Guarantee':'Real Guarantee at Contract Start'}, inplace = True)

cpi_data.drop(['Contract Start Year'], axis = 1, inplace = True)
yearly_df = pd.merge(yearly_df, cpi_data, how = 'left', on = 'Year')
yearly_df.drop('Date', axis = 1, inplace = True)

#Adjusting for inflation
real_aav = yearly_df['AAV']/yearly_df['CPI']
yearly_df.insert(12, 'Real AAV', real_aav)

yearly_df.head(20)

In [None]:
#code to adjust for inflation
'''#Adjusting for inflation
fa_df = pd.merge(fa_df, cpi_data, how = 'inner', on = 'Contract Start Year')
fa_df.drop(['Date', 'Year'], axis = 1, inplace = True)
real_guarantee = fa_df['Guarantee']/fa_df['CPI']
fa_df.insert(6, 'Real Guarantee', real_guarantee)
real_aav = fa_df['AAV']/fa_df['CPI']
fa_df.insert(9, 'Real AAV', real_aav)

fa_df.reset_index(drop=True, inplace=True) #drop = True to remove the original indexes

print(fa_df.dtypes)'''

In [None]:
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Name': ['John', 'Anna', 'Peter'],
    'Contract': ['Employee', 'Consultant', 'Employee'],
    'Years': [1, 3, 2]
})

# "Explode" the DataFrame to duplicate rows based on the 'Years' column
df = df.reindex(df.index.repeat(df['Years']))

# Reset the index
df.reset_index(drop=True, inplace=True)

df


In [None]:
#Plotting Inflation Adjusted Contracts and Time
guarantee_time = Plot(master_fa_df['Contract Start Year'], master_fa_df['Real Guarantee'], master_fa_df)
guarantee_time.scatter(True)
guarantee_time.regression(1)

#Plotting AAV over time
aav_time = Plot(master_fa_df['Contract Start Year'], master_fa_df['Real AAV'], master_fa_df)
aav_time.scatter(True)
aav_time.regression(1)

Overall, free agent contracts have been increasing at a pace of $191,112 per year, even adjusting for inflation. Of course, most players do not finish their service time, so it is difficult to say if the average mlb player is better off today then 30 years ago.

In [None]:
bat_stat = batting_stats(2022, qual = 1)
bat_stat = bat_stat[['Name', 'Age','WAR']]
bat_stat