<center> <h2> DS 3000 - Fall 2021</h2> </center>
<center> <h3> DS Report </h3> </center>


<center> <h3>Can We Predict the Winner of the Cy Young Award?</h3> </center>
<center><h4>Daniel Han, Karenna Ng, Hannah Szeto</h4></center>


<hr style="height:2px; border:none; color:black; background-color:black;">

#### Executive Summary:

&emsp;&emsp;Every season, the Cy Young Award is awarded to the two best pitchers in Major League Baseball. With baseball having perhaps the most commprehensive database in all of professional sports, analyzing that data should tell us who deserves to win the Cy Young based on statistics. So, in this project, we decided to compile pitching data from 2016-2019 to train a machine learning model to predict who actually deserves to win the Cy Young Award based on how previous years' winners won.

&emsp;&emsp;After cleaning and merging data from baseball-reference.com, it was time to train our model. Our imperfect and imbalanced dataset led us to choose cross validation and resampling to train our model. However, after our best efforts to fit our data properly and tune for hyperparameters, the problems with our dataset proved to be too much of an obstacle for the machine learning algorithms. Ultimately, we were unable to make a confident prediction for this year's Cy Young Award winners, but an even further exploration of our work could possibly yield more pertinent results.



<hr style="height:2px; border:none; color:black; background-color:black;">

## Outline
1. <a href='#1'>INTRODUCTION</a>
2. <a href='#2'>METHOD</a>
3. <a href='#3'>RESULTS</a>
4. <a href='#4'>DISCUSSION</a>

<a id="1"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 1. INTRODUCTION

&emsp;&emsp;Using advanced statistics and data made available by Major League Baseball, we would like
to predict this year’s (2021) Cy Young winners. The problem is that baseball statistics are detailed
and granular, yet there is so much variance on opinions and awards given. In other words, shouldn’t the vast statistical data and insights available from that data provide objective data on who the best pitchers in Major League Baseball are? Through an exploration of data from the last five years of baseball, a clear picture of who deserves the 2021 Cy Young Awards should emerge. 


<h4>Problem Statement</h4>

&emsp;&emsp;Every year in Major League Baseball, the best pitcher from each conference, American and National League, is presented with the Cy Young Award. The criteria for this award is subjective; the Baseball Writers Association of America votes on the best pitchers in each league to give the award to. However, not every year do the best pitchers in baseball actually win the Cy Young Award. So, there must be a way to predict and choose the best pitchers in baseball according to the criteria used in previous years. 

&emsp;&emsp;This project attempts to bridge the gap between statistics and who deserves to win the Cy Young Award in 2021, based on previous years' results and data. Thus, an accurate prediction should be able to be made concerning this year's award!



&emsp;&emsp;We would like to explore what data truly indicates great performance for a pitcher in Major League Baseball. This means exploring different types of data for pitchers including ERA, WHIP, Wins and Losses, etc. Doing so should increase our understanding of the game of baseball as well as make informed opinions and decisions regarding the sport!

&emsp;&emsp;The Cy Young Award is an award given out every year that crowns the best pitchers in Major League Baseball. This award comes with fame, huge contracts, and of course, the honor of being named one of the best pitchers in baseball. So, this award should be given fairly based on available statistics, since so much is at stake. The problem, although not a problem that might affect society in some deep manner, still can be solved through an analysis and machine learning driven solution to properly award the Cy Young award to the right pitchers.

&emsp;&emsp;Insights from this project could be useful for a variety of applications. First, simply predicting who the rightful winner of the Cy Young Award is a fun thing to do! MLB and the voters for the Cy Young Award could also use insights gained from this project to vote or change the decision process for the Award itself. Finally, sports bettors could use the insights gained from this project to make more edcuated decisions when placing bets! 

&emsp;&emsp;One such dive into the Cy Young award winner is a look into the 2018 Cy Young race. Using the xgboost algorithm, which is a popular machine learning algorithm, Robert Pollack, a sports data scientist, cleaned and analyzed data over the course of 17 seasons to attempt to make a prediction of the 2018 Cy Young Winner. Pollack's model identified 21 of 22 winners and 661 of 662 non-winners. So, the precision, recall, and F1 scores were all 0.95 out of 1.00. Notably, though, his model did not predict the 2018 winner itself correctly. Max Scherzer was projected to win according to Pollack's model, but Jacob DeGrom won by an overwhelming 29/30 first place votes in reality! So was not only his xgboost algortihm wrong for the 2018 prediction, it was extremely wrong! 

Pollack, R. (2018, November 14). Predicting the 2018 Cy Young Race with Machine Learning. Retrieved from https://tht.fangraphs.com/cy-young-award-2018-blake-snell-jacob-degrom-max-scherzer/ 

<h4>Questions</h4>
  
&emsp;&emsp;While considering the process and purpose behind this project, some significant questions came up. Who are the best pitchers in baseball? What features are most significant when evaluating pitchers? Which machine learning techniques are most accurate? Precise? LinearSVC? KNN Classification? GridSearchCV? Decision Tree? Is this a classification problem?



<a id="2"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 2. METHOD

### 2.1. Data Acquisition
 
&emsp;&emsp; Our data was compiled from https://www.baseball-reference.com/, a comprehsive database for Professional American baseball. This site is the main source of statistics for historical and current baseball data. For the purpose of this project, we needed siginficant amounts of data, so we chose to extract standard and starting pitching data for the years 2016-2021 (excluding 2020 due to the COVID-19 shortened season). We also compiled the winner data from baseball-reference.com for the Cy Young Award for the years 2016-2021. 

&emsp;&emsp; Once the data was collected, it was time to clean and join the data to prepare it for learning and training. The first thing to do was to join the standard and starting pitching data per year. Standard pitching presents every pitcher's pitching statistics, such as ERA (earned run average), WHIP (walks/htis per innings pitched), as well as Wins and Losses. Starting pitching data presents only starting pitcher data as well as more detailed statistics such as Quality wins and adjusted wins. Once joining standard and starting pitching data, it was time to remove unnecessary features. This was done in conjunction with the research we had done to determine which features were important. Finally, adding the Cy Young winner data to that joined dataframe was done as the last step in preparing our data for learning. 

&emsp;&emsp; So, after cleaning and joining data from different sources on https://www.baseball-reference.com/, we were left with one merged_df with many features and variables. This merged dataframe lists every pitcher from the years 2016-2019, their relevant statistics, and whether or not they won the Cy Young Award in that year. There are 46 columns of statisitcal data per pitcher, and 156 pitchers worth of data. More detail about cleaning and the merged_df is described in our Results section.



In [None]:
import pandas as pd

# import datasets
# repo: https://github.com/danielhan60903/DS3000FP 
  
url2016 = 'https://raw.githubusercontent.com/danielhan60903/DS3000FP/main/2016_starting_pitchers.csv'
df2016 = pd.read_csv(url2016)

url2017 = 'https://raw.githubusercontent.com/danielhan60903/DS3000FP/main/2017_starting_pitchers.csv'
df2017 = pd.read_csv(url2017)

url2018 = 'https://raw.githubusercontent.com/danielhan60903/DS3000FP/main/2018_starting_pitchers.csv'
df2018 = pd.read_csv(url2018)

url2019 = 'https://raw.githubusercontent.com/danielhan60903/DS3000FP/main/2019_starting_pitchers.csv'
df2019 = pd.read_csv(url2019)

url2021 = 'https://raw.githubusercontent.com/danielhan60903/DS3000FP/main/2021_starting_pitchers.csv'
df2021 = pd.read_csv(url2021)

stdURL2016 = 'https://raw.githubusercontent.com/danielhan60903/DS3000FP/main/std2016.csv'
std2016 = pd.read_csv(stdURL2016)

stdURL2017 = 'https://raw.githubusercontent.com/danielhan60903/DS3000FP/main/std2017.csv'
std2017 = pd.read_csv(stdURL2017)

stdURL2018 = 'https://raw.githubusercontent.com/danielhan60903/DS3000FP/main/std2018.csv'
std2018 = pd.read_csv(stdURL2018)

stdURL2019 = 'https://raw.githubusercontent.com/danielhan60903/DS3000FP/main/std2019.csv'
std2019 = pd.read_csv(stdURL2019)

stdURL2021 = 'https://raw.githubusercontent.com/danielhan60903/DS3000FP/main/std2021.csv'
std2021 = pd.read_csv(stdURL2021)

urlWinner = 'https://raw.githubusercontent.com/danielhan60903/DS3000FP/main/cy_young_award_winners.csv'
dfWinner = pd.read_csv(urlWinner, header=1, index_col=0)

### 2.2. Data Analysis


&emsp;&emsp;We are going to predict whether a player is a Cy Young winner from various baseball stats including wins, losses, earned run average, saves, hits, runs, earned runs, home runs, strikeouts, hit-by-pitch, balk, wild pitch, batters faced, adjusted earned run average, fielding independent pitching, walks and hits per innings pitched, hits per nine innings, and home runs per nine innings. These are important predictors because they demonstrate how good of a pitcher/player someone is, which indicates the likelihood of them winning the Cy Young prize.

&emsp;&emsp;This is a supervised ML problem because we are feeding labelled data to the algorithm to train it, as opposed to unsupervised learning, where you feed unlabelled data to the algorithm and allow it to find patterns/classes on its own. Additionally, we are tackling a classification problem, because we’re classifying players as either a Cy Young winner or not a Cy Young winner. In regression, the algorithm predicts a continuous value, as opposed to our case, where we are sorting the data into two discrete class labels.

&emsp;&emsp;We are planning on trying Support Vector Machine (SVM), Decision Tree classifier, k-Nearest Neighbors (kNN), and Logistic Regression, since these algorithms all can be applied to a classification problem. An SVM algorithm finds a distinct line between data points, which works for our data since we are dividing our data into two classes. A decision tree classifier uses a series of if/else statements to split up the data into various classes, which could be useful to determine which features designate a data point as one class or the other. kNN finds the data points most similar to a data point and assigns it a class based on the classification of its nearest neighbors. This could be useful to see how dissimilar the two classes are, but it might be difficult to use as the final algorithm since there are so few “winners” that it might be difficult for the algorithm to successfully classify a player as a winner.

<a id="3"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 3. RESULTS

### 3.1. Data Wrangling


#### _Reformatting Names and Dropping Duplicates_
We noticed that it was very difficult to query the dataframes because there were hidden characters ('\xa0') and asterisks (*) at the ends of some names. We cleaned up these strings to make searching and merging easier.
The dataset also has some multiple entries per year for a single player if they were traded in the middle of the season. We removed all duplicate players, as we decided that being traded would remove a player from Cy Young award eligibility for our algorithm.

In [None]:
# formats the given string, used for Name column
def format_name(name):
    name = name.replace('*', '')
    name = name.replace('\xa0', ' ')
    return name

In [None]:
format_name('Andrew\xa0Albers*')

'Andrew Albers'

In [None]:
# apply formatting (names, dropping duplicates) inplace to both dfs for each year
def format_df_names(df):
    df["Name"] = df["Name"].map(format_name)
    df.drop_duplicates(subset = 'Name', keep = False, inplace = True)
    return df

In [None]:
df2016.head()

Unnamed: 0,Rk,Name,Age,Tm,IP,G,GS,Wgs,Lgs,ND,Wchp,Ltuf,Wtm,Ltm,tmW-L%,Wlst,Lsv,CG,SHO,QS,QS%,GmScA,Best,Wrst,BQR,BQS,sDR,lDR,RS/GS,RS/IP,IP/GS,Pit/GS,<80,80-99,100-119,≥120,Max
0,1,Tim Adleman,28,CIN,69.2,13,13,4,4,5,3,1,6.0,7.0,0.462,2,2,0,0,5,38%,51.4,65,40,3,1,0,5,4.8,5.2,5.4,85,4,8,1,0,101
1,2,Andrew Albers*,30,MIN,17.0,6,2,0,0,2,0,0,0.0,2.0,0.0,0,0,0,0,0,0%,34.0,41,27,5,1,1,0,6.4,14.1,3.3,73,1,1,0,0,88
2,3,Matt Albers,33,CHW,51.1,58,1,0,0,1,0,0,1.0,0.0,1.0,0,1,0,0,0,0%,53.0,53,53,24,11,1,0,4.2,0.0,2.0,25,1,0,0,0,25
3,4,Raul Alcantara,23,OAK,22.1,5,5,1,3,1,1,0,1.0,4.0,0.2,0,0,0,0,0,0%,40.2,54,22,4,0,0,5,4.2,4.3,4.5,76,3,1,1,0,104
4,5,Brett Anderson*,28,LAD,11.1,4,3,0,2,1,0,0,1.0,2.0,0.333,0,1,0,0,0,0%,26.0,40,16,3,1,0,3,3.8,0.9,3.0,53,3,0,0,0,72


In [None]:
format_df_names(df2016).head()

Unnamed: 0,Rk,Name,Age,Tm,IP,G,GS,Wgs,Lgs,ND,Wchp,Ltuf,Wtm,Ltm,tmW-L%,Wlst,Lsv,CG,SHO,QS,QS%,GmScA,Best,Wrst,BQR,BQS,sDR,lDR,RS/GS,RS/IP,IP/GS,Pit/GS,<80,80-99,100-119,≥120,Max
0,1,Tim Adleman,28,CIN,69.2,13,13,4,4,5,3,1,6.0,7.0,0.462,2,2,0,0,5,38%,51.4,65,40,3,1,0,5,4.8,5.2,5.4,85,4,8,1,0,101
1,2,Andrew Albers,30,MIN,17.0,6,2,0,0,2,0,0,0.0,2.0,0.0,0,0,0,0,0,0%,34.0,41,27,5,1,1,0,6.4,14.1,3.3,73,1,1,0,0,88
2,3,Matt Albers,33,CHW,51.1,58,1,0,0,1,0,0,1.0,0.0,1.0,0,1,0,0,0,0%,53.0,53,53,24,11,1,0,4.2,0.0,2.0,25,1,0,0,0,25
3,4,Raul Alcantara,23,OAK,22.1,5,5,1,3,1,1,0,1.0,4.0,0.2,0,0,0,0,0,0%,40.2,54,22,4,0,0,5,4.2,4.3,4.5,76,3,1,1,0,104
4,5,Brett Anderson,28,LAD,11.1,4,3,0,2,1,0,0,1.0,2.0,0.333,0,1,0,0,0,0%,26.0,40,16,3,1,0,3,3.8,0.9,3.0,53,3,0,0,0,72


#### _Adding the Target Column_
Our target variable is whether or not a player was a Cy Young winner. This data was found in a separate table, and the information is merged into each year's stats as a one-hot encoded variable where 1 indicates a winner and 0 indicates a non-winner. 

In [None]:
# adds a column indicating whether a player is a Cy Young Award winner
def cy_winner_column(combined_df, year):
    
    # starts of CY winner column with all False
    combined_df['CY_winner'] = 0
    
    # creates a list of the winners names
    winner_list = [name for name in dfWinner[dfWinner['Year'] == year]['Name']]
    
    # finds those players in the combined_df and changes their 'CY_winner' value to 1
    for name in winner_list:
        combined_df.loc[combined_df['Name'] == name, 'CY_winner'] = 1
    
    return combined_df

In [None]:
# notice the added CY_winner column
cy_winner_column(df2016, 2016).describe()

Unnamed: 0,Rk,Age,IP,G,GS,Wgs,Lgs,ND,Wchp,Ltuf,Wtm,Ltm,tmW-L%,Wlst,Lsv,CG,SHO,QS,GmScA,Best,Wrst,BQR,BQS,sDR,lDR,RS/GS,RS/IP,IP/GS,Pit/GS,<80,80-99,100-119,≥120,Max,CY_winner
count,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,274.0,274.0,273.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,274.0,274.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0
mean,183.058182,27.050909,97.650182,21.152727,15.781818,5.352727,5.52,4.909091,1.28,1.294545,7.945255,7.890511,0.447985,1.421818,1.981818,0.276364,0.123636,7.469091,47.46,68.698182,23.025455,10.370909,3.28,0.221818,7.974545,4.391971,4.050365,5.178909,86.814545,2.312727,7.730909,5.661818,0.076364,102.487273,0.007273
std,112.110397,3.726649,66.429129,12.381937,11.409899,5.339833,4.053763,3.653179,1.484027,1.539198,6.859174,5.571617,0.232993,1.601857,1.805213,0.771283,0.381107,7.316736,9.314501,16.442329,13.291623,6.649232,2.918254,0.465876,5.645495,1.309982,1.827592,0.984103,12.443088,1.681919,6.071979,6.612736,0.304446,16.845228,0.085125
min,1.0,19.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,-12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.7,25.0,0.0,0.0,0.0,0.0,25.0,0.0
25%,86.5,24.0,36.1,10.0,5.0,1.0,2.0,2.0,0.0,0.0,2.0,3.0,0.333,0.0,0.0,0.0,0.0,1.0,42.55,61.0,14.0,5.0,1.0,0.0,3.0,3.8,3.3,4.7,82.5,1.0,2.0,0.0,0.0,97.0,0.0
50%,180.0,26.0,88.2,22.0,14.0,4.0,5.0,5.0,1.0,1.0,6.0,7.0,0.469,1.0,2.0,0.0,0.0,5.0,48.8,71.0,21.0,11.0,3.0,0.0,7.0,4.4,4.1,5.4,90.0,2.0,7.0,3.0,0.0,107.0,0.0
75%,288.5,29.0,158.1,31.0,26.0,9.0,9.0,8.0,2.0,2.0,13.0,12.0,0.586,2.0,3.0,0.0,0.0,14.0,53.55,80.0,29.0,15.0,5.0,0.0,13.0,5.0,4.8,5.85,94.0,3.0,12.0,9.0,0.0,114.0,0.0
max,368.0,43.0,230.0,64.0,35.0,22.0,19.0,17.0,8.0,7.0,25.0,23.0,1.0,8.0,8.0,6.0,3.0,27.0,70.3,98.0,67.0,30.0,15.0,3.0,21.0,11.0,14.1,7.1,108.0,8.0,24.0,29.0,3.0,124.0,1.0


#### _Aggregating Stats from Multiple Tables_
The single dataset for each year did not have some key features we wanted to use, so we merged each year's starting pitchers dataframe (df*year*) with the standard pitching data from the same year. The data was merged along the Name and Tm columns, increasing our number of features from 36 to 64.

The merging function also addresses some other formatting issues, such as repeat columns and % signs that interfered with the machine learning tasks.

In [None]:
# cleans and merges dataframes for a given year, adds Cy Young information
def aggregate_stats_per_year(starters, standard, year):
    
    # formats names of players in both dfs
    format_df_names(starters)
    format_df_names(standard)
    
    # removes repeat columns from standard so there is no redundancy in merged df
    repeat_columns = ['Rk', 'Age', 'IP', 'G', 'GS', 'CG','SHO']
    standard_filtered = standard.drop(repeat_columns, axis = 1)

    # merges df
    merged_df = pd.merge(starters, standard_filtered, how="left", on=['Name','Tm'])
    
    # adds a year column
    merged_df["Year"] = year
    
    # change string percent to float
    merged_df['QS%'] = merged_df['QS%'].str.rstrip('%').astype('float') / 100.0

    # adds the Cy Young Winner information
    combined_df = cy_winner_column(merged_df, year)
    
    return combined_df

In [None]:
combined_2016 = aggregate_stats_per_year(df2016, std2016, 2016)
combined_2016.describe()

Unnamed: 0,Rk,Age,IP,G,GS,Wgs,Lgs,ND,Wchp,Ltuf,Wtm,Ltm,tmW-L%,Wlst,Lsv,CG,SHO,QS,QS%,GmScA,Best,Wrst,BQR,BQS,sDR,lDR,RS/GS,RS/IP,IP/GS,Pit/GS,<80,80-99,100-119,≥120,Max,CY_winner,W,L,W-L%,ERA,GF,SV,H,R,ER,HR,BB,IBB,SO,HBP,BK,WP,BF,ERA+,FIP,WHIP,H9,HR9,BB9,SO9,SO/W,Year
count,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,274.0,274.0,273.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,274.0,274.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,271.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,275.0,274.0,275.0,275.0,275.0,275.0,275.0,275.0,271.0,275.0
mean,183.058182,27.050909,97.650182,21.152727,15.781818,5.352727,5.52,4.909091,1.28,1.294545,7.945255,7.890511,0.447985,1.421818,1.981818,0.276364,0.123636,7.469091,0.350109,47.46,68.698182,23.025455,10.370909,3.28,0.221818,7.974545,4.391971,4.050365,5.178909,86.814545,2.312727,7.730909,5.661818,0.076364,102.487273,0.007273,5.774545,5.865455,0.430657,4.995818,1.567273,0.112727,97.727273,50.109091,46.472727,13.130909,31.829091,1.36,84.68,3.570909,0.338182,3.378182,416.614545,100.551095,4.694473,1.436658,9.703273,1.385091,3.235273,7.425455,2.651587,2016.0
std,112.110397,3.726649,66.429129,12.381937,11.409899,5.339833,4.053763,3.653179,1.484027,1.539198,6.859174,5.571617,0.232993,1.601857,1.805213,0.771283,0.381107,7.316736,0.236057,9.314501,16.442329,13.291623,6.649232,2.918254,0.465876,5.645495,1.309982,1.827592,0.984103,12.443088,1.681919,6.071979,6.612736,0.304446,16.845228,0.085125,5.156677,3.939331,0.237108,2.49779,3.093144,0.524945,62.197782,30.585332,28.475021,8.774816,20.886639,1.520373,64.042978,3.12359,0.66604,3.273801,273.695747,46.219645,1.486279,0.357688,2.673339,0.828879,1.370343,1.839948,1.380778,0.0
min,1.0,19.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,-12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.7,25.0,0.0,0.0,0.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,13.0,16.0,1.8,0.545,3.0,0.0,0.0,2.6,0.33,2016.0
25%,86.5,24.0,36.1,10.0,5.0,1.0,2.0,2.0,0.0,0.0,2.0,3.0,0.333,0.0,0.0,0.0,0.0,1.0,0.17,42.55,61.0,14.0,5.0,1.0,0.0,3.0,3.8,3.3,4.7,82.5,1.0,2.0,0.0,0.0,97.0,0.0,1.0,2.0,0.317,3.605,0.0,0.0,39.5,22.5,20.5,5.0,15.0,0.0,27.0,1.0,0.0,1.0,166.5,73.0,3.78,1.223,8.25,0.9,2.3,6.3,1.85,2016.0
50%,180.0,26.0,88.2,22.0,14.0,4.0,5.0,5.0,1.0,1.0,6.0,7.0,0.469,1.0,2.0,0.0,0.0,5.0,0.38,48.8,71.0,21.0,11.0,3.0,0.0,7.0,4.4,4.1,5.4,90.0,2.0,7.0,3.0,0.0,107.0,0.0,4.0,5.0,0.444,4.57,0.0,0.0,90.0,51.0,45.0,12.0,29.0,1.0,70.0,3.0,0.0,3.0,378.0,92.5,4.48,1.375,9.3,1.2,3.0,7.5,2.4,2016.0
75%,288.5,29.0,158.1,31.0,26.0,9.0,9.0,8.0,2.0,2.0,13.0,12.0,0.586,2.0,3.0,0.0,0.0,14.0,0.53,53.55,80.0,29.0,15.0,5.0,0.0,13.0,5.0,4.8,5.85,94.0,3.0,12.0,9.0,0.0,114.0,0.0,9.0,9.0,0.579,5.805,2.0,0.0,154.0,75.0,69.0,20.0,46.0,2.0,134.0,5.5,1.0,5.0,668.5,117.0,5.195,1.5675,10.55,1.7,3.9,8.6,3.28,2016.0
max,368.0,43.0,230.0,64.0,35.0,22.0,19.0,17.0,8.0,7.0,25.0,23.0,1.0,8.0,8.0,6.0,3.0,27.0,1.0,70.3,98.0,67.0,30.0,15.0,3.0,21.0,11.0,14.1,7.1,108.0,8.0,24.0,29.0,3.0,124.0,1.0,22.0,19.0,1.0,27.0,16.0,6.0,227.0,124.0,113.0,37.0,86.0,8.0,284.0,17.0,4.0,17.0,951.0,387.0,12.9,3.563,27.0,6.8,9.0,12.5,15.64,2016.0


#### _Combining Data Across the Years_
Each year's data is cleaned as outlined above. Then all of the processed years were merged into one dataframe to use for machine learning. We also determined a set of features to remove from the dataset that we felt were not useful metrics for overall pitcher performance.

In [None]:
# apply to the rest of the years
combined_2017 = aggregate_stats_per_year(df2017, std2017, 2017)

combined_2018 = aggregate_stats_per_year(df2018, std2018, 2018)

combined_2019 = aggregate_stats_per_year(df2019, std2019, 2019)

combined_2021 = aggregate_stats_per_year(df2021, std2021, 2021)

In [None]:
# list of training data
all_data = [combined_2016, combined_2017, combined_2018, combined_2019]

# merging all dataframes
all_data_df = pd.concat(all_data)

# getting rid of rows with NaN
df_merged = all_data_df.dropna()

# get rid of discussed columns to reduce features
df_merged = df_merged.drop(["G", "CG", "SHO", "Best", "Wrst", "BQR", "BQS", "sDR", "lDR", 
                            "RS/IP", "<80", "80-99", "100-119", "≥120", "Max"], axis = 1)

df_merged

Unnamed: 0,Rk,Name,Age,Tm,IP,GS,Wgs,Lgs,ND,Wchp,Ltuf,Wtm,Ltm,tmW-L%,Wlst,Lsv,QS,QS%,GmScA,RS/GS,IP/GS,Pit/GS,CY_winner,Lg,W,L,W-L%,ERA,GF,SV,H,R,ER,HR,BB,IBB,SO,HBP,BK,WP,BF,ERA+,FIP,WHIP,H9,HR9,BB9,SO9,SO/W,Year
0,1,Tim Adleman,28,CIN,69.2,13,4,4,5,3,1,6.0,7.0,0.462,2,2,5,0.38,51.4,4.8,5.4,85,0,NL,4.0,4.0,0.500,4.00,0.0,0.0,64.0,32.0,31.0,13.0,20.0,1.0,47.0,5.0,0.0,0.0,287.0,107.0,5.30,1.206,8.3,1.7,2.6,6.1,2.35,2016
2,3,Matt Albers,33,CHW,51.1,1,0,0,1,0,0,1.0,0.0,1.000,0,1,0,0.00,53.0,4.2,2.0,25,0,AL,2.0,6.0,0.250,6.31,11.0,0.0,67.0,44.0,36.0,10.0,19.0,1.0,30.0,3.0,0.0,4.0,237.0,64.0,5.80,1.675,11.7,1.8,3.3,5.3,1.58,2016
3,4,Raul Alcantara,23,OAK,22.1,5,1,3,1,1,0,1.0,4.0,0.200,0,0,0,0.00,40.2,4.2,4.5,76,0,AL,1.0,3.0,0.250,7.25,0.0,0.0,31.0,18.0,18.0,9.0,4.0,0.0,14.0,4.0,1.0,1.0,103.0,57.0,8.21,1.567,12.5,3.6,1.6,5.6,3.50,2016
4,5,Brett Anderson,28,LAD,11.1,3,0,2,1,0,0,1.0,2.0,0.333,0,1,0,0.00,26.0,3.8,3.0,53,0,NL,1.0,2.0,0.333,11.91,0.0,0.0,25.0,15.0,15.0,4.0,4.0,0.0,5.0,0.0,0.0,2.0,62.0,35.0,7.91,2.559,19.9,3.2,3.2,4.0,1.25,2016
5,6,Chase Anderson,28,MIL,151.2,30,9,11,10,7,1,12.0,18.0,0.400,5,3,6,0.20,48.8,4.7,5.0,87,0,NL,9.0,11.0,0.450,4.39,1.0,0.0,155.0,83.0,74.0,28.0,53.0,0.0,120.0,4.0,0.0,4.0,647.0,97.0,5.09,1.371,9.2,1.7,3.1,7.1,2.26,2016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
322,426,Ryan Yarbrough,27,TBR,141.2,14,3,5,6,0,0,7.0,7.0,0.500,1,1,7,0.50,54.2,4.0,6.1,88,0,AL,11.0,6.0,0.647,4.13,1.0,0.0,121.0,69.0,65.0,15.0,20.0,2.0,117.0,9.0,1.0,0.0,563.0,106.0,3.55,0.995,7.7,1.0,1.3,7.4,5.85,2019
323,427,Gabriel Ynoa,26,BAL,110.2,13,0,9,4,0,2,1.0,12.0,0.077,0,2,3,0.23,42.9,2.8,4.9,80,0,AL,1.0,10.0,0.091,5.61,3.0,0.0,126.0,77.0,69.0,29.0,26.0,1.0,67.0,3.0,1.0,4.0,480.0,84.0,6.20,1.373,10.2,2.4,2.1,5.4,2.58,2019
324,428,Alex Young,25,ARI,83.1,15,7,5,3,3,1,7.0,8.0,0.467,1,1,5,0.33,51.8,4.3,5.1,84,0,NL,7.0,5.0,0.583,3.56,0.0,0.0,72.0,40.0,33.0,14.0,27.0,4.0,71.0,4.0,0.0,2.0,349.0,125.0,4.81,1.188,7.8,1.5,2.9,7.7,2.63,2019
325,429,T.J. Zeuch,23,TOR,22.2,3,0,2,1,0,0,1.0,2.0,0.333,0,0,0,0.00,45.7,2.8,4.4,81,0,AL,1.0,2.0,0.333,4.76,0.0,0.0,22.0,13.0,12.0,2.0,11.0,0.0,20.0,0.0,0.0,2.0,99.0,96.0,4.05,1.456,8.7,0.8,4.4,7.9,1.82,2019


#### _Scaling and Transformation, Splitting_
Our dataset is very imbalanced, as less than 1% of players are winners of the Cy Young award. To help address this, we used cross validation and repetion of training to more accuratly assess our machine learning model. Therefore, scaling and splitting is performed later in this report to avoid leaking test data to the model during training from resampling.

### 3.2. Data Exploration
These visualizations characcterize our dataset. They show the size disparity of the target classes, and some possibly important metrics to quantify pitcher performance over a season.


<img src="https://i.ibb.co/nsVGn2W/winners-pie-plot.png" alt="winners-pie-plot" border="0">

This pie chart shows how few winners there are compared to the entire population of pitchers. It demonstrates how unbalanced the target variables (classes) are.

<img src="https://i.ibb.co/pXrfTCf/ERA-bar-chart.png" alt="ERA-bar-chart" border="0">

This bar chart shows the ERA (earned run average), which is the average number of runs a pitcher allows per 9 innings. It splits the data up by year, and by winners vs non-winners. This was done to show how there is a visible difference between the average ERA of the winners and non-winners for every year.

<img src="https://i.ibb.co/FhFtB7v/wins-violin-plot.png" alt="wins-violin-plot" border="0">

This violin plot shows the average number of wins for winners and non-winners. The Cy Young winners data is represented by a 1, and the non-winners data is represented by a 0. While their ranges do overlap relatively significantly, this visualization shows that winners on average clearly have a higher number of wins than non-winners.

### 3.3. Model Training


#### _Splitting The Dataset_
Due to the imbalanced nature of our dataset, we use cross validation to build our model. To avoid leaking the testing data values during training, we use percentage split to divide the dataset into a training set and a testing set, where we then split into features and targets, etc.

In [None]:
# splitting data into features and target
def features_and_target(df):
    features = df.drop(["CY_winner", "Name", "Tm", "Year", "Lg"], axis = 1)
    target = df["CY_winner"]
    return(features, target)

In [None]:
from sklearn.model_selection import train_test_split

def split_the_dataset(features, target):    
    X_train, X_test, y_train, y_test = train_test_split(features, target, random_state = 3000)
    return(X_train, X_test, y_train, y_test)

In [None]:
# split into training and testing dfs
training_df, testing_df = train_test_split(df_merged, train_size = 0.75, random_state = 3000)

For model development, we will only use *training_df*. 

For model evaluation (3.5), we will use *testing_df*.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# our classifier models
estimators = {"LinearSVC" : LinearSVC(max_iter=1000000), 
              "Decision Tree" : DecisionTreeClassifier(), 
              "kNN" : KNeighborsClassifier(),
              "Logistic Regression" : LogisticRegression()}

In [None]:
# split the dataset into features and target
train_features, train_target = features_and_target(training_df)

# split into training and testing data
X_train, X_test, y_train, y_test = split_the_dataset(train_features, train_target)

#### _Single Partition of the Data_
Using only one split of the data, we noticed that there was a lot of variation in the accuracy metrics.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import balanced_accuracy_score

def preprocessed_classifier():
    scaler = StandardScaler()

    # preprocess/normalize data
    scaler.fit(X_train) 

    # scale X data
    X_train_scaled = scaler.transform(X_train) 
    X_test_scaled = scaler.transform(X_test) 
    
    # itereate through classifier models
    for estimator_name, estimator_object in estimators.items():

        # create model on scaled data
        clf = estimator_object.fit(X=X_train_scaled, y=y_train)

        #make predictions on the training set
        predicted = estimator_object.predict(X=X_train)

        expected = y_train

        #prediction accuracy
        print(f"{estimator_name}:")
        print("\tPrediction accuracy on the training data:", format(clf.score(X_train, y_train)*100, ".2f"))
        print("\tPrediction accuracy on the test data:", format(clf.score(X_test, y_test)*100, ".2f"))
        
        # balanced accuracy score 
        balanced_accuracy = balanced_accuracy_score(expected, predicted)
        print("\tBalanced accuracy score:",  format(balanced_accuracy * 100, ".2f"), "\n")
    
    return(X_train_scaled, X_test_scaled)

In [None]:
# test each model's performance
X_train_scaled, X_test_scaled = preprocessed_classifier()

LinearSVC:
	Prediction accuracy on the training data: 0.62
	Prediction accuracy on the test data: 0.92
	Balanced accuracy score: 50.00 

Decision Tree:
	Prediction accuracy on the training data: 99.23
	Prediction accuracy on the test data: 99.08
	Balanced accuracy score: 49.92 

kNN:
	Prediction accuracy on the training data: 92.46
	Prediction accuracy on the test data: 91.71
	Balanced accuracy score: 46.52 

Logistic Regression:
	Prediction accuracy on the training data: 0.62
	Prediction accuracy on the test data: 0.92
	Balanced accuracy score: 50.00 



In [None]:
X_train_scaled, X_test_scaled = preprocessed_classifier()

LinearSVC:
	Prediction accuracy on the training data: 0.62
	Prediction accuracy on the test data: 0.92
	Balanced accuracy score: 50.00 

Decision Tree:
	Prediction accuracy on the training data: 1.23
	Prediction accuracy on the test data: 1.84
	Balanced accuracy score: 50.31 

kNN:
	Prediction accuracy on the training data: 92.46
	Prediction accuracy on the test data: 91.71
	Balanced accuracy score: 46.52 

Logistic Regression:
	Prediction accuracy on the training data: 0.62
	Prediction accuracy on the test data: 0.92
	Balanced accuracy score: 50.00 



Each time we ran this code, the results would vary greatly and it was not obvious which model was best. Prediction accuracy values fluctuated between 1% and 99%, due to the imbalanced nature of our dataset: if a model correctly classified most of the non winners but none of the winners, the accuracy would be 99% without any true positives. To address this, a balanced accuracy score was also used to quantify model accuracy. This value is close to 50% for all models, indicating that the models are essentially randomly choosing how to classify players. These models were underfitting the data.

#### _Cross Validation and Resampling_
Due to the inconsistent results above, we changed our approach to use cross validation and resampling of our dataset. Both of these methods increase accuracy because the model has more data to train with. From the random resamples, an average accuracy score is returned. This is much more stable than the values we were getting using the previous approach, and are more representative of the true accuracy of the models.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
import numpy

# cross validation and resampling
def classifiers_cross_validation(df):
        
    # iterate through dictionary of different methods, change the estimator each time
    for estimator_name, estimator_object in estimators.items():
        
        # pipeline of scaler and classifier (need to scale and split each time, or else leaking data)
        clf = make_pipeline(StandardScaler(), estimator_object)
        
        # noticed accuracies were very variable and sensitive -> run many times and take average
        all_scores = []
        for n in range(100):
            
            # resample the dataframe for each run
            df_resample = df.sample(frac=1.0, replace=1)
            
            # get features and target dfs
            features, target = features_and_target(df_resample)
            
            # run 5-fold CV to train and test model
            scores = cross_val_score(estimator=clf, X=features, y=target, cv=5, scoring='balanced_accuracy')
            
            all_scores.extend(scores)
            
        print(f"{estimator_name}:") 
        # print average of accuracy scores
        print(f"\tBalanced accuracy mean = {numpy.nanmean(all_scores):.2%}")
        print(f"\tBalanced accuracy std = {numpy.nanstd(all_scores):.2%}")

In [None]:
import warnings
warnings.filterwarnings("ignore")

classifiers_cross_validation(training_df)

From these balanced accuracy averages, decision trees was the best model as it had the highest accuracy value of the four models tested. The baseline decision trees function to tune is below.

In [None]:
# cross validation and resampling for selected model (use for tuning)
def classifier_cv(df, model):
        
    # pipeline of scaler and classifier (need to scale and split each time, or else leaking data)
    clf = make_pipeline(StandardScaler(), model)
        
    # noticed accuracies were very variable and sensitive -> run many times and take average
    all_scores = []
    for n in range(100):
            
        # resample the dataframe for each run
        df_resample = df.sample(frac=1.0, replace=1)
            
        # get features and target dfs
        features, target = features_and_target(df_resample)
            
        # run 5-fold CV to train and test model
        scores = cross_val_score(estimator=clf, X=features, y=target, cv=5, scoring='balanced_accuracy')
            
        all_scores.extend(scores)
        
    # print average accuracy scores
    print(f"\tbalanced accuracy mean = {numpy.nanmean(all_scores):.2%}")
    print(f"\tbalanced accuracy std = {numpy.nanstd(all_scores):.2%}")

#### _Feature Selection_
Our dataset has about 50 features, so determining a subset that will yield the similar results is useful. We used iterative feature selection to select the most representative features.

Though accuracy scores were determined over many samples and splits of our dataset, feature selection will be determined based on one random split only.

In [None]:
# scale the split a single partition of the data to use for feature selection
scaler = StandardScaler()
scaler.fit(X_train) 

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeRegressor

def RFE_feature_selection():
    
    # RFE selector, fit to the training data
    select = RFE(DecisionTreeRegressor(random_state = 3000), n_features_to_select = 20)
    select.fit(X_train_scaled, y_train)
    
    # transform training and testing sets so only the selected features are retained
    X_train_selected = select.transform(X_train_scaled)
    X_test_selected = select.transform(X_test_scaled)

    # determine selected features on the model    
    model = DecisionTreeClassifier().fit(X=X_train_selected, y=y_train)
    
    # print selected features and make them into a list
    print("Selected features after RFE:")
    selected_features = []
    for i in range(len(train_features.columns)):
        if select.support_[i] == True:
            selected_features.append(train_features.columns[i])
            print("\t", train_features.columns[i], sep="")
    
    return(X_train_selected, X_test_selected, selected_features)

In [None]:
X_train_selected, X_test_selected, selected_features = RFE_feature_selection()

In [None]:
id_columns = ["CY_winner", "Name", "Tm", "Year", "Lg"]
selected_features.extend(id_columns)

# isolate the selected features
df_selected = df_merged[selected_features]

# classify based with the selected features
classifier_cv(df_selected, DecisionTreeClassifier())

### 3.4. Model Optimization

#### _Hyperparameter Tuning_
For our model, DecisionTreeClassifier, we chose to tune the following parameters:
* criterion: the function to measure the quality of a split
* max_depth: the maximum depth of the tree
* class_weight: relative weights of the target class

We are tuning our model to avoid overfitting to the training data. Especially since this dataset is small, placing more restrictions on our model will likely make it more generalizable to new data. In addition, since it is imbalanced, the class_weight parameter will likely halp even out the two target classes.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'criterion':['gini', 'entropy'], 
              'class_weight':['', 'balanced'],
              'max_depth':[5, 10, 15, 20, 30, 40],}

Similarly to our model training, we noticed that each time we ran our hyperparameter tuning function, the output would be different. Here, we run the function 10 times with resampling, and chose the values that appeared most often to evaluate performance.

In [None]:
# noticed the best parameters changed each time -> run a bunch of times
for n in range(10):
    
    # resample the dataframe each time
    df_resample = df_merged.sample(frac=1.0, replace=1)
    
    # get features and target dfs
    features, target = features_and_target(df_resample)
    
    grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5, scoring='balanced_accuracy')
    grid_search.fit(X=features, y=target)
    
    # this is the estimator chosen by the search
    print("Best estimator: ", grid_search.best_estimator_)

    # this is the best performance during training (balanced accuracy score)
    print("Best cross-validation score: ", grid_search.best_score_)

    # result of grid search
    print("Best parameters: ", grid_search.best_params_)
    
    print("\n")

The most common parameters deemed best were:
* criterion: 'gini'
* class_weight: 'balanced'
* max_depth: 5

In [None]:
best_estimator = DecisionTreeClassifier(class_weight='balanced', max_depth=5)

# classify based with the selected features and best estimator
classifier_cv(df_selected, best_estimator)

### 3.5. Model Testing

Our model was developed using cross validation and resampling on the training split only. Here, we train the model on our selected and scaled training data, and then test on the scaled and selected testing split.

In [None]:
# preparing the training data for retraining the model
selected_training_df = training_df[selected_features]

# split the dataset into features and target
train_features, train_target = features_and_target(selected_training_df)

# fit scalar to training data
scaler.fit(train_features) 

# scale training data
train_features_scaled = scaler.transform(train_features)

In [None]:
# preparing test data based on tuning above
selected_testing_df = testing_df[selected_features]

# split the dataset into features and target
test_features, test_target = features_and_target(selected_testing_df)

# scale testing data based on the scalar fit to the training data
test_features_scaled = scaler.transform(test_features)

In [None]:
# model trained using training data and optimized parameters
model = DecisionTreeClassifier(class_weight='balanced', max_depth=5).fit(X=train_features_scaled, y=train_target)

print(f"DecisionTree Classifier with Tuned Parameters and Selected Features:")
# training data metrics
print("\tPrediction accuracy on the training data:", format(model.score(train_features_scaled, train_target)*100, ".2f"))

balanced_accuracy_training = balanced_accuracy_score(train_target, model.predict(X=train_features_scaled))
print("\tBalanced accuracy score (training):",  format(balanced_accuracy_training * 100, ".2f"))
print("")

# testing data mertics
print("\tPrediction accuracy on the test data:", format(model.score(test_features_scaled, test_target)*100, ".2f"))

balanced_accuracy_test = balanced_accuracy_score(test_target, model.predict(X=test_features_scaled))
print("\tBalanced accuracy score (testing):",  format(balanced_accuracy_test * 100, ".2f"))

#### _Prediction of 2021 Cy Young Winners Using our Model_

In [None]:
# format 2021 dataset
selected_2021 = combined_2021[selected_features].dropna()

# split the dataset into features and target
features_2021, target_2021 = features_and_target(selected_2021)

# scale the split a single partition of the data to use for feature selection
scaler.fit(train_features) 

features_2021_scaled = scaler.transform(features_2021)

In [None]:
# model trained using training data and optimized parameters
model = DecisionTreeClassifier(class_weight='balanced', max_depth=5).fit(X=train_features_scaled, y=train_target)

winners = model.predict(X=features_2021)

predicted_winners = 0
for p in winners:
    if p == 1:
        predicted_winners + 1
print(predicted_winners)

0


Our model could not predict any Cy Young award winners for 2021. This could be due to the imbalance of our training data: our model could not learn enough information about the characteristics of a "winner", so it concluded that all players did not fit that class.

<a id="4"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 4. DISCUSSION


&emsp;&emsp;We compared Support Vector Machine (SVM), Decision Tree classifier, k-Nearest Neighbors (kNN), and Logistic Regression. Decision tree and SVM (LinearSVC) both performed well, with higher accuracy than kNN or Logistic Regression. However, Decision tree seemed to have slightly higher/more consistent accuracies than LinearSVC, so we decided to use that for our final algorithm. We did use grid search with both LinearSVC and decision trees to find their best parameters, but decision tree still beat out LinearSVC in terms of performance. For LinearSVC, we adjusted C and class_weight. C is the regularization parameter, and class_weight either doesn’t balance classes or uses C to balance them. For Decision Tree, we used grid search to try different values for criterion, max_depth, and class_weight. Criterion measures the quality of a split, max_depth determines the maximum depth of the tree, and class_weight weighs the classes if necessary. After looking at both LinearSVC and Decision Tree with ideal parameters, Decision Tree still slightly outperformed LinearSVC, so that was the algorithm we decided to use for our predictive model. 

&emsp;&emsp; Based on our findings and analysis of this data, we do not believe that that we can accurately predict the outcome variable of who deserves to win the 2021 Cy Young Awards. One reason is that our model severely overfits the training data (100% accuracy) and cannot be generalized to other years' data. Importantly, our prediction accuracy on the test data was close to 100% at 98.96% with our balanced accuracy score only benig 49.83%. Thus, a seemingly great prediction accuracy is invalidated by the less than ideal balanced accruacy score. 

&emsp;&emsp; The ethical implications of this project are not large in scope. The problem that this project attempts to solve really only relates to Major League Baseball and the world of professional baseball. Especially since our result is that we cannot form a confident prediction, it is safe to say that our project is not crossing any ethical boundaries of any sort both in the questions that we are asking as well as any potential results. 

&emsp;&emsp; Our dataset is biased, however. Each time we ran this code, the prediction accuracy values would vary greatly. To address this, a balanced accuracy score was also used to quantify model accuracy. This value is close to 50% for all models, indicating that the models are essentially randomly choosing how to classify players. This situation arises due to the imbalanced, bias nature of our dataset. 

&emsp;&emsp; This project definitely starts an interesting look into the world of sabermetrics. For future reference, however, there are some steps that could be taken to work towards actually making a confident prediction. Once such step would be to gather more years worth of data. Baseball-reference has many more years of data to scrape. However, since voting trends for the Cy Young have changed over time, going back too far might interfere with making accurate predictions for the current year. Another step to take would be to try using different machine learning algorithms with better parameter tuning. Our methods had inconsistent parameters to tune to, which definitely lead to low confidence in any prediction that our model made. Perhaps there are other algorithms to explore with different parameter tuning methods that could give us a much more confident prediction.

<a id="5"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

### CONTRIBUTIONS
* Section 1:  Daniel
* Section 2:  Daniel and Hannah
* Section 3:  Karenna and Hannah
* Section 4:  Daniel and Karenna
