# Using Historical Data to Predict Batting Success: Step 2

Authored by: Donna J. Harris (994042890)

Email: harr2890@mylaurier.ca

For: CP640 Machine Learning (S22) with Professor Elham Harirpoush

## Notebook Series

Just a word about the presentation of this project code.

The code is organized into a series of locally executed Jupyter notebooks, organized by step and needing to be executed in sequence. This is `harr2890_project_step2_hof_data_prep`, the second of XXXXX notebooks.   TODO

## *Step 2 - Data Preparation for a Hall of Fame Approach*

This notebook encompasses the second phase of data preparation, continuing with the structuring and splitting up the data to the state where the experiments and modelling will be conducted based on a Hall of Fame approach.

Here, we will be combining the main source data with our list of Hall of Fame inductees and preparing the data for exploration and modelling based on various **classification** techniques a subsequent notebook.

We will also be splitting the data, based on the year 2000, so that later on in the process we can run controlled tests on unseen data which can be used to manually evaluate the results and predictions of the modelling.

## Environment Setup

Import and establish environment for our work, including showing all dataframe column values.

In [1]:
import pandas as pd

pd.set_option('display.max_columns', None)

### Pre-Conditions

Step 1 must be run completely before running this notebook.

The `data` folder must exist with the following prepared data files:
- `./data/core_mlb_dataset.csv`
- `./data/hof_dataset.csv`

##  Loading Prepared Data Files

Load in the two stored data files: 
- Major League Baseball batting data (`./data/core_mlb_dataset.csv`)
- Baseball Hall of Fame inductee data (`./data/hof_dataset.csv`)

so we can continue with preparing this data.


In [2]:
core_mlb_dataset = "./data/core_mlb_dataset.csv"
df = pd.read_csv(core_mlb_dataset)
df

Unnamed: 0,ID,Player,Tm,Opp,PA,AB,R,H,2B,3B,HR,RBI,BB,SO,HBP,SH,SF,Result,Season
0,delahed01,Ed Delahanty,PHI,BRO,5,4,1,2,0,0,0,0,1,0,0,0,0,L,1901
1,dolanjo02,Joe Dolan,PHI,BRO,5,5,0,1,0,0,0,1,0,0,0,0,0,L,1901
2,childcu01,Cupid Childs,CHC,STL,5,5,1,1,0,0,0,0,0,0,0,0,0,W,1901
3,crolifr01,Fred Crolius,BSN,NYG,4,4,0,0,0,0,0,1,0,0,0,0,0,W,1901
4,delahed01,Ed Delahanty,PHI,BRO,4,4,0,0,0,0,0,0,0,2,0,0,0,L,1901
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3715517,woodfja01,Jake Woodford,STL,CHC,2,1,0,0,0,0,0,0,0,0,0,1,0,L,2021
3715518,yastrmi01,Mike Yastrzemski,SFG,SDP,4,3,1,1,1,0,0,2,1,1,0,0,0,W,2021
3715519,zimmebr01,Bradley Zimmer,CLE,TEX,4,4,1,2,0,0,0,1,0,0,0,0,0,W,2021
3715520,zimmery01,Ryan Zimmerman,WSN,BOS,4,3,0,0,0,0,0,1,1,2,0,0,0,L,2021


In [3]:
hof_dataset = "./data/hof_dataset.csv"
hof = pd.read_csv(hof_dataset)
hof

Unnamed: 0,ID,Inductee
0,hodgegi01,1
1,kaatji01,1
2,minosmi01,1
3,olivato01,1
4,ortizda01,1
...,...,...
263,cobbty01,1
264,johnswa01,1
265,mathech01,1
266,ruthba01,1


## Preprocessing (Continued from the Step 1 Notebook)

### The Hall of Fame Approach

First, a statement of the approach here, with Hall of Fame induction data.

The overarching goal is to use Major League Baseball data to predict batting success.

One consideration as a measure of success is a batter's induction into the Major League Baseball Hall of Fame. With the batting data available, it seems like an interesting approach and problem to atttempt to predict batting success based on induction into the Hall of Fame. (**Note:** There are many issues with this approach, which will be explored in more depth within the Step 3 Notebook, as well as the written report.)

For Step 3's Notebook, the goal will be to look at a baseball player's career batting data and predict whether or not that player is in the Hall of Fame.

**Note:** In the case that the player is not yet eligible for induction into the Hall of Fame (because they are currently playing or recently active) the prediction would represent the strong possibility that they would be inducted in the future once.

In order to prepare the data for this approach, we will to do the following:
1.  Arbitrarily split the data, based on the year 2000.
2.  Generate career batting statistics for both sets of player data.
3.  Label the career batting data with the Hall of Fame induction information.

The end result will be a two dataframes containing features and label for:
- players whose career began before 2000
- players whose career began during or after 2000

These will be stored and used in the Step 3 Notebook work. 

### Splitting the Data by 2000

## Concluding Notebook Comments

**Note:** At this point, we will conclude this notebook for organizational purposes, as it is a logical point for saving and launching into the following experimentation and modelling based on the data prepared here.

Saving the data files in various states makes it easier to re-run parts of the overall project without having to re-run every aspect.

The purpose of this notebook is to continue on with the data preparation started in step 1, by structuring and splitting up the data to the state where the experiments and modelling will be conducted.

The *next* notebook in the series is: `harr2890_project_step3_hof_modelling`, where the saved, data files will be loaded and ......   XXXXXX  TODO