Introduction:


John Doe is a rich billionaire invester who enjoys investing his money into different businesses and fields. Recently he has taken an interest in baseball and is looking to buy a team. John is also very competetive and enjoys winning. He was able to snatch some pitching statistics from the last 40 years in the Korean league. Knowing that a good pitching staff leads to a good winning team, he has asked us to predict the top 5 teams with the highest win percentage based on their pitching statistics.  


In [1]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import os

hey_batter = pd.read_csv('C:\\Users\\ebent\\OneDrive\\Documents\\GitHub\\Hey Batter Batter!\\Hey-Batter-Batter-\\csv\\kbopitchingdata.csv')
hey_batter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 323 entries, 0 to 322
Data columns (total 34 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   323 non-null    int64  
 1   year                 323 non-null    int64  
 2   team                 323 non-null    object 
 3   average_age          323 non-null    float64
 4   runs_per_game        323 non-null    float64
 5   wins                 323 non-null    int64  
 6   losses               323 non-null    int64  
 7   win_loss_percentage  323 non-null    float64
 8   ERA                  323 non-null    float64
 9   run_average_9        323 non-null    float64
 10  games                323 non-null    int64  
 11  games_started        184 non-null    float64
 12  games_finished       184 non-null    float64
 13  complete_game        323 non-null    int64  
 14  shutouts             323 non-null    int64  
 15  saves                323 non-null    int

We can see we have a lot of different stats here. We most likely are not going to use all of them but we should know what each one does. Here is some identifying information so we know what we are working with: 

- year
- teams
- average_age: Average pitcher age
- runs_per_game: Runs scored per game
- wins: Total wins per season
- losses: Total losses per season
- win_loss_percentage
- ERA: Pitching ERA: number of earned runs a pitcher allows per 9 innings
- run_average_9: run average per 9 innings
- games: Games played
- games_started: Games started
- games_finished: Games finished
- complete_game: Complete games
- shutouts: No runs allowed and complete games
- saves: pitcher who finishes a game for the winning team (certain prerequisites required)
- innings_pitched
- hits: Hits allowed
- runs: Runs allowed
- earned_runs: Earned runs allowed
- home_runs: Home runs allowed
- walks: Walks allowed
- intentional_walks: Intentional walks allowed
- strikeouts
- hit_batter: Hit batter with pitch
- balks: An illegal act by a pitcher with a runner or runners on base entitling all batters to advance one base
- wild_pitches: potential bases being awarded
- batters_faced
- WHIP: (Walks + Hits) / Total Innings Pitched
- hits_9: Hits per 9 innings
- homeruns_9: Homeruns per 9 innings
- walks_9: Walks per 9 innings
- strikeouts_9: Strikeouts per 9 innings
- strikeout_walk: strikeouts / walks

So far it's a lot of information but we can already identify that some of these might be more important stats than others. For example ERA which is the number of earned runs a pitcher allows per 9 innings is probably a more important statistic than shutouts. In baseball the more runs you let the other team score the more likely it is your team loses. So having a low average ERA for your pitching staff is going to be a big part of a team's winning percentage. Shutouts on the other hand are pretty rare and is often a pretty low stat. Having a higher number than average is a good thing, but it probably won't be the main determining factor for deciding a team's win percentage. 

In [4]:
hey_batter.head(-5)

Unnamed: 0,id,year,team,average_age,runs_per_game,wins,losses,win_loss_percentage,ERA,run_average_9,...,hit_batter,balks,wild_pitches,batters_faced,WHIP,hits_9,homeruns_9,walks_9,strikeouts_9,strikeout_walk
0,1,2021,LG Twins,26.3,3.90,72,57,0.558,3.57,3.96,...,97,5.0,43.0,5416,1.312,8.0,0.6,3.9,7.6,1.96
1,2,2021,KT Wiz,28.4,4.06,75,59,0.560,3.67,4.17,...,42,1.0,56.0,5359,1.316,8.4,0.6,3.5,7.5,2.16
2,3,2021,Doosan Bears,27.5,4.57,70,65,0.519,4.28,4.66,...,73,7.0,51.0,5596,1.487,9.2,0.7,4.2,7.4,1.77
3,4,2021,Samsung Lions,28.8,4.57,75,59,0.560,4.29,4.70,...,51,3.0,56.0,5496,1.450,9.3,0.9,3.8,7.4,1.96
4,5,2021,NC Dinos,27.7,4.80,67,67,0.500,4.50,4.95,...,77,8.0,74.0,5575,1.476,9.1,0.9,4.2,7.5,1.79
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
313,314,1983,Haitai Tigers,25.3,3.90,55,44,0.556,3.16,3.97,...,45,,,3707,1.232,8.2,0.6,2.9,4.6,1.59
314,315,1983,Samsung Lions,26.1,4.18,46,50,0.479,3.42,4.18,...,58,,,3785,1.299,8.3,0.7,3.4,4.0,1.19
315,316,1983,OB Bears,26.9,4.32,44,55,0.444,3.54,4.36,...,62,,,3789,1.358,9.0,0.8,3.2,3.3,1.04
316,317,1983,Lotte Giants,26.4,4.64,43,56,0.434,3.79,4.65,...,47,,,3827,1.331,9.4,0.9,2.6,4.2,1.60


Initially we see we have some missing values that looks like they correspond to the earlier years of the data (1980s vs 2010s). I also noticed that the spread of the data takes place over 40 years time. Since we are predicting data for 2022, we should know that most professional athletes don't play for 40 years at a time but usually on an average from 5-15 years. We don't have to do anything for now but we most likely will split the data for training/testing based on this knowledge. 

Lets investigate a little more into our missing data values. 

In [10]:
hey_batter_missing = pd.concat([hey_batter.isnull().sum()], axis=1)
hey_batter_missing.columns=['missing']
hey_batter_missing.sort_values(by='missing', ascending=False)

Unnamed: 0,missing
balks,139
intentional_walks,139
games_finished,139
games_started,139
wild_pitches,139
hit_batter,0
home_runs,0
walks,0
strikeouts,0
id,0


All of the missing data have the same number of values in the 5 catergories, 