<a href="https://colab.research.google.com/github/mtoce/DS-Unit-2-Applied-Modeling/blob/master/module1-define-ml-problems/Assig1_LS_DS_231.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

## Load Data

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [41]:
df = pd.read_csv('astros_bangs_20200127.csv')
print(df.shape)
df.head()

(8274, 29)


Unnamed: 0,game_id,game_pk,game_date,opponent,final_away_runs,final_home_runs,inning,top_bottom,batter,at_bat_event,pitch_type_code,pitch_category,has_bangs,bangs,call_code,description,on_1b,on_2b,on_3b,youtube_id,pitch_youtube_seconds,youtube_url,pitch_datetime,game_pitch_id,event_number,pitch_playid,atbat_playid,away_team_id,home_team_id
0,2017_04_03_seamlb_houmlb_1,490111,3/4/2017,SEA,0,3,1,bottom,George Springer,Home Run,SI,FB,n,,B,Ball,f,f,f,af5e55Cc8ZA,1473,https://www.youtube.com/watch?v=af5e55Cc8ZA&t=...,3/4/2017 19:24:41-05:05,170404002442,32,ca9ed282-a9c3-45a6-ac10-d216fae7ce8b,1421aabe-7063-4902-9a46-4b2c239394cb,136,117
1,2017_04_03_seamlb_houmlb_1,490111,3/4/2017,SEA,0,3,1,bottom,George Springer,Home Run,SI,FB,n,,F,Foul,f,f,f,af5e55Cc8ZA,1489,https://www.youtube.com/watch?v=af5e55Cc8ZA&t=...,3/4/2017 19:24:57-05:05,170404002458,33,7f89900c-faed-485c-a6b6-41c0a2b1c26f,1421aabe-7063-4902-9a46-4b2c239394cb,136,117
2,2017_04_03_seamlb_houmlb_1,490111,3/4/2017,SEA,0,3,1,bottom,George Springer,Home Run,SL,BR,n,,B,Ball,f,f,f,af5e55Cc8ZA,1512,https://www.youtube.com/watch?v=af5e55Cc8ZA&t=...,3/4/2017 19:25:20-05:05,170404002521,34,6c875b0c-b4e3-4521-b1c0-447ee785bff9,1421aabe-7063-4902-9a46-4b2c239394cb,136,117
3,2017_04_03_seamlb_houmlb_1,490111,3/4/2017,SEA,0,3,1,bottom,George Springer,Home Run,SI,FB,n,,E,"In play, run(s)",f,f,f,af5e55Cc8ZA,1529,https://www.youtube.com/watch?v=af5e55Cc8ZA&t=...,3/4/2017 19:25:37-05:05,170404002538,35,1421aabe-7063-4902-9a46-4b2c239394cb,1421aabe-7063-4902-9a46-4b2c239394cb,136,117
4,2017_04_03_seamlb_houmlb_1,490111,3/4/2017,SEA,0,3,1,bottom,Alex Bregman,Single,FF,FB,n,,D,"In play, no out",f,f,f,af5e55Cc8ZA,1580,https://www.youtube.com/watch?v=af5e55Cc8ZA&t=...,3/4/2017 19:26:28-05:05,170404002636,40,3a7c0b8b-3b0c-4832-90d0-5d8425d0b995,3a7c0b8b-3b0c-4832-90d0-5d8425d0b995,136,117


## Choose Target and Decide Approach


Our target variable is the column which records if the Astros used bangs to tip the pitcher's pitch to the hitter. We want to predict whether they will attempt to steal signs depending on different conditions in the baseball game.


The type of problem is a binary (2-class) classification here: are they cheating or not?


The classes are somewhat misbalanced, with the chance of bangs only being a mere **13.8%**.



```
# df['has_bangs'].value_counts(normalize=True)
```


This gives us our baseline. Our baseline guess is the majority class, being that the Astro's are NOT cheating by stealing signs and banging drums to give info to hitter.


In [0]:
# define a function to do initial data cleaning
def first_clean(X):
  '''
  Initial data cleaning for a fresh loaded DataFrame
  '''

  X = X.copy()

  # Change game_date to a datetime for ease of use
  X['game_date'] = pd.to_datetime(X['game_date'], infer_datetime_format=True)
  X['game_date']

  # Replace some character strings with numbers
  X['has_bangs'] = X['has_bangs'].replace({'n': 0, 'y': 1})

  X = X.drop(columns=['youtube_id',
       'pitch_youtube_seconds', 'youtube_url', 'pitch_datetime',
       'game_pitch_id', 'event_number', 'pitch_playid', 'atbat_playid', 'bangs'])

  # Remove outliers
  # Save only data where bangs occured (95% of df), remove the rest
  games_with_bangs_list = X[(X['has_bangs']==1)].game_date.to_list()
  X = X[X['game_date'].isin(games_with_bangs_list)]
  
  return X

In [43]:
# Change game_date to a datetime for ease of use
df['game_date'] = pd.to_datetime(df['game_date'], infer_datetime_format=True)
df['game_date']

0      2017-03-04
1      2017-03-04
2      2017-03-04
3      2017-03-04
4      2017-03-04
          ...    
8269   2017-09-24
8270   2017-09-24
8271   2017-09-24
8272   2017-09-24
8273   2017-09-24
Name: game_date, Length: 8274, dtype: datetime64[ns]

In [44]:
# Replace some character strings with numbers
df['has_bangs'].value_counts(normalize=True)

n    0.861977
y    0.138023
Name: has_bangs, dtype: float64

In [45]:
df = first_clean(df)
df.head()

Unnamed: 0,game_id,game_pk,game_date,opponent,final_away_runs,final_home_runs,inning,top_bottom,batter,at_bat_event,pitch_type_code,pitch_category,has_bangs,call_code,description,on_1b,on_2b,on_3b,away_team_id,home_team_id
0,2017_04_03_seamlb_houmlb_1,490111,2017-03-04,SEA,0,3,1,bottom,George Springer,Home Run,SI,FB,0,B,Ball,f,f,f,136,117
1,2017_04_03_seamlb_houmlb_1,490111,2017-03-04,SEA,0,3,1,bottom,George Springer,Home Run,SI,FB,0,F,Foul,f,f,f,136,117
2,2017_04_03_seamlb_houmlb_1,490111,2017-03-04,SEA,0,3,1,bottom,George Springer,Home Run,SL,BR,0,B,Ball,f,f,f,136,117
3,2017_04_03_seamlb_houmlb_1,490111,2017-03-04,SEA,0,3,1,bottom,George Springer,Home Run,SI,FB,0,E,"In play, run(s)",f,f,f,136,117
4,2017_04_03_seamlb_houmlb_1,490111,2017-03-04,SEA,0,3,1,bottom,Alex Bregman,Single,FF,FB,0,D,"In play, no out",f,f,f,136,117


In [47]:
df.isnull().sum()

game_id             0
game_pk             0
game_date           0
opponent            0
final_away_runs     0
final_home_runs     0
inning              0
top_bottom          0
batter              0
at_bat_event        0
pitch_type_code     0
pitch_category      0
has_bangs           0
call_code          17
description         0
on_1b               0
on_2b               0
on_3b               0
away_team_id        0
home_team_id        0
dtype: int64

In [48]:
df.nunique()

game_id            58
game_pk            58
game_date          58
opponent           18
final_away_runs    13
final_home_runs    15
inning             13
top_bottom          1
batter             20
at_bat_event       25
pitch_type_code    12
pitch_category      4
has_bangs           2
call_code          13
description        15
on_1b               2
on_2b               2
on_3b               2
away_team_id       18
home_team_id        1
dtype: int64

In [50]:
df['pitch_category'].value_counts()

FB    4959
BR    2518
CH     772
OT      25
Name: pitch_category, dtype: int64

In [52]:
df.dtypes

game_id                    object
game_pk                     int64
game_date          datetime64[ns]
opponent                   object
final_away_runs             int64
final_home_runs             int64
inning                      int64
top_bottom                 object
batter                     object
at_bat_event               object
pitch_type_code            object
pitch_category             object
has_bangs                   int64
call_code                  object
description                object
on_1b                      object
on_2b                      object
on_3b                      object
away_team_id                int64
home_team_id                int64
dtype: object

In [53]:
df.head()

Unnamed: 0,game_id,game_pk,game_date,opponent,final_away_runs,final_home_runs,inning,top_bottom,batter,at_bat_event,pitch_type_code,pitch_category,has_bangs,call_code,description,on_1b,on_2b,on_3b,away_team_id,home_team_id
0,2017_04_03_seamlb_houmlb_1,490111,2017-03-04,SEA,0,3,1,bottom,George Springer,Home Run,SI,FB,0,B,Ball,f,f,f,136,117
1,2017_04_03_seamlb_houmlb_1,490111,2017-03-04,SEA,0,3,1,bottom,George Springer,Home Run,SI,FB,0,F,Foul,f,f,f,136,117
2,2017_04_03_seamlb_houmlb_1,490111,2017-03-04,SEA,0,3,1,bottom,George Springer,Home Run,SL,BR,0,B,Ball,f,f,f,136,117
3,2017_04_03_seamlb_houmlb_1,490111,2017-03-04,SEA,0,3,1,bottom,George Springer,Home Run,SI,FB,0,E,"In play, run(s)",f,f,f,136,117
4,2017_04_03_seamlb_houmlb_1,490111,2017-03-04,SEA,0,3,1,bottom,Alex Bregman,Single,FF,FB,0,D,"In play, no out",f,f,f,136,117


In [55]:
df['at_bat_event'].value_counts()

Strikeout               1851
Groundout               1418
Single                  1138
Walk                     986
Flyout                   782
Lineout                  452
Pop Out                  412
Double                   341
Home Run                 256
Grounded Into DP         156
Forceout                 111
Sac Fly                   83
Field Error               74
Double Play               48
Hit By Pitch              44
Strikeout - DP            27
Runner Out                25
Triple                    20
Catcher Interference      17
Sac Bunt                  15
Fielders Choice Out        7
Fan interference           5
Bunt Groundout             4
Batter Interference        1
Bunt Pop Out               1
Name: at_bat_event, dtype: int64

## Remove Outliers


In [0]:
# Check how much of our data is games that had bangs in them
games_with_bangs_list = df[(df['has_bangs']==1)].game_date.to_list()

In [65]:
df[df['game_date'].isin(games_with_bangs_list)]

Unnamed: 0,game_id,game_pk,game_date,opponent,final_away_runs,final_home_runs,inning,top_bottom,batter,at_bat_event,pitch_type_code,pitch_category,has_bangs,call_code,description,on_1b,on_2b,on_3b,away_team_id,home_team_id
0,2017_04_03_seamlb_houmlb_1,490111,2017-03-04,SEA,0,3,1,bottom,George Springer,Home Run,SI,FB,0,B,Ball,f,f,f,136,117
1,2017_04_03_seamlb_houmlb_1,490111,2017-03-04,SEA,0,3,1,bottom,George Springer,Home Run,SI,FB,0,F,Foul,f,f,f,136,117
2,2017_04_03_seamlb_houmlb_1,490111,2017-03-04,SEA,0,3,1,bottom,George Springer,Home Run,SL,BR,0,B,Ball,f,f,f,136,117
3,2017_04_03_seamlb_houmlb_1,490111,2017-03-04,SEA,0,3,1,bottom,George Springer,Home Run,SI,FB,0,E,"In play, run(s)",f,f,f,136,117
4,2017_04_03_seamlb_houmlb_1,490111,2017-03-04,SEA,0,3,1,bottom,Alex Bregman,Single,FF,FB,0,D,"In play, no out",f,f,f,136,117
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8269,2017_09_24_anamlb_houmlb_1,492424,2017-09-24,ANA,7,5,9,bottom,Jose Altuve,Strikeout,FF,FB,0,C,Called Strike,f,f,f,108,117
8270,2017_09_24_anamlb_houmlb_1,492424,2017-09-24,ANA,7,5,9,bottom,Jose Altuve,Strikeout,FF,FB,0,C,Called Strike,f,f,f,108,117
8271,2017_09_24_anamlb_houmlb_1,492424,2017-09-24,ANA,7,5,9,bottom,Jose Altuve,Strikeout,FF,FB,0,B,Ball,f,f,f,108,117
8272,2017_09_24_anamlb_houmlb_1,492424,2017-09-24,ANA,7,5,9,bottom,Jose Altuve,Strikeout,FS,FB,0,B,Ball,f,f,f,108,117


In [0]:
df[df['game_date'].isin(games_with_bangs_list)].shape[0]

In [0]:
# So, our data is 95% comprised of games where there were bangs by the Astros.
games_with_bangs = df[df['game_date'].isin(games_with_bangs_list)].shape[0] / df.shape[0]
games_with_bangs

0.9572153734590283

In [0]:
# We are going to remove the other game rows where there were no bangs
df = df[df['game_date'].isin(games_with_bangs_list)]

In [0]:
def feature_engineering(X):
  

In [0]:
def train_test_split(X):
  

In [0]:
pipeline = make_pipeline(
    
)