# Pre-Processing

With our data now explored and any additional changes made during the exploratory process I will now prepare the data for modeling. This notebook will walk through the model pre-processing steps.

In [1]:
import numpy as np
import pandas as pd
import pickle
import csv

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

### Load the Data

Load in the clean and explored data.

In [4]:
final_df = pd.read_csv('../data/final_for_preprocessing.csv')

In [5]:
final_df.drop('Unnamed: 0', axis=1, inplace=True)

#### Final Data Frame Updates

Data type updates:
- `zone` : is being recognized as an integer value but needs to be an object (categorical)
- `game_year` : is being recognized as an integer value but needs to be an object (categorical)
- `age` : is being recognized as an integer value but needs to be an object (categorical)

During exploration I was using the batters name to explore and review the different features in the data but now I will drop this column so that my model doesn't receive data that is unnecessary for making predictions

In [6]:
final_df[['zone','game_year', 'age']] = final_df[['zone','game_year', 'age']].astype(object)

In [7]:
final_df.drop('player_name', axis=1, inplace=True)

In [22]:
final_df.shape

(16381, 35)

### Setup Dummy Variables and Modeling Dataframe

With the final data set now completed we need to ensure that all of our data columns are in a numerical data type. This means casting our categorical features as dummies so that they are recoginized as numeric data types to ensure our regression model can make predictions. With pandas we use "get_dummies" on the data frame to create dummy variables for for all object type columns.

In [8]:
model_df = pd.get_dummies(final_df, drop_first=True)
model_df.shape

(16381, 81)

### Setup X and y

With our final data frame now completed we need to identify the feature data frame which will be used to train our model (known as X) and the target variable that we are predicting (known as y).

In [9]:
X = model_df.drop('launch_speed', axis=1)
y = model_df['launch_speed'].values

### Train / Test / Split

To create data a batch of data for both training our model and then testing our model we perfom a train, test, and split on our identified X and y variables.
- I will use the default split of 75% training data and 25% testing data

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [11]:
X_train.shape, X_test.shape

((12285, 80), (4096, 80))

In [12]:
y_train.shape, y_test.shape

((12285,), (4096,))

### Scale the Data

To ensure the model can make accurate predictions on the target variable we need each feature to be placed on the same scale so that features are not over weighted unjustly.

In [13]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

### CSVs and Pickles

To establish a way to use this standard scale on my model in another notebook I will save csv copies of the scaled datasets and pickle the Standard Scaler object defined above.

In [14]:
model_df.to_csv('../data/modeling_data.csv')

In [15]:
with open('../data/X_train_sc.csv', 'w+') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerows(X_train_sc)

In [16]:
with open('../data/X_test_sc.csv', 'w+') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerows(X_test_sc)

In [17]:
X_train.to_csv('../data/X_train.csv', index=False)

In [18]:
X_test.to_csv('../data/X_test.csv', index=False)

In [19]:
with open('../pickles/y_train.pkl', 'wb+') as f:
    pickle.dump(y_train, f)

In [20]:
with open('../pickles/y_test.pkl', 'wb+') as f:
    pickle.dump(y_test, f)

In [21]:
with open('../pickles/standard_scaler.pkl', 'wb+') as f:
    pickle.dump(ss, f)

##### Now on to modeling: 04-Modeling