# Textbook Coding Lab Review

Here we're gonna review everything we did in the coding lab alongside the [sklearn.linear_model](https://scikit-learn.org/stable/api/sklearn.linear_model.html) documentation. This way we can ensure all best practices are up to date. It's also definitely gonna help further your understanding on how the whole process works for these different methodologies.

Our objective is to predict a baseball player's `Salary` based on various metrics associated with performance. Before we do any of that, let's load in the data and inspect it.

# Data Inspection and Preparation

In [361]:
from ISLP import load_data
import numpy as np
import pandas as pd
import sklearn as skl

Hitters = load_data("Hitters")

In [362]:
Hitters.head(10)

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
0,293,66,1,30,29,14,1,293,66,1,30,29,14,A,E,446,33,20,,A
1,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,475.0,N
2,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,480.0,A
3,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,500.0,N
4,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,91.5,N
5,594,169,4,74,51,35,11,4408,1133,19,501,336,194,A,W,282,421,25,750.0,A
6,185,37,1,23,8,21,2,214,42,1,30,9,24,N,E,76,127,7,70.0,A
7,298,73,0,24,24,7,3,509,108,0,41,37,12,A,W,121,283,9,100.0,A
8,323,81,6,26,32,8,2,341,86,6,32,34,8,N,W,143,290,19,75.0,N
9,401,92,17,49,66,65,13,5206,1332,253,784,890,866,A,E,0,0,0,1100.0,A


In [363]:
print(f"NaN values: {np.isnan(Hitters["Salary"]).sum()}")
Hitters = Hitters.dropna()
print(f"Shape: {Hitters.shape}")
Hitters.dtypes

NaN values: 59
Shape: (263, 20)


AtBat           int64
Hits            int64
HmRun           int64
Runs            int64
RBI             int64
Walks           int64
Years           int64
CAtBat          int64
CHits           int64
CHmRun          int64
CRuns           int64
CRBI            int64
CWalks          int64
League       category
Division     category
PutOuts         int64
Assists         int64
Errors          int64
Salary        float64
NewLeague    category
dtype: object

Convert all data types to the correct type before messing with the data any further.

In [364]:
Hitters = Hitters.astype({col: 'float32' for col in Hitters.select_dtypes(include=['int64', 'float64']).columns})
Hitters = Hitters.astype({col: 'str' for col in Hitters.select_dtypes(include=['category']).columns})
Hitters.dtypes

AtBat        float32
Hits         float32
HmRun        float32
Runs         float32
RBI          float32
Walks        float32
Years        float32
CAtBat       float32
CHits        float32
CHmRun       float32
CRuns        float32
CRBI         float32
CWalks       float32
League        object
Division      object
PutOuts      float32
Assists      float32
Errors       float32
Salary       float32
NewLeague     object
dtype: object

## One-Hot Encode Categorical Data

Now we'll one-hot encode the categorical data columns using `sklearn`'s [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

**I'm still a bit uncertain of whether or not I should one-hot encode before or after a data split**, or if the order even matters. However, for simplicity, we're just gonna one-hot encode before the data split.

The categorical columns are:
* `League`
* `Division`
* `NewLeague`

In [None]:
ohe = skl.preprocessing.OneHotEncoder()
ohe.fit_transform(Hitters[[""]])

## Split Data into Train and Test Sets

Now we'll split the data into train and test sets using `scikit`'s [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [365]:
# Split into train and test data
X, y = Hitters.drop("Salary", axis=1), Hitters["Salary"]
X_train, X_test, y_train, y_test = skl.model_selection.train_test_split(X, y, 
                                                                        test_size=0.2, 
                                                                        random_state=0)

print(f"X_train\nshape: {X_train.shape}\nType: {type(X_train)}")
print()
print(f"y_train\nshape: {y_train.shape}\nType: {type(y_train)}")

X_train
shape: (210, 19)
Type: <class 'pandas.core.frame.DataFrame'>

y_train
shape: (210,)
Type: <class 'pandas.core.series.Series'>
