# Textbook Coding Lab Review

Here we're gonna review everything we did in the coding lab alongside the [sklearn.linear_model](https://scikit-learn.org/stable/api/sklearn.linear_model.html) documentation. This way we can ensure all best practices are up to date. It's also definitely gonna help further your understanding on how the whole process works for these different methodologies.

Our objective is to predict a baseball player's `Salary` based on various metrics associated with performance. Before we do any of that, let's load in the data and inspect it.

# Data Inspection and Preparation

In [150]:
from ISLP import load_data
import numpy as np
import pandas as pd
import sklearn as skl

Hitters = load_data("Hitters")

In [151]:
Hitters.head(10)

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
0,293,66,1,30,29,14,1,293,66,1,30,29,14,A,E,446,33,20,,A
1,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,475.0,N
2,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,480.0,A
3,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,500.0,N
4,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,91.5,N
5,594,169,4,74,51,35,11,4408,1133,19,501,336,194,A,W,282,421,25,750.0,A
6,185,37,1,23,8,21,2,214,42,1,30,9,24,N,E,76,127,7,70.0,A
7,298,73,0,24,24,7,3,509,108,0,41,37,12,A,W,121,283,9,100.0,A
8,323,81,6,26,32,8,2,341,86,6,32,34,8,N,W,143,290,19,75.0,N
9,401,92,17,49,66,65,13,5206,1332,253,784,890,866,A,E,0,0,0,1100.0,A


In [152]:
nan_values = np.isnan(Hitters["Salary"])
print(f"NaN values: {nan_values.sum()}")
print(f"Data shape: {Hitters.shape}")
Hitters.dtypes

NaN values: 59
Data shape: (322, 20)


AtBat           int64
Hits            int64
HmRun           int64
Runs            int64
RBI             int64
Walks           int64
Years           int64
CAtBat          int64
CHits           int64
CHmRun          int64
CRuns           int64
CRBI            int64
CWalks          int64
League       category
Division     category
PutOuts         int64
Assists         int64
Errors          int64
Salary        float64
NewLeague    category
dtype: object

In [153]:
Hitters = Hitters.dropna()
Hitters.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
AtBat,263.0,403.642586,147.307209,19.0,282.5,413.0,526.0,687.0
Hits,263.0,107.828897,45.125326,1.0,71.5,103.0,141.5,238.0
HmRun,263.0,11.619772,8.757108,0.0,5.0,9.0,18.0,40.0
Runs,263.0,54.745247,25.539816,0.0,33.5,52.0,73.0,130.0
RBI,263.0,51.486692,25.882714,0.0,30.0,47.0,71.0,121.0
Walks,263.0,41.114068,21.718056,0.0,23.0,37.0,57.0,105.0
Years,263.0,7.311787,4.793616,1.0,4.0,6.0,10.0,24.0
CAtBat,263.0,2657.543726,2286.582929,19.0,842.5,1931.0,3890.5,14053.0
CHits,263.0,722.186312,648.199644,4.0,212.0,516.0,1054.0,4256.0
CHmRun,263.0,69.239544,82.197581,0.0,15.0,40.0,92.5,548.0


In [154]:
# Split into train and test data
X, y = Hitters.drop("Salary", axis=1), Hitters["Salary"]
print(f"`X` type: {type(X)}")
print(f"`Y` type: {type(y)}")

`X` type: <class 'pandas.core.frame.DataFrame'>
`Y` type: <class 'pandas.core.series.Series'>


In [155]:
X.head(10)

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,NewLeague
1,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,N
2,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,A
3,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,N
4,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,N
5,594,169,4,74,51,35,11,4408,1133,19,501,336,194,A,W,282,421,25,A
6,185,37,1,23,8,21,2,214,42,1,30,9,24,N,E,76,127,7,A
7,298,73,0,24,24,7,3,509,108,0,41,37,12,A,W,121,283,9,A
8,323,81,6,26,32,8,2,341,86,6,32,34,8,N,W,143,290,19,N
9,401,92,17,49,66,65,13,5206,1332,253,784,890,866,A,E,0,0,0,A
10,574,159,21,107,75,59,10,4631,1300,90,702,504,488,A,E,238,445,22,A


In [156]:
y.head(10)

1      475.000
2      480.000
3      500.000
4       91.500
5      750.000
6       70.000
7      100.000
8       75.000
9     1100.000
10     517.143
Name: Salary, dtype: float64

Now we'll inspect the correlations.

Now we'll split the data into training and test sets using [sklearn's train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [159]:
X_train, X_test, y_train, y_test = skl.model_selection.train_test_split(X, y, 
                                                                        test_size=0.2, 
                                                                        random_state=0)

In [172]:
print(f"X_train\nshape: {X_train.shape}\nType: {type(X_train)}")
print()
print(f"y_train\nshape: {y_train.shape}\nType: {type(y_train)}")


X_train
shape: (210, 19)
Type: <class 'pandas.core.frame.DataFrame'>

y_train
shape: (210,)
Type: <class 'pandas.core.series.Series'>
