# Pre-processing and Training
in this project I'm completing step 4 of my capstone project where I need to:

Create dummy or indicator features for categorical variables

Standardize the magnitude of numeric features using a scaler

Split your data into testing and training datasets


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

In [2]:
df = pd.read_excel(r'/Users/carlriemann/Documents/GitHub/Capstone-Two-EDA/EDA project1.xlsx')

In [3]:
df.shape

(2851, 30)

In [4]:
print(df.columns)

Index(['Player', 'Age', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA',
       '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB',
       'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Tm', 'Season', 'Pos'],
      dtype='object')


In [5]:
print(df.dtypes)

Player     object
Age         int64
G           int64
GS          int64
MP        float64
FG        float64
FGA       float64
FG%       float64
3P        float64
3PA       float64
3P%       float64
2P        float64
2PA       float64
2P%       float64
eFG%      float64
FT        float64
FTA       float64
FT%       float64
ORB       float64
DRB       float64
TRB       float64
AST       float64
STL       float64
BLK       float64
TOV       float64
PF        float64
PTS       float64
Tm         object
Season     object
Pos        object
dtype: object


The main goal of this step is to clean, transform, and split the data so that it is ready for training machine learning models.

In [7]:
#In this cell i perform more data cleaning in order to have it ready for training.
#changing 'unknown' position to the most common position (I decided to replace these Unknown values with the most common position in the dataset)
most_common_pos = df['Pos'].mode()[0]
df['Pos'] = df['Pos'].replace('Unknown', most_common_pos)
#Dropping Player column as it is unique
df = df.drop(columns=['Player'])
#Turn season column to contain only the starting year of the season (this ensured I don't encounter errors when testing the model)
df['Season'] = df['Season'].apply(lambda x: int(x.split('-')[0]))

In [8]:
#Creating dummy variables
df = pd.get_dummies(df, columns=['Tm', 'Pos'], drop_first=True)

I identified the Tm (Team) and Pos (Position) columns as categorical features in my dataset.
To include these features in the model, I convert them to dummy/indicator variables using pd.get_dummies(). 

In [9]:
#Select numerical columns and scale them
numerical_columns = ['Age', 'G', 'GS', 'MP', 'FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 
                     'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']
#Standarizing the numerical features using StandardScaler. This ensured that all features have a mean of 0 and a standard deviation of 1.
scaler = StandardScaler()
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

#Define target variable
X = df.drop(columns=['PTS'])
y = df['PTS']

In [10]:
#Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
#test linear regression model
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
#check metrics
from sklearn.metrics import r2_score

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Mean Squared Error: 0.00011804687556539186
R-squared: 0.999887617886216


I trained a Linear Regression model on the preprocessed data.
The model achieved a Mean Squared Error (MSE) of 0.000118 and an R-squared value of 0.9999 on the test set, indicating a high level of accuracy in predicting points per game (PTS).