# Export to Pickle

To deploy our machine learning model in a Streamlit app, we need to export the Ridge Regression model, which was the best-performing model according to our analysis in the `03_Modelling` notebook, to a pickle file.

In [6]:
import os
import pandas as pd
import pickle


# Define the path to the data folder
data_folder = r'C:\cloudresume\react\resume\sports-data\data'

# Load the cleaned data from a CSV file into a DataFrame
df = pd.read_csv(r'C:\cloudresume\react\resume\sports-data\data\cleaned_df.csv')

# Define the features to be used in the model
features = [
    'number_games_played',
    'total_minutes',
    'avg_goals_per_game',
    'goals',
    'assists',
    'age',
    'avg_games_per_year',
    'avg_goals_per_year', 
    'position'
]

## Drop Rows with Missing Values

For simplicity, we drop missing values. 

In [10]:
# Drop all rows with any NA values
df = df.dropna()

# Verify the operation by showing the shape before and after dropping NA
print("Shape before dropping NA:", df.shape)
print("Shape after dropping NA:", df_cleaned.shape)

Shape before dropping NA: (7540, 23)
Shape after dropping NA: (7540, 23)


## Ridge Regression Model

We'll prepare the dataset by encoding categorical variables and normalizing numerical variables if necessary, then train a Ridge Regression model.

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from joblib import dump


target = 'highest_market_value'

# Split data into features and target
X = df[features]
y = df[target]

# Encode the 'position' categorical data
categorical_features = ['position']
numerical_features = list(set(features) - set(categorical_features))

# Preprocessing for numerical data
numerical_transformer = 'passthrough'  # or use StandardScaler(), MinMaxScaler(), etc.

# Preprocessing for categorical data
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create and train the Ridge Regression model
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('regressor', Ridge())])  # Use Ridge here

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Train the model
model.fit(X_train, y_train)

# Predict and evaluate the model
y_pred = model.predict(X_test)
print('Mean squared error (MSE):', mean_squared_error(y_test, y_pred))
print('Coefficient of determination (R^2):', r2_score(y_test, y_pred))

Mean squared error (MSE): 125274289739646.83
Coefficient of determination (R^2): 0.5022027677652146


## Saving the Model

We will now save the trained Ridge Regression model to a file using pickle. This will allow us to load the model directly in our Streamlit app without having to retrain it.

In [68]:
# Save the model to a file using pickle
with open('ridge_regression_model.pkl', 'wb') as file:
    pickle.dump(model, file)

# Confirm that the model has been saved
print('Model has been saved as "ridge_regression_model.pkl"')

Model has been saved as "ridge_regression_model.pkl"
