# Online Gaming Behavior EDA:
## Technical Modeling

dataset: [Kaggle](https://www.kaggle.com/datasets/rabieelkharoua/predict-online-gaming-behavior-dataset/data)

____

In the descriptive and inferential analysis notebook, the findings were a bit simple. The data was very clean and after running a bunch of questions through the ringer, it seems like those who catered this dataset did a great job finding pretty great samples for each section. Main areas that showed a different were when males and females were taken into account. Mostly, as the sterotype goes, males play more games. However, there wasn't an insane disparity with the numbers, probably close to 3 to 2 ratio. 

The inferential section didn't find any significant impacts on the data, at least for the questions I posed. I hope that means I missed something in the initial findings, and I will be able to come up with a model that has a solid accuracy rating for predictions. Given this data set is focused on online behavior in the sense of engagement level, we will focus on that as our target variable. I will look at making some dummy data out of some of the categorical features and move to scale our data to be a bit more understandable in terms of correlations. 

In [1]:
# Standard Data Science/Analysis Toolkit
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt; plt.style.use("ggplot")
import seaborn as sns

# Machine Learning Tools, Utilities, and Scoring Metrics
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.pipeline import Pipeline

# Suite of Machine Learning Algorithms
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LinearRegression, LogisticRegression

# Setup to Ignore Version Errors and Deprecations
import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv('data/online_gaming_behavior.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40034 entries, 0 to 40033
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   PlayerID                   40034 non-null  int64  
 1   Age                        40034 non-null  int64  
 2   Gender                     40034 non-null  object 
 3   Location                   40034 non-null  object 
 4   GameGenre                  40034 non-null  object 
 5   PlayTimeHours              40034 non-null  float64
 6   InGamePurchases            40034 non-null  int64  
 7   GameDifficulty             40034 non-null  object 
 8   SessionsPerWeek            40034 non-null  int64  
 9   AvgSessionDurationMinutes  40034 non-null  int64  
 10  PlayerLevel                40034 non-null  int64  
 11  AchievementsUnlocked       40034 non-null  int64  
 12  EngagementLevel            40034 non-null  object 
dtypes: float64(1), int64(7), object(5)
memory usag

We will perform some of the data cleaning from the previous notebook. This is all the same, so please see descriptive analysis for reasoning

In [3]:
df = df.drop(['PlayerID', 'InGamePurchases', 'PlayerLevel', 'AchievementsUnlocked'], axis=1)
df = df.rename(columns={'Gender': 'Sex'})
df['AgeCategory'] = pd.cut(df['Age'], [15, 24, 34, 44, 49], labels=['15-24', '25-34', '35-44', '45+'], include_lowest=True)

In [4]:
df.head(5)

Unnamed: 0,Age,Sex,Location,GameGenre,PlayTimeHours,GameDifficulty,SessionsPerWeek,AvgSessionDurationMinutes,EngagementLevel,AgeCategory
0,43,Male,Other,Strategy,16.271119,Medium,6,108,Medium,35-44
1,29,Female,USA,Strategy,5.525961,Medium,5,144,Medium,25-34
2,22,Female,USA,Sports,8.223755,Easy,16,142,High,15-24
3,35,Male,USA,Action,5.265351,Easy,9,85,Medium,35-44
4,33,Male,Europe,Action,15.531945,Medium,2,131,Medium,25-34


We are using EngagementLevel as our target, and we will start with all the other columns as our features. We can tweak after our shotgun of tests.

In [5]:
df = pd.get_dummies(df, columns=['Sex', 'Location', 'GameGenre', 'GameDifficulty', 'AgeCategory'], drop_first=True)
# The drop_first=True parameter avoids multicollinearity by dropping the first category in each variable
df

Unnamed: 0,Age,PlayTimeHours,SessionsPerWeek,AvgSessionDurationMinutes,EngagementLevel,Sex_Male,Location_Europe,Location_Other,Location_USA,GameGenre_RPG,GameGenre_Simulation,GameGenre_Sports,GameGenre_Strategy,GameDifficulty_Hard,GameDifficulty_Medium,AgeCategory_25-34,AgeCategory_35-44,AgeCategory_45+
0,43,16.271119,6,108,Medium,1,0,1,0,0,0,0,1,0,1,0,1,0
1,29,5.525961,5,144,Medium,0,0,0,1,0,0,0,1,0,1,1,0,0
2,22,8.223755,16,142,High,0,0,0,1,0,0,1,0,0,0,0,0,0
3,35,5.265351,9,85,Medium,1,0,0,1,0,0,0,0,0,0,0,1,0
4,33,15.531945,2,131,Medium,1,1,0,0,0,0,0,0,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40029,32,20.619662,4,75,Medium,1,0,0,1,0,0,0,1,0,0,1,0,0
40030,44,13.539280,19,114,High,0,0,1,0,0,1,0,0,1,0,0,1,0
40031,15,0.240057,10,176,High,0,0,0,1,1,0,0,0,0,0,0,0,0
40032,34,14.017818,3,128,Medium,1,0,0,1,0,0,1,0,0,1,1,0,0


In [6]:
X = df.drop('EngagementLevel', axis=1)
y = df['EngagementLevel']

In [7]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [8]:
# 1. Logistic Regression
# Chosen because it's a simple and interpretable linear model, good for baseline performance.
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))

Logistic Regression Accuracy: 0.7472317042710849


In [9]:
# 2. Random Forest Classifier
# Selected as it’s an ensemble method, combining multiple decision trees to improve accuracy and reduce overfitting.
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))

Random Forest Accuracy: 0.8746981933227874


In [10]:
# 3. Support Vector Machine (SVM)
# SVM is powerful for high-dimensional spaces and is effective when the number of dimensions is greater than the number of samples.
svm = SVC()
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))

SVM Accuracy: 0.8687036882857381


In [11]:
# 4. K-Nearest Neighbors (KNN)
# KNN is a simple, instance-based learning method that works well for smaller datasets.
knn = KNN()
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
print("KNN Accuracy:", accuracy_score(y_test, y_pred_knn))

KNN Accuracy: 0.8469736075264341


In [12]:
# 5. Gaussian Naive Bayes
# Naive Bayes is fast, and performs well on small datasets with categorical features.
nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))

Naive Bayes Accuracy: 0.8238281575222712


In [13]:
# 6. Decision Tree Classifier
# Decision Trees are easy to interpret and can handle both numerical and categorical data.
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))

Decision Tree Accuracy: 0.8000999084172842


In [14]:
# 7. Gradient Boosting Classifier
# Gradient Boosting is another ensemble method like Random Forest, but it builds trees sequentially to reduce errors.
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)
print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred_gb))

Gradient Boosting Accuracy: 0.8790275580717676


In [15]:
# Display classification reports
print("\nClassification Report for Logistic Regression:\n", classification_report(y_test, y_pred_lr))
print("----------------------------------------\n")
print("\nClassification Report for Random Forest:\n", classification_report(y_test, y_pred_rf))


Classification Report for Logistic Regression:
               precision    recall  f1-score   support

        High       0.79      0.73      0.76      3132
         Low       0.74      0.64      0.69      3069
      Medium       0.73      0.81      0.77      5810

    accuracy                           0.75     12011
   macro avg       0.75      0.73      0.74     12011
weighted avg       0.75      0.75      0.75     12011

----------------------------------------


Classification Report for Random Forest:
               precision    recall  f1-score   support

        High       0.91      0.85      0.88      3132
         Low       0.85      0.83      0.84      3069
      Medium       0.87      0.91      0.89      5810

    accuracy                           0.87     12011
   macro avg       0.88      0.86      0.87     12011
weighted avg       0.88      0.87      0.87     12011



In [16]:
print("\nClassification Report for SVM:\n", classification_report(y_test, y_pred_svm))
print("----------------------------------------\n")
print("\nClassification Report for KNN:\n", classification_report(y_test, y_pred_knn))


Classification Report for SVM:
               precision    recall  f1-score   support

        High       0.91      0.85      0.88      3132
         Low       0.84      0.80      0.82      3069
      Medium       0.86      0.91      0.89      5810

    accuracy                           0.87     12011
   macro avg       0.87      0.86      0.86     12011
weighted avg       0.87      0.87      0.87     12011

----------------------------------------


Classification Report for KNN:
               precision    recall  f1-score   support

        High       0.88      0.85      0.87      3132
         Low       0.80      0.79      0.80      3069
      Medium       0.85      0.87      0.86      5810

    accuracy                           0.85     12011
   macro avg       0.85      0.84      0.84     12011
weighted avg       0.85      0.85      0.85     12011



In [17]:
print("\nClassification Report for Gaussian Naive Bayes:\n", classification_report(y_test, y_pred_nb))
print("----------------------------------------\n")
print("\nClassification Report for Decision Tree Classifier:\n", classification_report(y_test, y_pred_dt))


Classification Report for Gaussian Naive Bayes:
               precision    recall  f1-score   support

        High       0.93      0.77      0.84      3132
         Low       0.85      0.67      0.75      3069
      Medium       0.78      0.93      0.85      5810

    accuracy                           0.82     12011
   macro avg       0.85      0.79      0.81     12011
weighted avg       0.83      0.82      0.82     12011

----------------------------------------


Classification Report for Decision Tree Classifier:
               precision    recall  f1-score   support

        High       0.79      0.81      0.80      3132
         Low       0.75      0.74      0.75      3069
      Medium       0.83      0.83      0.83      5810

    accuracy                           0.80     12011
   macro avg       0.79      0.79      0.79     12011
weighted avg       0.80      0.80      0.80     12011



In [18]:
print("\nClassification Report for Gradient Booster Classifier:\n", classification_report(y_test, y_pred_gb))


Classification Report for Gradient Booster Classifier:
               precision    recall  f1-score   support

        High       0.90      0.87      0.89      3132
         Low       0.86      0.82      0.84      3069
      Medium       0.88      0.91      0.89      5810

    accuracy                           0.88     12011
   macro avg       0.88      0.87      0.87     12011
weighted avg       0.88      0.88      0.88     12011



# Talk about results from above

Before we select a model to dig deeper into, let's run a pipeline that scales the data first and then runs the prediction tests. 

In [19]:
# Create pipelines for each classifier with StandardScaler
pipelines = {
    'Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', LogisticRegression())
    ]),
    'Random Forest': Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', RandomForestClassifier())
    ]),
    'SVM': Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', SVC())
    ]),
    'KNN': Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', KNN())
    ]),
    'Naive Bayes': Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', GaussianNB())
    ]),
    'Decision Tree': Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', DecisionTreeClassifier())
    ]),
    'Gradient Boosting': Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', GradientBoostingClassifier())
    ])
}

In [20]:
model_accuracies = {}
# Train and evaluate each pipeline
for name, pipeline in pipelines.items():
    print(f"\n{name} Pipeline Results:")
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    model_accuracies[name] = {'accuracy': accuracy, 'report': report}
    print(f"Accuracy: {accuracy:.4f}")
    # print(f"Classification Report:\n{classification_report(y_test, y_pred)}")



Logistic Regression Pipeline Results:
Accuracy: 0.8150

Random Forest Pipeline Results:
Accuracy: 0.8762

SVM Pipeline Results:
Accuracy: 0.8623

KNN Pipeline Results:
Accuracy: 0.7301

Naive Bayes Pipeline Results:
Accuracy: 0.8238

Decision Tree Pipeline Results:
Accuracy: 0.7994

Gradient Boosting Pipeline Results:
Accuracy: 0.8790


In [22]:
# Find the model with the highest accuracy
best_model = max(model_accuracies, key=lambda x: model_accuracies[x]['accuracy'])
best_accuracy = model_accuracies[best_model]['accuracy']
best_report = model_accuracies[best_model]['report']


# Print the model with the highest accuracy and its accuracy
print(f"\nBest Model: {best_model} with Accuracy: {best_accuracy:.4f}")
print(f'Classification Report: \n{best_report}')


Best Model: Gradient Boosting with Accuracy: 0.8790
Classification Report: 
              precision    recall  f1-score   support

        High       0.90      0.87      0.89      3132
         Low       0.86      0.82      0.84      3069
      Medium       0.88      0.91      0.89      5810

    accuracy                           0.88     12011
   macro avg       0.88      0.87      0.87     12011
weighted avg       0.88      0.88      0.88     12011

