# 1. DEFINING THE PROBLEM

You’re given a dataset of Starcraft player performance data in ranked games. We want to develop a model to predict a player’s rank using the information provided in the dataset.

**Language:** Python 

# 2. LOAD, EXPLORE, CLEAN 

In [None]:
import pandas as pd

## LOAD CSV FILE

In [None]:
df = pd.read_csv("starcraft_player_data.csv") 
#check first 5 rows
df.head()

## EXPLORE THE DATA

In [None]:
# Checking general information about data
df.info()

**Key Takeaways:**
1. 3,395 rows and 20 columns
2. LeagueIndex represents the players rank
3. Data Types: Age, HoursPerWeek, and TotalHours are strings but these categories should be numeric, if its not changed there can be hidden issues in the analysis.
5. Non-Null Count shows there is no missing values but this would need to be further inspected because of the previous point.
6. Columns that show gameplay statistics and could be potential features: APM, MinimapAttacks, WorkersMade, ComplexUnitsMade.

**To-Do:**
Data cleaning time!
1. Convert the object types into numeric
2. Check for any duplicates 
3. Explore more for any missing data

## DATA CLEANING

In [None]:
# Convert Dtype from object to numeric
df[['Age', 'HoursPerWeek', 'TotalHours']] = df[['Age', 'HoursPerWeek', 'TotalHours']].apply(pd.to_numeric, errors='coerce')

# Check for duplicates
duplicate_count = df.duplicated().sum()

# Drop duplicates if any
df = df.drop_duplicates()

# Confirming changes
df.info(), f"Number of duplicates removed: {duplicate_count}"

## HANDLING MISSING VALUES

In [None]:
# Re-run to check for missing values
df.isnull().sum()

**Key Takeaways:**
Since this is low amount that is missing, I can either replace these values with the mean or deletion. I will inspect it a bit further to understand the type of missing values. I do not want to discard any data that could affect the conclusion or cause biases if it is not randomly distributed.

**To-Do:**
1. Create visualization for distribution of missing values
2. Determine how to handle it 

In [None]:
import missingno as msno

#missing data visualization
msno.matrix(df)

**Key Takeaways:**
1. By using this matrix chart I am able to see that the missing data is MAR
2. Since few rows have missing data we can just impute it using mean

In [None]:
# Fill missing values with column mean
df.fillna(df.mean(numeric_only=True), inplace=True)

# Check for nulls after cleaning
df.isnull().sum()

# 3. EXPLORATORY DATA ANALYSIS (EDA)

## Summary Statistics

In [None]:
# Summary stats of updated dataset
df.describe()

**Key Takeaways:**
1. LeagueIndex ranges from 1-8, making this a cateforical ranking system
2. Age is between 16 to 44 (median:21)
3. HoursPerWeek has a big variety, meaning there is possible outliers
4. Total hours also has a big range meaning there are existing outliers, also 1,000,000 is an impossible number
5. APM range is 22 to 389 and can indicate that highly skilled players act faster
6. SelectBy and AssignTo Hotkeys have very small values and could need rescaling 
7. Other columns with low average can check for importance to ranking

## Data Visualizations

In [None]:
# Data Visualizations
import seaborn as sns
import matplotlib.pyplot as plt

### Histogram 

In [None]:
# Plot histograms for feature distributions
df.hist(figsize=(15, 12), bins=30, edgecolor='black')
plt.suptitle("Feature Distributions", fontsize=16)
plt.show()

**Key Takeaways:**

1. **Skewed Distributions:** TotalHours and HoursPerWeek are heavily right-skewed, indicating extreme values. APM has a right skew, suggesting some players perform many more actions than others.

2. **Normally Distributed Features:** Age appears roughly normal, centered around early 20s. LeagueIndex has a slight central tendency, suggesting more mid-tier players.

3. **Sparse Features:** Columns dealing with Minimap and hotkey usage have very small values and might require scaling.


## Heatmap

In [None]:
# Generates a correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Feature Correlation Heatmap")
plt.show()

**Key Takeaways:**

1. **Strongest Correlations with LeagueIndex:** APM: More actions per minute correlate with higher ranks. MinimapAttacks: Frequent minimap use is linked to better players. TotalMapExplored: More exploration correlates with higher rank. AssignToHotkeys: Using hotkeys is associated with skill.

2. **Weak or No Correlation with LeagueIndex:** Game ID has no effect on rank.  HoursPerWeek & TotalHours: Time spent playing is not a strong predictor.

3. **Multicollinearity Risks:** APM, ActionsInPAC, and ActionLatency are highly correlated—we would need to remove redundant features. .

## Boxplot

In [None]:
# Generate boxplots for key features to detect outliers
key_features = ["APM", "TotalHours", "HoursPerWeek", "MinimapAttacks", "TotalMapExplored"]
plt.figure(figsize=(12, 8))

for i, feature in enumerate(key_features, 1):
    plt.subplot(2, 3, i)
    sns.boxplot(y=df[feature])
    plt.title(f"Boxplot of {feature}")

plt.tight_layout()
plt.show()

**Key Takeaways:**
1. **Extreme Outliers:** As mentioned before, some players in **TotalHours** have over 1,000,000 total hours, which is unrealistic. **HoursPerWeek** has extreme values up to 168 hours/week (unrealistic 24/7 playtime).

2. APM (Actions Per Minute): Some players have extremely high APM (>350), which may be valid but should be examined.

3. MinimapAttacks & TotalMapExplored: Outliers are present but less extreme than playtime features.

### Outlier Handling
Based on the findings I will be dealing with the outliers by percentile capping (trimming values beyond the 99th percentile. 
These outliers seem mostly due to entry mistakes.
This would modify extreme values by replacing them with predefined threshold values.

In [None]:
# Define capping function (99th percentile)
def cap_outliers(df, columns, percentile=0.99):
    for col in columns:
        upper_limit = df[col].quantile(percentile)
        df[col] = df[col].clip(upper=upper_limit)
    return df

# Apply capping to selected features
outlier_columns = ["TotalHours", "HoursPerWeek", "APM", "MinimapAttacks", "TotalMapExplored"]
df = cap_outliers(df, outlier_columns)

# Verify changes with new boxplots
plt.figure(figsize=(12, 8))
for i, feature in enumerate(outlier_columns, 1):
    plt.subplot(2, 3, i)
    sns.boxplot(y= df[feature])
    plt.title(f"Boxplot of {feature} (Capped)")

plt.tight_layout()
plt.show()

In [None]:
df.info()

# 4. MODEL SELECTION & TRAINING

In [None]:
# Import necessary libraries for model training and evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from imblearn.over_sampling import SMOTE
from collections import Counter
import numpy as np

## Feature Selection

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Compute Variance Inflation Factor (VIF)
X_selected = df.drop(columns=['LeagueIndex'])  # Exclude target variable
vif_data = pd.DataFrame()
vif_data["Feature"] = X_selected.columns
vif_data["VIF"] = [variance_inflation_factor(X_selected.values, i) for i in range(len(X_selected.columns))]

print(vif_data)

In [None]:
# finding the correlations to the Target
correlations = df.corr()['LeagueIndex'].sort_values(ascending=False)
print(correlations)

In [None]:
# Selected features based on correlation and relevance
selected_features = [
    "APM", "GapBetweenPACs", "AssignToHotkeys", "ActionLatency", "MinimapAttacks", "SelectByHotkeys", "TotalHours"
]

# Define target variable
target = "LeagueIndex"

# Define features and target variable
X = df[selected_features]  # Features
y = df[target]  # Target

# Split into 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Split into 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Confirm data shape
X_train_scaled.shape, X_test_scaled.shape, y_train.shape, y_test.shape

## Cross Validation on Models

## Logistic Regression Model

In [None]:
# Train a Logistic Regression model
log_reg = LogisticRegression(max_iter=500, random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Make predictions
y_pred = log_reg.predict(X_test_scaled)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred, output_dict=True)

print("accuracy:", accuracy)

# Creating dataframe for classification report for better readability
report_df = pd.DataFrame(classification_rep).transpose()
print(report_df)


## Random Forest Model

In [None]:
# Train a Random Forest model
rf_model = RandomForestClassifier(class_weight='balanced', n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test_scaled)

# Evaluate performance
accuracy_rf = accuracy_score(y_test, y_pred_rf)
classification_rep_rf = classification_report(y_test, y_pred_rf, output_dict=True)

print("accuracy:", accuracy)

# Creating dataframe for classification report for better readability
report_rf_df = pd.DataFrame(classification_rep_rf).transpose()
print(report_rf_df)

## Building Classification Model using Random Forest
Although the scores were low, based on my findings and the problem question, I believe this model would be the best fit due to these factors:
1. Handling complex relationships: There are multiple complex features and relationships to determine a players rank.
2. Handling imbalanced classes: As shown before, there is a clear sample imbalance amongst LeagueIndex 7-8.
3. Handling feature scale: As seen in the dataset there are some features with a much lower scale compared to others.
4. Providing feature importance: This further helps with feature selection and understanding their relationship between ranks.

## Model Improvements

**To further improve the model:**
1. Feature Selection to find the most important predictors.
2. Balance the dataset (address rank imbalances).
3. Try more models like Gradient Boosting Models (like XGBoost or LightGBM).

## Feature Selection prediction

In [None]:
# Train Random Forest on all features
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get feature importance scores
feature_importances = rf.feature_importances_

# Convert to a Pandas DataFrame for easier analysis
importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Display the top features
print(importance_df.head(10))

In [None]:
# Select the top N features (e.g., top 10)
N = 10
top_features = importance_df['Feature'].head(N).tolist()

# Create a new dataset with only the selected features
X_train_selected = X_train[top_features]
X_test_selected = X_test[top_features]


In [None]:
# Train model again with selected features
rf_selected = RandomForestClassifier(n_estimators=100, random_state=42)
rf_selected.fit(X_train_selected, y_train)

# Evaluate performance
y_pred_selected = rf_selected.predict(X_test_selected)

print("Accuracy:", accuracy_score(y_test, y_pred_selected))
print("Classification Report:\n", classification_report(y_test, y_pred_selected))

## Address Imbalances

In [None]:
# Define features and target variable
X = X_train_selected  # Features

# Split into 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Count instances in each rank
print("Before SMOTE:", Counter(y_train))

# Initialize SMOTE
smote = SMOTE(sampling_strategy='auto', random_state=42)

# Apply SMOTE to generate synthetic samples
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Check new distribution
print("After SMOTE:", Counter(y_train_resampled))


In [None]:
# Train the model on balanced data
rf_model_smote = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model_smote.fit(X_train_resampled, y_train_resampled)

# Make predictions
y_pred_smote = rf_model_smote.predict(X_test)

### Model Performance Score

In [None]:
# Evaluate results
print("Accuracy:", accuracy_score(y_test, y_pred_smote))

report_improve_df = pd.DataFrame(classification_report(y_test, y_pred_smote, output_dict=True)).transpose()
print(report_improve_df)