## Import Libraries and Modules

In [None]:
# Standard data analysis and plotting libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 300 # set the resolution of output plots to 600 dpi

# Standard machine learning libraries and modules
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, root_mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

# Tensorflow 
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

## Data Preprocessing and EDA

### NBA Players Performance and Salaries
Source: https://www.kaggle.com/datasets/thedevastator/exploring-nba-player-performance-and-salaries-19

#### Dataset guide

`Checked?` controls if a column has been converted to its correct data type

File: salaries_1985to2018.csv

| Column    | Description | Checked?
| -------- | ------- | -------
| league  | The league the player is in. (String) | Yes 
| salary  | The salary of the player. (Integer)    | Yes 
| season  | The season the player is playing in. (String)  | Yes 
| season_end | The end date of the season. (Date) | Yes 
| season_start | The start date of the season. (Date) | Yes  
| team         | The team the player is playing for. (String) | Yes 

File: players.csv

| Column     | Description | Checked?
| --------   | ------- | ------- 
| birthDate  | Date of birth of the player. (Date) | Yes
| birthPlace | Place of birth of the player. (String) | Yes
| career_AST | Career assists of the player. (Integer) | Yes
| career_FG% |Career field goal percentage of the player. (Float) | Yes
| career_FG3%|Career three-point field goal percentage of the player. (Float) | Yes
| career_FT% | Career free throw percentage of the player. (Float) | Yes
|career_G	 |Career games played by the player. (Integer) | Yes
|career_PER	 |Career player efficiency rating of the player. (Float) | Yes
|career_PTS	 |Career points scored by the player. (Integer)| Yes
|career_TRB	 |Career total rebounds of the player. (Integer)| ?
|career_WS	 |Career win shares of the player. (Float)| Yes
|career_eFG% |Career effective field goal percentage of the player. (Float)| ?
|college	 |College attended by the player. (String)| Yes
|draft_pick	 |Draft pick of the player. (Integer)| Yes
|draft_round |Round of the draft the player was selected in. (Integer)| Yes
|draft_team	 |Team that drafted the player. (String)| Yes
|draft_year	 |Year the player was drafted. (Integer)| Yes
|height	     |Height of the player. (Float) | Yes
|highSchool	 |High school attended by the player. (String) | Yes
|name	     |Name of the player. (String) | Yes
|position	 |Position of the player. (String) | Yes
|shoots	     |Shooting hand preference of the player. (String) | Yes
|weight	     |Weight of the player. (Integer) | Yes



### Read and Merge

In [None]:
salaries = pd.read_csv("salaries_1985to2018.csv")
players = pd.read_csv("players.csv")

In [None]:
salaries.shape

In [None]:
players.shape

The next steps is to merge the two datasets to get a unified one. <br>
Let's explore the columns and find a unique column name that we can merge on:

In [None]:
salaries.columns

In [None]:
players.columns

Both datasets `salaries` and `players` have their own index. Yet, we don't need more than one index for the merged dataset. So we drop the `index` column from one of the DataFrames (e.g., from `salaries`):

In [None]:
salaries.drop(columns=['index'], inplace=True)

We observe that, `player_id` in `salaries` and `_id` in `players` are common. Thus, we merge on id values using inner join to keep those players for whom salary details are availale. We also keep in mind to add `left_on` and `right_on` as the key columns have different names. Name the merged DataFrame `nba_salary_stats`:

In [None]:
nba_salary_stats = salaries.merge(players, how='inner', left_on='player_id', right_on='_id')
nba_salary_stats.head()

Delete one of the redundant id columns, for instance `_id` from `nba_salary_stats`:

In [None]:
nba_salary_stats.drop(columns=['_id'], inplace=True)

Checking again:

In [None]:
nba_salary_stats.columns

From now on, we work with `nba_salary_stats`.

Number of rows and columns in order:

In [None]:
nba_salary_stats.shape

Check for NaN values 

In [None]:
nba_salary_stats.isna().sum()

In [None]:
nba_salary_stats.shape

In [None]:
nba_salary_stats["career_FG3%"].isna().sum()

Check number of rows after removing NaNs:

In [None]:
nba_salary_stats.shape

Note that 3370 rows with NaNs are removed

Check for duplicated rows:

In [None]:
duplicated_rows = nba_salary_stats.duplicated().sum()
duplicated_rows

The data type of columns:

In [None]:
nba_salary_stats.dtypes

### Column transformation and EDA

We notice several columns of object type which need to be converted to numeric ones. I tried the standard type casting method `.astype()` but it didn't work out, since these columns contain special characters. Instead, I use custom lambda functions to strip off the special characters and only keep the numeric part. The columns `weight` and `height` are a bit more complicated because they contain special characters, and top of that, each needs to be converted to SI units (kg, cm).

Convert `weight`, `height` and all columns that contain special characters:

In [None]:
# Convert 'weight' from lb format to kg 
nba_salary_stats['weight'] = nba_salary_stats['weight'].apply(
    lambda x: float(str(x).replace('lb', '')) * 0.453592 
)

# Convert 'height' from feet-inches format to cm
nba_salary_stats["height"] = nba_salary_stats["height"].apply(
    lambda x: float(x.split("-")[0]) * 30.48 + float(x.split("-")[1]) * 2.54 if 
    isinstance(x, str) and "-" in x else np.nan  # Handle NaN or unexpected formats
)

# Custom function to remove special characters from other columns and set them into numeric
extract_numeric = lambda series: series.astype(str).str.extract(r'([-+]?\d*\.?\d+)', expand=False).astype(float)

cols_to_convert = ["career_FG%", "career_FG3%", "career_eFG%", "career_FT%", 
                   "career_PER", "draft_year", "career_WS", "draft_pick"]

nba_salary_stats[cols_to_convert] = nba_salary_stats[cols_to_convert].apply(extract_numeric)

Convert `birthDate` to datetime object:

In [None]:
# Convert `birthDate` to datetime object:
nba_salary_stats['birthDate'] = pd.to_datetime(
    nba_salary_stats['birthDate'], 
    format='%B %d, %Y'  # matching "MonthName Day, Year" format
)

Split `birthPlace` into `birth_city` and `nationality` columns:

In [None]:
# Split 'birthPlace' into 'birth_city' and 'nationality'
nba_salary_stats[['birth_city', 'nationality']] = nba_salary_stats['birthPlace'].str.split(', ', expand=True)
# Display a few rows to confirm changes
nba_salary_stats[["birthPlace", "birth_city", "nationality"]].head()

Add a `draft_age` column which is the age that a player is drafted:

In [None]:
# Ensure birthDate is a datetime object
nba_salary_stats["birthDate"] = pd.to_datetime(nba_salary_stats["birthDate"], errors="coerce")

# Convert draft_year to integer
nba_salary_stats["draft_year"] = nba_salary_stats["draft_year"].fillna(0).astype(int)

# Compute draft age
nba_salary_stats["draft_age"] = nba_salary_stats["draft_year"] - nba_salary_stats["birthDate"].dt.year

# Replace negative values in the draft_age column with the median of positive values
# Calculate median of positive draft ages
positive_median = nba_salary_stats.loc[nba_salary_stats["draft_age"] > 0, "draft_age"].median()

# Replace negative values with the median of positive ages
nba_salary_stats["draft_age"] = nba_salary_stats["draft_age"].apply(
    lambda x: positive_median if x < 0 else x
)

# Display a few rows to confirm changes
nba_salary_stats[["birthDate", "draft_year", "draft_age"]].head()

Remove nan values:

In [None]:
nba_salary_stats.dropna(inplace=True)
nba_salary_stats.isna().sum()

In [None]:
nba_salary_stats.columns

DataFrame info:

In [None]:
nba_salary_stats.info()

Summary statistics of numerical values:

In [None]:
nba_salary_stats.describe()

Random sample of 10 rows:

In [None]:
nba_salary_stats.sample(10)

Number of unique players:

In [None]:
nba_salary_stats['player_id'].nunique()

Number of players by state:

In [None]:
# Count the number of players per state/country of birth and select the top 10
top_nationalities = nba_salary_stats['nationality'].value_counts().nlargest(10)

# Plot the bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x=top_nationalities.index, y=top_nationalities.values)

# Customize the plot
plt.xticks(rotation=90)  # Rotate x-axis labels for readability
plt.xlabel('State')
plt.ylabel('Number of Players')
plt.title('Number of players by state')

# Show the plot
plt.show()

Salary Distribution:

In [None]:
# Set up figure with improved proportions
plt.figure(figsize=(12, 6), facecolor='white')  # Wider aspect ratio

# Create histogram with refined styling
ax = sns.histplot(
    nba_salary_stats['salary'],
    bins=50,
    kde=True,
    color='#5F9EA0',  # More sophisticated blue
    edgecolor='white',  # Clean bar edges
    linewidth=0.8,  # Subtle border
    alpha=0.85  # Slight transparency
)

# KDE line styling
for line in ax.lines:  # Access the KDE line
    line.set_color('#d62728')  # Contrasting red
    line.set_linewidth(2)  # Thicker line

# Customize labels with improved typography
plt.xlabel('Salary ($)', fontsize=12, labelpad=10, fontweight='normal')
plt.ylabel('Count of Players', fontsize=12, labelpad=10, fontweight='normal')
plt.title('NBA Salary Distribution', 
          fontsize=14, pad=20, fontweight='bold')

# Improve tick marks
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Add light grid for better readability
plt.grid(axis='y', alpha=0.3, linestyle='--')

# Remove top and right spines for cleaner look
sns.despine()

plt.tight_layout()
plt.show()

Draft pick distribution:

In [None]:
# Set up figure with improved proportions
plt.figure(figsize=(12, 6), facecolor='white')  # Wider aspect ratio

# Create histogram with refined styling
ax = sns.histplot(
    nba_salary_stats['draft_pick'],
    bins=50,
    kde=True,
    color='#5F9EA0',  # More sophisticated blue
    edgecolor='white',  # Clean bar edges
    linewidth=0.8,  # Subtle border
    alpha=0.85  # Slight transparency
)

# KDE line styling
for line in ax.lines:  # Access the KDE line
    line.set_color('#d62728')  # Contrasting red
    line.set_linewidth(2)  # Thicker line

# Customize labels with improved typography
plt.xlabel('Draft Pick Number', fontsize=12, labelpad=10, fontweight='normal')
plt.ylabel('Count of Players', fontsize=12, labelpad=10, fontweight='normal')
plt.title('NBA Draft Pick Distribution', 
          fontsize=14, pad=20, fontweight='bold')

# Improve tick marks
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Add light grid for better readability
plt.grid(axis='y', alpha=0.3, linestyle='--')

# Remove top and right spines for cleaner look
sns.despine()

plt.tight_layout()
plt.show()

Correlation between height and three-point field goal

In [None]:
# Set style
sns.set_style("whitegrid")
plt.figure(figsize=(10, 6), facecolor='#f8f9fa')

# Create scatter plot
scatter = sns.scatterplot(
    x='height',
    y='career_FG3%',
    data=nba_salary_stats,
    color='#2a9d8f',  # Attractive teal color
    alpha=0.7,
    s=80,  # Slightly larger points
    edgecolor='white',
    linewidth=0.5
)

# Add regression line
sns.regplot(
    x='height',
    y='career_FG3%',
    data=nba_salary_stats,
    scatter=False,
    color='#e76f51',  # Complementary coral color
    line_kws={'linewidth': 2.5}
)

# Calculate and display correlation
correlation = nba_salary_stats['height'].corr(nba_salary_stats['career_FG3%'])
plt.text(0.05, 0.95, f'Correlation: {correlation:.2f}', 
         transform=plt.gca().transAxes,
         fontsize=12,
         bbox=dict(facecolor='white', alpha=0.8, edgecolor='lightgray'))

# Labels and title
plt.title('Height vs. 3-Point Shooting Accuracy', 
          fontsize=14, pad=15, fontweight='bold')
plt.xlabel('Height (cm)', fontsize=12)
plt.ylabel('3-Point FG%', fontsize=12)

# Adjust ticks
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Clean up borders
sns.despine()

plt.tight_layout()
plt.show()

Is salary correlated with three-point field goal?

In [None]:
# Set style
sns.set_style("whitegrid")
plt.figure(figsize=(10, 6), facecolor='#f8f9fa')

# Create scatter plot
scatter = sns.scatterplot(
    y='career_FG3%',
    x='salary',
    data=nba_salary_stats,
    color='#2a9d8f',  # Attractive teal color
    alpha=0.7,
    s=80,  # Slightly larger points
    edgecolor='white',
    linewidth=0.5
)

# Add regression line
sns.regplot(
    y='career_FG3%',
    x='salary',
    data=nba_salary_stats,
    scatter=False,
    color='#e76f51',  # Complementary coral color
    line_kws={'linewidth': 2.5}
)

# Calculate and display correlation
correlation = nba_salary_stats['salary'].corr(nba_salary_stats['career_FG3%'])
plt.text(0.05, 0.95, f'Correlation: {correlation:.2f}', 
         transform=plt.gca().transAxes,
         fontsize=12,
         bbox=dict(facecolor='white', alpha=0.8, edgecolor='lightgray'))

# Labels and title
plt.title('Salary vs. 3-Point Shooting Accuracy', 
          fontsize=14, pad=15, fontweight='bold')
plt.xlabel('Salary ($)', fontsize=12)
plt.ylabel('3-Point FG%', fontsize=12)

# Adjust ticks
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Clean up borders
sns.despine()

plt.tight_layout()
plt.show()

Bar chart of age at which the player was drafted:

In [None]:
# Calculate and clean draft_age
positive_median = nba_salary_stats.loc[nba_salary_stats["draft_age"] > 0, "draft_age"].median()
nba_salary_stats["draft_age_clean"] = np.where(nba_salary_stats["draft_age"] < 0, 
                                             positive_median, 
                                             nba_salary_stats["draft_age"])

# Round ages to integers for clean bar labels
nba_salary_stats["draft_age_int"] = nba_salary_stats["draft_age_clean"].round().astype(int)

# Create countplot (bar chart)
plt.figure(figsize=(14, 6))
ax = sns.countplot(data=nba_salary_stats,
                 x="draft_age_int",
                 color="#2a9d8f",
                 edgecolor="white",
                 linewidth=0.7)

# Add median line
plt.axvline(x=positive_median - nba_salary_stats["draft_age_int"].min(), 
            color='red', 
            linestyle='--',
            linewidth=1.5,
            label=f'Median: {positive_median:.1f}')

# Customize
plt.title("NBA Players by Draft Age", fontsize=16, pad=20)
plt.xlabel("Age at Draft (Years)", fontsize=12)
plt.ylabel("Number of Players", fontsize=12)
plt.xticks(rotation=0)

# Add value labels on bars
for p in ax.patches:
    ax.annotate(f'{p.get_height():.0f}', 
                (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha='center', va='center', 
                xytext=(0, 5), 
                textcoords='offset points',
                fontsize=9)

plt.legend()
sns.despine()
plt.tight_layout()
plt.show()

In [None]:
nba_salary_stats["position"].value_counts()

In [None]:
# Define the given positions
selected_positions = [
    "Center",
    "Point Guard",
    "Power Forward and Center",
    "Shooting Guard",
    "Small Forward",
    "Power Forward",
    "Center and Power Forward",
    "Small Forward and Shooting Guard",
    "Point Guard and Shooting Guard",
    "Shooting Guard and Point Guard",
    "Shooting Guard and Small Forward",
    "Power Forward and Small Forward",
    "Small Forward and Power Forward"
]

# Filter dataset for selected positions
filtered_data = nba_salary_stats[nba_salary_stats["position"].isin(selected_positions)]

# Compute median career points for each position and sort in descending order
sorted_positions = (
    filtered_data.groupby("position")["career_PTS"]
    .median()
    .sort_values(ascending=False)
    .index
)

# Set figure size and style
plt.figure(figsize=(14, 6))
sns.set_style("whitegrid")

# Create a sorted boxplot
ax = sns.boxplot(
    x="position",
    y="career_PTS",
    data=filtered_data,
    order=sorted_positions,  # Sort positions based on median career points
    palette="coolwarm"
)

# Adjust y-axis limits dynamically based on data
plt.ylim(filtered_data["career_PTS"].min() * 0.9, filtered_data["career_PTS"].max() * 1.1)

# Customize labels and title
plt.xlabel("Position", fontsize=12)
plt.ylabel("Career Points", fontsize=12)
plt.title("Career Points Distribution for Most Frequent (> 100) Positions (Sorted by Median)", fontsize=14, fontweight="bold")

# Rotate x-axis labels to prevent overlap
plt.xticks(rotation=90, ha="right")

# Show the plot
plt.show()

Salary vs. Career Points Scatterplot:

In [None]:
# Set style (unchanged)
sns.set_style("whitegrid")
plt.figure(figsize=(10, 6), facecolor='#f8f9fa')

# Create scatter plot (only changed y-axis variable)
scatter = sns.scatterplot(
    x='salary',
    y='career_PTS',  # Changed from career_FG3%
    data=nba_salary_stats,
    color='#2a9d8f',  # Original teal color
    alpha=0.7,
    s=80,
    edgecolor='white',
    linewidth=0.5
)

# Add regression line (only changed y-axis variable)
sns.regplot(
    x='salary',
    y='career_PTS',  # Changed from career_FG3%
    data=nba_salary_stats,
    scatter=False,
    color='#e76f51',  # Original coral color
    line_kws={'linewidth': 2.5}
)

# Calculate and display correlation (changed variables)
correlation = nba_salary_stats['salary'].corr(nba_salary_stats['career_PTS'])
plt.text(0.05, 0.95, f'Correlation: {correlation:.2f}', 
         transform=plt.gca().transAxes,
         fontsize=12,
         bbox=dict(facecolor='white', alpha=0.8, edgecolor='lightgray'))

# Labels and title (only text changed)
plt.title('Salary vs. Career Points',  # Changed title
          fontsize=14, pad=15, fontweight='bold')
plt.xlabel('Salary ($)', fontsize=12)  # Kept same
plt.ylabel('Career Points', fontsize=12)  # Changed from 3-Point FG%

# Adjust ticks (unchanged)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Clean up borders (unchanged)
sns.despine()

plt.tight_layout()
plt.show()

Top 10 Positions by Average Salary (Bar Chart)

Shooting hand preference frequency table:

In [None]:
nba_salary_stats["shoots"].value_counts()

### Feature Engineering

Checking data shape before removing the outliers:

In [None]:
print(f"Dataset has {nba_salary_stats.shape[0]} rows and {nba_salary_stats.shape[1]} columns.")

Custom function to remove the outliers:

Drop unneeded columns:

In [None]:
nba_salary_stats.columns

In [None]:
nba_salary_stats.drop(columns=["league", "player_id", "season", "season_end", 
                               "index", "birthDate", "birthPlace", "career_TRB",
                               "career_eFG%", "college", "draft_pick", "name",
                               "draft_team", "draft_round", "draft_year",
                               "highSchool", "birth_city", "nationality",
                               "draft_age_clean", "draft_age_int"], inplace=True)

New columns:

In [None]:
nba_salary_stats.columns

Correlation Matrix:

In [None]:
# Calculate correlation matrix
corr_matrix = nba_salary_stats.corr(numeric_only=True)

# Create mask for diagonal AND upper triangle
mask = np.triu(np.ones_like(corr_matrix, dtype=bool)) | np.eye(corr_matrix.shape[0], dtype=bool)

# Plot heatmap (shows only lower triangle, no diagonal)
plt.figure(figsize=(12, 10))
sns.heatmap(
    corr_matrix,
    mask=mask,  # Hide diagonal + upper triangle
    annot=True,
    fmt=".2f",
    center=0,
    square=True,
    linewidths=0.5,
    cbar_kws={"shrink": 0.8}
)

plt.grid(False)
plt.title("Correlation Matrix", fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Select highly correlated features (above 50%) and remove multicollinearity: 

In [None]:
# Calculate correlation matrix
corr_matrix = nba_salary_stats.corr(numeric_only=True).abs()

# Get upper triangle to avoid duplicates
upper_triangle = np.triu(np.ones_like(corr_matrix, dtype=bool))

# Find high correlations (≥50%)
high_corr_pairs = (
    corr_matrix
    .where(upper_triangle)
    .stack()
    .loc[lambda x: (x >= 0.5) & (x < 1.0)]
    .reset_index()
    .rename(columns={'level_0': 'Feature_A', 'level_1': 'Feature_B', 0: 'Correlation'})
)

def select_feature(row):
    """Select feature with higher correlation to salary"""
    corr_a = corr_matrix.loc[row["Feature_A"], "salary"]
    corr_b = corr_matrix.loc[row["Feature_B"], "salary"]
    return row["Feature_A"] if corr_a > corr_b else row["Feature_B"]

# Apply selection
high_corr_pairs["Selected_Feature"] = high_corr_pairs.apply(select_feature, axis=1)

# Get final selected features
selected_features = high_corr_pairs["Selected_Feature"].unique()

print("Highly correlated feature pairs (≥50%):")
print(high_corr_pairs[['Feature_A', 'Feature_B', 'Correlation', 'Selected_Feature']])

print("\nSelected features (higher correlation with salary):")
print(selected_features)

## Apply Machine Learning Models

Normalize and One Hot Encode
- One Hot Encode for `team`, `position`, `shoots`
- Normalize the remaining columns using `MinMaxScaler()`

In [None]:
ct = make_column_transformer(
    (MinMaxScaler(), ["season_start", "career_AST", "career_FG%", "career_FG3%", 
                      "career_FT%", "career_G", "career_PER", "career_PTS", "career_WS", 
                      "height", "weight", "draft_age"]), # Normalize these columns
    (OneHotEncoder(handle_unknown="ignore"), ["team", "position", "shoots"]) # One hot encode these columns (pos, team)
)

Separate Features & Target

In [None]:
X = nba_salary_stats.drop("salary", axis=1) # Features
y = nba_salary_stats["salary"] # Target

Separate Training & Testing Data

In [None]:
# Build train (80%) & test (20%) datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Transform Training & Test Data

In [None]:
# Transform training data 
ct.fit(X_train)

# Transform training and test data with normalization (MinMaxScalar) and one hot encoding (OneHotEncoder)
X_train_normal = ct.transform(X_train)
X_test_normal = ct.transform(X_test)

New Normalized and One Hot Encoded Data

In [None]:
X_train_normal

### Build Machine Learning Models with Training Data

In [None]:
# Define models
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest Regressor": RandomForestRegressor(n_estimators=50, random_state=42),
    "XGBoost Regressor": xgb.XGBRegressor(n_estimators=50, random_state=42)
}

# Train models and evaluate performance
results = {}

for model_name, model in models.items():
    model.fit(X_train_normal, y_train)
    y_pred = model.predict(X_test_normal)

    mae = mean_absolute_error(y_test, y_pred)
    rmse = root_mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    results[model_name] = {'MAE': mae, 'RMSE': rmse, 'R^2': r2}

# Print results
print(pd.DataFrame(results))

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Define models
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest Regressor": RandomForestRegressor(n_estimators=50, random_state=42),
    "XGBoost Regressor": xgb.XGBRegressor(n_estimators=50, random_state=42)
}

# Train models and evaluate performance
results = {}

for model_name, model in models.items():
    model.fit(X_train_normal, y_train)
    y_pred = model.predict(X_test_normal)

    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))  # RMSE calculation
    r2 = r2_score(y_test, y_pred)

    results[model_name] = {'MAE': mae, 'RMSE': rmse, 'R²': r2}

# Convert results to DataFrame
results_df = pd.DataFrame(results).T  # Transpose for readability
print(results_df)

# Sort results for better visualization
sorted_results = sorted(results.items(), key=lambda x: x[1]['MAE'])

# Extract metrics for plotting
model_names = [x[0] for x in sorted_results]
mae_values = [x[1]['MAE'] for x in sorted_results]
rmse_values = [x[1]['RMSE'] for x in sorted_results]
r2_values = [x[1]['R²'] for x in sorted_results]

# Plot the metrics
x = np.arange(len(model_names))  # Label locations
width = 0.35  # Bar width

fig, ax = plt.subplots(figsize=(12, 8))

# Plot MAE and RMSE bars
ax.bar(x - width/2, mae_values, width, label='MAE', color='#87CEEB', edgecolor='black', linewidth=1.2, alpha=0.9)
ax.bar(x + width/2, rmse_values, width, label='RMSE', color='#1D2951', edgecolor='black', linewidth=1.2, alpha=0.9)

# Add R² values as text on top of bars
for i, (mae, rmse, r2) in enumerate(zip(mae_values, rmse_values, r2_values)):
    ax.text(i, rmse + 0.01 * max(rmse_values), f'R²: {r2:.3f}', 
            ha='center', va='bottom', fontsize=10, color='black', fontweight='bold')

# Labels, title, and formatting
ax.set_xlabel('Models')
ax.set_ylabel('MAE / RMSE')
ax.set_title('Model Performance Comparison')
ax.set_xticks(x)
ax.set_xticklabels(model_names, rotation=45, ha='right')
ax.legend(loc='upper left')

plt.tight_layout()
plt.show()

As we can see, even after transforming data, the well-known machine learning models do not give impressive results. The Random Forest Regressor has so far the best predictibility with R² = 0.76. <br>
In the next part, we develop a neural network model which may improve the results.

### Build Neural Network with Training Data

In [None]:
# Ensure target variables are NumPy arrays
y_train = np.array(y_train).astype(np.float32).reshape(-1, 1)  # Explicit conversion
y_test = np.array(y_test).astype(np.float32).reshape(-1, 1)

# Ensure target variables are NumPy arrays
y_train = np.array(y_train).astype(np.float32).reshape(-1, 1)  # Explicit conversion
y_test = np.array(y_test).astype(np.float32).reshape(-1, 1)

# Set random seed
tf.random.set_seed(66)

# Define the model
nba_model = tf.keras.Sequential([
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(1)  # Output layer for regression
])

# Compile the model
nba_model.compile(loss="mae", optimizer=tf.keras.optimizers.Adam(learning_rate=0.01), metrics=["mae"])

# Train the model
fit_data = nba_model.fit(X_train_normal, y_train, epochs=100, verbose=0)

# Make predictions AFTER training
y_pred = nba_model.predict(X_test_normal).flatten()

# Compute evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(root_mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

# Print the results
print(f"MAE: {mae}")
print(f"RMSE: {rmse}")
print(f"R²: {r2}")

Evaulate NBA model

In [None]:
nba_model_loss, nba_model_mae = nba_model.evaluate(X_test_normal, y_test)

#### Calculate prediction measures (MAE, RMSE, R²)

Plot to see How Increasing Epochs Decreases Loss

In [None]:
# Set figure size and style
plt.figure(figsize=(12, 6))
sns.set_style("whitegrid")

# Plot training loss
sns.lineplot(data=fit_data.history, linewidth=2.5, palette="tab10")

# Customize labels and title
plt.xlabel("Epochs", fontsize=14, fontweight='bold')
plt.ylabel("Loss", fontsize=14, fontweight='bold')
plt.title("Training Loss Over Epochs", fontsize=16, fontweight='bold')

# Improve tick visibility
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Add a legend for better understanding
plt.legend(labels=fit_data.history.keys(), fontsize=12, loc="upper right")

# Show the plot
plt.show()

In [None]:
nba_salary_stats.to_csv("nba_salary_stats.csv", index=False)

### Feature Importance using Random Forest Regressor on the NBA dataset

In [None]:
# Load and prepare the data
X = nba_salary_stats.drop(columns=['salary'])  # Features
y = nba_salary_stats['salary']  # Target variable

# Convert categorical features using one-hot encoding
X = pd.get_dummies(X, drop_first=True)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale numerical features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Extract feature importance scores
feature_importances = rf_model.feature_importances_
feature_names = X.columns

# Create a DataFrame for visualization
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False).head(10)  # Top 10 features

# Plot Feature Importance
plt.figure(figsize=(12, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df, palette='viridis')
plt.xlabel('Feature Importance Score')
plt.ylabel('Feature')
plt.title('Top 10 Most Important Features for NBA Salary Prediction')
plt.show()