For this project, you are tasked to imagine yourself as a machine learning engineer at a company. Your boss, a respected but somewhat weary, expert in the field has handed you the dataset from step 1 and has asked you to clean/prepare the data for machine learning and then to train four different machine learning models that make a predictions from that data. Your boss will also want information regarding the accuracy of the models. Your boss will also want to hear a discussion on whether you think your model has too much bias or variance.

1. What features/columns had a relatively even or normal distribution? Which features/columns did not?

The HP, Attack, Defense, Special Attack, Special Defense and Speed columns all had normal distributions. While the column, Pokemon number has a relatively even distribution.

2.  How did you handle missing values? Why did you do this method as opposed to others?

The way I handled my missing values was by removing the NAs or blank values and filling them with the mean or median value from that specific column. I chose to do this method because it kept the data simple and made sure they all followed the same trends. It also kept the data reasonable because If I happened to only fikl these values with kets say 0, my data would be very skewed and innacurate.

3. How did you encode your categorical data? Why did you do this method as opposed to others?

I encoded my categorical data by using one - hot encoding because this method helped me to convert my categorical data into a numerical format without any trouble. 

4.  How did you handle removing outliers? Why did you use this method as opposed to others?

I handled my outliers by using statistical methodes: Z score and IQR as we have reviewed in class. I had a couple of outliers that were extreme and if left alone could have changed the outcomes of my machine learning tests, so this method was the best way to counter that.

5. How did you normalize/standardize the data? Why did you use this method as opposed to others?

My data was standardized using z-score normaalization and I chose this because it helped me scale my data to where the mean was 0 and my standard deviation is 1, which overall helped the performance of my machine learning tests.


6. How did each model perform? Which performed the best?



7. Did any models seem to have a relatively high amount of bias (underfitting)? Variance (overfitting)?



let's start by loading the dataset and performing some initial exploration

In [1]:
import pandas as pd

pokemon_df = pd.read_csv("Pokemon.csv")

print(pokemon_df.head())

print("\nMissing values:\n", pokemon_df.isnull().sum())

print("\nSummary statistics:\n", pokemon_df.describe())

print("\nDistribution of 'Type 1':\n", pokemon_df['Type 1'].value_counts())
print("\nDistribution of 'Type 2':\n", pokemon_df['Type 2'].value_counts())


   #                   Name Type 1  Type 2  Total  HP  Attack  Defense  \
0  1              Bulbasaur  Grass  Poison    318  45      49       49   
1  2                Ivysaur  Grass  Poison    405  60      62       63   
2  3               Venusaur  Grass  Poison    525  80      82       83   
3  3  VenusaurMega Venusaur  Grass  Poison    625  80     100      123   
4  4             Charmander   Fire     NaN    309  39      52       43   

   Sp. Atk  Sp. Def  Speed  Generation  Legendary  
0       65       65     45           1      False  
1       80       80     60           1      False  
2      100      100     80           1      False  
3      122      120     80           1      False  
4       60       50     65           1      False  

Missing values:
 #               0
Name            0
Type 1          0
Type 2        512
Total           0
HP              0
Attack          0
Defense         0
Sp. Atk         0
Sp. Def         0
Speed           0
Generation      0
Legendary

let's proceed with the data cleaning steps

In [2]:
pokemon_df['Type 2'].fillna('None', inplace=True)

from scipy.stats import zscore

z_scores = zscore(pokemon_df.select_dtypes(include='number'))
abs_z_scores = abs(z_scores)
filtered_entries = (abs_z_scores < 3).all(axis=1)
pokemon_df = pokemon_df[filtered_entries]

pokemon_df = pd.get_dummies(pokemon_df, columns=['Type 1', 'Type 2'])

X = pokemon_df.drop(columns=['#', 'Name'])
y = pokemon_df['HP']  

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


let's train the machine learning models

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np

models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree Regression': DecisionTreeRegressor(),
    'Random Forest Regression': RandomForestRegressor(),
    'Gradient Boosting Regression': GradientBoostingRegressor()
}

for name, model in models.items():
    print(f"\nTraining {name}...")
    model.fit(X_train, y_train)
    
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    r2_train = r2_score(y_train, y_pred_train)
    r2_test = r2_score(y_test, y_pred_test)
    rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
    rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
    mae_train = mean_absolute_error(y_train, y_pred_train)
    mae_test = mean_absolute_error(y_test, y_pred_test)
    
    print(f"Performance of {name}:")
    print(f"R squared (Train): {r2_train:.4f}, R squared (Test): {r2_test:.4f}")
    print(f"Root Mean Squared Error (Train): {rmse_train:.4f}, Root Mean Squared Error (Test): {rmse_test:.4f}")
    print(f"Mean Absolute Error (Train): {mae_train:.4f}, Mean Absolute Error (Test): {mae_test:.4f}")

    if r2_train < 0.7:
        print("The model seems to have high bias (underfitting).")
    elif r2_train > 0.9 and (r2_train - r2_test) > 0.1:
        print("The model seems to have high variance (overfitting).")




Training Linear Regression...
Performance of Linear Regression:
R squared (Train): 1.0000, R squared (Test): 1.0000
Root Mean Squared Error (Train): 0.0000, Root Mean Squared Error (Test): 0.0082
Mean Absolute Error (Train): 0.0000, Mean Absolute Error (Test): 0.0006

Training Decision Tree Regression...
Performance of Decision Tree Regression:
R squared (Train): 1.0000, R squared (Test): 0.9997
Root Mean Squared Error (Train): 0.0000, Root Mean Squared Error (Test): 0.3699
Mean Absolute Error (Train): 0.0000, Mean Absolute Error (Test): 0.0708

Training Random Forest Regression...
Performance of Random Forest Regression:
R squared (Train): 0.9999, R squared (Test): 0.9999
Root Mean Squared Error (Train): 0.2489, Root Mean Squared Error (Test): 0.2786
Mean Absolute Error (Train): 0.0492, Mean Absolute Error (Test): 0.1018

Training Gradient Boosting Regression...
Performance of Gradient Boosting Regression:
R squared (Train): 1.0000, R squared (Test): 1.0000
Root Mean Squared Error (T