Good morning everyone, My name is caleb spikes and I would like to present my Dat 350 final project. At first I chose to do my project on credit card fraud, however I found that with that my ML models just weren't compatible with the dataset so I had to pivpt. I chose to explore a dataset from Kaggle.com on popular baby names by using several machine learning models. The dataset covers a multitude of years and provides us with a variety of information: Year of Birth, Gender, Ethnicity, and the count of occurrences for each baby name which is the column I chose to predict.

The Machine learning models I chose to use were: Logistic Regression, Random Forest, Decision Tree, K-Nearest Neighbors, Gradient Boosting, and Naive Bayes. Each of these models are so different from each other giving me different ways to view the dataset. Which in turn helped me to gain a deeper understanding on what has influenced baby name popularity over time. With this final project I hoped to learn how to showcase different types of machine learning models for real world situations. 

First we will load in our data set and explore the general structure of it, so we have a better understanding before moving forward.

In [6]:
import pandas as pd

data = pd.read_csv("Popular_Baby_Names.csv")

print(data.head())

   Year of Birth  Gender Ethnicity Child's First Name  Count  Rank
0           2011  FEMALE  HISPANIC          GERALDINE     13    75
1           2011  FEMALE  HISPANIC                GIA     21    67
2           2011  FEMALE  HISPANIC             GIANNA     49    42
3           2011  FEMALE  HISPANIC            GISELLE     38    51
4           2011  FEMALE  HISPANIC              GRACE     36    53


We'll handle missing values, encode categorical data, remove outliers, and normalize/standardize the data

In [9]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import IsolationForest, RandomForestClassifier
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

data.dropna(inplace=True)

label_encoders = {}
for column in ["Gender", "Ethnicity", "Child's First Name"]:
    le = LabelEncoder()
    data[column] = le.fit_transform(data[column])
    label_encoders[column] = le

iso_forest = IsolationForest()
outliers = iso_forest.fit_predict(data.drop(["Year of Birth", "Count", "Rank"], axis=1))
data = data[outliers == 1]

lof = LocalOutlierFactor()
outliers = lof.fit_predict(data.drop(["Year of Birth", "Count", "Rank"], axis=1))
data = data[outliers == 1]

X = data.drop(["Count", "Rank"], axis=1)
y = data["Count"]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


Now that our dataset is clean and prepared, we can build and run our machine learning models.

In [11]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB

models = {
    "Logistic Regression": LogisticRegression(),
    "Random Forest": RandomForestClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Gradient Boosting": GradientBoostingClassifier(),
    "Naive Bayes": GaussianNB()
}

for name, model in models.items():
    if name in ["Local Outlier Factor", "Support Vector Machine"]:
        
        continue
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    print(f"{name} Score: {score}")

Logistic Regression Score: 0.07077326343381389
Random Forest Score: 0.8387942332896461
Decision Tree Score: 0.846002621231979
K-Nearest Neighbors Score: 0.7916120576671035
Gradient Boosting Score: 0.019003931847968544
Naive Bayes Score: 0.047182175622542594


1. Logistic Regression - 0.0708
 - Logistic Regression achieved a low score of 0.0708, indicating poor performance in predicting the count of popular baby names. Logistic regression assumes a linear relationship between all the columns and our target which is count. If the correlation between these are not linear than it would struggle. 

2. Random Forest - 0.8388 
 - Random Forest achieved a score of 0.8388, indicating moderate to good performance in predicting the count of popular baby names. Random Forest is an ensemble learning method that combines multiple decision trees, and it performed well in this case, suggesting its reliability. The main reason is because it is less prone to over fitting. 

3. Decision Tree - 0.8460
 - Decision Tree achieved a score of 0.8460, slightly higher than Random Forest. Decision Tree models the data using a tree-like structure and performed slightly better than Random Forest in this case. DT's have the ability to find underlying patterns in big data even tho it is prone to overfitting. 

4. K Nearest Neighbors - 0.7916
 - K-Nearest Neighbors achieved a score of 0.7916, indicating moderate performance. KNN is a non-parametric method used for classification and regression, and it performed reasonably well in this case. KNN needs similarity between features in the dataset to work to the best of its ability and in this case, all of our columns relate to each other.

5. Gradient Boosting - 0.0190
 - Gradient Boosting achieved a very low score of 0.0190, indicating poor performance. Gradient Boosting is an ensemble technique that builds trees sequentially, and in this case, it performed poorly compared to other models. Since I had some outliers, this model wasn't able to perform well.

6. Naive Bayes - 0.0472
 - Naive Bayes achieved a low score of 0.0472, indicating poor performance. Naive Bayes is a probabilistic classifier based on Bayes' theorem with strong (naive) independence assumptions between features. In this case, it did not perform well. This is because the features in my dataset are highly correlated which leads to failure. 

In summary, Decision Tree and Random Forest are the most reliable models based on their relatively high scores.

Final steps:
 - Hyperparameter optimization
 - Compare different models using cross validation
 - Retrain the best models with the best hyperparameters and evaluate on a testing set that was set aside from the rest of the testing process

In [23]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

data = pd.read_csv("Popular_Baby_Names.csv")

In [24]:
X = data.drop(["Count", "Rank"], axis=1)
y = data["Count"]

encoder = OneHotEncoder(handle_unknown='ignore')
X_encoded = encoder.fit_transform(X)

In [25]:
pipelines = {
    "Random Forest": Pipeline([
        ("model", RandomForestRegressor())
    ]),
    "Decision Tree": Pipeline([
        ("model", DecisionTreeRegressor())
    ]),
    "Gradient Boosting": Pipeline([
        ("model", GradientBoostingRegressor())
    ]),
    "K-Nearest Neighbors": Pipeline([
        ("model", KNeighborsRegressor())
    ])
}

In [26]:
param_grids = {
    "Random Forest": {
        "model__n_estimators": [50, 100, 200],
        "model__max_depth": [None, 10, 20]
    },
    "Decision Tree": {
        "model__max_depth": [None, 10, 20]
    },
    "Gradient Boosting": {
        "model__n_estimators": [50, 100, 200],
        "model__max_depth": [3, 5, 7]
    },
    "K-Nearest Neighbors": {
        "model__n_neighbors": [3, 5, 7]
    }
}

In [27]:
best_estimators = {}

for name, pipeline in pipelines.items():
    print(f"Performing GridSearchCV for {name}...")
    grid_search = GridSearchCV(pipeline, param_grids[name], cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
    grid_search.fit(X_encoded, y)
    best_estimators[name] = grid_search.best_estimator_
    print(f"Best parameters for {name}: {grid_search.best_params_}")

Performing GridSearchCV for Random Forest...
Best parameters for Random Forest: {'model__max_depth': None, 'model__n_estimators': 100}
Performing GridSearchCV for Decision Tree...
Best parameters for Decision Tree: {'model__max_depth': None}
Performing GridSearchCV for Gradient Boosting...
Best parameters for Gradient Boosting: {'model__max_depth': 7, 'model__n_estimators': 200}
Performing GridSearchCV for K-Nearest Neighbors...
Best parameters for K-Nearest Neighbors: {'model__n_neighbors': 5}


After using GridSearchCV for each of my ML models, I was able to get the best hyperparameters for each:

1. Random Forest:
 - For Random Forest, the best performing model was achieved with an unlimited depth for individual decision trees (max_depth=None) and 100 estimators (i.e., the number of decision trees in the forest). This suggests that a larger number of decision trees helps capture more complex relationships in the data, while not constraining the maximum depth of each tree leads to better model flexibility.

2. Decision Tree: 
 - For Decision Tree, the best model was obtained when allowing the tree to grow to its maximum depth without any constraint (max_depth=None). This indicates that the decision tree algorithm was able to find the optimal level of complexity for the dataset without any explicit limitation on the tree depth.

3. Gradient Boosting:
 - Gradient Boosting achieved the best performance with a maximum tree depth of 7 (max_depth=7) and 200 estimators (n_estimators=200). This suggests that a moderately deep tree combined with a larger number of boosting iterations provides the best balance between model complexity and generalization performance.

4. K-Nearest Neighbors
 - K-Nearest Neighbors performed optimally with 5 neighbors (n_neighbors=5). This means that considering the 5 nearest neighbors for each prediction yielded the best results in terms of minimizing the mean squared error.

By fine-tuning the hyperparameters through grid search cross-validation, we were able to optimize the performance of each machine learning model and obtain insights into the most effective configurations for predicting popular baby names.