# P5 - Learning

This project give you experience with Learning topics. 

In [7]:
import pandas as pd 
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_iris
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn import model_selection
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, balanced_accuracy_score, make_scorer

from sklearn.feature_selection import SelectPercentile

from sklearn.preprocessing import MinMaxScaler, StandardScaler

from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn import svm

from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline

from sklearn import metrics

import numpy as np
np.random.seed(5550)

import otter
grader = otter.Notebook()

# Problem: Classification - Music Hits 

For this problem, you will work to classify a song’s popularity. Specifically, you will develop methods to predict whether a song will make the Top10 of Billboard’s Hot 100 Chart. The data set consists of song from the Top10 of Billboard’s Hot 100 Chart from 1990-2010 along with a sampling of other songs that did not make the list.  

The data source is adapted from one used in a MIT 15.071 course. The data set was created by scraping Billboard’s Hot 100, other songs on Billboard, and using the EchoNest API, now a part of Spotify, to get song information.

The variables included in the data set include several description of the song and artist (including song title and id numbers), the year the song was released. Additionally, several variables describe the song attributes: time signature, loudness, tempo, key, energy pitch, and timbre (measured of different sections of the song). The last variable is binary indicated whether the song was in the Top10 or not.

You will use the variables of the song attributes to predict whether the song will be popular or not.

## Q1 - Load and understand the data 

Load in the `music` data. 

You should not use the `year`, `artistname`, `artistID`, `songtitle` or `songID` in the prediction.  
Additionally, remove any variables that are the confidence of another variable, e.g., `timesignature_confidence`, `temp_confidence`. 


Create a input feature matrix, `Xm` and label vector `ym` that you will use to create your classifiers. 


In [8]:
music = pd.read_csv("music.csv", encoding="ISO-8859-1")

columns_drop = [ #Dropping these
    "year",
    "artistname",
    "artistID",
    "songtitle",
    "songID",
    "timesignature_confidence",
    "tempo_confidence",
    "key_confidence"
]

Xm = music.drop(columns=columns_drop + ["Top10"])
ym = music["Top10"]

Xm.head()

Unnamed: 0,timesignature,loudness,tempo,key,energy,pitch,timbre_0_min,timbre_0_max,timbre_1_min,timbre_1_max,...,timbre_7_min,timbre_7_max,timbre_8_min,timbre_8_max,timbre_9_min,timbre_9_max,timbre_10_min,timbre_10_max,timbre_11_min,timbre_11_max
0,3,-4.262,91.525,11,0.966656,0.024,0.002,57.342,-6.496,171.093,...,-71.127,82.475,-52.025,39.116,-35.368,71.642,-126.44,18.658,-44.77,25.989
1,4,-4.051,140.048,10,0.98471,0.025,0.0,57.414,-37.351,171.13,...,-65.807,106.918,-61.32,35.378,-81.928,74.574,-103.808,121.935,-38.892,22.513
2,4,-3.571,160.512,2,0.9899,0.026,0.003,57.422,-17.222,171.06,...,-67.433,80.621,-59.773,45.979,-46.293,59.904,-108.313,33.3,-43.733,25.744
3,4,-3.815,97.525,1,0.939207,0.013,0.0,57.765,-32.083,220.895,...,-63.667,96.675,-78.66,41.088,-49.194,95.44,-102.676,46.422,-59.439,37.082
4,4,-6.891,80.009,7,0.977862,0.008,0.057,55.404,-6.627,216.684,...,-57.084,77.725,-115.062,49.312,-31.369,40.076,-179.702,18.52,-57.549,22.489


In [9]:
grader.check("q1")

## Q2. Classify Top 10 Hits 

We want to report out the results of predicting the top-10 hits using KNN, Decision Trees, and SVMS.  

You will perform grid search to select the best hyperparameters with cross-validation.  However, you may not use `GridSearchCV`.  Instead you must use `StratifiedKFold` and other methodologies shown in class. 

You will do an initial stratified split of your data into training+validation set with 80% of the data and a test set with 20% of the data (`random_state`=5).  Within the train+val data, use 10-fold stratified cross-validation with a `random_state` = 5 and `shuffle` = True. 

For each model, you will tune the hyper-parameters:    
* KNN, number of neighbors = [5, 9, 13, 17] and weights = ['uniform', 'distance']
* Decision Trees, maximum depth of the tree = [3, 5, 8, 12] and criterion of ['gini', 'entropy'], set the random_state = 5
* SVM, use a rbf kernel with C = [0.01, 0.1, 1, 10] 

In addition, you will want to see which scaling methods seems to work best for this dataset and method: `StandardScaler` or `MinMaxScaler`. 

When selecting the best hyper-parameters, instead of using accuracy you will use the `f1_measure`.  

Make sure to consider how to set up the training and evaluation of your models to avoid overfitting and data leakage. 

Report out the best hyperparameters selected. 

Retrain the best model on the train+val data and report the f1-measure on the test data set. 

In [10]:
#Helper function to evaluate models
def evaluate_model(model, X, y, scaler=None):
    f1_scores = []
    for train_idx, val_idx in kf.split(X, y):
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
        if scaler:
            scaler.fit(X_train)
            X_train = scaler.transform(X_train)
            X_val = scaler.transform(X_val)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_val)
        f1_scores.append(f1_score(y_val, y_pred))
    return sum(f1_scores) / len(f1_scores)

#split data into train+val and test sets 
X_train_val, X_test, y_train_val, y_test = train_test_split(
    Xm, ym, test_size=0.2, random_state=5, stratify=ym
)

#10-fold stratified cross-validation
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=5)

#nneeded variables
scalers = [StandardScaler(), MinMaxScaler()]
scaler_names = ["StandardScaler", "MinMaxScaler"]
knn_params = {"n_neighbors": [5, 9, 13, 17], "weights": ["uniform", "distance"]}
dt_params = {"max_depth": [3, 5, 8, 12], "criterion": ["gini", "entropy"]}
svm_params = {"C": [0.01, 0.1, 1, 10]}
best_knn = {"f1": 0, "params": None, "scaler": None}
best_dt = {"f1": 0, "params": None, "scaler": None}
best_svm = {"f1": 0, "params": None, "scaler": None}

#KNN
for scaler, scaler_name in zip(scalers, scaler_names):
    for n_neighbors in knn_params["n_neighbors"]:
        for weights in knn_params["weights"]:
            knn = KNeighborsClassifier(n_neighbors=n_neighbors, weights=weights)
            f1 = evaluate_model(knn, X_train_val, y_train_val, scaler)
            if f1 > best_knn["f1"]:
                best_knn = {"f1": f1, "params": (n_neighbors, weights), "scaler": scaler_name}

#Decision Tree
for scaler, scaler_name in zip(scalers, scaler_names):
    for max_depth in dt_params["max_depth"]:
        for criterion in dt_params["criterion"]:
            dt = DecisionTreeClassifier(max_depth=max_depth, criterion=criterion, random_state=5)
            f1 = evaluate_model(dt, X_train_val, y_train_val, scaler)
            if f1 > best_dt["f1"]:
                best_dt = {"f1": f1, "params": (max_depth, criterion), "scaler": scaler_name}

#SVM
for scaler, scaler_name in zip(scalers, scaler_names):
    for C in svm_params["C"]:
        svc = SVC(C=C, kernel="rbf", random_state=5)
        f1 = evaluate_model(svc, X_train_val, y_train_val, scaler)
        if f1 > best_svm["f1"]:
            best_svm = {"f1": f1, "params": C, "scaler": scaler_name}

# Report best hyperparameters
knn_bestNbrs, knn_bestWt = best_knn["params"]
knn_bestScaling = best_knn["scaler"]
dt_bestMaxDepth, dt_bestCrit = best_dt["params"]
dt_bestScaling = best_dt["scaler"]
svm_bestC = best_svm["params"]
svm_bestScaling = best_svm["scaler"]

#Retrain models with best hyperparameters on train+val and evaluate on test
scalers_dict = {"StandardScaler": StandardScaler(), "MinMaxScaler": MinMaxScaler()}

#KNN
scaler = scalers_dict[knn_bestScaling]
scaler.fit(X_train_val)
X_train_val_scaled = scaler.transform(X_train_val)
X_test_scaled = scaler.transform(X_test)
knn = KNeighborsClassifier(n_neighbors=knn_bestNbrs, weights=knn_bestWt)
knn.fit(X_train_val_scaled, y_train_val)
knn_test = f1_score(y_test, knn.predict(X_test_scaled))

#Decision Tree
scaler = scalers_dict[dt_bestScaling]
scaler.fit(X_train_val)
X_train_val_scaled = scaler.transform(X_train_val)
X_test_scaled = scaler.transform(X_test)
dt = DecisionTreeClassifier(max_depth=dt_bestMaxDepth, criterion=dt_bestCrit, random_state=5)
dt.fit(X_train_val_scaled, y_train_val)
dt_test = f1_score(y_test, dt.predict(X_test_scaled))

#SVM
scaler = scalers_dict[svm_bestScaling]
scaler.fit(X_train_val)
X_train_val_scaled = scaler.transform(X_train_val)
X_test_scaled = scaler.transform(X_test)
svc = SVC(C=svm_bestC, kernel="rbf", random_state=5)
svc.fit(X_train_val_scaled, y_train_val)
svm_test = f1_score(y_test, svc.predict(X_test_scaled))

#results
print("knn best hyperparams:  ", knn_bestNbrs, knn_bestWt, knn_bestScaling)
print("dt best hyperparams:   ", dt_bestMaxDepth, dt_bestCrit, dt_bestScaling)
print("svm best hyperparams:  ", svm_bestC, svm_bestScaling)
print("\nBest Performance")
print("  KNN:  ", knn_test)
print("  DT:   ", dt_test)
print("  SVM:  ", svm_test)

#sources
#https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
#https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
#https://datascience.stackexchange.com/questions/102414/i-am-attempting-to-implement-k-folds-cross-validation-in-python3-what-is-the-be

knn best hyperparams:   5 distance StandardScaler
dt best hyperparams:    12 gini StandardScaler
svm best hyperparams:   10 StandardScaler

Best Performance
  KNN:   0.4150943396226415
  DT:    0.4200913242009132
  SVM:   0.5278450363196125


In [11]:
grader.check("q2")

## Q3. Classify Top 10 Hits with Pipelines 

For this question, you will repeat the analysis from above, but this time you will use pipelines and the `GridSearchCV` method to complete this process. 
  

For each model, you will tune the hyper-parameters:    
* KNN, number of neighbors = [5, 9, 13, 17] and weights = ['uniform', 'distance']
* Decision Trees, maximum depth of the tree = [3, 5, 8, 12] and criterion of ['gini', 'entropy'], set the random_state = 5
* SVM, use a rbf kernel with C = [0.01, 0.1, 1, 10] 

In addition, you will want to see which scaling methods seems to work best for this dataset and method: `StandardScaler` or `MinMaxScaler`. 

Overall, you will construct **three pipelines** to perform this analysis one for each model: KNN, DT, SVM.  You will do an initial stratified split of your data into training+validation set with 80% of the data and a test set with 20% of the data (random_state=5).  Use 10-fold stratified cross-validation with a random_state = 5 and shuffle = True. 

Additionally, when selecting the best hyper-parameters, instead of using accuracy you will use the `f1_measure`.  
 
The steps in your pipeline should be called `scaler` for the scaling step, `knn` for the KNN classifier, `dt` for the Decision Tree, and `svm` for the Support Vector Machine. 

One note, we are not using the results here to select a certain model (that would be using the test set for more than just estimating the generalized performance), rather just to report out the results. 

In [12]:

# Split of the test set 
X_trainval, X_test, y_trainval, y_test = train_test_split(
    Xm, ym, test_size=0.2, random_state=5, stratify=ym
)

# ** KNN **
# Create pipeline, with steps 'scaler' and 'knn'
knn_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier())
])

# specify pipeline steps hyperparameters
knn_param = {
    "scaler": [StandardScaler(), MinMaxScaler()],
    "knn__n_neighbors": [5, 9, 13, 17],
    "knn__weights": ["uniform", "distance"]
}

# Setup cross-validation for repeatability 
cvStrat = StratifiedKFold(n_splits=10, shuffle=True, random_state=5)

# instantiate and run GridSearchCV on pipeline:
knn_grid = GridSearchCV(
    knn_pipe,
    param_grid=knn_param,
    scoring=make_scorer(f1_score, average="weighted"),
    cv=cvStrat
)
knn_grid.fit(X_trainval, y_trainval)

# preditions on final test set 
knn_ytest = knn_grid.predict(X_test)

print(knn_grid.best_params_)

{'knn__n_neighbors': 13, 'knn__weights': 'distance', 'scaler': StandardScaler()}


In [13]:

np.random.seed(5550)

# ** DT ** 
# Create pipeline, with steps 'scaler' and 'dt'
dt_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("dt", DecisionTreeClassifier(random_state=5))
])

dt_param = {
    "scaler": [StandardScaler(), MinMaxScaler()],
    "dt__max_depth": [3, 5, 8, 12],
    "dt__criterion": ["gini", "entropy"]
}

# Setup cross-validation for repeatability 
cvStrat = StratifiedKFold(n_splits=10, shuffle=True, random_state=5)

# instantiate and run GridSearchCV on pipeline:
dt_grid = GridSearchCV(
    dt_pipe,
    param_grid=dt_param,
    scoring=make_scorer(f1_score, average="weighted"),
    cv=cvStrat
)
dt_grid.fit(X_trainval, y_trainval)

# preditions on final test set 
dt_ytest = dt_grid.predict(X_test)


print(dt_grid.best_params_)

{'dt__criterion': 'gini', 'dt__max_depth': 8, 'scaler': StandardScaler()}


In [None]:

# ** SVM ** 
# Create pipeline, with steps 'scaler' and 'svm'
svm_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("svm", SVC(kernel="rbf", random_state=5))
])

svm_param = {
    "scaler": [StandardScaler(), MinMaxScaler()],
    "svm__C": [0.01, 0.1, 1, 10]
}

# Setup cross-validation for repeatability 
cvStrat = StratifiedKFold(n_splits=10, shuffle=True, random_state=5)

# instantiate and run GridSearchCV on pipeline:
svm_grid = GridSearchCV(
    svm_pipe,
    param_grid=svm_param,
    scoring=make_scorer(f1_score, average="weighted"),
    cv=cvStrat
)
svm_grid.fit(X_trainval, y_trainval)

# preditions on final test set
svm_ytest = svm_grid.predict(X_test)

print(svm_grid.best_params_)

In [None]:
grader.check("q3")

## Q4  Table of Results 

Report in a DataFrame the following information for each model (use the models from Q3):
* `Model` type (KNN, DT, SVM), 
* best `Hyper-parameters` for the model, e.g., [(n_neighbors, 7), (weights, 'uniform')], (max_depth, 10), ('C', 0.1), etc.
* `Accuracy`, 
* `Precision`,
* `Recall`, 
* `F1-measure` and 
* `Balanced Acc` - balanced accuracy

The last 5 values should all be calculated on the test set. 

In [None]:
# Build data frame of requested results
results = pd.DataFrame(columns=[
    "Model", "Hyper-parameters", "Accuracy", "Precision", "Recall", "F1-measure", "Balanced Acc."
])

# Helper function to add results
#source https://stackoverflow.com/questions/31421413/how-to-compute-precision-recall-accuracy-and-f1-score-for-the-multiclass-case
def add_results_to_table(model_name, best_params, y_true, y_pred):
    results.loc[len(results)] = [
        model_name, best_params, accuracy_score(y_true, y_pred), precision_score(y_true, y_pred, average="weighted"),
        recall_score(y_true, y_pred, average="weighted"), f1_score(y_true, y_pred, average="weighted"), balanced_accuracy_score(y_true, y_pred)
    ]

#KNN
add_results_to_table("KNN", knn_grid.best_params_, y_test, knn_ytest)

#Decision Tree
add_results_to_table("Decision Tree", dt_grid.best_params_, y_test, dt_ytest)

#SVM
add_results_to_table("SVM", svm_grid.best_params_, y_test, svm_ytest)

results

In [None]:
grader.check("q4")

## Question 5 

Summarize the results.  Write 5-8 sentences about the results observed and the overall performance on the problem.  In particular, call out one of the challenges with this problem. 

* The results indicate that the SVM model outperformed the other two classifiers with the highest accuracy, 0.785; precision, 0.771; recall, 0.785; F1-measure, 0.767; and balanced accuracy, 0.669. This means that the SVM with a MinMaxScaler and C=10 is better at identifying both Top 10 hits and non-hits in a balanced manner. The KNN model with n=13 neighbors and distance-based weights resulted in a slightly lower performance, pointing out the sensitivity of this model to imbalanced data and hyperparameter tuning. Decision Tree, on the other hand, using a maximum depth of 8 and the Gini criterion, has similar precision and recall, though slightly lower balanced accuracy, indicating some issues with properly classifying underrepresented classes.

* One challenge in this problem is the inherent imbalance in the dataset, as Top 10 hits are likely underrepresented compared to non-hits. This further affects the model's learning ability and necessitates metrics such as balanced accuracy and weighted F1-scores to show true performance. Moreover, feature scaling was an important preprocessing step for both SVM and KNN, since their algorithms are sensitive to the magnitude of input features. Overall, though models yielded quite good predictive performance, the improvement of class balance and further feature exploration may lead to even better results.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)