# COGS 118B - Final Project GMM testing

# Insert title here

## Group members

- Allen Phu
- Kevin Yu
- Saksham Rai
- Rodrigo Lizaran-Molina

# Abstract 
This section should be short and clearly stated. It should be a single paragraph <200 words.  It should summarize: 
- what your goal/problem is
- what the data used represents 
- the solution/what you did
- major results you came up with (mention how results are measured) 

__NB:__ this final project form is much more report-like than the proposal and the checkpoint. Think in terms of writing a paper with bits of code in the middle to make the plots/tables

# Background

Fill in the background and discuss the kind of prior work that has gone on in this research area here. **Use inline citation** to specify which references support which statements.  You can do that through HTML footnotes (demonstrated here). I used to reccommend Markdown footnotes (google is your friend) because they are simpler but recently I have had some problems with them working for me whereas HTML ones always work so far. So use the method that works for you, but do use inline citations.

Here is an example of inline citation. After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds<a name="lorenz"></a>[<sup>[1]</sup>](#lorenznote). Use a minimum of 2 or 3 citations, but we prefer more <a name="admonish"></a>[<sup>[2]</sup>](#admonishnote). You need enough citations to fully explain and back up important facts. 

Remeber you are trying to explain why someone would want to answer your question or why your hypothesis is in the form that you've stated. 

# Problem Statement

Clearly describe the problem that you are solving. Avoid ambiguous words. The problem described should be well defined and should have at least one ML-relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms), measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).

# Data

Detail how/where you obtained the data and cleaned it (if necessary)

If the data cleaning process is very long (e.g., elaborate text processing) consider describing it briefly here in text, and moving the actual clearning process to another notebook in your repo (include a link here!).  The idea behind this approach: this is a report, and if you blow up the flow of the report to include a lot of code it makes it hard to read.

Please give the following infomration for each dataset you are using
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc you have done should be demonstrated here!


In [4]:
# IMPORT LIBRARIES AND QUICK INFO, IMPLEMENTATION OF FUNCTIONS

# stolen from notebook L07 and gpt stuff
# import stuff

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import silhouette_score
from sklearn.metrics import rand_score, adjusted_rand_score
from scipy import stats

def gmm_bic_score(estimator, X):
    """Callable to pass to GridSearchCV that will use the BIC score."""
    # Make it negative since GridSearchCV expects a score to maximize
    return -estimator.bic(X)

'''
quick notes for future me

the labels are broken up into 0-25, one for each letter of the alphabet

training data has 27,455 cases, test data has 7,172 cases

'''


'\nquick notes for future me\n\nthe labels are broken up into 0-25, one for each letter of the alphabet\n\ntraining data has 27,455 cases, test data has 7,172 cases\n\n'

In [5]:
# ESTABLISH DATA

train_data = pd.read_csv("data/sign_mnist_train.csv")
test_data = pd.read_csv("data/sign_mnist_test.csv")
true_labels = test_data['label']
train_labels = train_data['label']

# separate labels, data
gmm_pred = pd.DataFrame(columns = ['true', 'pred'])

predict_data = train_data.drop(['label'], axis=1)
true_data = test_data.drop(['label'], axis=1)

# fit data
gmm = GaussianMixture(n_components=26, random_state=42)
gmm.fit(predict_data)

# make predictions 
gmm_pred['true'] = train_data['label']
gmm_pred['pred'] = gmm.predict(predict_data)

In [16]:
# calculate adjusted/non-adjusted rand index, silhouette coefficient, BIC, AIC
adjusted_rand = adjusted_rand_score(gmm_pred['true'], gmm_pred['pred'])
non_adjusted_rand = rand_score(gmm_pred['true'], gmm_pred['pred'])

print("Adjusted Rand Score:", adjusted_rand)
print("Non-Adjusted Rand Score:", non_adjusted_rand)

Adjusted Rand Score: 0.07575039505807556
Non-Adjusted Rand Score: 0.921382880523567


In [17]:
# calculate BIC
bic = gmm.bic(predict_data) # i'm too lazy to import
print(bic)

130209446.69261551


In [18]:
# calculate AIC 
aic = gmm.aic(predict_data)
print(aic)

64273331.26057597


In [19]:
# calculate silhouette coefficient

# gonna be non-predicted data, then predicted data
silhouette = silhouette_score(predict_data, gmm_pred['pred'])

print(silhouette)

0.05330493070205487


In [None]:
param_grid = {
    "n_components": range(21, 27),
    "covariance_type": ["spherical", "tied", "diag", "full"],
}
grid_search = GridSearchCV(
    GaussianMixture(), param_grid=param_grid, scoring=gmm_bic_score
)
grid_search.fit(predict_data)


In [None]:
df = pd.DataFrame(grid_search.cv_results_)[
    ["param_n_components", "param_covariance_type", "mean_test_score"]
]
df["mean_test_score"] = -df["mean_test_score"]
df = df.rename(
    columns={
        "param_n_components": "Number of components",
        "param_covariance_type": "Type of covariance",
        "mean_test_score": "BIC score",
    }
)
df.sort_values(by="BIC score").head()