In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time
import altair as alt
import nltk
import itertools
import warnings
import os
import seaborn as sns

from itertools import permutations
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.naive_bayes import MultinomialNB

In [2]:
def warn(*args, **kwargs):
    pass
warnings.warn = warn

In [3]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

### GitHub

https://github.com/ebern17/dtsa-5510-final

### Introduction

For this project I have decided to use a dataset available through Kaggle containing text data that was collected from reddit.com. The data consists of the columns Id, Comment, and Topic. The comment column contains the text data and the topic column categorizes it into which topic the comment is tied to.

#### Citation

Physics vs Chemistry vs Biology. (2022) Kaggle. https://www.kaggle.com/datasets/vivmankar/physics-vs-chemistry-vs-biology/data 

### Purpose and goals

The purpose of this project is to create a classifer that can use the text data to classify what topic a comment relates to. Being in the science field it would be beneficial to have such a classifier to go through articles or research papers to find ones that relate to a specific topic. I will utilize multiple unsupervised models to see if I can create a strong classifier. I will also compare to a supervised method to determine what type of model is best for this dataset.

## Exploratory Data Analysis

#### Bring in and inspect data files

In [4]:
data_path = os.getcwd()+'/Science_Classification/'
os.listdir(os.getcwd()+'/Science_Classification/')

['test.csv', 'test_df.csv', 'train.csv', 'train_df.csv']

__Training data__

In [5]:
train_df=pd.read_csv(data_path+'train.csv')

In [6]:
train_df.shape

(8695, 3)

In [7]:
train_df.head()

Unnamed: 0,Id,Comment,Topic
0,0x840,A few things. You might have negative- frequen...,Biology
1,0xbf0,Is it so hard to believe that there exist part...,Physics
2,0x1dfc,There are bees,Biology
3,0xc7e,I'm a medication technician. And that's alot o...,Biology
4,0xbba,Cesium is such a pretty metal.,Chemistry


In [8]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8695 entries, 0 to 8694
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Id       8695 non-null   object
 1   Comment  8695 non-null   object
 2   Topic    8695 non-null   object
dtypes: object(3)
memory usage: 203.9+ KB


The training data includes 8695 observations and no null values. The column *Comment* is what I will train the models with and the column *Topic* is what will be predicted.

In [9]:
train_df.duplicated().sum()

0

__Test data__

In [10]:
test_df=pd.read_csv(data_path+'test.csv')

In [11]:
test_df.shape

(1586, 3)

In [12]:
test_df.head()

Unnamed: 0,Id,Comment,Topic
0,0x1aa9,Personally I have no idea what my IQ is. I’ve ...,Biology
1,0x25e,I'm skeptical. A heavier lid would be needed t...,Physics
2,0x1248,I think I have 100 cm of books on the subject....,Biology
3,0x2b9,Is chemistry hard in uni. Ive read somewhere t...,Chemistry
4,0x24af,"In addition to the other comment, you can crit...",Physics


In [13]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1586 entries, 0 to 1585
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Id       1586 non-null   object
 1   Comment  1586 non-null   object
 2   Topic    1586 non-null   object
dtypes: object(3)
memory usage: 37.3+ KB


The test data includes 1586 observations and no null values.

In [14]:
test_df.duplicated().sum()

0

### Distribution of training data

I need to see how the genres being predicted are distributed. I will plot a histogram to obtain counts

In [15]:
train_df.Topic.unique()

array(['Biology', 'Physics', 'Chemistry'], dtype=object)

In [16]:
alt.Chart(train_df).mark_bar().encode(
    x=alt.X('Topic:N',sort='y'),
    y='count()',
    color='Topic:N').properties(width=200)

There are three classes in the Topic column; Physics, Chemistry and Biology. There is a slight imbalance between the three classes. I will run everything on the full dataset to see how the models perform. To address the imbalance I will use RandomUnderSampler to under-sample the majority classes and perform the modeling on the reduced dataset as well.

### Text Preprocessing

Text data can be very useful but it requires preprocessing in order to clear out noise in the data. This includes stop-words, punctuation and numbers, and other unhelpful entries.

Tokenize text to make transformation easier

In [17]:
train_df['tokenized'] = train_df.Comment.map(lambda x: nltk.tokenize.word_tokenize(x))
train_df.head()

Unnamed: 0,Id,Comment,Topic,tokenized
0,0x840,A few things. You might have negative- frequen...,Biology,"[A, few, things, ., You, might, have, negative..."
1,0xbf0,Is it so hard to believe that there exist part...,Physics,"[Is, it, so, hard, to, believe, that, there, e..."
2,0x1dfc,There are bees,Biology,"[There, are, bees]"
3,0xc7e,I'm a medication technician. And that's alot o...,Biology,"[I, 'm, a, medication, technician, ., And, tha..."
4,0xbba,Cesium is such a pretty metal.,Chemistry,"[Cesium, is, such, a, pretty, metal, .]"


In [18]:
len(train_df.tokenized.values[0])

277

Lower case all words

In [19]:
train_df['tokenized'] = train_df.tokenized.map(lambda x: [y.lower() for y in x])
train_df.head()

Unnamed: 0,Id,Comment,Topic,tokenized
0,0x840,A few things. You might have negative- frequen...,Biology,"[a, few, things, ., you, might, have, negative..."
1,0xbf0,Is it so hard to believe that there exist part...,Physics,"[is, it, so, hard, to, believe, that, there, e..."
2,0x1dfc,There are bees,Biology,"[there, are, bees]"
3,0xc7e,I'm a medication technician. And that's alot o...,Biology,"[i, 'm, a, medication, technician, ., and, tha..."
4,0xbba,Cesium is such a pretty metal.,Chemistry,"[cesium, is, such, a, pretty, metal, .]"


Remove punctuation and numerical values

In [20]:
train_df['tokenized'] = train_df.tokenized.map(lambda x: [y for y in x if y.isalpha()])
train_df.head()

Unnamed: 0,Id,Comment,Topic,tokenized
0,0x840,A few things. You might have negative- frequen...,Biology,"[a, few, things, you, might, have, frequency, ..."
1,0xbf0,Is it so hard to believe that there exist part...,Physics,"[is, it, so, hard, to, believe, that, there, e..."
2,0x1dfc,There are bees,Biology,"[there, are, bees]"
3,0xc7e,I'm a medication technician. And that's alot o...,Biology,"[i, a, medication, technician, and, that, alot..."
4,0xbba,Cesium is such a pretty metal.,Chemistry,"[cesium, is, such, a, pretty, metal]"


In [21]:
len(train_df.tokenized.values[0])

249

#### Average Word Count

Now that punctuation and numbers have been removed I am going to see what the average word count per topic is

In [22]:
train_df['words'] = train_df.tokenized.apply(lambda x: len(x))

In [23]:
train_df.head(2)

Unnamed: 0,Id,Comment,Topic,tokenized,words
0,0x840,A few things. You might have negative- frequen...,Biology,"[a, few, things, you, might, have, frequency, ...",249
1,0xbf0,Is it so hard to believe that there exist part...,Physics,"[is, it, so, hard, to, believe, that, there, e...",36


In [24]:
train_df.groupby('Topic').agg({'words':'mean'})

Unnamed: 0_level_0,words
Topic,Unnamed: 1_level_1
Biology,27.3066
Chemistry,26.373973
Physics,32.168498


The word count per category looks to be pretty balanced. I am going to remove stopwords as this will help clear out unimportant but common words

In [25]:
stopwords = set(nltk.corpus.stopwords.words("english"))

In [26]:
train_df['tokenized'] = train_df.tokenized.map(lambda x: [y for y in x if y not in stopwords])
train_df.head()

Unnamed: 0,Id,Comment,Topic,tokenized,words
0,0x840,A few things. You might have negative- frequen...,Biology,"[things, might, frequency, dependent, selectio...",249
1,0xbf0,Is it so hard to believe that there exist part...,Physics,"[hard, believe, exist, particulars, ca, detect...",36
2,0x1dfc,There are bees,Biology,[bees],3
3,0xc7e,I'm a medication technician. And that's alot o...,Biology,"[medication, technician, alot, drugs, liver, p...",33
4,0xbba,Cesium is such a pretty metal.,Chemistry,"[cesium, pretty, metal]",6


In [27]:
len(train_df.tokenized.values[0])

131

Convert cleaned text back to string

In [28]:
train_df['tokenized_string'] = train_df.tokenized.map(lambda x: " ".join(x))
train_df.head()

Unnamed: 0,Id,Comment,Topic,tokenized,words,tokenized_string
0,0x840,A few things. You might have negative- frequen...,Biology,"[things, might, frequency, dependent, selectio...",249,things might frequency dependent selection goi...
1,0xbf0,Is it so hard to believe that there exist part...,Physics,"[hard, believe, exist, particulars, ca, detect...",36,hard believe exist particulars ca detect anyth...
2,0x1dfc,There are bees,Biology,[bees],3,bees
3,0xc7e,I'm a medication technician. And that's alot o...,Biology,"[medication, technician, alot, drugs, liver, p...",33,medication technician alot drugs liver probabl...
4,0xbba,Cesium is such a pretty metal.,Chemistry,"[cesium, pretty, metal]",6,cesium pretty metal


In [29]:
train_df.shape

(8695, 6)

In [30]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8695 entries, 0 to 8694
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Id                8695 non-null   object
 1   Comment           8695 non-null   object
 2   Topic             8695 non-null   object
 3   tokenized         8695 non-null   object
 4   words             8695 non-null   int64 
 5   tokenized_string  8695 non-null   object
dtypes: int64(1), object(5)
memory usage: 407.7+ KB


### Perform preprocessing on test data

In [31]:
test_df['tokenized'] = test_df.Comment.map(lambda x: nltk.tokenize.word_tokenize(x))
test_df['tokenized'] = test_df.tokenized.map(lambda x: [y.lower() for y in x])
test_df['tokenized'] = test_df.tokenized.map(lambda x: [y for y in x if y not in stopwords])
test_df['tokenized'] = test_df.tokenized.map(lambda x: [y for y in x if y.isalpha()])
test_df['tokenized_string'] = test_df.tokenized.map(lambda x: " ".join(x))

In [32]:
test_df.head()

Unnamed: 0,Id,Comment,Topic,tokenized,tokenized_string
0,0x1aa9,Personally I have no idea what my IQ is. I’ve ...,Biology,"[personally, idea, iq, never, tested, however,...",personally idea iq never tested however test o...
1,0x25e,I'm skeptical. A heavier lid would be needed t...,Physics,"[skeptical, heavier, lid, would, needed, build...",skeptical heavier lid would needed build press...
2,0x1248,I think I have 100 cm of books on the subject....,Biology,"[think, cm, books, subject, tl, dr, problem, c...",think cm books subject tl dr problem conscious...
3,0x2b9,Is chemistry hard in uni. Ive read somewhere t...,Chemistry,"[chemistry, hard, uni, ive, read, somewhere, h...",chemistry hard uni ive read somewhere hardest ...
4,0x24af,"In addition to the other comment, you can crit...",Physics,"[addition, comment, criticize, theory, without...",addition comment criticize theory without chec...


In [33]:
test_df.shape

(1586, 5)

In [34]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1586 entries, 0 to 1585
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Id                1586 non-null   object
 1   Comment           1586 non-null   object
 2   Topic             1586 non-null   object
 3   tokenized         1586 non-null   object
 4   tokenized_string  1586 non-null   object
dtypes: object(5)
memory usage: 62.1+ KB


### Add column encoding genres as integers

In [35]:
labeler = LabelEncoder()
labeler.fit(train_df.Topic.unique())

In [36]:
topic2int_dict=dict(zip(labeler.classes_,labeler.transform(labeler.classes_)))

In [37]:
train_df['topic_int'] = train_df.Topic.map(topic2int_dict)
test_df['topic_int'] = test_df.Topic.map(topic2int_dict)

In [38]:
train_df.head()

Unnamed: 0,Id,Comment,Topic,tokenized,words,tokenized_string,topic_int
0,0x840,A few things. You might have negative- frequen...,Biology,"[things, might, frequency, dependent, selectio...",249,things might frequency dependent selection goi...,0
1,0xbf0,Is it so hard to believe that there exist part...,Physics,"[hard, believe, exist, particulars, ca, detect...",36,hard believe exist particulars ca detect anyth...,2
2,0x1dfc,There are bees,Biology,[bees],3,bees,0
3,0xc7e,I'm a medication technician. And that's alot o...,Biology,"[medication, technician, alot, drugs, liver, p...",33,medication technician alot drugs liver probabl...,0
4,0xbba,Cesium is such a pretty metal.,Chemistry,"[cesium, pretty, metal]",6,cesium pretty metal,1


In [39]:
int2topic_dict = dict(zip(train_df.topic_int,train_df.Topic))

#### Export progress

In [40]:
# train_df.to_csv(data_path+'train_df.csv',index=False)
# test_df.to_csv(data_path+'test_df.csv',index=False)

## Convert text to feature vectors

I am going to use Text Frequency - Inverse Document Frequency to vectorize the text features. This method measures the importance of a word by consider how frequent it is in a given text set and in how many documents it is found in. This will help weight words properly so that more common but less informative words don't impact the model too much.

I am going to use the training and testing set for this step to ensure every word present is considered. The vectorizer is not passed labels which prevents data leakage.

In [41]:
full_text = pd.concat([train_df[['tokenized_string']],test_df[['tokenized_string']]])
full_text.head()

Unnamed: 0,tokenized_string
0,things might frequency dependent selection goi...
1,hard believe exist particulars ca detect anyth...
2,bees
3,medication technician alot drugs liver probabl...
4,cesium pretty metal


In [42]:
full_text.duplicated().sum()

1140

In [43]:
full_text[full_text.tokenized_string.isna()]

Unnamed: 0,tokenized_string


In [None]:
# full_text = full_text[full_text.duplicated()==False]

In [None]:
# full_text.duplicated().sum()

In [44]:
#initialize
vectorizer = TfidfVectorizer(stop_words='english')
# Fit text data
tfidf = vectorizer.fit_transform(full_text['tokenized_string'])

In [45]:
train_df.shape

(8695, 7)

In [46]:
test_df.shape

(1586, 6)

In [47]:
# subset training portion
tfidf_train = tfidf[:8695, :]
# subset testing portion
tfidf_test = tfidf[-1586:, :]
# Create y_train
y_train = train_df.topic_int

In [48]:
tfidf_train

<8695x18004 sparse matrix of type '<class 'numpy.float64'>'
	with 94345 stored elements in Compressed Sparse Row format>

In [49]:
tfidf_test

<1586x18004 sparse matrix of type '<class 'numpy.float64'>'
	with 43112 stored elements in Compressed Sparse Row format>

### RandomOverSampler

In [50]:
smplr = RandomOverSampler(random_state=11)

In [51]:
X_train_bal, y_train_bal = smplr.fit_resample(tfidf_train,train_df.Topic)

In [52]:
y_train_bal = pd.DataFrame(y_train_bal)

In [53]:
alt.Chart(y_train_bal).mark_bar().encode(
    x=alt.X('Topic:N',sort='y'),
    y='count()',
    color='Topic:N').properties(width=200) | alt.Chart(train_df).mark_bar().encode(
    x=alt.X('Topic:N',sort='y'),
    y='count()',
    color='Topic:N').properties(width=200)

In [54]:
y_train_bal['topic_int'] = y_train_bal.Topic.map(topic2int_dict)

### Functions

In [55]:
def label_permute_compare(y_pred):

    label_int = train_df["topic_int"]
    permutations_list = list(permutations(range(3)))

    best_accuracy = 0
    best_permutation = None

    for permutation in permutations_list:
        reordered_labels_int = np.array([permutation[label] for label in y_pred])
        current_accuracy = accuracy_score(label_int, reordered_labels_int)

        if current_accuracy > best_accuracy:
            best_accuracy = current_accuracy
            best_permutation = permutation
            reordered_labels_int = reordered_labels_int
            
    return best_accuracy, best_permutation

In [56]:
def label_permute_compare_bal(y_pred):

    label_int = y_train_bal["topic_int"]
    permutations_list = list(permutations(range(3)))

    best_accuracy = 0
    best_permutation = None

    for permutation in permutations_list:
        reordered_labels_int = np.array([permutation[label] for label in y_pred])
        current_accuracy = accuracy_score(label_int, reordered_labels_int)

        if current_accuracy > best_accuracy:
            best_accuracy = current_accuracy
            best_permutation = permutation
            reordered_labels_int = reordered_labels_int
            
    return best_accuracy, best_permutation

In [57]:
def label_permute_compare_test(y_pred):

    label_int = test_df["topic_int"]
    permutations_list = list(permutations(range(3)))

    best_accuracy = 0
    best_permutation = None

    for permutation in permutations_list:
        reordered_labels_int = np.array([permutation[label] for label in y_pred])
        current_accuracy = accuracy_score(label_int, reordered_labels_int)

        if current_accuracy > best_accuracy:
            best_accuracy = current_accuracy
            best_permutation = permutation
            reordered_labels_int = reordered_labels_int
            
    return best_accuracy, best_permutation

In [58]:
def eval_init_ninit(X):
    best_score = 0
    best_init = None
    best_ninit = None
    
    inits = ['random', 'k-means++']
    n_inits = [1,10,15]
    
    for init, n_init in itertools.product(inits,n_inits):
        model = KMeans(n_clusters=3,random_state=17,init=init,n_init=n_init)
        try:
            trained_model = model.fit(X)
            y_pred = trained_model.labels_

            best_accuracy, best_permutation = label_permute_compare(y_pred)
            if best_accuracy > best_score:
                best_score = best_accuracy
                best_init = init
                best_ninit = n_init
        except:
            continue

    return best_score, best_init, best_ninit

In [59]:
def eval_init_loss(X):
    best_score = 0
    best_init = None
    best_beta = None
    
    inits = ['random', 'nndsvd', 'nndsvda', 'nndsvdar']
    betas = ['frobenius', 'kullback-leibler', 'itakura-saito']
    
    for init, beta in itertools.product(inits,betas):
        if beta == 'frobenius':
            solver='cd'
        else:
            solver='mu'
        model = NMF(n_components=3,random_state=17,init=init,beta_loss=beta,solver=solver)
        try:
            trained_model = model.fit_transform(X)
            y_pred = np.argmax(trained_model,axis=1)

            best_accuracy, best_permutation = label_permute_compare(y_pred)
            if best_accuracy > best_score:
                best_score = best_accuracy
                best_init = init
                best_beta = beta
        except:
            continue

    return best_score, best_init, best_beta

In [60]:
model_res = {}

## Unsupervised Modeling

### KMeans

__Full dataset__

Defaults

In [61]:
km1 = KMeans(n_clusters=3,random_state=17)
km1_train=km1.fit(tfidf_train)

In [62]:
len(km1_train.labels_)

8695

In [63]:
best_accuracy, best_permutation=label_permute_compare(km1_train.labels_)

In [64]:
best_accuracy

0.40966072455434155

In [65]:
model_res['Model 1'] = {'model':'KMeans','data':'full','parameters':'default','accuracy':.4097}

Hyperparameter Tuning

In [66]:
eval_init_ninit(tfidf_train)

(0.4356526739505463, 'k-means++', 1)

The accuracy was improved but is still really low

In [67]:
model_res['Model 2'] = {'model':'KMeans','data':'full','parameters':'tuned','accuracy':.4357}

__Balanced Dataset__

In [68]:
km2 = KMeans(n_clusters=3,random_state=17,n_init=1,init='k-means++')
km2_train=km2.fit(X_train_bal)

In [69]:
best_accuracy, best_permutation=label_permute_compare_bal(km2_train.labels_)

In [70]:
best_accuracy

0.3645224171539961

The balanced dataset performed worse than the original dataset.

In [71]:
model_res['Model 3'] = {'model':'KMeans','data':'balanced','parameters':'tuned','accuracy':.3645}

### NMF

__Full dataset__

In [72]:
nmf1 = NMF(n_components=3,random_state=17)

In [73]:
nmf1_train = nmf1.fit_transform(tfidf_train)

In [74]:
nmf1_pred= np.argmax(nmf1_train,axis=1)

In [75]:
nmf_acc, nmf_perm=label_permute_compare(nmf1_pred)
nmf_acc

0.4243818286371478

Hyperparameter tuning

In [76]:
eval_init_loss(tfidf_train)

(0.4243818286371478, 'nndsvda', 'frobenius')

__Balanced dataset__

In [77]:
nmf2 = NMF(n_components=3,random_state=17,init='nndsvda',beta_loss='frobenius')
nmf2_train = nmf2.fit_transform(X_train_bal)

In [78]:
label_permute_compare_bal(np.argmax(nmf2_train,axis=1))

(0.35329063399238836, (2, 0, 1))

Once again the balanced dataset did not improve the accuracy.

In [79]:
model_res['Model 4'] = {'model':'NMF','data':'full','parameters':'default','accuracy':.4244}

In [80]:
model_res['Model 5'] = {'model':'NMF','data':'full','parameters':'tuned','accuracy':.4265}

In [81]:
model_res['Model 6'] = {'model':'NMF','data':'balanced','parameters':'tuned','accuracy':.3533}

### Hierarchical Clustering

In [82]:
clst = AgglomerativeClustering(n_clusters=3)

In [83]:
clst.fit(tfidf_train.toarray())

In [84]:
clst.labels_

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [85]:
clst_accuracy, clst_permutation=label_permute_compare(clst.labels_)
clst_accuracy

0.42507188039102933

The AgglomerativeClustering module from sklearn does not accept the sparse tfidf data. Converting it to an array to run takes significant computing time and I will not run the balanced dataset since it has been outperformed by the original dataset.

In [86]:
model_res['Model 7'] = {'model':'Hierarchical','data':'full','parameters':'default','accuracy':.4251}

### Test Data

So far I have just run training data on the different unsupervised models. I want to see how the test data performs using the KMeans classifier with optimized hyperparameters.

In [87]:
km3 = KMeans(n_clusters=3,random_state=17,n_init=1,init='k-means++')
km3_train=km3.fit(tfidf_train)

In [88]:
km3_pred = km3_train.predict(tfidf_test)

In [89]:
km3_pred

array([0, 1, 1, ..., 1, 1, 0])

In [90]:
label_permute_compare_test(km3_pred)

(0.4287515762925599, (1, 0, 2))

The test prediction accuracy is similar to what the training data provided. Based on what I have run so far, unsupervised modeling does not appear to be sufficient for this dataset. The comments being used for classification vary in quality and length, they may not have enough information to accurately make the classifications. I will compare with a supervised method to see if a suitable accuracy may be acheived.

In [91]:
model_res['Model 8'] = {'model':'KMeans','data':'test','parameters':'tuned','accuracy':.4288}

### Supervised Modeling

### Naive Bayes

In [92]:
nb1 = MultinomialNB()

In [93]:
nb1.fit(tfidf_train,train_df.topic_int)

In [94]:
nb1_pred = nb1.predict(tfidf_test)

In [95]:
label_permute_compare_test(nb1_pred)

(0.8039092055485498, (0, 1, 2))

Using sklearn's MultinomialNB classifier greatly improved the accuracy. While the accuracy is not great, the improvement is encouraging and leads me to believe that this dataset is much more suited for a supervised learning model.

In [96]:
model_res['Model 9'] = {'model':'NaiveBayes','data':'full','parameters':'default','accuracy':.8039}

## Results

I created a dictionary to store my results and transformed it to a dataframe which is below. When consider unsupervised learning the strongest model was a KMeans model with optimized hyperparameters. The accuracy was only 0.44 which is very low. The test data performed similarly when using the strongest unsupervised model for prediction. I decided to run a supervised model to try and increase the performance. Using a multinomal Naive Bayes' model proved to be much stronger than the unsupervised models I ran. 

In [97]:
pd.DataFrame.from_dict(model_res,orient='index').sort_values(by='accuracy',ascending=False)

Unnamed: 0,model,data,parameters,accuracy
Model 9,NaiveBayes,full,default,0.8039
Model 2,KMeans,full,tuned,0.4357
Model 8,KMeans,test,tuned,0.4288
Model 5,NMF,full,tuned,0.4265
Model 7,Hierarchical,full,default,0.4251
Model 4,NMF,full,default,0.4244
Model 1,KMeans,full,default,0.4097
Model 3,KMeans,balanced,tuned,0.3645
Model 6,NMF,balanced,tuned,0.3533


## Conclusion

For this project I found that the dataset was not suited for the unsupervised models I used. One reason could be due to the quality of comments and words used in the text extracted from reddit. Without having labels to train the model I ended up with underfit classifiers and poor accuracy. Due to the imbalance in amount of comments per topic I also trained the models with a balanced dataset. This proved to be even worse for training the models. Supervised learning, while only briefly explored, improved the accuracy greatly.

Due to computational limits I was not able to perform hyperparameter tuning on the agglomerative clustering model. If I were to continue to explore this data I would like to see if I could decrease the computational time needed to run that model in order to tune it. There are many unsupervised models that were not tested and next steps could include further exploration to see if a suitable unsupervised model could be trained. 