# The Unsent Project Topic Modeling

## Topic Modeling

I looked into grouping things by colors because for LSA, it uses similarities between documents as a way to cluster words/topics and without grouping by color they documents are incredibly small and also almost random. (come back once you have better understanding of LSA to see if this is actually true, also maybe try grouping posts by time and/or color). One thing I found in the difference between using just color and using all documents, is that I found one distinct topic of hate and pain with the documents. If the colors to not really correspond significantly to the topics (which they likely don't because there are so many within each color) then it may lead to less accuracy.

--- I learned that I should remove duplicates because I don't care about how often the topics are mentioned really more just that what is being talked about.

--- Also I learned that I should take the lemmas of words because different forms of the word will not help to form meaningful topics

--- Also because each document (mostly if they're grouped by color, I'm not entirely sure about individual posts) does not really have many significant words that characterize them, the assumption used in LSA that words can be clustered by finding that similar words appear in similar contexts (documents), this assumption is not really true, and so LSA will be less likely to be accurate/more random.

### Import packages

In [2]:
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD, LatentDirichletAllocation, NMF
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import spacy
nlp = spacy.load("en_core_web_md")

In [90]:
df = pd.read_csv('D:/Coding/Projects/Python/The Unsent Project/unsent_data_final.csv')

In [44]:
df.head()

Unnamed: 0,name,date,color,message
0,Evan,June 30 2020,Tangerine,All I want is you to be happy. I love you fore...
1,Procureur,June 30 2020,Light Pink,Tu es mon meilleur ami
2,shiv,June 30 2020,Pale Purple,i loved you... more than you ever cared for me...
3,brendan,June 30 2020,Yellow,you make me so incredibly happy & im so in lov...
4,James,June 30 2020,Yellow,There this thing about you. No matter how many...


#### Lemmatization
For uncovering meaningful topics, I initially thought that different forms of words would be unhelpful, but different tenses do indicate different meanings in the topics, so I've chose to leave the tokens as they are.

#### Stop words
A lot of stop words like you, him, her, more, them, want, etc. can all be somewhat characterizing of certain topics in this situation, so I'm choosing to make my own list mostly of articles, prepositions, and conjunctions, part of speeches that don't really make a big difference towards the topics. I mostly gathered these words from initial runs of topic modeling with stop words, and pulled the words I found to be less helpful.

In [96]:
stop_words = ['to', 'and', 'the', 'a', 'for', 'but', 'so', 'in', 'of', 'it', \
             'about', 'as', 'this', 'do', 'doing', 'did', 'be', 'on', 'off', \
             'with', 'that', 'this', 'been', 'just', 'at']

In [13]:
n_topics = 10
n_words = 10

In [35]:
def display_topic_words(model, vocab):
    for i, component in enumerate(model.components_):
        terms_comp = zip(vocab, component)
        sorted_terms = sorted(terms_comp, key = lambda x: x[1], reverse=True)[:n_words]
        print("Topic " + str(i), ' '.join([term for term, comp in sorted_terms]))

### LSA - Latent Semantic Analysis

In [84]:
vectorizer = TfidfVectorizer(stop_words=stop_words)
documents = df['message'].apply(lambda x: str(x))#use_idf=True, smooth_idf=True
vectors = vectorizer.fit_transform(documents)

In [85]:
vectors.shape

(199931, 32857)

In [86]:
lsa_model = TruncatedSVD(n_components=n_topics)

In [87]:
lsa_model.fit(vectors)

TruncatedSVD(algorithm='randomized', n_components=10, n_iter=5,
             random_state=None, tol=0.0)

In [88]:
vocab = vectorizer.get_feature_names()

In [89]:
display_topic_words(lsa_model, vocab)

Topic 0 you me love my still miss know wish we never
Topic 1 love always will still first you my much forever ll
Topic 2 me you loved why hurt feel made like hate broke
Topic 3 miss you much still everyday hate come lot talking sometimes
Topic 4 my heart first were friend best life you always broke
Topic 5 don know think still want re can if hope anymore
Topic 6 still think me my were when broke time all every
Topic 7 sorry was im enough not wasn love why good really
Topic 8 always will sorry me still never ll we think miss
Topic 9 sorry think much still loved you how never wish will


In [None]:
lsa_model.singular_values_

In [36]:
display_topic_words(svd_model, vocab) # this is displaying the previous svd model

Topic 0 you me love my still miss know wish we never
Topic 1 love always will first you still my much forever ll
Topic 2 me you loved why hurt hate much made feel broke
Topic 3 miss you much still everyday hope come lot sometimes talking
Topic 4 my heart first were friend best life always miss broke
Topic 5 don know think still want if re can anymore even
Topic 6 still think were me my we first time re broke
Topic 7 sorry was enough im why love wasn not were good
Topic 8 sorry always still will think never was enough im ll
Topic 9 me we were miss love why like what first made


### LDA - Latent Dirichlet Allocation

In [91]:
vectorizer = CountVectorizer(stop_words=stop_words)
documents = df['message'].apply(lambda x: str(x))
vectors = vectorizer.fit_transform(documents)

In [92]:
lda_model = LatentDirichletAllocation(n_components=n_topics, random_state=1)

In [93]:
lda_model.fit(vectors) # this takes a little while

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=10, n_jobs=None,
                          perp_tol=0.1, random_state=1, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [94]:
vocab = vectorizer.get_feature_names()

In [95]:
display_topic_words(lda_model, vocab)

Topic 0 you dont know we im want like if what miss
Topic 1 you my heart your me when at back all broke
Topic 2 you me love much still how dont never hate will
Topic 3 my you were life first youre best love person friend
Topic 4 you we always will love ill our years its together
Topic 5 me fuck thank you ur bitch fucking up shit youre
Topic 6 your you im miss sorry still me my its not
Topic 7 hope you youre really good are happy out we enough
Topic 8 you me was more love loved never than im wish
Topic 9 you me wish would if was have didnt when could


### NMF - Non-negative Matrix Factorization

In [97]:
vectorizer = CountVectorizer(stop_words=stop_words)
documents = df['message'].apply(lambda x: str(x))
vectors = vectorizer.fit_transform(documents)

In [98]:
nmf_model = NMF(n_components=n_topics, random_state=1)

In [99]:
nmf_model.fit(vectors)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=10, random_state=1, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

In [100]:
vocab = vectorizer.get_feature_names()

In [101]:
display_topic_words(nmf_model, vocab)

Topic 0 you still much are loved hope were hate miss think
Topic 1 me like made feel why hurt make how when what
Topic 2 we were still together had other each friends miss when
Topic 3 my heart life were first best friend youre all is
Topic 4 love always will still first much you forever ill thank
Topic 5 im sorry not youre still happy now its scared like
Topic 6 know dont want if like think what its how even
Topic 7 was when like what were all didnt because had one
Topic 8 your miss hope is still its when youre all smile
Topic 9 have never wish could will how always would much back
