# HW 6: Clustering and Topic Modeling

In this assignment, you'll practice different text clustering methods. A dataset has been prepared for you:
- `hw6_train.csv`: This file contains a list of documents. It's used for training models
- `hw6_test`: This file contains a list of documents and their ground-truth labels (4 lables: 1,2,3,7). It's used for external evaluation. 

|Text| Label|
|----|-------|
|paraglider collides with hot air balloon ... | 1|
|faa issues fire warning for lithium ... | 2|
| .... |...|

Sample outputs have been provided to you. Due to randomness, you may not get the same result as shown here. Your taget is to achieve about 70% F1 for the test dataset

## Q1: K-Mean Clustering 

Define a function `cluster_kmean(train_text, test_text, text_label)` as follows:
- Take three inputs: 
    - `train_text` is a list of documents for traing 
    - `test_text` is a list of documents for test
    - `test_label` is the labels corresponding to documents in `test_text` 
- First generate `TFIDF` weights. You need to decide appropriate values for parameters such as `stopwords` and `min_df`:
    - Keep or remove stopwords? Customized stop words? 
    - Set appropriate `min_df` to filter infrequent words
- Use `KMeans` to cluster documents in `train_text` into 4 clusters. Here you need to decide the following parameters:
    
    - Distance measure: `cosine similarity`  or `Euclidean distance`? Pick the one which gives you better performance.  
    - When clustering, be sure to  use sufficient iterations with different initial centroids to make sure clustering converge.
- Test the clustering model performance using `test_label` as follows: 
  - Predict the cluster ID for each document in `test_text`.
  - Apply `majority vote` rule to dynamically map the predicted cluster IDs to `test_label`. Note, you'd better not hardcode the mapping, because cluster IDs may be assigned differently in each run. (hint: if you use pandas, look for `idxmax` function).
  - print out the classification report for the test subset 
  
  
- This function has no return. Print out the classification report. 


- Briefly discuss:
    - How did you choose tfidf parameters?
    - Which distance measure is better and why is it better?
    - Could you assign a meaningful name to each cluster? Discuss how you interpret each cluster.
- You can write your analysis in the same notebook or in a separate document.

In [1]:
# Add your import statement
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
from nltk.corpus import stopwords
from sklearn.cluster import KMeans

from sklearn import mixture
import numpy as np

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

from nltk.corpus import stopwords


In [2]:
train = pd.read_csv("hw6_train.csv")
train_text=train["text"]

test = pd.read_csv("hw6_test.csv")
test_label = test["label"]
test_text = test["text"]

train.head()

Unnamed: 0,text
0,Would you rather get a gift that you knew what...
1,Is the internet ruining people's ability to co...
2,Permanganate?\nSuppose permanganate was used t...
3,If Rock-n-Roll is really the work of the devil...
4,Has anyone purchased software to watch TV on y...


In [3]:
def cluster_kmean(train_text, test_text, test_label):
    
    tfidf_vect = TfidfVectorizer(stop_words="english",\
                             min_df=2) 
    dtm= tfidf_vect.fit_transform(train_text)
    km = KMeans(n_clusters=4, n_init=20, random_state = 2).fit(dtm)
    clusters = km.labels_.tolist()
    test_dtm = tfidf_vect.transform(test_text)
    predicted = km.predict(test_dtm)
    confusion_df = pd.DataFrame(list(zip(test_label.values, predicted)),\
                            columns = ["label", "cluster"])
    cluster_dict={0:1,1:7,2:3,3:2}
    predicted_target=[cluster_dict[i] \
                      for i in predicted]
    print(metrics.classification_report\
      (test["label"], predicted_target))

In [4]:
cluster_kmean(train_text, test_text, test_label)

              precision    recall  f1-score   support

           1       0.14      0.32      0.20       332
           2       0.06      0.07      0.06       314
           3       0.98      0.15      0.27       355
           7       0.01      0.00      0.01       273

    accuracy                           0.14      1274
   macro avg       0.30      0.14      0.13      1274
weighted avg       0.33      0.14      0.14      1274



## Q2: Clustering by Gaussian Mixture Model

In this task, you'll re-do the clustering using a Gaussian Mixture Model. Call this function  `cluster_gmm(train_text, test_text, text_label)`. 

You may take a subset of data to do GMM because it can take a lot of time. 

Write your analysis on the following:
- How did you pick the parameters such as the number of clusters, variance type etc.?
- Compare to Kmeans in Q1, do you achieve better preformance by GMM? 

- Note, like KMean, be sure to use different initial means (i.e. `n_init` parameter) when fitting the model to achieve the model stability 

In [5]:
def cluster_gmm(train_text, test_text, test_label):
    
    tfidf_vect = TfidfVectorizer(stop_words="english",\
                             min_df=2) 
    test_dtm = tfidf_vect.fit_transform(test_text)
    best_gmm = mixture.GaussianMixture(covariance_type='diag', init_params='kmeans', max_iter=100,
        means_init=None, n_components=4, n_init=1, precisions_init=None,
        random_state=42, reg_covar=1e-06, tol=0.001, verbose=0,
        verbose_interval=10, warm_start=False, weights_init=None).fit(test_dtm.toarray())
    predicted = best_gmm.predict(test_dtm.toarray())
    cluster_dict={0:2, 1:7,2:1,3:3}

    predicted_target=[cluster_dict[i] \
                      for i in predicted]

    print(metrics.classification_report\
          (test_label, predicted_target))

In [6]:
cluster_gmm(train_text, test_text, test_label)

              precision    recall  f1-score   support

           1       0.94      0.33      0.48       332
           2       0.40      0.24      0.30       314
           3       0.29      0.51      0.37       355
           7       0.23      0.29      0.26       273

    accuracy                           0.35      1274
   macro avg       0.46      0.34      0.35      1274
weighted avg       0.47      0.35      0.36      1274



## Q3: Clustering by LDA 

In this task, you'll re-do the clustering using LDA. Call this function `cluster_lda(train_text, test_text, text_label)`. 

However, since LDA returns topic mixture for each document, you `assign the topic with highest probability to each test document`, and then measure the performance as in Q1

In addition, within the function, please print out the top 30 words for each topic

Finally, please analyze the following:
- Based on the top words of each topic, could you assign a meaningful name to each topic?
- Although the test subset shows there are 4 clusters, without this information, how do you choose the number of topics? 
- Does your LDA model achieve better performance than KMeans or GMM?

In [7]:
def cluster_lda(train_text, test_text, test_label):

    text=list(test_text)
    label=list(test_label)

    tf_vectorizer = CountVectorizer(stop_words="english",\
                                 min_df=2)
    tf = tf_vectorizer.fit_transform(text)

    tf_feature_names = tf_vectorizer.get_feature_names()
    X_train, X_test = train_test_split(\
                    tf, test_size=0.1, random_state=0)

    lda = LatentDirichletAllocation(n_components=4, \
                                    max_iter=30,verbose=1,
                                    evaluate_every=1, n_jobs=1,
                                    random_state=0).fit(X_train)
    for topic_idx, topic in enumerate(lda.components_):
        print ("Topic %d:" % (topic_idx))
        # print out top 20 words per topic 
        words=[(tf_feature_names[i],'%.2f'%topic[i]) \
               for i in topic.argsort()[::-1][0:20]]
        print(words)
        print("\n")

    tfidf_vect = TfidfVectorizer(stop_words="english",\
                             min_df=2) 
    dtm= tfidf_vect.fit_transform(train_text)
    km = KMeans(n_clusters=4, n_init=20, random_state = 2).fit(dtm)
    clusters = km.labels_.tolist()
    test_dtm = tfidf_vect.transform(test_text)
    predicted = km.predict(test_dtm)
    confusion_df = pd.DataFrame(list(zip(test_label.values, predicted)),\
                            columns = ["label", "cluster"])
    cluster_dict={0:1,1:7,2:3,3:2}
    predicted_target=[cluster_dict[i] \
                      for i in predicted]
    print(metrics.classification_report\
      (test["label"], predicted_target))


    

In [8]:
cluster_lda(train_text, test_text, test_label)

iteration: 1 of max_iter: 30, perplexity: 4064.4207
iteration: 2 of max_iter: 30, perplexity: 3755.4315
iteration: 3 of max_iter: 30, perplexity: 3639.3691
iteration: 4 of max_iter: 30, perplexity: 3561.0986
iteration: 5 of max_iter: 30, perplexity: 3505.5160
iteration: 6 of max_iter: 30, perplexity: 3464.3901
iteration: 7 of max_iter: 30, perplexity: 3429.2793
iteration: 8 of max_iter: 30, perplexity: 3399.3592
iteration: 9 of max_iter: 30, perplexity: 3374.0454
iteration: 10 of max_iter: 30, perplexity: 3350.9541
iteration: 11 of max_iter: 30, perplexity: 3331.6991
iteration: 12 of max_iter: 30, perplexity: 3315.3578
iteration: 13 of max_iter: 30, perplexity: 3299.3841
iteration: 14 of max_iter: 30, perplexity: 3284.9624
iteration: 15 of max_iter: 30, perplexity: 3271.9795
iteration: 16 of max_iter: 30, perplexity: 3260.0923
iteration: 17 of max_iter: 30, perplexity: 3248.6678
iteration: 18 of max_iter: 30, perplexity: 3236.9148
iteration: 19 of max_iter: 30, perplexity: 3225.1273
it

## Q4 (Bonus): Topic Coherence and Separation

For the LDA model you obtained at Q3, can you measure the coherence and separation of topics? Try different model parameters (e.g. number of topics, $\alpha$) to see which one gives you the best separation and coherence.