<a href="https://colab.research.google.com/github/vaccine-lang/facebook-data/blob/main/Topic_Modeling_and_Clustering_Facebook_Data_(Week_8).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling and Clustering Facebook Data

![Flounder saying "This is this and that is that."](https://i.imgur.com/fi5fh1C.gif)

This week, we will apply topic modeling and clustering to the Facebook data set to see what sorts of "topics" emerge. Conceptually, the primary difference between topic modeling and clustering is that topic models allow for overlap: a single document can feature many topics. In clustering, however, a document is assigned to a single cluster. 

In both cases, however, we set the number of topics/clusters and can adjust for a best fit. Similarly, we can get wordlists of words that are distinct to a specific topic/cluster, which may help with identifying the different "genres" of vaccine hesitancy.

A good rule of thumb for topics are for there to be a rapid drop off in weights for terms. If the drop off is slow, it suggests an indistinct topic. On the other hand, with clusters you want the clusters similar in size (within an order of magnitude), because otherwise you get a handful of useful clusters and one giant "misc" cluster.

In [None]:
# Import common libraries
import pandas as pd
import numpy as np
import os

# Import our language libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en.stop_words import STOP_WORDS as stopwords

# Install and import gensim
#!pip install --upgrade gensim

# Intsall Levenshtein
!pip install python-Levenshtein

In [None]:
 # Import data files from GitHub

# Set remote (GitHub) and local paths for the data files
GITHUB_ROOT = "https://raw.githubusercontent.com/vaccine-lang/facebook-data/main"
BASE_DIR = "/"
print(f'Files will be downloaded from "{GITHUB_ROOT}"')
print(f'Files will be downloaded to "{BASE_DIR}".')

# Download the concatinated file
file_names = ["concatenated_raw_Facebook_data_w_metadata_stripped_out_text_only"]
print("Downloading data")
for name in file_names:
  cmd = " ".join(['wget', '-P', os.path.dirname(BASE_DIR + name + ".csv"), GITHUB_ROOT + "/data/" + name + ".csv"])
  print("!"+cmd)
  if os.system(cmd) != 0:
    print('  ~~> ERROR')

In [None]:
df = pd.read_csv("concatenated_raw_Facebook_data_w_metadata_stripped_out_text_only.csv")
print(len(df)) # check to make sure the number of lines is about right: ~180k.
df.head()

The data is imported and converted into a table of ~180k snippets of text. Let's vectorize it, then.

In [None]:
# Let's initialize our vectorizer. Incidentally, this uses the default tokenizer:
# r"(?u)\b\w\w+\b"
tfidf_text = TfidfVectorizer(stop_words=stopwords, min_df=5, max_df=0.7)
tfidf_text_vectors = tfidf_text.fit_transform(df["text"])

## Topic Modeling

There are many ways to model topics, and each method also has tweakable parameters. We'll use Non-negative Matrix Factorization (NMF):

![W x H ~ V](https://upload.wikimedia.org/wikipedia/commons/f/f9/NMF.png)

_V_ is the vector space of our corpus. Each row is a Facebook post, and each column is a word. This is a sparse vector, because typically a word will _not_ appear in a post.

_W_ in this example reduces the corpus to two topics (columns) and four posts (columns)

_H_, on the other hand, reduces the corpus to two topics (rows) and six words (columns).

In other words, the two topics are the matrix that you can multiply the posts and words together to get the original corpus. The topics are like an adapter.


In [None]:
# Set some initial values and define a function for displaying topics
topics = 10
seed = 42
def display_topics(model, features, no_top_words=5):
  for topic, word_vector in enumerate(model.components_):
    total = word_vector.sum()
    largest = word_vector.argsort()[::-1] # inverted sort
    print("\nTopic %02d" % topic)
    for i in range(0, no_top_words):
      print(" %s (%2.2f)" % (features[largest[i]], word_vector[largest[i]]*100.0/total))

In [None]:
# Start with NMF topic modeling
from sklearn.decomposition import NMF

nmf_text_model = NMF(n_components=topics, random_state=seed)
W_text_matrix = nmf_text_model.fit_transform(tfidf_text_vectors)
H_text_matrix = nmf_text_model.components_



In [None]:
display_topics(nmf_text_model, tfidf_text.get_feature_names())

## What should we do with these results?