<a href="https://colab.research.google.com/github/vaccine-lang/facebook-data/blob/main/Topic_Modeling_and_Clustering_Facebook_Data_(Week_8).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling and Clustering Facebook Data

This week, we will apply topic modeling and clustering to the Facebook data set to see what sorts of "topics" emerge. Conceptually, the primary difference between topic modeling and clustering is that topic models allow for overlap: a single document can feature many topics. In clustering, however, a document is assigned to a single cluster. 

In both cases, however, we set the number of topics/clusters and can adjust for a best fit. Similarly, we can get wordlists of words that are distinct to a specific topic/cluster, which may help with identifying the different "genres" of vaccine hesitancy.

A good rule of thumb for topics are for there to be a rapid drop off in weights for terms. If the drop off is slow, it suggests an indistinct topic. On the other hand, with clusters you want the clusters similar in size (within an order of magnitude), because otherwise you get a handful of useful clusters and one giant "misc" cluster.

In [17]:
# Import common libraries
import pandas as pd
import numpy as np
import os

# Import our language libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en.stop_words import STOP_WORDS as stopwords

# Install and import gensim
#!pip install --upgrade gensim

# Intsall Levenshtein
!pip install python-Levenshtein

Collecting python-Levenshtein
  Downloading python-Levenshtein-0.12.2.tar.gz (50 kB)
[?25l[K     |██████▌                         | 10 kB 10.5 MB/s eta 0:00:01[K     |█████████████                   | 20 kB 16.2 MB/s eta 0:00:01[K     |███████████████████▌            | 30 kB 21.0 MB/s eta 0:00:01[K     |██████████████████████████      | 40 kB 25.3 MB/s eta 0:00:01[K     |████████████████████████████████| 50 kB 6.2 MB/s 
Building wheels for collected packages: python-Levenshtein
  Building wheel for python-Levenshtein (setup.py) ... [?25l[?25hdone
  Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.2-cp37-cp37m-linux_x86_64.whl size=149865 sha256=d8f69ba81cbdb0adc4901ab6c33606afa041f7491849e58554c5596fd8c1867d
  Stored in directory: /root/.cache/pip/wheels/05/5f/ca/7c4367734892581bb5ff896f15027a932c551080b2abd3e00d
Successfully built python-Levenshtein
Installing collected packages: python-Levenshtein
Successfully installed python-Levenshtein-0.12.2


In [10]:
 # Import data files from GitHub

# Set remote (GitHub) and local paths for the data files
GITHUB_ROOT = "https://raw.githubusercontent.com/vaccine-lang/facebook-data/main"
BASE_DIR = "/"
print(f'Files will be downloaded from "{GITHUB_ROOT}"')
print(f'Files will be downloaded to "{BASE_DIR}".')

# Download the concatinated file
file_names = ["concatenated_raw_Facebook_data_w_metadata_stripped_out_text_only"]
print("Downloading data")
for name in file_names:
  cmd = " ".join(['wget', '-P', os.path.dirname(BASE_DIR + name + ".csv"), GITHUB_ROOT + "/data/" + name + ".csv"])
  print("!"+cmd)
  if os.system(cmd) != 0:
    print('  ~~> ERROR')

Files will be downloaded from "https://raw.githubusercontent.com/vaccine-lang/facebook-data/main"
Files will be downloaded to "/".
Downloading data
!wget -P / https://raw.githubusercontent.com/vaccine-lang/facebook-data/main/data/concatenated_raw_Facebook_data_w_metadata_stripped_out_text_only.csv


In [13]:
df = pd.read_csv("concatenated_raw_Facebook_data_w_metadata_stripped_out_text_only.csv")
print(len(df)) # check to make sure the number of lines is about right: ~180k.
df.head()

186822


Unnamed: 0.1,Unnamed: 0,text
0,0,#ATTENTION Federal Election is Coming #QUESTIO...
1,1,Doctors & Nurses are disregarding sound medica...
2,2,🤬 SouthernDude82 If your still in doubt that ...
3,3,VICE “What is being built is the architectur...
4,4,WORLDWIDE RALLY FOR FREEDOM [MELBOURNE] On M...


In [19]:
tfidf_text = TfidfVectorizer(stop_words=stopwords, min_df=5, max_df=0.7)
tfidf_text_vectors = tfidf_text.fit_transform(df["text"])

  'stop_words.' % sorted(inconsistent))


In [20]:
topics = 10
seed = 42
def display_topics(model, features, no_top_words=5):
  for topic, word_vector in enumerate(model.components_):
    total = word_vector.sum()
    largest = word_vector.argsort()[::-1] # inverted sort
    print("\nTopic %02d" % topic)
    for i in range(0, no_top_words):
      print(" %s (%2.2f)" % (features[largest[i]], word_vector[largest[i]]*100.0/total))

In [21]:
# Start with NMF topic modeling
from sklearn.decomposition import NMF

nmf_text_model = NMF(n_components=topics, random_state=seed)
W_text_matrix = nmf_text_model.fit_transform(tfidf_text_vectors)
H_text_matrix = nmf_text_model.components_



In [23]:
display_topics(nmf_text_model, tfidf_text.get_feature_names())


Topic 00
 post (37.98)
 share (37.71)
 photos (0.81)
 bodek (0.38)
 peter (0.30)

Topic 01
 covid (5.25)
 19 (4.46)
 deaths (0.83)
 cases (0.60)
 test (0.54)

Topic 02
 https (4.38)
 com (3.98)
 www (3.69)
 youtube (2.01)
 watch (1.96)

Topic 03
 people (1.08)
 know (0.58)
 don (0.54)
 like (0.47)
 need (0.43)

Topic 04
 vaccine (5.97)
 vaccines (1.65)
 pfizer (0.82)
 coronavirus (0.71)
 gates (0.57)

Topic 05
 mask (6.09)
 masks (4.14)
 wear (2.53)
 wearing (2.31)
 face (1.77)

Topic 06
 trump (6.34)
 president (3.05)
 donald (2.34)
 biden (1.39)
 election (0.87)

Topic 07
 canada (3.81)
 trudeau (3.69)
 justin (1.52)
 canadians (1.07)
 government (1.06)

Topic 08
 facebook (11.48)
 com (5.66)
 posts (4.57)
 https (4.42)
 www (3.91)

Topic 09
 state (0.93)
 health (0.87)
 new (0.65)
 coronavirus (0.59)
 governor (0.55)
