<a href="https://colab.research.google.com/github/vaccine-lang/facebook-data/blob/main/Topic_Modeling_and_Clustering_Facebook_Data_(Week_10).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling and Clustering Facebook Data

Last week, we tried topic modeling our Facebook data set and decided we should probably do more thinking about how we tokenize our data in order to get more usable topics from the data.

In [None]:
# Install textacy
!pip install textacy

# Import common libraries
import pandas as pd
import numpy as np
import os
import re
import unicodedata
import sys

# Import our language libraries
import textacy
from textacy import preprocessing as tprep
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en.stop_words import STOP_WORDS as stopwords

# Install and import gensim
#!pip install --upgrade gensim

# Install Levenshtein
#!pip install python-Levenshtein

# Install spaCy model
!{sys.executable} -m spacy download en_core_web_lg

# Import data files from GitHub

# Set remote (GitHub) and local paths for the data files
GITHUB_ROOT = "https://raw.githubusercontent.com/vaccine-lang/facebook-data/main"
BASE_DIR = "/"
print(f'Files will be downloaded from "{GITHUB_ROOT}"')
print(f'Files will be downloaded to "{BASE_DIR}".')

# Download the concatinated file
file_names = ["concatenated_raw_Facebook_data_w_metadata_stripped_out_text_only"]
print("Downloading data")
for name in file_names:
  cmd = " ".join(['wget', '-P', os.path.dirname(BASE_DIR + name + ".csv"), GITHUB_ROOT + "/data/" + name + ".csv"])
  print("!"+cmd)
  if os.system(cmd) != 0:
    print('  ~~> ERROR')

df = pd.read_csv("concatenated_raw_Facebook_data_w_metadata_stripped_out_text_only.csv").drop(['Unnamed: 0'], axis=1)

## Work with a sample

Let's make a sample of the data so we can more easily see what's going on. A sample of 20 posts should catch enough of the edge cases we stumbled upon last week.

In [None]:
s = df.sample(20, random_state=8)
print(s.to_markdown())

Things are looking good, but we can see, again, the prevalence of non-word things in the texts that would nevertheless carry some sort of semantic meaning. In fact, let's see how we can use tokenization here to understand the data better.

## Tokenizing

The simplest tokenizer for English is just to split tokens based on spaces. This may also help us understand what "non-English" tokens there are. So let's create an array of everything in the dataset separated by spaces.

In [None]:
# Create a new column for space-split tokens.
df["naive_tokens"] = df["text"].str.split(" ")
# Collapse the Series of lists into a giant Series and then get unique values.
unique_tokens = pd.unique(df["naive_tokens"].explode())
nonword_tokens = [token for token in unique_tokens if re.search("^[^a-zA-Z]+$", token)]
print(len(nonword_tokens))

In [None]:
# Most of the nonword tokens look like numbers, so let's remove all of the tokens thaat are just numbers with some other special characters
nonword_nonnumber_tokens = [non_number for non_number in nonword_tokens if re.search("^[^0-9:/.,$]+$", non_number)]
print(len(nonword_nonnumber_tokens))

In [None]:
# What kinds of tokens remain?
nonword_nonnumber_tokens

Because of how we established our filters, we're still catching words in this net, but they aren't English words. Furthermore, we see how much of the non-word/non-number tokens are emoji. Some combination of dropping non-English words and emoji would be helpful. Here's where Textacy can help, especially with its url processing as well.

In [None]:
def process_emoji(emoji):
  text = unicodedata.name(emoji).replace(" ", "_").lower()
  return text + "_emoji "

def process_url(url):
  domain = url.lower().split("/")[2].split(".")[-2]
  return domain + "_url "

preproc = tprep.make_pipeline(
    tprep.normalize.unicode,
    tprep.normalize.whitespace,
    tprep.normalize.quotation_marks,
    tprep.replace.phone_numbers
)

def process_text(text):
  processed_text = preproc(text)
  processed_text = tprep.normalize.repeating_chars(processed_text, chars="!")
  processed_text = tprep.replace.emojis(processed_text, lambda reMatch: process_emoji(reMatch[0]))
  processed_text = tprep.replace.urls(processed_text, lambda reMatch: process_url(reMatch[0]))
  return processed_text

In [None]:
s["processed"] = s.apply(lambda row: process_text(row["text"]), axis=1)
print(s["processed"].to_markdown())




Let's try to use a more robust tokenizer than we used [last time](https://colab.research.google.com/github/vaccine-lang/facebook-data/blob/main/Topic_Modeling_and_Clustering_Facebook_Data_(Week_8).ipynb). 

So far, we haven't used any tokenizer; Textacy's preprocessing functions all worked with regular expressions and the like. But now we have to get more robust tools.

In [None]:
nlp = spacy.load("en_core_web_lg")

In [None]:
s["doc"] = s.apply(lambda row: nlp(row["processed"]), axis=1)
print(s["doc"])

## Vectorizing

Next, let's look at Textacy's [vectorizer](https://textacy.readthedocs.io/en/latest/api_reference/representations.html#vectorizers) and see about how we can build our vectors. We used the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) last time, so let's compare our results.

In [None]:
# scikit-learn Incidentally, this uses the default tokenizer:
# r"(?u)\b\w\w+\b" 
tfidf_text = TfidfVectorizer(stop_words=stopwords, min_df=5, max_df=0.7)
tfidf_text_vectors = tfidf_text.fit_transform(df["text"])