# Analyzing tagged tweets

In [4]:
# Import packages
import pickle
from nltk.stem import PorterStemmer
from gensim import corpora, models
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [5]:
# Import data
tweets = pickle.load(open("../data/id_tweets",'rb'))

In [6]:
# Define variables
tweets_lower = []
tweets_stemmed = []
tweets_dict = dict()

## Text cleaning

In [7]:
# Lower tweets and remove stopwords    
for tweet in tweets:
    if "http" in tweet:
        tweet = tweet.split("http")[0]
        tweet = " ".join([x for x in tweet.split() if x != "\n" and x not in  set(stopwords.words('english'))])
        tweets_lower.append(tweet.lower())

In [8]:
# Stem Tweets    
stemmer = PorterStemmer()
for tweet in tweets_lower:
        tweets_stemmed.append([stemmer.stem(x) for x in tweet.split()])    

## TF-IDF vectorization

In [9]:
# Generate a dictionary of terms and frequencies
tweets_dict = corpora.Dictionary(tweets_stemmed)

In [10]:
# Create a bag-of-words corpus
corpus = [tweets_dict.doc2bow(doc) for doc in tweets_stemmed]

In [11]:
# Run TF-IDF model
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

## Topic modeling

In [12]:
# Run the LDA model and generate 10 topics
lda_model = models.LdaModel(corpus, num_topics=5, id2word=tweets_dict)

In [13]:
# Print topics and their associated words
lda_model.print_topics()


[(0,
  '0.022*"walk" + 0.022*"#unitetheright" + 0.016*"white" + 0.016*"long" + 0.016*"supremacist" + 0.014*"look" + 0.013*"except" + 0.012*"man." + 0.012*"square," + 0.012*"face"'),
 (1,
  '0.042*"#unitetheright2" + 0.038*"#shutitdowndc" + 0.020*"#alloutdc" + 0.019*"chant" + 0.019*"we" + 0.018*"white" + 0.018*"freedom" + 0.016*"plaza" + 0.015*"lafayett" + 0.015*"hundr"'),
 (2,
  '0.040*"#shutitdowndc" + 0.037*"#unitetheright2" + 0.027*"#alloutdc" + 0.024*"nazi" + 0.021*"jason" + 0.019*"kessler" + 0.016*"march" + 0.016*"&amp;" + 0.016*"white" + 0.015*"supremacist"'),
 (3,
  '0.064*"#unitetheright2" + 0.020*"foggi" + 0.018*"torch" + 0.016*"what?" + 0.015*"tiki" + 0.014*"metro" + 0.014*"polic" + 0.013*"station" + 0.013*"bottom" + 0.012*"lafayett"'),
 (4,
  '0.058*"#shutitdowndc" + 0.047*"#unitetheright2" + 0.028*"replac" + 0.017*"flag" + 0.017*"protest" + 0.015*"arriv" + 0.014*"unit" + 0.013*"#blacklivesmatt" + 0.012*"hate" + 0.012*"ralli"')]

The output generated by the LDA models package is a list of topics. Each topic consists of words and probabilities of each word being assigned to that topics. In the first topic, also known as the first item in the list, the hashtag #unitetheright2 shows up as having the highest probability of being assigned to that topic. The topics can be characterized as below:

* Topic 1: #unitetheright2, #shutitdowndc, lafayette, white, #alloutdc, freedom, falli, hundr, kessler, jason
* Topic 2: #shutitdowndc, nazi, #alloutdc, walk, #defenddc, guess, #unitetheright, @unitetheright2 look, get
* Topic 3: we, chant, no, white, around, supremacist, #maga, thi, #unittheright, trump
* Topic 4: #unitetheright2, #shutitdowndc, replac, organ, white, nazis, wear, flag, you, protest
* Topic 5: #unitetheright2, polic, foggi, torch, what?, stupid, protest, fuck, metro, the

The topics themselves don't appear to correspond to a particular theme. Each topic appears to show a different aspect of the discussion on the protest. The hashtag #unitetheright2 shows up in almost every topic. Weighting the words differently might produce differerent results, so I use the TF-IDF vectorized corpus in the LDA model below.

In [14]:
# LDA model with 
tfidf_lda_model = models.LdaModel(corpus_tfidf, num_topics=5, id2word=tweets_dict)
lda_model.print_topics()

[(0,
  '0.022*"walk" + 0.022*"#unitetheright" + 0.016*"white" + 0.016*"long" + 0.016*"supremacist" + 0.014*"look" + 0.013*"except" + 0.012*"man." + 0.012*"square," + 0.012*"face"'),
 (1,
  '0.042*"#unitetheright2" + 0.038*"#shutitdowndc" + 0.020*"#alloutdc" + 0.019*"chant" + 0.019*"we" + 0.018*"white" + 0.018*"freedom" + 0.016*"plaza" + 0.015*"lafayett" + 0.015*"hundr"'),
 (2,
  '0.040*"#shutitdowndc" + 0.037*"#unitetheright2" + 0.027*"#alloutdc" + 0.024*"nazi" + 0.021*"jason" + 0.019*"kessler" + 0.016*"march" + 0.016*"&amp;" + 0.016*"white" + 0.015*"supremacist"'),
 (3,
  '0.064*"#unitetheright2" + 0.020*"foggi" + 0.018*"torch" + 0.016*"what?" + 0.015*"tiki" + 0.014*"metro" + 0.014*"polic" + 0.013*"station" + 0.013*"bottom" + 0.012*"lafayett"'),
 (4,
  '0.058*"#shutitdowndc" + 0.047*"#unitetheright2" + 0.028*"replac" + 0.017*"flag" + 0.017*"protest" + 0.015*"arriv" + 0.014*"unit" + 0.013*"#blacklivesmatt" + 0.012*"hate" + 0.012*"ralli"')]

Using a TF-IDF vectorized corpus produces similar topics to the first LDA model. However, I can ascertain a few trends within each topic. With words such as "white", "supermacist", "look", "man", and "square", Topic 1 appears to refer to a particular person or group of people (perhaps Unite the Right 2 organizers). Topic 3 also appears to refer to Topic 4 appears to refer to travel to the rally ("foggi", "bottom", "metro").

LDA generates clusters that capture the latent structure of data assigned to each topic. The components of each topic suggest that the collected tweets exemplify general discussion about the event, discussions that are slightly left-leaning. Hashtags such as #alloutdc and #shutitdowndc signal more explicitly left-leaning responses to the rally. I think the best use of this method is to verify that uncategorized scraped tweets indeed discuss the Unite the Right rally.