# LDA Topic Analysis of Tweets made to Female MPs during the June 2017 General Election

The notebook below modifies the standard LDA notebook from SKLearn to analyse tweets. It is specifically tuned to the 140 character format, and produces interesting results!

## Data Ingestion

In [1]:
import pandas as pd
import numpy as np
import re
import pyLDAvis.sklearn
from datetime import datetime as dt

In [2]:
df_all = pd.read_pickle('./tweet_data/aggregated.pkl')

In [3]:
df_abuse = pd.read_pickle('./tweet_data/abusive.pkl')

The below shows lines we need to drop as they were malformed by the API:

In [4]:
df_all.loc[df_all.Posts.isnull()].head()

Unnamed: 0,GUID,Date (GMT),URL,Contents,Author,Name,Country,State/Region,City/Urban Area,Category,Emotion,Source,Klout Score,Gender,Posts,Followers,Following
2267654,817337076572102656,,http://twitter.com/Solutionprovida/status/8173...,http://twitter.com/Solutionprovida/status/8173...,,,United Kingdom,North West,Liverpool,,,Twitter,51.0,M,,,
2267655,817352877421248512,,http://twitter.com/shaancheema/status/81735287...,http://twitter.com/shaancheema/status/81735287...,,,United Kingdom,Greater London,London,,,Twitter,53.0,,,,
2267656,817452492174790656,,http://twitter.com/AWarwickThomps1/status/8174...,http://twitter.com/AWarwickThomps1/status/8174...,,,United Kingdom,,,,,Twitter,42.0,,,,
2267657,817330843609874432,,http://twitter.com/martytechno1/status/8173308...,http://twitter.com/martytechno1/status/8173308...,,,,,,,,Twitter,43.0,M,,,
2267658,817494566832078848,,http://twitter.com/achairukdpc/status/81749456...,http://twitter.com/achairukdpc/status/81749456...,,,United Kingdom,,,,,Twitter,35.0,F,,,


## Data Processing

We write the following functions to remove various artefacts (i.e.: rewtweet flags, hashtags, mentions etc.) from the body of the tweet which don't add information:

In [5]:
def remove_handles(text):
    return re.sub('@[^\s]+','',text)

def remove_hashtags(text):
    return re.sub('#[^\s]+',string=text,repl='')

def remove_RT(text):
    return re.sub('^RT ',string=text,repl='')

def remove_url(text):
    return re.sub('http[^\s]+',string=text,repl='')

def process_text(text):
    return (remove_url(remove_RT(remove_hashtags(remove_handles(text))))).strip()

def process_text_ht(text):
    return (remove_url(remove_RT(remove_handles(text)))).strip()

Note that two wrapper functions exist, one which removes hashtags and one which doesn't

We only need to run this for the 'raw' file as the 'abuse' file already has these processed.

In [6]:
df_all['StrippedHasHashtag'] = df_all['Contents'].map(process_text_ht)

Remove the superfluous malformed tweets found above in the 'raw':

In [7]:
df_all = df_all[~ df_all['Date (GMT)'].isnull()]

Convert string to datetime for the 'raw' data:

In [8]:
df_all['Date (GMT)'] = df_all['Date (GMT)'].map(lambda x : dt.strptime(x, '%d/%m/%Y %H:%M'))

In [9]:
ge_date = pd.to_datetime('18-04-2017')

Now extract tweets which were made after the general election was announced:

In [10]:
df_all_ge = df_all[df_all['Date (GMT)'] > ge_date]
df_abuse_ge = df_abuse[df_abuse['date'] > ge_date]

## Performing LDA

The below is code inherited from the SKLearn LDA example inherited from Grisel, Buitinck and Yau. This is modified to work better with tweets and to remove non-LDA analysis:

In [11]:
###################################
##### LDA Topic Analysis ##########
###################################
######### Philip Ball #############
###################################

# Original Authors: Olivier Grisel <olivier.grisel@ensta.org>
#         Lars Buitinck
#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>

from __future__ import print_function
from time import time
import re

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

# The below works well for our data set. Note that for larger datasets you may be able to push these values up

n_features = 1000
n_topics = 20
n_top_words = 10

# function to extract and print the top n_words from the top n_topics defined above

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()


# Main function to get LDA features. Also calls the above function, as well as provide timings
    
def get_LDA(input_data):
    # Use tf (raw term count) features for LDA.
    print("Extracting tf features for LDA...")
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df = 2, max_features=n_features,
                                    stop_words='english')
    t0 = time()
    tf = tf_vectorizer.fit_transform(input_data)
    
    print("done in %0.3fs." % (time() - t0))
    print('Max number of times a word appears in a sentence is %d, min is %d.\n' % (tf.A.max(),tf.A.min()))
    print("Fitting LDA models with tf features, "
          "n_samples=%d and n_features=%d..."
          % (len(input_data), n_features))
    
    # The hyperparameters below are tuned to work with twitter data:
    
    lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=50,
                                    learning_method='online',
                                    learning_offset=50.,
                                    random_state=0,doc_topic_prior = 0.001)
    t0 = time()
    lda.fit(tf)
    print("done in %0.3fs." % (time() - t0))
    print("\nTopics in LDA model:")
    tf_feature_names = tf_vectorizer.get_feature_names()
    print_top_words(lda, tf_feature_names, n_top_words)
    return(tf_vectorizer, tf, lda)

In [12]:
data_samples_all = df_all_ge.StrippedHasHashtag.sample(n=10000,random_state=1234)
data_samples_abuse = df_abuse_ge['Clean Contents'].sample(n=10000,random_state=1234)

In [13]:
tf_vec_all, tf_all, lda_all = get_LDA(data_samples_all)

Extracting tf features for LDA...
done in 0.523s.
Max number of times a word appears in a sentence is 5, min is 0.

Fitting LDA models with tf features, n_samples=10000 and n_features=1000...
done in 147.015s.

Topics in LDA model:
Topic #0:
great, ge2017, votesnp, today, support, ge17, thanks, campaigning, votelabour, labourdoorstep
Topic #1:
theresa, think, time, rights, record, workers, weak, trust, shame, debate
Topic #2:
just, need, know, don, voted, did, hope, doesn, labour, does
Topic #3:
cuts, home, education, say, tories, secretary, police, tory, hit, schools
Topic #4:
tory, pm, manifesto, right, voters, day, won, says, labour, hard
Topic #5:
labour, make, britain, better, ll, win, stand, children, forthemany, policies
Topic #6:
good, public, debate, leader, bbcdebate, british, leaders, asking, heard, respects
Topic #7:
bbcqt, want, tax, marr, tories, said, strong, care, brilliant, attack
Topic #8:
nhs, going, women, ukip, 10, tory, deal, brexit, people, given
Topic #9:
year, 

In [14]:
tf_vec_abuse, tf_abuse, lda_abuse = get_LDA(data_samples_abuse)

Extracting tf features for LDA...
done in 0.501s.
Max number of times a word appears in a sentence is 5, min is 0.

Fitting LDA models with tf features, n_samples=10000 and n_features=1000...
done in 161.458s.

Topics in LDA model:
Topic #0:
make, won, real, read, mate, wtf, news, embarrassment, tv, pay
Topic #1:
stupid, fool, hell, bloody, got, arrogant, ignorant, time, woman, run
Topic #2:
vote, fat, pathetic, head, arse, slag, shame, twat, talking, crack
Topic #3:
mouth, sick, disgrace, rape, paid, little, old, tax, clause, working
Topic #4:
corbyn, kill, years, saying, shoot, poor, terrorists, thought, clue, care
Topic #5:
stop, crap, ass, hole, digging, like, taught, world, nan, cock
Topic #6:
shit, diane, ve, abbott, did, racist, just, complete, way, resign
Topic #7:
bullshit, tories, let, disgusting, does, labour, clueless, seriously, better, votelabour
Topic #8:
idiot, think, right, stupid, like, need, labour, getting, woman, boris
Topic #9:
tory, party, labour, life, worst, bi

We now turn to LDAvis, a fantastic visualisation tool for LDA which has been ported into Python. Very little documentation exists for this, so the below was a bit of trial and error.

Effectively, we transform our data into its 2 principal components, and plot where the different LDA clusters sit within this new orthogonal coordinate set.

In [15]:
####### Prepare for LDAvis ##########

prepare_all = pyLDAvis.sklearn.prepare(lda_model=lda_all, vectorizer=tf_vec_all, dtm=tf_all)
prepare_abuse = pyLDAvis.sklearn.prepare(lda_model=lda_abuse, vectorizer=tf_vec_abuse, dtm=tf_abuse)

We have now prepared our data for pyLDAvis, let's plot the results for all the tweets and the abusive tweets:

### All Tweets

As we can see below, there are many interesting topic clusters. I can identify around 3 with relative ease:

* Topic 1: General 'General Election' chat
* Topics 6,11,16,17: Chat about the TV appearances
* Topics 3,5,7,12,13: Chat about public services and pro-Labour sentiment

In [16]:
pyLDAvis.display(prepare_all)

### Abuse Tweets

Aside from the obvious ones:

* Topic 16 is abuse aimed at the current PM
* Topic 6 is abuse aimed at the main parties
* Topic 18 is abuse aimed at Corbyn

The most interesting area however is Topics 5,9,15. This seems to be specifically abuse aimed at women, and can be seen by highlighting the last word in the top-30!

Perhaps we can identify abusive tweets at women in future by filtering for tweets which fall into these clusters!

In [17]:
pyLDAvis.display(prepare_abuse)