### Climate change tweets from data world.
#### Based on code by James at the Coding Club.
#### Modified for Topic Modeling Workshop at Northwestern University, August, 2019.
#### [https://github.com/nuitrcs/topic-modeling-workshop](https://github.com/nuitrcs/topic-modeling-workshop)

In this example we'll be looking at a series of tweets
discussing climate change issues. Tweets are short texts
like the ABC News headlines. Unlike the headlines, Tweets
contain a variety of special tags which we'll need to
process.

We'll use Non-negative matrix factorization to extract topics 
once we've preprocessed the tweets. 

Since tweets are short texts, we might expect NMF to perform 
better than LDA.  You'll be asked to check this by comparing 
the results obtained using Latent Dirichlet Allocation with those 
obtained using NMF.

The input data consists of 6090 tweets plus a column header "tweet".

Here are two sample tweets.  

* RT @our_codingclub: Can @you find #all the #hashtags?
* Not a retweet. All views @my own

In these tweets:

* *RT* indicates a retweet.
* *@something* indicates "something" is a twitter handle.
* *#hashtag* indicates a hashtag.
 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

**Load climate tweets.**

In [None]:
df = pd.read_csv( 'data/climate_tweets.csv' )

# df (dataframe) now contains the text of the tweets, one per row.

**Make a new 'is_retweet' column to highlight retweets.**

In [None]:
df['is_retweet'] = df['tweet'].apply(lambda x: x[:2]=='RT')

**Count the number of retweets.**

In [None]:
df['is_retweet'].sum()  # number of retweets


**Get the number of unique retweets.**

In [None]:
df.loc[df['is_retweet']].tweet.unique().size

**Find the ten most repeated tweets.**

In [None]:
df.groupby(['tweet']).size().reset_index(name='counts')\
  .sort_values('counts', ascending=False).head(10) 

**Count number of times each tweet appears.**

In [None]:
counts = df.groupby(['tweet']).size()\
           .reset_index(name='counts')\
           .counts

**Define bins for histogram of counts.**

In [None]:
my_bins = np.arange(0,counts.max()+2, 1)-0.5

**Generate a histogram of tweet counts.**

In [None]:
plt.figure()
plt.hist(counts, bins = my_bins)
plt.xlabels = np.arange(1,counts.max()+1, 1)
plt.xlabel('copies of each tweet')
plt.ylabel('frequency')
plt.yscale('log', nonposy='clip')
plt.show()

**Define functions to extract twitter handles and hashtags.**

In [None]:
def find_retweeted(tweet):
    '''Extract the twitter handles of retweeted people'''
    return re.findall('(?<=RT\s)(@[A-Za-z]+[A-Za-z0-9-_]+)', tweet)

def find_mentioned(tweet):
    '''Extract the twitter handles of people mentioned in the tweet'''
    return re.findall('(?<!RT\s)(@[A-Za-z]+[A-Za-z0-9-_]+)', tweet)  

def find_hashtags(tweet):
    '''Extract hashtags'''
    return re.findall('(#[A-Za-z]+[A-Za-z0-9-_]+)', tweet)    

 **Create new columns for retweeted usernames, mentioned usernames and hashtags.**

In [None]:
df['retweeted'] = df.tweet.apply(find_retweeted)
df['mentioned'] = df.tweet.apply(find_mentioned)
df['hashtags'] = df.tweet.apply(find_hashtags)

**Take the rows from the hashtag columns where there are actually hashtags.**

In [None]:
hashtags_list_df = df.loc[
                       df.hashtags.apply(
                           lambda hashtags_list: hashtags_list !=[]
                       ),['hashtags']]

**Create a dataframe where each use of a hashtag gets its own row.**

In [None]:
flattened_hashtags_df = pd.DataFrame(
    [hashtag for hashtags_list in hashtags_list_df.hashtags
    for hashtag in hashtags_list],
    columns=['hashtag'])

   
**Calculate number of unique hashtags.**

In [None]:
flattened_hashtags_df['hashtag'].unique().size

**Count number of appearances for each hashtag.**

In [None]:
popular_hashtags = flattened_hashtags_df.groupby('hashtag').size()\
                                        .reset_index(name='counts')\
                                        .sort_values('counts', ascending=False)\
                                        .reset_index(drop=True)

**Number of times each hashtag appears.**

In [None]:
counts = flattened_hashtags_df.groupby(['hashtag']).size()\
                              .reset_index(name='counts')\
                              .counts

**Define bins for histogram of tweet counts.**

In [None]:
my_bins = np.arange(0,counts.max()+2, 5)-0.5

**Produce histogram of tweet counts.**

In [None]:
plt.figure()
plt.hist(counts, bins = my_bins)
plt.xlabels = np.arange(1,counts.max()+1, 1)
plt.xlabel('Number of appearances for hashtags')
plt.ylabel('Frequency')
plt.yscale('log', nonposy='clip')
plt.show()

**Get hashtags which appear at least twenty times.**
**We'll consider these "popular" hashtags.**

In [None]:
min_appearance = 20

**Find popular hashtags.**

In [None]:
popular_hashtags_set = set(popular_hashtags[
                           popular_hashtags.counts >= min_appearance
                           ]['hashtag'])

**Create new column with only the popular hashtags.**

In [None]:
hashtags_list_df['popular_hashtags'] = hashtags_list_df.hashtags.apply(
            lambda hashtag_list: [hashtag for hashtag in hashtag_list
                                  if hashtag in popular_hashtags_set])

**Drop rows which do not contain at least one popular hashtag.**

In [None]:
popular_hashtags_list_df = hashtags_list_df.loc[
            hashtags_list_df.popular_hashtags.apply( \
            lambda hashtag_list: hashtag_list !=[])]
 

**Create a new dataframe with the popular hashtags.**

In [None]:
hashtag_vector_df = \
    popular_hashtags_list_df.loc[:, ['popular_hashtags']]

**Create columns to record presence of hashtags.**

In [None]:
for hashtag in popular_hashtags_set:
    hashtag_vector_df['{}'.format(hashtag)] = \
        hashtag_vector_df.popular_hashtags.apply(
        lambda hashtag_list: int(hashtag in hashtag_list))

hashtag_matrix = hashtag_vector_df.drop('popular_hashtags', axis=1)

 **Calculate a hashtag correlation matrix.**
 **This tells us which hashtags tend to appear together.**

In [None]:
correlations = hashtag_matrix.corr()

**Plot correlation matrix.**

In [None]:
plt.figure(figsize=(10,10))

sns.heatmap(correlations,
    cmap='RdBu',
    vmin=-1,
    vmax=1,
    square = True,
    cbar_kws={'label':'correlation'})
plt.show()

**Define methods to clean tweets for further processing.**
**We'll use nltk for natural language processing.**

In [None]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer

# Create a lemmatizer to fold inflected word forms together.

lemmatizer = WordNetLemmatizer()

nltk.download( 'stopwords', quiet=True )
nltk.download( 'wordnet', quiet=True )

tag_dict = {"J": wordnet.ADJ,
            "N": wordnet.NOUN,
            "V": wordnet.VERB,
            "R": wordnet.ADV
           }

# Get part of speech for a word. 

def get_wordnet_pos( word ):
    tag = nltk.pos_tag( [word] )[0][1][0].upper()
    return tag_dict.get( tag , wordnet.NOUN )

# Remove web links from a tweet.

def remove_links(tweet):
    '''Takes a string and removes web links from it'''
    tweet = re.sub(r'http\S+', '', tweet) # remove http links
    tweet = re.sub(r'bit.ly/\S+', '', tweet) # rempve bitly links
    tweet = tweet.strip('[link]') # remove [links]
    return tweet

# Remove retweet and @user information from a tweet.

def remove_users(tweet):
    '''Takes a string and removes retweet and @user information'''
    
    # remove retweet
    
    tweet = re.sub('(RT\s@[A-Za-z]+[A-Za-z0-9-_]+)', '', tweet) 
    
    # remove tweeted at

    tweet = re.sub('(@[A-Za-z]+[A-Za-z0-9-_]+)', '', tweet)     
    return tweet

# Get the default English stopwords list from nltk.

my_stopwords = nltk.corpus.stopwords.words('english')

# Get a list of punctuation to clean from tweet.

my_punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'

**Clean a tweet.**

The resulting cleaned version of a tweet is placed in the column "clean_tweet" in the dataframe.


In [None]:
def clean_tweet(tweet, bigrams=False):
    tweet = remove_users(tweet)  #remove users (handles)
    tweet = remove_links(tweet)  # remove web links
    tweet = tweet.lower()        # convert tweet to lower case
    tweet = re.sub('['+my_punctuation + ']+', ' ', tweet) # strip punctuation
    tweet = re.sub('\s+', ' ', tweet) #remove double spacing
    tweet = re.sub('([0-9]+)', '', tweet) # remove numbers

    # tokenize the tweet.
    
    tweet_token_list = [word for word in tweet.strip().split(' ')
                            if len(word) > 0 ]

    # lemmatize the words in the tweet.
    
    tweet_token_list = [lemmatizer.lemmatize( word )
        for word in tweet_token_list]

    # remove stop words from lemma list.
    tweet_token_list = [word for word in tweet_token_list
                       if word not in my_stopwords]

    # deal with bigrams if requested.
    
    if bigrams:
        tweet_token_list = tweet_token_list+[tweet_token_list[i]+'_'+tweet_token_list[i+1]
                                            for i in range(len(tweet_token_list)-1)]

    # join the processed words in the tweet back to a string with a blank separating the words.

    tweet = ' '.join(tweet_token_list)

    # return the cleaned tweet.
    
    return tweet

df['clean_tweet'] = df.tweet.apply(clean_tweet)

**Use sci-kit learn to perform topic modeling on the cleaned tweets.**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

**Create vectorizer object will to transform text to vector form.**
**Ignore words that occur less than 25 times (min_df=25) or in more than**
**90% of the documents (max_df=0.9).**

**The token pattern indicates what a token looks like.**  

In [None]:
vectorizer = CountVectorizer(max_df=0.9, min_df=25, 
                             token_pattern='\w+|\$[\d\.]+|\S+')

**Apply transformation to the cleaned tweets.**

In [None]:
tf = vectorizer.fit_transform(df['clean_tweet']).toarray()

**tf_feature_names tells what word each column in the matrix represents.**

In [None]:
tf_feature_names = vectorizer.get_feature_names()

**Load both the Latent Dirichlet Allocation and Non-negative matrix factorization modules.**

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.decomposition import NMF

**We'll extract ten topics.**

In [None]:
number_of_topics = 10

**Extract topics using non-negative matrix factorization.**
**NMF typically works better than LDA for short texts like tweets.**
**As usual we initialize the extraction using a singular value decomposition**
**(init='nndsvd') .**

In [None]:
nmfModel = NMF( n_components=number_of_topics, init='nndsvd' )
nmfModel.fit( tf )

**Display the top ten words in each topic.**

In [None]:
def display_topics(model, feature_names, no_top_words):
    topic_dict = {}
    for topic_idx, topic in enumerate(model.components_):
        topic_dict["Topic %d words" % (topic_idx)]= ['{}'.format(feature_names[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
        topic_dict["Topic %d weights" % (topic_idx)]= ['{:.1f}'.format(topic[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
    return pd.DataFrame(topic_dict)
    
num_top_words = 10

print( "Topics extracted using Non-negative matrix factorization:" )

display_topics( nmfModel, tf_feature_names, num_top_words ) 


# Assignment:  Compare NMF and LDA topics.

**The following code extracts topics using Latent Dirichlet Allocation.**  **Compare the topics extracted by LDA with those extracted with NMF.**
**Which appears to make more sense?  Or are they about equally** **interpretable?**

In [None]:
ldaModel = LatentDirichletAllocation( n_components=number_of_topics,
    random_state=32767 )

ldaModel.fit( tf )

print( "Topics extracted using Latent Dirichlet Allocation:" )

display_topics( ldaModel, tf_feature_names, num_top_words )