# Latent Dirilech Allocation

 Latent Dirichlet allocation (LDA), perhaps the most common topic model currently in use. Topic modeling is a type of statistical modeling for discovering the main topics in a collection of documents. The number of topics could be analyzed similar as a number of clusters.

### Data acquisition

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import string
import nltk                                  
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
import gensim

In [2]:
dataset= pd.read_csv('Tweets.csv', sep=',')
dataset.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [3]:
tweet_df = dataset[dataset['airline_sentiment'] != 'neutral'] #Removing the tweets associated with neutral reviews
tweet_df=tweet_df[['text','airline_sentiment']]
tweet_df.head()

Unnamed: 0,text,airline_sentiment
1,@VirginAmerica plus you've added commercials t...,positive
3,@VirginAmerica it's really aggressive to blast...,negative
4,@VirginAmerica and it's a really big bad thing...,negative
5,@VirginAmerica seriously would pay $30 a fligh...,negative
6,"@VirginAmerica yes, nearly every time I fly VX...",positive


### Preprocessing

In [4]:
tweet = tweet_df.text.to_list()

In [5]:
def process_tweet(tweet):
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet
    
    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean

In [6]:
process_tweet(tweet[0])

['plu', 'ad', 'commerci', 'experi', '...', 'tacki']

In [7]:
text_data = []
for i in range(len(tweet)):
    text_data.append(process_tweet(tweet[i]))

In [8]:
text_data[0:3]

[['plu', 'ad', 'commerci', 'experi', '...', 'tacki'],
 ['realli',
  'aggress',
  'blast',
  'obnoxi',
  'entertain',
  'guest',
  'face',
  'littl',
  'recours'],
 ['realli', 'big', 'bad', 'thing']]

In [9]:
dictionary = gensim.corpora.Dictionary(text_data)

#### Bag of words -Gensim doc2bow


Filtering out the tokens that appear in less than 15 documents or in more of 0.5 of the documents. Keeping just the more freqient 100.000 tokens.

In [10]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [11]:
bow_corpus = [dictionary.doc2bow(doc) for doc in text_data]
bow_corpus[4310]

[(12, 1),
 (125, 1),
 (210, 1),
 (229, 1),
 (240, 1),
 (247, 1),
 (256, 1),
 (258, 1),
 (280, 1),
 (333, 1),
 (341, 2),
 (423, 1),
 (607, 1)]

In [12]:
bow_doc_4310 = bow_corpus[2210]
for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                               dictionary[bow_doc_4310[i][0]], 
bow_doc_4310[i][1]))

Word 11 ("fli") appears 1 time.
Word 44 ("think") appears 1 time.
Word 135 ("problem") appears 1 time.
Word 151 ("end") appears 1 time.
Word 159 ("airlin") appears 1 time.
Word 208 ("like") appears 1 time.
Word 476 ("continu") appears 1 time.
Word 533 ("resolut") appears 1 time.
Word 979 ("especi") appears 1 time.
Word 981 ("decid") appears 1 time.


### Topic Modeling

In [13]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=5, id2word=dictionary, passes=15)
for idx, topic in lda_model.print_topics(num_words=4):
    print('Topic: {} Words: {}'.format(idx, topic))

Topic: 0 Words: 0.086*"flight" + 0.034*"cancel" + 0.026*"delay" + 0.021*"plane"
Topic: 1 Words: 0.051*"servic" + 0.050*"custom" + 0.021*"fli" + 0.019*"airlin"
Topic: 2 Words: 0.032*"call" + 0.032*"hour" + 0.030*"hold" + 0.026*"get"
Topic: 3 Words: 0.080*"flight" + 0.017*"late" + 0.014*"cancel" + 0.014*"need"
Topic: 4 Words: 0.074*"thank" + 0.029*"flight" + 0.018*"delay" + 0.017*"great"


#### Checking the topic for one specific document

In [14]:
for index, score in sorted(lda_model[bow_corpus[2210]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.4769130349159241	 
Topic: 0.051*"servic" + 0.050*"custom" + 0.021*"fli" + 0.019*"airlin" + 0.015*"..." + 0.013*"worst" + 0.013*"ever" + 0.013*"never" + 0.012*"love" + 0.012*"guy"

Score: 0.2836836278438568	 
Topic: 0.074*"thank" + 0.029*"flight" + 0.018*"delay" + 0.017*"great" + 0.014*"u" + 0.013*"time" + 0.013*"gate" + 0.012*"help" + 0.011*"still" + 0.010*"make"

Score: 0.20219534635543823	 
Topic: 0.086*"flight" + 0.034*"cancel" + 0.026*"delay" + 0.021*"plane" + 0.019*"flightl" + 0.017*"get" + 0.014*"go" + 0.014*"miss" + 0.013*"hour" + 0.012*"us"

Score: 0.018702760338783264	 
Topic: 0.080*"flight" + 0.017*"late" + 0.014*"cancel" + 0.014*"need" + 0.013*"get" + 0.012*"look" + 0.012*"time" + 0.011*"next" + 0.010*"help" + 0.010*"us"

Score: 0.018505269661545753	 
Topic: 0.032*"call" + 0.032*"hour" + 0.030*"hold" + 0.026*"get" + 0.020*"wait" + 0.020*"tri" + 0.019*"flight" + 0.019*"help" + 0.019*"phone" + 0.016*"can't"


Comparing the scores, this document is highly associated with the first topic.

## References:
* https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d
* https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24
* https://github.com/AprendizajeProfundo/Diplomado/blob/master/Temas/Módulo%208-%20Aprendizaje%20Profundo%20II/1.%20Procesamiento%20de%20Lenguaje%20natural/Cuadernos/nlp_Introduccion.ipynb