# NLP: basic topic modelling for Twitter data
In this notebook, we are going to use our cleaned dataset that we have created in the second notebook.

We are going to do some basic Natural Language Processing(NLP) - will try topic modelling for extended_tweet_cleaned column.

## Goal: 
Learn what is topic modelling (LDA), try applying topic modelling algoithm on different subsets of the input data, learn how to visualize topics.

## Introduction to LDA
[LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)(Latent Dirichlet Allocation) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.(Wikipedia)


LDA divides documents (tweets) into topics(clusters). 
It assumes that:
 - every topic (e.g. sport) has some representative words (e.g soccer, play, ball. etc.), 
 - every document(tweet) has some representative topics(e.g sport, summer, etc.).

Topics can interfere: the same word can belong to multiple topics.  
Documents(tweets) can have multiple topics.

### Topic modelling process:
 - Subset the data
 - Prepare data for topic modelling(tokenize, lemmatize)
 - Apply topic modelling
 - Vizualize and explore
 - Repeat if needed

#### Lets download the cleaned tiwitter data from object storage and display first 5 rows

Load python libraries first. Additional libraries that we are going to use in this notebook are: 
 - [gensim](https://pypi.org/project/gensim/) -  Python library for topic modelling
 - [nltk](https://www.nltk.org/) - natural language toolkit, library to work with language
 - [pyLDAvis](https://pypi.org/project/pyLDAvis/) - library for interactive topic model visualization.

In [1]:
try:
    import urllib.request
except ImportError:
    !pip install  --user  urllib
    import urllib.request
    
try:
    import pandas as pd
except ImportError:
    !pip install  --user  pandas
    import pandas as pd

try:
    import gensim
    import gensim.corpora as corpora
    from gensim.utils import simple_preprocess
except:
    !pip install  --user  gensim
    import gensim
    import gensim.corpora as corpora
    from gensim.utils import simple_preprocess
    
try:
    import pyLDAvis.gensim
except:
    !pip install  --user pyldavis
    import pyLDAvis.gensim

try:
    import nltk
    from nltk.corpus import wordnet as wn
except:
    !pip install  --user  nltk
    import nltk
    from nltk.corpus import wordnet as wn

#### Download the cleaned dataset from object store and display first 5 rows
There is a copy of cleaned dataset saved to object store. You can download it or use your local copy, created in the second notebook.
We reading csv file into pandas dataframe and printing first 5 rows.

In [2]:
target_url="https://swift-yeg.cloud.cybera.ca:8080/v1/AUTH_233e84cd313945c992b4b585f7b9125d/geeky-summit/tweets_cleaned1.csv"
file_name="tweets_cleaned1.csv"
urllib.request.urlretrieve(target_url, file_name) ## comment out this line to use your local copy 

tweets = pd.read_csv(file_name,parse_dates=['created_at_date'])  ## reading 'created_at_date' column as timestamp
pd.set_option('max_colwidth', 20)
tweets.head()

Unnamed: 0,created_at_date,hashtags_string,user_string,user_location,longitude,latitude,name,screen_name,extended_tweet,extended_tweet_cleaned
0,2018-11-02 21:01:56,,Symin16,Toronto ✈ Calgary,,,♠,jessmayumba85,@Symin16 I’d lik...,I’d like to kn...
1,2018-11-02 21:02:01,,TwoCanSamAdams,YYC,,,hannahrae cuddle...,thimblewad,@TwoCanSamAdams ...,Legit. There a...
2,2018-11-02 21:02:05,job Calgary Supp...,,Calgary,51.004583,-114.007914,TMJ - CAL Manuf ...,tmj_cal_manuf,Can you recommen...,Can you recommen...
3,2018-11-02 21:02:10,,,🌎📱,,,Sunny Rai,TheSunsRay,Kids See Ghosts:...,Kids See Ghosts:...
4,2018-11-02 21:02:13,Calgary job,,Calgary,50.997882,-114.074005,TMJ-CAL Retail Jobs,tmj_cal_retail,See our latest #...,See our latest ...


### Step1 Subsetting the data
To make LDA algoritm work faster we will use only subset of the data instead of the entire dataset.  
Let's try subsetting by the day first:

In [3]:
tweets_subset_nov5=tweets.loc[tweets["created_at_date"].dt.day==5]

### Step2 Preparing data for LDA

#### [Tokenization](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html) is the task of chopping sentence up into pieces, called tokens, (in our case we will split tweets into words).  
First we transform extended_tweet_cleaned column into a List of Strings (tweets).   
We are only going to be working with this column  - so don't need the entire dataset.

In [4]:
data=tweets_subset_nov5["extended_tweet_cleaned"].tolist()
print (data[1]) ###printing 1st element of the list

this is the most beautiful picture i have ever seen


Second we will go through every String(tweet) in  the List and transform it into another List: list of words(tokens).  
We are going to use [gensim.utils.simple_preprocess](https://radimrehurek.com/gensim/utils.html#gensim.utils.simple_preprocess) function, it converts a document into a list of lowercase tokens, ignoring tokens that are too short or too long.

In [5]:
def string_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(sentence,min_len=4, max_len=15))  ## will ignore words shorter than 4 and longer than 15 characters

data_tokens = list(string_to_words(data))

print(data_tokens[1]) ###printing 1st element of the list after tokenization

['this', 'most', 'beautiful', 'picture', 'have', 'ever', 'seen']


#### Removing [stowords](https://pythonspot.com/nltk-stop-words/) - removing words that don't add any value and can be excluded from ananlysis
NLTK libabry has some predefined stopwords, we will download them and examine:

In [6]:
nltk.download('stopwords')

stop_words = set(nltk.corpus.stopwords.words('english'))
#stop_words.add('calgary')
print(stop_words)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
{'out', 'or', 'between', "aren't", "didn't", 'be', 're', 'wasn', 'them', 'by', 'it', 'hasn', 'does', 'where', 's', 'my', "she's", 'me', "that'll", 'a', 'couldn', 'mightn', "won't", 'did', "should've", "shan't", 'yours', 'from', 'both', 'have', 'as', 'just', 'who', 'him', 'all', 've', 'll', 'm', 'off', 'why', 'any', 'same', 'and', 'some', 'on', 'myself', 'this', 'again', 'in', "you're", 'won', 'o', 'aren', 'theirs', 'there', 'up', 'nor', 'she', 'ain', "isn't", 't', 'having', 'was', 'into', "don't", 'the', 'if', 'not', 'each', 'has', 'your', 'an', 'before', 'only', 'so', 'been', 'of', 'her', 'are', 'didn', "weren't", 'its', 'is', 'doing', 'for', 'shouldn', 'below', 'own', 'most', 'y', 'hadn', 'that', 'i', 'don', 'needn', 'wouldn', 'doesn', 'will', 'against', 'now', 'when', "you'll", 'herself', 'do', 'ourselves', 'his', 'then', 'down', 'shan', 'bei

#### [Lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)  - reducing word forms and unifying them to one main form.
For example:
>is $\Rightarrow$ be   
>car, cars $\Rightarrow$ car

We are going to use wn.morphy() function to get lemmas for words.


In [7]:
nltk.download('wordnet')

def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma

[nltk_data] Downloading package wordnet to /home/jupyter/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [8]:
print(get_lemma("is"))
print(get_lemma("cars"))

be
car


In [9]:
def tokens_to_lda(data_tokens):
    for data_token in data_tokens:
        tokens = [token for token in data_token if token not in stop_words]
        tokens = [get_lemma(token) for token in tokens]
        yield tokens
tokens = list(tokens_to_lda(data_tokens))
print(tokens[1]) ###printing 1st element of the list after removing stopwords and lemmatization

['beautiful', 'picture', 'ever', 'see']


#### Creating dictionary and corpus objects for LDA model
We will create dictionary and corpus for LDA model.  

**[Dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary)** - a mapping between words and their integer ids.  
**Corpus** -  list of (word_id, word_frequency) for every document.

We will be  using [corpora.Dictionary()](https://radimrehurek.com/gensim/corpora/dictionary.html)  and 
[doc2bow()](https://kite.com/python/docs/gensim.corpora.dictionary.Dictionary.doc2bow) methods.

In [10]:
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(text) for text in tokens]

print("First 10 words in the dictionary: ", list(dictionary.token2id)[:10])

print("First tweet, tokens: ",tokens[1])
print("First tweet, corpus: ",corpus[1])

First 10 words in the dictionary:  ['beautiful', 'ever', 'picture', 'see', 'advance', 'another', 'assume', 'assumption', 'calgarians', 'days']
First tweet, tokens:  ['beautiful', 'picture', 'ever', 'see']
First tweet, corpus:  [(0, 1), (1, 1), (2, 1), (3, 1)]


### Step3 Building LDA model with 7 topics and displaying top 8 words for every topic

We are going to use [gensim.models.ldamodel.LdaModel()](https://radimrehurek.com/gensim/models/ldamodel.html) function with predefined number of topics - 8 and print top 7 words in every topic.

In [11]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=7, 
                                           random_state=100)

topics = ldamodel.print_topics(num_words=8)
for topic in topics:
    print(topic)

(0, '0.021*"calgary" + 0.016*"like" + 0.012*"want" + 0.011*"work" + 0.010*"click" + 0.009*"latest" + 0.007*"alberta" + 0.006*"opening"')
(1, '0.011*"anyone" + 0.009*"would" + 0.009*"recommend" + 0.009*"make" + 0.008*"need" + 0.008*"work" + 0.006*"calgary" + 0.006*"check"')
(2, '0.008*"scotty" + 0.007*"alberta" + 0.007*"open" + 0.006*"calgary" + 0.005*"request" + 0.005*"think" + 0.005*"best" + 0.005*"ever"')
(3, '0.012*"request" + 0.011*"close" + 0.008*"people" + 0.008*"good" + 0.007*"love" + 0.006*"make" + 0.006*"snow" + 0.006*"apply"')
(4, '0.010*"great" + 0.007*"time" + 0.007*"happy" + 0.007*"close" + 0.006*"birthday" + 0.006*"year" + 0.006*"amaze" + 0.006*"request"')
(5, '0.011*"know" + 0.010*"great" + 0.009*"interest" + 0.008*"could" + 0.007*"want" + 0.005*"year" + 0.005*"time" + 0.005*"make"')
(6, '0.008*"like" + 0.006*"right" + 0.006*"need" + 0.005*"know" + 0.005*"canadian" + 0.005*"back" + 0.005*"question" + 0.004*"idea"')


### Step4 Visualizing the model 
We are going to use [pyLDAvis.gensim.prepare] (https://pyldavis.readthedocs.io/en/latest/modules/API.html#pyLDAvis.prepare)

In [12]:
lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary,sort_topics=True)
pyLDAvis.display(lda_display)

####  Excersise: try diffrent days, try modifying stopwords, number of topics  and min_len /max_len in `gensim.utils.simple_preprocess`

##  Conclusion
