# Building LDA Models

In this notebook, we build the various LDA models used for our preliminary testing in the main **03-rrp-topic-modelling** notebook. We build them here because they are time-intensive (approx. 2hr per model). 

We save the models to disk (and Github repo) in this notebook and then import them in the main Topic Modelling notebook to evaluation and further use.

# 1. Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import gensim
from gensim import corpora, models
from gensim.models import nmf
from gensim.models import CoherenceModel
import matplotlib.pyplot as plt

In [2]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# 2. Importing Dataset

In [3]:
%%time
df = pd.read_parquet('/Users/richard/Desktop/data_cap3/processed/df_unique_tweets_hashtags_lemmatized_050521.parquet',
                     engine='pyarrow')

CPU times: user 26.4 s, sys: 8.04 s, total: 34.4 s
Wall time: 37.5 s


In [4]:
df.shape

(6145783, 4)

# 3. Pre-Processing for LDA

APPROACH
1. Create array containing documents only
2. Create bag of words
3. Map docs to BOW
4. Create Tf-Idf
5. Run LDA

## 3.1. Create Docs Array

In [5]:
docs = df.tweet_text.to_numpy()

In [6]:
docs.shape

(6145783,)

In [7]:
docs[100]

array(['ٱِنْطَلَق_1', 'مَرْكَز_1', 'ٱِتِّصال_1', 'مُوَحَّد_1',
       'مُسْتَهْلِك_1', 'بَيِّن_1', 'إِنْشاء_1', 'لَجْنَة_1', 'دائِم_1',
       'حِمايَة_1', 'مُسْتَهْلِك_1', 'بِناء_1', 'قَرار_1'], dtype=object)

## 3.2. Create BOW Dictionary

In [8]:
%%time
# create BOW dictionary
dictionary = gensim.corpora.Dictionary(docs)

CPU times: user 2min 33s, sys: 2.92 s, total: 2min 36s
Wall time: 2min 37s


In [9]:
# show first 11 words in dictionary
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 أَنْتُم_1
1 أَيّ_2
2 اللَّه_1
3 بَرَكَة_1
4 تَأَخَّر_1
5 خَيْر_1
6 رَحْمَة_1
7 سَلام_1
8 ظِرّ_1
9 عَلَى_1
10 عَمِيل_1


In [10]:
len(dictionary)

890544

### 3.2.1. Filter Extreme Cases

**NOTE: Lots of experimenting to do here still. This is just a first random guess.**

In [11]:
# filter extreme cases out of dictionary
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [12]:
len(dictionary)

82457

In [13]:
vocab_length = len(dictionary)

## 3.3. Map Docs to BOW

In [14]:
%%time
# map docs to bag of words
bow_corpus = [dictionary.doc2bow(doc) for doc in docs]

CPU times: user 2min 7s, sys: 2min 5s, total: 4min 12s
Wall time: 4min 48s


In [15]:
# inspect
bow_doc_300 = bow_corpus[300]

for i in range(len(bow_doc_300)):
    print("Word {} (\"{}\") appears {} time(s).".format(bow_doc_300[i][0], 
                                                     dictionary[bow_doc_300[i][0]],
                                                     bow_doc_300[i][1]))

Word 175 ("تويتر_0") appears 1 time(s).
Word 253 ("بَرْنامَج_1") appears 2 time(s).
Word 912 ("تَخَلُّص_1") appears 1 time(s).
Word 1113 ("تَنْحِيف_1") appears 1 time(s).
Word 1242 ("حَقِيقِيّ_1") appears 1 time(s).
Word 1243 ("مُسْتَعِير_1") appears 1 time(s).
Word 1244 ("ٱِسْم_1") appears 1 time(s).
Word 1331 ("وَزْن_1") appears 1 time(s).
Word 1348 ("كِيلُو_1") appears 1 time(s).
Word 1676 ("الكورس_0") appears 1 time(s).
Word 1677 ("تَثْبِيت_1") appears 1 time(s).
Word 1680 ("وَرْس_1") appears 1 time(s).


# 4. Running LDA

APPROACH
1. Run LDA using 3 / 5 / 7 / 10 / 15 topics
   - print top 20 words in topics
2. Evaluate models
   - Eyeballing Topic Contents using pyLDAvis 
   - Computing Topic Coherence using function
3. Select best LDA model
4. Refine performance using parameters
   - adjust dictionary filtering cut-offs
   - use different inputs (lemmatized vs. stemmed / bigrams vs. trigrams vs. no-grams) 
   - use BOW vs. Tf-Idf
   - save models for future use
   - adjust LDA parameters: alpha, beta, random_state

### Interpeting Topic Coherence

Some good background info on [this SO thread](https://stackoverflow.com/questions/54762690/what-is-the-meaning-of-coherence-score-0-4-is-it-good-or-bad).
- The overall coherence score of a topic is the average of the distances between words.
    - .3 is bad
    - .4 is low
    - .55 is okay
    - .65 might be as good as it is going to get
    - .7 is nice
    - .8 is unlikely and
    - .9 is probably wrong
- Adjust parameters (alpha, beta, random_state) to get better performance.

## 4.1. Running LDA Models

### 4.1.1. LDA with 5 Topics

In [16]:
%%time
lda_model_5 = gensim.models.LdaMulticore(bow_corpus, 
                                         num_topics=5, 
                                         id2word=dictionary, 
                                         passes=4, 
                                         workers=2,
                                         random_state=21)

CPU times: user 38min 40s, sys: 10min 42s, total: 49min 22s
Wall time: 1h 1min 24s


In [17]:
# Save 5-topic model to disk in github repo
lda_model_5.save("/Users/richard/Desktop/springboard_repo/capstones/three/models/LDA_5_below15_above50_top100k.model")

In [18]:
# for each topic, print words occuring in that topic
for idx, topic in lda_model_5.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.013*"الله_1_0" + 0.007*"رياض_1_0" + 0.006*"ل_1_0" + 0.005*"قلب_3_0" + 0.005*"شركة_1_0" + 0.005*"دعم_1_0" + 0.004*"تنظيف_1_0" + 0.004*"أهلي_1_0" + 0.004*"رب_1_0" + 0.004*"سعودي_1_0"
Topic: 1 
Words: 0.032*"متابع_1_0" + 0.030*"بيع_1_0" + 0.029*"تويتر_0_0" + 0.029*"رتويت_0_0" + 0.024*"زيادة_1_0" + 0.023*"ٱنتصاب_1_0" + 0.020*"منتج_2_0" + 0.020*"قذف_1_0" + 0.016*"علاج_1_0" + 0.016*"تنفيذ_1_0"
Topic: 2 
Words: 0.080*"ساعة_1_0" + 0.029*"ماركة_1_0" + 0.025*"رياض_1_0" + 0.024*"نسائي_1_0" + 0.024*"مظلة_1_0" + 0.021*"كيلو_1_0" + 0.020*"أبى-a_1_0" + 0.018*"خارج_1_0" + 0.017*"طقم_1_0" + 0.017*"علبة_1_0"
Topic: 3 
Words: 0.090*"دون_1_0" + 0.059*"سداد_1_0" + 0.055*"مكيفات_1_0" + 0.050*"غسيل_1_0" + 0.049*"قرض_1_0" + 0.047*"بنك_1_0" + 0.045*"تسديد_1_0" + 0.041*"متعثر_1_0" + 0.039*"قلم_1_0" + 0.035*"قمة_1_0"
Topic: 4 
Words: 0.114*"جديد_1_0" + 0.069*"سكس_0_0" + 0.037*"فيلم_1_0" + 0.030*"ني_1_0" + 0.027*"محرم_1_0" + 0.025*"خادم_1_0" + 0.024*"قديم_1_0" + 0.023*"مترجم_1_0" + 0.023*"دفع_1

- Topic 0: ...
- Topic 1: ...
- Topic 2: ...
- Topic 3: ...
- Topic 4: ...

### 4.1.2. LDA with 3 Topics

In [19]:
%%time
lda_model_3 = gensim.models.LdaMulticore(bow_corpus, 
                                         num_topics=3, 
                                         id2word=dictionary, 
                                         passes=4, 
                                         workers=2,
                                         random_state=21)

CPU times: user 37min 32s, sys: 9min 52s, total: 47min 24s
Wall time: 58min 57s


In [20]:
# Save 3-topic model to disk in github repo
lda_model_3.save("/Users/richard/Desktop/springboard_repo/capstones/three/models/LDA_3_below15_above50_top100k.model")

In [21]:
# for each topic, print words occuring in that topic
for idx, topic in lda_model_3.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.015*"الله_1_0" + 0.006*"ل_1_0" + 0.006*"قلب_3_0" + 0.005*"دعم_1_0" + 0.005*"أهلي_1_0" + 0.005*"رب_1_0" + 0.004*"سعودي_1_0" + 0.004*"اللهم_1_0" + 0.004*"أن_1_0" + 0.004*"أنا_1_0"
Topic: 1 
Words: 0.042*"رياض_1_0" + 0.036*"تنظيف_1_0" + 0.035*"شركة_1_0" + 0.018*"أثاث_1_0" + 0.015*"مجلس_1_0" + 0.014*"واصل_1_0" + 0.012*"طبيعي_1_0" + 0.011*"نقل_1_0" + 0.011*"مظلة_1_0" + 0.010*"مكيفات_1_0"
Topic: 2 
Words: 0.041*"ساعة_1_0" + 0.028*"سعر_1_0" + 0.021*"جديد_1_0" + 0.017*"عرض_1_0" + 0.016*"ماركة_1_0" + 0.015*"خاص_1_0" + 0.015*"ضمان_1_0" + 0.013*"رجل_1_0" + 0.012*"سنة_1_0" + 0.012*"نسائي_1_0"


### 4.1.3. LDA with 7 Topics

In [22]:
%%time
lda_model_7 = gensim.models.LdaMulticore(bow_corpus, 
                                         num_topics=7, 
                                         id2word=dictionary, 
                                         passes=4, 
                                         workers=2,
                                         random_state=21)

CPU times: user 38min 18s, sys: 11min 29s, total: 49min 47s
Wall time: 59min 11s


In [23]:
# Save 7-topic model to disk in github repo
lda_model_7.save("/Users/richard/Desktop/springboard_repo/capstones/three/models/LDA_7_below15_above50_top100k.model")

In [24]:
# for each topic, print words occuring in that topic
for idx, topic in lda_model_7.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.063*"شيك_1_0" + 0.048*"قهوة_1_0" + 0.036*"مول_1_0" + 0.032*"لذيذ_1_0" + 0.031*"إجازة_1_0" + 0.029*"ميل_1_0" + 0.025*"شاي_1_0" + 0.024*"ٱستمتع_1_0" + 0.023*"لذة_1_0" + 0.015*"كافي_1_0"
Topic: 1 
Words: 0.057*"توفيرحجوزات_0_0" + 0.050*"اوراوا_0_0" + 0.026*"هواري_1_0" + 0.025*"دندراوي_0_0" + 0.014*"الحبسي_0_0" + 0.014*"هيريرا_0_0" + 0.014*"انييستا_0_0" + 0.014*"تاعبك_0_0" + 0.014*"توحا_0_0" + 0.013*"ربيروف_0_0"
Topic: 2 
Words: 0.383*"الميكرسكوب_0_0" + 0.047*"بشرة_1_0" + 0.033*"نخبة_1_0" + 0.029*"ستار_1_0" + 0.027*"ياباني_1_0" + 0.025*"ٱستعادة_1_0" + 0.017*"عظمي_1_0" + 0.016*"سقطري_0_0" + 0.015*"مارسيل_0_0" + 0.010*"إيجابيات_1_0"
Topic: 3 
Words: 0.101*"كوبلاي_0_0" + 0.031*"هاذا_0_0" + 0.022*"سيرناي_0_0" + 0.021*"رأس-ai_1_0" + 0.015*"شاة_1_0" + 0.015*"بينار_0_0" + 0.014*"لايك_1_0" + 0.013*"كار_1_0" + 0.012*"ماه-w_1_0" + 0.011*"وه_0_0"
Topic: 4 
Words: 0.060*"ايي_0_0" + 0.024*"ربيي_0_0" + 0.022*"تهبلل_0_0" + 0.016*"استرونج_0_0" + 0.014*"والطقعه_0_0" + 0.014*"تهببل_0_0" +

### 4.1.4. LDA with 10 Topics

In [25]:
%%time
lda_model_10 = gensim.models.LdaMulticore(bow_corpus, 
                                         num_topics=10, 
                                         id2word=dictionary, 
                                         passes=4, 
                                         workers=2,
                                         random_state=21)

CPU times: user 39min 40s, sys: 12min 10s, total: 51min 50s
Wall time: 1h 10s


In [26]:
# Save 10-topic model to disk in github repo
lda_model_10.save("/Users/richard/Desktop/springboard_repo/capstones/three/models/LDA_10_below15_above50_top100k.model")

In [27]:
# for each topic, print words occuring in that topic
for idx, topic in lda_model_10.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.000*"سِعْر_1" + 0.000*"ساعَة_1" + 0.000*"طَلَب_1" + 0.000*"واتس_0" + 0.000*"جَدِيد_1" + 0.000*"سَنَة_1" + 0.000*"رِياض_1" + 0.000*"تَواصُل_1" + 0.000*"بَيْع_1" + 0.000*"تَسابّ_1"
Topic: 1 
Words: 0.000*"هِلال_3" + 0.000*"أَهْلِيّ_1" + 0.000*"رتويت_0" + 0.000*"سَجَّل_1" + 0.000*"نادِي_1" + 0.000*"إِعْلان_1" + 0.000*"حِساب_2" + 0.000*"نَصْر_2" + 0.000*"دَوْر_1" + 0.000*"تابَع_1"
Topic: 2 
Words: 0.013*"الله_1_0" + 0.007*"رياض_1_0" + 0.005*"ل_1_0" + 0.005*"قلب_3_0" + 0.005*"شركة_1_0" + 0.005*"دعم_1_0" + 0.004*"أهلي_1_0" + 0.004*"تنظيف_1_0" + 0.004*"رب_1_0" + 0.004*"سعودي_1_0"
Topic: 3 
Words: 0.208*"ماكا_0_0" + 0.123*"مالتي_0_0" + 0.075*"فياغرا_1_0" + 0.072*"منوي_1_0" + 0.041*"مستخلص_1_0" + 0.035*"جانبيهفعال_0_0" + 0.034*"والارداف_0_0" + 0.026*"أَيّ_2" + 0.017*"إذابة_1_0" + 0.016*"الملتي_0_0"
Topic: 4 
Words: 0.003*"اللَّه_1" + 0.001*"رَبّ_1" + 0.001*"اللّٰهُمَّ_1" + 0.001*"لِ_1" + 0.000*"خَيْر_1" + 0.000*"جَعَل-a_1" + 0.000*"قَلْب_3" + 0.000*"سَعادَة_1" + 0.000*"رَحِم-

Topic 3 and Topic 8 looks interesting here.
- thank you
- salman
- mohammed
- support
- Grob
- king
- prince

TOPICS:

- Topic 0: Domestic services
- Topic 1: Stop words
- Topic 2: Service / Used / Buy
- Topic 3: Political: Saudi / Qatar / republic
- Topic 4: Weight loss
- Topic 5: Banks
- Topic 6: Religious / Ramadan
- Topic 7: Watch offers
- Topic 8: Political Saudi
- Topic 9: Religious / love

### 4.1.5. LDA with 15 Topics

In [28]:
%%time
lda_model_15 = gensim.models.LdaMulticore(bow_corpus, 
                                         num_topics=15, 
                                         id2word=dictionary, 
                                         passes=4, 
                                         workers=2,
                                         random_state=21)

CPU times: user 40min 46s, sys: 12min 55s, total: 53min 42s
Wall time: 58min 36s


In [29]:
# Save 15-topic model to disk in github repo
lda_model_15.save("/Users/richard/Desktop/springboard_repo/capstones/three/models/LDA_15_below15_above50_top100k.model")

In [30]:
# for each topic, print words occuring in that topic
for idx, topic in lda_model_15.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.000*"أثاث_1_0" + 0.000*"أفريقيا_1_0" + 0.000*"سياسي_1_0" + 0.000*"مظلة_1_0" + 0.000*"حزام_1_0" + 0.000*"سكس_0_0" + 0.000*"شراء_1_0" + 0.000*"مستعمل_1_0" + 0.000*"عالمي_1_0" + 0.000*"زد_0_0"
Topic: 1 
Words: 0.000*"سِعْر_1" + 0.000*"واتس_0" + 0.000*"طَلَب_1" + 0.000*"رتويت_0" + 0.000*"تَواصُل_1" + 0.000*"عَمَل_1" + 0.000*"بَيْع_1" + 0.000*"مَوْقِع_1" + 0.000*"إِعْلان_1" + 0.000*"رَجُل_1"
Topic: 2 
Words: 0.000*"طَرِيق_1" + 0.000*"واحِد_1" + 0.000*"ٱِحْتاج_1" + 0.000*"آخَر_1" + 0.000*"قِيمَة_1" + 0.000*"فُرْصَة_1" + 0.000*"ٱِنْتَهَى_1" + 0.000*"نِهايَة_1" + 0.000*"نَظَر_1" + 0.000*"سِنّ_1"
Topic: 3 
Words: 0.004*"اللَّه_1" + 0.001*"قال-u_1" + 0.001*"ناس_1" + 0.001*"رَبّ_1" + 0.001*"اللّٰهُمَّ_1" + 0.000*"لِ_1" + 0.000*"رَحِم-a_1" + 0.000*"جَنَّة_1" + 0.000*"خَيْر_1" + 0.000*"حَمْد_2"
Topic: 4 
Words: 0.001*"لَيّ_1" + 0.001*"حُبّ_1" + 0.000*"قَلْب_3" + 0.000*"لَيْل_1" + 0.000*"أَنا_1" + 0.000*"عَيْن_3" + 0.000*"يِن_1" + 0.000*"بُعْد_1" + 0.000*"غِياب_1" + 0.000*"شَوْق_1