# Detecting Arabic-Language Misinformation on Twitter

## 1. Importing Libraries

In [60]:
# basic data processing libraries
import pandas as pd
import numpy as np
import re

In [61]:
# arabic nlp libraries
from camel_tools.tokenizers.word import simple_word_tokenize
from camel_tools.utils.normalize import normalize_alef_maksura_ar
from camel_tools.utils.normalize import normalize_alef_ar
from camel_tools.utils.normalize import normalize_teh_marbuta_ar
from camel_tools.utils.dediac import dediac_ar
from camel_tools.morphology.database import MorphologyDB
from camel_tools.morphology.analyzer import Analyzer
from camel_tools.disambig.mle import MLEDisambiguator
from camel_tools.tokenizers.morphological import MorphologicalTokenizer

# topic modelling libraries
import gensim
from gensim import corpora, models
from gensim.models import CoherenceModel
from gsdmm import MovieGroupProcess

## 2. Load Clean Twitter Data

The cleaned dataset contains ~6 million tweets = ~650MB.

Let's load in with Dask to save time then load only the tweet contents into local memory.

- `df_full` contains the unlemmatized unique tweets (works with sklearn hopefully)
- `df` contains the lemmatized tweets (works with Gensim)

In [3]:
# read in cleaned, full-text data (only tweet_text column)
df_full = pd.read_parquet(
    's3://coiled-datasets/arabic-tweets/unique_tweets_whole.parquet',
    columns=['tweet_text', 'hashtags', 'is_retweet', 'retweet_tweetid'],
)

In [62]:
df_full.head()

Unnamed: 0,tweet_text,hashtags,is_retweet,retweet_tweetid
0,السلام عليكم ورحمة الله وبركاته مرحبا عملاء م...,,True,9.986493e+17
1,للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",True,9.996373e+17
2,مظلات وسواتر آفاق الرياض مظلات استراحات مظلات...,"[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",True,9.993939e+17
3,فيديو شاهد مواطن يوثق بالفيديو كميات كبيرة من...,,True,9.983516e+17
4,أستغفر الله العظيم وأتوب إليه,,False,


## 3. Run Arabic Preprocessing

In [63]:
# subset tweet texts only
tweets = df_full['tweet_text']

### Remove Repeating Characters
Using the regex pattern below, we will replace any character that is repeated more than twice with a single instance of that character. This is to account for informal text input such as (the Arabic equivalents of): "yeeeees" or "haaaahaaa", etc.

In [64]:
# define function
def remove_repeating_char(text):
    return re.sub("(.)\\1{2,}", "\\1", text)

In [65]:
# show what preprocessing function does
remove_repeating_char("ههههه")

'ه'

In [66]:
%%time
tweets = tweets.apply(remove_repeating_char)
tweets

CPU times: user 22.2 s, sys: 796 ms, total: 23 s
Wall time: 24.8 s


0           السلام عليكم ورحمة الله وبركاته مرحبا عملاء م...
1           للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...
2           مظلات وسواتر آفاق الرياض مظلات استراحات مظلات...
3           فيديو شاهد مواطن يوثق بالفيديو كميات كبيرة من...
4                             أستغفر الله العظيم وأتوب إليه 
                                 ...                        
6145778          وأنا بقلّب في تركى آل شيخ لقيت التايم لاين 
6145779     اختي جوزها شافها وهي طالعه من مسجد بالعشر الا...
6145780     رمضان كريم الدحيل العين القدس عاصمه فلسطين ال...
6145781     قال رسول الله إنَّ في الجمُعةِ لساعَةٌ لا يوا...
6145782     إنجازات شخصية للأعضاء فقط شنو انجازات المجلس ...
Name: tweet_text, Length: 6145783, dtype: object

### Orthographic Normalization
Let's now move on to normalize spellings to account for inconsistencies across dialects and common spelling 'mistakes'. This will reduce data sparsity.

In [67]:
def ortho_normalize(text):
    text = normalize_alef_maksura_ar(text)
    text = normalize_alef_ar(text)
    text = normalize_teh_marbuta_ar(text)
    return text

`camel-tools` does this by removing particular symbols from particular letters (e.g. the dots from the teh-marbuta and the hamza from the alef). For more details see [the documentation](https://camel-tools.readthedocs.io/en/latest/api/utils/normalize.html).

In [68]:
# show what preprocessing function does
ortho_normalize("أحمر حمرة")

'احمر حمره'

In [69]:
%%time
tweets = tweets.apply(ortho_normalize)

CPU times: user 5.94 s, sys: 176 ms, total: 6.12 s
Wall time: 6.37 s


### Dediacritization
Now let's proceed to remove the diacritics, again to significnatly reduce data sparsity.

*NB: diacritics are, loosely put, the Arabic equivalent of vowels. They are symbols written above or below the main characters that change the pronunciation (and possibly the meaning) of the word. This means that, technically speaking, the different words can look the same when we remove the diacritics. However, fluent Arabic-speaking people can ascertain the correct meaning of the word from context. For example, most Arabic newspapers are written without the diacritics.*

We use the dediac_ar function included in the camel_tools library.

In [70]:
# show what preprocessing function does
dediac_ar("حَرَكَات")

'حركات'

In [71]:
%%time
tweets = tweets.apply(dediac_ar)

CPU times: user 4.16 s, sys: 205 ms, total: 4.36 s
Wall time: 4.58 s


We have now done the basic NLP preprocessing. Let's save this intermediate file containing the **clean, unlemmatized tweets**.

In [31]:
tweets.sample(5)

4742556     كم نسبتك بالثانويه تسديد القروض البنكيه الراج...
5075488     الابتعاد عن الاشخاص الذين يتعمدوا في تعكير مز...
674847      مناحل ابو سلطان اقسم بالله عسل سدر بلدي طبيعي...
287904     اللهم اجعل امي لاتشكي هما ولا تتالم وجعا واسعد...
5211478     سقوط جرحي من المدنيين اثر قصف مدفعي لقوات الن...
Name: tweet_text, dtype: object

In [32]:
df_tweets = pd.DataFrame(data=tweets)

In [42]:
# write intermediary file to S3
df_tweets.to_parquet("s3://coiled-datasets/arabic-tweets/tweets_cleaned_unlemmatized.parquet")

We're now set to input these documents into a `Td-Idf Vectorizer` and then perform LDA Topic Modelling with `scikit-learn`. 

**Alternatively** you can continue performing morphological disambiguation (incl. lemmatization) below. This means you'll have to use Gensim instead of Scikit-Learn for LDA as Gensim works better with lemmatized tokens. Scikit-learn performs the tokenization (as well as stopword removal) as part of the `Tf-Idf Vectorizer`.

### Alternative route: Morphological Disambiguation (incl. Lemmatization)
Arabic has a very rich inflectional system. A verb could have up to 5400 inflections (compared to 6 in English and 1 in Chinese). So the trick is knowing...what does a word mean? Especially when stripped of its diacritics?

CAMeL Tools allows us to perform analysis against a morphological database to get all of that word's possible meanings. We can then select one.

In [72]:
# First, we need to load a morphological database.
# Here, we load the default database which is used for analyzing
# Modern Standard Arabic. 
db = MorphologyDB.builtin_db()

analyzer = Analyzer(db)

analyses = analyzer.analyze('سيحاسب')

for analysis in analyses:
    print(analysis, '\n')

{'diac': 'سَيُحاسِب', 'lex': 'حاسَب', 'bw': 'سَ/FUT_PART+يُ/IV3MS+حاسِب/IV', 'gloss': 'will_+_he;it+hold_responsible;get_even_with', 'pos': 'verb', 'prc3': '0', 'prc2': '0', 'prc1': 'sa_fut', 'prc0': '0', 'per': '3', 'asp': 'i', 'vox': 'a', 'mod': 'i', 'stt': 'na', 'cas': 'na', 'enc0': '0', 'rat': 'n', 'source': 'lex', 'form_gen': 'm', 'form_num': 's', 'd3seg': 'سَ+_يُحاسِب', 'caphi': 's_a_y_u_7_aa_s_i_b', 'd1tok': 'سَيُحاسِب', 'd2tok': 'سَ+_يُحاسِب', 'pos_logprob': -1.023208, 'd3tok': 'سَ+_يُحاسِب', 'd2seg': 'سَ+_يُحاسِب', 'pos_lex_logprob': -5.099521, 'num': 's', 'ud': 'AUX+VERB', 'gen': 'm', 'catib6': 'PRT+VRB', 'root': 'ح.س.ب', 'bwtok': 'سَ+_يُ+_حاسِب', 'pattern': 'سَيُ1ا2ِ3', 'lex_logprob': -5.099521, 'atbtok': 'سَ+_يُحاسِب', 'atbseg': 'سَ+_يُحاسِب', 'd1seg': 'سَيُحاسِب', 'stem': 'حاسِب', 'stemgloss': 'hold_responsible;get_even_with', 'stemcat': 'IV_yu'} 



### Simple Word Tokenize
Before we can perform Morpohological Disambiguation (select a particular meaning and form of our word from the range of possibilities), we need to perform a simple word tokenizing in order to be able to feed these into the disambiguating algorithm.

While testing this tool, we discovered that the word يارب was not being tokenized correctly. It is, in fact, two words, but because some tweets include it as one word it was getting processed incorrectly. Therefore, let's first split the instances of يارب and insert a whitespace in between them so that it's tokenized properly.

In [16]:
# define variables with strings to avoid problems with right-to-left order in .replace() call
yarab = 'يارب'
ya_rab = 'يا رب'

In [17]:
def split_yarab(text):
    text = text.replace(yarab, ya_rab)
    return text

In [18]:
%%time
tweets = tweets.apply(split_yarab)

CPU times: user 1.17 s, sys: 41.8 ms, total: 1.21 s
Wall time: 1.21 s


Let's now use the `simple_word_tokenize` function to tokenize our tweets.

In [19]:
%%time
tokens = tweets.apply(simple_word_tokenize)

CPU times: user 18.7 s, sys: 10.7 s, total: 29.4 s
Wall time: 35.9 s


### Removing Stop Words
Using [this Github text file](https://github.com/mohataher/arabic-stop-words), we will define our set of Arabic stop words to remove from the tokenized tweet_text column.

In [20]:
# define stopwords
with open('/Users/rpelgrim/Desktop/data/arabic-stopwords.txt', 'r') as file:
    stopwords = file.read()
    stopwords_list = stopwords.split('\n')

In [21]:
def remove_stopwords(tokenized_text):
    tokens_without_sw = [word for word in tokenized_text if word not in stopwords_list]
    return tokens_without_sw

In [22]:
%%time
tokens_nostop = tokens.apply(remove_stopwords)

CPU times: user 8min 40s, sys: 1min 3s, total: 9min 43s
Wall time: 21min 14s


In [23]:
df_tokens_nostop = pd.DataFrame(data=tokens_nostop)

In [43]:
# write intermediary file to S3
df_tokens_nostop.to_parquet("s3://coiled-datasets/arabic-tweets/tweets_tokenized_nostopwords.parquet")

### Morphological Disambiguation
The next and final step is to conduct morphological disambiguation: to reduce the range of possible forms and meanings of the words in our Arabic text (which has been dediacritized and therefore can have multiple meanings) to a single form and meaning.

For this project we will also use this step to directly lemmatize our tokens. There are many different ways to create 'morphological tokens' (using 9 different schemas built into the CAMeL Morphological Disambiguator). But since we will be conducting Topic Modelling on the text, the lemmas will suffice for our purposes.

In [44]:
# instantiate the Maximum Likelihood Disambiguator
mle = MLEDisambiguator.pretrained()

Let's run on a sample sentence to see how it works:

In [45]:
# The disambiguator expects pre-tokenized text
sentence = simple_word_tokenize('نجح بايدن في الانتخابات')

disambig = mle.disambiguate(sentence)

# For each disambiguated word d in disambig, d.analyses is a list of analyses
# sorted from most likely to least likely. Therefore, d.analyses[0] would
# be the most likely analysis for a given word. Below we extract different
# features from the top analysis of each disambiguated word into seperate lists.
diacritized = [d.analyses[0].analysis['diac'] for d in disambig]
pos_tags = [d.analyses[0].analysis['pos'] for d in disambig]
lemmas = [d.analyses[0].analysis['lex'] for d in disambig]

# Print the combined feature values extracted above
for triplet in zip(diacritized, pos_tags, lemmas):
    print(triplet)

# print lemmas
print(lemmas)

('نَجَحَ', 'verb', 'نَجَح')
('بايدن', 'noun_prop', 'بايدن')
('فِي', 'prep', 'فِي')
('الاِنْتِخاباتِ', 'noun', 'ٱِنْتِخاب')
['نَجَح', 'بايدن', 'فِي', 'ٱِنْتِخاب']


The above example from the CAMeL documentation works perfectly.

Let's now adapt so that we can get just the lemmas.

**NOTE**: We included the try/except clauses because some list indexing was throwing an 'out of range' error. **The function now returns NaN if it can't lemmatize a token.**

In [46]:
def get_lemmas(tokenized_text):
    disambig = mle.disambiguate(tokenized_text)
    try:
        lemmas = [d.analyses[0].analysis['lex'] for d in disambig]
        return lemmas
    except:
        return np.nan

In [None]:
%%time
# NOTE: this cell takes a long time to run (>1 hour on 8-core Macbook Pro)
lemmas = tokens_nostop.apply(get_lemmas)

Awesome -- we've now got our lemmatized tokens and are ready to continue on to our Topic Modelling.

## 4. Topic Modelling: LDA with Gensim

In [6]:
# load in cleaned AND lemmatized data
df = pd.read_parquet(
    's3://coiled-datasets/arabic-tweets/arabic_twitter_clean.parquet',
)

In [73]:
df.head()

Unnamed: 0,tweet_text,hashtags,is_retweet,retweet_tweetid,timestamp_first,user_reference_id
0,"[سَلام_1, عَلَى_1, رَحْمَة_1, اللَّه_1, بَرَكَ...",,True,9.986493e+17,2018-05-25 00:15:00,58
1,"[تَأْجِير_1, بَيْع_1, النطيطات_0, زحاليق_0, ما...","[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",True,9.996373e+17,2018-04-17 12:22:00,0
2,"[مِظَلَّة_1, ساتِر_1, أُفُق_1, رِياض_1, مِظَلّ...","[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",True,9.993939e+17,2018-05-25 00:15:00,58
3,"[فِيدْيُو_1, شاهَد_1, مُواطِن_1, وَثِق-ia_1, ف...",,True,9.983516e+17,2018-05-25 13:06:00,1
4,"[ٱِسْتَغْفَر_1, اللَّه_1, عَظِيم_2, تاب-u_1]",,False,,2014-04-12 03:34:00,657


In [75]:
df.tweet_text[0]

array(['سَلام_1', 'عَلَى_1', 'رَحْمَة_1', 'اللَّه_1', 'بَرَكَة_1',
       'مَرْحَباً_1', 'عَمِيل_1', 'مَتْجَر_1', 'وَنّ_1', 'أَيّ_2',
       'كُلّ_1', 'أَنْتُم_1', 'خَيْر_1', 'ٱِعْتَذَر_1', 'لِ_1',
       'تَأَخَّر_1', 'عَوْدَة_1', 'ظِرّ_1'], dtype=object)

In [9]:
# get only tweet content
docs = df.tweet_text

In [10]:
docs

0          [سَلام_1, عَلَى_1, رَحْمَة_1, اللَّه_1, بَرَكَ...
1          [تَأْجِير_1, بَيْع_1, النطيطات_0, زحاليق_0, ما...
2          [مِظَلَّة_1, ساتِر_1, أُفُق_1, رِياض_1, مِظَلّ...
3          [فِيدْيُو_1, شاهَد_1, مُواطِن_1, وَثِق-ia_1, ف...
4               [ٱِسْتَغْفَر_1, اللَّه_1, عَظِيم_2, تاب-u_1]
                                 ...                        
6145778    [أنا_1_0, قلب_3_0, تركي_1_0, ال_1_0, شيخ_2_0, ...
6145779    [أخت_1_0, جوز_2_0, شافه_1_0, طالع_1_0, مسجد_1_...
6145780    [رمضان_1_0, كريم_1_0, الدحيل_0_0, عين_1_0, قدس...
6145781    [رسول_1_0, الله_1_0, جمعة_1_0, ساعة_1_0, وافق_...
6145782    [إنجاز_2_0, شخص_1_0, عضو_1_0, شنو_0_0, إنجاز_2...
Name: tweet_text, Length: 6145783, dtype: object

### Create BOW Dictionary with Gensim

In [76]:
%%time
# create BOW dictionary
dictionary = gensim.corpora.Dictionary(docs)

CPU times: user 59.8 s, sys: 921 ms, total: 1min
Wall time: 1min 3s


In [17]:
# filter extreme cases out of dictionary
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [18]:
%%time
# map docs to bag of words
bow_corpus = [dictionary.doc2bow(doc) for doc in docs]

CPU times: user 36.5 s, sys: 1.13 s, total: 37.6 s
Wall time: 38.2 s


Let's test our Bag-of-Words for good measure.

In [19]:
# inspect
bow_doc_300 = bow_corpus[300]

for i in range(len(bow_doc_300)):
    print("Word {} (\"{}\") appears {} time(s).".format(bow_doc_300[i][0], 
                                                     dictionary[bow_doc_300[i][0]],
                                                     bow_doc_300[i][1]))

Word 175 ("تويتر_0") appears 1 time(s).
Word 253 ("بَرْنامَج_1") appears 2 time(s).
Word 912 ("تَخَلُّص_1") appears 1 time(s).
Word 1113 ("تَنْحِيف_1") appears 1 time(s).
Word 1242 ("حَقِيقِيّ_1") appears 1 time(s).
Word 1243 ("مُسْتَعِير_1") appears 1 time(s).
Word 1244 ("ٱِسْم_1") appears 1 time(s).
Word 1331 ("وَزْن_1") appears 1 time(s).
Word 1348 ("كِيلُو_1") appears 1 time(s).
Word 1676 ("الكورس_0") appears 1 time(s).
Word 1677 ("تَثْبِيت_1") appears 1 time(s).
Word 1680 ("وَرْس_1") appears 1 time(s).


### Run LDA with Gensim

Experimentation in [a separate notebook](https://github.com/rrpelgrim/portfolio/blob/master/0_FINAL_CAPSTONE_Identifying_Politiical_Misinformation/notebooks/03-rrp-topic-modelling.ipynb) showed that the LDA Model with 15 topics performed the best out of 5 tested options. Below, we provide a summary of our in-depth analysis of the LDA Visualisation Report of this 15-Topic LDA Model.

- LDA Visualisation shows Top 30 words that occur in each Topic
- This more in-depth view of the topics confirms our initial 'First-Glance Analysis':
  - There are 2 clearly political clusters
  - There is 1 cluster mixing political with misc. content
- The 15 clusters are not evenly distributed throughout the clustering space. Instead, there is one cluster on one side, and all 14 other clusters are overlapping on the other (see screenshot). This may be a sign that this clustering is not functioning entirely as it should.

In [12]:
%%time
lda_model_15 = gensim.models.LdaMulticore(bow_corpus, 
                                         num_topics=15, 
                                         id2word=dictionary, 
                                         passes=2, 
                                         workers=7,
                                         random_state=21)

CPU times: user 1min 40s, sys: 5min 22s, total: 7min 2s
Wall time: 7min 50s


In [13]:
# save model
lda_model_15.save("LDA_15.model")

In [14]:
lda_model_15 =  models.LdaModel.load("LDA_15.model")

In [20]:
# evaluate model using Topic Coherence score
cm_15 = CoherenceModel(model=lda_model_15, corpus=bow_corpus, texts=docs, coherence='c_v')
coherence_15 = cm_15.get_coherence()  # get coherence value

In [21]:
coherence_15

0.6013252652994153

### A Note on Coherence Models

We will use the objective measure of Topic Coherence as an additional check to verify these eyeballing checks. While the Topic Coherence can give a (tempting) illusion of objectivity to your evaluation performance, I found it helpful to balance that with some sobering scepticism from [this Stack Overflow thread](https://stackoverflow.com/questions/54762690/evaluation-of-topic-modeling-how-to-understand-a-coherence-value-c-v-of-0-4):

- 0.3 is bad
- 0.4 is low
- 0.55 is okay
- 0.65 might be as good as it is going to get
- 0.7 is nice
- 0.8 is unlikely and
- 0.9 is probably wrong

### Visualize LDA with pyLDAvis

In [22]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

In [23]:
%%time
# prepare visualisation data
vis_data = gensimvis.prepare(lda_model_15, bow_corpus, dictionary)

  default_term_info = default_term_info.sort_values(
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


CPU times: user 4min 6s, sys: 3.1 s, total: 4min 9s
Wall time: 4min 12s


In [24]:
# create filepath to save HTML visualisation
filepath = "/Users/rpelgrim/Desktop/LDA_5.html"

In [25]:
# save visualisation to HTML in repo
pyLDAvis.save_html(vis_data, filepath)

In [26]:
from IPython.display import HTML
HTML(filename='/Users/rpelgrim/Desktop/LDA_5.html')

### Subset Political from LDA Output

In [27]:
# define function to get topics
def get_topics_LDA_15(row):
    index = int(row.name)
    try:
        topic = sorted(lda_model_15.get_document_topics(bow_corpus[index], minimum_probability=0.4), reverse=True)[0][0]
        return topic
    except:
        return np.nan

In [28]:
tweets = pd.DataFrame(docs)
tweets

Unnamed: 0,tweet_text
0,"[سَلام_1, عَلَى_1, رَحْمَة_1, اللَّه_1, بَرَكَ..."
1,"[تَأْجِير_1, بَيْع_1, النطيطات_0, زحاليق_0, ما..."
2,"[مِظَلَّة_1, ساتِر_1, أُفُق_1, رِياض_1, مِظَلّ..."
3,"[فِيدْيُو_1, شاهَد_1, مُواطِن_1, وَثِق-ia_1, ف..."
4,"[ٱِسْتَغْفَر_1, اللَّه_1, عَظِيم_2, تاب-u_1]"
...,...
6145778,"[أنا_1_0, قلب_3_0, تركي_1_0, ال_1_0, شيخ_2_0, ..."
6145779,"[أخت_1_0, جوز_2_0, شافه_1_0, طالع_1_0, مسجد_1_..."
6145780,"[رمضان_1_0, كريم_1_0, الدحيل_0_0, عين_1_0, قدس..."
6145781,"[رسول_1_0, الله_1_0, جمعة_1_0, ساعة_1_0, وافق_..."


In [29]:
%%time
# assign topic labels 
tweets['topic'] = tweets.apply(get_topics_LDA_15, axis=1)

CPU times: user 6min 30s, sys: 2.48 s, total: 6min 33s
Wall time: 6min 33s


In [30]:
tweets.head()

Unnamed: 0,tweet_text,topic
0,"[سَلام_1, عَلَى_1, رَحْمَة_1, اللَّه_1, بَرَكَ...",3.0
1,"[تَأْجِير_1, بَيْع_1, النطيطات_0, زحاليق_0, ما...",1.0
2,"[مِظَلَّة_1, ساتِر_1, أُفُق_1, رِياض_1, مِظَلّ...",9.0
3,"[فِيدْيُو_1, شاهَد_1, مُواطِن_1, وَثِق-ia_1, ف...",
4,"[ٱِسْتَغْفَر_1, اللَّه_1, عَظِيم_2, تاب-u_1]",3.0


In [31]:
# for each topic, print words occuring in that topic
for idx, topic in lda_model_15.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.035*"ساعَة_1" + 0.023*"سِعْر_1" + 0.019*"سَنَة_1" + 0.015*"طَلَب_1" + 0.014*"واتس_0" + 0.012*"جَدِيد_1" + 0.011*"آب_1" + 0.011*"لَوْن_1" + 0.011*"رَجُل_1" + 0.010*"عَرْض_1"
Topic: 1 
Words: 0.018*"رتويت_0" + 0.014*"تابَع_1" + 0.011*"حِساب_2" + 0.011*"عَمَل_1" + 0.008*"ريتويت_0" + 0.008*"مُتابِع_1" + 0.008*"إِعْلان_1" + 0.008*"تَطْبِيق_1" + 0.007*"أَضاف_1" + 0.006*"تَغْرِيد_1"
Topic: 2 
Words: 0.020*"أَنَّ_1" + 0.018*"حَياة_1" + 0.016*"شَيْء_1" + 0.015*"شَخْص_1" + 0.013*"كان_1" + 0.011*"نَفْس_1" + 0.008*"ناس_1" + 0.008*"آخَر_1" + 0.008*"عَرَف-i_1" + 0.007*"راوَنْد_1"
Topic: 3 
Words: 0.149*"اللَّه_1" + 0.014*"اللّٰهُمَّ_1" + 0.013*"رَبّ_1" + 0.010*"سَلَّم_1" + 0.010*"حَمْد_2" + 0.010*"وَلِي-i_1" + 0.010*"صَلاة_1" + 0.010*"ٱِسْتَغْفَر_1" + 0.009*"لِ_1" + 0.009*"قال-u_1"
Topic: 4 
Words: 0.026*"قَلْب_3" + 0.022*"حُبّ_1" + 0.018*"أَحَبّ_1" + 0.018*"لِ_1" + 0.014*"قال-u_1" + 0.013*"أَنا_1" + 0.011*"لَيّ_1" + 0.011*"أَنْتَ_1" + 0.010*"عَيْن_3" + 0.009*"عَيْن_2"
Topic: 5 
W

Topics 11, 5 contain political content.

Let's subset those and feed that into GSDMM for further sorting.

In [32]:
# filter tweets with political topics
tweets_pol = tweets[tweets.topic.isin([5,11])]

In [33]:
# get shape
tweets_pol.shape

(371832, 2)

Extracting just the tweets labelled with Topic 5 and 11, yields a dataframe of **just over 370K political tweets**.

### Run GSDMM on LDA Output

Run `python -m pip install git+https://github.com/rwalk/gsdmm` to install [this GSDMM package](https://github.com/rwalk/gsdmm)

In [34]:
# create array of documents
docs_pol = tweets_pol.tweet_text.to_numpy()

In [35]:
%%time
# create BOW dictionary
dictionary_pol = gensim.corpora.Dictionary(docs_pol)

CPU times: user 3.06 s, sys: 84.9 ms, total: 3.15 s
Wall time: 3.15 s


In [36]:
# get vocab length
vocab_length_pol = len(dictionary_pol)

In [37]:
%%time
# map docs to bag of words
bow_corpus_pol = [dictionary_pol.doc2bow(doc) for doc in docs_pol]

CPU times: user 1.73 s, sys: 39.2 ms, total: 1.77 s
Wall time: 1.77 s


In [38]:
# instantiate GSDMM
gsdmm_pol = MovieGroupProcess(K=30, alpha=0.4, beta=1, n_iters=12)

In [39]:
%%time
y_pol = gsdmm_pol.fit(docs_pol, vocab_length_pol)

In stage 0: transferred 357549 clusters with 30 clusters populated
In stage 1: transferred 184582 clusters with 30 clusters populated
In stage 2: transferred 69815 clusters with 26 clusters populated
In stage 3: transferred 54569 clusters with 23 clusters populated
In stage 4: transferred 49953 clusters with 19 clusters populated
In stage 5: transferred 48056 clusters with 18 clusters populated
In stage 6: transferred 48075 clusters with 17 clusters populated
In stage 7: transferred 48176 clusters with 17 clusters populated
In stage 8: transferred 48477 clusters with 17 clusters populated
In stage 9: transferred 48915 clusters with 17 clusters populated
In stage 10: transferred 48409 clusters with 17 clusters populated
In stage 11: transferred 47698 clusters with 17 clusters populated
CPU times: user 35min 31s, sys: 16.7 s, total: 35min 48s
Wall time: 35min 42s


Let's get the top words in each of these topics.

In [40]:
doc_count_pol = np.array(gsdmm_pol.cluster_doc_count)
print('Number of documents per topic :', doc_count_pol)
print('*'*20)

# Topics sorted by the number of document they are allocated to
top_index_pol = doc_count_pol.argsort()[-21:][::-1]
print('Most important clusters (by number of docs inside):', top_index_pol)

Number of documents per topic : [ 2861   127     0   594 48029     0 77379  3623   314 37276    65 33837
    41     0 40711    13   790     0     0     0 93144     0     0    67
     0     0 32961     0     0     0]
********************
Most important clusters (by number of docs inside): [20  6  4 14  9 11 26  7  0 16  3  8  1 23 10 12 15 24  2 27  5]


In [41]:
# define function to get top words per topic
def top_words(cluster_word_distribution, top_cluster, values):
    for cluster in top_cluster:
        sort_dicts = sorted(cluster_word_distribution[cluster].items(), key=lambda k: k[1], reverse=True)[:values]
        print("\nCluster %s : %s"%(cluster, sort_dicts))

In [42]:
# get top words in topics
top_words(gsdmm_pol.cluster_word_distribution, top_index_pol, 20)


Cluster 20 : [('رَئِيس_1', 18750), ('سَعُودِيّ_1', 15312), ('وَزِير_1', 15017), ('مِصْر_1', 10795), ('عَرَبِيّ_1', 10150), ('مَجْلِس_1', 9807), ('دَوْلَة_1', 8149), ('السيسي_0', 6795), ('خارِجِيّ_1', 5627), ('أَمارَة_1', 5605), ('أَمْرِيكِيّ_1', 5487), ('عاجِل_1', 5377), ('عامّ_1', 5100), ('مِصْرِيّ_1', 4879), ('سِياسِيّ_1', 4769), ('مُحَمَّد_1', 4643), ('حُكُومَة_1', 4566), ('دَوْلِيّ_1', 4433), ('نائِب_1', 4379), ('بَلَد_1', 4341)]

Cluster 6 : [('قَطَر_1', 24540), ('إِرْهاب_1', 18415), ('إِخْوَة_1', 13521), ('تُركِيا_1', 11526), ('إِرْهابِيّ_1', 11403), ('عَرَبِيّ_1', 10961), ('إِيران_1', 10743), ('اردوغان_0', 9574), ('دَوْلَة_1', 8988), ('سَعُودِيّ_1', 8429), ('قَطَرِيّ_1', 6489), ('شَعْب_1', 5692), ('تُرْكِيّ_1', 5494), ('تَمِيم_1', 5348), ('جَماعَة_1', 5155), ('اللَّه_1', 4438), ('لِيبِيا_1', 4396), ('دَعْم_1', 4332), ('مِصْر_1', 4326), ('تَنْظِيم_1', 4326)]

Cluster 4 : [('لِيبِيا_1', 17995), ('لِيبِيّ_1', 12100), ('جَيْش_1', 12012), ('قُوَّة_1', 10612), ('طَرابُلُس_1', 9264), 

### Analysis 

Clusters 20, 6, 4, 14, 9 and 26 are definitely Political Content. 

The others are definitely not:
- commercials for eye-corrections
- social media promotions
- domestic services
- publishing services
- misc. / stopwords
- ...

## Get Topic Labels

In [54]:
def create_topics_dataframe(docs, mgp):
    result = pd.DataFrame(columns=['Lemma-text', 'Topic'])
    for i, text in enumerate(docs):
        result.at[i, 'Lemma-text'] = docs[i]
        prob = mgp.choose_best_label(docs[i])
        result.at[i, 'Topic'] = prob[0]
    return result

In [55]:
%%time
df_pol_topics = create_topics_dataframe(docs=docs_pol, mgp=gsdmm_pol)

CPU times: user 53min 28s, sys: 1min 41s, total: 55min 9s
Wall time: 55min 14s


In [56]:
df_pol_topics

Unnamed: 0,Lemma-text,Topic
0,"[عاجِل_1, نَبَأ_1, دِبلُوماسِيّ_2, تَلّ_1, أَب...",11
1,"[عُنْصُر_1, مِيلِيشِيا_1, الحوثي_0, قَتَل-u_1,...",4
2,"[اللوء_0, رُكْن_2, مانِع_2, عُمَر_1, ابالعلاء_...",20
3,"[رَمَضان_1, كَرِيم_1, أَنْتَ_1, ٱِسْتَأْهَل_1,...",7
4,"[جاد-u_1, عَرْض_1, شَهْر_1, الجود_0, أَوَّل_2,...",7
...,...,...
371827,"[رِيف_1, دِمَشْق_1, دارِي_1, ٱِشْتِباك_1, ثائِ...",11
371828,"[مميش_0, إِنْهاء_1, عَمَل_1, التكريك_0, قَناة_...",20
371829,"[قَناة_1, سُوَيْس_1, جَدِيد_1, حُلْم_1, حَقِيق...",6
371830,"[غَرَد_1, صُورَة_1, مُبادَرَة_1, مُؤَسِّس_1, أ...",14


## Subset Only Political

In [58]:
pol_tweets = df_pol_topics[df_pol_topics.Topic.isin([20,6,4,14,9,26])]

In [79]:
pol_tweets.sample(20)

Unnamed: 0,Lemma-text,Topic
305873,"[ياعليك_0, كِذْب_1, سُرْعَة_1, تَلْفِيق_1, خُب...",6
150013,"[قِيام_1, مشيليات_0, سِراج_2, ٱِحْتِجاز_1, مُه...",4
51497,"[مُؤامَرَة_1, بتنكشف_0]",6
64645,"[حَقّ_2, عامِل_2, نِظام_1, سَعُودِيّ_1, ما_1, ...",20
207336,"[طَرابُلُس_1, هَواء_1, تَرْكِيبَة_1, مُدِير_1,...",4
88466,"[صُور_1, مُؤَسِّس_1, خَلِيفَة_2, زايِد_1, ال_1...",9
51168,"[سَيِّء_1, سِياسَة_1, تَنْظِيم_1, حَمْدَيْن_1,...",6
32028,"[مُحَمَّد_1, زايِد_1, بَحَث-a_1, هاتِفِيّ_1, ر...",20
360620,"[خَطَر_1, مَجُوس_1, يَمَن_2, وَضْع_1, خَطَر_1,...",6
9714,"[طَرابُلُس_1, عَمِيد_1, بَلَد_1, تاجوراء_0, حُ...",4


In [77]:
len(pol_tweets)

330439

## And now...?

* Detect misinformation vs not misinformation?
* Assume all political in here is misinformation?
* Use dataset to train misinformation detector and test?

## 4. Tf-Idf Vectorizer with Sklearn

In [19]:
# turn tweets into list of strings
docs = list(tweets)

In [20]:
# define stopwords
with open('/Users/rpelgrim/Desktop/data/arabic-stopwords.txt', 'r') as file:
    stopwords = file.read()
    stopwords_list = stopwords.split('\n')

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [22]:
vectorizer = TfidfVectorizer(stop_words=stopwords_list)

In [23]:
%%time
X = vectorizer.fit_transform(docs)



CPU times: user 51.8 s, sys: 1.06 s, total: 52.9 s
Wall time: 53.2 s


## 5. LDA with Sklearn

In [24]:
from sklearn.decomposition import LatentDirichletAllocation

In [29]:
lda = LatentDirichletAllocation(
    n_components=5,
    random_state=42,
    n_jobs=-1
)

In [26]:
import joblib

In [None]:
%%time
with joblib.parallel_backend("dask"):
    lda.fit(X)

^^ This gives "module not found scipy.sparse..."

- still not working, even after updating coiled s-env to explicitly include scipy and scikit-learn

In [28]:
%%time
lda.fit(X)

CPU times: user 53min 33s, sys: 15.2 s, total: 53min 48s
Wall time: 53min 46s


LatentDirichletAllocation(n_components=5, random_state=42)

In [30]:
%%time
lda.fit(X)

CPU times: user 1min 47s, sys: 21.4 s, total: 2min 8s
Wall time: 21min 56s


LatentDirichletAllocation(n_components=5, n_jobs=-1, random_state=42)

## Gensim

## SKlearn

### Vectorize

Vectorizing isn't possible at the moment because the cleaned dataframe contains numpy arrays of the lemmas. The `Vectorizers` expect a string per document. 

**TO DO: Try loading in the untokenized, cleaned tweet texts and Vectorizing those directly. NO >> Arabic-specific preprocessing to do. OR find a way to write custom preprocessor and tokenizers.**

To do that I'll probably have to:
- input custom preprocessors/tokenizers.
- input the list of stop words (we have it somewhere)
- 

In [9]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
CountVec = CountVectorizer(ngram_range=(2,2))
Count_data = CountVec.fit_transform(docs)

## Dask-ML

The array of documents `X` is only 47MB. Doesn't make sense to use Dask-ML for this. Instead use `sklearn` tf-idf vectorizer and then train LDA in parallel with Dask backend.

In [10]:
# vectorize contents
from dask_ml.feature_extraction.text import HashingVectorizer
from dask_ml.feature_extraction.text import CountVectorizer

### Hashing Vectorizer

In [11]:
vect = HashingVectorizer(lowercase=False)

In [12]:
X = df_full['tweet_text'].to_dask_array(lengths=True)

In [18]:
X

Unnamed: 0,Array,Chunk
Bytes,46.89 MiB,46.89 MiB
Shape,"(6145783,)","(6145783,)"
Count,3 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 46.89 MiB 46.89 MiB Shape (6145783,) (6145783,) Count 3 Tasks 1 Chunks Type object numpy.ndarray",6145783  1,

Unnamed: 0,Array,Chunk
Bytes,46.89 MiB,46.89 MiB
Shape,"(6145783,)","(6145783,)"
Count,3 Tasks,1 Chunks
Type,object,numpy.ndarray


In [17]:
X[1].compute()

' للتأجير لبيع النطيطات زحاليق مائيه صابونية ملاعب صابونيه زحاليق في جدة ألعاب أولاد بنات بالرياض '

In [34]:
docs_vect = vect.fit_transform(docs)

In [35]:
docs_vect.compute_chunk_sizes()

TypeError: cannot use a string pattern on a bytes-like object

In [None]:
docs_local = docs_vect.compute().toarray()

## X. Preprocessing with Dask Bags (not working)

In [35]:
# cast tweet texts into a Dask bag
bag = df_full['tweet_text'].to_bag(index=False)

In [19]:
# get number of items in bag
bag.count().compute()

6145783

In [36]:
t = bag.take(1)

In [37]:
t

(' السلام عليكم ورحمة الله وبركاته مرحبا عملاء متجر ون واي وكل عام وانتم بخير نعتذر لكم عن تاخرنا في العودة بسبب بعض الظر ',)

In [38]:
type(t)

tuple

In [39]:
t[0]

' السلام عليكم ورحمة الله وبركاته مرحبا عملاء متجر ون واي وكل عام وانتم بخير نعتذر لكم عن تاخرنا في العودة بسبب بعض الظر '

In [40]:
# extract value from tuple
def get_tweets(element):
    return element[0]

In [41]:
tweets = bag.map(get_tweets)

In [42]:
tweets.take(1)

(' ',)

In [27]:
t[0]

' السلام عليكم ورحمة الله وبركاته مرحبا عملاء متجر ون واي وكل عام وانتم بخير نعتذر لكم عن تاخرنا في العودة بسبب بعض الظر '

In [25]:
type(t)

tuple

I think there's an issue with how the values are cast into the Bag. Seems like they're being cast as tuples when I actually just want the value. Is that what's tripping up the `bag.apply` and killing workers?

### Remove Repeating Characters

In [44]:
# remove repeating characters if character repeats more than once
def remove_repeating_char(text):
    return re.sub("(.)\\1{2,}", "\\1", text)

In [46]:
# apply regex function to contents of Dask bag
bag2 = db.map(remove_repeating_char, bag)

dask.bag<remove_repeating_char, npartitions=4>

In [47]:
bag2.take(1)

(' السلام عليكم ورحمة الله وبركاته مرحبا عملاء متجر ون واي وكل عام وانتم بخير نعتذر لكم عن تاخرنا في العودة بسبب بعض الظر ',)