# Say What Now? 
## Detecting Arabic-Language Political Misinformation on Twitter

# Introduction to Project & Notebook
This is the first notebook of the project "Say What Now? Identifying Arabic-Language Political Misinformation on Twitter". The project uses a dataset of censored Twitter accounts containing more than 36 million tweets to:
1. **Build a machine-learning classifier** to identify political misinformation tweets
2. **Conduct network analysis** to identify the underlying relationships between the accounts

In addition, this project sets out to tackle 2 technical challenges:
1. **Use distributed computing** to process the larger-than-memory dataset, specifically using the new (Beta) Coiled interface to run AWS clusters
2. **Conduct research** into NLP for low-resource languages such as Arabic, specifically by using semi-supervised learning to circumvent the challenges of scarce labelled datasets.

## About the Dataset
Data downloaded on February 26, 2021 from [the Information Operations page](https://transparency.twitter.com/en/reports/information-operations.html) of the Twitter Transparency Center

Notes about the dataset:
- What's Included? "Platform manipulation that we can reliably attribute to a government or state linked actor is considered an information operation and is prohibited by the Twitter Rules."
- "These datasets include profile information, Tweets and media (e.g., images and videos) from accounts we believe are connected to state linked information operations. Tweets and media which were deleted [by the user] are not included in the datasets."
- "For accounts with fewer than 5,000 followers, we have hashed certain identifying fields (such as user ID and screen name) in the publicly-available version of the datasets. While we’ve taken every possible precaution to ensure there are no false positives in these datasets, we’ve hashed these fields to reduce the potential negative impact on authentic or compromised accounts — while still enabling longitudinal research, network analysis, and assessment of the underlying content created by these accounts."


Twitter's [statement](https://twitter.com/TwitterSafety/status/1245682431975460864?s=20) about this particular dataset: 
- "A network of accounts associated with Saudi Arabia and operating out of multiple countries including KSA, Egypt and UAE, were amplifying content praising Saudi leadership, and critical of Qatar and Turkish activity in Yemen. A total of 5,350 accounts were removed."

## Notebook Outline

0. Introduction to Project and Notebook


1. Import Libraries


2. Configure Coiled AWS Clusters


3. Import Data


4. Data Inspection & Basic Data Cleaning
 - Remove faulty and NaN rows


5. Text Cleaning
 - Subset Arabic Tweets
 - Remove URLs
 - Remove emoji
 - Move RT usernames to new column & Remove RT
 - Move hashtags to new column & Remove hashtags
 - Delete Rows with Empty Tweet_Text


6. Subset Unique Tweets
 - Create a dataframe with only unique tweets and a reset index
 - Replace tweets in original dataframe with indices from new (unique tweets) dataframe


7. NLP Pre-Processing of Unique Tweets
 - Dediacritization
 - Tokenization
 - Orthographic Normalisation
 - Morphological Disambiguation
 - Lemmatization

# 1. Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import coiled
import dask
from dask import distributed
from dask.distributed import Client, progress
import dask.dataframe as dd

import emoji
import lxml

from nltk.corpus import stopwords
from camel_tools.tokenizers.word import simple_word_tokenize
from camel_tools.utils.normalize import normalize_alef_maksura_ar
from camel_tools.utils.normalize import normalize_alef_ar
from camel_tools.utils.normalize import normalize_teh_marbuta_ar
from camel_tools.utils.dediac import dediac_ar
from camel_tools.morphology.database import MorphologyDB
from camel_tools.morphology.analyzer import Analyzer
from camel_tools.disambig.mle import MLEDisambiguator
from camel_tools.tokenizers.morphological import MorphologicalTokenizer

# 2. Configuring Coiled Clusters
Below, we spin up our Coiled AWS clusters in 2 simple steps:

1. Spinning up our cluster using the configurations set up in notebook **99-rrp-cluster-configurations**
2. Connecting our cluster to a Dask client

In the Cluster Configuration notebook we have:
1. Created a software environment that will be distributed to all the workers on our clusters using a local .yml file
2. Created a cluster configuration containing specs of our scheduler and workers

In [3]:
%%time
# create coiled cluster
cluster = coiled.Cluster(
    name='wrangling-cluster',
    shutdown_on_close=False,
    configuration="cap3-wrangling-s2_8-w4_16",
    n_workers=50,
    scheduler_options={"idle_timeout": "2 hours"},
)

CPU times: user 29 s, sys: 7.68 s, total: 36.7 s
Wall time: 4min 34s


In [4]:
# connect cluster to Dask
client = Client(cluster)
print('Dashboard:', client.dashboard_link)

Dashboard: http://ec2-3-16-114-53.us-east-2.compute.amazonaws.com:8787


# 3. Importing Arabic Twitter Data

Working with AWS clusters requires our data to be stored in an AWS S3 bucket. Below, we read in our data from an S3 bucket, selecting only the columns relevant to our analysis: basic information about the users, the content of the tweets and any information about interactions (retweets, replies) between users.

In [5]:
# read s3 data into dask dataframe
ddf = dd.read_csv(
    "s3://twitter-saudi-us-east-2/sa_eg_ae_022020_tweets_csv_hashed_*.csv",
    blocksize="64MiB",
    usecols=[
        'tweetid',
        'userid',
        'user_screen_name',
        'follower_count', 
        'following_count',
        'tweet_language',
        'tweet_text',
        'tweet_time', 
        'tweet_client_name', 
        'is_retweet',
        'retweet_userid',
        'retweet_tweetid'],
    engine='python',
    error_bad_lines=False,
    na_values='None',
    dtype={
        "tweetid": "object",
        "userid": "object",
        "user_screen_name": "object",
        "follower_count": "object",
        "following_count": "object",
        "tweet_language": "object",
        "tweet_text": "object",
        "tweet_time": "object",
        "tweet_client_name": "object",
        "is_retweet": "object",
        "retweet_userid": "object",
        "retweet_tweetid": "object"
    }
).persist()

Skipping line 410: unexpected end of data


# 4. Data Inspection & Basic Data Cleaning

Let's have a look at what's in our dataframe and what we'll need to do to get this into a shape we can work with.

In [6]:
# inspect
ddf

Unnamed: 0_level_0,tweetid,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid
npartitions=369,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
,object,object,object,object,object,object,object,object,object,object,object,object
,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...


In [7]:
# get number of rows of our dask dataframe
n_rows_all = ddf.shape[0].compute()
n_rows_all

36524387

We have more than 36.5 million tweets.

We had some trouble setting the data types directly upon import. From earlier experimentation not presented in this notebook, we know that this is most likely due to NaNs and/or faulty rows in which entries have shifted across columns (e.g. content of 'username' is in 'retweet_id' column). Let's investigate and resolve this first before moving on.

## 4.1 Start by Creating Index
When working with a dask dataframe, it's best practice to set an index at the start of your operations. Because Dask dataframes are partitioned, the row indices (that would be accessed using .iloc) are not straightforward to use. Setting the index resolves this and creates a clear index across all partitions. This is an expensive operation so best to use it as sparingly as possible.

In [8]:
%%time
# set index to tweetid
ddf = ddf.set_index('tweetid', drop=True).persist()

CPU times: user 2.5 s, sys: 127 ms, total: 2.62 s
Wall time: 8.84 s


Done.

We will now proceed to inspect each column of the dataframe to check for NaNs / faulty entries and clean up as needed.

## 4.2 Inspecting Tweet_Text Column
Let's proceed to inspect the tweet_text column to see what's going on with the faulty entries and see if we can remove them.

In [10]:
# get a subset of our dataframe to experiment with
df_temp = ddf.partitions[9].compute()

In [11]:
df_temp

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1040243117105065984,993642585892818944,rahil_76,12576,12682,ar,RT @alasirihoney_: 1_اقسم بالله انه عسل سدر طب...,2018-09-13 14:17,Twitter for Android,true,,1040088191351640065
1040243122406678534,993642585892818944,rahil_76,12576,12682,ar,RT @alasirihoney_: 1_عسل ابيض مجرى | بلدي طبيع...,2018-09-13 14:17,Twitter for Android,true,,1040087978578722816
1040243124407291904,993642585892818944,rahil_76,12576,12682,ar,RT @alasirihoney_: 1_عسل سدر عسل طبيعي عسل ابو...,2018-09-13 14:17,Twitter for Android,true,,1040087959792504832
1040243161019412480,993642585892818944,rahil_76,12576,12682,ar,RT @dddhhh284: #خزنات # حمامات # مسابح \nحل ار...,2018-09-13 14:17,Twitter for Android,true,,1039887208356958208
1040243165003956224,993642585892818944,rahil_76,12576,12682,ar,RT @dddhhh284: #خزنات # حمامات # مسابح \nحل ار...,2018-09-13 14:17,Twitter for Android,true,,1039886180030898176
...,...,...,...,...,...,...,...,...,...,...,...
1044651472749826048,448366650,hassin1937,30529,7265,ar,RT @mom999900: لاعدمتك ياهناء قلبي وخله💚ياللي ...,2018-09-25 18:14,Twitter for iPhone,true,,1044622714944794624
1044651483885637632,448366650,hassin1937,30529,7265,ar,RT @do__x: •\n\n.\n.\nأحياناً . .\nيتوّجب عليك...,2018-09-25 18:14,Twitter for iPhone,true,,1044629681792782339
1044651498024591361,rJ9LKF5+KW7TRiUemWEc2o7f2Yir2yMc+oxuoHToyR0=,rJ9LKF5+KW7TRiUemWEc2o7f2Yir2yMc+oxuoHToyR0=,616,1668,ar,دونالد #ترامب رئيس #الولايات_المتحدة الأمريكية...,2018-09-25 18:15,TweetDeck,false,,
1044651498259402753,raSzN6PrYMCSDxoqTBRxJ6+gadIVulhn0NA0dn10A=,raSzN6PrYMCSDxoqTBRxJ6+gadIVulhn0NA0dn10A=,800,359,ar,دونالد #ترامب رئيس #الولايات_المتحدة الأمريكية...,2018-09-25 18:15,TweetDeck,false,,


The tweet_text column contains faulty entries that are:
- only digits (shifted rows due to faulty import)
- NaNs

Let's first remove NaNs so that we can then proceed to look at the digit entries.

### 4.2.1. Drop NaNs from ddf

In [12]:
# drop rows for which tweet_text is NaN
ddf = ddf.dropna(subset=['tweet_text'])

In [13]:
%%time
ddf.shape[0].compute()

CPU times: user 101 ms, sys: 12.3 ms, total: 114 ms
Wall time: 1.98 s


36523971

In [14]:
n_rows_all - 36523971

416

Dropping the rows with NaN in tweet_text still leaves us with >36.5mln tweets. We have dropped 416 rows.

### 4.2.2. Drop Faulty Rows

Now let's see how many rows there are with digits in tweet_text. These rows are faulty imports; their contents have been shifted across columns, i.e. tweetid contains tweet_text etc.

In [15]:
# subset ddf to get the number of entries for which tweet_text is entirely numeric
df_numeric = ddf[ddf.tweet_text.str.isnumeric()].compute()
df_numeric.shape[0]

373

In [16]:
df_numeric.sample(5)

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1112427432743788546,NInrrL2fY2whh740LuCYLkW+1YQoDijuRzl2I3AGoZQ=,NInrrL2fY2whh740LuCYLkW+1YQoDijuRzl2I3AGoZQ=,4601.0,3861,und,٠,2019-03-31 18:52,Twitter for Android,False,,
"لان اليوم خميس واجازه وكذا انشرولي ابي اوصل ١٠٠٠ واكون لكم من الشاكرين 😊""",2018-08-16 18:33,,,1030109875140026369,0,0,0,0,,,
"🐟شر…""",2019-08-03 20:21,,,1157488424955387905,0,0,0,0,,,
"#اليوم_العالمي_…""",2017-09-30 13:41,,,913719629171658754,0,0,0,0,,,
"تنظيم حفل زواج لاحد الموقوفين بحضور اهالي العروسين في سجن ذهبان في جده مع التكفل التام من قبل #أمن_…""",2019-10-04 15:59,,,1180117429835501568,0,0,0,0,,,


There are 373 rows that contain only digits in the tweet_text column - although this includes some tweets that are not shifted, but are just tweets of numbers. But we can remove all of those, they're not going to be very informative anyway.

Dask does not support dropping rows. Instead, let's select the inverse of the rows_to_drop.

In [17]:
# get all rows that are NOT all numeric
ddf = ddf[~ddf.tweet_text.str.isnumeric()]

In [18]:
ddf.shape[0].compute()

36523598

In [19]:
n_rows_all - 36523598

789

After removing tweets with all numeric entries in tweet_text we have removed 789 tweets in total (i.e. including the NaNs removed above). 

Let's make sure we have removed all the instances where userid contains the date (i.e. shifted columns).

In [20]:
# define function to pass to map_partitions
def get_faulty_ids(df):
    df_all = pd.DataFrame()
    df_faulty = df[df.userid.str.startswith('201')]
    df_all = df_all.append(df_faulty)
    return df_all

In [21]:
%%time
# pass faulty_ids function to each partition
df_faultyrows = ddf.map_partitions(get_faulty_ids).compute()
df_faultyrows

CPU times: user 737 ms, sys: 71.3 ms, total: 808 ms
Wall time: 14.3 s


Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


Yes, we have removed al entries for which the userid was a date.

Let's see if we can now convert our columns to the right datatypes. That should gives us confidence that all our rows contain the right entries and we have no more faulty imports / shifted rows.

In [22]:
ddf.columns

Index(['userid', 'user_screen_name', 'follower_count', 'following_count',
       'tweet_language', 'tweet_text', 'tweet_time', 'tweet_client_name',
       'is_retweet', 'retweet_userid', 'retweet_tweetid'],
      dtype='object')

## 4.3 Inspecting is_retweet Column

Let's inspect the is_retweet column. This column should contain only Boolean values: True or False.

In [23]:
is_retweet = ddf.is_retweet

In [24]:
is_retweet.value_counts().compute()

true     33107687
false     3415910
[]              1
Name: is_retweet, dtype: int64

There's one NaN or empty value. Let's check that out.

In [25]:
ddf[ddf.is_retweet == "[]"].compute()

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
اسبيلت,اسبيلت,2019-07-11 09:23,,True,absent,absent,0,0,[],['3069869876'],


This is a shifted row too, let's drop it.

In [25]:
ddf = ddf[ddf.is_retweet != "[]"]

In [26]:
ddf.shape[0].compute()

36523597

In [27]:
n_rows_all - 36523597

790

We've now dropped one more row, totalling 790 rows dropped.

Let's now convert this column to boolean. To do that, we map the strings 'true' and 'false' to their corresponding Boolean values.

In [28]:
# create dictionary to map strings to booleans
mapping = {'true': True, 'false': False}

In [29]:
# map strings to booleans
ddf.is_retweet = ddf.is_retweet.map(mapping)

In [30]:
# inspect counts
ddf.is_retweet.value_counts().compute()

True     33107687
False     3415910
Name: is_retweet, dtype: int64

In [31]:
# cast column to boolean data type
ddf.is_retweet = ddf.is_retweet.astype(bool)

In [32]:
# verify
ddf.is_retweet.value_counts().compute()

True     33107687
False     3415910
Name: is_retweet, dtype: int64

In [33]:
ddf.dtypes

userid               object
user_screen_name     object
follower_count       object
following_count      object
tweet_language       object
tweet_text           object
tweet_time           object
tweet_client_name    object
is_retweet             bool
retweet_userid       object
retweet_tweetid      object
dtype: object

Excellent, that has worked as expected.

## 4.4. Inspecting Following and Follower Columns
Let's have a look at the following_count and follower_count columns.

We'll look at:
- number of non-numeric entries
- number of NaNs
- any strange one-time occurrences that don't belong

In [34]:
# get rows with following_count is non-numeric
faulty_following = ddf[~ddf.following_count.str.isnumeric()].compute()

In [35]:
faulty_following

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


In [36]:
# get rows with NaNs
following_NaN = ddf[ddf.following_count.isnull()].compute()

In [37]:
following_NaN

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


In [38]:
# get value counts
following_value_counts = ddf.following_count.value_counts()

In [39]:
# check tail to spot any strange one-time occurrences
following_value_counts.tail(10)

726     5
480     5
304     4
183     2
812     2
239     2
384     1
1113    1
173     1
851     1
Name: following_count, dtype: int64

No issues here.

Let's look at follower_count:

In [40]:
# get rows with follower_count is non-numeric
faulty_follower = ddf[~ddf.follower_count.str.isnumeric()].compute()

In [41]:
faulty_follower

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


In [42]:
# get rows with NaNs
follower_NaN = ddf[ddf.follower_count.isnull()].compute()

In [43]:
follower_NaN

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


In [44]:
# get value counts
follower_value_counts = ddf.follower_count.value_counts()

In [45]:
# check tail to spot any strange one-time occurrences
follower_value_counts.tail(10)

7988     29
1353     28
1607     28
3790     25
2196     24
1465     24
1210     22
23195    22
821      14
3731     10
Name: follower_count, dtype: int64

No issues here either.

This means we should be able to cast these columns to int64 data type.

In [46]:
# cast columns to integer dtypes
ddf.following_count = ddf.following_count.astype('int64')
ddf.follower_count = ddf.follower_count.astype('int64')

In [47]:
ddf.dtypes

userid               object
user_screen_name     object
follower_count        int64
following_count       int64
tweet_language       object
tweet_text           object
tweet_time           object
tweet_client_name    object
is_retweet             bool
retweet_userid       object
retweet_tweetid      object
dtype: object

Perfect. Both columns cast to int64.

## 4.5. Inspecting tweet_language Column
Let's have a look at the tweet_language column next. This column should contain only strings indicating the language of the tweets in tweet_text column.

In [48]:
# get local copy of tweet_language column
lang_value_counts = ddf.tweet_language.value_counts().compute()

In [49]:
# inspect values
lang_value_counts.index

Index(['ar', 'und', 'en', 'fa', 'tr', 'ko', 'eu', 'in', 'fi', 'tl', 'vi', 'fr',
       'ur', 'es', 'pt', 'ja', 'ca', 'cs', 'ht', 'de', 'et', 'id', 'zh', 'it',
       'ru', 'pl', 'nl', 'cy', 'ckb', 'sk', 'hi', 'sl', 'sv', 'uk', 'da', 'hu',
       'sd', 'lt', 'no', 'is', 'ro', 'lv', 'th', 'bo', 'ta', 'kn', 'hr', 'ps',
       'iw', 'bg', 'bn', 'bs', 'ug', 'el', 'am', 'hy', 'ml', 'ka', 'sr', 'ne',
       'chr', 'pa', 'dv', 'iu', 'sn', 'mr', 'he'],
      dtype='object')

That looks fine, too. No strange numbers or tweet texts in here that would point to faulty imports.

This column is already of type 'object' so we can proceed without altering anything.

## 4.6. Inspecting tweet_client_name Column
Let's take a look at the tweet_client_name column next.

In [50]:
# get a local copy of the column
client_name_value_counts = ddf.tweet_client_name.value_counts().compute()

In [51]:
# inspect value counts
client_name_value_counts

Twitter for iPhone     17746118
Twitter for Android    11781531
Twitter for iPad        3063280
Twitter Web App         1468719
Twitter Web Client       534684
                         ...   
erased14602063                1
PicCollage                    1
erased14118499                1
Plays Now                     1
   Fancy                      1
Name: tweet_client_name, Length: 457, dtype: int64

In [52]:
client_name_value_counts.sample(20)

مكتبة تغريدات                        9519
WatchClient                             1
Twuffer                                44
Appy Pie App                            1
Fãs - Só que ao Contrário old           5
Crazy HeliumBooth HD Free on iOS        1
TellyApp                                2
tweetie                                15
erased994719                         3287
UberSocial for BlackBerry             324
Publish Live                            5
TLampApp                                1
تطبيق كتابي                           610
CallApp                                 3
erased7925788                          18
Shorty Awards                           8
RULES OF SURVIVAL                       1
Telly Android                           3
Mobile Web (M2)                     25148
Samsung Mobile                         88
Name: tweet_client_name, dtype: int64

This looks in order. Client names in arabic are read from right to left which is why the value counts are in the left column.

The column is already of type 'object', so no change necessary here.

## 4.7. Inspecting retweet_tweetid Column

In [53]:
# get local copy of column
retweetids = ddf.retweet_tweetid

In [54]:
retweetids.head()

tweetid
1000000000447930368    998649277479088128
1000000030391095297    999637296612855808
1000000039362662400    999393857438699520
1000000054911033344    998351563839148032
1000000204865789954                   NaN
Name: retweet_tweetid, dtype: object

OK, so we have a null value here. Let's see how many in total.

In [55]:
ddf.retweet_tweetid.isnull().sum().compute()

3415910

That corresponds exactly to the number of non-retweets, so that's fine.

This means we should cast this column to **float** not integer.

In [56]:
# cast column to float dtypes
ddf.retweet_tweetid = ddf.retweet_tweetid.astype('float64')

In [57]:
ddf.dtypes

userid                object
user_screen_name      object
follower_count         int64
following_count        int64
tweet_language        object
tweet_text            object
tweet_time            object
tweet_client_name     object
is_retweet              bool
retweet_userid        object
retweet_tweetid      float64
dtype: object

## 4.8. Inspecting retweet_userid Column

This column should contain hashed and unhashed userids, i.e. strings.

Let's just check any NaNs.

In [58]:
ddf.retweet_userid.isnull().sum().compute()

35839843

Wow. 35.8 mln NaNs for the retweet_userid, that is worrisome considering the large number of retweets in the dataset.

Let's have a closer look.

In [59]:
retweet_userids_notNaN = ddf[ddf.retweet_userid.notnull()]

In [60]:
retweet_userids_notNaN.head(15)

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000061011880284161,np1eOxdxZnVz8KR9WRdXAaxOCMVgkHZlqKMZpQKTq08=,np1eOxdxZnVz8KR9WRdXAaxOCMVgkHZlqKMZpQKTq08=,4607,4843,ar,RT @jLrAA2gkQM83CtpEH6YyeV3eb+f56UcVcdvWBXKt0Q...,2018-05-25 17:08,Twitter for Android,True,974777604812263425,1.000053e+18
1000061016393252864,np1eOxdxZnVz8KR9WRdXAaxOCMVgkHZlqKMZpQKTq08=,np1eOxdxZnVz8KR9WRdXAaxOCMVgkHZlqKMZpQKTq08=,4607,4843,ar,RT @jLrAA2gkQM83CtpEH6YyeV3eb+f56UcVcdvWBXKt0Q...,2018-05-25 17:08,Twitter for Android,True,974777604812263425,1.000053e+18
1000061023066492934,np1eOxdxZnVz8KR9WRdXAaxOCMVgkHZlqKMZpQKTq08=,np1eOxdxZnVz8KR9WRdXAaxOCMVgkHZlqKMZpQKTq08=,4607,4843,ar,RT @jLrAA2gkQM83CtpEH6YyeV3eb+f56UcVcdvWBXKt0Q...,2018-05-25 17:08,Twitter for Android,True,974777604812263425,1.000052e+18
1000061031132131329,np1eOxdxZnVz8KR9WRdXAaxOCMVgkHZlqKMZpQKTq08=,np1eOxdxZnVz8KR9WRdXAaxOCMVgkHZlqKMZpQKTq08=,4607,4843,ar,RT @jLrAA2gkQM83CtpEH6YyeV3eb+f56UcVcdvWBXKt0Q...,2018-05-25 17:08,Twitter for Android,True,974777604812263425,1.000052e+18
1000061714824679424,XLCNthF4y5Q18iYqRrj7hStchSPp26kfx7ly0SPVt8=,XLCNthF4y5Q18iYqRrj7hStchSPp26kfx7ly0SPVt8=,2370,3492,ar,RT @2sTX4IADf8dmtSoDvHdalKZ9Wh2InPrVINyxL3ZJOU...,2018-05-25 17:11,Twitter for Android,True,2sTX4IADf8dmtSoDvHdalKZ9Wh2InPrVINyxL3ZJOUc=,1.00004e+18
1000061720243720192,XLCNthF4y5Q18iYqRrj7hStchSPp26kfx7ly0SPVt8=,XLCNthF4y5Q18iYqRrj7hStchSPp26kfx7ly0SPVt8=,2370,3492,ar,RT @2sTX4IADf8dmtSoDvHdalKZ9Wh2InPrVINyxL3ZJOU...,2018-05-25 17:11,Twitter for Android,True,2sTX4IADf8dmtSoDvHdalKZ9Wh2InPrVINyxL3ZJOUc=,1.00004e+18
1000061725092319235,XLCNthF4y5Q18iYqRrj7hStchSPp26kfx7ly0SPVt8=,XLCNthF4y5Q18iYqRrj7hStchSPp26kfx7ly0SPVt8=,2370,3492,ar,RT @2sTX4IADf8dmtSoDvHdalKZ9Wh2InPrVINyxL3ZJOU...,2018-05-25 17:11,Twitter for Android,True,2sTX4IADf8dmtSoDvHdalKZ9Wh2InPrVINyxL3ZJOUc=,1.00004e+18
1000061730414825472,XLCNthF4y5Q18iYqRrj7hStchSPp26kfx7ly0SPVt8=,XLCNthF4y5Q18iYqRrj7hStchSPp26kfx7ly0SPVt8=,2370,3492,ar,RT @2sTX4IADf8dmtSoDvHdalKZ9Wh2InPrVINyxL3ZJOU...,2018-05-25 17:11,Twitter for Android,True,2sTX4IADf8dmtSoDvHdalKZ9Wh2InPrVINyxL3ZJOUc=,1.00004e+18
1000061736110690304,XLCNthF4y5Q18iYqRrj7hStchSPp26kfx7ly0SPVt8=,XLCNthF4y5Q18iYqRrj7hStchSPp26kfx7ly0SPVt8=,2370,3492,ar,RT @2sTX4IADf8dmtSoDvHdalKZ9Wh2InPrVINyxL3ZJOU...,2018-05-25 17:11,Twitter for Android,True,2sTX4IADf8dmtSoDvHdalKZ9Wh2InPrVINyxL3ZJOUc=,1.00004e+18
1000071220707151873,CGglXh1nikyRGa29EEe1F2FAhhkOQ0Z6OlKB5vmS8=,CGglXh1nikyRGa29EEe1F2FAhhkOQ0Z6OlKB5vmS8=,4540,4732,ar,RT @2sTX4IADf8dmtSoDvHdalKZ9Wh2InPrVINyxL3ZJOU...,2018-05-25 17:48,Twitter for Android,True,2sTX4IADf8dmtSoDvHdalKZ9Wh2InPrVINyxL3ZJOUc=,1.000046e+18


Yes, we are missing the retweet_userid for the majority of our tweets.

However, it seems the tweet_text contains the retweet @username. Let's see if this is also the case for the entries with retweet_userid = NaN. That would be great.

In [61]:
retweet_userids_NaN = ddf[ddf.retweet_userid.isnull()]

In [62]:
retweet_userids_NaN.head(15)

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000000000447930368,948302862098092034,y_44a_,9007,8821,ar,RT @oneway_market: السلام عليكم ورحمة الله وبر...,2018-05-25 13:05,Twitter for iPhone,True,,9.986493e+17
1000000030391095297,948302862098092034,y_44a_,9007,8821,ar,RT @games4marah: 🌻#للتأجير 🌻#لبيع_النطيطات 🌻\n...,2018-05-25 13:06,Twitter for iPhone,True,,9.996373e+17
1000000039362662400,948302862098092034,y_44a_,9007,8821,ar,RT @mzlatksa: #مظلات وسواتر #آفاق_الرياض\n#مظل...,2018-05-25 13:06,Twitter for iPhone,True,,9.993939e+17
1000000054911033344,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=,168,408,ar,RT @videohat_1: فيديو\nشاهد.. مواطن يوثق بالفي...,2018-05-25 13:06,Twitter for iPhone,True,,9.983516e+17
1000000204865789954,Gj+bihYSO0L5Ht1+f9OEqP42KbnJWtNK4qv0WJr0cs=,Gj+bihYSO0L5Ht1+f9OEqP42KbnJWtNK4qv0WJr0cs=,1623,2022,ar,أستغفر الله العظيم وأتوب إليه https://t.co/Dn3...,2018-05-25 13:06,غرد بصدقة,False,,
1000000215598891008,948302862098092034,y_44a_,9007,8821,ar,RT @danat_almesk: #تخفيضات 50% على جميع الأصنا...,2018-05-25 13:06,Twitter for iPhone,True,,9.997592e+17
1000000242165714944,948302862098092034,y_44a_,9007,8821,ar,RT @756870fda1544b6: ✅علاج السرطان في الهند عن...,2018-05-25 13:06,Twitter for iPhone,True,,9.997301e+17
1000000262315094022,948302862098092034,y_44a_,9007,8821,ar,RT @m3asafarah: دورة #مع_السفرة السادسة عشر\nا...,2018-05-25 13:06,Twitter for iPhone,True,,9.997669e+17
1000000271492288512,948302862098092034,y_44a_,9007,8821,ar,RT @Ayed72044978: #تسديد_قروض\n♋الراجحي\n♋الاه...,2018-05-25 13:06,Twitter for iPhone,True,,9.99761e+17
1000000325586169856,2SJuOzyE6GQOsmW9ukY3ChH8rl049x6mDNZi3EM=,2SJuOzyE6GQOsmW9ukY3ChH8rl049x6mDNZi3EM=,1850,1594,ar,لا إله إلا أنت سبحانك إني كنت من الظالمين \n♻️...,2018-05-25 13:07,تطبيق زاد المسلم,False,,


The entries containing NaN for retweet_userid DO contain a retweet handle but not formatted as userid but as username / screenname. 

This **may be because** these users (without userids) are not included as users themselves in this dataset, i.e. the entries that DO have retweet_userids may be retweeting other users in this dataset. Just a hunch.

Possible solution: create a reference table of ALL users (those who (re)tweeted and those who were being retweeted) with a unique user index. This table would have the columns: 

1. Unique ID generated by me 
2. Screen Name 
3. Twitter UserID if available
4. number of followers/following, if available

Besides this, no sign of faulty imports / shifted rows here. The column is already of type 'object' so no change necessary.

## 4.9. Inspecting tweetid Column

We've already set tweetid to be the index. Let's just make sure that it is indeed unique.

In [63]:
ddf['tweetid'] = ddf.index

In [64]:
ddf.head()

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid,tweetid
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1000000000447930368,948302862098092034,y_44a_,9007,8821,ar,RT @oneway_market: السلام عليكم ورحمة الله وبر...,2018-05-25 13:05,Twitter for iPhone,True,,9.986493e+17,1000000000447930368
1000000030391095297,948302862098092034,y_44a_,9007,8821,ar,RT @games4marah: 🌻#للتأجير 🌻#لبيع_النطيطات 🌻\n...,2018-05-25 13:06,Twitter for iPhone,True,,9.996373e+17,1000000030391095297
1000000039362662400,948302862098092034,y_44a_,9007,8821,ar,RT @mzlatksa: #مظلات وسواتر #آفاق_الرياض\n#مظل...,2018-05-25 13:06,Twitter for iPhone,True,,9.993939e+17,1000000039362662400
1000000054911033344,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=,168,408,ar,RT @videohat_1: فيديو\nشاهد.. مواطن يوثق بالفي...,2018-05-25 13:06,Twitter for iPhone,True,,9.983516e+17,1000000054911033344
1000000204865789954,Gj+bihYSO0L5Ht1+f9OEqP42KbnJWtNK4qv0WJr0cs=,Gj+bihYSO0L5Ht1+f9OEqP42KbnJWtNK4qv0WJr0cs=,1623,2022,ar,أستغفر الله العظيم وأتوب إليه https://t.co/Dn3...,2018-05-25 13:06,غرد بصدقة,False,,,1000000204865789954


In [65]:
ddf.tweetid.str.isnumeric().sum().compute()

36523597

In [66]:
ddf.shape[0].compute()

36523597

In [67]:
ddf = ddf.drop('tweetid', axis=1)

In [68]:
ddf.head()

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000000000447930368,948302862098092034,y_44a_,9007,8821,ar,RT @oneway_market: السلام عليكم ورحمة الله وبر...,2018-05-25 13:05,Twitter for iPhone,True,,9.986493e+17
1000000030391095297,948302862098092034,y_44a_,9007,8821,ar,RT @games4marah: 🌻#للتأجير 🌻#لبيع_النطيطات 🌻\n...,2018-05-25 13:06,Twitter for iPhone,True,,9.996373e+17
1000000039362662400,948302862098092034,y_44a_,9007,8821,ar,RT @mzlatksa: #مظلات وسواتر #آفاق_الرياض\n#مظل...,2018-05-25 13:06,Twitter for iPhone,True,,9.993939e+17
1000000054911033344,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=,168,408,ar,RT @videohat_1: فيديو\nشاهد.. مواطن يوثق بالفي...,2018-05-25 13:06,Twitter for iPhone,True,,9.983516e+17
1000000204865789954,Gj+bihYSO0L5Ht1+f9OEqP42KbnJWtNK4qv0WJr0cs=,Gj+bihYSO0L5Ht1+f9OEqP42KbnJWtNK4qv0WJr0cs=,1623,2022,ar,أستغفر الله العظيم وأتوب إليه https://t.co/Dn3...,2018-05-25 13:06,غرد بصدقة,False,,


Excellent. All rows have a tweetid that is entirely numeric.

This gives us enough confidence that our columns contain all the right values (i.e. no faulty import / shifted rows) and we can proceed to change the data types.

## 4.10.  Inspecting tweet_time column

Let's see if we can cast this column to datetime64, that would verify that it contains only dates.

In [69]:
ddf.tweet_time = ddf.tweet_time.astype('datetime64')

In [70]:
ddf.tweet_time.max().compute()

Timestamp('2020-01-22 06:02:00')

That worked. We should be all set now.

Let's just confirm below.

## Verifying Data Types

In [71]:
ddf.dtypes

userid                       object
user_screen_name             object
follower_count                int64
following_count               int64
tweet_language               object
tweet_text                   object
tweet_time           datetime64[ns]
tweet_client_name            object
is_retweet                     bool
retweet_userid               object
retweet_tweetid             float64
dtype: object

Excellent. We have converted all the columns to their appropriate datatypes.

Let's confirm the number of rows in our dataframe after this first pass of filtering out NaN rows and faulty imports.

In [72]:
ddf.shape[0].compute()

36523597

In [73]:
# persist this corrected version of ddf to cluster memory
ddf = ddf.persist()

This still leaves us with more than 36.5mln tweets, which is ample material to work with.

Let's now proceed to subset the dataframe to include only arabic tweets.

# 5. Cleaning Text Data

In this section, we proceed to clean up the data in the tweet_text column for use in the NLP topic-modelling stage of this project. 

## 5.1. Subsetting Arabic Tweets
We are interested only in Arabic-language tweets so the first step is to filter out any tweets that are not in Arabic.

### 5.1.1. Percentage of Tweets in Arabic
For a first sanity check, let's evaluate the percentage of tweets in our dataset that are in Arabic.

In [74]:
%%time
# get value counts of tweet language column
ddf.tweet_language.value_counts().compute()

CPU times: user 136 ms, sys: 8.42 ms, total: 145 ms
Wall time: 2.02 s


ar     34249868
und     1682128
en       287225
fa        26917
tr        15486
         ...   
dv            1
iu            1
sn            1
mr            1
he            1
Name: tweet_language, Length: 67, dtype: int64

In [75]:
# get percentage of arabic language tweets
ddf.tweet_language.value_counts().compute().loc['ar'] / n_rows_all * 100

93.77260185092223

In [76]:
# get percentage of english language tweets
ddf.tweet_language.value_counts().compute().loc['und'] / n_rows_all * 100

4.605492762958622

In [77]:
# get percentage of english language tweets
ddf.tweet_language.value_counts().compute().loc['en'] / n_rows_all * 100

0.786392390377421

- We have a clear majority of Arabic language tweets: almost 94%.
- 4.6% of the tweets have language 'undefined'.
- Less than 1% of the tweets are in English.

### 5.1.2. Exploring 'Undefined' Category
Let's explore the 'undefined' category a little further. I wonder if this is mostly Arabic.

In [78]:
%%time
# get tweet texts that have language = 'und'
ddf[ddf.tweet_language == 'und'].tweet_text.sample(frac=0.01).head(15)

CPU times: user 15.4 ms, sys: 4.01 ms, total: 19.4 ms
Wall time: 289 ms


tweetid
1000737104392081408    RT @ssa3tt: #ليفربول_ريال_مدريد\n#تخفيضات_رمضا...
1000012400513478657    RT @joo_roody: #رمضان_كريم\n#مركز_طرق_الجمال\n...
1000616288740413440         RT @54W3j0rWTQXoONS: https://t.co/hcEHcLYYYd
1000112656043429888                       😭😭😭😭😭😭 https://t.co/yoiKYBV9Qf
1001593342973632513    RT @abwfhdalswady: #مظلات_وسواتر_الخالدي053444...
1001474778044215297                         🔥🔥🔥🔥 https://t.co/monyq4OWKK
1000286266133569536    RT @t2030d1: #عمان_تتكاتف_كلها \n#ساعه_استجابه...
1000011997159854080    RT @ooaoaoao: #هي_وحدة #وش_خطتك_في_رمضان #حزب_...
1000431922831675393         RT @OJmz7UlUDh5Cbsd: https://t.co/3iyJrE16Zf
1000476676193751040    RT @naql3afshriyad: #ليفربول_ريال_مدريد\n#شركة...
1000790558498017280     RT @rufaqaa: #غرد_بموعظه https://t.co/zK4an1rS4n
1000826364616937477    RT @almohydb4: .\n#رمضان #غرد_بذكر_الله \n#ابت...
1001647595943579650    RT @Riyadh_mmm: #شركة_تنظيف_بالرياض 💯✔      #ت...
1001852875767762944    RT @vip_u_s111: #شرك

Running the cell above a number of times reveals that the 'undefined' category seems to be made up of:
- tweets in Arabic
- tweets with only URLs
- tweets with only emojis

This means we should be able to retrieve some more Arabic tweets from this column to increase the total number of rows in our Arabic-only dataframe.

### 5.1.3. Filtering Arabic from Undefined

Let's do this by:
1. Filtering out tweets with language = 'ar' into separate ddf: ddf_ar
2. Filtering out tweets with language = 'und' into separate ddf: ddf_und
3. Deleting tweets that are URLs or only emoji from ddf_und
4. Appending ddf_und to ddf_ar

In [79]:
ddf_ar = ddf[ddf.tweet_language == 'ar']
ddf_und = ddf[ddf.tweet_language == 'und']

In [80]:
# verify
ddf_und.shape[0].compute()

1682128

The undefined ddf should be small enough to work with locally as a pandas dataframe. Let's call a compute: df_und

In [81]:
# turn ddf_und into local pandas dataframe
df_und = ddf_und.compute()

In [82]:
# # save locally for future reference
# df_und.to_csv('/Users/richard/Desktop/springboard_repo/capstones/three/df_undefined.csv')

In [83]:
df_und.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1682128 entries, 1000000675722547200 to 999999622281121792
Data columns (total 11 columns):
 #   Column             Non-Null Count    Dtype         
---  ------             --------------    -----         
 0   userid             1682128 non-null  object        
 1   user_screen_name   1682128 non-null  object        
 2   follower_count     1682128 non-null  int64         
 3   following_count    1682128 non-null  int64         
 4   tweet_language     1682128 non-null  object        
 5   tweet_text         1682128 non-null  object        
 6   tweet_time         1682128 non-null  datetime64[ns]
 7   tweet_client_name  1682128 non-null  object        
 8   is_retweet         1682128 non-null  bool          
 9   retweet_userid     21326 non-null    object        
 10  retweet_tweetid    1492599 non-null  float64       
dtypes: bool(1), datetime64[ns](1), float64(1), int64(2), object(6)
memory usage: 142.8+ MB


This is looking good. We have no NaNs for the first 9 columns and the expected many NaNs for retweet_userid and some NaNs for retweet_tweetid.

**Important:** some of these are **re-tweets** of tweets containing only emojis or URLs. 

This means it's better to:
1. first move the retweet usernames to the new column retweet_user_screen_name (to account for lack of user_ids) 
2. then remove all "RT"s, mentions, and URLs, and
3. then come back and filter out arabic tweets from undefined.

What we can do at this point is **remove all the tweets with tweet_languages other than 'ar' and 'und'**.

In [84]:
# subset ddf to include only 'ar' and 'und' entries and persist to memory
ddf = ddf[(ddf.tweet_language == 'ar') | (ddf.tweet_language == 'und')].persist()

In [85]:
ddf.shape[0].compute()

35931996

In [86]:
n_rows_arabic_und = ddf.shape[0].compute()

After removing all tweets in languages other than 'arabic' and 'undefined' we are left with just under 36 million tweets.

We still have to filter out the tweets in 'undefined' containing only emojis and URLs but we'll do that at a later stage.

## 5.2. Move Retweet Usernames to New Column

As mentioned above, the next step is to move the retweet usernames to a new column. We are doing this because the column retweet_userid is almost entirely empty. In this way, we can recoup the valuable information contained in the tweet_text column, namely: who is retweeting who. As mentioned in Section 4, this new column we are creating ('retweet_user_screen_name') will become part of a reference table that will allow us to keep track of which users are retweeting one another.

Let's work with df_und first locally, to get a feel of how to do this.

In [87]:
df_und.tweet_text.sample(20, random_state=25)

tweetid
866627763641495552                RT @kbria2016: https://t.co/dwgHXb62uz
1122476147743313920    RT @a_abdawhab2022: @DiwaniyaOfPoets #مسابقة_د...
1057630368596033536    RT @a__r__z: #اسال_تميسه\n #الاختبارات\n #ضعف_...
749742946380083200     RT @eostudy: :Three sentences for getting Succ...
1077159302887018496               RT @LahaleboS: https://t.co/RPUpyjdv5i
1046470719549386752                RT @sluttyx1: https://t.co/qTo3IbcUCc
1115381123931168768    RT @Jood_Elevators: #مصاعد_جود 920008419 https...
1198462139033423872           RT @hawa_Lbalq8op: https://t.co/TgDuMWNLZi
793890843207798784     RT @hmodee2050: #سكري #محشي #للوز #تين_مشمش #ص...
872185405420040192                    RT @8iiit: https://t.co/fsUcZBX2U2
1141980248483008513             RT @ward_altaif: https://t.co/5Yam41mmG4
1205223104798175247    RT @AdryDrive: 🅼┏━━┓#H0MEL3ND ⭐️⭐️⭐️\n🆄┃━┳┻┳┓┏...
1202307740841971715                RT @JEA__: 🖤🔥 https://t.co/CixoI4KliB
762087051940491264          RT @am_dema2015

In [88]:
df_und.iloc[0]

userid                     ytmVN9opEFMM7Uk+0O0XgSuOpIRlok5Xqu+jel9qyM=
user_screen_name           ytmVN9opEFMM7Uk+0O0XgSuOpIRlok5Xqu+jel9qyM=
follower_count                                                    3928
following_count                                                   4273
tweet_language                                                     und
tweet_text           RT @Abo_ody103: #الشفا\n#الحزم\n#الفواز\n#الدا...
tweet_time                                         2018-05-25 13:08:00
tweet_client_name                                  Twitter for Android
is_retweet                                                        True
retweet_userid                                                     NaN
retweet_tweetid                                   999699981744340992.0
Name: 1000000675722547200, dtype: object

Let's define a regex pattern to find the user screen name - located between "RT @" and ":".

We will then use this regex in a function that returns only the user_screen_name when present and NaN otherwise.

In [89]:
pattern_RT_mention = r"RT @([^:]+):"

In [90]:
def search_RT_mention(tweet_text):
    try:
        return re.search(pattern_RT_mention, tweet_text).group(1)
    except:
        return np.nan

In [91]:
df_und['retweet_user_screen_name'] = df_und.tweet_text.apply(search_RT_mention)

In [92]:
df_und.retweet_user_screen_name.sample(20)

tweetid
1084914415139463168           qawafel0
1040951141721415680         Sexfire911
896032239615713282               uo_uj
653726989996695552            Hanoy666
706594987430039552               hl_4o
806487693689585664             mmayz_1
1086358281222799362            KAlobid
831091702999896064              3z_r0q
1142068113242017799    laila_al_ohaidb
1188047037738213376             dr52kk
409646074807455744                 NaN
841161647020507138          mxxzzxxzz1
733597818594566144             Tuwath_
1118600863977750530           alaotify
833549244434939904            mousaa__
1059235615378677761          sefgcx123
756096541685903362          albader_77
807181247478988800           0808Decor
632226038240575488             eostudy
1187256296908701696                NaN
Name: retweet_user_screen_name, dtype: object

Excellent, this has worked. By indexing '1' into the .group() call we are pulling out only the grouped characters between the @ and the :, i.e. the user_screen_name.

Let's now apply this to our entire ddf by writing a function to pass to ddf.map_partitions. This function operates on each partition of the dask dataframe (which is a df) and will use the search_RT_mention function defined above.

In [93]:
def create_retweet_user_column(df):
    df['retweet_user_screen_name'] = df.tweet_text.apply(search_RT_mention)
    return df

In [94]:
# map to all partitions and persist new dataframe to memory
ddf = ddf.map_partitions(create_retweet_user_column).persist()

In [95]:
ddf.partitions[0].sample(frac=0.0002, random_state=26).compute()

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid,retweet_user_screen_name
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1000401223508512768,meAFVoFFDbqur3udXGEVzD0JyLWeeDfVsLXuq3IWpI=,meAFVoFFDbqur3udXGEVzD0JyLWeeDfVsLXuq3IWpI=,3576,5464,ar,"#مسابقه_نوف_بنت_عبدالعزيز\n""\n""\nقال تعالي:\n\...",2018-05-26 15:40:00,Twitter for iPhone,False,,,
1000193719377768448,CGglXh1nikyRGa29EEe1F2FAhhkOQ0Z6OlKB5vmS8=,CGglXh1nikyRGa29EEe1F2FAhhkOQ0Z6OlKB5vmS8=,4540,4732,und,RT @ghj__123: https://t.co/FeWYAPyJZF,2018-05-26 01:55:00,Twitter for Android,True,,1.0001e+18,ghj__123
1001596706948034562,CGglXh1nikyRGa29EEe1F2FAhhkOQ0Z6OlKB5vmS8=,CGglXh1nikyRGa29EEe1F2FAhhkOQ0Z6OlKB5vmS8=,4540,4732,ar,RT @tsded_k: تسديد_القروض\n💯📘🔙الراجحي \n📕🔙الأه...,2018-05-29 22:50:00,Twitter for Android,True,,1.00129e+18,tsded_k
1001117219043119109,948302862098092034,y_44a_,9007,8821,ar,RT @yosefah50019586: مع ملتي ماكا 💊⤵\n💥زيادة ق...,2018-05-28 15:05:00,Twitter for iPhone,True,,1.001054e+18,yosefah50019586
1001347238025678848,XLCNthF4y5Q18iYqRrj7hStchSPp26kfx7ly0SPVt8=,XLCNthF4y5Q18iYqRrj7hStchSPp26kfx7ly0SPVt8=,2370,3492,ar,RT @honeyTaif2018: #اربح_بيت_مع_مصعب\n#وحدك_ام...,2018-05-29 06:19:00,Twitter for Android,True,,1.001055e+18,honeyTaif2018
1000064166135255045,np1eOxdxZnVz8KR9WRdXAaxOCMVgkHZlqKMZpQKTq08=,np1eOxdxZnVz8KR9WRdXAaxOCMVgkHZlqKMZpQKTq08=,4607,4843,ar,RT @waam2040: #تسديد_قروض_بنكية \n🌙الراجحي\n🌙ا...,2018-05-25 17:20:00,Twitter for Android,True,,9.99901e+17,waam2040
1001812856994713600,988400378797535232,maram__2002,16937,16823,und,RT @mark0547070481: #شركه_رش_مبيدات #شركة_مكاف...,2018-05-30 13:09:00,Twitter for iPhone,True,,1.001374e+18,mark0547070481
1000972929595789314,vczripOcvrQ82Or6UC35LhdwjXqWZPGKl+mzXb5dQ64=,vczripOcvrQ82Or6UC35LhdwjXqWZPGKl+mzXb5dQ64=,4468,148,ar,RT @Abotalaal72: حساب هلاليه تستحق المتابعه\nد...,2018-05-28 05:31:00,Twitter for iPhone,True,,1.00094e+18,Abotalaal72
1000330356258754560,948302862098092034,y_44a_,9007,8821,ar,RT @1l5l1__: #تسديد_قروض\n#سداد_قروض بنكية \n#...,2018-05-26 10:58:00,Twitter for iPhone,True,,1.000204e+18,1l5l1__


Excellent, this has worked as expected.

We can now proceed to clean the Tweet_Text column.

## 5.3. Cleaning Tweet_Text Column

We will be removing the following from tweet_text:
- RT_mention pattern
- URLs
- line breaks ("\n" etc.)
- hashtags and underscores

### 5.3.1. Remove Retweet Mentions

The function below uses a regex pattern to remove the "RT @user_screen_name:" substrings from any tweet_text entries containing a retweet mention.

In [96]:
def remove_RT_mentions(tweet_text):
    try:
        return re.sub(pattern_RT_mention, "", tweet_text)
    except:
        pass

In [97]:
def clean_tweet_text_column_RT(df):
    df.tweet_text = df.tweet_text.apply(remove_RT_mentions)
    return df

In [98]:
ddf = ddf.map_partitions(clean_tweet_text_column_RT).persist()

In [99]:
ddf.tweet_text.sample(frac=0.000005).compute()

tweetid
1035744319632822272     مفارقة؛\n\nيتم تشجيع ٥٠٠٠ شخص على ريادة الاعم...
1047354863032496128     شركة تنظيف خزانات بالرياض0537315337\n💫شركة غس...
1052329431539310592     📢 اعلان عن توفر وظيفة \n👩‍🎓فني/مهندس صيانة\n👩...
1060135231213240320    @youm7 @dandrawy_hawary أن الاوان لان تفعل الق...
1062354330408402944     #عسل_الضرم يستخدم في علاج الجروح والحروق وتخف...
                                             ...                        
956402769719713792      #تجميل #مكياج #تسريحات #شعر https://t.co/nYlN...
964457950004826112      عيش بـ #صراحه #طيبه \n\nوتصير لـ #أجلك معظم #...
978396048778461195      مؤسسةالتميز الرائدةفي عالم التظليل والمقاولات...
984870080017203200      #ساعه_استجابه\n\n#قرعه_دوري_الابطال\n\nحقين \...
991331720044609536      @M0VOWuYUH34R3rjz0aoiTA4kY6lN7gsukt2O3+jV2Yo=...
Name: tweet_text, Length: 160, dtype: object

Works like a charm!

### 5.3.2. Removing Remaining Mentions

The tweet_text column also occasionally contains mentions that are not retweet-related. We will remove those below.

In [100]:
pattern_mentions = r'@[\S]+'

In [101]:
def remove_mentions(tweet_text):
    return re.sub(pattern_mentions, "", tweet_text)

In [102]:
def clean_tweet_text_column_mentions(df):
    df.tweet_text = df.tweet_text.apply(remove_mentions)
    return df

In [103]:
# inspect ddf before removing mentions
ddf.tweet_text.sample(frac=0.000004, random_state=5).compute()

tweetid
1157717594297774080     #الملك_سلمان : من دافع عن ديننا قبل كل شي وعن...
1184320597272100865     تقارب الارواح قلب وصوت / ترى عليه اثبات علميا...
1186000635356962818     ساعة رولكس أوتوماتيك  1\n\nالتوصيل خلال ساعه ...
496762459214979072     يا ليت بعض القلوب ؟ تشوفها عيني\nاعـرف وفـاهـا...
666496653205991424      ⭕️تشكيلة راقية من مقابض الابواب والكوالين وبا...
715975659294756866      💢💯💢💯💢💯\nرولكس ستيل نسائي درجه اولى هاي كواليت...
726760920618307584      لَا أُرِيد وَضْع عَلامة إسْتِفهامٍ لمَا سَبق\...
779723954965639172      أبسط أيآمي معك كآنت جميلة\nوالبسيط آللي يجي م...
Name: tweet_text, dtype: object

In [104]:
# apply cleaning function to all partitions
ddf = ddf.map_partitions(clean_tweet_text_column_mentions).persist()

In [105]:
# inspect ddf after removing mentions
ddf.tweet_text.sample(frac=0.000004, random_state=5).compute()

tweetid
1157717594297774080     #الملك_سلمان : من دافع عن ديننا قبل كل شي وعن...
1184320597272100865     تقارب الارواح قلب وصوت / ترى عليه اثبات علميا...
1186000635356962818     ساعة رولكس أوتوماتيك  1\n\nالتوصيل خلال ساعه ...
496762459214979072     يا ليت بعض القلوب ؟ تشوفها عيني\nاعـرف وفـاهـا...
666496653205991424      ⭕️تشكيلة راقية من مقابض الابواب والكوالين وبا...
715975659294756866      💢💯💢💯💢💯\nرولكس ستيل نسائي درجه اولى هاي كواليت...
726760920618307584      لَا أُرِيد وَضْع عَلامة إسْتِفهامٍ لمَا سَبق\...
779723954965639172      أبسط أيآمي معك كآنت جميلة\nوالبسيط آللي يجي م...
Name: tweet_text, dtype: object

Excellent. Let's proceed.

### 5.3.3. Removing Emoji

Now proceeding to remove all emoji from the tweet_text. Technically, we could save these to a separate column and use them as features (for example, to augment our sentiment analysis). But to maintain a realistic scope of this project we will not do so here. 

We will simply use the *emoji* library, which keeps an up-to-date record of all emoji Unicodes, to remove the emoji from the tweet text bodies.

In [106]:
def remove_emoji(tweet_text):
    try:
        return emoji.get_emoji_regexp().sub(u'', tweet_text)
    except:
        pass

In [107]:
def clean_tweet_text_column_emoji(df):
    df.tweet_text = df.tweet_text.apply(remove_emoji)
    return df

We had some trouble running this function on the entire ddf. After extensive debugging, a repartitioning with max. partition_size set to 50MB seems to have solved the problem, so we will conduct that here before continuing to pass the function to .map_partitions().

In [108]:
ddf

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid,retweet_user_screen_name
npartitions=369,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
"✴️نسدد المتعثرات لدى س…""",object,object,int64,int64,object,object,datetime64[ns],object,bool,object,float64,float64
1001882935384002560,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...
"تنظيم حفل زواج لاحد الموقوفين بحضور اهالي العروسين في سجن ذهبان في جده مع التكفل التام من قبل #أمن_…""",...,...,...,...,...,...,...,...,...,...,...,...
"🙊٥ - أخلا…""",...,...,...,...,...,...,...,...,...,...,...,...


In [109]:
ddf = ddf.repartition(partition_size="50MB")

In [110]:
ddf.shape[0].compute() == n_rows_arabic_und

True

In [111]:
ddf

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid,retweet_user_screen_name
npartitions=769,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
,object,object,int64,int64,object,object,datetime64[ns],object,bool,object,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...


OK, ddf now has 769 partitions of max. 50MB each.

Let's run the removing-emoji function.

In [118]:
# apply function to all partitions
# NB: this takes ca. 10min (with 50 workers) to run 
# persist() call so will run in the background and will slow down subsequent operations
ddf = ddf.map_partitions(clean_tweet_text_column_emoji).persist()

In [119]:
ddf.shape[0].compute() == n_rows_arabic_und

True

Excellent. Emoji's: bye, bye!

### 5.3.4. Removing Line Breaks

Let's replace line breaks ("\n") with a white space.

In [121]:
def remove_linebreaks(tweet_text):
    return re.sub(r'[\n]', " ", tweet_text)

In [122]:
def clean_tweet_text_column_linebreaks(df):
    df.tweet_text = df.tweet_text.apply(remove_linebreaks)
    return df

In [123]:
# inspect sample of ddf before removing linebreaks 
ddf.tweet_text.sample(frac=0.00005, random_state=15).compute()

Series([], Name: tweet_text, dtype: object)

In [124]:
# apply cleaning function to all partitions
ddf = ddf.map_partitions(clean_tweet_text_column_linebreaks).persist()

In [125]:
# inspect sample of ddf after removing linebreaks 
ddf.tweet_text.sample(frac=0.00005, random_state=15).compute()

tweetid
1001076400189857792     تميزي مع هاي كلاس باطقم السهره الرائعة ح   سا...
1000708048661504002    يدخل فريق #مونبلييه_الفرنسي فى مواجهة قوية مع ...
1002840636079460358    (فانتقمنا منهم فانظر كيف كان عاقبة المكذبين) [...
1002269876118056960         اطقم نسائي ناعمه   7    بأشكال مميززه  ول...
1003644144127021056     َ من عامل الله بالتقوى والطاعة في حال رخائه ،...
                                             ...                        
994701086123610114         برفووو ابوفيصل_سيبوني خلوني احللل  هههههههههه
997035000913645568      معمول أم صالح طبخ بيت √بالتمر √طعم لذيذ √ دقي...
997258402555285504      ساعة باتيك فيليب رجالي a  التوصيل خلال ساعه ا...
999115886828113925      #وزنك_زاد_ولا_باقي #مسابقه_زياد_الجهني4 #الدم...
998820191315296256      اقمشة دنهل   السعر 180 ريال  للتواصل واتس على...
Name: tweet_text, Length: 1770, dtype: object

Done.

### 5.3.5. Removing URLs
Let's remove all URLs from the tweet_text column.

In [126]:
pattern_URLs = r'http\S+'

In [127]:
def remove_URLs(tweet_text):
    return re.sub(pattern_URLs, "", tweet_text)

In [128]:
def clean_tweet_text_column_URLs(df):
    df.tweet_text = df.tweet_text.apply(remove_URLs)
    return df

In [129]:
# apply cleaning function to all partitions
ddf = ddf.map_partitions(clean_tweet_text_column_URLs).persist()

In [130]:
# inspect sample of ddf after removing linebreaks 
ddf.tweet_text.sample(frac=0.00005, random_state=19).compute()

tweetid
1001247484260376578     Mail on Sunday | مانشستر سيتي يقترب من التوقي...
1000895874069823488     من فوايد علاج تدليك الاعصاب والضغط علي مسارات...
1002028867471757312     توفر من جديد6 لحاف ٩ قطعه مطرز نفرين لحاف لحا...
1002238276533653504     زعيم الأمة وقائدها خادم الحرمين الشريفين المل...
1004465349415395328    #ولي_العهد الأمير #محمد_بن_سلمان يستقبل ولي عه...
                                             ...                        
995058150373036032      اللي ينفع أحكيلهم مش هيفهموا واللي هيفهموا ما...
997275163912794112      تعتبرالخلوة التقنية الأكثر فائدة للوصول إلى ا...
996947883260874752     اللّهم  أنت الحليم فلا تعجل، وأنت الجواد فلا ت...
998913241244893184     عن أبي هريرة رضي الله عنه أن الرسول صلى الله ع...
999835329266962434       سنابك احلى  لديك سناب وتبى مشاهدات عاليه واض...
Name: tweet_text, Length: 1770, dtype: object

Done.

## Progress Check
For an intermediate sanity check, let's sample our dask dataframe to see how we're doing.

In [131]:
df_temp = ddf.tweet_text.sample(frac=0.0001, random_state=1).compute()

In [132]:
df_temp.shape

(3589,)

In [133]:
df_temp.sample(25, random_state=3)

tweetid
934789324620607489      #الكدس_ليس_كضيتنا #الهلال_اوراورا  السلام علي...
1175890834987847684     بمناسبة #اليوم_الوطني_٨٩ بادر بالاشتراك بعروض...
932723901469220864       عروض مميزة  بوفيه مفتوح    افطار يومي 52 ريا...
1151400910397906945     وصل الأمر بنشأت الديهي وقناة ten عمل حلقة كام...
1081511593001795584     لاننسي هذا البطل أبداً سياده المشير حسين طنطا...
1107306700493737986     إذا كنت تبحث عن #تصميم #سيرة_ذاتية فألق نظرة ...
811126400774242304      وداعاً لمشاكل #القولون #الانتفاخات اتصل بنا 0...
729138548767637504      عسل ربيع نجد نادر جدا وقد بذلت جهدا كبيرا من ...
939134949755707392     اللهم إني أعوذ بك من الهم والحزن والعجز والكسل...
1115653481686286337     خادمات وجميع المهن المنزلية من الفلبين وصول ٤...
840745144731238400      ززيت الشعر الافغاني الأصلي٪  ينعم يكثف يطول ي...
671989097423118336      قبلة الصباح كفيله .. أن تجعل من المستحيل ممكن...
1191035821819871232     بحمد الله وتوفيقة تم توقيع شراكة استراتيجية م...
612385868062416897      خدم البلاط يقدم للم

Still to clean:
- digits
- some non-character icons
- some random latin characters
- hashtags

We could also try removing any non-arabic text. But that would remove hashtags and underscores and we need to process those first.

## 5.4. Hashtags

How to deal with the hashtags? This is not a straightforward problem, and there are several ways we could approach it. We could: 

1. Remove the hashtag symbols and underscores and keep the words of the hashtags in the tweet_text body
2. Move the hashtags out of the tweet_text body and save them as a separate feature

We will opt for a combination of the two. We will first copy the hashtags to a separate column and save them as a separate feature, i.e. each tweet will have a feature that is a list of the hashtags (text only) that were included in that tweet. We will then remove the hashtag symbols and underscores from the tweet_text body **but maintain the words themselves**.

We are opting for this approach for two reasons:
1. To accommodate for instances in which hashtags double as part of the tweet sentence itself, e.g. "I didn't know #machinelearning worked in #arabic!"
2. Because we are ultimately interested in topic modelling and the text in the hashtags is likely to contain a strong signal about the topic each tweet belongs to.

To play devil's advocate here - and illustrate that there's always more than one way to approach a problem - we could also think of a good reason *not* to keep the hashtag text in the tweet_text column, namely:
1. Hashtags are, in some sense, a rudimentary form of topic modelling already. By including them in the tweet_text, we may be overpowering the signal that's in the tweet_text body itself. In other words, our topic modelling algorithm might simply cluster tweets according to the hashtags they contain.

We will be aware of this moving forward. Should this problem arise, we can always remove the hashtag texts from the tweet_text column at a later stage, since we have the hashtag texts saved in a separate column.

In [134]:
# inspect sample of ddf  
df_temp = ddf.partitions[0].sample(frac=0.001, random_state=1).compute()

In [135]:
df_temp.tweet_text.sample(15, random_state=6)

tweetid
1000621618530529281     !#تسديد_قروض الراجحي الاهلي 22 راتب سامبا الا...
1000378419681644544     #استقدام_عامله_منزليه #فيتنام 20 يوم #الفلبين...
1000690962451124224    أهم ما جاء من عناوين في الصحف السعودية الصادرة...
1000487400529891328     مجموعة #العناية_بالشعر منتجات تحتوي على #زبدة...
1001152483434160128     جديدناا️5 اطقم رجاليه حديد ساعه رجالي حديد فخ...
1000903312978382849               اللهم أعذنا من عذاب القبر وعذاب جهنم  
1000803590825488384     تنفيذ المشبات الحجر والرخام أفران_قرميد_طوب-ش...
1000975849192075264     كريم #بلاك_فايتر الاصلي  لتاخير القذف بدون اي...
1001191420852690944     يارب إن للصائم دعوة مستجابة عند فطره ، فأجعله...
1000523587747500032     #الاتحاد_اكبر_عقد_رعايه #جوايز_السعوديه9 #ليف...
1000440820821057537     عرض خاص10 قماش دنهال مع علبه وكيس دنهال صناعه...
1000859060005154816     شاحن واير لس من انكر يدعم الايفون 8 و ايفون X...
1001414469136904193     دنيا و مامتها بيعيدوا علي جورج سيدهم علشان عي...
1000385385208332289     ساعة باتيك فيليب سا

In [136]:
df_temp.tweet_text.sample(15, random_state=7)

tweetid
1001191420852690944     يارب إن للصائم دعوة مستجابة عند فطره ، فأجعله...
1001125595206561792         ┊┊┊┊┊مبدع ┊┊┊┊مميز ┊┊┊مغرد ┊┊متألق       ...
1000975849192075264     كريم #بلاك_فايتر الاصلي  لتاخير القذف بدون اي...
1001511046102822913     البخاخ الالماني #دراجون مؤخر القذف من اول مره...
1000050679422029826                                                     
1000803590825488384     تنفيذ المشبات الحجر والرخام أفران_قرميد_طوب-ش...
1000808319865499649     شاحن متعدد معه ماطور هواء للكفر يشحن  جميع ال...
1000621618530529281     !#تسديد_قروض الراجحي الاهلي 22 راتب سامبا الا...
1000729335903514624     ترا المجامل بعض الاحيان مقبول يدخـل تحت منهـج...
1001443851427606528     #تريكة_أسطورة_عربية  #صيفنا_في_تركيا_اخطر  #ع...
1000859060005154816     شاحن واير لس من انكر يدعم الايفون 8 و ايفون X...
1001560860110802944     ☜أقوى برنامج تخسيس في أوروبا والشرق الأوسط  ك...
1000848835395059712                                ملامح في عُمق القلب*.
1000440820821057537     عرض خاص10 قماش دنها

In [138]:
df_temp.loc['1001443851427606528'].tweet_text

' #تريكة_أسطورة_عربية  #صيفنا_في_تركيا_اخطر  #عرس_الجميله_والوحش  #شركة _غسيل_سجاد #شركة_تنظيف_مجالس #شركة_تنظيف_كنب تعقيم و…'

In [139]:
df_temp.loc['1000975849192075264'].tweet_text

' كريم #بلاك_فايتر الاصلي  لتاخير القذف بدون اي آثار جانبيه للطلب 0542835579 السعر 180 ريال '

**NOTE:** Some of these hashtags are quite long. In the first example above, up to 6 words, separated by underscores.

### 5.4.1. Move Hashtags to New Column

We will use a regex to find the hashtags and move them to a new column.

Since Arabic is written from right to left, matching hashtags in Arabic using a regex means the reverse of how it's done in English texts.

In [140]:
# pattern to match 'reverse' hashtags
pattern_hashtags = r'((?<!\S)#\S+)+'

In [141]:
re.sub(pattern_hashtags, 'hash', df_temp.tweet_text.loc['1000975849192075264'])

' كريم hash الاصلي  لتاخير القذف بدون اي آثار جانبيه للطلب 0542835579 السعر 180 ريال '

This pattern works. Let's apply it to our df_temp:

In [142]:
def find_hashtags(tweet_text):
    hashtag_list = re.findall(pattern_hashtags, tweet_text)
    if len(hashtag_list) == 0:
        return np.nan
    else:
        return hashtag_list

In [143]:
df_temp['hashtags'] = df_temp.tweet_text.apply(find_hashtags)

In [144]:
df_temp.hashtags.sample(15)

tweetid
1000975849192075264                                        [#بلاك_فايتر]
1000164119008497664    [#مصمم_ديكور, #ديكورات, #نيو_كلاسيك, #تصميم_دا...
1000523587747500032    [#الاتحاد_اكبر_عقد_رعايه, #جوايز_السعوديه9, #ل...
1001241675736469504                                                  NaN
1000050679422029826                                                  NaN
1001167653975789569                                  [#سروال_التورمالين]
1000162954510327810                                                  NaN
1001303244759535617    [#القولون_الهضمي, #الإمساك, #جرثومة_المعدة, #ا...
1000839996901871617                                                  NaN
1000487400529891328               [#العناية_بالشعر, #زبدة_الشيا, #الشيا]
1001560860110802944                                                  NaN
1000690962451124224    [#الارصاد, #اعصار_مكونو, #السعودية, #الصحف_الس...
1001443851427606528    [#تريكة_أسطورة_عربية, #صيفنا_في_تركيا_اخطر, #ع...
1001152483434160128                        

In [243]:
df_temp.loc['1003015938629951488'].tweet_text

'انهيار الليرة التركية بعد تدخلات أردوغان.. وحجم الديون يتزايد! #تركيا #أردوغان #الليرة_التركية #الاقتصاد_التركي #هاشتاجات_العرب '

In [244]:
df_temp.loc['1003015938629951488'].hashtags

['#تركيا',
 '#أردوغان',
 '#الليرة_التركية',
 '#الاقتصاد_التركي',
 '#هاشتاجات_العرب']

**NOTE:** The formatting of .sample() and .head() is off so that it appears as if the hashtag is on the left of the word instead of on the right. This is just a display issue and not an actually problem. Important to keep in mind.

**NOTE:** These two randomly extracted tweets illustrate important things about the contents of the tweets in our dataset. The first one is an ad for laser eye-correction, the second is clearly political and critical of Erdogan. It translates roughly as 

> *The collapse of the Turkish lira after Erdogan's interventions ... and the volume of debt is increasing!*

Great, this is working for df_temp. Let's expand and apply to our entire ddf.

In [146]:
def extract_hashtags(df):
    df['hashtags'] = df.tweet_text.apply(find_hashtags)
    return df

In [147]:
ddf = ddf.map_partitions(extract_hashtags).persist()

In [148]:
ddf.sample(frac=0.00001, random_state=1).compute()

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid,retweet_user_screen_name,hashtags
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1009346032596070401,CGglXh1nikyRGa29EEe1F2FAhhkOQ0Z6OlKB5vmS8=,CGglXh1nikyRGa29EEe1F2FAhhkOQ0Z6OlKB5vmS8=,4540,4732,ar,#عروض #الليزك #اللازك #تصحيح_النظر #تشطيب_الن...,2018-06-20 08:03:00,Twitter for Android,True,,1.009167e+18,o4g_vv,"[#عروض, #الليزك, #اللازك, #تصحيح_النظر, #تشطيب..."
1011524952007331840,urO+NoCFhGNMfK3o4hfFvmKE9W3swxGbRkMLvanBUz8=,urO+NoCFhGNMfK3o4hfFvmKE9W3swxGbRkMLvanBUz8=,1693,629,ar,لا,2018-06-26 08:21:00,Twitter for Android,False,,,,
1015996262838087680,nU+LdjgIh5dXaoPkSqQn1BdPRpiqC6mXXCYaCRSRI8=,nU+LdjgIh5dXaoPkSqQn1BdPRpiqC6mXXCYaCRSRI8=,2701,4232,ar,“وَلَوْ أهدَوْا كنوز الكونِ .. فلا تعني لنا أ...,2018-07-08 16:29:00,Twitter for iPhone,True,,1.015167e+18,ad_asd85,[#تويت_فلسطينى]
1018691265460875266,765562015180001280,3366Zss,5774,6543,ar,• - قال العلامة ابن عثيمين • - عليه رحمات رب ...,2018-07-16 02:58:00,Twitter for iPhone,True,,1.018450e+18,13_mrm,
1019273039018692613,xSOhWuOuq1LzRUfZ9Mm3SS2fmHBeDktWxBVDBRzNvQ=,xSOhWuOuq1LzRUfZ9Mm3SS2fmHBeDktWxBVDBRzNvQ=,866,49,ar,#أحيانا تحتاج قلوبنا گل فترة أن ننفضها مثل م...,2018-07-17 17:30:00,Twitter for iPhone,True,,1.019259e+18,hanan_13_DZ,[#أحيانا]
...,...,...,...,...,...,...,...,...,...,...,...,...,...
960941436261863425,480758910,mutfaell1,814247,663230,ar,تعاني من سمنة مفرطة لا تستطيع تحمل أنظمة الح...,2018-02-06 18:21:00,Twitter for Android,True,,9.609338e+17,qh4O9oDZSorckZH,
981247832148664320,948302862098092034,y_44a_,9007,8821,ar,#تسديد_قروض_بنكية #الرياض_جميع_المناطق الراج...,2018-04-03 19:11:00,Twitter for iPhone,True,,9.806776e+17,tsdeed_qd,"[#تسديد_قروض_بنكية, #الرياض_جميع_المناطق, #سدا..."
984131614698541056,QnYdRpeAMpwbn7LLxUNW7jtCKAwjAK8Li0MotarE324=,QnYdRpeAMpwbn7LLxUNW7jtCKAwjAK8Li0MotarE324=,3967,5079,ar,#التمر يحتوي على كالسيلينيوم والمنغنيز والنحا...,2018-04-11 18:10:00,Twitter for iPhone,True,,9.838218e+17,lcb6s,[#التمر]
985640394670174215,448366650,hassin1937,30529,7265,ar,المهذري يهذري مع شديد الصخونه سبعة سنين الاه...,2018-04-15 22:05:00,Twitter for iPhone,True,,9.855993e+17,yaser_515,


This looks like it worked.

Let's double-check by having a look at the tweet_text for entries that have NaN in the hashtags column:

In [149]:
text = ddf.loc['1005851382082166785'].tweet_text.compute()

In [150]:
text[0]

' اللهم انت ربي لا اله الا انت خلقتني وانا عبدك وانا على عهدك ووعدك ما استطعت اعوذ بك من شر ماصنعت ابوء لك بنعمتك علي وابوء بذن…'

In [151]:
re.findall(r'#', text[0])

[]

Perfect.

### 5.4.2. Removing Hashtags and Underscores 

Let's replace any hashtags and underscores with a space in the tweet_text column. In the hashtags column, we remove the # and replace only the underscore with a space. 

This way we keep the words of the hashtags (to go into our topic modelling) but remove the non-letter characters.

In [152]:
# define function to sub underscores and hashtags with space
def remove_hashtags_underscores(tweet_text):
    return re.sub(r'[_#]', " ", tweet_text)

In [153]:
# define funtion to clean each partition
def clean_hashtags_underscores(df):
    df.tweet_text = df.tweet_text.apply(remove_hashtags_underscores) 
    
    for index in range(0, len(df)): 
        list_hashtags = df.hashtags.iloc[index]
        if type(list_hashtags) == list:
            for index in range (0, len(list_hashtags)):
                list_hashtags[index] = re.sub(r'_', " ", list_hashtags[index]) 
                list_hashtags[index] = re.sub(r'#', "", list_hashtags[index])
        else:
            pass 
    return df

Let's test on a subset of ddf first.

In [154]:
df_temp = ddf.sample(frac=0.0001, random_state=1).compute()

In [155]:
df_temp2 = df_temp.copy()

In [158]:
df_temp2.loc['859571653029879808']

userid                                                              499971841
user_screen_name                                               Alwatansupport
follower_count                                                         115985
following_count                                                        102456
tweet_language                                                             ar
tweet_text                   شارك معنا بالسناب واربح جلسة مساج مجاناً يوم ...
tweet_time                                                2017-05-03 00:53:00
tweet_client_name                                         Twitter for Android
is_retweet                                                               True
retweet_userid                                                            NaN
retweet_tweetid                                          777445506121433088.0
retweet_user_screen_name                                         knights_rest
hashtags                         [#مساج, #حمام_مغربي, #مسابقة_تص

In [159]:
clean_hashtags_underscores(df_temp2);

In [161]:
df_temp2.loc['859571653029879808']

userid                                                              499971841
user_screen_name                                               Alwatansupport
follower_count                                                         115985
following_count                                                        102456
tweet_language                                                             ar
tweet_text                   شارك معنا بالسناب واربح جلسة مساج مجاناً يوم ...
tweet_time                                                2017-05-03 00:53:00
tweet_client_name                                         Twitter for Android
is_retweet                                                               True
retweet_userid                                                            NaN
retweet_tweetid                                          777445506121433088.0
retweet_user_screen_name                                         knights_rest
hashtags                             [مساج, حمام مغربي, مسابقة ت

In [162]:
df_temp2.loc['859571653029879808'].hashtags[2]

'مسابقة تصوير'

OK, this works. Let's map to our ddf partitions:

In [163]:
ddf = ddf.map_partitions(clean_hashtags_underscores).persist()

In [164]:
ddf.partitions[0].hashtags.head(10)

tweetid
1000000000447930368                                                  NaN
1000000030391095297    [للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...
1000000039362662400    [مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...
1000000054911033344                                                  NaN
1000000204865789954                                                  NaN
1000000215598891008    [تخفيضات, دانة المسك للعود, الرياض, الدائري ال...
1000000242165714944                  [سرطان الثدي, سرطان البنكرياس, سر…]
1000000262315094022                                   [مع السفرة, حلقا…]
1000000271492288512                                         [تسديد قروض]
1000000325586169856                                                  NaN
Name: hashtags, dtype: object

Done.

## 5.5. Final Cleaning Tweet_Text Column

Now that we've dealth with hashtags, we still have to clean:
- digits
- some non-character icons
- some random latin characters

We could also try doing this all in one go by removing any non-arabic text. 
Let's try that on a subset of ddf first.

In [165]:
df_temp = ddf.sample(frac=0.0001, random_state=21).compute()

In [166]:
df_temp.tweet_text.sample(15, random_state=2)

tweetid
1093296731670921217     اللَّهُمَ أنْت رَبي لا إلَهَ إلا أَنْت خَلَقت...
926446895194615808      عروض بداية العام الجديد   ⁧ الضرم750⁩ريال  1-...
1113744665583472640     خادمات وجميع المهن المنزلية من الفلبين وصول ٤...
1157840291140968448     كل شْيُء أتَرُكه خِلْفيُ لْأَ أعود إلْيُهُ ، ...
708589958542573568      مخيم VIP  بروضة خريم   للحجز : 0533303874   ر...
505748895985463296      هناك أموات لم تمت كلماتهم وهناك أحياء لم نسمع...
1060597280389115904     بمناسبة قدوم فصل الشتاء  عزل اسطح خزانات مساب...
1104441862587322370     جعلوا من الخوف طاعة .. ومن الجوع قناعة .! ظلا...
806384247909076993      عملائنا الكرام  الآن ولفترة محدودةاستفيدوا من...
899288848701808641      طرق بسيطة وفعالة في خلق جو من الاختلاف في منز...
835842101141192704       شركة تنظيف فلل0509760904   شركة تنظيف شقق   ...
698613681282772994                  نشالين الترند حفله السهرة يلا شاركو 
738825767203840001      كفر LuMee للايفون 6 و ايفون 6p  متوفر بالاسود...
1058114883923906562     وداعاً لمشاكل  القو

We're using a regex pattern found in [this SO thread](https://stackoverflow.com/questions/11323596/regular-expression-for-arabic-language) to eliminate all non-Arabic characters from the tweet_text bodies.

In [167]:
pattern_notArabic = r'[^\u0620-\u065f]+'

In [168]:
def remove_nonArabic_characters(tweet_text):
    try:
        t = re.sub(pattern_notArabic, ' ', tweet_text)
        return t
    except:
        pass


In [169]:
def tweet_text_finalsweep(df):
    df.tweet_text = df.tweet_text.apply(remove_nonArabic_characters)
    return df

In [170]:
# remove all non-Arabic characters
df_temp.tweet_text = df_temp.tweet_text.apply(remove_nonArabic_characters)

In [171]:
df_temp.tweet_text.sample(15, random_state=2)

tweetid
1093296731670921217     اللَّهُمَ أنْت رَبي لا إلَهَ إلا أَنْت خَلَقت...
926446895194615808      عروض بداية العام الجديد الضرم ريال السدر ريال...
1113744665583472640     خادمات وجميع المهن المنزلية من الفلبين وصول ي...
1157840291140968448     كل شْيُء أتَرُكه خِلْفيُ لْأَ أعود إلْيُهُ قُ...
708589958542573568       مخيم بروضة خريم للحجز روضة خريم الاهلي القادسيه
505748895985463296      هناك أموات لم تمت كلماتهم وهناك أحياء لم نسمع...
1060597280389115904     بمناسبة قدوم فصل الشتاء عزل اسطح خزانات مسابح...
1104441862587322370     جعلوا من الخوف طاعة ومن الجوع قناعة ظلام في ا...
806384247909076993      عملائنا الكرام الآن ولفترة محدودةاستفيدوا من ...
899288848701808641      طرق بسيطة وفعالة في خلق جو من الاختلاف في منز...
835842101141192704      شركة تنظيف فلل شركة تنظيف شقق تنظيف مجالس تنظ...
698613681282772994                  نشالين الترند حفله السهرة يلا شاركو 
738825767203840001      كفر للايفون و ايفون متوفر بالاسود والذهبي وال...
1058114883923906562     وداعاً لمشاكل القول

In [172]:
df_temp.tweet_text.sample(15, random_state=14)

tweetid
818754471006236672      ليس النجاح القوي ضحية ظروفه فهو يخلق الظروف ا...
1170898381453090817     هذا الـ فيديو يستحق ان يراه كل موظف وكل مدير ...
1113414510348525570     المبدعونللبحث العلمي للاستفسار اضغط بروبوزال ...
1102568060810977280     اقسم بالله انه عسل سدر طبيعي من منحلي الخاص ف...
848816215049084928      لدينا تمويل عقاري بدون دفعة أولى وسداد المديو...
1200824869224628224     ياشاري الدلّه وبيض الفناجيل دوّر لها مجلس على...
1026775850195275776     الشخصية المحورية في الحملة الكندية التي استهد...
1105573619760025600     لانكوم لا نوت تريزور أ لا فولي لو عطر دافئ بر...
560025686904012803      بعض السعوديين اللي بالتويتر يشوفون الفلو من ب...
1104485410053459968     سهلنا عليك اختيارك العائلي خدمات فندقية جلسات...
1125522733784285185     ساعة هوبلو رجالي التوصيل خلال ساعه الرياض خار...
793699789636505600      تسديد قروض بنكية الراجحي راتب الاهلي راتب الع...
1104146812078837760     سؤال السحب على السيارة الثانية مع شرط متابعة ...
825413554232840193      الحين تَرجع لي وتَن

Done, now let's apply to ddf:

In [174]:
ddf = ddf.map_partitions(tweet_text_finalsweep).persist()

In [175]:
ddf.tweet_text.sample(frac=0.001, random_state=7).head(10)

tweetid
1000858610753327104      دعاء الجلسة بين السجدتين رب اغفر لي رب اغفر لي 
1000646766260256768     شركة تنظيف منازل بالسليمانيه شركه غسيل خزانات...
1000542120820822016     عيونك أجمل وأحلي بدون نظارة طبية مع تصحيح الن...
1000296152816865281     حساب مهتم بالسياحه بالمملكه و خاصه ابها وظايف...
1001182832788729857     العيد مابقى عليه الا اقل من شهر مو متحمسه تطل...
1000170398829498368     كفرات لويس فيتون ماركة وجودة عاليه جدا لجميع ...
1000896205193469957     مهما حصلك في حياتك خليك دايما واثق ان ربنا شا...
1000929696480223232     لدينا طاقم من أفضل الفنيين والمعدات في خدمتكم...
1000214068442132482    اللهم إني أسألك علماً نافعاً ورزقاً طيباً وعمل...
1000972633847058432     الدعاء لجنودنا في شهر رمضان شركة تنظيف قصور ش...
Name: tweet_text, dtype: object

## 5.6. Drop Rows with Empty Tweet_Text Columns

We have removed all:
- emojis
- mentions
- URLs
- hashtags
- digits
- non-Arabic characters

We can now proceed to drop any rows that have an empty tweet_text column. These should be the entries that have tweet_language = 'undefined' and that only contained emoji's, mentions and/or URLs.

These entries will contain only spaces (rather than totally empty strings). Let's write a regex pattern that will allow us to match any entries containing of only 0 or more spaces.

In [176]:
# regex pattern
pattern_whitespaces = r'^\s*$'

Let's test on a partition first.

In [177]:
# test on first partition
df_0 = ddf.partitions[1].compute()

In [178]:
df_0.tweet_text.isnull().sum()

0

In [179]:
df_0[df_0.tweet_text == ' ']

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid,retweet_user_screen_name,hashtags
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1001597814097567745,988400378797535232,maram__2002,16937,16823,und,,2018-05-29 22:55:00,Twitter for iPhone,True,,9.971469e+17,badre72,
1001599017569193984,kpeiRc5FjzPFD6lb5jNnUMi9wNYGNh2U3cjP6JEj3I=,kpeiRc5FjzPFD6lb5jNnUMi9wNYGNh2U3cjP6JEj3I=,1777,765,und,,2018-05-29 22:59:00,Twitter for iPhone,False,,,,
1001599162415280134,kpeiRc5FjzPFD6lb5jNnUMi9wNYGNh2U3cjP6JEj3I=,kpeiRc5FjzPFD6lb5jNnUMi9wNYGNh2U3cjP6JEj3I=,1777,765,und,,2018-05-29 23:00:00,Twitter for iPhone,False,,,,
1001600008381222912,N7vyR3Dc33h4hsFz7AN8UsMWS9KFrO6+xwoaY2PCEnE=,N7vyR3Dc33h4hsFz7AN8UsMWS9KFrO6+xwoaY2PCEnE=,77,663,und,,2018-05-29 23:03:00,Twitter for iPhone,False,,,,
1001600512259698688,OjC1wjnPAgRjJBgqeQHJL+OLlbzy0OdfKtWW0FVX4q0=,OjC1wjnPAgRjJBgqeQHJL+OLlbzy0OdfKtWW0FVX4q0=,3526,4287,und,,2018-05-29 23:05:00,Twitter for iPhone,True,,1.001586e+18,othmanalkamees,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1003366640430600192,948302862098092034,y_44a_,9007,8821,und,,2018-06-03 20:03:00,Twitter for iPhone,True,,9.609943e+17,Albait_alraqi,
1003366651948097536,948302862098092034,y_44a_,9007,8821,und,,2018-06-03 20:03:00,Twitter for iPhone,True,,9.988897e+17,alzawq,
1003366657149022208,948302862098092034,y_44a_,9007,8821,und,,2018-06-03 20:03:00,Twitter for iPhone,True,,1.000184e+18,alzawq,
1003367764478787584,urO+NoCFhGNMfK3o4hfFvmKE9W3swxGbRkMLvanBUz8=,urO+NoCFhGNMfK3o4hfFvmKE9W3swxGbRkMLvanBUz8=,1693,629,und,,2018-06-03 20:08:00,Twitter for Android,False,,,,


In [180]:
# create mask that contains all entries where tweet text matches regex pattern, i.e. only whitespaces
mask = df_0.tweet_text.str.contains(pattern_whitespaces)

In [181]:
# use mask to maintain only entries that do NOT contain only whitespaces
df_0 = df_0[~mask]

In [182]:
# verify
df_0[df_0.tweet_text == ' ']

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid,retweet_user_screen_name,hashtags
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1


This works. Let's apply to ddf:

In [183]:
# define function
def remove_empty_tweets(df):
    mask = df.tweet_text.str.contains(pattern_whitespaces)
    df = df[~mask]
    return df

In [184]:
# apply across all partitions
ddf = ddf.map_partitions(remove_empty_tweets).persist()

In [185]:
# verify
ddf[ddf.tweet_text == ' '].compute()

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid,retweet_user_screen_name,hashtags
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1


Excellent, this has worked.

Let's see how many rows we have left now.

In [186]:
ddf.shape[0].compute()

35347002

Just over 35 million tweets left. These are guaranteed to be all in Arabic. That's still more than enough to work with.

Let's save this cleaned df to our s3 bucket as parquet.

In [187]:
ddf.to_parquet('s3://twitter-saudi-us-east-2/interim/ddf_clean_without_empty_tweet_text_rows.parquet',
               engine='pyarrow')

# 6. Create Username Reference Table

We will create a reference table containing:
- unique users (indexed from 0 to n_users)
- their Twitter user_id, if available
- their user_screen_name
- their number of followers, if available (!!)
- their number of following, if available (!!)
- whether the user was flagged by Twitter or not

(!!) These won't be available for users who have only been retweeted by flagged users but who were not flagged and subsequently removed by Twitter themselves (at least not in this batch).


In [188]:
ddf.columns

Index(['userid', 'user_screen_name', 'follower_count', 'following_count',
       'tweet_language', 'tweet_text', 'tweet_time', 'tweet_client_name',
       'is_retweet', 'retweet_userid', 'retweet_tweetid',
       'retweet_user_screen_name', 'hashtags'],
      dtype='object')

Steps are as follows:
1. Subset ddf to include only userid, user_screen_name, follower_count and following_count
2. Drop duplicates (subset='user_screen_name', method='first')
3. Get unique usernames from retweet_user_screen_name
4. Append together
5. For all columns with NaN in userid > Flagged = 0. All others, Flagged = 1

## 6.1. Get Unique Usernames and Related Data of Flagged Users

In [264]:
ddf_user_screen_name = ddf[['user_screen_name', 'userid', 'follower_count', 'following_count']].persist()

In [265]:
ddf_user_screen_name.head()

Unnamed: 0_level_0,user_screen_name,userid,follower_count,following_count
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1000000000447930368,y_44a_,948302862098092034,9007,8821
1000000030391095297,y_44a_,948302862098092034,9007,8821
1000000039362662400,y_44a_,948302862098092034,9007,8821
1000000054911033344,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=,168,408
1000000204865789954,Gj+bihYSO0L5Ht1+f9OEqP42KbnJWtNK4qv0WJr0cs=,Gj+bihYSO0L5Ht1+f9OEqP42KbnJWtNK4qv0WJr0cs=,1623,2022


In [266]:
ddf_user_screen_name.shape[0].compute()

35347002

Excellent.

If we do a groupby, we can also get the number of times each user has tweeted, that might be useful information, too.

In [267]:
# get tweet count per user as separate dataframe 
df_tweet_counts = ddf_user_screen_name.groupby(['user_screen_name']).userid.count().compute()

In [268]:
df_tweet_counts.shape

(4273,)

In [269]:
df_tweet_counts.head()

user_screen_name
++NKgOjGxSjksF8OoE2pc279lQPgPw+2VAEDAQFFs=       289
+H3e4huQnfPabIqJjpda8W933waxcGgJsnY0w+WcoI=     7159
+ZPMEkaCpxlnSug88JSojnxg+AE7p8viodFn5CRg=      28168
+iPsIpKQMg6RBLUH7rz9RquJvqGFh0At9B9cBPVq8o=    13404
0IODAjn8rdCiEhCJAgw2x8Ey0z9biPDat3xFjXYVb4=        1
Name: userid, dtype: int64

Let's now proceed to drop all duplicates in ddf_user_screen_name.

In [270]:
ddf_user_screen_name = ddf_user_screen_name.drop_duplicates(subset=['user_screen_name'])

In [271]:
ddf_user_screen_name.head(10)

Unnamed: 0_level_0,user_screen_name,userid,follower_count,following_count
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1000000000447930368,y_44a_,948302862098092034,9007,8821
1000000054911033344,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=,168,408
1000000204865789954,Gj+bihYSO0L5Ht1+f9OEqP42KbnJWtNK4qv0WJr0cs=,Gj+bihYSO0L5Ht1+f9OEqP42KbnJWtNK4qv0WJr0cs=,1623,2022
1000000325586169856,2SJuOzyE6GQOsmW9ukY3ChH8rl049x6mDNZi3EM=,2SJuOzyE6GQOsmW9ukY3ChH8rl049x6mDNZi3EM=,1850,1594
1000000475503132673,ytmVN9opEFMM7Uk+0O0XgSuOpIRlok5Xqu+jel9qyM=,ytmVN9opEFMM7Uk+0O0XgSuOpIRlok5Xqu+jel9qyM=,3928,4273
1000000623381753862,rJ9LKF5+KW7TRiUemWEc2o7f2Yir2yMc+oxuoHToyR0=,rJ9LKF5+KW7TRiUemWEc2o7f2Yir2yMc+oxuoHToyR0=,616,1668
1000001073459945472,m7m_aq,2904198326,42835,42860
1000001135242104832,Oyts8f+bvCC1gTv8xwa5wUbtFnEAImBG2bKeoR+ayRk=,Oyts8f+bvCC1gTv8xwa5wUbtFnEAImBG2bKeoR+ayRk=,952,524
1000001283313545216,uvY9IQjUPH4qmW4RI5URaQTKFFszPWdCw4nTUPolPQ=,uvY9IQjUPH4qmW4RI5URaQTKFFszPWdCw4nTUPolPQ=,553,566
1000002504942346240,lLdMGNWHZQHieTbvbBinuzNilXcZXihu5EIKw+GqbIQ=,lLdMGNWHZQHieTbvbBinuzNilXcZXihu5EIKw+GqbIQ=,1526,621


That looks good. Let's bring that in as a local pandas dataframe now.

In [272]:
df_user_screen_name = ddf_user_screen_name.compute()

In [273]:
df_user_screen_name.shape

(4273, 4)

In [274]:
df_user_screen_name.sample(20, random_state=2)

Unnamed: 0_level_0,user_screen_name,userid,follower_count,following_count
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1136144650950103042,HpLtcR38EHiDwl+SMFeiZJfjJYhhXXSqrjSZmO2L8=,HpLtcR38EHiDwl+SMFeiZJfjJYhhXXSqrjSZmO2L8=,1360,2107
1195443645480161280,iolNitXbFp48iizuBln3aM+RFk2CQpaLBTG2lELHI=,iolNitXbFp48iizuBln3aM+RFk2CQpaLBTG2lELHI=,12,76
1138528120259186689,U6uJY1DUNiWiPQdumvwTc8aLgBXr4zo+8OOHfZWb8=,U6uJY1DUNiWiPQdumvwTc8aLgBXr4zo+8OOHfZWb8=,210,453
763527967347793921,+uKxjQyby7fwl4aj6LZs691EKlQwFgiImkzmasxQ9Ls=,+uKxjQyby7fwl4aj6LZs691EKlQwFgiImkzmasxQ9Ls=,18,7
673322036949839872,X+hmavIgL84Zl2PS103dQz7cNGYPpaOACCSJhwSUCTE=,X+hmavIgL84Zl2PS103dQz7cNGYPpaOACCSJhwSUCTE=,32,93
1165721449337950208,malkk23,1147517086123593728,7104,6603
1102130141008936960,rhR0hzSb6ltboeLaqA+9DiGhGIhP6kLzwj59mzXpcE=,rhR0hzSb6ltboeLaqA+9DiGhGIhP6kLzwj59mzXpcE=,2627,2787
1062201700025389058,O6Pr8bRNPEUttX3lkQUjWK6sdSITczyQUjfbCz+Ym4=,O6Pr8bRNPEUttX3lkQUjWK6sdSITczyQUjfbCz+Ym4=,5,72
590783574661533696,luEpRfvffUb3Zin5EhdNdWMXiDQIaB4cqKcYp52JSH4=,luEpRfvffUb3Zin5EhdNdWMXiDQIaB4cqKcYp52JSH4=,10,74
1121098481274368001,loghVlPG9BZXJshg1JskiMyKBCZvpA4kWJ4wpJxuvA=,loghVlPG9BZXJshg1JskiMyKBCZvpA4kWJ4wpJxuvA=,554,802


Let's join df_user_screen_name with df_tweet_count.

In [275]:
# convert df_tweet_counts to dataframe
df_tweet_counts = pd.DataFrame(df_tweet_counts)

In [276]:
# reset the index
df_tweet_counts.reset_index(drop=False, inplace=True)

In [277]:
# rename the tweet_count column
df_tweet_counts = df_tweet_counts.rename(columns={'userid': 'tweet_count'})

In [278]:
df_tweet_counts.head()

Unnamed: 0,user_screen_name,tweet_count
0,++NKgOjGxSjksF8OoE2pc279lQPgPw+2VAEDAQFFs=,289
1,+H3e4huQnfPabIqJjpda8W933waxcGgJsnY0w+WcoI=,7159
2,+ZPMEkaCpxlnSug88JSojnxg+AE7p8viodFn5CRg=,28168
3,+iPsIpKQMg6RBLUH7rz9RquJvqGFh0At9B9cBPVq8o=,13404
4,0IODAjn8rdCiEhCJAgw2x8Ey0z9biPDat3xFjXYVb4=,1


We're now ready to execute the join.

In [280]:
df_users = df_user_screen_name.merge(
    df_tweet_counts, 
    how='left', 
    on='user_screen_name'
)

In [281]:
df_users.head()

Unnamed: 0,user_screen_name,userid,follower_count,following_count,tweet_count
0,y_44a_,948302862098092034,9007,8821,258339
1,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=,168,408,4982
2,Gj+bihYSO0L5Ht1+f9OEqP42KbnJWtNK4qv0WJr0cs=,Gj+bihYSO0L5Ht1+f9OEqP42KbnJWtNK4qv0WJr0cs=,1623,2022,59605
3,2SJuOzyE6GQOsmW9ukY3ChH8rl049x6mDNZi3EM=,2SJuOzyE6GQOsmW9ukY3ChH8rl049x6mDNZi3EM=,1850,1594,14833
4,ytmVN9opEFMM7Uk+0O0XgSuOpIRlok5Xqu+jel9qyM=,ytmVN9opEFMM7Uk+0O0XgSuOpIRlok5Xqu+jel9qyM=,3928,4273,17643


In [282]:
df_tweet_counts[df_tweet_counts.user_screen_name == 'y_44a_']

Unnamed: 0,user_screen_name,tweet_count
256,y_44a_,258339


In [283]:
df_tweet_counts[df_tweet_counts.user_screen_name == 'iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=']

Unnamed: 0,user_screen_name,tweet_count
184,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=,4982


In [284]:
df_users.tail()

Unnamed: 0,user_screen_name,userid,follower_count,following_count,tweet_count
4268,7pOdmjKqHcJVTSok+qfVEYNov31rbO9CCKCwnQP7vzk=,7pOdmjKqHcJVTSok+qfVEYNov31rbO9CCKCwnQP7vzk=,11,80,1
4269,G+jasJ4rse9GSGy4XauAEx3JwWOR4JUmSwQzZ3rL4I=,G+jasJ4rse9GSGy4XauAEx3JwWOR4JUmSwQzZ3rL4I=,0,21,1
4270,w8DugCJZldmy0PZWJwFgvvm56mm5o5fBBqHgYxMXG8A=,w8DugCJZldmy0PZWJwFgvvm56mm5o5fBBqHgYxMXG8A=,0,0,1
4271,o+r+FeR4oE5hHYFTsleFXNUDpyA5RmzHYkvEPxYu7Ck=,o+r+FeR4oE5hHYFTsleFXNUDpyA5RmzHYkvEPxYu7Ck=,7,9,1
4272,sB6soAYD34Fcwj2BXikt3QTJkBBVe1J9N10QslhlnTg=,sB6soAYD34Fcwj2BXikt3QTJkBBVe1J9N10QslhlnTg=,10,5,1


Excellent, that worked. These are the **4273 unique users** who have been flagged and removed by Twitter.

## 6.2. Get Unique Usernames of Retweeted Users (not flagged)

Let's now proceed to append the remaining users whose tweets were retweeted by some of these 4273 flagged users. These are stored in the retweet_user_screen_name column.

We will:
1. get all unique usernames from that column
2. append them to df_users
3. then drop duplicates (method='first') which will leave us with only those retweet_usernames added that are not already in this df_users dataframe of 4273 flagged users.

In [338]:
ddf_retweet_usernames = ddf[['retweet_user_screen_name']].persist()

In [339]:
ddf_retweet_usernames = ddf_retweet_usernames.drop_duplicates()

In [340]:
ddf_retweet_usernames.shape[0].compute()

334100

In [341]:
# bring in as pandas dataframe
df_retweet_usernames = ddf_retweet_usernames.compute()

In [342]:
df_retweet_usernames.head()

Unnamed: 0_level_0,retweet_user_screen_name
tweetid,Unnamed: 1_level_1
1000000000447930368,oneway_market
1000000030391095297,games4marah
1000000039362662400,mzlatksa
1000000054911033344,videohat_1
1000000204865789954,


In [343]:
df_retweet_usernames.retweet_user_screen_name.isnull().sum()

1

In [344]:
df_retweet_usernames.dropna(inplace=True)

In [345]:
df_retweet_usernames.reset_index(drop=True, inplace=True)

In [346]:
df_retweet_usernames.head()

Unnamed: 0,retweet_user_screen_name
0,oneway_market
1,games4marah
2,mzlatksa
3,videohat_1
4,danat_almesk


In [347]:
df_retweet_usernames.shape

(334099, 1)

That's a lot of extra usernames. Seems like there are plenty of tweets included that were not originally authored by the flagged users in this dataset. 

**NOTE:** Would be interesting to investigate what those are, because - technically speaking - those should be perfectly harmless Tweets. Since Twitter didn't flag them, that is...

In [348]:
df_retweet_usernames = df_retweet_usernames.rename(columns={'retweet_user_screen_name': 'user_screen_name'})

In [349]:
df_retweet_usernames['flagged'] = 0

In [350]:
df_users = df_users.append(df_retweet_usernames)

In [352]:
df_users.iloc[4270:4280]

Unnamed: 0,user_screen_name,userid,follower_count,following_count,tweet_count,flagged
4270,w8DugCJZldmy0PZWJwFgvvm56mm5o5fBBqHgYxMXG8A=,w8DugCJZldmy0PZWJwFgvvm56mm5o5fBBqHgYxMXG8A=,0.0,0.0,1.0,
4271,o+r+FeR4oE5hHYFTsleFXNUDpyA5RmzHYkvEPxYu7Ck=,o+r+FeR4oE5hHYFTsleFXNUDpyA5RmzHYkvEPxYu7Ck=,7.0,9.0,1.0,
4272,sB6soAYD34Fcwj2BXikt3QTJkBBVe1J9N10QslhlnTg=,sB6soAYD34Fcwj2BXikt3QTJkBBVe1J9N10QslhlnTg=,10.0,5.0,1.0,
0,oneway_market,,,,,0.0
1,games4marah,,,,,0.0
2,mzlatksa,,,,,0.0
3,videohat_1,,,,,0.0
4,danat_almesk,,,,,0.0
5,756870fda1544b6,,,,,0.0
6,m3asafarah,,,,,0.0


Excellent, just need to:

1. reset the index
2. drop duplicate user_screen_name (keeping the 'first')
3. set 'Flagged' column to 0 for all NaN values

In [353]:
# drop duplicates
df_users.drop_duplicates(subset='user_screen_name', keep='first', inplace=True)

In [354]:
df_users.shape

(336755, 6)

In [355]:
df_users.iloc[4270:4280]

Unnamed: 0,user_screen_name,userid,follower_count,following_count,tweet_count,flagged
4270,w8DugCJZldmy0PZWJwFgvvm56mm5o5fBBqHgYxMXG8A=,w8DugCJZldmy0PZWJwFgvvm56mm5o5fBBqHgYxMXG8A=,0.0,0.0,1.0,
4271,o+r+FeR4oE5hHYFTsleFXNUDpyA5RmzHYkvEPxYu7Ck=,o+r+FeR4oE5hHYFTsleFXNUDpyA5RmzHYkvEPxYu7Ck=,7.0,9.0,1.0,
4272,sB6soAYD34Fcwj2BXikt3QTJkBBVe1J9N10QslhlnTg=,sB6soAYD34Fcwj2BXikt3QTJkBBVe1J9N10QslhlnTg=,10.0,5.0,1.0,
0,oneway_market,,,,,0.0
1,games4marah,,,,,0.0
2,mzlatksa,,,,,0.0
3,videohat_1,,,,,0.0
4,danat_almesk,,,,,0.0
5,756870fda1544b6,,,,,0.0
6,m3asafarah,,,,,0.0


Alright, after appending the two sets of usernames and dropping duplicates, we are left with close to 337k users, of which 4273 have been flagged by Twitter (and contain additional information like following_count, etc.).

Let's reset the index.

In [356]:
df_users.reset_index(drop=True, inplace=True)

In [357]:
df_users.iloc[4270:4280]

Unnamed: 0,user_screen_name,userid,follower_count,following_count,tweet_count,flagged
4270,w8DugCJZldmy0PZWJwFgvvm56mm5o5fBBqHgYxMXG8A=,w8DugCJZldmy0PZWJwFgvvm56mm5o5fBBqHgYxMXG8A=,0.0,0.0,1.0,
4271,o+r+FeR4oE5hHYFTsleFXNUDpyA5RmzHYkvEPxYu7Ck=,o+r+FeR4oE5hHYFTsleFXNUDpyA5RmzHYkvEPxYu7Ck=,7.0,9.0,1.0,
4272,sB6soAYD34Fcwj2BXikt3QTJkBBVe1J9N10QslhlnTg=,sB6soAYD34Fcwj2BXikt3QTJkBBVe1J9N10QslhlnTg=,10.0,5.0,1.0,
4273,oneway_market,,,,,0.0
4274,games4marah,,,,,0.0
4275,mzlatksa,,,,,0.0
4276,videohat_1,,,,,0.0
4277,danat_almesk,,,,,0.0
4278,756870fda1544b6,,,,,0.0
4279,m3asafarah,,,,,0.0


In [358]:
df_users.tail()

Unnamed: 0,user_screen_name,userid,follower_count,following_count,tweet_count,flagged
336750,nice_moon2030,,,,,0.0
336751,Shaibah11,,,,,0.0
336752,EqA50Heq8z0Jaff,,,,,0.0
336753,e540w,,,,,0.0
336754,dina_50_0,,,,,0.0


## 6.3. Set 'Flagged' Column
Last but not least, let's set the 'Flagged' column to '1' for all rows which are currently other than 0.

In [376]:
df_users.loc[df_users.flagged != 0, "flagged"] = 1.0

In [377]:
df_users.iloc[4270:4280]

Unnamed: 0,user_screen_name,userid,follower_count,following_count,tweet_count,flagged
4270,w8DugCJZldmy0PZWJwFgvvm56mm5o5fBBqHgYxMXG8A=,w8DugCJZldmy0PZWJwFgvvm56mm5o5fBBqHgYxMXG8A=,0.0,0.0,1.0,1.0
4271,o+r+FeR4oE5hHYFTsleFXNUDpyA5RmzHYkvEPxYu7Ck=,o+r+FeR4oE5hHYFTsleFXNUDpyA5RmzHYkvEPxYu7Ck=,7.0,9.0,1.0,1.0
4272,sB6soAYD34Fcwj2BXikt3QTJkBBVe1J9N10QslhlnTg=,sB6soAYD34Fcwj2BXikt3QTJkBBVe1J9N10QslhlnTg=,10.0,5.0,1.0,1.0
4273,oneway_market,,,,,0.0
4274,games4marah,,,,,0.0
4275,mzlatksa,,,,,0.0
4276,videohat_1,,,,,0.0
4277,danat_almesk,,,,,0.0
4278,756870fda1544b6,,,,,0.0
4279,m3asafarah,,,,,0.0


This looks like it worked. Let's confirm by summing the counts in 'flagged', that should equal 4273.

In [378]:
df_users.flagged[df_users.flagged == 1].count()

4273

Excellent. Let's save as local and cloud-based parquet.

In [379]:
# # save this dataframe locally
# df_users.to_parquet('/Users/richard/Desktop/data_cap3/interim/df_unique_users.parquet',
#                     engine='pyarrow')

In [380]:
# # save this dataframe to s3 bucket
# df_users.to_parquet('s3://twitter-saudi-us-east-2/interim/df_unique_users.parquet',
#                     engine='pyarrow')

## 6.4. Substitute Userid / User_Screen_Name with Unique User ID

We are now ready to simplify our main dataframe ddf by substituting the user_screen_name and userid with our new, unique user ID index values. We will do the same for the usernames in the retweet_user_screen_name column. 

Following this, we can drop 4 columns:
- userid
- user_screen_name
- retweet_userid
- retweet_user_screen_name

In [14]:
# turn df_users into dask dataframe
ddf_users = dd.from_pandas(df_users, npartitions=1).persist()

In [15]:
ddf_users

Unnamed: 0_level_0,user_screen_name,userid,follower_count,following_count,tweet_count,flagged
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,object,object,float64,float64,float64,float64
336754,...,...,...,...,...,...


Let's turn the index of ddf_users into a column, to prepare for our join.

In [16]:
ddf_users['reference_userid'] = ddf_users.index

In [17]:
ddf_users.head()

Unnamed: 0,user_screen_name,userid,follower_count,following_count,tweet_count,flagged,reference_userid
0,y_44a_,948302862098092034,9007.0,8821.0,258339.0,1.0,0
1,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=,168.0,408.0,4982.0,1.0,1
2,Gj+bihYSO0L5Ht1+f9OEqP42KbnJWtNK4qv0WJr0cs=,Gj+bihYSO0L5Ht1+f9OEqP42KbnJWtNK4qv0WJr0cs=,1623.0,2022.0,59605.0,1.0,2
3,2SJuOzyE6GQOsmW9ukY3ChH8rl049x6mDNZi3EM=,2SJuOzyE6GQOsmW9ukY3ChH8rl049x6mDNZi3EM=,1850.0,1594.0,14833.0,1.0,3
4,ytmVN9opEFMM7Uk+0O0XgSuOpIRlok5Xqu+jel9qyM=,ytmVN9opEFMM7Uk+0O0XgSuOpIRlok5Xqu+jel9qyM=,3928.0,4273.0,17643.0,1.0,4


In [18]:
ddf_users.columns

Index(['user_screen_name', 'userid', 'follower_count', 'following_count',
       'tweet_count', 'flagged', 'reference_userid'],
      dtype='object')

In [19]:
ddf_users.known_divisions

True

Let's do the same for ddf.

In [20]:
ddf.columns

Index(['userid', 'user_screen_name', 'follower_count', 'following_count',
       'tweet_language', 'tweet_text', 'tweet_time', 'tweet_client_name',
       'is_retweet', 'retweet_userid', 'retweet_tweetid',
       'retweet_user_screen_name', 'hashtags'],
      dtype='object')

In [21]:
ddf.known_divisions

True

In [22]:
ddf['tweetid'] = ddf.index

In [23]:
ddf.head(2)

Unnamed: 0_level_0,userid,user_screen_name,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid,retweet_user_screen_name,hashtags,tweetid
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1000000000447930368,948302862098092034,y_44a_,9007,8821,ar,السلام عليكم ورحمة الله وبركاته مرحبا عملاء م...,2018-05-25 13:05:00,Twitter for iPhone,True,,9.986493e+17,oneway_market,,1000000000447930368
1000000030391095297,948302862098092034,y_44a_,9007,8821,ar,للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...,2018-05-25 13:06:00,Twitter for iPhone,True,,9.996373e+17,games4marah,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",1000000030391095297


Excellent.

We are now ready to perform our joins:
1. Joining on 'user_screen_name' column to bring in the respective unique User IDs
2. Joining on 'retweet_user_screen_name' column to bring in the respective unique User IDs there, too

After these two joins we will:
- delete the redundant columns
- maintain 'tweetid' as a column, not as index

In [25]:
%%time
# execute left join
ddf = ddf.merge(
    ddf_users[["reference_userid", "user_screen_name"]],
    how="left",
    on="user_screen_name"
).persist()

CPU times: user 584 ms, sys: 102 ms, total: 686 ms
Wall time: 864 ms


Great, let's verify by comparing user_screen_name and ID pairs in ddf to df_users.

In [27]:
ddf[['reference_userid', 'user_screen_name']].head(10)

Unnamed: 0,reference_userid,user_screen_name
0,0,y_44a_
1,0,y_44a_
2,0,y_44a_
3,1,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=
4,2,Gj+bihYSO0L5Ht1+f9OEqP42KbnJWtNK4qv0WJr0cs=
5,0,y_44a_
6,0,y_44a_
7,0,y_44a_
8,0,y_44a_
9,3,2SJuOzyE6GQOsmW9ukY3ChH8rl049x6mDNZi3EM=


In [28]:
df_users.head(4)

Unnamed: 0,user_screen_name,userid,follower_count,following_count,tweet_count,flagged
0,y_44a_,948302862098092034,9007.0,8821.0,258339.0,1.0
1,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=,iXwa1+qxYAH2hEJ9nDG11qo6nmcpl89IQKhDRDqpfU4=,168.0,408.0,4982.0,1.0
2,Gj+bihYSO0L5Ht1+f9OEqP42KbnJWtNK4qv0WJr0cs=,Gj+bihYSO0L5Ht1+f9OEqP42KbnJWtNK4qv0WJr0cs=,1623.0,2022.0,59605.0,1.0
3,2SJuOzyE6GQOsmW9ukY3ChH8rl049x6mDNZi3EM=,2SJuOzyE6GQOsmW9ukY3ChH8rl049x6mDNZi3EM=,1850.0,1594.0,14833.0,1.0


Perfect, this worked.

Let's now just proceed to do a similar join for the retweet_user_screen_name column and then we're done.

In [30]:
%%time
# execute left join
ddf = ddf.merge(
    ddf_users[["reference_userid", "user_screen_name"]],
    how="left",
    left_on="retweet_user_screen_name",
    right_on="user_screen_name"
).persist()

CPU times: user 450 ms, sys: 363 ms, total: 813 ms
Wall time: 1.06 s


In [31]:
ddf.head(3)

Unnamed: 0,userid,user_screen_name_x,follower_count,following_count,tweet_language,tweet_text,tweet_time,tweet_client_name,is_retweet,retweet_userid,retweet_tweetid,retweet_user_screen_name,hashtags,tweetid,reference_userid_x,reference_userid_y,user_screen_name_y
0,948302862098092034,y_44a_,9007,8821,ar,السلام عليكم ورحمة الله وبركاته مرحبا عملاء م...,2018-05-25 13:05:00,Twitter for iPhone,True,,9.986493e+17,oneway_market,,1000000000447930368,0,4273.0,oneway_market
1,948302862098092034,y_44a_,9007,8821,ar,للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...,2018-05-25 13:06:00,Twitter for iPhone,True,,9.996373e+17,games4marah,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",1000000030391095297,0,4274.0,games4marah
2,948302862098092034,y_44a_,9007,8821,ar,مظلات وسواتر آفاق الرياض مظلات استراحات مظلات...,2018-05-25 13:06:00,Twitter for iPhone,True,,9.993939e+17,mzlatksa,"[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",1000000039362662400,0,4275.0,mzlatksa


In [32]:
# double-check
df_users.iloc[4273], df_users.iloc[4274]

(user_screen_name    oneway_market
 userid                       None
 follower_count                NaN
 following_count               NaN
 tweet_count                   NaN
 flagged                       0.0
 Name: 4273, dtype: object,
 user_screen_name    games4marah
 userid                     None
 follower_count              NaN
 following_count             NaN
 tweet_count                 NaN
 flagged                     0.0
 Name: 4274, dtype: object)

Excellent, this worked. Let's just remove and rename some columns.

In [33]:
ddf.columns

Index(['userid', 'user_screen_name_x', 'follower_count', 'following_count',
       'tweet_language', 'tweet_text', 'tweet_time', 'tweet_client_name',
       'is_retweet', 'retweet_userid', 'retweet_tweetid',
       'retweet_user_screen_name', 'hashtags', 'tweetid', 'reference_userid_x',
       'reference_userid_y', 'user_screen_name_y'],
      dtype='object')

In [34]:
ddf = ddf.drop(columns=['userid', 
                       'user_screen_name_x', 
                       'retweet_userid', 
                       'retweet_user_screen_name', 
                       'user_screen_name_y']).persist()

Now let's rename the two reference_userid columns

In [35]:
ddf = ddf.rename(columns={
        'reference_userid_x': 'user_reference_id',
        'reference_userid_y': 'retweet_user_reference_id'
}).persist()

Finally let's just reshuffle the order of the columns.

In [37]:
ddf = ddf[['tweetid',
           'user_reference_id', 
           'follower_count', 
           'following_count', 
           'tweet_text',
           'hashtags',
           'tweet_language',
           'tweet_time',
           'tweet_client_name',
           'is_retweet',
           'retweet_tweetid',
           'retweet_user_reference_id']].persist()

In [38]:
ddf.head()

Unnamed: 0,tweetid,user_reference_id,follower_count,following_count,tweet_text,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id
0,1000000000447930368,0,9007,8821,السلام عليكم ورحمة الله وبركاته مرحبا عملاء م...,,ar,2018-05-25 13:05:00,Twitter for iPhone,True,9.986493e+17,4273.0
1,1000000030391095297,0,9007,8821,للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.996373e+17,4274.0
2,1000000039362662400,0,9007,8821,مظلات وسواتر آفاق الرياض مظلات استراحات مظلات...,"[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.993939e+17,4275.0
3,1000000054911033344,1,168,408,فيديو شاهد مواطن يوثق بالفيديو كميات كبيرة من...,,ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.983516e+17,4276.0
4,1000000204865789954,2,1623,2022,أستغفر الله العظيم وأتوب إليه,,ar,2018-05-25 13:06:00,غرد بصدقة,False,,


In [40]:
ddf.shape[0].compute()

35347002

Fantastic. 

Let's save this now.

In [41]:
# check if divisions are known
ddf.known_divisions

False

In [42]:
# set index to tweetid column
ddf = ddf.set_index('tweetid', drop=True).persist()

In [45]:
# let's drop tweetid from the columns
ddf = ddf.drop(columns=['tweetid']).persist()

In [46]:
ddf.head()

Unnamed: 0_level_0,user_reference_id,follower_count,following_count,tweet_text,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000000000447930368,0,9007,8821,السلام عليكم ورحمة الله وبركاته مرحبا عملاء م...,,ar,2018-05-25 13:05:00,Twitter for iPhone,True,9.986493e+17,4273.0
1000000030391095297,0,9007,8821,للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.996373e+17,4274.0
1000000039362662400,0,9007,8821,مظلات وسواتر آفاق الرياض مظلات استراحات مظلات...,"[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.993939e+17,4275.0
1000000054911033344,1,168,408,فيديو شاهد مواطن يوثق بالفيديو كميات كبيرة من...,,ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.983516e+17,4276.0
1000000204865789954,2,1623,2022,أستغفر الله العظيم وأتوب إليه,,ar,2018-05-25 13:06:00,غرد بصدقة,False,,


In [47]:
ddf.to_parquet('s3://twitter-saudi-us-east-2/interim/ddf_clean_userids_substituted.parquet',
               engine='pyarrow')

Excellent.

We now have:
- a reference dataframe containing all the information about the unique users in this dataset
- a slimmed-down version of our main dataframe with user ids and names replaced with just their unique identifiers.

Let's now proceed to create a similar reference dataframe for the **unique tweets** in the dataset.

# 7. Identifying Unique Tweets

We will now proceed to identify the unique tweets in the dataset. Since there are so many re-tweets, we are hoping that this will allow us to significantly reduce the size of the dataset so that we can proceed to work with the tweet text contents locally using pandas, rather than distributed using dask.

The goal is to get a relational database of:
- unique tweets with unique identifiers, generated by us
- the original dataset with unique tweet IDs, representing unique instances of potentially duplicated content

The collection of unique tweet content is what we will feed into our topic modelling algorithm.

## 7.1. Drop All Rows with Duplicate Tweet_Text
Let's proceed to drop all rows from ddf that have duplicate contents for the tweet_text column. This will leave us with the set of original tweets; i.e. no duplicates due to re-tweeting.

In [11]:
# create dask dataframe with only unique entries in tweet_text
ddf_unique = ddf.drop_duplicates(subset=['tweet_text']).persist()

In [12]:
ddf_unique

Unnamed: 0_level_0,user_reference_id,follower_count,following_count,tweet_text,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
,int64,int64,int64,object,object,object,datetime64[ns],object,bool,float64,float64
,...,...,...,...,...,...,...,...,...,...,...


In [13]:
# get number of rows in ddf_unique
ddf_unique.shape[0].compute()

6153341

OK. That is more than expected. The fact that we had 3.4million rows with is_retweet = False led us to believe we'd end up with around that number in unique tweets. We have now ended up with almost double that.

Let's investigate this further before continuing on. We need to be make sure we can trust this set of 'unique' tweets.

In [14]:
ddf_unique.head()

Unnamed: 0_level_0,user_reference_id,follower_count,following_count,tweet_text,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000000000447930368,0,9007,8821,السلام عليكم ورحمة الله وبركاته مرحبا عملاء م...,,ar,2018-05-25 13:05:00,Twitter for iPhone,True,9.986493e+17,4273.0
1000000030391095297,0,9007,8821,للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.996373e+17,4274.0
1000000039362662400,0,9007,8821,مظلات وسواتر آفاق الرياض مظلات استراحات مظلات...,"[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.993939e+17,4275.0
1000000054911033344,1,168,408,فيديو شاهد مواطن يوثق بالفيديو كميات كبيرة من...,,ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.983516e+17,4276.0
1000000204865789954,2,1623,2022,أستغفر الله العظيم وأتوب إليه,,ar,2018-05-25 13:06:00,غرد بصدقة,False,,


In [15]:
# inspect entries for which is_retweet = True
ddf_unique[ddf_unique.is_retweet == True].head()

Unnamed: 0_level_0,user_reference_id,follower_count,following_count,tweet_text,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000000000447930368,0,9007,8821,السلام عليكم ورحمة الله وبركاته مرحبا عملاء م...,,ar,2018-05-25 13:05:00,Twitter for iPhone,True,9.986493e+17,4273.0
1000000030391095297,0,9007,8821,للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.996373e+17,4274.0
1000000039362662400,0,9007,8821,مظلات وسواتر آفاق الرياض مظلات استراحات مظلات...,"[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.993939e+17,4275.0
1000000054911033344,1,168,408,فيديو شاهد مواطن يوثق بالفيديو كميات كبيرة من...,,ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.983516e+17,4276.0
1000000215598891008,0,9007,8821,تخفيضات على جميع الأصناف لدى دانة المسك للعود...,"[تخفيضات, دانة المسك للعود, الرياض, الدائري ال...",ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.997592e+17,4277.0


We might be able to use the retweet_tweetid feature to get all entries which have retweeted the same tweet.

Let's try this by finding duplicates in retweet_tweetid column. 

In [16]:
# check first if we even have duplicates by doing a value_counts()
ddf_unique.retweet_tweetid.value_counts().head(15)

1.096888e+18    5
4.185034e+17    4
1.085309e+18    4
4.198996e+17    4
1.012492e+18    4
1.088175e+18    4
1.090611e+18    4
1.119623e+18    4
1.088173e+18    4
1.085316e+18    4
9.870564e+17    4
1.093834e+18    4
1.153985e+18    4
9.870569e+17    4
1.139251e+18    3
Name: retweet_tweetid, dtype: int64

It seems we do have retweets of the same tweets in this dataset. That would seem to mean that not all the tweets in this reduced dataset are truly unique.

Maybe there are minor changes like spaces, hashtags, etc.?

Let's get just the duplicated rows.
Dask does not support the .duplicated() method, so let's create a ddf with value_counts > 1 as a mask.

In [17]:
# create mask ddf
ddf_filter = ddf_unique['retweet_tweetid'].value_counts().map(lambda x: x > 1)

In [18]:
# use mask to get duplicated rows
ddf_duplicated_retweetids = ddf_unique[ddf_unique['retweet_tweetid'].isin(list(ddf_filter[ddf_filter].index))]

In [19]:
# set index and sort dataframe
ddf_duplicated_retweetids = ddf_duplicated_retweetids.set_index('retweet_tweetid')

In [20]:
# apply
ddf_duplicated_retweetids.head(15)

Unnamed: 0_level_0,user_reference_id,follower_count,following_count,tweet_text,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_user_reference_id
retweet_tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1.373095e+17,176,650,561,المهم ان اسامة لن يلعب في السعودية لغير الهلا...,,ar,2014-10-18 13:37:00,Twitter for Android,True,171264.0
1.373095e+17,176,650,561,المهم ان اسامة لن يلعب في السعودية لغير الهلا...,,ar,2016-02-24 12:35:00,Twitter for iPhone,True,5660.0
2.348015e+17,3562,652879,421198,التـمـيز فـيـك لـو غـيرك تـمـيز فـارقه ميزتك ...,,ar,2016-08-11 10:24:00,Twitter for iPad,True,235215.0
2.348015e+17,3557,814247,663230,التـمـيز فـيـك لـو غـيرك تـمـيز فـارقه ميزتك ...,,ar,2017-10-28 12:44:00,Twitter for Android,True,302905.0
2.348034e+17,3562,652879,421198,يآ عووونكـ ماقلتي عووونك ولا قلت وش فيكــ وش ...,,ar,2016-08-11 10:24:00,Twitter for iPad,True,235215.0
2.348034e+17,3557,814247,663230,يآ عووونكـ ماقلتي عووونك ولا قلت وش فيكــ وش ...,,ar,2017-10-28 19:54:00,Twitter for Android,True,302905.0
2.873282e+17,3587,138546,122416,الل م انّي أسألك نفحةً ن نفحآت رحمتك تلك التي...,,ar,2017-12-09 10:06:00,Twitter for iPhone,True,302905.0
2.873282e+17,3617,187212,175594,الل م انّي أسألك نفحةً ن نفحآت رحمتك تلك التي...,,ar,2016-07-27 16:06:00,Twitter for iPhone,True,235215.0
2.943013e+17,3587,138546,122416,ربنا اجعلنا لك ذكارين لك شكارين إليك أواهين م...,,ar,2017-12-11 12:46:00,Twitter for iPhone,True,302905.0
2.943013e+17,3373,870549,564628,ربنا اجعلنا لك ذكارين لك شكارين إليك أواهين م...,,ar,2016-07-20 13:02:00,Twitter for iPad,True,235215.0


OK. We do have duplicates here, it seems.

We could proceed to drop rows with duplicate retweet_tweetid. But be careful to keep the rows with "NaN" in retweet_tweetid!

I am curious, though, why these weren't dropped when we dropped duplicates in tweet_text. Minor differences due to preprocessing maybe (additional white space for example)? Let's investigate a little further.

In [21]:
# get a local copy to use as pandas dataframe for easier wrangling
df_duplicated_retweetids = ddf_duplicated_retweetids.compute()

In [22]:
df_duplicated_retweetids.iloc[0:2]

Unnamed: 0_level_0,user_reference_id,follower_count,following_count,tweet_text,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_user_reference_id
retweet_tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1.373095e+17,176,650,561,المهم ان اسامة لن يلعب في السعودية لغير الهلا...,,ar,2014-10-18 13:37:00,Twitter for Android,True,171264.0
1.373095e+17,176,650,561,المهم ان اسامة لن يلعب في السعودية لغير الهلا...,,ar,2016-02-24 12:35:00,Twitter for iPhone,True,5660.0


Just looking at these first two entries is fascinating:
- same user
- retweeting the same message twice
- a year and a half apart
- but the retweet_user_screen_name is different 

In [23]:
df_duplicated_retweetids.tweet_text.iloc[0]

' المهم ان اسامة لن يلعب في السعودية لغير الهلال واحتمالية بقاءه حتى نهاية الموسم واردة وعرض الهلال بالنسبة له مجزي وهو م '

In [24]:
df_duplicated_retweetids.tweet_text.iloc[1]

' المهم ان اسامة لن يلعب في السعودية لغير الهلال واحتمالية بقاءه حتى نهاية الموسم واردة وعرض الهلال بالنسبة له مجزي وهو مق '

One character difference between these two tweets. Hence it wasn't dropped in the .drop_duplicates().

The safest way forward is to drop any rows with duplicate values in the retweet_tweetid column. We'll do that after we've inspected the contents of some tweets to see if we have any tweets that are almost identical.

### 7.1.1. Finding Matching Substrings
Let's try matching a part of the text to find entries that are almost identical.

In [25]:
# get tweet content
ddf_unique.loc['1000000030391095297'].tweet_text.compute()

tweetid
1000000030391095297     للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...
Name: tweet_text, dtype: object

In [26]:
# find all entries
ddf_unique[ddf_unique.tweet_text.str.contains('للتأجير لبيع النطيطات')].head()

Unnamed: 0_level_0,user_reference_id,follower_count,following_count,tweet_text,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000000030391095297,0,9007,8821,للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.996373e+17,4274.0


Interesting. A small substring doesn't return any matches other than the one which has is_retweet marked as True. This means **it must be a retweet of a tweet not contained in this dataset** (i.e. of a user not marked as engaging in political misinformation). This is very interesting and worth exploring further in our EDA.

Let's double-check a couple more just to be sure.

In [27]:
# get tweet content
result = ddf_unique.loc['1000000054911033344']['tweet_text'].compute()

In [28]:
result.loc['1000000054911033344']

' فيديو شاهد مواطن يوثق بالفيديو كميات كبيرة من الطعام تهدر من خيمة إفطار الصائمين '

In [29]:
# find all entries
ddf_unique[ddf_unique.tweet_text.str.contains('مواطن يوثق بالفيديو كميات كبيرة')].head()

Unnamed: 0_level_0,user_reference_id,follower_count,following_count,tweet_text,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000000054911033344,1,168,408,فيديو شاهد مواطن يوثق بالفيديو كميات كبيرة من...,,ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.983516e+17,4276.0
998364569184751621,6,42835,42860,مواطن يوثق بالفيديو كميات كبيرة من الطعام تهد...,,ar,2018-05-21 00:47:00,Twitter for Android,True,9.983592e+17,17138.0


In [30]:
# get tweet content
result2 = ddf_unique.loc['998364569184751621'].tweet_text.compute()

In [31]:
result2.loc['998364569184751621']

' مواطن يوثق بالفيديو كميات كبيرة من الطعام تهدر من خيمة إفطار الصائمين '

Interesting. Here we have 2 tweets which **are both retweets** and which have very similar content - except that one has two extra words at the beginning of the tweet (which may well be a hashtag). So here we have a tweet being retweeted with a minor customization.

**Importantly** the original tweet is not included in the dataset and comes from a user not identified as 'compromised'.

Let's try one more.

In [32]:
# get tweet content
result3 = ddf_unique.loc['1000000215598891008']['tweet_text'].compute()

In [33]:
result3.loc['1000000215598891008']

' تخفيضات على جميع الأصناف لدى دانة المسك للعود الرياض الدائري الشرقي حي الروابي بين مخرج و مقابل أسواق العثيم '

In [34]:
# find all entries
ddf_unique[ddf_unique.tweet_text.str.contains(' دانة المسك للعود الرياض الدائري الشرقي حي الروابي بين')].head()

Unnamed: 0_level_0,user_reference_id,follower_count,following_count,tweet_text,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000000215598891008,0,9007,8821,تخفيضات على جميع الأصناف لدى دانة المسك للعود...,"[تخفيضات, دانة المسك للعود, الرياض, الدائري ال...",ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.997592e+17,4277.0
1031068213684183040,148,72720,69585,دهن مخلط الدايموند من دانة المسك للعود الرياض...,"[دهن مخلط الدايموند, دانة المسك, للعود, الرياض...",ar,2018-08-19 06:39:00,Twitter for Android,True,1.020977e+18,4277.0
960302987188023297,3558,117314,117272,دهن مخلط ملكي من دانة المسك للعود الرياض الدا...,"[دهن مخلط ملكي, دانة المسك, للعود, الرياض, الد...",ar,2018-02-05 00:04:00,Twitter for iPhone,True,9.602409e+17,4277.0
963889899517677568,3556,600516,512111,عود خشب طبيعي كلمنتان المتميز برائحته الجذابة...,"[عود خشب طبيعي, كلمنتان, دانة المسك للعود, الر...",ar,2018-02-14 21:37:00,Twitter for iPhone,True,9.638707e+17,4277.0
970147097067638784,3941,204788,158132,دهن مخلط السيوف من دانة المسك للعود الرياض ال...,"[دهن مخلط السيوف, دانة المسك, للعود, الرياض, ا...",ar,2018-03-04 04:01:00,Twitter for iPhone,True,9.700389e+17,4277.0


No exact matches, just mentions of the same shopping mall.

We know enough now to continue:
- We should **definitely not** filter simply by is_retweet = False, because many of the retweets are retweeting messages that are not included in this dataset
- We should be wary of the possibility of entries that are **almost identical** -- i.e. retweets with a minor alteration.
- We will drop any rows with duplicate retweet_tweetids. This should remove most - if not all - of the almost identical tweets. We will do this using a **mask** rather than a **.drop_duplicates()** call to avoid losing the NaNs.

That said, this creates a new problem: when we perform the join, some rows will not match exactly with their respective 'unique tweet' - because there are minor differences between them. 

Solution:
- Check NaN values in 'unique_tweet_id' column of ddf AFTER the join.
- Fill these using a JOIN the retweet_tweetid column (so KEEP that column in ddf_unique).

### 7.1.2. Drop Duplicate Retweet_Tweetids

In this section, we will use the mask / ddf_filter created above to drop all rows with more than one instance of the same **retweet_tweetid** while keeping the NaNs.

In [35]:
# check value counts
ddf_unique.retweet_tweetid.value_counts().head(10)

1.096888e+18    5
4.185034e+17    4
1.085309e+18    4
4.198996e+17    4
1.012492e+18    4
1.088175e+18    4
1.090611e+18    4
1.119623e+18    4
1.088173e+18    4
1.085316e+18    4
Name: retweet_tweetid, dtype: int64

In [36]:
ddf_unique.shape[0].compute()

6153341

Let's pull out a tweet_text that has been tweeted by:
- the original tweet ( **retweet_tweetid** = Nan)
- retweeted with **multiple retweet_tweetids**

This should help us verify that we're doing the right thing.

In [37]:
same_retweetid = ddf[ddf.retweet_tweetid == 1157138758036123648].compute()

In [38]:
same_text = same_retweetid.iloc[0].tweet_text

In [39]:
ddf[ddf.tweet_text == same_text].compute()

Unnamed: 0_level_0,user_reference_id,follower_count,following_count,tweet_text,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1146251429457354752,2117,1364,1716,مناحل ابو سلطان عرض خاص لمده اسبوع كيلو سدر ج...,[مناحل ابو سلطان],ar,2019-07-03 02:56:00,Twitter for Android,False,,
1146253210107224065,284,8968,8010,مناحل ابو سلطان عرض خاص لمده اسبوع كيلو سدر ج...,[مناحل ابو سلطان],ar,2019-07-03 03:03:00,Twitter for Android,False,,
1146258605601099777,2137,2162,2874,مناحل ابو سلطان عرض خاص لمده اسبوع كيلو سدر ج...,[مناحل ابو سلطان],ar,2019-07-03 03:25:00,Twitter for Android,True,1.146253e+18,22598.0
1146266513801994241,2132,2351,2397,مناحل ابو سلطان عرض خاص لمده اسبوع كيلو سدر ج...,[مناحل ابو سلطان],ar,2019-07-03 03:56:00,Twitter for Android,True,1.146253e+18,22598.0
1146266908557238273,284,8968,8010,مناحل ابو سلطان عرض خاص لمده اسبوع كيلو سدر ج...,[مناحل ابو سلطان],ar,2019-07-03 03:58:00,Twitter for Android,True,1.146252e+18,100422.0
...,...,...,...,...,...,...,...,...,...,...,...
1175134986422251520,179,16937,16823,مناحل ابو سلطان عرض خاص لمده اسبوع كيلو سدر ج...,[مناحل ابو سلطان],ar,2019-09-20 19:49:00,Twitter for iPhone,True,1.174895e+18,92440.0
1175145494022098946,1522,4294,3682,مناحل ابو سلطان عرض خاص لمده اسبوع كيلو سدر ج...,[مناحل ابو سلطان],ar,2019-09-20 20:31:00,Twitter for Android,True,1.174895e+18,92440.0
1175145591619411972,1522,4294,3682,مناحل ابو سلطان عرض خاص لمده اسبوع كيلو سدر ج...,[مناحل ابو سلطان],ar,2019-09-20 20:31:00,Twitter for Android,True,1.174894e+18,2117.0
1175184889097986049,210,3690,5504,مناحل ابو سلطان عرض خاص لمده اسبوع كيلو سدر ج...,[مناحل ابو سلطان],ar,2019-09-20 23:08:00,Twitter for iPhone,True,1.174895e+18,92440.0


In [40]:
# inspect df with duplicated retweet ids
df_duplicated_retweetids.head()

Unnamed: 0_level_0,user_reference_id,follower_count,following_count,tweet_text,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_user_reference_id
retweet_tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1.373095e+17,176,650,561,المهم ان اسامة لن يلعب في السعودية لغير الهلا...,,ar,2014-10-18 13:37:00,Twitter for Android,True,171264.0
1.373095e+17,176,650,561,المهم ان اسامة لن يلعب في السعودية لغير الهلا...,,ar,2016-02-24 12:35:00,Twitter for iPhone,True,5660.0
2.348015e+17,3562,652879,421198,التـمـيز فـيـك لـو غـيرك تـمـيز فـارقه ميزتك ...,,ar,2016-08-11 10:24:00,Twitter for iPad,True,235215.0
2.348015e+17,3557,814247,663230,التـمـيز فـيـك لـو غـيرك تـمـيز فـارقه ميزتك ...,,ar,2017-10-28 12:44:00,Twitter for Android,True,302905.0
2.348034e+17,3562,652879,421198,يآ عووونكـ ماقلتي عووونك ولا قلت وش فيكــ وش ...,,ar,2016-08-11 10:24:00,Twitter for iPad,True,235215.0


In [41]:
# get indices 
duplicated_retweetids = df_duplicated_retweetids.index.unique()

In [42]:
# check how many
len(duplicated_retweetids)

3709

In [43]:
# turn into list
duplicated_retweetids = list(duplicated_retweetids)

In [44]:
# use mask to filter out duplicate retweetids
ddf_unique_dropped = ddf_unique[~(ddf_unique.retweet_tweetid.isin(duplicated_retweetids))].persist()

In [46]:
# check n_rows dropped
ddf_unique.shape[0].compute() - ddf_unique_dropped.shape[0].compute()

7558

In [47]:
# inspect by looking up tweet_text
ddf_unique_dropped[ddf_unique_dropped.tweet_text == same_text].compute()

Unnamed: 0_level_0,user_reference_id,follower_count,following_count,tweet_text,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1146251429457354752,2117,1364,1716,مناحل ابو سلطان عرض خاص لمده اسبوع كيلو سدر ج...,[مناحل ابو سلطان],ar,2019-07-03 02:56:00,Twitter for Android,False,,


In [49]:
ddf_unique = ddf_unique_dropped.persist()

In [50]:
# verify by checking value counts again
ddf_unique.retweet_tweetid.value_counts().head(10)

4.517189e+08    1
1.072091e+18    1
1.072091e+18    1
1.072091e+18    1
1.072091e+18    1
1.072091e+18    1
1.072091e+18    1
1.072091e+18    1
1.072090e+18    1
1.072103e+18    1
Name: retweet_tweetid, dtype: int64

In [51]:
ddf_unique.shape[0].compute()

6145783

Excellent, that did the trick.

That has reduced our set of unique tweets to 6.15mln.

Now that this is done, we can save the ddf with unique tweets to our s3 bucket.

In [52]:
# save ddf with unique tweets to s3 bucket as parquet
ddf_unique.to_parquet('s3://twitter-saudi-us-east-2/interim/ddf_unique_all_columns.parquet',
               engine='pyarrow')

### 7.1.3. Get Local Copy of Unique Tweets

The csv file in our s3 bucket totals to 1.4GB. That means we should definitely be able to read it in as a local pandas dataframe, especially if we only use the tweet_text and hashtags columns.

Let's try that now.

In [53]:
ddf_unique[['tweet_text', 'hashtags', 'is_retweet', 'retweet_tweetid']].head()

Unnamed: 0_level_0,tweet_text,hashtags,is_retweet,retweet_tweetid
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1000000000447930368,السلام عليكم ورحمة الله وبركاته مرحبا عملاء م...,,True,9.986493e+17
1000000030391095297,للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",True,9.996373e+17
1000000039362662400,مظلات وسواتر آفاق الرياض مظلات استراحات مظلات...,"[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",True,9.993939e+17
1000000054911033344,فيديو شاهد مواطن يوثق بالفيديو كميات كبيرة من...,,True,9.983516e+17
1000000204865789954,أستغفر الله العظيم وأتوب إليه,,False,


In [54]:
%%time
# bring copy of only unique tweet text bodies, hashtags and retweet data to local machine as pandas dataframe
df_tweets_hashtags = ddf_unique[['tweet_text', 'hashtags', 'is_retweet', 'retweet_tweetid']].compute()

CPU times: user 1min 42s, sys: 3min 14s, total: 4min 57s
Wall time: 18min 43s


In [55]:
df_tweets_hashtags.head()

Unnamed: 0_level_0,tweet_text,hashtags,is_retweet,retweet_tweetid
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1000000000447930368,السلام عليكم ورحمة الله وبركاته مرحبا عملاء م...,,True,9.986493e+17
1000000030391095297,للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",True,9.996373e+17
1000000039362662400,مظلات وسواتر آفاق الرياض مظلات استراحات مظلات...,"[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",True,9.993939e+17
1000000054911033344,فيديو شاهد مواطن يوثق بالفيديو كميات كبيرة من...,,True,9.983516e+17
1000000204865789954,أستغفر الله العظيم وأتوب إليه,,False,


Excellent. That works and we can continue to work locally from here.

Let's reset the index and then save this to our s3 bucket as well for later reference.

In [56]:
# reset the index to be consecutive integers starting from 0
df_tweets_hashtags.reset_index(drop=True, inplace=True)

In [57]:
# create column from index for later use
df_tweets_hashtags.reset_index(drop=False,inplace=True)

In [58]:
df_tweets_hashtags.head(10)

Unnamed: 0,index,tweet_text,hashtags,is_retweet,retweet_tweetid
0,0,السلام عليكم ورحمة الله وبركاته مرحبا عملاء م...,,True,9.986493e+17
1,1,للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",True,9.996373e+17
2,2,مظلات وسواتر آفاق الرياض مظلات استراحات مظلات...,"[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",True,9.993939e+17
3,3,فيديو شاهد مواطن يوثق بالفيديو كميات كبيرة من...,,True,9.983516e+17
4,4,أستغفر الله العظيم وأتوب إليه,,False,
5,5,تخفيضات على جميع الأصناف لدى دانة المسك للعود...,"[تخفيضات, دانة المسك للعود, الرياض, الدائري ال...",True,9.997592e+17
6,6,علاج السرطان في الهند عند افضل مستشفى مختص با...,"[سرطان الثدي, سرطان البنكرياس, سر…]",True,9.997301e+17
7,7,دورة مع السفرة السادسة عشر احصائية الربع الاو...,"[مع السفرة, حلقا…]",True,9.997669e+17
8,8,تسديد قروض الراجحي الاهلي راتب وجميع البنوك ب...,[تسديد قروض],True,9.99761e+17
9,9,لا إله إلا أنت سبحانك إني كنت من الظالمين,,False,


Excellent, let's save, both locally and to s3 bucket.

In [59]:
# save this dataframe locally
df_tweets_hashtags.to_parquet('/Users/richard/Desktop/data_cap3/interim/df_unique_tweets_hashtags_reset_index_030521.parquet',
                          engine='pyarrow')

In [60]:
%%time
# save this dataframe to s3 bucket (ca. 1GB upload)
df_tweets_hashtags.to_parquet('s3://twitter-saudi-us-east-2/interim/df_unique_tweets_hashtags_reset_index.parquet',
                              engine='pyarrow')

CPU times: user 48.7 s, sys: 53.7 s, total: 1min 42s
Wall time: 16min 14s


## 7.2. Replace Tweet_Text in Full Table with Index

Now that we have a Dataframe with the unique tweet_text contents and hashtags (and a reset index), we can go ahead and replace the tweet_text bodies in the original, full dataset (containing >35mln rows) with just an integer referring to the index of the unique tweet_text content. We'll do this using a JOIN.

This will significantly reduce the size of the original, full dataset and may mean we can work with that locally, too.

Because we will be executing a join here, let's bring in the pandas dataframe with unique tweets as a dask dataframe. That will make the join easier to execute.


In [4]:
ddf_tweets_hashtags = dd.read_parquet('s3://twitter-saudi-us-east-2/interim/df_unique_tweets_hashtags_reset_index.parquet',
                                      engine='pyarrow').persist()

In [5]:
ddf_tweets_hashtags.head()

Unnamed: 0,index,tweet_text,hashtags,is_retweet,retweet_tweetid
0,0,السلام عليكم ورحمة الله وبركاته مرحبا عملاء م...,,True,9.986493e+17
1,1,للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",True,9.996373e+17
2,2,مظلات وسواتر آفاق الرياض مظلات استراحات مظلات...,"[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",True,9.993939e+17
3,3,فيديو شاهد مواطن يوثق بالفيديو كميات كبيرة من...,,True,9.983516e+17
4,4,أستغفر الله العظيم وأتوب إليه,,False,


In [6]:
ddf_tweets_hashtags.shape[0].compute()

6145783

In [7]:
ddf_tweets_hashtags.tail()

Unnamed: 0,index,tweet_text,hashtags,is_retweet,retweet_tweetid
6145778,6145778,وأنا بقلّب في تركى آل شيخ لقيت التايم لاين,,False,
6145779,6145779,اختي جوزها شافها وهي طالعه من مسجد بالعشر الا...,,True,9.993118e+17
6145780,6145780,رمضان كريم الدحيل العين القدس عاصمه فلسطين ال...,"[رمضان كريم, الدحيل العين, Ramadan, القدس عاصم...",True,9.968081e+17
6145781,6145781,قال رسول الله إنَّ في الجمُعةِ لساعَةٌ لا يوا...,,True,7.741362e+17
6145782,6145782,إنجازات شخصية للأعضاء فقط شنو انجازات المجلس ...,[شنو انجازات المجلس الامه],True,9.992911e+17


Looks like that saved OK.

It's only 1 partition, though, so let's repartition to 50MB per partition.

In [8]:
# repartition
ddf_tweets_hashtags = ddf_tweets_hashtags.repartition(partition_size='50MB').persist()

In [9]:
# check if divisions are known
ddf_tweets_hashtags.known_divisions

False

In [10]:
# set index
ddf_tweets_hashtags = ddf_tweets_hashtags.set_index('index', drop=False).persist()

In [11]:
ddf_tweets_hashtags.head()

Unnamed: 0_level_0,index,tweet_text,hashtags,is_retweet,retweet_tweetid
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,السلام عليكم ورحمة الله وبركاته مرحبا عملاء م...,,True,9.986493e+17
1,1,للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",True,9.996373e+17
2,2,مظلات وسواتر آفاق الرياض مظلات استراحات مظلات...,"[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",True,9.993939e+17
3,3,فيديو شاهد مواطن يوثق بالفيديو كميات كبيرة من...,,True,9.983516e+17
4,4,أستغفر الله العظيم وأتوب إليه,,False,


In [12]:
ddf_tweets_hashtags.known_divisions

True

### 7.2.1. Executing Join

We're now ready to execute the join. We will:
1. Bring in the unique_tweet_id by joining on the 'tweet_text' column
2. For any rows with slightly mismatched tweet_texts, we will join the unique_tweet_id on the 'retweet_tweetid' column

In [13]:
ddf = dd.read_parquet('s3://twitter-saudi-us-east-2/interim/ddf_clean_userids_substituted.parquet',
                      engine='pyarrow').persist()

In [14]:
# check if divisions are known
ddf.known_divisions

True

In [15]:
ddf.head()

Unnamed: 0_level_0,user_reference_id,follower_count,following_count,tweet_text,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id
tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000000000447930368,0,9007,8821,السلام عليكم ورحمة الله وبركاته مرحبا عملاء م...,,ar,2018-05-25 13:05:00,Twitter for iPhone,True,9.986493e+17,4273.0
1000000030391095297,0,9007,8821,للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.996373e+17,4274.0
1000000039362662400,0,9007,8821,مظلات وسواتر آفاق الرياض مظلات استراحات مظلات...,"[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.993939e+17,4275.0
1000000054911033344,1,168,408,فيديو شاهد مواطن يوثق بالفيديو كميات كبيرة من...,,ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.983516e+17,4276.0
1000000204865789954,2,1623,2022,أستغفر الله العظيم وأتوب إليه,,ar,2018-05-25 13:06:00,غرد بصدقة,False,,


In [16]:
# turn tweetid into column before executing join
ddf['tweetid'] = ddf.index
ddf.persist();

In [17]:
# verify that both divisions are known
ddf_tweets_hashtags.known_divisions, ddf.known_divisions

(True, True)

OK, now we should be all set to execute the join.

In [18]:
%%time
# execute left join
# takes a minute or two to run on the cluster
ddf_merged = ddf.merge(
    ddf_tweets_hashtags[
        ["index", 'tweet_text']
    ],
    how="left",
    on="tweet_text"
).persist()

CPU times: user 6.13 s, sys: 188 ms, total: 6.31 s
Wall time: 7.4 s


Let's inspect the new dataframe to see if that executed OK.

In [19]:
ddf_merged.head()

Unnamed: 0,user_reference_id,follower_count,following_count,tweet_text,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id,tweetid,index
0,11,51215,7289,اللهم لا تصدّ عنا وجهك يوم أن نلقاك اللّهم لا ...,,ar,2018-05-25 14:15:00,erased14591474,False,,,1000017501831548928,317.0
1,16,3395,393,اللهم لا تصدّ عنا وجهك يوم أن نلقاك اللّهم لا ...,,ar,2018-05-25 15:52:00,erased14591474,False,,,1000041961783529473,317.0
2,58,2370,3492,احباط هجوم إرهابي بأبها زيادة قوة وصلابة الإن...,[احباط هجوم إرهابي بأبها],ar,2018-05-26 21:02:00,Twitter for Android,True,1.000423e+18,5212.0,1000482347123437571,5288.0
3,189,802,282,اللهم يا من بلغتنا رمضان أكرمنا بالعتق من الن...,,ar,2018-05-26 22:23:00,Twitter for Android,True,9.994315e+17,5620.0,1000502743893671936,5490.0
4,58,2370,3492,اجعلنا منهم يا الله,,ar,2018-05-26 23:55:00,Twitter for iPhone,True,1.00042e+18,5913.0,1000525934754136064,5744.0


In [20]:
ddf_merged[['tweet_text', 'index']].head()

Unnamed: 0,tweet_text,index
0,اللهم لا تصدّ عنا وجهك يوم أن نلقاك اللّهم لا ...,317.0
1,اللهم لا تصدّ عنا وجهك يوم أن نلقاك اللّهم لا ...,317.0
2,احباط هجوم إرهابي بأبها زيادة قوة وصلابة الإن...,5288.0
3,اللهم يا من بلغتنا رمضان أكرمنا بالعتق من الن...,5490.0
4,اجعلنا منهم يا الله,5744.0


OK, we have 2 identical tweet_texts showing up as having the same index. That's a good sign.

In [21]:
# verify max() value of index column
ddf_merged['index'].max().compute() == (ddf_tweets_hashtags.shape[0].compute() - 1)

True

In [22]:
# verify
ddf_tweets_hashtags[ddf_tweets_hashtags['index'] == 5744].compute()

Unnamed: 0_level_0,index,tweet_text,hashtags,is_retweet,retweet_tweetid
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5744,5744,اجعلنا منهم يا الله,,True,1.00042e+18


Perfect. This worked.

We can now drop the tweet_text column.

In [23]:
# drop tweet_text column
ddf_merged_dropped = ddf_merged.drop(columns='tweet_text')
ddf_merged_dropped.persist()

Unnamed: 0_level_0,user_reference_id,follower_count,following_count,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id,tweetid,index
npartitions=763,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
,int64,int64,int64,object,object,datetime64[ns],object,bool,float64,float64,object,int64
,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...


In [24]:
ddf_merged_dropped.head()

Unnamed: 0,user_reference_id,follower_count,following_count,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id,tweetid,index
0,11,51215,7289,,ar,2018-05-25 14:15:00,erased14591474,False,,,1000017501831548928,317.0
1,16,3395,393,,ar,2018-05-25 15:52:00,erased14591474,False,,,1000041961783529473,317.0
2,58,2370,3492,[احباط هجوم إرهابي بأبها],ar,2018-05-26 21:02:00,Twitter for Android,True,1.000423e+18,5212.0,1000482347123437571,5288.0
3,189,802,282,,ar,2018-05-26 22:23:00,Twitter for Android,True,9.994315e+17,5620.0,1000502743893671936,5490.0
4,58,2370,3492,,ar,2018-05-26 23:55:00,Twitter for iPhone,True,1.00042e+18,5913.0,1000525934754136064,5744.0


In [25]:
ddf_merged_dropped.shape[0].compute()

35347002

Excellent, looking good. Number of rows is still **35347002**.

We can now proceed to change the column name 'index' to 'unique_tweetid'.

Let's also reshuffle the order of the columns.

In [26]:
ddf_merged_dropped = ddf_merged_dropped.rename(columns={
    'index': 'unique_tweetid',
    'tweetid': 'twitter_tweetid'
}
).persist()

In [27]:
ddf_merged_dropped.columns

Index(['user_reference_id', 'follower_count', 'following_count', 'hashtags',
       'tweet_language', 'tweet_time', 'tweet_client_name', 'is_retweet',
       'retweet_tweetid', 'retweet_user_reference_id', 'twitter_tweetid',
       'unique_tweetid'],
      dtype='object')

In [28]:
ddf_merged_dropped = ddf_merged_dropped[['unique_tweetid',
                                         'twitter_tweetid',
                                         'user_reference_id', 
                                         'follower_count', 
                                         'following_count', 
                                         'hashtags',
                                         'tweet_language', 
                                         'tweet_time', 
                                         'tweet_client_name', 
                                         'is_retweet',
                                         'retweet_tweetid', 
                                         'retweet_user_reference_id']].persist()

In [29]:
ddf_merged_dropped.head()

Unnamed: 0,unique_tweetid,twitter_tweetid,user_reference_id,follower_count,following_count,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id
0,317.0,1000017501831548928,11,51215,7289,,ar,2018-05-25 14:15:00,erased14591474,False,,
1,317.0,1000041961783529473,16,3395,393,,ar,2018-05-25 15:52:00,erased14591474,False,,
2,5288.0,1000482347123437571,58,2370,3492,[احباط هجوم إرهابي بأبها],ar,2018-05-26 21:02:00,Twitter for Android,True,1.000423e+18,5212.0
3,5490.0,1000502743893671936,189,802,282,,ar,2018-05-26 22:23:00,Twitter for Android,True,9.994315e+17,5620.0
4,5744.0,1000525934754136064,58,2370,3492,,ar,2018-05-26 23:55:00,Twitter for iPhone,True,1.00042e+18,5913.0


In [30]:
# interim save to s3 bucket
ddf_merged_dropped.to_parquet('s3://twitter-saudi-us-east-2/interim/ddf_merged_dropped_BEFORE_join_on_retweetids.parquet',
                              engine='pyarrow')

### 7.2.2. Joining Unique Tweet IDs of Mismatched Retweets

Only one thing left to do: the values with NaN in the **unique_tweetid** column are slightly mismatched retweets. We will bring in the unique tweetid of these tweets by joinin on the **retweet_tweetid** column.

In [31]:
ddf_tweets_hashtags.head()

Unnamed: 0_level_0,index,tweet_text,hashtags,is_retweet,retweet_tweetid
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,السلام عليكم ورحمة الله وبركاته مرحبا عملاء م...,,True,9.986493e+17
1,1,للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",True,9.996373e+17
2,2,مظلات وسواتر آفاق الرياض مظلات استراحات مظلات...,"[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",True,9.993939e+17
3,3,فيديو شاهد مواطن يوثق بالفيديو كميات كبيرة من...,,True,9.983516e+17
4,4,أستغفر الله العظيم وأتوب إليه,,False,


In [32]:
# check if divisions are known
ddf_tweets_hashtags.known_divisions

True

In [33]:
# same for main ddf
ddf_merged_dropped.known_divisions

False

Important to set index for ddf_merged_dropped AND to maintain twitter_tweetid as a separate column so we retain it after the join.

In [34]:
# set index for ddf_merged_dropped
ddf_merged_dropped = ddf_merged_dropped.set_index('twitter_tweetid', drop=False).persist()

In [35]:
# same for main ddf
ddf_merged_dropped.known_divisions

True

In [36]:
ddf_merged_dropped.head()

Unnamed: 0_level_0,unique_tweetid,twitter_tweetid,user_reference_id,follower_count,following_count,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id
twitter_tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1000000000447930368,0.0,1000000000447930368,0,9007,8821,,ar,2018-05-25 13:05:00,Twitter for iPhone,True,9.986493e+17,4273.0
1000000030391095297,1.0,1000000030391095297,0,9007,8821,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.996373e+17,4274.0
1000000039362662400,2.0,1000000039362662400,0,9007,8821,"[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.993939e+17,4275.0
1000000054911033344,3.0,1000000054911033344,1,168,408,,ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.983516e+17,4276.0
1000000204865789954,4.0,1000000204865789954,2,1623,2022,,ar,2018-05-25 13:06:00,غرد بصدقة,False,,


In [37]:
ddf_merged_dropped.shape[0].compute()

35347002

In [38]:
ddf_tweets_hashtags.shape[0].compute()

6145783

In [39]:
ddf_tweets_hashtags.known_divisions

True

One more thing to do before we can perform the join:

Replace the NaNs in ddf_merged_dropped's **retweet_tweetid** column with something else. This to avoid the join joining on the NaN values.

In [40]:
# fill NaNs with 999
ddf_merged_dropped.retweet_tweetid = ddf_merged_dropped.retweet_tweetid.fillna(999).persist()

In [41]:
ddf_merged_dropped.retweet_tweetid.isnull().sum().compute()

0

In [42]:
ddf_merged_dropped.shape[0].compute()

35347002

In [43]:
# retweet column still dtype float?
ddf_merged_dropped.dtypes

unique_tweetid                        int64
twitter_tweetid                      object
user_reference_id                     int64
follower_count                        int64
following_count                       int64
hashtags                             object
tweet_language                       object
tweet_time                   datetime64[ns]
tweet_client_name                    object
is_retweet                             bool
retweet_tweetid                     float64
retweet_user_reference_id           float64
dtype: object

In [44]:
ddf_tweets_hashtags.dtypes

index                int64
tweet_text          object
hashtags            object
is_retweet            bool
retweet_tweetid    float64
dtype: object

OK, ready to execute join.

In [45]:
%%time
# execute left join
# takes a minute or two to run on the cluster
ddf_final = ddf_merged_dropped.merge(
                    ddf_tweets_hashtags[
                        ["index", 'retweet_tweetid']
                    ],
                    how="left",
                    on="retweet_tweetid"
).persist()

CPU times: user 6.56 s, sys: 287 ms, total: 6.85 s
Wall time: 7.76 s


In [46]:
ddf_final.head()

Unnamed: 0,unique_tweetid,twitter_tweetid,user_reference_id,follower_count,following_count,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id,index
0,275.0,1000011728917291009,10,4962,4547,"[الخليج التجارى, دبى, القناة المائية]",ar,2018-05-25 13:52:00,Twitter for iPhone,True,9.997086e+17,4381.0,275.0
1,981.0,1000069946448273408,0,9007,8821,,ar,2018-05-25 17:43:00,Twitter for iPhone,True,9.99827e+17,4341.0,981.0
2,1156.0,1000074572362866689,88,4540,4732,"[الرياض, الرياض الآن, الدمام, الدمام الآن, الق...",ar,2018-05-25 18:02:00,Twitter for Android,True,9.99287e+17,4719.0,
3,1249.0,1000077564696768518,0,9007,8821,"[جيوشيلد, حماية, رمضان, عروض رمضان]",ar,2018-05-25 18:14:00,Twitter for iPhone,True,9.978252e+17,4347.0,1249.0
4,1249.0,1000099542241628160,66,4607,4843,"[جيوشيلد, حماية, رمضان, عروض رمضان]",ar,2018-05-25 19:41:00,Twitter for Android,True,9.978252e+17,4347.0,1249.0


In [47]:
# check shape
ddf_final.shape[0].compute()

35347002

Let's check a sample of our ddf_final for correspondence between **unique_tweetid** and the newly joined **index**. These should be the same.

In [48]:
# check sample 
ddf_final[['unique_tweetid', 'twitter_tweetid', 'retweet_tweetid', 'index']].sample(frac=0.001, random_state=27).head(10)

Unnamed: 0,unique_tweetid,twitter_tweetid,retweet_tweetid,index
37559,5447465.0,885864270797393921,8.856847e+17,5447465.0
9977,1580981.0,1143030293374394368,1.142064e+18,1580981.0
40477,5837503.0,955915527427411969,9.558111e+17,
31653,4845687.0,791917788285722624,7.918077e+17,
35908,23.0,855872250574245888,8.558364e+17,
3071,777588.0,1089483720405516288,1.083514e+18,777588.0
26802,4344023.0,718102313014063105,7.174532e+17,
1856,373738.0,1068849280960606209,1.051647e+18,373738.0
40294,5842468.0,948042603714539520,9.480423e+17,5842468.0
18543,1789480.0,1195388577699356672,1.19522e+18,


The merge worked.

Now replace all NaN values in 'unique_tweetid' with the corresponding 'index' value.

In [49]:
# check number of tweets with no unique_tweetid
ddf_final.unique_tweetid[ddf_final.unique_tweetid.isnull()].shape[0].compute()

65349

In [50]:
# replace NaNs in unique_tweetid with value in index column
ddf_final.unique_tweetid = ddf_final.unique_tweetid.fillna(ddf_final['index']).persist()

In [51]:
# check number of tweets with no unique_tweetid
n_rows_final_null = ddf_final.unique_tweetid[ddf_final.unique_tweetid.isnull()].shape[0].compute()
print(n_rows_final_null)

65167


In [52]:
n_rows_final_null / ddf_final.shape[0].compute() * 100

0.18436358478153253

Hm, we still have 65k tweets with no unique tweet_id. Just over 0.18%.

I've spent too much time on this and need to move on. I will drop these rows for now.

We can also drop the 'index' column.

In [53]:
ddf_final_dropped = ddf_final.dropna(subset=['unique_tweetid'])
ddf_final_dropped = ddf_final_dropped.drop(columns=['index'])
ddf_final_dropped.persist()

Unnamed: 0_level_0,unique_tweetid,twitter_tweetid,user_reference_id,follower_count,following_count,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id
npartitions=763,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
,int64,object,int64,int64,int64,object,object,datetime64[ns],object,bool,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...


In [55]:
ddf_final_dropped.head(3)

Unnamed: 0,unique_tweetid,twitter_tweetid,user_reference_id,follower_count,following_count,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id
0,275.0,1000011728917291009,10,4962,4547,"[الخليج التجارى, دبى, القناة المائية]",ar,2018-05-25 13:52:00,Twitter for iPhone,True,9.997086e+17,4381.0
1,981.0,1000069946448273408,0,9007,8821,,ar,2018-05-25 17:43:00,Twitter for iPhone,True,9.99827e+17,4341.0
2,1156.0,1000074572362866689,88,4540,4732,"[الرياض, الرياض الآن, الدمام, الدمام الآن, الق...",ar,2018-05-25 18:02:00,Twitter for Android,True,9.99287e+17,4719.0


In [56]:
ddf_final.shape[0].compute() - ddf_final_dropped.shape[0].compute()

65167

Excellent. 

Let's set the index to **twitter_tweetid** column and then save this to our s3 bucket.

In [57]:
# check if divisions are known
ddf_final_dropped.known_divisions

False

In [58]:
# set index and drop twitter_tweetid column
ddf_final_dropped = ddf_final_dropped.set_index('twitter_tweetid', drop=True).persist()

In [59]:
ddf_final_dropped.known_divisions

True

In [60]:
ddf_final_dropped.head(3)

Unnamed: 0_level_0,unique_tweetid,user_reference_id,follower_count,following_count,hashtags,tweet_language,tweet_time,tweet_client_name,is_retweet,retweet_tweetid,retweet_user_reference_id
twitter_tweetid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000000000447930368,0.0,0,9007,8821,,ar,2018-05-25 13:05:00,Twitter for iPhone,True,9.986493e+17,4273.0
1000000030391095297,1.0,0,9007,8821,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.996373e+17,4274.0
1000000039362662400,2.0,0,9007,8821,"[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",ar,2018-05-25 13:06:00,Twitter for iPhone,True,9.993939e+17,4275.0


In [61]:
ddf_final_dropped.shape[0].compute()

35281835

Great, let's save this.

We still have more than 35M tweets to work with.

In [62]:
# save to s3 as parquet
ddf_final_dropped.to_parquet('s3://twitter-saudi-us-east-2/interim/ddf_complete_with_tweets_as_indices.parquet',
                              engine='pyarrow')

# 8. Pre-Processing Unique Tweets

In order to process our tweets using a topic-modelling algorithm (LDA), we will have to conduct the necessary pre-processing steps. Many of these steps are specific to Arabic NLP. As such, I'll provide a brief motivation for why each step is necessary.

For an in-depth explanation, please see [my Medium article](https://towardsdatascience.com/arabic-nlp-unique-challenges-and-their-solutions-d99e8a87893d) on the topic. 

0. Remove repeating characters and stop words (technically part of wrangling steps above but will do it here)
1. Orthographic Normalisation: necessary to account for alternative spellings and common spelling inconsistencies across dialects
2. Dediacritization: removal of diacritics (small symbols above/below characters) to reduce data sparsity.
3. Remove stopwords
4. Morphological Disambiguation: identifying the most likely meaning and form of the word and its lemma(s)**
5. Tokenization

** Note: Arabic words have on average ~7 diacritizations and ~3 lemmas *per word* *.

In [64]:
# drop index column
ddf_uniq = ddf_tweets_hashtags.drop(columns=['index']).persist()

In [65]:
ddf_uniq.head()

Unnamed: 0_level_0,tweet_text,hashtags,is_retweet,retweet_tweetid
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,السلام عليكم ورحمة الله وبركاته مرحبا عملاء م...,,True,9.986493e+17
1,للتأجير لبيع النطيطات زحاليق مائيه صابونية مل...,"[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",True,9.996373e+17
2,مظلات وسواتر آفاق الرياض مظلات استراحات مظلات...,"[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",True,9.993939e+17
3,فيديو شاهد مواطن يوثق بالفيديو كميات كبيرة من...,,True,9.983516e+17
4,أستغفر الله العظيم وأتوب إليه,,False,


In [66]:
ddf_uniq.shape[0].compute()

6145783

Alright, let's go ahead and preprocess our 6.1M unique tweets.

### 8.1. Remove Repeating Characters

This is something we could have done earlier on in the data wrangling phase, but let's do it now before continuing. Using the regex pattern below, we will replace any character that is repeated more than twice with a single instance of that character. This is to account for informal text input such as (the Arabic equivalents of): "yeeeees" or "haaaahaaa", etc.

In [67]:
# define function
def remove_repeating_char(text):
    return re.sub("(.)\\1{2,}", "\\1", text)

In [68]:
# test
remove_repeating_char('اللة هههه')

'اللة ه'

Great, this is working as expected. Let's apply to our dask dataframe.

In [69]:
# define function map across partitions
def map_remove_rep_char(df):
    df.tweet_text = df.tweet_text.apply(remove_repeating_char)
    return df

In [70]:
ddf_uniq = ddf_uniq.map_partitions(map_remove_rep_char).persist()

### 8.2. Orthographic Normalization
Let's now move on to normalize spellings to account for inconsistencies across dialects and common spelling 'mistakes'. This will reduce data sparsity.

In [72]:
def ortho_normalize(text):
    text = normalize_alef_maksura_ar(text)
    text = normalize_alef_ar(text)
    text = normalize_teh_marbuta_ar(text)
    return text

In [73]:
def map_ortho(df):
    df.tweet_text = df.tweet_text.apply(ortho_normalize)
    return df

In [74]:
# map across partitions
ddf_uniq = ddf_uniq.map_partitions(map_ortho).persist()

In [75]:
ddf_uniq.tweet_text.sample(frac=0.0001, random_state=21).head(10)

index
915        عرض خاص مركز طرق الجمال الطبي تنظيف الاسنان ت...
48117      كاس العالم علي بي اوت رتويت شهري اشتراك رتويت...
74594      لحم برمه علي كيف كيفك السعوديه الدمام الخبر م...
135340     نفسيتي محتاجه ادعس علي شواربك ابوك لاابو اللي...
48656      واكيد هتشجع منتخب الاورجواي الشقيق ضد منتخبنا...
151018     صفَت لك الايــام والا ماصفت المجد غـايه للرجـ...
3647       اللهم ادخل السرور والحبور في نفوس العرب الشرف...
89270      اللهم احتويني برحمتك وتولَّني بقُدرتك ولا تخز...
79808     هيكل سمكه القنفذ سم هذه السمكه علي الكائنات ال...
17187      مهندس مصري محكوم عليه بالاعدام في السعوديه يس...
Name: tweet_text, dtype: object

### 8.3. Dediacritization
Now let's proceed to remove the diacritics, again to significnatly reduce data sparsity.

*NB: diacritics are, loosely put, the Arabic equivalent of vowels. They are symbols written above or below the main characters that change the pronunciation (and possibly the meaning) of the word. This means that, technically speaking, the different words can **look** the same when we remove the diacritics. However, fluent Arabic-speaking people can ascertain the correct meaning of the word from context. For example, most Arabic newspapers are written without the diacritics.*

We use the **dediac_ar** function included in the **camel_tools** library.

In [76]:
def map_dediac(df):
    df.tweet_text = df.tweet_text.apply(dediac_ar)
    return df

In [77]:
ddf_uniq = ddf_uniq.map_partitions(map_dediac).persist()

In [78]:
ddf_uniq.tweet_text.sample(frac=0.0001, random_state=2).head(10)

index
151704                قروب ريم الملكيه اهداء لنونا الغامدي 
75433      خل الوجيه المظلمه والكئيبه اللي بها الهجران ك...
8750                                      الف مبروك يا بطل 
40080      هديتك مع فلل بريرا حطين حاب تشتغل وتزود دخلك ...
25114      احسن الله عزاءكم وجبر مصابكم وغفرا الله لها و...
92645      للاعلانات التواصل علي الواتساب مستشفيات بيع م...
42559      المشكله الحقيقيه انها بتاخد اكتر من وقتها علي...
40959      بس بقول شو شعورك الامارات رهف تناشد الملك سلم...
50978     انتي اللي احلي واجمل عيد اونتي حتتي انتي احلي ...
68706      هاشتاج السيسي مش هيرحل يقترب من رقم مليون مشا...
Name: tweet_text, dtype: object

### 8.4. Morphological Analysis
Arabic has a very rich inflectional system. A verb could have up to 5400 inflections (compared to 6 in English and 1 in Chinese). So the trick is knowing...what does a word mean? Especially when stripped of its diacritics?

CAMeL Tools allows us to perform analysis against a morphological database to get all of that word's possible meanings. We can then select one.

In [79]:
# First, we need to load a morphological database.
# Here, we load the default database which is used for analyzing
# Modern Standard Arabic. 
db = MorphologyDB.builtin_db()

analyzer = Analyzer(db)

analyses = analyzer.analyze('سيحاسب')

for analysis in analyses:
    print(analysis, '\n')

{'diac': 'سَيُحاسِب', 'lex': 'حاسَب_1', 'bw': 'سَ/FUT_PART+يُ/IV3MS+حاسِب/IV', 'gloss': 'will_+_he;it+hold_responsible;get_even_with', 'pos': 'verb', 'prc3': '0', 'prc2': '0', 'prc1': 'sa_fut', 'prc0': '0', 'per': '3', 'asp': 'i', 'vox': 'a', 'mod': 'i', 'stt': 'na', 'cas': 'na', 'enc0': '0', 'rat': 'n', 'source': 'lex', 'form_gen': 'm', 'form_num': 's', 'pattern': 'سَيُ1ا2ِ3', 'root': 'ح.س.ب', 'catib6': 'PRT+VRB', 'ud': 'AUX+VERB', 'd1seg': 'سَيُحاسِب', 'd1tok': 'سَيُحاسِب', 'atbseg': 'سَ+_يُحاسِب', 'd3seg': 'سَ+_يُحاسِب', 'd2seg': 'سَ+_يُحاسِب', 'd2tok': 'سَ+_يُحاسِب', 'atbtok': 'سَ+_يُحاسِب', 'd3tok': 'سَ+_يُحاسِب', 'bwtok': 'سَ+_يُ+_حاسِب', 'pos_lex_logprob': -5.099521, 'caphi': 's_a_y_u_7_aa_s_i_b', 'pos_logprob': -1.023208, 'gen': 'm', 'lex_logprob': -5.099521, 'num': 's', 'stem': 'حاسِب', 'stemgloss': 'hold_responsible;get_even_with', 'stemcat': 'IV_yu'} 



This only works per single word. 

Since we are working with longer strings (more than 6 million of them, in fact) it's better to just select the first analysis (analyses are sorted from most likely to least likely) and perform Morphological Disambiguation, see below.

### 8.5. Simple Word Tokenize

Before we can perform Morpohological Disambiguation (select a particular meaning and form of our word from the range of possibilities), we need to perform a simple word tokenizing in order to be able to feed these into the disambiguating algorithm.

While testing this tool, we discovered that the word يارب was not being tokenized correctly.  It is, in fact, two words, but because some tweets include it as one word it was getting processed incorrectly. Therefore, let's first split the instances of يارب and insert a whitespace in between them so that it's tokenized properly.

In [80]:
# testing on a single sentence
sentence = ' يارب سوره بالتوفيق يارب اوقاف القران'
sentence

' يارب سوره بالتوفيق يارب اوقاف القران'

In [81]:
# define variables with strings to avoid problems with right-to-left order in .replace() call
yarab = 'يارب'
ya_rab = 'يا رب'

In [82]:
sentence.replace(yarab, ya_rab)

' يا رب سوره بالتوفيق يا رب اوقاف القران'

In [83]:
def split_yarab(text):
    text = text.replace(yarab, ya_rab)
    return text

In [84]:
def map_split_yarab(df):
    df.tweet_text = df.tweet_text.apply(split_yarab)
    return df

In [85]:
# map across partitions
ddf_uniq = ddf_uniq.map_partitions(map_split_yarab).persist()

Done. Let's now apply the simple_word_tokenizer across partitions.

In [86]:
def map_simple_tokenizer(df):
    df.tweet_text = df.tweet_text.apply(simple_word_tokenize)
    return df

In [87]:
# map across partitions
ddf_uniq = ddf_uniq.map_partitions(map_simple_tokenizer).persist()

In [88]:
ddf_uniq.head()

Unnamed: 0_level_0,tweet_text,hashtags,is_retweet,retweet_tweetid
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,"[السلام, عليكم, ورحمه, الله, وبركاته, مرحبا, ع...",,True,9.986493e+17
1,"[للتاجير, لبيع, النطيطات, زحاليق, مائيه, صابون...","[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",True,9.996373e+17
2,"[مظلات, وسواتر, افاق, الرياض, مظلات, استراحات,...","[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",True,9.993939e+17
3,"[فيديو, شاهد, مواطن, يوثق, بالفيديو, كميات, كب...",,True,9.983516e+17
4,"[استغفر, الله, العظيم, واتوب, اليه]",,False,


### 8.6. Removing Stop Words

Using [this Github text file](https://github.com/mohataher/arabic-stop-words), we will define our set of Arabic stop words to remove from the tokenized tweet_text column.

In [89]:
# define stopwords
with open('/Users/richard/Desktop/springboard_repo/capstones/three/supporting_files/arabic-stopwords.txt', 'r') as file:
    stopwords = file.read()
    stopwords_list = stopwords.split('\n')
    
print(stopwords_list)

['،', 'ء', 'ءَ', 'آ', 'آب', 'آذار', 'آض', 'آل', 'آمينَ', 'آناء', 'آنفا', 'آه', 'آهاً', 'آهٍ', 'آهِ', 'أ', 'أبدا', 'أبريل', 'أبو', 'أبٌ', 'أجل', 'أجمع', 'أحد', 'أخبر', 'أخذ', 'أخو', 'أخٌ', 'أربع', 'أربعاء', 'أربعة', 'أربعمئة', 'أربعمائة', 'أرى', 'أسكن', 'أصبح', 'أصلا', 'أضحى', 'أطعم', 'أعطى', 'أعلم', 'أغسطس', 'أفريل', 'أفعل به', 'أفٍّ', 'أقبل', 'أكتوبر', 'أل', 'ألا', 'ألف', 'ألفى', 'أم', 'أما', 'أمام', 'أمامك', 'أمامكَ', 'أمد', 'أمس', 'أمسى', 'أمّا', 'أن', 'أنا', 'أنبأ', 'أنت', 'أنتم', 'أنتما', 'أنتن', 'أنتِ', 'أنشأ', 'أنه', 'أنًّ', 'أنّى', 'أهلا', 'أو', 'أوت', 'أوشك', 'أول', 'أولئك', 'أولاء', 'أولالك', 'أوّهْ', 'أى', 'أي', 'أيا', 'أيار', 'أيضا', 'أيلول', 'أين', 'أيّ', 'أيّان', 'أُفٍّ', 'ؤ', 'إحدى', 'إذ', 'إذا', 'إذاً', 'إذما', 'إذن', 'إزاء', 'إلى', 'إلي', 'إليكم', 'إليكما', 'إليكنّ', 'إليكَ', 'إلَيْكَ', 'إلّا', 'إمّا', 'إن', 'إنَّ', 'إى', 'إياك', 'إياكم', 'إياكما', 'إياكن', 'إيانا', 'إياه', 'إياها', 'إياهم', 'إياهما', 'إياهن', 'إياي', 'إيهٍ', 'ئ', 'ا', 'ا?', 'ا?ى', 'االا', 'االتى', 'اب

Let's now proceed to remove the stopwords.

In [90]:
def remove_stopwords(tokenized_text):
    tokens_without_sw = [word for word in tokenized_text if word not in stopwords_list]
    return tokens_without_sw

In [91]:
def map_stopwords(df):
    df.tweet_text = df.tweet_text.apply(remove_stopwords)
    return df

In [92]:
# map across partitions
ddf_uniq = ddf_uniq.map_partitions(map_stopwords).persist()

In [93]:
ddf_uniq.head()

Unnamed: 0_level_0,tweet_text,hashtags,is_retweet,retweet_tweetid
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,"[السلام, عليكم, ورحمه, الله, وبركاته, مرحبا, ع...",,True,9.986493e+17
1,"[للتاجير, لبيع, النطيطات, زحاليق, مائيه, صابون...","[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",True,9.996373e+17
2,"[مظلات, وسواتر, افاق, الرياض, مظلات, استراحات,...","[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",True,9.993939e+17
3,"[فيديو, شاهد, مواطن, يوثق, بالفيديو, كميات, كب...",,True,9.983516e+17
4,"[استغفر, الله, العظيم, واتوب]",,False,


Done.

Let's do an interim save to our s3 bucket before continuing to run the lemmatization function.

In [94]:
# save to s3 before lemmatizing
ddf_uniq.to_parquet('s3://twitter-saudi-us-east-2/interim/ddf_unique_BEFORE_lemmatization.parquet',
                             engine='pyarrow')

### 8.7. Morphological Disambiguation

The next and final step is to conduct **morphological disambiguation**: to reduce the range of possible forms and meanings of the words in our Arabic text (which has been dediacritized and therefore can have multiple meanings) to a single form and meaning. 

For this project we will also use this step to directly **lemmatize** our tokens. There are many different ways to create 'morphological tokens' (using 9 different schemas built into the CAMeL Morphological Disambiguator). But since we will be conducting Topic Modelling on the text, the lemmas will suffice for our purposes.

In [6]:
# instantiate the Maximum Likelihood Disambiguator
mle = MLEDisambiguator.pretrained()

Let's run it on a sample sentence to see how it works:

In [7]:
# The disambiguator expects pre-tokenized text
sentence = simple_word_tokenize('نجح بايدن في الانتخابات')

disambig = mle.disambiguate(sentence)

# For each disambiguated word d in disambig, d.analyses is a list of analyses
# sorted from most likely to least likely. Therefore, d.analyses[0] would
# be the most likely analysis for a given word. Below we extract different
# features from the top analysis of each disambiguated word into seperate lists.
diacritized = [d.analyses[0].analysis['diac'] for d in disambig]
pos_tags = [d.analyses[0].analysis['pos'] for d in disambig]
lemmas = [d.analyses[0].analysis['lex'] for d in disambig]

# Print the combined feature values extracted above
for triplet in zip(diacritized, pos_tags, lemmas):
    print(triplet)

# print lemmas
print(lemmas)

('نَجَحَ', 'verb', 'نَجَح-a_1')
('بايدن', 'noun_prop', 'بايدن_0')
('فِي', 'prep', 'فِي_1')
('الاِنْتِخاباتِ', 'noun', 'ٱِنْتِخاب_1')
['نَجَح-a_1', 'بايدن_0', 'فِي_1', 'ٱِنْتِخاب_1']


The above example from the CAMeL documentation works perfectly.

Let's now adapt so that we can get just the lemmas.

**NOTE:** We included the try/except clauses because some list indexing was throwing an 'out of range' error. The function now returns NaN if it can't lemmatize a token. **Very important** therefore to check the number of NaNs in the ddf after mapping this function across all partitions.

In [8]:
def get_lemmas(tokenized_text):
    disambig = mle.disambiguate(tokenized_text)
    try:
        lemmas = [d.analyses[0].analysis['lex'] for d in disambig]
        return lemmas
    except:
        return np.nan

Let's try it on a subset of df_unique_tokenized.

In [9]:
df_sample = ddf_uniq.sample(frac=0.0001, random_state=21).compute()

In [10]:
df_sample.shape

(624, 4)

In [11]:
df_sample.head()

Unnamed: 0_level_0,tweet_text,hashtags,is_retweet,retweet_tweetid
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
915,"[عرض, خاص, مركز, طرق, الجمال, الطبي, تنظيف, ال...",[عرض خاص],True,9.998049e+17
48117,"[كاس, العالم, بي, اوت, رتويت, شهري, اشتراك, رت...","[كاس العالم علي بي اوت, رتويت, اشتراك, اصوات, ...",True,1.006832e+18
74594,"[لحم, برمه, كيفك, السعوديه, الدمام, الخبر, مطا...","[على كيف كيفك, السعودية, الدمام, الخبر, مطاعم ...",True,1.01125e+18
135340,"[نفسيتي, محتاجه, ادعس, شواربك, ابوك, لاابو, ال...","[نفسيتي محتاجه, نفسيتك محتاجة اية, القوة الغاشمة]",True,1.018902e+18
48656,"[واكيد, هتشجع, منتخب, الاورجواي, الشقيق, منتخب...",,False,


In [12]:
%%time
df_sample['lemmas'] = df_sample.tweet_text.apply(get_lemmas)

CPU times: user 8.66 s, sys: 408 ms, total: 9.07 s
Wall time: 12.2 s


In [13]:
df_sample.head()

Unnamed: 0_level_0,tweet_text,hashtags,is_retweet,retweet_tweetid,lemmas
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
915,"[عرض, خاص, مركز, طرق, الجمال, الطبي, تنظيف, ال...",[عرض خاص],True,9.998049e+17,"[عَرَض-i_1, خاصّ_1, مَرْكَز_1, طَرِيق_1, جَمال..."
48117,"[كاس, العالم, بي, اوت, رتويت, شهري, اشتراك, رت...","[كاس العالم علي بي اوت, رتويت, اشتراك, اصوات, ...",True,1.006832e+18,"[كاس_1, عالَم_1, بِي_1, أَوَى-i_1, رتويت_0, شَ..."
74594,"[لحم, برمه, كيفك, السعوديه, الدمام, الخبر, مطا...","[على كيف كيفك, السعودية, الدمام, الخبر, مطاعم ...",True,1.01125e+18,"[لَحْم_1, رُمَّة_1, كيفك_0, سَعُودِيّ_1, دَمّا..."
135340,"[نفسيتي, محتاجه, ادعس, شواربك, ابوك, لاابو, ال...","[نفسيتي محتاجه, نفسيتك محتاجة اية, القوة الغاشمة]",True,1.018902e+18,"[نَفْسِيّ_1, مُحْتاج_1, دَعَس-a_1, شارِب_3, أَ..."
48656,"[واكيد, هتشجع, منتخب, الاورجواي, الشقيق, منتخب...",,False,,"[أَكِيد_1, هتشجع_0, مُنْتَخَب_1, أُورُجواي_1, ..."


In [14]:
df_sample.loc[915].tweet_text

array(['عرض', 'خاص', 'مركز', 'طرق', 'الجمال', 'الطبي', 'تنظيف', 'الاسنان',
       'تبيض', 'الاسنان', 'بالزوم', 'تبيض', 'الاسنان', 'المنزلي',
       'القوالب'], dtype=object)

In [15]:
df_sample.loc[915].lemmas

['عَرَض-i_1',
 'خاصّ_1',
 'مَرْكَز_1',
 'طَرِيق_1',
 'جَمال_2',
 'طِبِّيّ_1',
 'تَنْظِيف_1',
 'سِنّ_1',
 'باض-i_1',
 'سِنّ_1',
 'زُوم_1',
 'باض-i_1',
 'سِنّ_1',
 'مَنْزِلِيّ_1',
 'قالِب_1']

Great, that worked. This randomly extracted tweet is an ad for dental hygiene services. Very politically compromised ;)

Let's now apply on the whole ddf_uniq.

### 8.9. Lemmatization

I had quite come trouble trying to run the lemmatization function on my Coiled / AWS cluster. The tricky thing is that the lemmatization requires all the workers to have access to the **morphology database** (~19MB) which was slowing down the distributed processing considerably because the serialized data had to be transferred between workers multiple time. This meant that running the lemmatization function on a dask dataframe of just 36 rows took more than 9 minutes (!!).

I ended up getting in touch with the Dask maintainers and Gabe Joseph ended up writing a hack / work-around, found [here](https://github.com/gjoseph92/once-per-worker). The package has been pip-installed to our software environments, so we will import it here and proceed.

In [11]:
from once_per_worker import once_per_worker

In [12]:
loaded_disambiguator = once_per_worker(lambda: MLEDisambiguator.pretrained())

In [13]:
def map_lemmas(df, disambiguator):
    
    def get_lemmas_nested(tokenized_text):
        
        disambig = disambiguator.disambiguate(tokenized_text)

        try:
            lemmas = [d.analyses[0].analysis['lex'] for d in disambig]
            return lemmas
        except:
            return np.nan
    
    df.tweet_text = df.tweet_text.apply(get_lemmas_nested)
    return df

Mapping this function to our Dask Dataframe only works if the **number of partitions** in the dataframe is **less than the number of workers**. This is most likely because the camel-tools disambiguator is not thread-safe.

Below we run the lemmatisation function on batches of partitions and then append them all together to get back to our complete dataset of **6145783 unique tweets**.

In [8]:
ddf_0_4 = ddf_uniq.partitions[0:5].persist()

In [9]:
ddf_0_4

Unnamed: 0_level_0,tweet_text,hashtags,is_retweet,retweet_tweetid
npartitions=5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,object,object,bool,float64
157584,...,...,...,...
...,...,...,...,...
630336,...,...,...,...
787920,...,...,...,...


In [14]:
# run test on cluster
ddf_0_4_lem = ddf_0_4.map_partitions(map_lemmas,
                                  loaded_disambiguator,
                                  meta=ddf_uniq
).copy().persist()

In [15]:
%%time
ddf_0_4_lem.head()

CPU times: user 2.33 s, sys: 518 ms, total: 2.85 s
Wall time: 20min 39s


Unnamed: 0_level_0,tweet_text,hashtags,is_retweet,retweet_tweetid
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,"[سَلام_1, عَلَى_1, رَحْمَة_1, اللَّه_1, بَرَكَ...",,True,9.986493e+17
1,"[تَأْجِير_1, بَيْع_1, النطيطات_0, زحاليق_0, ما...","[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",True,9.996373e+17
2,"[مِظَلَّة_1, ساتِر_1, أُفُق_1, رِياض_1, مِظَلّ...","[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",True,9.993939e+17
3,"[فِيدْيُو_1, شاهَد_1, مُواطِن_1, وَثِق-ia_1, ف...",,True,9.983516e+17
4,"[ٱِسْتَغْفَر_1, اللَّه_1, عَظِيم_2, تاب-u_1]",,False,


In [19]:
ddf_5_14 = ddf_uniq.partitions[5:15].persist()

In [20]:
# run mapping dunction on batch
ddf_5_14_lem = ddf_5_14.map_partitions(map_lemmas,
                                  loaded_disambiguator,
                                  meta=ddf_uniq
).copy().persist()

In [21]:
ddf_5_14_lem.shape[0].compute()

1575842

In [22]:
ddf_15_24 = ddf_uniq.partitions[15:25].persist()

In [23]:
# run test on cluster
ddf_15_24_lem = ddf_15_24.map_partitions(map_lemmas,
                                  loaded_disambiguator,
                                  meta=ddf_uniq
).copy().persist()

In [24]:
%%time
ddf_15_24_lem.shape[0].compute()

CPU times: user 4.29 s, sys: 875 ms, total: 5.16 s
Wall time: 40min 44s


1575842

In [42]:
ddf_25_39 = ddf_uniq.partitions[25:].persist()

In [43]:
# run test on cluster
ddf_25_39_lem = ddf_25_39.map_partitions(map_lemmas,
                                  loaded_disambiguator,
                                  meta=ddf_uniq
).copy().persist()

In [44]:
%%time
ddf_25_39_lem.shape[0].compute()

CPU times: user 16.1 s, sys: 3.14 s, total: 19.3 s
Wall time: 2h 25min 49s


2206179

Let's append these partial lemmatized ddf's into one big one.

In [46]:
# append partial ddfs together
ddf_full_lem = ddf_0_4_lem.append(ddf_5_14_lem)
ddf_full_lem = ddf_full_lem.append(ddf_15_24_lem)
ddf_full_lem = ddf_full_lem.append(ddf_25_39_lem)
ddf_full_lem.persist()

Unnamed: 0_level_0,tweet_text,hashtags,is_retweet,retweet_tweetid
npartitions=39,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,object,object,bool,float64
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


In [48]:
# check n_rows to verify
ddf_full_lem.shape[0].compute() == 6145783

True

In [49]:
# check number of NaNs
ddf_full_lem.tweet_text.isnull().sum().compute()

0

In [50]:
ddf_full_lem.tweet_text.sample(frac=0.001, random_state=21).compute()

index
915        [عَرَض-i_1, خاصّ_1, مَرْكَز_1, طَرِيق_1, جَمال...
48117      [كاس_1, عالَم_1, بِي_1, أَوَى-i_1, رتويت_0, شَ...
74594      [لَحْم_1, رُمَّة_1, كيفك_0, سَعُودِيّ_1, دَمّا...
135340     [نَفْسِيّ_1, مُحْتاج_1, دَعَس-a_1, شارِب_3, أَ...
48656      [أَكِيد_1, هتشجع_0, مُنْتَخَب_1, أُورُجواي_1, ...
                                 ...                        
6114091    [ٱتساق_1_0, قمر_1_0, سياحة_1_0, فندق_1_0, مكة_...
6067753    [الحمدلل_0_0, أغاث_1_0, روح_1_0, الحمدلل_0_0, ...
5993520    [بيع_1_0, ألمنيوم_1_0, طائرة_1_0, ألف_1_0, طن_...
6089414    [أرسل_1_0, رمز_1_0, رائح_1_0, رائح_1_0, روسيا_...
6031645             [دان-i_1_0, هديلك_0_0, درس_1_0, أول_2_0]
Name: tweet_text, Length: 6162, dtype: object

Looking good. The number of rows in our appended dataframe is correct and all tweets have been lemmatised.

Let's save to our s3 bucket.

In [94]:
# bring local copy to save
df_full_lem = ddf_full_lem.compute()

In [95]:
df_full_lem.head()

Unnamed: 0_level_0,tweet_text,hashtags,is_retweet,retweet_tweetid
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,"[سَلام_1, عَلَى_1, رَحْمَة_1, اللَّه_1, بَرَكَ...",,True,9.986493e+17
1,"[تَأْجِير_1, بَيْع_1, النطيطات_0, زحاليق_0, ما...","[للتأجير, لبيع النطيطات, زحاليق مائيه صابونية,...",True,9.996373e+17
2,"[مِظَلَّة_1, ساتِر_1, أُفُق_1, رِياض_1, مِظَلّ...","[مظلات, آفاق الرياض, مظلات استراحات, مظلات مسا...",True,9.993939e+17
3,"[فِيدْيُو_1, شاهَد_1, مُواطِن_1, وَثِق-ia_1, ف...",,True,9.983516e+17
4,"[ٱِسْتَغْفَر_1, اللَّه_1, عَظِيم_2, تاب-u_1]",,False,


In [96]:
df_full_lem.shape

(6145783, 4)

In [97]:
df_full_lem.to_parquet('/Users/richard/Desktop/data_cap3/processed/df_unique_tweets_hashtags_lemmatized_050521.parquet',
                       engine='pyarrow')

# Conclusions

In this notebook, we have conducted **basic data cleaning**, including:
- removing faulty rows (<1000 = <0.001%)
- subsetting the dataset to include only Arabic tweets (ca. 96%)
- cleaning the tweet_text column of URLs, Emoji, RT symbols and usernames, hashtags,
- creating a new column containing the hashtags
- creating a new column containing the retweet usernames


We have also create a number of **reference tables** to reduce the size of the main dataframe and to begin constructing a relational database, including:
- creating a dataframe with **only unique tweets** and a unique index (from 0 to n_tweets)
- replacing tweet texts in original dataframe with indices from new (unique tweets) dataframe
- creating a dataframe with **unique user screen names** and a new, unique index (from 0 to n_users)
- replacing user screen names in original dataframe with indices from new (unique users) dataframe


Finally, we have conducted extensive NLP Arabic-specific Pre-Processing of the unique tweets, including:
 - dediacritization
 - tokenization
 - orthographic normalisation
 - lemmatization


Our data is now ready for the next stages of the project: EDA, Topic Modelling and Clustering. As a reminder, we have 3 main dataframes at this point:
1. ddf: a distributed Dask dataframe containing the full dataset with tweet texts and user screen names as indices
2. df_unique: a Pandas dataframe containing the tweet texts and hashtags of the ca. 4.1 million unique tweets
3. df_users: a Pandas dataframe containing the user screen names and Twitter user IDs (when available) of all unique users in the dataset.
 