Parler DATA PREPROCESSING Part 1

Data source: Aliapoulios, M., Bevensee, E., Blackburn, J., Bradlyn, B., De Cristofaro, E., Stringhini, G., & Zannettou, S. (2021, May). A Large Open Dataset from the Parler Social Network. In ICWSM (pp. 943-951).

1. Read NDJSON file with Parler data as pandas dataframe 
2. Remove unuseful columns = dimension reduction (columns) -> only [body] and [createdAtformatted] left
3. Filter out null values found in [body] of parleys = dimension reduction (rows)
4. Filter out parleys not in relevant time period = dimension reduction (rows) -> only Nov 2020, Dec 2020 and Jan 2021 left
5. Filter out [body] of parleys in languages except English = dimension reduction (rows)
6. Remove unuseful time info from [createdAtformatted] column -> only date info left 
7. Save pandas dataframe containing only useful info (no index) as CSV file


In [1]:
import json
import pandas as pd 
import numpy as np

In [2]:
# 1. Read NDJSON file as pandas dataframe

parler_df = pd.read_json('D:\\bachelors_thesis\Datasets\parler_data\parler_data000000000041.ndjson', lines = True) 

In [3]:
# 2. Remove unuseful columns = dimension reduction (columns) -> only [body] and [createdAtformatted] left

print(parler_df.columns)
print('Dimension of whole dataframe: ' + str(parler_df.shape) + '\n') # df.shape -> (rows, columns)

# final_df_1 = pd.DataFrame()
# final_df_1['body'] = df['body'].copy()
# final_df_1['createdAtformatted'] = df['createdAtformatted'].copy()
# parler_df.drop(parler_df.iloc[:, 2:38], inplace = True, axis = 1) # remove all columns between column index 2 to 38
# parler_df.drop(['comments'], inplace = True, axis = 1)            # remove first column
parler_df.drop(parler_df.iloc[:, 5:38], inplace = True, axis = 1)   # remove all columns between column index 5 to 38
parler_df.drop(['comments'], inplace = True, axis = 1)              # remove column nr. 0
parler_df.drop(['bodywithurls'], inplace = True, axis = 1)          # remove column nr. 2
parler_df.drop(['createdAt'], inplace = True, axis = 1)             # remove column nr. 3

print(parler_df.columns)
print('Dimension of dataframe after removing columns: ' + str(parler_df.shape) + '\n')

Index(['comments', 'body', 'bodywithurls', 'createdAt', 'createdAtformatted',
       'creator', 'datatype', 'depth', 'depthRaw', 'followers', 'following',
       'hashtags', 'id', 'lastseents', 'links', 'media', 'parent', 'posts',
       'sensitive', 'upvotes', 'urls', 'username', 'verified', 'article',
       'impressions', 'preview', 'reposts', 'state', 'shareLink', 'color',
       'commentDepth', 'controversy', 'conversation', 'downvotes', 'post',
       'replyingTo', 'score', 'isPrimary'],
      dtype='object')
Dimension of whole dataframe: (1095543, 38)

Index(['body', 'createdAtformatted'], dtype='object')
Dimension of dataframe after removing columns: (1095543, 2)



In [4]:
# 3. Filter out null values found in [body] of parleys = dimension reduction (rows)

print('Dimension of dataframe: ' + str(parler_df.shape)) 
print(parler_df) 

parler_df['body'].replace("", np.nan, inplace=True)
parler_df.dropna(subset=['body'], inplace=True)
# parler_df.reset_index(drop=True, inplace=True)
# parler_df.drop_duplicates(subset = ['body'], inplace=True, keep='first'/'last'/False)

print('\n'  + 'Dimension of dataframe after filtering out null values: ' + str(parler_df.shape)) 
print(parler_df)

Dimension of dataframe: (1095543, 2)
        body       createdAtformatted
0             2019-12-22 05:13:59 UTC
1             2020-11-05 15:45:56 UTC
2             2020-12-04 23:23:36 UTC
3             2020-11-11 18:36:20 UTC
4             2021-01-08 23:17:40 UTC
...      ...                      ...
1095538       2020-12-22 21:51:11 UTC
1095539       2020-11-02 00:56:02 UTC
1095540       2019-09-23 05:07:10 UTC
1095541       2020-11-11 00:53:16 UTC
1095542       2020-12-20 15:18:33 UTC

[1095543 rows x 2 columns]

Dimension of dataframe after filtering out null values: (637020, 2)
                                                      body  \
19       We have the Sorriest FBI in the history of any...   
48       Agree 100% #wakeupamerica #obamamostcorruptpot...   
176      This is massive this is big this is what you f...   
290                        #trump2020 #bestpresidentever45   
...                                                    ...   
1095508  While that’s a good thing, it

In [5]:
# 4. Filter out parleys not in relevant time period = dimension reduction (rows) -> only Nov 2020, Dec 2020 and Jan 2021 left

def check_date(string):
    if ( (((string.split('-'))[0] == '2020') and ((string.split('-'))[1] == '11'))          # November 2020
        or (((string.split('-'))[0] == '2020') and ((string.split('-'))[1] == '12'))        # December 2020
        or (((string.split('-'))[0] == '2021') and ((string.split('-'))[1] == '01')) ):     # January  2021
            return True
    return False

print('Dimension of dataframe before: ' + str(parler_df.shape)) 
print(parler_df)

parler_df['appropiateDate'] = parler_df['createdAtformatted'].apply(check_date)
# parler_df.drop(parler_df[parler_df['appropiateDate'] == False].index, inplace=True)
parler_df = parler_df[parler_df['appropiateDate'] == True]
parler_df.drop(['appropiateDate'], inplace = True, axis = 1)          

print('Dimension of dataframe with parleys from relevant time period only: ' + str(parler_df.shape)) 
print(parler_df)

Dimension of dataframe before: (637020, 2)
                                                      body  \
19       We have the Sorriest FBI in the history of any...   
48       Agree 100% #wakeupamerica #obamamostcorruptpot...   
176      This is massive this is big this is what you f...   
290                        #trump2020 #bestpresidentever45   
...                                                    ...   
1095508  While that’s a good thing, it makes me wonder ...   
1095524                          Please put this guy away.   
1095531                     Hey it totally works for them.   
1095534                    See? Unity works... #TakeAStand   
1095536                        I am ready first the truth.   

              createdAtformatted  
19       2020-10-03 11:25:19 UTC  
48       2020-06-30 19:52:34 UTC  
49       2020-10-25 22:25:28 UTC  
176      2020-12-23 05:00:55 UTC  
290      2020-06-30 21:23:14 UTC  
...                          ...  
1095508  2020-07-02 21:24:06 

In [6]:
# 5. Filter out [body] of parleys in languages except English = dimension reduction (rows)

import fasttext
model = fasttext.load_model("lid.176.ftz")

def fast_detect(msg):
    try:
        ln = model.predict(msg)[0][0].split("__")[2] 
    except Exception as e:
        ln = None
    return ln

print('Dimension of dataframe before: ' + str(parler_df.shape)) 
parler_df['language'] = parler_df['body'].apply(fast_detect)
print('Dimension of dataframe with parleys in English only: ' + str(parler_df.shape)) 

print(parler_df)
parler_df.drop(parler_df[parler_df['language'] != 'en'].index, inplace=True)
parler_df.drop(['language'], inplace = True, axis = 1)          
print(parler_df) 



Dimension of dataframe before: (353028, 2)
Dimension of dataframe with tweets in English only: (353028, 3)
                                                      body  \
176      This is massive this is big this is what you f...   
345      #biden is wearing a bullet proof vest 🤣😂🤣😂 Ful...   
561           Absofreakinglutely!!! Tear that shit down!!!   
686      @Joebwinner And you are correct.. There are li...   
688      @factsRus I guess that’s why 15 CIA operative ...   
...                                                    ...   
1095448                          The reckoning approaches.   
1095450  Fuck this piece of shit. And fuck anyone who d...   
1095463                           🙌🏼🙌🏼🙌🏼🙌🏼🙌🏼🙌🏼🇺🇸🇺🇸🇺🇸🇺🇸🇺🇸🇺🇸   
1095465  Thank you Ricky !! ❤️You may loose some no goo...   
1095536                        I am ready first the truth.   

              createdAtformatted language  
176      2020-12-23 05:00:55 UTC       en  
345      2020-12-02 23:36:57 UTC     None  
561      2021-01

In [7]:
# 6. Remove unuseful time info from [createdAtformatted] column -> only date info left 

print(parler_df['createdAtformatted'])
parler_df['createdAtformatted'] = parler_df['createdAtformatted'].str.split(n = 0, expand = False).str[0]
print(parler_df['createdAtformatted'])

176        2020-12-23 05:00:55 UTC
561        2021-01-03 05:36:11 UTC
686        2020-11-24 01:03:33 UTC
688        2020-11-30 05:49:11 UTC
691        2020-11-01 23:36:45 UTC
                    ...           
1095400    2020-12-15 13:43:19 UTC
1095448    2020-12-05 17:37:29 UTC
1095450    2020-12-24 05:15:29 UTC
1095465    2020-11-21 15:14:57 UTC
1095536    2020-12-20 03:57:34 UTC
Name: createdAtformatted, Length: 294295, dtype: object
176        2020-12-23
561        2021-01-03
686        2020-11-24
688        2020-11-30
691        2020-11-01
              ...    
1095400    2020-12-15
1095448    2020-12-05
1095450    2020-12-24
1095465    2020-11-21
1095536    2020-12-20
Name: createdAtformatted, Length: 294295, dtype: object


In [8]:
# 7. Save pandas dataframe containing only useful info (no index) as CSV file

parler_df.to_csv('parler_df_041_dates_before.csv', index=False)