<a href="https://colab.research.google.com/github/andreea-bodea/bachelors-thesis-informatics/blob/main/BT%20INFO%20-%20Gab%20Data%20Preprocessing%20%26%20Sample%20Formation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Gab DATA PREPROCESSING 

Gab DATA SAMPLE FORMATION

Data source: Zannettou, S., Bradlyn, B., De Cristofaro, E., Kwak, H., Sirivianos, M., Stringini, G., & Blackburn, J. (2018, April). What is gab: A bastion of free speech or an alt-right echo chamber. In Companion Proceedings of the The Web Conference 2018 (pp. 1007-1014).

 0. Upload NDJSON file with posts to Google Colab 

 1. Read NDJSON file as pandas dataframe

 2. Remove unuseful columns = dimension reduction (columns) -> only {body} of [post] left

 3. Remove URLs 

 4. Remove tags

 5. Convert to lowercase

 6. Remove emojis (demoji library)

 7. Expand contractions (contractions library) (ex: you’re => you are) 

 8. Remove punctuation (using string.punctuation)

 9. Remove numbers

10. Lemmatization (using WordNetLemmatizer from nltk) (ex: says => say) 
 
11. Remove words shorter than 3 characters

12. Remove English and Spanish stopwords (using stopwords from nltk corpus)

13. Filter out posts in languages except English = dimension reduction (rows)

14. Filter out null values = dimension reduction (rows)

15. Drop duplicates = dimension reduction (rows)

16. Drop posts with 3 words or less

17. Filter out null values = dimension reduction (rows)

18. Save pandas dataframe without index after preprocessing as CSV file

19. Create dataframe for training sentence transformer with 100.000 randomly selected gabs 

20. Save the final dataframes without index as CSV file

In [None]:
import json
import pandas as pd 
import numpy as np
import io
import re
import nltk
nltk.download('all') # nltk.download('wordnet')

In [None]:
%%capture 
!pip install demoji
!pip install contractions
!pip install nltk

In [None]:
# https://fasttext.cc/docs/en/language-identification.html https://www.youtube.com/watch?v=JJdJePbmCyw
%%capture 
!pip install fasttext
!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz
!ls

In [None]:
# 0. Upload NDJSON file with posts to Google Colab 
from google.colab import files
uploaded = files.upload()

In [None]:
# 1. Read NDJSON file as pandas dataframe
gab_df_complete = pd.read_json(io.StringIO(uploaded.get('gab_posts_jan_2018_030.json').decode('utf-8')), lines = True)
gab_df_complete

In [None]:
# 2. Remove unuseful columns = dimension reduction (columns) -> only {body} of [post] left
# gab_df = pd.DataFrame([x.get('body') for x in gab_df_complete['post']]) 
# gab_df = gab_df[gab_df['body'].str.split().str.len().gt(1)] # delete all 1 word sentences -> no other types apart from string
# gab_df.to_csv('gab_df_posts_body).csv', index=Fals)
# gab_df = pd.read_csv(io.BytesIO(uploaded['gab_df_posts_body(500 MB).csv']))
# gab_df.rename(columns={'0': 'body'}, inplace=True)
gab_df = pd.read_csv(io.BytesIO(uploaded['gab_df_posts_body.csv']))
gab_df

Unnamed: 0,body
0,@CZAR Nice!
1,@JohnRivers There hasn't been any real monopol...
2,@Sinisin Moore wouldn't know a real racist if ...
3,@ShineALight Thank you.
4,http://davidshurter.com/?p=7215\nIf you haven'...
...,...
216454,Was the truck an Auto or Manual ?
216455,#AllahSnackbar\n#BanIslam\n#Terrorism
216456,This lady knows.\n\n#islam #deathcult
216457,https://voiceofeurope.com/2017/10/2278/\n#GabF...


In [None]:
# 3. Remove URLs 
gab_df['body'] = gab_df['body'].apply(lambda x: ' '.join(re.sub(r"http\S+", " ", x).split()))
gab_df

Unnamed: 0,body
0,@CZAR Nice!
1,@JohnRivers There hasn't been any real monopol...
2,@Sinisin Moore wouldn't know a real racist if ...
3,@ShineALight Thank you.
4,"If you haven't read this book, I would recomme..."
...,...
216454,Was the truck an Auto or Manual ?
216455,#AllahSnackbar #BanIslam #Terrorism
216456,This lady knows. #islam #deathcult
216457,#GabFam #CanFam #News #Politics #MAGA #NewRigh...


In [None]:
# 4. Remove tags
gab_df['body'] = gab_df['body'].apply(lambda x: ' '.join(re.sub(r"@\S+", " ", x).split()))
gab_df

Unnamed: 0,body
0,Nice!
1,There hasn't been any real monopoly enforcemen...
2,Moore wouldn't know a real racist if he was po...
3,Thank you.
4,"If you haven't read this book, I would recomme..."
...,...
216454,Was the truck an Auto or Manual ?
216455,#AllahSnackbar #BanIslam #Terrorism
216456,This lady knows. #islam #deathcult
216457,#GabFam #CanFam #News #Politics #MAGA #NewRigh...


In [None]:
# 5. Convert to lowercase
gab_df['body'] = gab_df['body'].apply(lambda x: ' '.join([w.lower() for w in x.split()]))
gab_df

Unnamed: 0,body
0,nice!
1,there hasn't been any real monopoly enforcemen...
2,moore wouldn't know a real racist if he was po...
3,thank you.
4,"if you haven't read this book, i would recomme..."
...,...
216454,was the truck an auto or manual ?
216455,#allahsnackbar #banislam #terrorism
216456,this lady knows. #islam #deathcult
216457,#gabfam #canfam #news #politics #maga #newrigh...


In [None]:
# 6. Remove emojis (demoji library)
import demoji
gab_df['body'] = gab_df['body'].apply(lambda x: demoji.replace(x, ""))
gab_df

Unnamed: 0,body
0,nice!
1,there hasn't been any real monopoly enforcemen...
2,moore wouldn't know a real racist if he was po...
3,thank you.
4,"if you haven't read this book, i would recomme..."
...,...
216454,was the truck an auto or manual ?
216455,#allahsnackbar #banislam #terrorism
216456,this lady knows. #islam #deathcult
216457,#gabfam #canfam #news #politics #maga #newrigh...


In [None]:
# 7. Expand contractions (contractions library) (ex: you’re => you are) 
import contractions
gab_df['body'] = gab_df['body'].apply(lambda x: ' '.join([contractions.fix(word) for word in x.split()]))
gab_df

Unnamed: 0,body
0,nice!
1,there has not been any real monopoly enforceme...
2,moore would not know a real racist if he was p...
3,thank you.
4,"if you have not read this book, i would recomm..."
...,...
216454,was the truck an auto or manual ?
216455,#allahsnackbar #banislam #terrorism
216456,this lady knows. #islam #deathcult
216457,#gabfam #canfam #news #politics #maga #newrigh...


In [None]:
# 8. Remove punctuation (using string.punctuation)
import string 
gab_df['body'] = gab_df['body'].apply(lambda x: ''.join([i for i in x if i not in string.punctuation]))
gab_df

Unnamed: 0,body
0,nice
1,there has not been any real monopoly enforceme...
2,moore would not know a real racist if he was p...
3,thank you
4,if you have not read this book i would recomme...
...,...
216454,was the truck an auto or manual
216455,allahsnackbar banislam terrorism
216456,this lady knows islam deathcult
216457,gabfam canfam news politics maga newright geop...


In [None]:
# 9. Remove numbers 
gab_df['body'] = gab_df['body'].apply(lambda x: ' '.join(re.sub("[^a-zA-Z]+", " ", x).split()))
gab_df

Unnamed: 0,body
0,nice
1,there has not been any real monopoly enforceme...
2,moore would not know a real racist if he was p...
3,thank you
4,if you have not read this book i would recomme...
...,...
216454,was the truck an auto or manual
216455,allahsnackbar banislam terrorism
216456,this lady knows islam deathcult
216457,gabfam canfam news politics maga newright geop...


In [None]:
# 10. Lemmatization (using WordNetLemmatizer from nltk) (ex: says => say) 
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
gab_df['body'] = gab_df['body'].apply(lambda x: ' '.join([WordNetLemmatizer().lemmatize(w) for w in x.split()]))
gab_df

Unnamed: 0,body
0,nice
1,there ha not been any real monopoly enforcemen...
2,moore would not know a real racist if he wa po...
3,thank you
4,if you have not read this book i would recomme...
...,...
216454,wa the truck an auto or manual
216455,allahsnackbar banislam terrorism
216456,this lady know islam deathcult
216457,gabfam canfam news politics maga newright geop...


In [None]:
# 11. Remove words shorter than 3 characters
gab_df['body'] = gab_df['body'].apply(lambda x: ' '.join([w.strip() for w in x.split() if len(w.strip()) >= 3]))
gab_df

Unnamed: 0,body
0,nice
1,there not been any real monopoly enforcement d...
2,moore would not know real racist poked his goo...
3,thank you
4,you have not read this book would recommend bu...
...,...
216454,the truck auto manual
216455,allahsnackbar banislam terrorism
216456,this lady know islam deathcult
216457,gabfam canfam news politics maga newright geop...


In [None]:
# 12. Remove English and Spanish stopwords (using stopwords from nltk corpus)
from nltk.corpus import stopwords
english_stop_words = [sw for sw in nltk.corpus.stopwords.words('english') if sw not in ['not', 'no']]
english_stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
# print(english_stop_words)
spanish_stop_words = stopwords.words('spanish')
# print(spanish_stop_words)
gab_df['body'] = gab_df['body'].apply(lambda x: ' '.join([w for w in x.split() if w not in english_stop_words]))
gab_df['body'] = gab_df['body'].apply(lambda x: ' '.join([w for w in x.split() if w not in spanish_stop_words]))
gab_df

Unnamed: 0,body
0,nice
1,not real monopoly enforcement done since att s...
2,moore would not know real racist poked good ey...
3,thank
4,not read book would recommend buying kindle cu...
...,...
216454,truck auto manual
216455,allahsnackbar banislam terrorism
216456,lady know islam deathcult
216457,gabfam canfam news politics maga newright geop...


In [None]:
# 13. Filter out posts in languages except English = dimension reduction (rows)
import fasttext
model = fasttext.load_model("lid.176.ftz")
def fast_detect(msg):
    try:
        ln = (str(model.predict(msg)[0]).split("__")[2])[0:2]
    except Exception as e:
        ln = None
        # print(msg)
    return ln
gab_df['language'] = gab_df['body'].apply(fast_detect)
print(gab_df[gab_df['language'] != 'en'])
gab_df.drop(gab_df[gab_df['language'] != 'en'].index, inplace=True)
gab_df.drop(['language'], inplace = True, axis = 1)          
print(gab_df) 



                                                     body language
20                                   franke laine rawhide       fr
33      skol bradford wright yard pas pat viking lead ...       nd
37                     left half part tundra grow lesueur       fr
38           skol bradfordrudolph connection yard pas pat       fr
66                                   blood sweat tear die       de
...                                                   ...      ...
216361                                         fuck islam       es
216373  not shocking bil deblasio sadiq khan unholy al...       fi
216387                           good boy neva dun nutten       gl
216443                         taste rainbow hide rainbow       it
216444                deportthemall banislam buildthewall       fi

[9826 rows x 2 columns]
                                                     body
0                                                    nice
1       not real monopoly enforcement done since att s.

In [None]:
# 14. Filter out null values = dimension reduction (rows)
print('Dimension of dataframe: ' + str(gab_df.shape)) 
gab_df['body'].replace("", np.nan, inplace=True)
gab_df.dropna(inplace=True)
print('\n'  + 'Dimension of dataframe after filtering out null values: ' + str(gab_df.shape)) 
gab_df

Dimension of dataframe: (206633, 1)

Dimension of dataframe after filtering out null values: (205410, 1)


Unnamed: 0,body
0,nice
1,not real monopoly enforcement done since att s...
2,moore would not know real racist poked good ey...
3,thank
4,not read book would recommend buying kindle cu...
...,...
216454,truck auto manual
216455,allahsnackbar banislam terrorism
216456,lady know islam deathcult
216457,gabfam canfam news politics maga newright geop...


Data Sample Formation Gab (train)

In [None]:
# 15. Drop duplicates = dimension reduction (rows)
gab_df.drop_duplicates(inplace=True)
gab_df = gab_df.reset_index(drop=True)
gab_df

Unnamed: 0,body
0,nice
1,not real monopoly enforcement done since att s...
2,moore would not know real racist poked good ey...
3,thank
4,not read book would recommend buying kindle cu...
...,...
182285,nyc terrorist came diversity visa program spon...
182286,mayor bill blasio declares incident lone wolf ...
182287,time eradicate barbarian candle praying stay s...
182288,truck auto manual


In [None]:
# 16. Drop posts with 3 words or less
gab_df['body'] = gab_df[gab_df['body'].str.split().str.len().gt(3)]
gab_df

Unnamed: 0,body
0,
1,not real monopoly enforcement done since att s...
2,moore would not know real racist poked good ey...
3,
4,not read book would recommend buying kindle cu...
...,...
182285,nyc terrorist came diversity visa program spon...
182286,mayor bill blasio declares incident lone wolf ...
182287,time eradicate barbarian candle praying stay s...
182288,


In [None]:
# 17. Filter out null values = dimension reduction (rows)
print('Dimension of dataframe: ' + str(gab_df.shape)) 
gab_df['body'].replace("", np.nan, inplace=True)
gab_df.dropna(inplace=True)
print('\n'  + 'Dimension of dataframe after filtering out null values: ' + str(gab_df.shape)) 
gab_df

Dimension of dataframe: (182290, 1)

Dimension of dataframe after filtering out null values: (160041, 1)


Unnamed: 0,body
1,not real monopoly enforcement done since att s...
2,moore would not know real racist poked good ey...
4,not read book would recommend buying kindle cu...
5,well oyster stew fresh fish anyone suppose hel...
7,recognizing continued company would long term ...
...,...
182284,antifa bamn nambla connected many antifa membe...
182285,nyc terrorist came diversity visa program spon...
182286,mayor bill blasio declares incident lone wolf ...
182287,time eradicate barbarian candle praying stay s...


In [None]:
# 18. Save pandas dataframe without index after preprocessing as CSV file
gab_df.to_csv('gab_df_posts_body_preprocessed.csv', index=False)

In [None]:
# 19. Create dataframe for training sentence transformer with 100.000 posts randomly selected posts
gab_train = gab_df.sample(n=100000, random_state=1, ignore_index=True) 
gab_train

Unnamed: 0,body
0,traitor alert lindsey graham revealed secret p...
1,mean let not upset goober
2,dnc obey law dnc worker suing party failing pa...
3,thus reason altright right whole develop psych...
4,sweden police scared good reason yeah jesus de...
...,...
99995,delete duckduckgo messed computer
99996,coming alttech project take note
99997,next news network youtube view edward snowden ...
99998,would rather burn abandoned christian church l...


In [None]:
# 20. Save the final dataframes without index as CSV file
gab_train.to_csv('gab_train.csv', index=False)