Parler DATA PREPROCESSING Part 2

8. Read CSV file containing only useful info (no index) as pandas dataframe
9. Convert [body] of parleys to lowercase
10. Remove emojis from [body] of parleys (demoji library)
11. Remove English and Spanish stopwords from [body] of parleys (using stopwords from nltk corpus)
12. Expand contractions from [body] of parleys (contractions library) (ex: you’re => you are) 
13. Remove punctuation from [body] of parleys (using string.punctuation)
14. Remove numbers from [body] of parleys (re = regular expression library)
15. Lemmatization of [body] of parleys (using WordNetLemmatizer from nltk) (ex: says => say) 
16. Remove words shorter than 3 characters from [body] of parleys
17. Filter out null values found in [body] of parleys once more = dimension reduction (rows)
18. Save pandas dataframe without index after preprocessing as CSV file
19. Save pandas dataframes without index for each month after preprocessing as CSV file

In [1]:
import pandas as pd 
import numpy as np
import nltk

In [2]:
# 8. Read CSV file containing only useful info (no index) as pandas dataframe

parler_df = pd.read_csv('C:\\Users\\cosmi\\Desktop\\ANDREEA\\bachelors-thesis\\parler_df_041_dates_before.csv')
print(parler_df)

                                                     body createdAtformatted
0       This is massive this is big this is what you f...         2020-12-23
1            Absofreakinglutely!!! Tear that shit down!!!         2021-01-03
2       @Joebwinner And you are correct.. There are li...         2020-11-24
3       @factsRus I guess that’s why 15 CIA operative ...         2020-11-30
4                            @JennieWrennnn Your welcome.         2020-11-01
...                                                   ...                ...
294290  Ok, your beautiful wife who is currently livin...         2020-12-15
294291                          The reckoning approaches.         2020-12-05
294292  Fuck this piece of shit. And fuck anyone who d...         2020-12-24
294293  Thank you Ricky !! ❤️You may loose some no goo...         2020-11-21
294294                        I am ready first the truth.         2020-12-20

[294295 rows x 2 columns]


In [3]:
# 9. Convert [body] of parleys to lowercase

print(parler_df['body'])
parler_df['body'] = parler_df['body'].apply(lambda x: ' '.join([w.lower() for w in x.split()]))
print(parler_df['body'])

0         This is massive this is big this is what you f...
1              Absofreakinglutely!!! Tear that shit down!!!
2         @Joebwinner And you are correct.. There are li...
3         @factsRus I guess that’s why 15 CIA operative ...
4                              @JennieWrennnn Your welcome.
                                ...                        
294290    Ok, your beautiful wife who is currently livin...
294291                            The reckoning approaches.
294292    Fuck this piece of shit. And fuck anyone who d...
294293    Thank you Ricky !! ❤️You may loose some no goo...
294294                          I am ready first the truth.
Name: body, Length: 294295, dtype: object
0         this is massive this is big this is what you f...
1              absofreakinglutely!!! tear that shit down!!!
2         @joebwinner and you are correct.. there are li...
3         @factsrus i guess that’s why 15 cia operative ...
4                              @jenniewrennnn your welcome

In [4]:
# 10. Remove emojis from [body] of parleys (demoji library)

import demoji
print(parler_df['body'])
parler_df['body'] = parler_df['body'].apply(lambda x: demoji.replace(x, ""))
print(parler_df['body'])

0         this is massive this is big this is what you f...
1              absofreakinglutely!!! tear that shit down!!!
2         @joebwinner and you are correct.. there are li...
3         @factsrus i guess that’s why 15 cia operative ...
4                              @jenniewrennnn your welcome.
                                ...                        
294290    ok, your beautiful wife who is currently livin...
294291                            the reckoning approaches.
294292    fuck this piece of shit. and fuck anyone who d...
294293    thank you ricky !! ❤️you may loose some no goo...
294294                          i am ready first the truth.
Name: body, Length: 294295, dtype: object
0         this is massive this is big this is what you f...
1              absofreakinglutely!!! tear that shit down!!!
2         @joebwinner and you are correct.. there are li...
3         @factsrus i guess that’s why 15 cia operative ...
4                              @jenniewrennnn your welcome

In [5]:
# 11. Remove English and Spanish stopwords from [body] of parleys (using stopwords from nltk corpus)

from nltk.corpus import stopwords
# nltk.download('wordnet')

english_stop_words = [sw for sw in nltk.corpus.stopwords.words('english') if sw not in ['not', 'no']]
english_stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
# print(english_stop_words)
spanish_stop_words = stopwords.words('spanish')
# print(spanish_stop_words)

print(parler_df['body'])

parler_df['body'] = parler_df['body'].apply(lambda x: ' '.join([w for w in x.split() if w not in english_stop_words]))
parler_df['body'] = parler_df['body'].apply(lambda x: ' '.join([w for w in x.split() if w not in spanish_stop_words]))

print(parler_df['body'])

0         this is massive this is big this is what you f...
1              absofreakinglutely!!! tear that shit down!!!
2         @joebwinner and you are correct.. there are li...
3         @factsrus i guess that’s why 15 cia operative ...
4                              @jenniewrennnn your welcome.
                                ...                        
294290    ok, your beautiful wife who is currently livin...
294291                            the reckoning approaches.
294292    fuck this piece of shit. and fuck anyone who d...
294293    thank you ricky !! you may loose some no good ...
294294                          i am ready first the truth.
Name: body, Length: 294295, dtype: object
0         massive big find really dig! incriminating evi...
1                   absofreakinglutely!!! tear shit down!!!
2         @joebwinner correct.. literally conservatives ...
3         @factsrus guess that’s 15 cia operative killed...
4                                   @jenniewrennnn welcome

In [6]:
# 12. Expand contractions from [body] of parleys (contractions library) (ex: you’re => you are) 

import contractions
print(parler_df['body'])
parler_df['body'] = parler_df['body'].apply(lambda x: ' '.join([contractions.fix(word) for word in x.split()]))
print(parler_df['body'])

0         massive big find really dig! incriminating evi...
1                   absofreakinglutely!!! tear shit down!!!
2         @joebwinner correct.. literally conservatives ...
3         @factsrus guess that’s 15 cia operative killed...
4                                   @jenniewrennnn welcome.
                                ...                        
294290    ok, beautiful wife currently living free count...
294291                                reckoning approaches.
294292    fuck piece shit. fuck anyone disrespects count...
294293    thank ricky !! may loose good liberals hollywo...
294294                                   ready first truth.
Name: body, Length: 294295, dtype: object
0         massive big find really dig! incriminating evi...
1                   absofreakinglutely!!! tear shit down!!!
2         @joebwinner correct.. literally conservatives ...
3         @factsrus guess that is 15 cia operative kille...
4                                   @jenniewrennnn welcome

In [7]:
# 13. Remove punctuation from [body] of parleys (using string.punctuation)

import string 
print(parler_df['body'])
parler_df['body'] = parler_df['body'].apply(lambda x: ''.join([i for i in x if i not in string.punctuation]))
print(parler_df['body'])

0         massive big find really dig! incriminating evi...
1                   absofreakinglutely!!! tear shit down!!!
2         @joebwinner correct.. literally conservatives ...
3         @factsrus guess that is 15 cia operative kille...
4                                   @jenniewrennnn welcome.
                                ...                        
294290    ok, beautiful wife currently living free count...
294291                                reckoning approaches.
294292    fuck piece shit. fuck anyone disrespects count...
294293    thank ricky !! may loose good liberals hollywo...
294294                                   ready first truth.
Name: body, Length: 294295, dtype: object
0         massive big find really dig incriminating evid...
1                         absofreakinglutely tear shit down
2         joebwinner correct literally conservatives bur...
3         factsrus guess that is 15 cia operative killed...
4                                     jenniewrennnn welcom

In [8]:
# 14. Remove numbers from [body] of parleys (re = regular expression library)

import re
print(parler_df['body'])
parler_df['body'] = parler_df['body'].apply(lambda x: ' '.join(re.sub("[^a-zA-Z]+", " ", x).split()))
print(parler_df['body'])

0         massive big find really dig incriminating evid...
1                         absofreakinglutely tear shit down
2         joebwinner correct literally conservatives bur...
3         factsrus guess that is 15 cia operative killed...
4                                     jenniewrennnn welcome
                                ...                        
294290    ok beautiful wife currently living free countr...
294291                                 reckoning approaches
294292    fuck piece shit fuck anyone disrespects countr...
294293    thank ricky  may loose good liberals hollywood...
294294                                    ready first truth
Name: body, Length: 294295, dtype: object
0         massive big find really dig incriminating evid...
1                         absofreakinglutely tear shit down
2         joebwinner correct literally conservatives bur...
3         factsrus guess that is cia operative killed china
4                                     jenniewrennnn welcom

In [9]:
# 15. Lemmatization of [body] of parleys (using WordNetLemmatizer from nltk) (ex: says => say) 

from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

print(parler_df['body'])
parler_df['body'] = parler_df['body'].apply(lambda x: ' '.join([WordNetLemmatizer().lemmatize(w) for w in x.split()]))
print(parler_df['body'])

0         massive big find really dig incriminating evid...
1                         absofreakinglutely tear shit down
2         joebwinner correct literally conservatives bur...
3         factsrus guess that is cia operative killed china
4                                     jenniewrennnn welcome
                                ...                        
294290    ok beautiful wife currently living free countr...
294291                                 reckoning approaches
294292    fuck piece shit fuck anyone disrespects countr...
294293    thank ricky may loose good liberals hollywood ...
294294                                    ready first truth
Name: body, Length: 294295, dtype: object
0         massive big find really dig incriminating evid...
1                         absofreakinglutely tear shit down
2         joebwinner correct literally conservative burn...
3         factsrus guess that is cia operative killed china
4                                     jenniewrennnn welcom

In [10]:
# 16. Remove words shorter than 3 characters from [body] of parleys

print(parler_df['body'])
parler_df['body'] = parler_df['body'].apply(lambda x: ' '.join([w.strip() for w in x.split() if len(w.strip()) >= 3]))
print(parler_df['body'])

0         massive big find really dig incriminating evid...
1                         absofreakinglutely tear shit down
2         joebwinner correct literally conservative burn...
3         factsrus guess that is cia operative killed china
4                                     jenniewrennnn welcome
                                ...                        
294290    ok beautiful wife currently living free countr...
294291                                   reckoning approach
294292    fuck piece shit fuck anyone disrespect country...
294293    thank ricky may loose good liberal hollywood s...
294294                                    ready first truth
Name: body, Length: 294295, dtype: object
0         massive big find really dig incriminating evid...
1                         absofreakinglutely tear shit down
2         joebwinner correct literally conservative burn...
3            factsrus guess that cia operative killed china
4                                     jenniewrennnn welcom

In [11]:
# 17. Filter out null values found in [body] of parleys once more = dimension reduction (rows)

print('Dimension of dataframe: ' + str(parler_df.shape)) 
print(parler_df['body']) 

parler_df['body'].replace("", np.nan, inplace=True)
parler_df.dropna(subset=['body'], inplace=True)

print('\n'  + 'Dimension of dataframe after preprocessing: ' + str(parler_df.shape)) 
print(parler_df['body'])

Dimension of dataframe: (294295, 2)
0         massive big find really dig incriminating evid...
1                         absofreakinglutely tear shit down
2         joebwinner correct literally conservative burn...
3            factsrus guess that cia operative killed china
4                                     jenniewrennnn welcome
                                ...                        
294290    beautiful wife currently living free country t...
294291                                   reckoning approach
294292    fuck piece shit fuck anyone disrespect country...
294293    thank ricky may loose good liberal hollywood s...
294294                                    ready first truth
Name: body, Length: 294295, dtype: object

Dimension of dataframe after preprocessing: (286237, 2)
0         massive big find really dig incriminating evid...
1                         absofreakinglutely tear shit down
2         joebwinner correct literally conservative burn...
3            factsrus gue

In [12]:
# 18. Save pandas dataframe without index after preprocessing as CSV file

parler_df.to_csv('parler_df_041_dates.csv', index=False)

In [13]:
# 19. Save pandas dataframes without index for each month after preprocessing as CSV file

def check_month(date):
    if (((date.split('-'))[0] == '2020') and ((date.split('-'))[1] == '11')):    # November 2020
        return 'nov'
    elif (((date.split('-'))[0] == '2020') and ((date.split('-'))[1] == '12')):  # December 2020
        return 'dec'
    else:
        return 'jan' # January  2021
       

print('Dimension of whole dataframe after preprocessing: ' + str(parler_df.shape)) 
print(parler_df)

parler_df['month'] = parler_df['createdAtformatted'].apply(check_month)

parler_df_nov = parler_df[parler_df['month'] == 'nov']
parler_df_nov.drop(['month'], inplace = True, axis = 1) 
print('Dimension of November dataframe after preprocessing: ' + str(parler_df_nov.shape)) 
print(parler_df_nov)

parler_df_dec = parler_df[parler_df['month'] == 'dec']
parler_df_dec.drop(['month'], inplace = True, axis = 1)  
print('Dimension of December dataframe after preprocessing: ' + str(parler_df_dec.shape)) 
print(parler_df_dec)

parler_df_jan = parler_df[parler_df['month'] == 'jan']  
parler_df_jan.drop(['month'], inplace = True, axis = 1) 
print('Dimension of January dataframe after preprocessing: ' + str(parler_df_jan.shape)) 
print(parler_df_jan)

parler_df_nov.to_csv('parler_df_041_dates_nov.csv', index=False)
parler_df_dec.to_csv('parler_df_041_dates_dec.csv', index=False)
parler_df_jan.to_csv('parler_df_041_dates_jan.csv', index=False)

Dimension of whole dataframe after preprocessing: (286237, 2)
                                                     body createdAtformatted
0       massive big find really dig incriminating evid...         2020-12-23
1                       absofreakinglutely tear shit down         2021-01-03
2       joebwinner correct literally conservative burn...         2020-11-24
3          factsrus guess that cia operative killed china         2020-11-30
4                                   jenniewrennnn welcome         2020-11-01
...                                                   ...                ...
294290  beautiful wife currently living free country t...         2020-12-15
294291                                 reckoning approach         2020-12-05
294292  fuck piece shit fuck anyone disrespect country...         2020-12-24
294293  thank ricky may loose good liberal hollywood s...         2020-11-21
294294                                  ready first truth         2020-12-20

[286237 rows 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Dimension of November dataframe after preprocessing: (164708, 2)
                                                     body createdAtformatted
2       joebwinner correct literally conservative burn...         2020-11-24
3          factsrus guess that cia operative killed china         2020-11-30
4                                   jenniewrennnn welcome         2020-11-01
5                                         trumppin boomer         2020-11-23
6       aintchaprecious captain debating team talk int...         2020-11-20
...                                                   ...                ...
294275                   devil worshipper michaelmichelle         2020-11-21
294276  obviously buffoon idea electoral process work ...         2020-11-12
294286                                    great awakening         2020-11-02
294289                                               need         2020-11-26
294293  thank ricky may loose good liberal hollywood s...         2020-11-21

[164708 ro