DATA PREPROCESSING Part 2

8. Read CSV file containing only useful info (no index) as pandas dataframe
9. Convert [body] of tweets to lowercase
10. Remove emojis from [body] of tweets (demoji library)
11. Remove English and Spanish stopwords from [body] of tweets (using stopwords from nltk corpus)
12. Expand contractions from [body] of tweets (contractions library) (ex: you’re => you are) 
13. Remove punctuation from [body] of tweets (using string.punctuation)
14. Remove numbers from [body] of tweets (re = regular expression library)
15. Lemmatization of [body] of tweets (using WordNetLemmatizer from nltk) (ex: says => say) 
16. Remove words shorter than 3 characters from [body] of tweets
17. Filter out null values found in [body] of tweets once more = dimension reduction (rows)
18. Save pandas dataframe without index after preprocessing as CSV file
19. Save pandas dataframes without index for each month after preprocessing as CSV file

In [1]:
import pandas as pd 
import numpy as np
import nltk

In [2]:
# 8. Read CSV file containing only useful info (no index) as pandas dataframe

parler_df = pd.read_csv('C:\\Users\\cosmi\\Desktop\\ANDREEA\\bachelors-thesis\\parler_df_030_dates_before.csv')
print(parler_df)

                                                     body createdAtformatted
0        Professor my ass more like a pot head gone meth.         2020-12-13
1                    Need To Spread The News, Jo Ann !!..         2020-12-10
2               Fuck yes fire this loser non American !!!         2020-11-12
3                                     Hang that pedophile         2020-12-25
4       @Ivanka2020 You must be a sex and love the dep...         2020-12-08
...                                                   ...                ...
294636  This ugly alcoholic should just sit down at th...         2020-12-23
294637                                              Boom!         2020-11-02
294638  Exactly what I think every time I see John Rob...         2020-12-25
294639                            Twat, the term is twat!         2020-11-18
294640                                            Love it         2020-12-20

[294641 rows x 2 columns]


In [3]:
# 9. Convert [body] of tweets to lowercase

print(parler_df['body'])
parler_df['body'] = parler_df['body'].apply(lambda x: ' '.join([w.lower() for w in x.split()]))
print(parler_df['body'])

0          Professor my ass more like a pot head gone meth.
1                      Need To Spread The News, Jo Ann !!..
2                 Fuck yes fire this loser non American !!!
3                                       Hang that pedophile
4         @Ivanka2020 You must be a sex and love the dep...
                                ...                        
294636    This ugly alcoholic should just sit down at th...
294637                                                Boom!
294638    Exactly what I think every time I see John Rob...
294639                              Twat, the term is twat!
294640                                              Love it
Name: body, Length: 294641, dtype: object
0          professor my ass more like a pot head gone meth.
1                      need to spread the news, jo ann !!..
2                 fuck yes fire this loser non american !!!
3                                       hang that pedophile
4         @ivanka2020 you must be a sex and love the dep..

In [4]:
# 10. Remove emojis from [body] of tweets (demoji library)

import demoji
print(parler_df['body'])
parler_df['body'] = parler_df['body'].apply(lambda x: demoji.replace(x, ""))
print(parler_df['body'])

0          professor my ass more like a pot head gone meth.
1                      need to spread the news, jo ann !!..
2                 fuck yes fire this loser non american !!!
3                                       hang that pedophile
4         @ivanka2020 you must be a sex and love the dep...
                                ...                        
294636    this ugly alcoholic should just sit down at th...
294637                                                boom!
294638    exactly what i think every time i see john rob...
294639                              twat, the term is twat!
294640                                              love it
Name: body, Length: 294641, dtype: object
0          professor my ass more like a pot head gone meth.
1                      need to spread the news, jo ann !!..
2                 fuck yes fire this loser non american !!!
3                                       hang that pedophile
4         @ivanka2020 you must be a sex and love the dep..

In [5]:
# 11. Remove English and Spanish stopwords from [body] of tweets (using stopwords from nltk corpus)

from nltk.corpus import stopwords
# nltk.download('wordnet')

english_stop_words = [sw for sw in nltk.corpus.stopwords.words('english') if sw not in ['not', 'no']]
english_stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
# print(english_stop_words)
spanish_stop_words = stopwords.words('spanish')
# print(spanish_stop_words)

print(parler_df['body'])

parler_df['body'] = parler_df['body'].apply(lambda x: ' '.join([w for w in x.split() if w not in english_stop_words]))
parler_df['body'] = parler_df['body'].apply(lambda x: ' '.join([w for w in x.split() if w not in spanish_stop_words]))

print(parler_df['body'])

0          professor my ass more like a pot head gone meth.
1                      need to spread the news, jo ann !!..
2                 fuck yes fire this loser non american !!!
3                                       hang that pedophile
4         @ivanka2020 you must be a sex and love the dep...
                                ...                        
294636    this ugly alcoholic should just sit down at th...
294637                                                boom!
294638    exactly what i think every time i see john rob...
294639                              twat, the term is twat!
294640                                              love it
Name: body, Length: 294641, dtype: object
0                    professor ass like pot head gone meth.
1                             need spread news, jo ann !!..
2                      fuck yes fire loser non american !!!
3                                            hang pedophile
4         @ivanka2020 must sex love deprived human being..

In [6]:
# 12. Expand contractions from [body] of tweets (contractions library) (ex: you’re => you are) 

import contractions
print(parler_df['body'])
parler_df['body'] = parler_df['body'].apply(lambda x: ' '.join([contractions.fix(word) for word in x.split()]))
print(parler_df['body'])

0                    professor ass like pot head gone meth.
1                             need spread news, jo ann !!..
2                      fuck yes fire loser non american !!!
3                                            hang pedophile
4         @ivanka2020 must sex love deprived human being...
                                ...                        
294636    ugly alcoholic sit bar keep picking last guys ...
294637                                                boom!
294638    exactly think every time see john roberts smir...
294639                                     twat, term twat!
294640                                                 love
Name: body, Length: 294641, dtype: object
0                    professor ass like pot head gone meth.
1                             need spread news, jo ann !!..
2                      fuck yes fire loser non american !!!
3                                            hang pedophile
4         @ivanka2020 must sex love deprived human being..

In [7]:
# 13. Remove punctuation from [body] of tweets (using string.punctuation)

import string 
print(parler_df['body'])
parler_df['body'] = parler_df['body'].apply(lambda x: ''.join([i for i in x if i not in string.punctuation]))
print(parler_df['body'])

0                    professor ass like pot head gone meth.
1                             need spread news, jo ann !!..
2                      fuck yes fire loser non american !!!
3                                            hang pedophile
4         @ivanka2020 must sex love deprived human being...
                                ...                        
294636    ugly alcoholic sit bar keep picking last guys ...
294637                                                boom!
294638    exactly think every time see john roberts smir...
294639                                     twat, term twat!
294640                                                 love
Name: body, Length: 294641, dtype: object
0                     professor ass like pot head gone meth
1                                  need spread news jo ann 
2                         fuck yes fire loser non american 
3                                            hang pedophile
4         ivanka2020 must sex love deprived human being ..

In [8]:
# 14. Remove numbers from [body] of tweets (re = regular expression library)

import re
print(parler_df['body'])
parler_df['body'] = parler_df['body'].apply(lambda x: ' '.join(re.sub("[^a-zA-Z]+", " ", x).split()))
print(parler_df['body'])

0                     professor ass like pot head gone meth
1                                  need spread news jo ann 
2                         fuck yes fire loser non american 
3                                            hang pedophile
4         ivanka2020 must sex love deprived human being ...
                                ...                        
294636    ugly alcoholic sit bar keep picking last guys ...
294637                                                 boom
294638    exactly think every time see john roberts smir...
294639                                       twat term twat
294640                                                 love
Name: body, Length: 294641, dtype: object
0                     professor ass like pot head gone meth
1                                   need spread news jo ann
2                          fuck yes fire loser non american
3                                            hang pedophile
4         ivanka must sex love deprived human being ever..

In [9]:
# 15. Lemmatization of [body] of tweets (using WordNetLemmatizer from nltk) (ex: says => say) 

from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

print(parler_df['body'])
parler_df['body'] = parler_df['body'].apply(lambda x: ' '.join([WordNetLemmatizer().lemmatize(w) for w in x.split()]))
print(parler_df['body'])

0                     professor ass like pot head gone meth
1                                   need spread news jo ann
2                          fuck yes fire loser non american
3                                            hang pedophile
4         ivanka must sex love deprived human being ever...
                                ...                        
294636    ugly alcoholic sit bar keep picking last guys ...
294637                                                 boom
294638    exactly think every time see john roberts smir...
294639                                       twat term twat
294640                                                 love
Name: body, Length: 294641, dtype: object
0                      professor as like pot head gone meth
1                                   need spread news jo ann
2                          fuck yes fire loser non american
3                                            hang pedophile
4         ivanka must sex love deprived human being ever..

In [10]:
# 16. Remove words shorter than 3 characters from [body] of tweets

print(parler_df['body'])
parler_df['body'] = parler_df['body'].apply(lambda x: ' '.join([w.strip() for w in x.split() if len(w.strip()) >= 3]))
print(parler_df['body'])

0                      professor as like pot head gone meth
1                                   need spread news jo ann
2                          fuck yes fire loser non american
3                                            hang pedophile
4         ivanka must sex love deprived human being ever...
                                ...                        
294636    ugly alcoholic sit bar keep picking last guy t...
294637                                                 boom
294638    exactly think every time see john robert smirk...
294639                                       twat term twat
294640                                                 love
Name: body, Length: 294641, dtype: object
0                         professor like pot head gone meth
1                                      need spread news ann
2                          fuck yes fire loser non american
3                                            hang pedophile
4         ivanka must sex love deprived human being ever..

In [11]:
# 17. Filter out null values found in [body] of tweets once more = dimension reduction (rows)

print('Dimension of dataframe: ' + str(parler_df.shape)) 
print(parler_df['body']) 

parler_df['body'].replace("", np.nan, inplace=True)
parler_df.dropna(subset=['body'], inplace=True)

print('\n'  + 'Dimension of dataframe after preprocessing: ' + str(parler_df.shape)) 
print(parler_df['body'])

Dimension of dataframe: (294641, 2)
0                         professor like pot head gone meth
1                                      need spread news ann
2                          fuck yes fire loser non american
3                                            hang pedophile
4         ivanka must sex love deprived human being ever...
                                ...                        
294636    ugly alcoholic sit bar keep picking last guy t...
294637                                                 boom
294638    exactly think every time see john robert smirk...
294639                                       twat term twat
294640                                                 love
Name: body, Length: 294641, dtype: object

Dimension of dataframe after preprocessing: (286569, 2)
0                         professor like pot head gone meth
1                                      need spread news ann
2                          fuck yes fire loser non american
3                        

In [12]:
# 18. Save pandas dataframe without index after preprocessing as CSV file

parler_df.to_csv('parler_df_030_dates.csv', index=False)

In [15]:
# 19. Save pandas dataframes without index for each month after preprocessing as CSV file

def check_month(date):
    if (((date.split('-'))[0] == '2020') and ((date.split('-'))[1] == '11')):    # November 2020
        return 'nov'
    elif (((date.split('-'))[0] == '2020') and ((date.split('-'))[1] == '12')):  # December 2020
        return 'dec'
    else:
        return 'jan' # January  2021
       

print('Dimension of whole dataframe after preprocessing: ' + str(parler_df.shape)) 
print(parler_df)

parler_df['month'] = parler_df['createdAtformatted'].apply(check_month)

parler_df_nov = parler_df[parler_df['month'] == 'nov']
parler_df_nov.drop(['month'], inplace = True, axis = 1) 
print('Dimension of November dataframe after preprocessing: ' + str(parler_df.shape)) 
print(parler_df_nov)

parler_df_dec = parler_df[parler_df['month'] == 'dec']
parler_df_dec.drop(['month'], inplace = True, axis = 1)  
print('Dimension of December dataframe after preprocessing: ' + str(parler_df.shape)) 
print(parler_df_dec)

parler_df_jan = parler_df[parler_df['month'] == 'jan']  
parler_df_jan.drop(['month'], inplace = True, axis = 1) 
print('Dimension of January dataframe after preprocessing: ' + str(parler_df_jan.shape)) 
print(parler_df_jan)

parler_df_nov.to_csv('parler_df_030_dates_nov.csv', index=False)
parler_df_dec.to_csv('parler_df_030_dates_dec.csv', index=False)
parler_df_jan.to_csv('parler_df_030_dates_jan.csv', index=False)

Dimension of whole dataframe after preprocessing: (286569, 3)
                                                     body createdAtformatted  \
0                       professor like pot head gone meth         2020-12-13   
1                                    need spread news ann         2020-12-10   
2                        fuck yes fire loser non american         2020-11-12   
3                                          hang pedophile         2020-12-25   
4       ivanka must sex love deprived human being ever...         2020-12-08   
...                                                   ...                ...   
294636  ugly alcoholic sit bar keep picking last guy t...         2020-12-23   
294637                                               boom         2020-11-02   
294638  exactly think every time see john robert smirk...         2020-12-25   
294639                                     twat term twat         2020-11-18   
294640                                               love 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


Dimension of November dataframe after preprocessing: (286569, 3)
                                                     body createdAtformatted
2                        fuck yes fire loser non american         2020-11-12
5                                      redparty send info         2020-11-25
7       jenelleeason better not tell commie trash frie...         2020-11-15
8       metamorphys take blocking win closed minded he...         2020-11-07
10              metroswift mama nazi whore that paid xbox         2020-11-24
...                                                   ...                ...
294629                      fredo fredo fredo broke heart         2020-11-13
294632              today november still feel slow reason         2020-11-07
294634                                      report parler         2020-11-21
294637                                               boom         2020-11-02
294639                                     twat term twat         2020-11-18

[164948 ro