# Data cleaning for Modeling and Feature Engineering

## Setup

In [24]:
# import the usual suspects / basics
import pandas as pd
import numpy as np
import re
import pickle
import os

# tqdm
from tqdm import tqdm
tqdm.pandas()

# spaCy
import spacy

!python -m spacy download en_core_web_sm

# fastText
import fasttext

# display all df columns (default is 20)
pd.options.display.max_columns = None

# show all data in columns so that full comment is visible
pd.options.display.max_colwidth = None

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 682.7 kB/s eta 0:00:19
     --------------------------------------- 0.1/12.8 MB 919.0 kB/s eta 0:00:14
     - -------------------------------------- 0.4/12.8 MB 2.5 MB/s eta 0:00:06
     -- ------------------------------------- 0.9/12.8 MB 5.1 MB/s eta 0:00:03
     ---- ----------------------------------- 1.3/12.8 MB 6.0 MB/s eta 0:00:02
     ---- ----------------------------------- 1.4/12.8 MB 6.5 MB/s eta 0:00:02
     ----- ---------------------------------- 1.8/12.8 MB 6.0 MB/s eta 0:00:02
     ------- -------------------------------- 2.2/12.8 MB 6.8 MB/s eta 0:00:02
     ------- -------------------------------- 2.5/12.8 MB 6.4 MB/s eta 0:00:02
     --------- ------------------------

## Load data

In [3]:
df = pd.read_csv('data/undersampled_data_60_40.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 399455 entries, 0 to 399454
Data columns (total 3 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   comment_text  399455 non-null  object 
 1   toxicity      399455 non-null  float64
 2   toxic         399455 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 9.1+ MB


## Optional: Create smaller sample from data to speed up things while experimenting

In [5]:
sample_size = None

# uncomment to create sample of desired size
#sample_size = 10_000

if sample_size != None:
    # ratio toxic/nontoxic
    tox_perc = 0.4
    nontox_perc = 0.6

    # number of toxic/nontoxic rows
    sample_size_tox = int(sample_size * tox_perc)
    sample_size_nontox = int(sample_size * nontox_perc)

    sample_tox = df[df['toxic'] == 1].sample(sample_size_tox,
                                             random_state=42)
    sample_nontox = df[df['toxic'] == 0].sample(sample_size_nontox,
                                                random_state=42)

    df = pd.concat([sample_tox, sample_nontox])
    print(f'Using sample ({df.shape[0]} rows).')

else:
    print(f'Using full data ({df.shape[0]} rows).')

Using full data (399455 rows).


## Create corpus

In [6]:
corp = df['comment_text']

## Data cleaning

### Show data size before cleaning

In [7]:
# count 'words' (rough regex method)
num_words_before = corp.str.count(r'\S+', flags=re.I).sum()

print(f'Number of words in corpus before cleaning: {num_words_before:,}')

Number of words in corpus before cleaning: 20,096,710


### Remove anchor HTML tags (\<a\>)

TODO: Do this with an HTML parser like Beautiful Soup.

In [8]:
regex = r'<a .*?>|</a>' # *? for non-greedy repetition

# count matches
print(corp.str.count(regex, flags=re.I).sum())

# show some rows containing the pattern
corp[corp.str.contains(regex, na=False, case=False)].head()

65


345                                                                                                                                                                                                                                                                                                                yaah this is good posting for women  <a href="http://www.adultvibes.in/index.php?main_page=page&id=9l">sexual</a>  feeeling
12528     Rakhi is the traditional Indian festival where a sister ties Rakhi string around her brother's wrist. \nLike many other Indian festivals, this too is a gift-giving occasion when brother and sisters exchange their token of love.\nThere are many quotes are available for sibling in our article......\nFor More....\nPlz visit:- <a href="http://www.dooiitt.com/raksha-bandhan-quotes-images-wishes/">Raksha Bandhan Quotes</a>
66222                                                                                                                                     

In [9]:
# replace pattern
corp = corp.str.replace(regex, '', regex=True, case=False)

# count matches again, should be 0
print(corp.str.count(regex, flags=re.I).sum())

0


### Remove URLs

In [10]:
regex = r'https?://\S+'
print(corp.str.count(regex, flags=re.I).sum())
corp[corp.str.contains(regex, na=False, case=False)].head()

10820


7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Odd that Saunders doesn't mention the two terrorist attacks, the one in Manchester less than three weeks ago and the one in London last week, and what  effect they may have had on the election.  The Tories' drop in the polls, and Labour's rise, starts after the Manchester massacre: https://en.wikipedia.org/wiki/Opinion_polling_for_the_United_Kingdom_general_election,_2017#2017. Of course that may just be coincidence, but may

In [11]:
corp = corp.str.replace(regex, '', regex=True, case=False)
print(corp.str.count(regex, flags=re.I).sum())

0


### Remove whitespace except for spaces

\r actually causes an error when loading the saved csv file with read_csv() (just C engine, Python engine works).  
\u2028 --> Unicode line seperator.

In [12]:
regex = r'[\t\n\r\f\v\u2028]'
print(corp.str.count(regex, flags=re.I).sum())
corp[corp.str.contains(regex, na=False, case=False)].head()

435299


1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   The Jones Act was immediately lifted to help Texas and Florida.\n\nIt took the nation two weeks of shaming Trump before he acted to help Puerto Rico.\n\nHe spent that time making lame and nonsensical excuses for why he couldn't lift the ban.\n\nIn other news:\n\nTrump continues to shore up his racist base by dropping more racial dog whistles. Now he says NFL owners are 'afraid' of their black players.\n\nYep, the plantation is under threat by the uppitys all over again.  Trump is a racist.\n\nAnd a traitor.
3  

In [13]:
corp = corp.str.replace(regex, ' ', regex=True, case=False)
print(corp.str.count(regex, flags=re.I).sum())

0


### Remove numbers

In [14]:
regex = r'\d+'
print(corp.str.count(regex, flags=re.I).sum())
corp[corp.str.contains(regex, na=False, case=False)].head()

186580


2                                                                                As long as the Church keeps preventing the Lord from calling women to the sacramental priesthood, there is a fundamental imbalance, driven by patriarchal gender ideology, that is harmful to the entire body of Christ, male and female.  The vocation crisis is not about women, just as it is not about men.  It is about letting go of a patriarchal culture that is passing away, and allowing the Lord to call those he wants here and now, men and women, to all vocations, including the sacramental priesthood and the episcopate, without imposing artificial gender walls that are heritage from the Old Law (not the New Law!) and no longer make sense.  The sacramental priesthood is about service, not genitals. Ordained priests do #2 sitting down.  Why in the world is it that they cannot do #1 sitting down as well?  Allow the Lord to call women to the ministerial priesthood, and the life of the Church will be much better!

In [15]:
corp = corp.str.replace(regex, ' ', regex=True, case=False)
print(corp.str.count(regex, flags=re.I).sum())

0


### Manually "unmask" morst frequent swearwords, insults etc. (e.g. f*ck, cr@p)

Also correct some (on-purpose) misspellings that reflect pronunciation, e.g. "huuuge", "stooopid".

TODO: Implement autocorrection.

In [16]:
# search patterns used to create list of replacements (see next cell)

regex = r'\S*\*\S+'
#regex = r'\S*@\S+'
#regex = r'\S*#\S+'
#regex = r'\S*a{3,}\S*'
#regex = r'\S*e{3,}\S*'
#regex = r'\S*i{3,}\S*'
#regex = r'\S*o{3,}\S*'
#regex = r'\S*u{3,}\S*'

print(corp.str.count(regex, flags=re.I).sum())
all_matches = corp.str.findall(regex, flags=re.I).value_counts()
all_matches[all_matches > 5]

4365


comment_text
[]                395926
[sh*t]                91
[***]                 54
[a**]                 50
[s**t]                38
[*****]               35
[p***y]               34
[****]                34
[f***]                28
[f**k]                23
[h*ll]                21
[p****]               20
[p*ssy]               19
[F***]                18
[s***]                18
[a**.]                17
[pu**y]               16
[sh**]                15
[cr*p]                15
[*not*]               14
[sh*t.]               14
[*sigh*]              14
[b*tch]               13
[h***]                13
[*any*]               13
[f*ck]                12
[*is*]                12
[f***ing]             11
[F*ck]                11
[*you*]               10
[bullsh*t]            10
[*I]                  10
[a**es]                9
[*ss]                  9
[*&^%]                 8
[d*mn]                 8
[*are*]                8
[A**]                  8
[p*ss]                 7
[b*tch.]    

In [17]:
match_list = '(?i)f*ck, (?i)sh*t, (?i)s**t, (?i)f***, (?i)p***y, (?i)b*tch, (?i)f**k, (?i)p*ssy, (?i)p****, (?i)s***, (?i)a**, (?i)h*ll, (?i)h***, (?i)sh*t, (?i)pu**y, (?i)sh**, (?i)cr*p, (?i)@ss, (?i)cr@p, (?i)b@lls, (?i)f@ck, (?i)waaay, (?i)waaaay, (?i)riiiight, (?i)soo+, (?i)stooooopid, (?i)huu+ge, (?i)yuu+ge, (?i)suu+re'\
    .replace('*', r'\*').split(', ')
replace_list = 'fuck, shit, shit, fuck, pussy, bitch, fuck, pussy, pussy, shit, ass, hell, hell, shit, pussy, shit, crap, ass, crap, balls, fuck, way, way, right, so, stupid, huge, huge, sure'\
    .split(', ')

corp.replace(match_list, replace_list, regex=True, inplace=True)

### Remove multiple spaces

In [18]:
regex = r' {2,}'
print(corp.str.count(regex, flags=re.I).sum())
corp[corp.str.contains(regex, na=False, case=False)].head()

702621


0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Trudeau with a brain?  I assume you are taking about Pierre. Can't imagine anyone else.
1          

In [19]:
corp = corp.str.replace(regex, ' ', regex=True, case=False)
print(corp.str.count(regex, flags=re.I).sum())

0


### Show data size after cleaning

In [20]:
num_words_after = corp.str.count(r'\S+', flags=re.I).sum()

print(f'Number of words in corpus after cleaning: {num_words_after:,} (before: {num_words_before:,})')

Number of words in corpus after cleaning: 20,050,329 (before: 20,096,710)


## Preprocess data with spaCy

See: https://realpython.com/natural-language-processing-spacy-python/


In [25]:
# load English language model
nlp = spacy.load('en_core_web_sm')

### Tokenize, remove punctuation, make lower case, lemmatize, remove stop words

In [26]:
def preprocess(s):
    doc = nlp(s)
    
    tokens = [token.text.lower()
              for token in doc
              if not token.is_punct]
    
    tokens_lemma = [token.lemma_.lower()
              for token in doc
              if not token.is_punct]
    
    tokens_lemma_stop = [token.lemma_.lower()
              for token in doc
              if not token.is_punct and not token.is_stop]
    
    # convert lists to space-separated strings and return as Series
    return pd.Series([' '.join(tokens),
                      ' '.join(tokens_lemma),
                      ' '.join(tokens_lemma_stop)],
                      index=['clean_pp',
                             'clean_pp_lemma',
                             'clean_pp_lemma_stop'])

In [27]:
corp_pp = corp.progress_apply(preprocess)
corp_pp.head()

100%|████████████████████████████████████████████████████████████████████████████████████| 399455/399455 [15:55:20<00:00,  6.97it/s]


Unnamed: 0,clean_pp,clean_pp_lemma,clean_pp_lemma_stop
0,trudeau with a brain i assume you are taking about pierre ca n't imagine anyone else,trudeau with a brain i assume you be take about pierre can not imagine anyone else,trudeau brain assume take pierre imagine
1,the jones act was immediately lifted to help texas and florida it took the nation two weeks of shaming trump before he acted to help puerto rico he spent that time making lame and nonsensical excuses for why he could n't lift the ban in other news trump continues to shore up his racist base by dropping more racial dog whistles now he says nfl owners are afraid of their black players yep the plantation is under threat by the uppitys all over again trump is a racist and a traitor,the jones act be immediately lift to help texas and florida it take the nation two week of shame trump before he act to help puerto rico he spend that time make lame and nonsensical excuse for why he could not lift the ban in other news trump continue to shore up his racist base by drop more racial dog whistle now he say nfl owner be afraid of their black player yep the plantation be under threat by the uppitys all over again trump be a racist and a traitor,jones act immediately lift help texas florida take nation week shame trump act help puerto rico spend time make lame nonsensical excuse lift ban news trump continue shore racist base drop racial dog whistle say nfl owner afraid black player yep plantation threat uppitys trump racist traitor
2,as long as the church keeps preventing the lord from calling women to the sacramental priesthood there is a fundamental imbalance driven by patriarchal gender ideology that is harmful to the entire body of christ male and female the vocation crisis is not about women just as it is not about men it is about letting go of a patriarchal culture that is passing away and allowing the lord to call those he wants here and now men and women to all vocations including the sacramental priesthood and the episcopate without imposing artificial gender walls that are heritage from the old law not the new law and no longer make sense the sacramental priesthood is about service not genitals ordained priests do sitting down why in the world is it that they can not do sitting down as well allow the lord to call women to the ministerial priesthood and the life of the church will be much better,as long as the church keep prevent the lord from call woman to the sacramental priesthood there be a fundamental imbalance drive by patriarchal gender ideology that be harmful to the entire body of christ male and female the vocation crisis be not about woman just as it be not about man it be about let go of a patriarchal culture that be pass away and allow the lord to call those he want here and now man and woman to all vocation include the sacramental priesthood and the episcopate without impose artificial gender wall that be heritage from the old law not the new law and no long make sense the sacramental priesthood be about service not genital ordain priest do sit down why in the world be it that they can not do sit down as well allow the lord to call woman to the ministerial priesthood and the life of the church will be much well,long church keep prevent lord call woman sacramental priesthood fundamental imbalance drive patriarchal gender ideology harmful entire body christ male female vocation crisis woman man let patriarchal culture pass away allow lord want man woman vocation include sacramental priesthood episcopate impose artificial gender wall heritage old law new law long sense sacramental priesthood service genital ordain priest sit world sit allow lord woman ministerial priesthood life church well
3,climate change in the sense discussed in the pope 's encyclical is a dubious theory at best also your claim that of the scientific community supports the theory is patently false in any event for a pope to give his imprimatur to a scientific theory is unprecedented he has no competence in this area and should not lend credibility to what is essentially a prudential matter there is nothing wrong with trump 's approach to russia and there is no reason for the catholic church at this time and under current current circumstances to advise anyone let alone the president on the matter again this is a prudential matter relations between nations have nothing to do with what a country has done in terms of suppressing democracy what planet do you live on as for as poverty is concerned the pope should speak out against the corrupt governments that keep people in poverty and ignorance thinking that america can pay to save the world misses the point,climate change in the sense discuss in the pope 's encyclical be a dubious theory at well also your claim that of the scientific community support the theory be patently false in any event for a pope to give his imprimatur to a scientific theory be unprecedented he have no competence in this area and should not lend credibility to what be essentially a prudential matter there be nothing wrong with trump 's approach to russia and there be no reason for the catholic church at this time and under current current circumstance to advise anyone let alone the president on the matter again this be a prudential matter relation between nation have nothing to do with what a country have do in term of suppress democracy what planet do you live on as for as poverty be concern the pope should speak out against the corrupt government that keep people in poverty and ignorance think that america can pay to save the world miss the point,climate change sense discuss pope encyclical dubious theory well claim scientific community support theory patently false event pope imprimatur scientific theory unprecedented competence area lend credibility essentially prudential matter wrong trump approach russia reason catholic church time current current circumstance advise let president matter prudential matter relation nation country term suppress democracy planet live poverty concern pope speak corrupt government people poverty ignorance think america pay save world miss point
4,fake news now she is lying figures she is making her millions and gosh darn her detractors do n't scare her so there she needs to shut up herself,fake news now she be lie figure she be make her million and gosh darn her detractor do not scare she so there she need to shut up herself,fake news lie figure make million gosh darn detractor scare need shut


## Create new df with raw + cleaned + preprocessed comments + target

In [29]:
df_new = pd.concat([df['comment_text'],
                    corp,
                    corp_pp['clean_pp'],
                    corp_pp['clean_pp_lemma'],
                    corp_pp['clean_pp_lemma_stop'],
                    df['toxic']], axis=1)

# column names
df_new.columns = ['raw',
                  'clean',
                  'clean_pp',
                  'clean_pp_lemma',
                  'clean_pp_lemma_stop',
                  'toxic']

df_new.head()

Unnamed: 0,raw,clean,clean_pp,clean_pp_lemma,clean_pp_lemma_stop,toxic
0,Trudeau with a brain? I assume you are taking about Pierre. Can't imagine anyone else.,Trudeau with a brain? I assume you are taking about Pierre. Can't imagine anyone else.,trudeau with a brain i assume you are taking about pierre ca n't imagine anyone else,trudeau with a brain i assume you be take about pierre can not imagine anyone else,trudeau brain assume take pierre imagine,1
1,"The Jones Act was immediately lifted to help Texas and Florida.\n\nIt took the nation two weeks of shaming Trump before he acted to help Puerto Rico.\n\nHe spent that time making lame and nonsensical excuses for why he couldn't lift the ban.\n\nIn other news:\n\nTrump continues to shore up his racist base by dropping more racial dog whistles. Now he says NFL owners are 'afraid' of their black players.\n\nYep, the plantation is under threat by the uppitys all over again. Trump is a racist.\n\nAnd a traitor.","The Jones Act was immediately lifted to help Texas and Florida. It took the nation two weeks of shaming Trump before he acted to help Puerto Rico. He spent that time making lame and nonsensical excuses for why he couldn't lift the ban. In other news: Trump continues to shore up his racist base by dropping more racial dog whistles. Now he says NFL owners are 'afraid' of their black players. Yep, the plantation is under threat by the uppitys all over again. Trump is a racist. And a traitor.",the jones act was immediately lifted to help texas and florida it took the nation two weeks of shaming trump before he acted to help puerto rico he spent that time making lame and nonsensical excuses for why he could n't lift the ban in other news trump continues to shore up his racist base by dropping more racial dog whistles now he says nfl owners are afraid of their black players yep the plantation is under threat by the uppitys all over again trump is a racist and a traitor,the jones act be immediately lift to help texas and florida it take the nation two week of shame trump before he act to help puerto rico he spend that time make lame and nonsensical excuse for why he could not lift the ban in other news trump continue to shore up his racist base by drop more racial dog whistle now he say nfl owner be afraid of their black player yep the plantation be under threat by the uppitys all over again trump be a racist and a traitor,jones act immediately lift help texas florida take nation week shame trump act help puerto rico spend time make lame nonsensical excuse lift ban news trump continue shore racist base drop racial dog whistle say nfl owner afraid black player yep plantation threat uppitys trump racist traitor,1
2,"As long as the Church keeps preventing the Lord from calling women to the sacramental priesthood, there is a fundamental imbalance, driven by patriarchal gender ideology, that is harmful to the entire body of Christ, male and female. The vocation crisis is not about women, just as it is not about men. It is about letting go of a patriarchal culture that is passing away, and allowing the Lord to call those he wants here and now, men and women, to all vocations, including the sacramental priesthood and the episcopate, without imposing artificial gender walls that are heritage from the Old Law (not the New Law!) and no longer make sense. The sacramental priesthood is about service, not genitals. Ordained priests do #2 sitting down. Why in the world is it that they cannot do #1 sitting down as well? Allow the Lord to call women to the ministerial priesthood, and the life of the Church will be much better!","As long as the Church keeps preventing the Lord from calling women to the sacramental priesthood, there is a fundamental imbalance, driven by patriarchal gender ideology, that is harmful to the entire body of Christ, male and female. The vocation crisis is not about women, just as it is not about men. It is about letting go of a patriarchal culture that is passing away, and allowing the Lord to call those he wants here and now, men and women, to all vocations, including the sacramental priesthood and the episcopate, without imposing artificial gender walls that are heritage from the Old Law (not the New Law!) and no longer make sense. The sacramental priesthood is about service, not genitals. Ordained priests do # sitting down. Why in the world is it that they cannot do # sitting down as well? Allow the Lord to call women to the ministerial priesthood, and the life of the Church will be much better!",as long as the church keeps preventing the lord from calling women to the sacramental priesthood there is a fundamental imbalance driven by patriarchal gender ideology that is harmful to the entire body of christ male and female the vocation crisis is not about women just as it is not about men it is about letting go of a patriarchal culture that is passing away and allowing the lord to call those he wants here and now men and women to all vocations including the sacramental priesthood and the episcopate without imposing artificial gender walls that are heritage from the old law not the new law and no longer make sense the sacramental priesthood is about service not genitals ordained priests do sitting down why in the world is it that they can not do sitting down as well allow the lord to call women to the ministerial priesthood and the life of the church will be much better,as long as the church keep prevent the lord from call woman to the sacramental priesthood there be a fundamental imbalance drive by patriarchal gender ideology that be harmful to the entire body of christ male and female the vocation crisis be not about woman just as it be not about man it be about let go of a patriarchal culture that be pass away and allow the lord to call those he want here and now man and woman to all vocation include the sacramental priesthood and the episcopate without impose artificial gender wall that be heritage from the old law not the new law and no long make sense the sacramental priesthood be about service not genital ordain priest do sit down why in the world be it that they can not do sit down as well allow the lord to call woman to the ministerial priesthood and the life of the church will be much well,long church keep prevent lord call woman sacramental priesthood fundamental imbalance drive patriarchal gender ideology harmful entire body christ male female vocation crisis woman man let patriarchal culture pass away allow lord want man woman vocation include sacramental priesthood episcopate impose artificial gender wall heritage old law new law long sense sacramental priesthood service genital ordain priest sit world sit allow lord woman ministerial priesthood life church well,0
3,"Climate change, in the sense discussed in the Pope's encyclical, is a dubious theory at best. Also, your claim that 95% of the ""scientific community"" supports the theory is patently false. In any event for a pope to give his imprimatur to a scientific theory is unprecedented. He has no competence in this area and should not \nlend credibility to what is essentially a prudential matter. \n\nThere is nothing wrong with Trump's approach to Russia, and there is no reason for the Catholic Church at this time and under current current circumstances to advise anyone, let alone the President on the matter. Again this is a prudential matter. \n\nRelations between nations have nothing to do with what a country has done in terms of ""Suppressing democracy"". What planet do you live on. \n\nAs for as poverty is concerned, the Pope should speak out against the corrupt governments that keep people in poverty and ignorance. Thinking that America can pay to save the world misses the point.","Climate change, in the sense discussed in the Pope's encyclical, is a dubious theory at best. Also, your claim that % of the ""scientific community"" supports the theory is patently false. In any event for a pope to give his imprimatur to a scientific theory is unprecedented. He has no competence in this area and should not lend credibility to what is essentially a prudential matter. There is nothing wrong with Trump's approach to Russia, and there is no reason for the Catholic Church at this time and under current current circumstances to advise anyone, let alone the President on the matter. Again this is a prudential matter. Relations between nations have nothing to do with what a country has done in terms of ""Suppressing democracy"". What planet do you live on. As for as poverty is concerned, the Pope should speak out against the corrupt governments that keep people in poverty and ignorance. Thinking that America can pay to save the world misses the point.",climate change in the sense discussed in the pope 's encyclical is a dubious theory at best also your claim that of the scientific community supports the theory is patently false in any event for a pope to give his imprimatur to a scientific theory is unprecedented he has no competence in this area and should not lend credibility to what is essentially a prudential matter there is nothing wrong with trump 's approach to russia and there is no reason for the catholic church at this time and under current current circumstances to advise anyone let alone the president on the matter again this is a prudential matter relations between nations have nothing to do with what a country has done in terms of suppressing democracy what planet do you live on as for as poverty is concerned the pope should speak out against the corrupt governments that keep people in poverty and ignorance thinking that america can pay to save the world misses the point,climate change in the sense discuss in the pope 's encyclical be a dubious theory at well also your claim that of the scientific community support the theory be patently false in any event for a pope to give his imprimatur to a scientific theory be unprecedented he have no competence in this area and should not lend credibility to what be essentially a prudential matter there be nothing wrong with trump 's approach to russia and there be no reason for the catholic church at this time and under current current circumstance to advise anyone let alone the president on the matter again this be a prudential matter relation between nation have nothing to do with what a country have do in term of suppress democracy what planet do you live on as for as poverty be concern the pope should speak out against the corrupt government that keep people in poverty and ignorance think that america can pay to save the world miss the point,climate change sense discuss pope encyclical dubious theory well claim scientific community support theory patently false event pope imprimatur scientific theory unprecedented competence area lend credibility essentially prudential matter wrong trump approach russia reason catholic church time current current circumstance advise let president matter prudential matter relation nation country term suppress democracy planet live poverty concern pope speak corrupt government people poverty ignorance think america pay save world miss point,0
4,"Fake news...now she is lying. figures....she is making her millions and gosh darn, her detractors don't scare her, so there!!! She needs to shut up, herself.","Fake news...now she is lying. figures....she is making her millions and gosh darn, her detractors don't scare her, so there!!! She needs to shut up, herself.",fake news now she is lying figures she is making her millions and gosh darn her detractors do n't scare her so there she needs to shut up herself,fake news now she be lie figure she be make her million and gosh darn her detractor do not scare she so there she need to shut up herself,fake news lie figure make million gosh darn detractor scare need shut,1


## Drop rows with NaN's

In [32]:
# convert empty strings to NaN
df_new.replace('', np.NaN, inplace=True)

In [33]:
df_new.isna().sum()
rows_before = df_new.shape[0]
print("Rows before dropping:", rows_before)
df_new.dropna(inplace=True)
df_new.reset_index(drop=True, inplace=True)
rows_after = df_new.shape[0]
print('Rows after dropping:', rows_after)
print('Rows dropped:', rows_before - rows_after)

Rows before dropping: 398434
Rows after dropping: 398434
Rows dropped: 0


## Create fastText vectors

In [34]:
# # create temp file for fastText
# df_new.comment_clean_preproc.to_csv('data/fasttext_training_data_tmp.csv',
#                                     index=False, header=False)

# # run unsupervised learning to get embeddings
# ft = fasttext.train_unsupervised('data/fasttext_training_data_tmp.csv')

# # delete temp file
# os.remove('data/fasttext_training_data_tmp.csv')

In [None]:
# # add fastText vectors to df
# df_new['ft_vector'] = df_new['comment_clean_preproc']\
#     .map(ft.get_sentence_vector)

In [35]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398434 entries, 0 to 398433
Data columns (total 6 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   raw                  398434 non-null  object
 1   clean                398434 non-null  object
 2   clean_pp             398434 non-null  object
 3   clean_pp_lemma       398434 non-null  object
 4   clean_pp_lemma_stop  398434 non-null  object
 5   toxic                398434 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 18.2+ MB


## Save CSV file

In [36]:
df_new.to_csv('data/data_usampl_60_40_cleaned.csv', index=False)

In [37]:
df_check = pd.read_csv('data/data_usampl_60_40_cleaned.csv')
df_check.head()

Unnamed: 0,raw,clean,clean_pp,clean_pp_lemma,clean_pp_lemma_stop,toxic
0,Trudeau with a brain? I assume you are taking about Pierre. Can't imagine anyone else.,Trudeau with a brain? I assume you are taking about Pierre. Can't imagine anyone else.,trudeau with a brain i assume you are taking about pierre ca n't imagine anyone else,trudeau with a brain i assume you be take about pierre can not imagine anyone else,trudeau brain assume take pierre imagine,1
1,"The Jones Act was immediately lifted to help Texas and Florida.\n\nIt took the nation two weeks of shaming Trump before he acted to help Puerto Rico.\n\nHe spent that time making lame and nonsensical excuses for why he couldn't lift the ban.\n\nIn other news:\n\nTrump continues to shore up his racist base by dropping more racial dog whistles. Now he says NFL owners are 'afraid' of their black players.\n\nYep, the plantation is under threat by the uppitys all over again. Trump is a racist.\n\nAnd a traitor.","The Jones Act was immediately lifted to help Texas and Florida. It took the nation two weeks of shaming Trump before he acted to help Puerto Rico. He spent that time making lame and nonsensical excuses for why he couldn't lift the ban. In other news: Trump continues to shore up his racist base by dropping more racial dog whistles. Now he says NFL owners are 'afraid' of their black players. Yep, the plantation is under threat by the uppitys all over again. Trump is a racist. And a traitor.",the jones act was immediately lifted to help texas and florida it took the nation two weeks of shaming trump before he acted to help puerto rico he spent that time making lame and nonsensical excuses for why he could n't lift the ban in other news trump continues to shore up his racist base by dropping more racial dog whistles now he says nfl owners are afraid of their black players yep the plantation is under threat by the uppitys all over again trump is a racist and a traitor,the jones act be immediately lift to help texas and florida it take the nation two week of shame trump before he act to help puerto rico he spend that time make lame and nonsensical excuse for why he could not lift the ban in other news trump continue to shore up his racist base by drop more racial dog whistle now he say nfl owner be afraid of their black player yep the plantation be under threat by the uppitys all over again trump be a racist and a traitor,jones act immediately lift help texas florida take nation week shame trump act help puerto rico spend time make lame nonsensical excuse lift ban news trump continue shore racist base drop racial dog whistle say nfl owner afraid black player yep plantation threat uppitys trump racist traitor,1
2,"As long as the Church keeps preventing the Lord from calling women to the sacramental priesthood, there is a fundamental imbalance, driven by patriarchal gender ideology, that is harmful to the entire body of Christ, male and female. The vocation crisis is not about women, just as it is not about men. It is about letting go of a patriarchal culture that is passing away, and allowing the Lord to call those he wants here and now, men and women, to all vocations, including the sacramental priesthood and the episcopate, without imposing artificial gender walls that are heritage from the Old Law (not the New Law!) and no longer make sense. The sacramental priesthood is about service, not genitals. Ordained priests do #2 sitting down. Why in the world is it that they cannot do #1 sitting down as well? Allow the Lord to call women to the ministerial priesthood, and the life of the Church will be much better!","As long as the Church keeps preventing the Lord from calling women to the sacramental priesthood, there is a fundamental imbalance, driven by patriarchal gender ideology, that is harmful to the entire body of Christ, male and female. The vocation crisis is not about women, just as it is not about men. It is about letting go of a patriarchal culture that is passing away, and allowing the Lord to call those he wants here and now, men and women, to all vocations, including the sacramental priesthood and the episcopate, without imposing artificial gender walls that are heritage from the Old Law (not the New Law!) and no longer make sense. The sacramental priesthood is about service, not genitals. Ordained priests do # sitting down. Why in the world is it that they cannot do # sitting down as well? Allow the Lord to call women to the ministerial priesthood, and the life of the Church will be much better!",as long as the church keeps preventing the lord from calling women to the sacramental priesthood there is a fundamental imbalance driven by patriarchal gender ideology that is harmful to the entire body of christ male and female the vocation crisis is not about women just as it is not about men it is about letting go of a patriarchal culture that is passing away and allowing the lord to call those he wants here and now men and women to all vocations including the sacramental priesthood and the episcopate without imposing artificial gender walls that are heritage from the old law not the new law and no longer make sense the sacramental priesthood is about service not genitals ordained priests do sitting down why in the world is it that they can not do sitting down as well allow the lord to call women to the ministerial priesthood and the life of the church will be much better,as long as the church keep prevent the lord from call woman to the sacramental priesthood there be a fundamental imbalance drive by patriarchal gender ideology that be harmful to the entire body of christ male and female the vocation crisis be not about woman just as it be not about man it be about let go of a patriarchal culture that be pass away and allow the lord to call those he want here and now man and woman to all vocation include the sacramental priesthood and the episcopate without impose artificial gender wall that be heritage from the old law not the new law and no long make sense the sacramental priesthood be about service not genital ordain priest do sit down why in the world be it that they can not do sit down as well allow the lord to call woman to the ministerial priesthood and the life of the church will be much well,long church keep prevent lord call woman sacramental priesthood fundamental imbalance drive patriarchal gender ideology harmful entire body christ male female vocation crisis woman man let patriarchal culture pass away allow lord want man woman vocation include sacramental priesthood episcopate impose artificial gender wall heritage old law new law long sense sacramental priesthood service genital ordain priest sit world sit allow lord woman ministerial priesthood life church well,0
3,"Climate change, in the sense discussed in the Pope's encyclical, is a dubious theory at best. Also, your claim that 95% of the ""scientific community"" supports the theory is patently false. In any event for a pope to give his imprimatur to a scientific theory is unprecedented. He has no competence in this area and should not \nlend credibility to what is essentially a prudential matter. \n\nThere is nothing wrong with Trump's approach to Russia, and there is no reason for the Catholic Church at this time and under current current circumstances to advise anyone, let alone the President on the matter. Again this is a prudential matter. \n\nRelations between nations have nothing to do with what a country has done in terms of ""Suppressing democracy"". What planet do you live on. \n\nAs for as poverty is concerned, the Pope should speak out against the corrupt governments that keep people in poverty and ignorance. Thinking that America can pay to save the world misses the point.","Climate change, in the sense discussed in the Pope's encyclical, is a dubious theory at best. Also, your claim that % of the ""scientific community"" supports the theory is patently false. In any event for a pope to give his imprimatur to a scientific theory is unprecedented. He has no competence in this area and should not lend credibility to what is essentially a prudential matter. There is nothing wrong with Trump's approach to Russia, and there is no reason for the Catholic Church at this time and under current current circumstances to advise anyone, let alone the President on the matter. Again this is a prudential matter. Relations between nations have nothing to do with what a country has done in terms of ""Suppressing democracy"". What planet do you live on. As for as poverty is concerned, the Pope should speak out against the corrupt governments that keep people in poverty and ignorance. Thinking that America can pay to save the world misses the point.",climate change in the sense discussed in the pope 's encyclical is a dubious theory at best also your claim that of the scientific community supports the theory is patently false in any event for a pope to give his imprimatur to a scientific theory is unprecedented he has no competence in this area and should not lend credibility to what is essentially a prudential matter there is nothing wrong with trump 's approach to russia and there is no reason for the catholic church at this time and under current current circumstances to advise anyone let alone the president on the matter again this is a prudential matter relations between nations have nothing to do with what a country has done in terms of suppressing democracy what planet do you live on as for as poverty is concerned the pope should speak out against the corrupt governments that keep people in poverty and ignorance thinking that america can pay to save the world misses the point,climate change in the sense discuss in the pope 's encyclical be a dubious theory at well also your claim that of the scientific community support the theory be patently false in any event for a pope to give his imprimatur to a scientific theory be unprecedented he have no competence in this area and should not lend credibility to what be essentially a prudential matter there be nothing wrong with trump 's approach to russia and there be no reason for the catholic church at this time and under current current circumstance to advise anyone let alone the president on the matter again this be a prudential matter relation between nation have nothing to do with what a country have do in term of suppress democracy what planet do you live on as for as poverty be concern the pope should speak out against the corrupt government that keep people in poverty and ignorance think that america can pay to save the world miss the point,climate change sense discuss pope encyclical dubious theory well claim scientific community support theory patently false event pope imprimatur scientific theory unprecedented competence area lend credibility essentially prudential matter wrong trump approach russia reason catholic church time current current circumstance advise let president matter prudential matter relation nation country term suppress democracy planet live poverty concern pope speak corrupt government people poverty ignorance think america pay save world miss point,0
4,"Fake news...now she is lying. figures....she is making her millions and gosh darn, her detractors don't scare her, so there!!! She needs to shut up, herself.","Fake news...now she is lying. figures....she is making her millions and gosh darn, her detractors don't scare her, so there!!! She needs to shut up, herself.",fake news now she is lying figures she is making her millions and gosh darn her detractors do n't scare her so there she needs to shut up herself,fake news now she be lie figure she be make her million and gosh darn her detractor do not scare she so there she need to shut up herself,fake news lie figure make million gosh darn detractor scare need shut,1


In [38]:
df_check.isna().sum()

raw                    0
clean                  0
clean_pp               0
clean_pp_lemma         0
clean_pp_lemma_stop    0
toxic                  0
dtype: int64