# Text preprocessing

Steps 
- Remove usernames and html artifacts 
- Split into train, test, and validation 
- Create word count matrix and remove stop words with CountVectorizer (demonstration, will repeat with the same random state for each step)
- Export cleaned test, cleaned train, cleaned validation, and countvectorized 

In [27]:
import numpy as np 
import pandas as pd 
import re 

from sklearn.feature_extraction.text import CountVectorizer

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_colwidth', 500)

## Load text data

In [28]:
df = pd.read_csv('../data/posts_01_raw_sample.csv')

In [29]:
df.head(2)

Unnamed: 0,comments,body,bodywithurls,createdAt,createdAtformatted,creator,datatype,depth,depthRaw,followers,following,hashtags,id,lastseents,links,media,posts,sensitive,shareLink,upvotes,urls,username,verified,article,impressions,preview,reposts,state,parent,color,commentDepth,controversy,downvotes,post,score,isPrimary,conversation,replyingTo
0,1,,\n,20201110000000.0,2020-11-05 17:26:57 UTC,f4a01e9315834d20be428c0492425a00,posts,1.0,1.0,24000.0,43000.0,[],4aa59d1fb1464508a1d1f7e2bbcdb733,2020-12-28T02:13:32.167350+00:00,[],8.0,33000.0,0.0,,34.0,[],RealCinders2,0.0,0.0,2100.0,,29.0,4.0,6fcdf40cf0d84327b551612df233625f,,,,,,,,,
1,0,"Ok, great, now be like Pa!","Ok, great, now be like Pa!\n",20201130000000.0,2020-11-30 04:30:22 UTC,cbee6ccfb5ef4b81b01a72f8079b513a,posts,1.0,1.0,48.0,36.0,[],4fd51eb8a15a45eab6a1ba0030c9d340,2020-12-24T03:43:41.332392+00:00,[],0.0,71.0,0.0,https://parler.com/post/4fd51eb8a15a45eab6a1ba0030c9d340,0.0,[],Kmsclyde,0.0,0.0,30.0,"Ok, great, now be like Pa!",0.0,4.0,bc708c931d9547edb910513d7f24ee2e,,,,,,,,,


In [30]:
df = df[['body']]

In [31]:
# need to dropna in importing step with entire df - REMOVE
df = df.dropna().reset_index(drop=True)

In [32]:
df.tail(20)

Unnamed: 0,body
57996,"Yeah, the DNC is gonna sequester the senile old dementia patient. Then the old bait and switch comes...."
57997,Ik dacht even dat Omroep Zwart al in de lucht was naar het blijkt gelukkig Opsporing Verzocht te zijn.
57998,"Big question...I understand the concept of ""care, custody & control"" but, WHO ACTUALLY OWNS ALL OF THE SCHOOL BUILDINGS THAT THESE union members INTEND TO BLOCK ACCESS TO?????????????????????"
57999,At some point the Lord is going to have to catch up to the number of deaths that Hillary and Bill are responsible for😂I lost count at 500￼
58000,Welcome. Great to have you. Follow us for your favorite commentary.
58001,This is inevitable and VERY soon.
58002,💕✝️🙏🏻✝️💕
58003,❤️❤️
58004,US20090062677A1 - Method of Recording and Saving of Human Soul for Human Immortality and Installation for it - Google Patents
58005,Typical


In [35]:
df.tail(15)

Unnamed: 0,body
58001,This is inevitable and VERY soon.
58002,💕✝️🙏🏻✝️💕
58003,❤️❤️
58004,US20090062677A1 - Method of Recording and Saving of Human Soul for Human Immortality and Installation for it - Google Patents
58005,Typical
58006,Glad to see you here on Parler where free speech is actually alive and well. Looking forward to mixing it up with you here and keeping the truth alive in the face of the constant bias we face. Let's Go!
58007,He escaped from a Lovecraft movie.
58008,Amen.
58009,YES!!! It’s about TIME someone reigned in that little fascist DICK-tator Cuomo!
58010,What? This punk hasn’t had his ass kicked yet?


In [36]:
df.shape

(58016, 1)

In [37]:
df.isna().sum()

body    0
dtype: int64

## CountVectorize and remove stop words

In [38]:
default_words = list(CountVectorizer(stop_words='english').get_stop_words())

# common contractions missed by CountVectorizer 
contractions = ['ve', 're']

# add custom lists to default sklearn 'english' stopwords 
custom_stopwords = default_words + contractions

In [39]:
X = df['body']

In [40]:
cv = CountVectorizer(min_df = 10, 
                     max_df = .5,
                     stop_words = custom_stopwords, 
                     ngram_range = (1, 3),
                     strip_accents='ascii', 
                )
cv.fit(X)

cv_text = cv.transform(X)
cv_df = pd.DataFrame(cv_text.todense(), columns=cv.get_feature_names_out())
cv_df.head(2)

Unnamed: 0,00,000,000 000,000 people,000 votes,06,08,10,10 000,10 million,10 years,100,100 000,100 correct,100 million,1000,11,12,12 years,13,14,14th,15,15 years,150,16,17,1776,18,19,1984,1a,1a 2a,1st,1st amendment,20,20 years,200,2000,2008,2015,2016,2016 election,2017,2018,2019,202,2020,2020 election,2020 presidential,...,yea,yeah,yeah im,yeah right,year,year old,years,years ago,years old,years trump,yelling,yellow,yep,yes,yes agree,yes did,yes sir,yes yes,yesterday,yikes,yo,york,york city,york post,york times,yorkers,youd,youll,young,young man,younger,youre,youre doing,youre going,youre just,youre right,youth,youtube,youtube page,youtube page videos,youve,youve got,yr,yr old,yrs,yup,zero,zombies,zone,zuckerberg
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [46]:
cv_df.columns[3050:3100]

Index(['holdtheline', 'hole', 'holes', 'holiday', 'holidays', 'hollywood',
       'holocaust', 'holy', 'holy spirit', 'home', 'homeless', 'homes',
       'homework', 'honest', 'honestly', 'honesty', 'honey', 'hong',
       'hong kong', 'honor', 'hook', 'hope', 'hope does', 'hope enjoy',
       'hope enjoy new', 'hope pray', 'hope right', 'hope true', 'hope trump',
       'hope youre', 'hopefully', 'hopes', 'hoping', 'horrible', 'horrific',
       'horror', 'horse', 'horses', 'hospital', 'hospitals', 'host', 'hot',
       'hotel', 'hour', 'hours', 'house', 'house representatives',
       'house senate', 'houses', 'housing'],
      dtype='object')

In [47]:
[each for each in cv_df.columns if 'parler' in each]

['aboard parler',
 'aboard parler just',
 'best parler',
 'best parler experience',
 'best rob parlerconcierge',
 'check hashtag parlerconcierge',
 'check parler',
 'check parler youtube',
 'choosing parler',
 'choosing parler free',
 'congratulations choosing parler',
 'ensure best parler',
 'espanol tambien parlerconcierge',
 'glad parler',
 'glad parler free',
 'glad parler truly',
 'hashtag parlerconcierge',
 'hashtag parlerconcierge learn',
 'join parler',
 'joined parler',
 'joined parler looking',
 'just joined parler',
 'know questions parlerconcierge',
 'like parler',
 'like parler accepts',
 'looking parler',
 'looking parler tips',
 'newuser truefreespeech parlerconcierge',
 'parler',
 'parler accepts',
 'parler accepts right',
 'parler comments',
 'parler comments leave',
 'parler experience',
 'parler experience puedo',
 'parler free',
 'parler free speech',
 'parler help',
 'parler help make',
 'parler hope',
 'parler hope enjoy',
 'parler just',
 'parler just wanted',
 '

In [50]:
cv_df.columns[5000:5050]

Index(['preserve', 'presidency', 'president', 'president country',
       'president donald', 'president donald trump', 'president elect',
       'president going', 'president history', 'president just',
       'president lifetime', 'president president', 'president trump',
       'president trump president', 'president trump won', 'president trumps',
       'president united', 'president united states', 'president years',
       'presidente', 'presidential', 'presidential election', 'presidents',
       'presidenttrump', 'press', 'press conference', 'pressure', 'pretend',
       'pretending', 'pretty', 'pretty good', 'pretty sure', 'prevail',
       'prevent', 'preventing', 'previous', 'price', 'priceless', 'prices',
       'prick', 'pride', 'priest', 'primary', 'prime', 'prince', 'principle',
       'principles', 'print', 'printed', 'printing'],
      dtype='object')

In [51]:
cv_df.shape

(58016, 7297)

(58016, 5220)
(58016, 51194)


#### After initial CountVectorizer run decided to go back to original compiled raw dataset and:
- remove all posts with the word "parler" (see parler_bot for examples of texts using the words parler)
- 

In [None]:
parler_bot = ['aboard parler', 'aboard parler just', 'best parler', 'best parler experience', 'best rob parlerconcierge', 'check hashtag parlerconcierge', 'check parler', 'check parler youtube', 'choosing parler', 'choosing parler free', 'congratulations choosing parler', 'ensure best parler', 'espanol tambien parlerconcierge', 'glad parler', 'glad parler free', 'glad parler truly', 'hashtag parlerconcierge', 'hashtag parlerconcierge learn', 'joined parler', 'joined parler looking', 'just joined parler', 'know questions parlerconcierge', 'like parler', 'like parler accepts', 'looking parler', 'looking parler tips', 'newuser truefreespeech parlerconcierge', 'parler', 'parler accepts', 'parler accepts right', 'parler comments', 'parler comments leave', 'parler experience', 'parler experience puedo', 'parler free', 'parler free speech', 'parler help', 'parler help make', 'parler hope', 'parler hope enjoy', 'parler just', 'parler just wanted', 'parler looking', 'parler looking forward', 'parler parlerusa', 'parler pro', 'parler pro tips', 'parler step', 'parler step taking', 'parler tips', 'parler tips tos', 'parler truly', 'parler truly trump', 'parler youtube', 'parler youtube page', 'parler101', 'parler101 channel', 'parler101 channel good', 'parlerconcierge', 'parlerconcierge learn', 'parlerconcierge learn ropes', 'parlerksa', 'parleruk', 'parlerusa', 'people like parler', 'people looking parler', 'places parler101', 'places parler101 channel', 'questions parlerconcierge', 'rob parlerconcierge', 'share parler', 'share parler pro', 'tambien parlerconcierge', 'tos check parler', 'truefreespeech parlerconcierge', 'videos places parler101', 'wanted share parler', 'welcome aboard parler', 'welcome glad parler', 'welcome parler', 'welcome parler comments', 'welcome parler help', 'welcome parler hope', 'welcome parler step']

## Check top n-grams and refine stop words 

In [22]:
# print(default_words)

In [23]:
# remove numbers? 

## Split into validation vs modeling (modeling will be further split into train and test)

## Export cleaned validation and modeling text datasets

In [13]:
# df.to_csv('../data/posts_01_cleaned_sample.csv', index=False)