# Clean users and posts compiled datasets 

For both the users and posts datasets:
- Pare down features
- Drop nulls and deal with missing data
- Remove usernames and html artifacts 
- Join users to post dataset
- Export cleaned data as 2 separate csvs: modeling (for train and test) and validation 

In [1]:
import numpy as np
import pandas as pd
import re

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_colwidth', 500)

In [2]:
posts = pd.read_csv('data/posts_01_raw_sample.csv')
posts.shape

(100000, 38)

In [3]:
posts.head(3)

Unnamed: 0,comments,body,bodywithurls,createdAt,createdAtformatted,creator,datatype,depth,depthRaw,followers,following,hashtags,id,lastseents,links,media,posts,sensitive,shareLink,upvotes,urls,username,verified,article,impressions,preview,reposts,state,parent,color,commentDepth,controversy,downvotes,post,score,isPrimary,conversation,replyingTo
0,0,Possibly........\n\n•Chynaa.\n•Soros.\n•Globalists.\n•Deep State.\n\nAnd a whole host of brain washed Idiots - investing in their own future safety.,Possibly........\n\n•Chynaa.\n•Soros.\n•Globalists.\n•Deep State.\n\nAnd a whole host of brain washed Idiots - investing in their own future safety.\n,20200900000000.0,2020-09-01 18:05:07 UTC,e0ccf0acef0a43fa9ea7a447debdc781,comments,2.0,2.0,701.0,525.0,[],08511f2c61514e0f805e299e467f4727,2020-12-26T12:16:52.766453+00:00,[],168.0,1100.0,0.0,https://parler.com/comment/08511f2c61514e0f805e299e467f4727,0.0,[],Dd061973,0.0,,,,,,692baa94e4c845df829fa5b333a0e61b,#808080,1.0,0.0,0.0,19a9db6ce1c040f1accff10028d90cb8,0.0,0.0,,
1,0,Right!,Right!\n,20200720000000.0,2020-07-24 20:58:59 UTC,781e9ee94ab242f294627d69ee1e74ac,comments,1.0,1.0,10.0,12.0,[],52059b3edb194960b7f1db5fa577f2d9,2021-01-09T18:36:02.804212+00:00,[],0.0,18.0,0.0,https://parler.com/comment/52059b3edb194960b7f1db5fa577f2d9,0.0,[],AlisonHMcvay,0.0,,,,,,77528f6960b34bb691405135b27f9782,#a60303,0.0,0.0,0.0,77528f6960b34bb691405135b27f9782,0.0,1.0,,
2,0,Cuomo is an egotistical asshole. His day is coming.,Cuomo is an egotistical asshole. His day is coming.\n,20201130000000.0,2020-11-29 16:25:55 UTC,9c46ba5cdb7445b28d1e301ad873bb75,comments,1.0,1.0,3100.0,5700.0,[],5d3df500ce124a99bf93d15518b732ee,2021-01-09T16:02:03.575467+00:00,[],1.0,7900.0,0.0,https://parler.com/comment/5d3df500ce124a99bf93d15518b732ee,1.0,[],Mlaster206,0.0,,,,,,7c187094ac5c4ed2a408d6288f923b4e,#a60303,0.0,0.0,0.0,7c187094ac5c4ed2a408d6288f923b4e,1.0,1.0,,


## Drop columns

In [4]:
posts = posts.drop(columns=['bodywithurls', 'createdAt', 'color', 'shareLink', 'urls'])

In [5]:
# lowercase column names
posts.columns = posts.columns.str.lower()

## Clean up text with regex 

In [6]:
# remove usernames
posts['body'] = posts['body'].map(lambda x: re.sub("\@[a-zA-Z0-9]*", ' ', str(x)))

In [7]:
# remove new lines characters and html artifacts
posts['body'] = posts['body'].map(lambda x: re.sub("\n|\r|&amp;#x200B;|&amp;", ' ', str(x)))

## Remove rows where body text contains the following words 
- parler/Parler 
- Welcome/welcome
- Non-english languages (Arabic)

In [8]:
len(posts)

100000

In [9]:
posts = posts[posts['body'].str.contains('parler') == False]

In [10]:
len(posts)

98737

In [11]:
posts = posts[posts['body'].str.contains('Parler') == False]

In [12]:
len(posts)

89751

In [13]:
posts = posts[posts['body'].str.contains('welcome') == False]

In [14]:
len(posts)

89593

In [15]:
posts = posts[posts['body'].str.contains('Welcome') == False]

In [16]:
len(posts)

87788

In [17]:
posts.head(2)

Unnamed: 0,comments,body,createdatformatted,creator,datatype,depth,depthraw,followers,following,hashtags,id,lastseents,links,media,posts,sensitive,upvotes,username,verified,article,impressions,preview,reposts,state,parent,commentdepth,controversy,downvotes,post,score,isprimary,conversation,replyingto
0,0,Possibly........ •Chynaa. •Soros. •Globalists. •Deep State. And a whole host of brain washed Idiots - investing in their own future safety.,2020-09-01 18:05:07 UTC,e0ccf0acef0a43fa9ea7a447debdc781,comments,2.0,2.0,701.0,525.0,[],08511f2c61514e0f805e299e467f4727,2020-12-26T12:16:52.766453+00:00,[],168.0,1100.0,0.0,0.0,Dd061973,0.0,,,,,,692baa94e4c845df829fa5b333a0e61b,1.0,0.0,0.0,19a9db6ce1c040f1accff10028d90cb8,0.0,0.0,,
1,0,Right!,2020-07-24 20:58:59 UTC,781e9ee94ab242f294627d69ee1e74ac,comments,1.0,1.0,10.0,12.0,[],52059b3edb194960b7f1db5fa577f2d9,2021-01-09T18:36:02.804212+00:00,[],0.0,18.0,0.0,0.0,AlisonHMcvay,0.0,,,,,,77528f6960b34bb691405135b27f9782,0.0,0.0,0.0,77528f6960b34bb691405135b27f9782,0.0,1.0,,


In [18]:
(posts.isna().sum()/len(posts) * 100).sort_values()

comments               0.000000
verified               0.000000
username               0.000000
sensitive              0.000000
posts                  0.000000
media                  0.000000
links                  0.000000
lastseents             0.000000
id                     0.000000
upvotes                0.000000
following              0.000000
followers              0.000000
depthraw               0.000000
depth                  0.000000
datatype               0.000000
creator                0.000000
createdatformatted     0.000000
body                   0.000000
hashtags               0.000000
parent                12.720417
score                 22.778740
post                  22.778740
downvotes             22.778740
controversy           22.778740
commentdepth          22.778740
isprimary             33.623046
preview               77.221260
reposts               77.221260
impressions           77.221260
article               77.983323
state                 78.821707
conversa

In [19]:
posts = posts.drop(columns=['conversation', 'replyingto'])

## Export cleaned posts csv

In [20]:
posts.to_csv('data/posts_01_cleaned_sample.csv', index=False)

## Create transposed table with creator, body text, timestamp 

In [21]:
posts.shape

(87788, 31)

In [22]:
users_posts = posts[['body', 'username']]

In [23]:
users_posts.shape

(87788, 2)

In [24]:
users_posts['username'].value_counts().sort_values().tail(25)

Ot00              29
TTexasRepublic    29
KimmyKesler       29
DC11546736        29
NotJustTrying     29
1DRACARYS         30
baconguns         31
Jakebe            32
arhumes3          33
SamRiddle         33
Tinagcrawley      33
Simonbrennen      33
moon52            33
91q               34
DancrDave         37
ThomasFox         39
Oldmanrant        40
Cobrarick98       41
Mikewin           43
chucknellis       43
jenniev101        48
Klonokid          49
doutingthomas1    50
LibertyElaine     54
Gayle7753         55
Name: username, dtype: int64

In [25]:
# code for combing rows https://stackoverflow.com/questions/36392735/how-to-combine-multiple-rows-into-a-single-row-with-pandas
combined_posts = users_posts.groupby('username')['body'].apply(' '.join).reset_index()

In [26]:
combined_posts.sample(10)

Unnamed: 0,username,body
55954,laderechadiario,"🇧🇷 | #Brasil: de qué se trata el proyecto de ley que presentó el diputado federal Eduardo Bolsonaro, hijo del presidente Jair Bolsonaro, para combatir el nazismo y el comunismo y asegurar un futuro libre de autoritarismos en el país. 🇻🇪 | #Venezuela: el dictador Nicolás Maduro anunció el pasado sábado sus intenciones de comprar misiles al régimen iraní, poniendo en alerta a los países vecinos, quienes están preparando sus tropas ante un posible conflicto en la zona."
2869,Apothecariostore,Leftists are ‘Ridic’
46920,TammyBarrier2,Disregard for the nation is how we ended up here. The deep state players could not let it be a fair election but have no fear the schemes will all be brought to light
19155,Henry74,🤣
31159,Maury19763,This election is not Trump vs Biden. It is Pro America vs Anti America.
101,1751Migrant,"When he shares the advice with the court, GAME OVER. It’s not about gender, it’s about emotion, which you make abundantly clear. Reading is fundamental. You should try it."
20392,Irvincooper75,"YES! WHORONO IS AN IDIOT! They will destroy our country and turn us into another Hong Kong. A 5year old child would make more sense then any CNN, MSNBC reporter ever would and would put out the truth. The MSM reporters do not know what the word TRUTH even means."
43595,Shawndudka,You know what they say about snitches
57093,privitae,Lol
52304,Wmoseley48,?????


In [27]:
combined_posts['username'].value_counts()

-generaldebellis    1
Ranger503           1
RangerLiebowitz     1
RangerWay           1
Rangerhondo         1
                   ..
Horny3000           1
Hornykraken         1
HorseBranch         1
Horsefanextreme     1
zundel              1
Name: username, Length: 58797, dtype: int64

In [28]:
combined_posts.loc[combined_posts['username'] == 'John']

Unnamed: 0,username,body
23393,John,"Nobody owns me. They do this every week. No that was a photoshopped picture.... Fox never confirmed. So if I promise to create 19,000 jobs can I have an extremely low interest loan of 5.5b and give myself a massive bonus? Should we keep giving them this money for forever? Maybe we should have the government pay every business’s expenses for perpetuity so nobody looses a job again? Basically. Probably Most of it."


In [29]:
combined_posts.loc[combined_posts['username'] == 'ronpaul']

Unnamed: 0,username,body


In [30]:
combined_posts.loc[combined_posts['username'] == 'TeamTrump']

Unnamed: 0,username,body


In [31]:
combined_posts.loc[combined_posts['username'] == 'FetchingFeline']

Unnamed: 0,username,body
15577,FetchingFeline,"I’ve seen a few men on here crying that they were scammed by this account or that one, that their bank accounts or identity’s were stolen. Well now, how did that happen ? You gave your bank account info to some picture of boobs on here? You deserve what you got then.... I feel ya.... it’s getting real. Why yes, yes I am ! ♥️ ....China. Notice how, with ONE tweet calling out #ElijahCummings, was able to completely shut down any further media coverage of Mueller’s embarrassing testimony?..."


In [32]:
combined_posts.loc[combined_posts['username'] == 'ThomasFox']

Unnamed: 0,username,body
48459,ThomasFox,"🚨⭐️🇺🇸President Trump's Approval with Black Voters Soars to 46% After Debate Win Over Joe ""Predator"" Biden🇺🇸⭐️🚨 Virginia: 1,000+ Voters Receive Two Absentee Ballots Dem Sen. Blumenthal: I Won't Meet with Barrett, 'It Would Treat This Process as Legitimate' Miami lifts COVID-19 curfew after legal win by local jiggle joint Yep Hypocrisy Biased and sticking to their propaganda RNC vs. dnc Facebook is paying people to shut down their accounts ahead of the election - The Verge Coronavirus vaccine ..."


In [33]:
combined_posts.loc[combined_posts['username'] == 'WashTimesOpEd']

Unnamed: 0,username,body
51575,WashTimesOpEd,"Kay Coles James: “Lessons from past elections were lost on many, and now the mistakes — made more challenging in 2020 by mail-in ballots — have multiplied across the nation as a result.” #WashTimesOpEd"


In [35]:
combined_posts.loc[combined_posts['username'] == 'Gayle7753']

Unnamed: 0,username,body
16965,Gayle7753,REPORT: N. Korea: Trump's suggestion to meet Kim at DMZ is 'very interesting' Joe Buck and Troy Aikman caught in hot mic moment ridiculing the military flyover before NFL game - TheBlaze BREAKING: Federal Reserve approves its first rate cut since 2008 LGBT TERROR: Transgender 'Lady' Viciously Beats Elderly Christian Preacher on NYC Subway - Big League Politics CBN NEWS EXCLUSIVE - Mitch McConnell on Reshaping the Courts and Pro-Life Legislation: 'Leave No Vacancy Behind' | CBN News San Fran ...


In [36]:
combined_posts.loc[combined_posts['username'] == 'LibertyElaine']

Unnamed: 0,username,body
28046,LibertyElaine,"AMEN Yes. Biden role models: BRZEZINSKI and BYRD GENOCIDAL and RACIST DICTATOR AMEN COMMUNISM Poop Not a REAL POPE If you love someone TELL THEM TO TAKE ▪︎HCQ AND ▪︎ ZINC IF THEY GET SICK with COVID-19 If one pharmacy denies your doctors order, GO TO ANOTHER pharmacy. GET DR. SIMONE GOLD's map that shows where you can get HCQ Americas Front Line Doctors. Dr Simone Gold DUMMIES MOCKINGBIRDS Losers COVID-19 AND THE EMPERORS BARTENDER for Soros. SENSELESS drivel. SETH RIC..."


In [37]:
combined_posts.loc[combined_posts['username'] == 'doutingthomas1']

Unnamed: 0,username,body
54457,doutingthomas1,AAA test finds some automotive pedestrian detection systems don't work at night and need improvement Explore the Fox News apps that are right for you at I wonder what kind a Parfum RBG WEARS IS SHE INDEED IS STILL AMONG THE LIVING? AuDe Corpse? 😬🍸 👍👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏 Trump announces increased tariffs on China in latest trade war salvo Explore the Fox News apps that are right for you at 'Shark Tank' star Daymond John tried to sell Florida N95 masks at an inflated price: report Explo...


In [38]:
combined_posts.loc[combined_posts['username'] == 'Klonokid']

Unnamed: 0,username,body
26401,Klonokid,"US charges 4 Chinese military members in Equifax breach One of busiest roads in Baghdad was blocked by Kataib Hizbulla members, holding pictures of Iranian supreme leader Khamenei to intimidate drivers passing by #Iraq via Merkel on second wave: Europe must show it has learned its lesson Riyadh Daily Watch: Pete Buttigieg Runs away from Reporters' Questions in Spin Room | Breitbart Sharad Pawar, Nephew Ajit Pawar Named In Rs 25,000-Crore Money Laundering Case Ahead Of Maharashtra Election..."


LibertyElaine and doutingthomas1 seem like they might be bots

In [39]:
combined_posts.to_csv('data/posts_by_use_sample.csv', index=False)