# Cleaning and Feature Engingeering:

This section is focused on merging and cleaning the Submissions and Comments datasets, dropping unnecessary data, generating binary indicator variables, checking for duplicate values, and merging the r/conspiracytheories and r/science data into single Submission and Comment CSVs for modeling.

## Subreddit Submissions: 

In [390]:
# Import Libraries:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import plotly as iplot
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as plt
import seaborn as sns

In [319]:
# Read in data:

conspire = pd.read_csv('../data/conspire_pull_submissions.csv')
science = pd.read_csv('../data/science_pull_submissions.csv')

In [320]:
# Merge into single dataframe :

df_subs = conspire.append(science)

In [321]:
df_subs.shape

(24000, 5)

In [322]:
df_subs.head()

Unnamed: 0.1,Unnamed: 0,created_utc,subreddit,selftext,title
0,0,1587078122,conspiracytheories,Who else has noticed how the baby is being hit...,The coronavirus
1,1,1587076875,conspiracytheories,We are all pawns in this political game. Relea...,"COVID-19 Power, Control, and Profit!"
2,2,1587075153,conspiracytheories,,Dr. Andrew Kaufman disputes COVID19
3,3,1587074999,conspiracytheories,[removed],Do someone remember an suppost radio transmiss...
4,4,1587074240,conspiracytheories,"Hey frendos, \n\nSomeone close to me has asser...",Question: Bill Gates Malaria Vaccine Mutates A...


In [323]:
# Drop Unnamed: 0 column:

df_subs = df_subs.drop(columns='Unnamed: 0')

In [324]:
# Drop 'selftext' column due to high volumne of missing values:

df_subs = df_subs.drop(columns='selftext')

### Generate Indicator Variable for Subreddit Source (Submissions):

In [325]:
# Map Conspiracy Theory Subreddit to 0, Science to 1:

df_subs['subreddit'] = df_subs['subreddit'].map({'conspiracytheories': 0,'science': 1})

## Clean Subreddit Comments: 

In [326]:
# Read in data:

conspire_comments = pd.read_csv('../data/conspire_pull_comments.csv')
science_comments = pd.read_csv('../data/science_pull_comments.csv')

In [327]:
# Merge into single dataframe:

df_comments = conspire_comments.append(science_comments)

In [328]:
# Drop Unnamed: 0 column:

df_comments = df_comments.drop(columns='Unnamed: 0')

### Generate Indicator Variable for Subreddit Source (Comments):

In [329]:
# Map Conspiracy Theory Subreddit to 0, Science to 1:

df_comments['subreddit'] = df_comments['subreddit'].map({'conspiracytheories': 0,'science': 1})

In [330]:
df_comments.shape

(50000, 3)

In [331]:
df_comments.head()

Unnamed: 0,created_utc,subreddit,body
0,1587081288,0,"Trump was part of Epstein's ring too, he's the..."
1,1587081284,0,If china was going to cripple us with a virus ...
2,1587081205,0,Damn...exactly
3,1587081197,0,Only if you’re not going to explain what you t...
4,1587081171,0,The pollution levels have gone down because we...


In [332]:
df_comments['body'].value_counts()

# Comments that have been removed and deleted will be dropped.
# There are some bot posts that may require later attention, 
# but ~200 among ~33,000 is probably acceptable.

[removed]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         8099
[deleted]                                                                                                                                                                                                                                                                                                                                                                                                                        

In [333]:
# Drop '[removed]' comments:

df_comments = df_comments.drop(df_comments[df_comments.body == '[removed]'].index)

In [334]:
# Drop '[deleted]' comments:

df_comments = df_comments.drop(df_comments[df_comments.body == '[deleted]'].index)

In [335]:
# Remaining comments:

df_comments.shape

(33672, 3)

## Drop Duplicates:

In [336]:
# Drop duplicates from comments:

df_comments.drop_duplicates

<bound method DataFrame.drop_duplicates of        created_utc  subreddit  \
0       1587081288          0   
1       1587081284          0   
2       1587081205          0   
3       1587081197          0   
4       1587081171          0   
...            ...        ...   
24993   1586655714          1   
24994   1586655692          1   
24996   1586655495          1   
24997   1586655476          1   
24999   1586655417          1   

                                                    body  
0      Trump was part of Epstein's ring too, he's the...  
1      If china was going to cripple us with a virus ...  
2                                         Damn...exactly  
3      Only if you’re not going to explain what you t...  
4      The pollution levels have gone down because we...  
...                                                  ...  
24993  Okay, I know how anti-intellectual this is gon...  
24994  I think it is awesome, and I make sure to cook...  
24996  Bad science journalism

In [337]:
# Drop duplicates from Submissions:

df_subs.drop_duplicates

<bound method DataFrame.drop_duplicates of        created_utc  subreddit  \
0       1587078122          0   
1       1587076875          0   
2       1587075153          0   
3       1587074999          0   
4       1587074240          0   
...            ...        ...   
11995   1576775249          1   
11996   1576774546          1   
11997   1576774413          1   
11998   1576773960          1   
11999   1576773156          1   

                                                   title  
0                                        The coronavirus  
1                   COVID-19 Power, Control, and Profit!  
2                    Dr. Andrew Kaufman disputes COVID19  
3      Do someone remember an suppost radio transmiss...  
4      Question: Bill Gates Malaria Vaccine Mutates A...  
...                                                  ...  
11995  Does Affirmative Action Worsen Bureaucratic Pe...  
11996  Amazing, is this how our eyes look like under ...  
11997                        

### Validate before Export:

In [338]:
df_subs.head()

Unnamed: 0,created_utc,subreddit,title
0,1587078122,0,The coronavirus
1,1587076875,0,"COVID-19 Power, Control, and Profit!"
2,1587075153,0,Dr. Andrew Kaufman disputes COVID19
3,1587074999,0,Do someone remember an suppost radio transmiss...
4,1587074240,0,Question: Bill Gates Malaria Vaccine Mutates A...


In [339]:
df_comments.head()

Unnamed: 0,created_utc,subreddit,body
0,1587081288,0,"Trump was part of Epstein's ring too, he's the..."
1,1587081284,0,If china was going to cripple us with a virus ...
2,1587081205,0,Damn...exactly
3,1587081197,0,Only if you’re not going to explain what you t...
4,1587081171,0,The pollution levels have gone down because we...


In [309]:
df_subs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24000 entries, 0 to 11999
Data columns (total 3 columns):
created_utc    24000 non-null int64
subreddit      24000 non-null int64
title          24000 non-null object
dtypes: int64(2), object(1)
memory usage: 750.0+ KB


In [310]:
df_comments.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33672 entries, 0 to 24999
Data columns (total 3 columns):
created_utc    33672 non-null int64
subreddit      33672 non-null int64
body           33672 non-null object
dtypes: int64(2), object(1)
memory usage: 1.0+ MB


In [317]:
df_subs['subreddit'].value_counts()

1    12000
0    12000
Name: subreddit, dtype: int64

In [316]:
df_comments['subreddit'].value_counts()

1    16836
0    16836
Name: subreddit, dtype: int64

### Export merged and cleaned data to CSV:

In [313]:
# Export Submissions Data:

df_subs.to_csv('../data/merged_submissions.csv', index=False)

In [314]:
# Export Comments Data:

df_comments.to_csv('../data/merged_comments.csv', index=False)