# Read in and Explore Data:

The below lines explore the Submissions and Comments data using standard EDA techniques. Checks are made for congruence of shape, data types, and missing values. The 'selftext' field from the Submissions data will not be used in final analysis, given the amount of missing data. The Comments data will require cleaning, but do not present missing data issues.

### Explore Title Submissions:

In [139]:
# Import Libraries:

import pandas as pd

In [140]:
# Read in conspiracy theory data:

conspire = pd.read_csv('../data/conspire_pull_submissions.csv')

In [141]:
# Read in science data:

science = pd.read_csv('../data/science_pull_submissions.csv')

In [142]:
# Conspire is 12k rows x 5 columns

conspire.shape

(12000, 5)

In [143]:
# Science is 12k rows by 5 columns

science.shape

(12000, 5)

### Explore Title Submisions Missing Data:

In [144]:
conspire.head()

Unnamed: 0.1,Unnamed: 0,created_utc,subreddit,selftext,title
0,0,1587078122,conspiracytheories,Who else has noticed how the baby is being hit...,The coronavirus
1,1,1587076875,conspiracytheories,We are all pawns in this political game. Relea...,"COVID-19 Power, Control, and Profit!"
2,2,1587075153,conspiracytheories,,Dr. Andrew Kaufman disputes COVID19
3,3,1587074999,conspiracytheories,[removed],Do someone remember an suppost radio transmiss...
4,4,1587074240,conspiracytheories,"Hey frendos, \n\nSomeone close to me has asser...",Question: Bill Gates Malaria Vaccine Mutates A...


In [145]:
science.head()

Unnamed: 0.1,Unnamed: 0,created_utc,subreddit,selftext,title
0,0,1587081052,science,,Tens to Hundreds of Gigahertz Stable Pulsed Al...
1,1,1587079564,science,,"iPhone 11 64G,Double Card Double Wait, Genuine..."
2,2,1587078047,science,[deleted],Dornaz alfa koronavirüs tedavisinde kullanılab...
3,3,1587078029,science,[deleted],Finally a solution!
4,4,1587077467,science,[deleted],An analysis of radiation from the Big Bang sug...


In [146]:
conspire.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 5 columns):
Unnamed: 0     12000 non-null int64
created_utc    12000 non-null int64
subreddit      12000 non-null object
selftext       5773 non-null object
title          12000 non-null object
dtypes: int64(2), object(3)
memory usage: 468.9+ KB


In [147]:
science.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 5 columns):
Unnamed: 0     12000 non-null int64
created_utc    12000 non-null int64
subreddit      12000 non-null object
selftext       248 non-null object
title          12000 non-null object
dtypes: int64(2), object(3)
memory usage: 468.9+ KB


In [148]:
conspire.isnull().sum()

Unnamed: 0        0
created_utc       0
subreddit         0
selftext       6227
title             0
dtype: int64

In [149]:
science.isnull().sum()

Unnamed: 0         0
created_utc        0
subreddit          0
selftext       11752
title              0
dtype: int64

In [150]:
# The Conspire dataset is missing 51% of 'selfttext' entries.
# The Science dataset is mising 98% of 'selftext' entries.

# We'll likely need to use 'title' as a predictor of thread membership, 
# given the high volume of missing values.

### Explore Comments:

In [151]:
# Read in conspiracy theory data:

conspire_comments = pd.read_csv('../data/conspire_pull_comments.csv')

In [152]:
# Read in science data:

science_comments = pd.read_csv('../data/science_pull_comments.csv')

In [153]:
# Conspire comments is 25k rows x 4 columns

conspire_comments.shape

(25000, 4)

In [154]:
# Science comments is 25k rows x 4 columns

science_comments.shape

(25000, 4)

### Explore Comments Missing Data:

In [156]:
conspire_comments.head()

Unnamed: 0.1,Unnamed: 0,created_utc,subreddit,body
0,0,1587081288,conspiracytheories,"Trump was part of Epstein's ring too, he's the..."
1,1,1587081284,conspiracytheories,If china was going to cripple us with a virus ...
2,2,1587081205,conspiracytheories,Damn...exactly
3,3,1587081197,conspiracytheories,Only if you’re not going to explain what you t...
4,4,1587081171,conspiracytheories,The pollution levels have gone down because we...


In [157]:
science_comments.head()

Unnamed: 0.1,Unnamed: 0,created_utc,subreddit,body
0,0,1587081425,science,There goes my appetite for seafood. Had honest...
1,1,1587081414,science,Which one?
2,2,1587081373,science,Probably similar. However... they say they're ...
3,3,1587081313,science,You’re such a liar. At 40 years old I know and...
4,4,1587081304,science,"Thats a little bit overkill, there are easier ..."


In [158]:
conspire_comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 4 columns):
Unnamed: 0     25000 non-null int64
created_utc    25000 non-null int64
subreddit      25000 non-null object
body           25000 non-null object
dtypes: int64(2), object(2)
memory usage: 781.4+ KB


In [159]:
science_comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 4 columns):
Unnamed: 0     25000 non-null int64
created_utc    25000 non-null int64
subreddit      25000 non-null object
body           25000 non-null object
dtypes: int64(2), object(2)
memory usage: 781.4+ KB


In [160]:
conspire_comments.isnull().sum()

Unnamed: 0     0
created_utc    0
subreddit      0
body           0
dtype: int64

In [161]:
science_comments.isnull().sum()

Unnamed: 0     0
created_utc    0
subreddit      0
body           0
dtype: int64

In [162]:
# The comment datasets seem relatively complete. 
# Cleaning the text documents and generating indicator variables
# will take place in the next notebook. 