<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Cleaning" data-toc-modified-id="Data-Cleaning-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Cleaning</a></span></li></ul></div>

## Data Cleaning

In this section we will take a look at our data and obtain what is required for EDA.

In [1]:
# Import necessary libraries
import pandas as pd
import bs4 as BeautifulSoup
pd.set_option("display.max_rows", 120)

In [2]:
# Import data
df1 = pd.read_csv('../datasets/adhd_210311.csv')
# Dataset used for modeling: df1 = pd.read_csv('../datasets/adhd_210311.csv')

In [3]:
dict_meta = {'columns': list(df1.columns), 'isna()' :list(df1.isna().sum())}

In [4]:
# How many posts do we have?
df1.shape

(884, 109)

In [5]:
df1_meta = pd.DataFrame(dict_meta)

In [6]:
df1_meta.sort_values(by='isna()', ascending=False)

Unnamed: 0,columns,isna()
54,banned_by,884
60,suggested_sort,884
35,category,884
40,approved_by,884
44,author_flair_css_class,884
47,content_categories,884
49,mod_note,884
53,removed_by_category,884
17,top_awarded_type,884
16,thumbnail_height,884


Looking at our data, there are many columns which are contain Reddit-specific metadata which have data missing. However, for the scope of our project, we will only be looking into text fields. Specifically, the 'title' and 'selftext'.

'title' of a reddit post is self-explanatory. The 'selftext' is defined as the content of the post written by the post's original author. Reddit posts also typically contain multi-threaded comments/replies which create discussion which may be valuable to understanding the context of the original post.

Comments are valuable when they include advice by others with similar experiences and advice, but many may be irrelevant (memes or jokes meant for entertainment). After observation and discussion we have decided that the valuable information contained in the comments tended to be few and far between, which would have required a lot of EDA and likely required a separate model to obtain (a model trained to determine if a comment thread is relevant or not)! For these reasons, we will be fully excluding them from the scope of this project.

**Where to find Title and Selftext on Reddit**
![Title and Selftext](../assets/selftext.png)  
  
**Comments on Reddit (under the Selftext area)**
![Comments](../assets/comments.png)  

In [7]:
# Conserve only the target variable ('subreddit'), and predictive variables title and selftext
df1 = df1[['subreddit', 'selftext', 'title']]

In [8]:
df1.shape

(884, 3)

In [9]:
# Is any data duplicated? Let's search by title.
(df1.duplicated(subset='title', keep='first').sum(), df1.duplicated(subset='selftext', keep='first').sum())

(54, 54)

It looks like the Reddit API returns a lot of duplicated data. This could possibly be due to us getting 'caught' requesting a large number of posts, or because the subreddit has only that many posts in the first place. We hope to have at least 800 posts for each of our datasets. Since our first dataset has only 54 duplicates, we can safely drop them from consideration.

In [10]:
df1 = df1.drop_duplicates(subset='title')

In [11]:
df1.shape

(830, 3)

In [12]:
# Verify duplicates are removed
(df1.duplicated(subset='title', keep='first').sum(), df1.duplicated(subset='selftext', keep='first').sum())

(0, 0)

Let's repeat this process for our r/OCD dataset.

In [13]:
df2 = pd.read_csv('../datasets/ocd_210311.csv')
# Dataset used for modeling: df2 = pd.read_csv('../datasets/ocd_210311.csv')

In [14]:
df2 = df2[['subreddit', 'selftext', 'title']]

In [15]:
df2.shape

(994, 3)

In [16]:
(df2.duplicated(subset='title', keep='first').sum(), df2.duplicated(subset='selftext', keep='first').sum())

(5, 108)

Looks like the r/OCD dataset has many more duplicates. We will remove them now.

In [17]:
df2 = df2.drop_duplicates(subset='title')

In [18]:
df2 = df2.drop_duplicates(subset='selftext', keep='first')

In [19]:
# Verify duplicates are removed
(df2.duplicated(subset='title', keep='first').sum(), df2.duplicated(subset='selftext', keep='first').sum())

(0, 0)

In [20]:
data = pd.concat([df1, df2], axis=0)

In [21]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1713 entries, 0 to 993
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  1713 non-null   object
 1   selftext   1712 non-null   object
 2   title      1713 non-null   object
dtypes: object(3)
memory usage: 53.5+ KB


For our current dataset, a NaN selftext is acceptable as long as title has content. Selftext is sometimes NaN if the entire post consists of the title and either an image or a video. As long as the 'title' contains non-duplicated text, the post is relevant to our model.

In [22]:
data.shape

(1713, 3)

We will now save our data and continue working on it in the EDA section.

In [23]:
from datetime import date
date.today().strftime("%y%m%d")
data.to_csv(f'../datasets/adhd_ocd_{date.today().strftime("%y%m%d")}.csv', index=False)