# Subreddit Classification Using Language Processing
## Part 2 of 4: Exploratory Data Analysis and Cleaning

#### Notebooks
- [01_data_collection](./01_data_collection.ipynb)
- [02_eda_and_cleaning](./02_eda_and_cleaning.ipynb)
- [03_visualizing](./03_visualizing.ipynb)
- [04_modeling](./04_modeling.ipynb)

#### This Notebook's Contents
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Cleaning the Data](#Cleaning-the-Data)
- [Cleaning the Text](#Cleaning-the-Text)

*NOTE: 'Data Science' may be abbrebiated to DS below. 'Artificial Intelligence' may be abbreviated to AI below.*

# Exploratory Data Analysis

In [1]:
# Import the required libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib_venn import venn2, venn2_circles

import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text

# Hide non-critical warnings appearing in Jupyter notebook.
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Read in the DS data as a dataframe.
ds_df = pd.read_csv('../data/datascience_og.csv')

In [3]:
# Read in the AI data as a dataframe.
ai_df = pd.read_csv('../data/artificial_og.csv')

In [4]:
# Check the shape of the DS dataframe.
ds_df.shape

(31649, 7)

In [5]:
# Display information about the DS dataframe.
ds_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31649 entries, 0 to 31648
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        31649 non-null  object
 1   created_utc  31649 non-null  int64 
 2   selftext     23641 non-null  object
 3   subreddit    31649 non-null  object
 4   author       31649 non-null  object
 5   media_only   31585 non-null  object
 6   permalink    31649 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.7+ MB


In [6]:
# Check the shape of the AI dataframe.
ai_df.shape

(18754, 7)

In [7]:
# Display information about the AI dataframe.
ai_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18754 entries, 0 to 18753
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        18754 non-null  object
 1   created_utc  18754 non-null  int64 
 2   selftext     5422 non-null   object
 3   subreddit    18754 non-null  object
 4   author       18754 non-null  object
 5   media_only   18672 non-null  object
 6   permalink    18754 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.0+ MB


In [8]:
# Display the NaN values in the DS dataframe.
ds_df.isna().sum()

title             0
created_utc       0
selftext       8008
subreddit         0
author            0
media_only       64
permalink         0
dtype: int64

In [9]:
# Display the NaN values in the AI dataframe.
ai_df.isna().sum()

title              0
created_utc        0
selftext       13332
subreddit          0
author             0
media_only        82
permalink          0
dtype: int64

In [10]:
# 'Selftext' represents data in the subreddit post.
# Some posts have title text but no text in the post itself.

# Display two rows where selftext data is missing.
ds_df[ds_df['selftext'].isna()].head(2)

Unnamed: 0,title,created_utc,selftext,subreddit,author,media_only,permalink
2,The Most Popular Programming Languages - 1965/...,1601811209,,datascience,accappatoiviola,False,/r/datascience/comments/j4xiyx/the_most_popula...
5,8 Ways to Drop Columns in Pandas | A Detailed ...,1601792947,,datascience,thatascience,False,/r/datascience/comments/j4ueix/8_ways_to_drop_...


In [11]:
# Display two rows where selftext data is missing.
ai_df[ai_df['selftext'].isna()].head(2)

Unnamed: 0,title,created_utc,selftext,subreddit,author,media_only,permalink
2,AR/VR Can Impact the Future of Healthcare,1601925920,,artificial,stayhealthy_1,False,/r/artificial/comments/j5q0bb/arvr_can_impact_...
3,A Reddit user was caught today posting many co...,1601924278,,artificial,Wiskkey,False,/r/artificial/comments/j5ph1h/a_reddit_user_wa...


In [12]:
# Display the count of [removed] posts in the DS dataframe.
len(ds_df[ds_df['selftext'] == '[removed]'])

4788

In [13]:
# Display the count of [deleted] posts in the DS dataframe.
len(ds_df[ds_df['selftext'] == '[deleted]'])

120

In [14]:
# Display the count of [removed] posts in the AI dataframe.
len(ai_df[ai_df['selftext'] == '[removed]'])

911

In [15]:
# Display the count of [deleted] posts in the AI dataframe.
len(ai_df[ai_df['selftext'] == '[deleted]'])

53

# Cleaning the Data

## Address [removed] and [deleted] posts
#### Drop [removed] and [deleted] posts so they are not factored into the NLP model as a signal

In [16]:
# Drop the [removed] posts in the data science dataframe.
ds_df = ds_df.drop(ds_df[ds_df['selftext'] == '[removed]'].index)

In [17]:
# Drop the [deleted] posts in the data science dataframe.
ds_df = ds_df.drop(ds_df[ds_df['selftext'] == '[deleted]'].index)

In [18]:
# Drop the [removed] posts in the machine learning dataframe.
ai_df = ai_df.drop(ai_df[ai_df['selftext'] == '[removed]'].index)

In [19]:
# Drop the [deleted] posts in the data science dataframe.
ai_df = ai_df.drop(ai_df[ai_df['selftext'] == '[deleted]'].index)

## Address NaN text values

In [20]:
# Drop the rows with missing post text in the DS dataframe.
ds_df = ds_df.drop(ds_df[ds_df['selftext'].isna()].index)

In [21]:
# Drop the rows with missing post text in the AI dataframe.
ai_df = ai_df.drop(ai_df[ai_df['selftext'].isna()].index)

In [22]:
# Display the shape of the DS dataframe.
ds_df.shape

(18733, 7)

In [23]:
# Display the shape of the AI dataframe.
ai_df.shape

(4458, 7)

In [24]:
# Drop DS posts before a certain timestamp.
ds_df = ds_df.drop(ds_df[ds_df['created_utc'] < 1586000000].index)

In [25]:
# Display the shape of the DS dataframe.
ds_df.shape

(4973, 7)

#### The data science dataframe has 4,973 entries. The artificial intelligence dataframe has 4,458 entries.

## Feature engeering

In [26]:
# Create a new column based on all text in both dataframes.
ai_df['all_text'] = ai_df['title'] + ' ' + ai_df['selftext']
ds_df['all_text'] = ds_df['title'] + ' ' + ds_df['selftext']

In [27]:
# Create target columns.
# Assign 1 to data science, 0 to artificial intelligence.
ds_df['data_science'] = 1
ai_df['data_science'] = 0

## Concatenate the dataframes

In [28]:
# Concatenate the dataframes 
df = pd.concat([ai_df, ds_df])

In [29]:
# Display the first few rows of the datafram. 
df.head(2)

Unnamed: 0,title,created_utc,selftext,subreddit,author,media_only,permalink,all_text,data_science
0,Noob question about the limitations of AI,1601948655,"Hello, so my friends and I were talking about ...",artificial,HaeL756,False,/r/artificial/comments/j5wg5t/noob_question_ab...,Noob question about the limitations of AI Hell...,0
1,[R] Google AI Helps Sign Language ‘Take the Fl...,1601930050,To enable signers to “take the floor” in such ...,artificial,Yuqing7,False,/r/artificial/comments/j5raur/r_google_ai_help...,[R] Google AI Helps Sign Language ‘Take the Fl...,0


In [30]:
# Export the dataframe to csv.
df.to_csv('../data/combined.csv', index=False)

# Cleaning the Text

## Use regex to clean the text

In [31]:
# Define a function to clean the text.
# https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python

def clean_text(text):
    
    # Use regex to replace URLs with empty strings.   
    text = re.sub(r"\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*", ' ', text)
    
    # Use regex to replace post title tags with empty strings.
    text = re.sub(r"\W*(\[D\])|(\[P\])|(\[R\])|(\[N\])\W*", '', text)
    
    # Use regex to replace post title tags with empty strings, ignoring case.
    text = re.sub(r"\W*(\[Advice\])|(\[Repost\])|(\[UPDATE\])\W*", '', text, flags=re.I)

    # Use regex to replace post title tags with empty strings, ignoring case.
    text = re.sub(r"\W*(\[Discussion\])|(\[News\])|(\[PROJECT\])\W*", '', text, flags=re.I)
    
    # Use regex to remove special characters.
    text = re.sub(r"[#@\?\¿\.$%_\[\]()+-:*\",'-]", ' ', text)
    
    # Use regex to replace any tabs or line breaks with empty strings.
    text = re.sub(r"\s+", ' ', text)
    
    return text

In [32]:
# Apply the text cleaning function to the 'all_text' column.
df['all_text'] = df['all_text'].apply(clean_text)

In [33]:
# Export the final dataframe to csv.
df.to_csv('../data/combined_cleaned_ai_ds.csv', index=False)

In [34]:
# Apply the text cleaning function to the AI dataframe.
ai_df['all_text'] = ai_df['all_text'].apply(clean_text)

In [35]:
# Export the cleaned AI dataframe to csv.
ai_df.to_csv('../data/artificial_cleaned.csv', index=False)

In [36]:
# Apply the text cleaning function to the DS dataframe.
ds_df['all_text'] = ds_df['all_text'].apply(clean_text)

In [37]:
# Export the cleaned AI dataframe to csv.
ds_df.to_csv('../data/datascience_cleaned.csv', index=False)