## Data Cleaning

For a text-heavy classification problem, a lot of parsing is necessary up front. The approach taken was to create a data frame that consisted of the combined text of each post (the title and the post body) in a handful of different ways, differing slightly in the methods used to process/tokenize them. 

Initial imports called for text-cleaning:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer

The initial four datasets collected from PushShift are read in and combined below to form what becomes our main dataframe:

In [2]:
df = pd.concat([pd.read_csv('../datasets/Catholicism.csv'), 
                pd.read_csv('../datasets/Catholicism2.csv'), 
                pd.read_csv('../datasets/OrthodoxChristianity.csv'), 
                pd.read_csv('../datasets/OrthodoxChristianity2.csv')]).reset_index(drop = True)

Null values are deleted, as are any entries where posts were deleted or removed by an administrator:

In [3]:
df = df[df['selftext'].notnull()]
df = df[df['selftext'] != '[removed]']
df = df[df['selftext'] != '[deleted]']

Title and post body are combined:

In [4]:
df['alltext'] = df['title'] + ' ' + df['selftext']

Here the new combined text feature is run through a laundry list of string parsing functions to either remove unintelligible markdown or convert it to something meaningful where possible. URLs are also removed.

In [5]:
df['alltext'] = df['alltext'].str.replace('\n', ' ')
df['alltext'] = df['alltext'].str.replace('\t', ' ')
df['alltext'] = df['alltext'].str.replace('\r', ' ')
df['alltext'] = df['alltext'].str.replace("\'", "")
df['alltext'] = df['alltext'].str.replace('"', ' ')
df['alltext'] = df['alltext'].str.replace('&gt', '>')
df['alltext'] = df['alltext'].str.replace('&ge', '>=')
df['alltext'] = df['alltext'].str.replace('&lt', '<')
df['alltext'] = df['alltext'].str.replace('&le', '<=')
df['alltext'] = df['alltext'].map(lambda x: re.sub(r"http\S+", "", x))

Text is converted to lowercase and fed into a new column with stopwords removed:

In [6]:
df['lowtext'] = df['alltext'].str.lower()
df['unstopped'] = df['lowtext'].map(lambda x: ' '.join([word for word in x.split() if word not in stopwords.words('english')]))

Below, two tokenizing objects are created, as are a lemmatization object and a stemming object respectively, to aid in the preprocessing that follows.

In [7]:
tok1 = RegexpTokenizer(r'\w+')
tok2 = RegexpTokenizer('\w+|\$[\d\.]+|\S+')

lemmatizer = WordNetLemmatizer()
p_stemmer = PorterStemmer()

Here, four columns are created, each with a slightly different version of the entry's text. This was done in the beginning to facilitate the tracking of nuances in the model results from as early on as possible. The versions are:

- the words of the combined title and post body, lemmatized
- the words of the combined title and post body, stemmed
- the words and potentially significant numbers of the combined title and post body, lemmatized
- the words and potentially significant numbers of the combined title and post body, stemmed

In [8]:
df['simp_lem'] = df['unstopped'].map(lambda x: ' '.join(tok1.tokenize(x)))
df['simp_lem'] = df['simp_lem'].map(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))

df['simp_stem'] = df['unstopped'].map(lambda x: ' '.join(tok1.tokenize(x)))
df['simp_stem'] = df['simp_stem'].map(lambda x: ' '.join([p_stemmer.stem(word) for word in x.split()]))

df['comp_lem'] = df['unstopped'].map(lambda x: ' '.join(tok2.tokenize(x)))
df['comp_lem'] = df['comp_lem'].map(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))

df['comp_stem'] = df['unstopped'].map(lambda x: ' '.join(tok2.tokenize(x)))
df['comp_stem'] = df['comp_stem'].map(lambda x: ' '.join([p_stemmer.stem(word) for word in x.split()]))

Target feature is binarized:

In [9]:
df['subreddit'] = df['subreddit'].map({'Catholicism' : 1, 'OrthodoxChristianity' : 0})

The result is written to its `.csv` file in the `datasets`.

In [10]:
df.to_csv('../datasets/main.csv')