# Project 3: Advanced Running Retargeting using NLP

---


## Part 2: Initial Cleaning

This section involves initial cleaning of the raw Reddit datasets. The original raw r/AdvancedRunning dataset contained 13454 rows and 92 columns, while the original raw r/C25K dataset contained 15037 rows and 96 columns. 'Removed' posts were dropped from both datasets. New data frames were created for both subreddits containing 'author', 'title' and 'selftext' (post). The 'title' and 'selftext' columns were merged into one 'post' column. Rows with missing values were then identified and dropped from both datasets. Duplicate posts were then dropped from both datasets. Reddit moderators were identified and moderator posts were dropped from each dataset. After initial cleaning, the r/AdvancedRunning dataset contained 8440 rows and the r/C25K contained 5132 rows. A target column for classification was added to each dataset, where '1' indicates r/AdvancedRunning and '0' indicates r/C25k. The initially cleaned datasets were then concatenated into one dataset to be used for pre-processing in the following section.

In [28]:
#imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [26]:
#Read in raw csv

raw_advanced_running=pd.read_csv('../data/raw_advanced_running.csv', low_memory=False)
raw_couch_5k=pd.read_csv('../data/raw_couch_5k.csv', low_memory=False)

#### Initial Data Inspection

In [27]:
#Look at data
raw_advanced_running.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,author_id,brand_safe,media,media_embed,secure_media,secure_media_embed,approved_at_utc,banned_at_utc,view_count,author_created_utc
0,[],False,Brojadyn2006,,[],,text,t2_73m74dx2,False,False,...,,,,,,,,,,
1,[],False,Tea-reps,plain,[],"28F, 17:59 5K / 36:33 10K",text,t2_co242mdb,False,False,...,,,,,,,,,,
2,[],False,Caffeinated262,,[],,text,t2_lbkrpopb,False,False,...,,,,,,,,,,
3,[],False,blueheeler9,,[],,text,t2_mkehpd7,False,False,...,,,,,,,,,,
4,[],False,zzach_519,,[],,text,t2_4homla04,False,False,...,,,,,,,,,,


In [5]:
#Look at data
raw_couch_5k.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,updated_utc,og_description,og_title,rte_mode,brand_safe,banned_by,approved_at_utc,author_created_utc,banned_at_utc,suggested_sort
0,[],False,Maleficent-Sock-6800,,[],,text,t2_lnrcxghk,False,False,...,,,,,,,,,,
1,[],False,BbNowSayMyNamebB,done,[],DONE!,text,t2_3wjl49d,False,False,...,,,,,,,,,,
2,[],False,FeeValuable22,,[],,text,t2_bjlogeog,False,False,...,,,,,,,,,,
3,[],False,Resident_Ad_4004,,[],,text,t2_7jljjuv5,False,False,...,,,,,,,,,,
4,[],False,C25k_bot,,[],,text,t2_148nft,False,False,...,,,,,,,,,,


In [25]:
#View shape of data
print(raw_advanced_running.shape)
print(raw_couch_5k.shape)

(13454, 92)
(15037, 96)


#### Removing all posts that contained 'removed'

In [7]:
#Checking and removing [removed] and re-assining to df without removed
raw_advanced_running=raw_advanced_running.loc[raw_advanced_running['selftext']!='[removed]']
raw_couch_5k=raw_couch_5k.loc[raw_couch_5k['selftext']!='[removed]']

#### Creating new DataFrames with wanted columns
1. Author, Title and Post
2. Making Title and Post into one all-text column 

In [8]:
#Making new dataframes
raw_advanced=pd.DataFrame()
raw_couch5k=pd.DataFrame()

In [9]:
#Creating dataframes with wanted information 
raw_advanced['author']=raw_advanced_running['author']
raw_advanced['post']=raw_advanced_running['title'] + ' ' + raw_advanced_running['selftext']
raw_couch5k['author']=raw_couch_5k['author']
raw_couch5k['post']=raw_couch_5k['title'] + ' ' + raw_couch_5k['selftext']

In [10]:
#Checking 
raw_advanced.head()

Unnamed: 0,author,post
0,Brojadyn2006,"Further college running Hello, so I was wonder..."
1,Tea-reps,Race Report: Big breakthrough at the Boston Ha...
2,Caffeinated262,Garden of Life Palm Beaches Marathon I have th...
3,blueheeler9,2022 BAA Half Marathon | Wet &amp; Glorious 1:...
4,zzach_519,2022 Berkeley Half race report ### Race Inform...


In [11]:
#Checking
raw_couch5k.head()

Unnamed: 0,author,post
0,Maleficent-Sock-6800,"The ones going from outside to a treadmill, Wa..."
1,BbNowSayMyNamebB,
2,FeeValuable22,
3,Resident_Ad_4004,Advice Needed So I started C25K 6 weeks early ...
4,C25k_bot,[WEEKLY THREAD] RANT WEDNESDAYS Things that ma...


#### Checking for Null Values and Data Types

In [12]:
#Checking for nulls and datatypes 
raw_advanced.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13454 entries, 0 to 15087
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   author  13454 non-null  object
 1   post    11710 non-null  object
dtypes: object(2)
memory usage: 315.3+ KB


In [13]:
#Checking for nulls and datatypes 
raw_couch5k.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15037 entries, 0 to 15078
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   author  15037 non-null  object
 1   post    9804 non-null   object
dtypes: object(2)
memory usage: 352.4+ KB


#### Dropping Missing Values 

In [14]:
#Removing rows with missing values 
raw_advanced.dropna(inplace=True)
raw_couch5k.dropna(inplace=True)

In [15]:
#Checking
raw_advanced.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11710 entries, 0 to 15087
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   author  11710 non-null  object
 1   post    11710 non-null  object
dtypes: object(2)
memory usage: 274.5+ KB


In [16]:
#Checking
raw_couch5k.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9804 entries, 0 to 15078
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   author  9804 non-null   object
 1   post    9804 non-null   object
dtypes: object(2)
memory usage: 229.8+ KB


#### Dropping duplicate posts in each dataset. Indicating to keep the first post of every duplicate. 

In [17]:
#Dropping duplicates 
raw_advanced.drop_duplicates(subset='post', keep='first', inplace=True)

raw_couch5k.drop_duplicates(subset='post', keep='first', inplace=True)

In [18]:
#Checking shape after dropping duplicates
print(raw_advanced.shape)
print(raw_couch5k.shape)

(9238, 2)
(5155, 2)


#### Dropping moderator posts from each dataset
- Moderator's were identified in each subreddit and rows containing moderator posts were dropped from both datasets

In [19]:
#Dropping mods by selecting rows with mods in both df
raw_advanced.drop(raw_advanced.index[(raw_advanced['author'] == 'brwalkernc') | 
                             (raw_advanced['author'] == 'CatzerzMcGee') | 
                            (raw_advanced['author'] == 'aewillia') | 
                             (raw_advanced['author'] == 'AutoModerator') |
                            (raw_advanced['author'] == 'ruinawish')], inplace=True)

raw_couch5k.drop(raw_couch5k.index[(raw_couch5k['author'] == 'AshKals') | 
                           (raw_couch5k['author'] == 'cainunable') | 
                            (raw_couch5k['author'] == 'C25k_bot')], inplace=True)

#### Checking final shape for each dataframe
- Both subreddit's were within an appropriate amount for modeling. Extra care will be taken during modeling to deal with imbalanced data

In [20]:
#Checking final shape of both dfs
print(raw_advanced.shape)
print(raw_couch5k.shape)

(8440, 2)
(5132, 2)


#### Binary column created for target classification column 

In [21]:
#Creating 1 and 0 columns for each dataset used for classification 
raw_advanced['is_advanced']=1
raw_couch5k['is_advanced']=0

#### Both datasets were then concatenated together into one dataset to be used for pre-processing

In [24]:
#concat subreddits together 
clean_runners=pd.concat([raw_advanced, raw_couch5k], axis=0)
clean_runners.tail()

Unnamed: 0,author,post,is_advanced
10974,underblueskies,"Oh yeah, this is where I used to walk (W7D1) :...",0
10975,AngryCanadian89,W2D1 Down After taking an almost 2 month hiatu...,0
10976,[deleted],I want to run...but not with my phone! Hi ever...,0
10977,kilted79,Unscheduled.... Looks like my body wants to me...,0
10978,AlexOfCanada,How and when do you stretch? Right now I stret...,0


In [23]:
#Saving cleaned dataset
clean_runners.to_csv('clean_runners.csv', index=False)

---
#### Next Section: Part 3 - Pre-Processing