# Text Cleaning
---

### This notebook accomplishes the following:


---

In [1]:
# Imports
import pandas as pd
import numpy as np
import re

---
### Read in the data files
We can read in the `.csv`s we created in the first notebook. We're working with four datasets total, so we'll do each step in the cleaning process four times. I create a list of my dataframes to loop through to reduce clutter.

In [2]:
dmacademy_df = pd.read_csv('../data/dmacademy.csv')
truezelda_df = pd.read_csv('../data/truezelda.csv')
poli_dis_2012_df = pd.read_csv('../data/poli_dis_2012.csv')
poli_dis_2020_df = pd.read_csv('../data/poli_dis_2020.csv')

---
### Check for removed or deleted posts

One column that caught my eye in these dataframe is the `removed_by_category` column. Each row of this column was found to contain one of six unique values: 'reddit', 'moderator', 'author', 'deleted', 'automod_filtered', or a null value. Rows where this column contains anything other than a null are posts that have been removed from the subreddit. This means the selftext only contains the word 'removed' or 'deleted'. 

Let's see how many rows in each dataframe might be posts that have been removed or deleted.

In [3]:
# Inspect one of our dataframes
dmacademy_df

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,link_flair_css_class,post_hint,preview,author_flair_background_color,author_flair_text_color,removed_by_category,gilded,author_cakeday,suggested_sort,banned_by
0,[],False,Vashael,,[],,text,t2_7p5eovfe,False,False,...,,,,,,,,,,
1,[],False,Atarihero76,,[],,text,t2_48oeurxu,False,False,...,Guide,self,"{'enabled': False, 'images': [{'id': '9XSNOgfA...",,,,,,,
2,[],False,JethroBuldean,,[],,text,t2_1plx2r4z,False,False,...,,,,,,,,,,
3,[],False,Mechaaniac,,[],,text,t2_a04c8tpe,False,False,...,,,,,,,,,,
4,[],False,Hungerforhuman,,[],,text,t2_a2ouksrx,False,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,[],False,AngelsJos,,[],,text,t2_9571c28u,False,True,...,,,,,,,,,,
4996,[],False,Light_of_Avalon,,[],,text,t2_871qw,False,False,...,,,,,,,,,,
4997,[],False,Randoff-Runemaker,,[],,text,t2_ab1yxy0e,False,False,...,,,,,,,,,,
4998,[],False,Shatyel,,[],,text,t2_1v4sccw1,False,False,...,,,,,,,,,,


In [4]:
dmacademy_df.rename(columns={'selftext': 'text'}, inplace=True)
truezelda_df.rename(columns={'selftext': 'text'}, inplace=True)
poli_dis_2012_df.rename(columns={'body': 'text'}, inplace=True)
poli_dis_2020_df.rename(columns={'body': 'text'}, inplace=True)

In [5]:
for df in [dmacademy_df, truezelda_df]:
    print(df['removed_by_category'].unique())
    print(df[df['removed_by_category'].notnull()]['text'].unique())
    print('\n')

[nan 'deleted' 'moderator' 'reddit']
['[deleted]' '[removed]' nan]


[nan 'reddit' 'automod_filtered' 'deleted' 'moderator']
['[removed]' '[deleted]' nan]




Based on this information, I decided to remove all rows where `selftext` or `body` is deleted, removed, or null. Unforunately, this means dropping hundreds of rows, but I intentionally pulled more data than I thought necessary to plan for this possibility. Changing both these columns to `text` to make it easier to iterate over later! 

Below, I print out examples of nulls, deleted, and removed posts in the DMAcademy data. All of these rows still have titles, but including the titles will give us the impression later that our classes are more balanced than they truly are, so I'd rather drop them, too.

In [6]:
dmacademy_df[dmacademy_df['text'].isnull()][['text','title','removed_by_category']].head()

Unnamed: 0,text,title,removed_by_category
775,,Giving depth to your random encounters,deleted
937,,New DM,deleted
3190,,How should I balance low stake/high stake enco...,deleted
3318,,"Тhеrе is а Yоutubеr cаlled ""Lеon Farаdаy"" that...",deleted


In [7]:
dmacademy_df.loc[dmacademy_df['text']=='[deleted]'][['text','title','removed_by_category']].head()

Unnamed: 0,text,title,removed_by_category
21,[deleted],I need help can I ask some assistance.,deleted
38,[deleted],What to do with an unhelpful player.,deleted
62,[deleted],(popular question) Playing after LMoP for newb...,deleted
78,[deleted],how should i go about this session if there I'...,deleted
81,[deleted],Menacing music for Villains?,deleted


In [8]:
dmacademy_df.loc[dmacademy_df['text']=='[removed]'][['text','title','removed_by_category']].head()

Unnamed: 0,text,title,removed_by_category
47,[removed],Who is the Queen of Air and Darkness? What hap...,moderator
89,[removed],A helpful video about how to design a balanced...,moderator
224,[removed],Character backstory assist.,moderator
227,[removed],New player to Dnd/ build help pls,moderator
241,[removed],My friend asked me if I knew much about the Ba...,moderator


First, let's see jut how many rows we'll be dropping from each dataframe.

In [9]:
df_names_list = ['dmacademy_df', 'truezelda_df', 'poli_dis_2012_df', 'poli_dis_2020_df']
df_list = [dmacademy_df, truezelda_df, poli_dis_2012_df, poli_dis_2020_df]

In [10]:
for i, dataframe in enumerate(df_list):    
    # add up number of rows that contain 'deleted' or 'removed' 
    num_nulls = 0
    num_nulls += len(dataframe.loc[(dataframe['text']=='[deleted]') | (dataframe['text']=='[removed]')])
    num_nulls += dataframe['text'].isnull().sum()
    
    print(f'{df_names_list[i]} contains {num_nulls} empty posts')

dmacademy_df contains 113 empty posts
truezelda_df contains 884 empty posts
poli_dis_2012_df contains 267 empty posts
poli_dis_2020_df contains 238 empty posts


---
### Drop rows with removed, deleted, or null posts or comments

Now that we feel confident we've identified all the empty posts, let's drop them.

In [11]:
for i, df in enumerate(df_list):
    
    # Drop rows where posts are nulls
    df.dropna(axis=0, inplace=True, subset=['text'])

    # Drop rows where posts were deleted
    deleted_rows = df.loc[df['text']=='[deleted]'].index
    df.drop(deleted_rows, inplace=True, axis=0)
    
    # Drop rows where posts were removed
    removed_rows = df.loc[df['text']=='[removed]'].index
    df.drop(removed_rows, inplace=True, axis=0)

In [12]:
for i, dataframe in enumerate(df_list):    
    # add up number of rows that contain 'deleted' or 'removed' 
    num_nulls = 0
    num_nulls += len(dataframe.loc[(dataframe['text']=='[deleted]') | (dataframe['text']=='[removed]')])
    num_nulls += dataframe['text'].isnull().sum()
    
    print(f'{df_names_list[i]} contains {num_nulls} empty posts')

dmacademy_df contains 0 empty posts
truezelda_df contains 0 empty posts
poli_dis_2012_df contains 0 empty posts
poli_dis_2020_df contains 0 empty posts


In [13]:
for df in df_list:
    print(len(df))

4887
4112
4733
4762


Our dataframes are significantly smaller now, but so much cleaner! Yay!

---
### Check for bot messages
I don't anticipate finding any bot messages in the truezelda and DMAcademy posts, but I've already noticed some in the comments from the political discussion comments. 

Unfortunately, I don't know all the forms that bot messages can come in, so I may not be able to find them all. Let's remove what we so that our model isn't informed by repetitive bot messages. 

In [14]:
for i, df in enumerate(df_list):
    print(df_names_list[i])
    print(df.loc[df['text'].str.contains('Hello, /u/')]['text'], '\n')
    print(df.loc[df['text'].str.contains('I am a bot')]['text'], '\n')

dmacademy_df
Series([], Name: text, dtype: object) 

Series([], Name: text, dtype: object) 

truezelda_df
Series([], Name: text, dtype: object) 

Series([], Name: text, dtype: object) 

poli_dis_2012_df
Series([], Name: text, dtype: object) 

Series([], Name: text, dtype: object) 

poli_dis_2020_df
37      Hello, /u/PMmeURsprintPROGRAMS. Thanks for con...
38      Hello, /u/pleasedontbullyme_. Thanks for contr...
39      Hello, /u/chickenman86. Thanks for contributin...
40      Hello, /u/dumbirds. Thanks for contributing! U...
67      Hello, /u/Theduder89. Thanks for contributing!...
                              ...                        
4987    Hello, /u/cc_hk. Thanks for contributing! Unfo...
4994    Hello, /u/Realtalkdo3. Thanks for contributing...
4995    Hello, /u/Gransanto102. Thanks for contributin...
4996    Hello, /u/SocialObserver3802. Thanks for contr...
4997    Hello, /u/SthenicFreeze. Thanks for contributi...
Name: text, Length: 182, dtype: object 

1       [A reminder f

Let's investigate a couple of these messages to make sure they're just bots.

In [15]:
poli_dis_2020_df['text'][37]



In [16]:
poli_dis_2020_df['text'][1]

"[A reminder for everyone](https://www.reddit.com/r/PoliticalDiscussion/comments/4479er/rules_explanations_and_reminders/). This is a subreddit for genuine discussion:\n\n* Don't post low effort comments like joke threads, memes, slogans, or links without context.\n* Help prevent this subreddit from becoming an echo chamber. Please don't downvote comments with which you disagree.\n* The downvote and report buttons are not disagree buttons.  Please don't use them that way.\n\nViolators will be fed to the bear.\n\n---\n\n*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/PoliticalDiscussion) if you have any questions or concerns.*"

Yup, looks like nonsense that we don't want! Let's drop these rows.

In [17]:
for i, df in enumerate(df_list):

    # Drop rows where posts were deleted
    hello_bots = df.loc[df['text'].str.contains('Hello, /u/')]['text'].index
    df.drop(hello_bots, inplace=True, axis=0)
    
    # Drop rows where posts were removed
    iamabot_bots = df.loc[df['text'].str.contains('I am a bot')]['text'].index
    df.drop(iamabot_bots, inplace=True, axis=0)
    
    print(f'{df_names_list[i]} is {len(df)} rows')

dmacademy_df is 4887 rows
truezelda_df is 4112 rows
poli_dis_2012_df is 4733 rows
poli_dis_2020_df is 4309 rows


Here are the final sizes of our datasets!

---
### Isolate the subreddit, title, and text columns 
After dropping empty posts, our dataframes still contain a lot of arbitrary data in all of those extra columns. Since this is a project on Natural Language Processing (NLP), I'm only going to use the title and main text (stored in the `title` and `text` columns, respectively) for every post. We can isolate these columns in new dataframes so that we aren't working with so much extraneous data.

In [18]:
dmacademy_df = dmacademy_df[['subreddit', 'title', 'text']].copy()
truezelda_df = truezelda_df[['subreddit', 'title', 'text']].copy()
poli_dis_2012_df = poli_dis_2012_df[['subreddit', 'text']].copy()
poli_dis_2020_df = poli_dis_2020_df[['subreddit', 'text']].copy()

In [19]:
dmacademy_df

Unnamed: 0,subreddit,title,text
0,DMAcademy,Seeking seasoned DMs to be guests on interview...,**Edit (UPDATE): Thank you for the robust resp...
1,DMAcademy,"TERRAIN, and Using it Effectively","TERRAIN, and using it Effectively – DM Tips\n\..."
2,DMAcademy,Know the exact location of something,The players are planning on dropping an evil a...
3,DMAcademy,How to run military basic as a session,I am running a campaign for all intents and pu...
4,DMAcademy,Best time to take breaks/how long they should be,Hey just a newbie DM .My sessions are usually ...
...,...,...,...
4995,DMAcademy,🐟 Urgently need DM support to appropriately en...,"Hi all, I am hoping for some advice please! I ..."
4996,DMAcademy,Do we help with plot here?,Mouseketeers go away\n\nSo i’m stuck on a plot...
4997,DMAcademy,Where to go next? Open to suggestions.,I have my first group about to finish Lost Min...
4998,DMAcademy,NPCs - Playing against type,I'm pretty new to DMing and take very long to ...


---
### Inspect text for odd characters

There are a lot of funky characters and strings in our text that we might want to consider removing before we tokenize and vectorize our data for model fitting. This includes typical relics like `\n` and `&amp;`, but there's also some unseemly strings like `#x200B;`, as well as a lot of YouTube links. Some of these might be filtered out under the hood in our models, but I'm not an expert in NLP classification models so I'm going to do a bit of cleaning myself, or else I won't be able to say with confidence that this text data was properly processed. However, since I'm not expert, there is probably something I've missed.

First, let's print a few posts from a single dataframe to skim for funky strings.

In [20]:
print(dmacademy_df['text'][1])

TERRAIN, and using it Effectively – DM Tips

Using terrain Effectively Video:

[https://youtu.be/AnpNtWTIX2Q](https://youtu.be/AnpNtWTIX2Q)

Hey folks, I’d like to share with you some advice, in video and written form, on the use of Terrain in your Tabletop RPGs.

I see a lot of questions and suggestions on adding terrain to your combats and skill challenges, but just plopping down some environment features is not the end of the technique, it is the beginning. Here I will spell out definitions and techniques for how to actually go about making terrain that is effective and will add drama to your encounters/scenes.

The official dictionary definition of TERRAIN is: a stretch of land, especially with regard to its physical features

How that relates to DnD (and other RPG settings) in my mind, is to redefine Terrain as: Anything in a scene or setting that is not a creature, (but also sometimes a creature can be a terrain feature...).

So Terrain in an RPG is not only the swamps, treetops,

The post above showcases many of the examples I was describing before, including several YouTube links and one `&amp;#x200B;`. Let's use some regular expressions to replace these with empty strings.

In [21]:
poli_dis_2012_df['text'][14]

'Because this man was part of an invading force. If Russia were occupying our country and all you\'d already heard stories of torture and murder of innocents, had relatives kicked out of their houses while the military searched for people who weren\'t there, and heard of entire towns blown up as part of a "clearing exercise", you wouldn\'t blame the invading force instead of just the individual that killed your family?\n\nEdit: fixed a typo'

Here we have lots of \n and \t. 

---
### Remove links, digits, punctuation, and many other small things 

In [22]:
# I still have a lot to learn about regular expression and string processing, 
# so I relied on my classmate Amir and instructor John Hazard to help me with this code!
df_list = [dmacademy_df, truezelda_df, poli_dis_2012_df, poli_dis_2020_df]

# Iterate over each dataframe
for df in df_list:
    
    # remove links
    # https://stackoverflow.com/questions/51994254/removing-url-from-a-column-in-pandas-dataframe/51994366
    df.replace('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ', regex=True, inplace=True)
    
    # remove digits, '&amp;', and forward slashes
    # https://stackoverflow.com/questions/41719259/how-to-remove-numbers-from-string-terms-in-a-pandas-dataframe
    df.replace('\d+', '', regex=True, inplace=True) # help from Amir! 
    df.replace('&amp;', ' ', regex=True, inplace=True)
    df.replace('/', ' ', regex=True, inplace=True)
    
    # remove new line characters
    # https://gist.github.com/smram/d6ded3c9028272360eb65bcab564a18a
    df.replace(to_replace=[r'\\t|\\n|\\r', '\t|\n|\r'], value=[' ',' '], regex=True, inplace=True)
    
    # removes punctuation
    # From Amir! 
    df.replace(r'[^a-zA-Z ]\s?',' ',regex=True, inplace=True)

In [23]:
print(dmacademy_df['text'][1])

TERRAIN and using it Effectively  DM Tips  Using terrain Effectively Video     Hey folks I d like to share with you some advice in video and written form on the use of Terrain in your Tabletop RPGs  I see a lot of questions and suggestions on adding terrain to your combats and skill challenges but just plopping down some environment features is not the end of the technique it is the beginning Here I will spell out definitions and techniques for how to actually go about making terrain that is effective and will add drama to your encounters scenes  The official dictionary definition of TERRAIN is a stretch of land especially with regard to its physical features  How that relates to DnD  and other RPG settings in my mind is to redefine Terrain as Anything in a scene or setting that is not a creature  but also sometimes a creature can be a terrain feature      So Terrain in an RPG is not only the swamps treetops bushes boulders weather it is anything and every physical thing in a scene Thi

In [24]:
poli_dis_2012_df['text'][14]

'Because this man was part of an invading force If Russia were occupying our country and all you d already heard stories of torture and murder of innocents had relatives kicked out of their houses while the military searched for people who weren t there and heard of entire towns blown up as part of a  clearing exercise  you wouldn t blame the invading force instead of just the individual that killed your family  Edit fixed a typo'

This may not be a perfect job but it looks MUCH better! 

---
### Save processed data for modeling

In [25]:
dmacademy_df.to_csv('../data/clean_dmacademy.csv', index=False)
truezelda_df.to_csv('../data/clean_truezelda.csv', index=False)

poli_dis_2012_df.to_csv('../data/clean_poli_dis_2012.csv', index=False)
poli_dis_2020_df.to_csv('../data/clean_poli_dis_2020.csv', index=False)

Now we're ready to move on to notebook 3, creating a model to predict which subreddit a post came from!