# Text Cleaning
---

### This notebook accomplishes the following:


---

In [40]:
# Imports
import pandas as pd
import numpy as np

### Read in the data files

In [41]:
dmacademy_df = pd.read_csv('../data/dmacademy.csv')
truezelda_df = pd.read_csv('../data/truezelda.csv')

In [42]:
dmacademy_df['full_link'].nunique()

5000

In [43]:
truezelda_df['full_link'].nunique()

4996

In [44]:
# frames = [dmacademy_df, truezelda_df]
# master_df = pd.concat(frames)

In [45]:
dmacademy_df

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,link_flair_css_class,post_hint,preview,author_flair_background_color,author_flair_text_color,removed_by_category,gilded,author_cakeday,suggested_sort,banned_by
0,[],False,Vashael,,[],,text,t2_7p5eovfe,False,False,...,,,,,,,,,,
1,[],False,Atarihero76,,[],,text,t2_48oeurxu,False,False,...,Guide,self,"{'enabled': False, 'images': [{'id': '9XSNOgfA...",,,,,,,
2,[],False,JethroBuldean,,[],,text,t2_1plx2r4z,False,False,...,,,,,,,,,,
3,[],False,Mechaaniac,,[],,text,t2_a04c8tpe,False,False,...,,,,,,,,,,
4,[],False,Hungerforhuman,,[],,text,t2_a2ouksrx,False,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,[],False,AngelsJos,,[],,text,t2_9571c28u,False,True,...,,,,,,,,,,
4996,[],False,Light_of_Avalon,,[],,text,t2_871qw,False,False,...,,,,,,,,,,
4997,[],False,Randoff-Runemaker,,[],,text,t2_ab1yxy0e,False,False,...,,,,,,,,,,
4998,[],False,Shatyel,,[],,text,t2_1v4sccw1,False,False,...,,,,,,,,,,


### Isolate the subreddit, title, and selftext columns in a new dataframe

In [47]:
dmacademy_text_df = dmacademy_df[['subreddit', 'title', 'selftext']].copy()

In [52]:
for row in dm_text_df['selftext'][:1]:
    print(row)

**Edit (UPDATE): Thank you for the robust response! I've had some great applications so far and I'm hyped for the conversations we will have. I am currently going through interview applications and preparing to send out follow-ups. Thanks for your patience.**

Hello, seasoned dungeon masters!

I'm Nick, the host of an upcoming GM interview podcast, and I want to talk with you on my show! Isn't it about time your epic campaign gets shared with the TTRPG community?

I want you, the non-celebrity GM (or hidden GM, if you will), to share your stories and wisdom with the community. I think we, as a community, have something to learn from your experience behind the GM screen.

What makes this podcast really fun is that we'll be getting your wisdom out there to other GMs just like you. This is the outlet you've been waiting for to talk about the campaign you poured your heart and soul into. I know how painful it is to keep all the cool secrets of your campaign and hidden lore from your PCs. M

### Remove digits, punctuation, and other small things 

In [39]:
# https://stackoverflow.com/questions/41719259/how-to-remove-numbers-from-string-terms-in-a-pandas-dataframe
import re
dm_text_df['selftext'] = dm_text_df['selftext'].str.replace('\d+', '') # help from Amir! 
dm_text_df['selftext'] = dm_text_df['selftext'].str.replace('&amp;', '') # could replace with 'and'
dm_text_df['selftext'] = dm_text_df['selftext'].str.replace('\(https:\/\/www.youtube.com\/watch\?v=...........\)', '')
#https://stackoverflow.com/questions/44227748/removing-newlines-from-messy-strings-in-pandas-dataframe-cells
dm_text_df.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=[" "," "], regex=True, inplace=True) 


#text_df['selftext'] = text_df['selftext'].str.replace(r'^https?:\/\/.*[\r\n]*', '')

# String alphaOnly = input.replaceAll("[^a-zA-Z]+","");
# String alphaAndDigits = input.replaceAll("[^a-zA-Z0-9]+","");

# https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python/11332580

#text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)

  dm_text_df['selftext'] = dm_text_df['selftext'].str.replace('\d+', '') # help from Amir!
  dm_text_df['selftext'] = dm_text_df['selftext'].str.replace('\(https:\/\/www.youtube.com\/watch\?v=...........\)', '')


In [10]:
for row in text_df['selftext'][:5]:
    print(row)

**Edit (UPDATE): Thank you for the robust response! I've had some great applications so far and I'm hyped for the conversations we will have. I am currently going through interview applications and preparing to send out follow-ups. Thanks for your patience.**  Hello, seasoned dungeon masters!  I'm Nick, the host of an upcoming GM interview podcast, and I want to talk with you on my show! Isn't it about time your epic campaign gets shared with the TTRPG community?  I want you, the non-celebrity GM (or hidden GM, if you will), to share your stories and wisdom with the community. I think we, as a community, have something to learn from your experience behind the GM screen.  What makes this podcast really fun is that we'll be getting your wisdom out there to other GMs just like you. This is the outlet you've been waiting for to talk about the campaign you poured your heart and soul into. I know how painful it is to keep all the cool secrets of your campaign and hidden lore from your PCs. M

In [11]:
text_df['title'].nunique()

9899

In [12]:
text_df['selftext'].nunique()

8942

### Check for null values
Some posts are title-only, meaning the selftext column contains a null value. We don't want to pass any nulls to our models, so we can replace nulls with empty strings. 

Again, this is why it's important that we chose text-heavy subreddits in our data collection. If we chose subreddits where most posts contained images or links to other websites, then we would have more null values and less text, leading to our models being less informed. I'm comfortable with the small amount of null values here, so I don't need to change to a more text-rich subreddit. 

In [13]:
text_df.isnull().sum()

subreddit      0
title          0
selftext     211
dtype: int64

We can even see which of our subreddits contains most of the nulls. It turns out that almost all of them are in the true zelda subreddit, which means this subreddit will be slightly less represented in our model, but the difference is so slight that it won't have a significant impact on model performance.

In [14]:
# see how many nulls were from which subreddits
dmacademy_df['selftext'].isnull().sum()

4

In [15]:
truezelda_df['selftext'].isnull().sum()

207

Our classes will be slightly imbalanced if we drop these nulls, but it's only a few percent difference so I'm willing to accept it. However, if we replace these nulls now they'll be turned back into nulls when we pull this cleaned data into the modeling notebooks, so nulls will have to be replaced in each modeling notebook.

Finally, replace those nulls with empty strings.

In [16]:
text_df.replace(np.nan, "", regex=True, inplace=True)

In [17]:
text_df.isnull().sum()

subreddit    0
title        0
selftext     0
dtype: int64

### Map our target variable to integer values

In [18]:
text_df.dtypes

subreddit    object
title        object
selftext     object
dtype: object

In [19]:
text_df['subreddit']

0       DMAcademy
1       DMAcademy
2       DMAcademy
3       DMAcademy
4       DMAcademy
          ...    
4991    truezelda
4992    truezelda
4993    truezelda
4994    truezelda
4995    truezelda
Name: subreddit, Length: 9996, dtype: object

In [20]:
# this step is optional
#text_df['subreddit'] = text_df['subreddit'].map({'truezelda':1,'DMAcademy':0})

In [21]:
text_df['subreddit']

0       DMAcademy
1       DMAcademy
2       DMAcademy
3       DMAcademy
4       DMAcademy
          ...    
4991    truezelda
4992    truezelda
4993    truezelda
4994    truezelda
4995    truezelda
Name: subreddit, Length: 9996, dtype: object

In [22]:
text_df

Unnamed: 0,subreddit,title,selftext
0,DMAcademy,Seeking seasoned DMs to be guests on interview...,**Edit (UPDATE): Thank you for the robust resp...
1,DMAcademy,"TERRAIN, and Using it Effectively","TERRAIN, and using it Effectively – DM Tips U..."
2,DMAcademy,Know the exact location of something,The players are planning on dropping an evil a...
3,DMAcademy,How to run military basic as a session,I am running a campaign for all intents and pu...
4,DMAcademy,Best time to take breaks/how long they should be,Hey just a newbie DM .My sessions are usually ...
...,...,...,...
4991,truezelda,Should games such as Tri Force Heroes and Four...,The Zelda timeline is infamous for being convo...
4992,truezelda,Review score predictions for Skyward HD?,Title says it all. How will it fare? With the...
4993,truezelda,Should I watch or play Majora’s Mask?,[deleted]
4994,truezelda,Can anyone help me find a normally priced zeld...,My daughter loves to play super smash bros and...


### Save processed data for modeling

In [23]:
text_df.to_csv('../data/clean_dnd_zelda.csv', index=False)

Now we're ready to move on to notebook 3, creating a model to predict which subreddit a post came from!

In [24]:
text_df.isnull().sum()

subreddit    0
title        0
selftext     0
dtype: int64