# Project 3 - Reddit NLP
## 02-Data Cleaning



In this section, I will begin by inspecting the data. I will then look for:
- Duplicated values to remove
- Removed posts included in the web scrape
- Missing values to impute
- Remove special characters using regex

Then I will feature engineer a few columns to capture word count. Once this has all been completed, I will save the dataframe as a csv to be used in the EDA and Modeling sections.

In [1]:
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import re

In [2]:
# read in data
df = pd.read_csv('./data/legal_nsq.csv')

In [3]:
df.head(2)

Unnamed: 0,title,created_utc,selftext,subreddit,author,media_only,permalink
0,My ex gf refuses to reclaim her items from my ...,1601524380,About a week ago I ended a moderately long rel...,legaladvice,Gtormund51,False,/r/legaladvice/comments/j31aqq/my_ex_gf_refuse...
1,"A car is advertised for $18,000 lower than MSR...",1601524283,So I'm looking for a new car and I stumbled up...,legaladvice,hustlegoat,False,/r/legaladvice/comments/j319w1/a_car_is_advert...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        60000 non-null  object
 1   created_utc  60000 non-null  int64 
 2   selftext     50677 non-null  object
 3   subreddit    60000 non-null  object
 4   author       60000 non-null  object
 5   media_only   60000 non-null  bool  
 6   permalink    60000 non-null  object
dtypes: bool(1), int64(1), object(5)
memory usage: 2.8+ MB


### Check for and remove Duplicates


In [5]:
# Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html
len(df[df.duplicated(subset='title')])

669

In [6]:
df.drop_duplicates(subset='title', keep='first', inplace=True)

### Check for removed Posts

There are several posts in this dataset that were removed by the moderator. There are various reasons for post removal. Here I will assume that if the post was removed from the subreddit, it should also be removed from our dataset. 

In [7]:
df[df['selftext']=='[removed]'].shape

(3618, 7)

In [8]:
df.drop(df[df['selftext']=='[removed]'].index, inplace=True)

### Handling empty values

There are several instances where "selftext" is left blank. Upon inspection of a sampling of submissions, this is typically because the entire question is posed in the title, and there is no body. Here, I will impute the string "blank".

In [9]:
df['selftext'] = df['selftext'].replace(np.nan, 'blank')

### Removal of special characters

Once countvectorized, it became apparent that there were a number of special characters used throughout these posts. I will use RegEx to create "clean" columns of both the 'selftext' and 'title' columns. This will keep all alphanumeric characters, as well as select punctuation.

In [10]:
# Reference: https://stackoverflow.com/questions/21492621/regex-to-keep-specific-characters-from-string
clean_text = [re.sub(r"[^a-zA-Z0-9\'.?]+", ' ', entry) for entry in df['selftext']]
clean_text[:2]

['About a week ago I ended a moderately long relationship with my then gf. She refuses to get her stuff from my house amp is holding certain items of mine hostage as well. One of these items being an access key to my property. She never received mail here amp did not contribute to utility bills nor did she pay any rent. 1.How long do I have to legally keep her stuff? 2.Do her possessions HAVE to be stored on my property until she claims it or can I place the items in a secure storage unit? 3.What can I legally do to reclaim my possessions? Thanks in advance for any credible advice ',
 "So I'm looking for a new car and I stumbled upon a nice car on a very well known car sales site. The car is listed for around 2.2k but when I look up the car it goes for around 22k. The car has very low miles and doesn't appear to have any blatant issues. If I walked into the dealership tomorrow and said I'd like to buy the car for 2.2k does the dealer have to honor the advertised price online? State is 

In [11]:
df['clean_text'] = clean_text

In [12]:
clean_title = [re.sub(r"[^a-zA-Z0-9\'.?]+", ' ', entry) for entry in df['title']]
clean_title[:2]

['My ex gf refuses to reclaim her items from my home. What are my options? Alabama amp no she not a relative lol ',
 'A car is advertised for 18 000 lower than MSRP on a website does the dealer have to honor that price?']

In [13]:
df['clean_title'] = clean_title

In [14]:
df = df.drop(columns=['title', 'selftext'])

### Feature Engineering

Here I create an "alltext_clean" column that will concatenate the two word columns. I will also create word count columns, which I will use for EDA.

In [15]:
# creating a column that combines our two text columns, title and selftext
df['alltext_clean'] = df['clean_title'] + df['clean_text']

In [16]:
df['alltext_word_count'] = df['alltext_clean'].str.split().str.len()

In [17]:
df['title_word_count'] = df['clean_title'].str.split().str.len()

In [18]:
df['selftext_word_count'] = df['clean_text'].str.split().str.len()

In [19]:
df.head(2)

Unnamed: 0,created_utc,subreddit,author,media_only,permalink,clean_text,clean_title,alltext_clean,alltext_word_count,title_word_count,selftext_word_count
0,1601524380,legaladvice,Gtormund51,False,/r/legaladvice/comments/j31aqq/my_ex_gf_refuse...,About a week ago I ended a moderately long rel...,My ex gf refuses to reclaim her items from my ...,My ex gf refuses to reclaim her items from my ...,136,23,113
1,1601524283,legaladvice,hustlegoat,False,/r/legaladvice/comments/j319w1/a_car_is_advert...,So I'm looking for a new car and I stumbled up...,A car is advertised for 18 000 lower than MSRP...,A car is advertised for 18 000 lower than MSRP...,105,21,85


### Save Cleaned Data

A special note here: due to the size of the dataset used, I will also create a smaller file with a random sampling of rows that will be uploaded to the git repository. The full dataset will not be included in the upload, though the full set was used for all of the following notebooks.

In [20]:
df.to_csv('./data/legal_nsq_clean.csv', index=False)

In [22]:
sample_df = df.sample(n=3000)

In [23]:
sample_df.to_csv('./data/legal_nsq_clean_sample.csv', index=False)