# Reddit NLP Classifier

## Data Cleaning (2/4)

## Contents
- [Data Cleaning](#Data-Cleaning)

## Data Cleaning

### All libraries

In [1]:
# Import libraries
import numpy as np
import pandas as pd

In [2]:
# Change to display the max number of rows and columns
# Reference: https://kakakakakku.hatenablog.com/entry/2021/04/19/090229
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

### Data Import

In [3]:
# Read in the data
df = pd.read_csv('../data/subreddits_combined.csv')
df.head()

Unnamed: 0,subreddit,body
0,malefashionadvice,Definitely agree there’s personality there. Se...
1,malefashionadvice,You're looking for high fashion designers bro....
2,malefashionadvice,"Yeah, I’d add photos if I knew exactly what I ..."
3,malefashionadvice,[cool cardigan](https://i.pinimg.com/736x/80/6...
4,malefashionadvice,[deleted]


In [4]:
# Check data shape
df.shape

(5571, 2)

### Handle `NaN` / [removed] values

`NaN` values are dropped from the dataset.

In [5]:
# Review any missing values 
df.isnull().sum().sort_values(ascending = False)

body         1
subreddit    0
dtype: int64

In [6]:
# Drop `Nan` values 
df.dropna(inplace=True)

# Review 
df.isnull().sum()

subreddit    0
body         0
dtype: int64

Rows with `[removed]` in `body` columns are also dropped from the dataset. 

In [7]:
# Review the number of rows with [removed] body comments
df[df['body'] == '[removed]'].shape

(2, 2)

In [8]:
# Remove [removed] data
df = df[df['body']!='[removed]']

# Check data shape
df.shape

(5568, 2)

### Handle duplicates posts

Any duplicates posts are removed from the dataset. 

In [9]:
# Check the number of duplicated rows in title 
# Reference: https://note.nkmk.me/python-pandas-duplicated-drop-duplicates/
df['body'].duplicated().value_counts()

False    5563
True        5
Name: body, dtype: int64

In [10]:
# Drop duplicated rows 
df = df.drop_duplicates(subset=('body'), keep = 'last')
df.shape

(5563, 2)

### Adjusted the number of rows between two subreddits

In [11]:
# Check the number of data by subreddits
df['subreddit'].value_counts()

femalefashionadvice    2791
malefashionadvice      2772
Name: subreddit, dtype: int64

In [12]:
# Delete last (2791-2772) rows to align the number of posts between two subreddits
# Reference: https://sparkbyexamples.com/pandas/pandas-drop-last-n-rows-from-dataframe
df.drop(df.tail(2791-2772).index, inplace=True)
df['subreddit'].value_counts()

malefashionadvice      2772
femalefashionadvice    2772
Name: subreddit, dtype: int64

### Save dataset

In [13]:
# Save dataset to csv
df.to_csv('../data/subreddits_combined_clean.csv', index = False)