# Clean Data
---
#### Import libraries and read data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
import pickle
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
df = pd.read_csv('../data/skincare.csv')
df.head(2)

Unnamed: 0,author,title,selftext,num_comments,score,subreddit,is_ab
0,laurtay7166,[Routine Help] Suggestions for dehydrated to n...,,1,1,skincareaddiction,0
1,atrevz,[B&amp;A] Did the Fifty Shades of Snail sebace...,,1,1,skincareaddiction,0


## Exclude removed and deleted posts
---
Excluding removed and deleted posts from my data because it will serve no value and misrepresent the subreddits in my analysis or modeling since it's no longer on the forum.

I have 64,008 (64%) left of data after this elimination, which  still leaves me with ample amount of observations to work with.

In [3]:
df = df[(df['selftext'] != '[removed]') & (df['selftext'] != '[deleted]')]

print(f'I have {df.shape[0]} ({round((df.shape[0])/100_000*100)}%) left of my data to work with.')

I have 64008 (64%) left of my data to work with.


##### Sanity check:

In [4]:
df[df['selftext'] == '[removed]']

Unnamed: 0,author,title,selftext,num_comments,score,subreddit,is_ab


## Handle missing values
---

There are missing selftexts in my data.

After investigating each subreddit to get more context, I found that the two subreddits have differing posting culture. Some users in the AsianBeauty community post product reviews with a title and a written review in the comments ([example](https://www.reddit.com/r/AsianBeauty/comments/g3kd2a/review_11_cosrx_products/)). Hence, it is showing up as missing values in my data. However, since there are other posts to represent [product reviews](https://www.reddit.com/r/AsianBeauty/?f=flair_name%3A%22Review%22) that includes selftexts, I will be eliminating these posts from my data.

More, in the SkincareAddiction subreddit, the posts without selftexts are usually a product question for the community to chime in, indicated by the tag and  title ([example](https://www.reddit.com/r/SkincareAddiction/comments/dd3za5/product_question_cleansers_and_toners_that_help/)). Considering the myriad of [product questions](https://www.reddit.com/r/SkincareAddiction/?f=flair_name%3A%22Product%20Question%22) with selftexts that will be able to represent this type of post, I will also remove this from my data.

After excluding posts with missing `selftext`s, I have 32,226 (32%) left of my data, which is still ample amount of data to work with. Class balance is to be determined.

In [5]:
df.isna().sum()

author              0
title               0
selftext        31782
num_comments        0
score               0
subreddit           0
is_ab               0
dtype: int64

In [6]:
df = df.dropna()

print(f'I have {df.shape[0]} ({round((df.shape[0])/100_000*100)}%) left of my data to work with.')

I have 32226 (32%) left of my data to work with.


##### Sanity check:

In [7]:
df.isna().sum()

author          0
title           0
selftext        0
num_comments    0
score           0
subreddit       0
is_ab           0
dtype: int64

## Check data types
---
All data types are correct, no changes made.

In [8]:
df.dtypes

author          object
title           object
selftext        object
num_comments     int64
score            int64
subreddit       object
is_ab            int64
dtype: object

## Check for duplicates
---

After doing further investigation by browsing the duplicate posts on Reddit.com, I concluded that most of these are either spam posts that have long been removed/deleted, or scheduled posts from an automated moderator (see Fig 1 - 4 at the bottom of this notebook).

The appearance of words from duplicate posts will skew my analysis, thus I will remove them from my data.

After removing these duplicates, I have 31,618 (32%) left of my data, which is ample amount of observations to work with.

##### Look at duplicate data

In [9]:
df[df.duplicated()].sample(10)

Unnamed: 0,author,title,selftext,num_comments,score,subreddit,is_ab
55677,AutoModerator,"Daily Deals, Fluff, and Hauls","Post all of your deals, memes, gifs, hauls, sh...",6,1,asianbeauty,1
52983,AutoModerator,"Daily Deals, Fluff, and Hauls","Post all of your deals, memes, gifs, hauls, sh...",3,1,asianbeauty,1
70699,AutoModerator,"Daily Deals, Fluff, and Hauls","Post all of your deals, memes, gifs, hauls, sh...",8,3,asianbeauty,1
64817,AutoModerator,"Daily Deals, Fluff, and Hauls","Post all of your deals, memes, gifs, hauls, sh...",2,6,asianbeauty,1
66820,AutoModerator,"Daily Deals, Fluff, and Hauls","Post all of your deals, memes, gifs, hauls, sh...",6,1,asianbeauty,1
67938,AutoModerator,"Daily Deals, Fluff, and Hauls","Post all of your deals, memes, gifs, hauls, sh...",2,1,asianbeauty,1
65788,AutoModerator,"Daily Deals, Fluff, and Hauls","Post all of your deals, memes, gifs, hauls, sh...",18,1,asianbeauty,1
83263,AutoModerator,Daily Fluff and Hauls,"Post your meme trash, gifs, hauls, sheet mask ...",13,5,asianbeauty,1
55774,AutoModerator,"Daily Deals, Fluff, and Hauls","Post all of your deals, memes, gifs, hauls, sh...",0,1,asianbeauty,1
62069,AutoModerator,"Daily Deals, Fluff, and Hauls","Post all of your deals, memes, gifs, hauls, sh...",1,3,asianbeauty,1


##### Investigate these common titles on Reddit.com

See the screenshots of my findings in Fig 1 - 4.

In [10]:
df[df['title']=='DISCOUNT CODE: "KINDNESS" for an additional discount on yesstyle.com :D &lt;3'][['subreddit']]

Unnamed: 0,subreddit
55126,asianbeauty
55497,asianbeauty
55803,asianbeauty
56221,asianbeauty
56324,asianbeauty
56326,asianbeauty
56565,asianbeauty
56608,asianbeauty


In [11]:
df[df['title']=='when to apply overnight exfoliator?'][['subreddit']]

Unnamed: 0,subreddit
45355,skincareaddiction
45358,skincareaddiction


In [12]:
df[df['title']=='[Skin Concerns] Red, itchy skin on body'][['subreddit']]

Unnamed: 0,subreddit
8801,skincareaddiction
8866,skincareaddiction
8867,skincareaddiction


In [13]:
df[df['title']=='Daily Fluff and Hauls'][['subreddit']].head()

Unnamed: 0,subreddit
79272,asianbeauty
79313,asianbeauty
79349,asianbeauty
79396,asianbeauty
79430,asianbeauty


##### Remove duplicates

In [14]:
df.drop_duplicates(inplace=True)

print(f'I have {df.shape[0]} ({round((df.shape[0])/100_000*100)}%) left of my data to work with.')

I have 31618 (32%) left of my data to work with.


##### Sanity check:

In [15]:
df.duplicated().sum()

0

## Remove posts from AutoModerator
---
I became aware of automated posts since investigating duplicate observations in my data. However, not all AutoModerator posts are flagged as duplicates because oftentimes they are labeled with date of posting (see codes below). 

Because posts by AutoModerators are meant to catalyze community engagement where users share their skincare thoughts and concerns in the comments instead, I will remove these posts to have better insights about the community rather than the moderator.

After removing AutoModerator posts, I have 19,923 (20%) left of my data.

##### Get a sense of AutoModerator's post `title`s

In [16]:
def automoderator_posts(col_name, str_keywords):
    for post in df[(df['author'] == 'AutoModerator') & (df[col_name].str.contains(str_keywords))][col_name]:
        print(post)

In [17]:
automoderator_posts('title', "It's Casual Friday!")

[Personal] It's Casual Friday! General Chat thread - Apr 17, 2020
[Personal] It's Casual Friday! General Chat thread - Apr 10, 2020
[Personal] It's Casual Friday! General Chat thread - Apr 03, 2020
[Personal] It's Casual Friday! General Chat thread - Mar 27, 2020
[Personal] It's Casual Friday! General Chat thread - Mar 20, 2020
[Personal] It's Casual Friday! General Chat thread - Mar 13, 2020
[Personal] It's Casual Friday! General Chat thread - Mar 06, 2020
[Personal] It's Casual Friday! General Chat thread - Feb 28, 2020
[Personal] It's Casual Friday! General Chat thread - Feb 21, 2020
[Personal] It's Casual Friday! General Chat thread - Feb 14, 2020
[Personal] It's Casual Friday! General Chat thread - Feb 07, 2020
[Personal] It's Casual Friday! General Chat thread - Jan 31, 2020
[Personal] It's Casual Friday! General Chat thread - Jan 24, 2020
[Personal] It's Casual Friday! General Chat thread - Jan 17, 2020
[Personal] It's Casual Friday! General Chat thread - Jan 10, 2020
[Personal]

In [18]:
automoderator_posts('title', "Anti-Haul")

Anti-Haul Monthly April 23, 2020
Anti-Haul Monthly March 26, 2020
Anti-Haul Monthly February 27, 2020
Anti-Haul Monthly January 23, 2020
Anti-Haul Monthly December 26, 2019
Anti-Haul Monthly November 28, 2019
Anti-Haul Monthly October 24, 2019
Anti-Haul Monthly September 26, 2019
Anti-Haul Monthly August 22, 2019
Anti-Haul Monthly July 25, 2019
Anti-Haul Monthly June 27, 2019
Anti-Haul Monthly May 23, 2019
Anti-Haul Monthly April 25, 2019
Anti-Haul Monthly March 28, 2019
Anti-Haul Monthly February 28, 2019


In [19]:
df = df[df['author'] != 'AutoModerator']

print(f'I have {df.shape[0]} ({round((df.shape[0])/100_000*100)}%) left of my data to work with.')

I have 29153 (29%) left of my data to work with.


##### Sanity check:

In [20]:
df[df['author'] == 'AutoModerator']

Unnamed: 0,author,title,selftext,num_comments,score,subreddit,is_ab


## Check and handle class imbalance
---

After cleaning my data this far, the split in my target variable, `is_ab`, is 30 - 70. Which means that there is class imbalance in my data.

I'm keeping 9,000 random samples from SkincareAddiction subreddit to approximately balance the 8,871 observations I have from AsianBeauty.

I have 17,871 (18%) left of my data, which is just enough observations to work with.

##### Check class imbalance

In [21]:
df['subreddit'].value_counts()

skincareaddiction    20282
asianbeauty           8871
Name: subreddit, dtype: int64

##### Keep only 9K random samples from majority class

In [22]:
skin = df[df['is_ab'] == 0].sample(9000)
azn = df[df['is_ab'] == 1]

df = pd.concat([azn, skin], axis=0)

print(f'I have {df.shape[0]} ({round((df.shape[0])/100_000*100)}%) left of my data to work with.')

I have 17871 (18%) left of my data to work with.


##### Sanity check:

In [23]:
df['subreddit'].value_counts(normalize=True)

skincareaddiction    0.503609
asianbeauty          0.496391
Name: subreddit, dtype: float64

## Clean text variables
---
Here, I'm cleaning both `title` and `selftext` variables from unneccessary words, characters, numbers, even URL's to help filter out the noise from our data. 

I initially cleaned the texts from preset NLTK stopwords. However, after seeing how frequent trivial and obvious words are in EDA as well as the strongest coefficient weights in my first Logistic Regression model, I added a list of custom stopwords to remove noise words to get deeper and more meaningful insights about the subreddit communities in EDA as well as modeling.

### Create custom stopwords list

I'm pickling a list of top coefficient weights from 1st Logistic Regression model (Notebook: `../code/004_model_lg1.ipynb`).

In [24]:
sorted_coef_features = pickle.load(open('../assets/sorted_features_lg_model1.pkl', 'rb'))

##### Create custom stopwords list 

Create a list of custom stopwords based on the list of top coefficient weights.

In [25]:
stopwords_to_add = [
    # Noise
    'discussion',
    'beauty',
    'http',
    'do',
    'you',
    'nice',
    'favourite',
    'where',
    'care',
    'routine',
    'your',
    'edit',
    'tried',
    'product',
    'products',
    'shop',
    'fluff',
    'items',
    'post',
    'pack',
    'power',
    'com',
    'www',
    'black',
    'friday',
    'free',
    'shipping',
    'help',
    'removed',
    'guys',
    'buy',
    'like',
    'really',
    've',
    'use',
    'help',
    'just',
    'using',
    'don',
    'know',
    'hg',
    'ebay',
    'skin',
    'face',
    'amp',
    
    # Obvious signal
    'ab',
    'abers',
    'asian','asians',
    'korean',
    'japanese',
    'korea',
    'japan',
    'yesstyle',
    'jolse',
    'hong','kong',
    'soko','glam']

##### Compile lists stopwords, including custom, NLTK, and CountVectorizer stopwords

In [26]:
cvec_stopwords = list(CountVectorizer(stop_words = 'english').get_stop_words())

nltk_stopwords = stopwords.words('english')

custom_stopwords = stopwords_to_add + cvec_stopwords + nltk_stopwords

##### Pickle copiled list for ease of recall in other notebooks

In [27]:
file_name = '../assets/custom_stopwords.pkl'

pickle.dump(custom_stopwords, open(file_name, 'wb'))

### Clean data from irrelevant characters and noise words

##### Control text with stopword: 'asian'

In [28]:
print(df.loc[68619,'selftext'])

Belong to us Asians :

- https://old.reddit.com/r/awcmovement/comments/9uc9i5/only_asians_look_like_anime_characters_only/

- https://old.reddit.com/r/awcmovement/comments/9uc93h/only_asians_look_like_anime_characters_only/

- https://old.reddit.com/r/awcmovement/comments/9uc8mo/only_asians_look_like_anime_characters_only/

- https://old.reddit.com/r/awcmovement/comments/9uc84q/only_asians_look_like_anime_characters_only/

- https://old.reddit.com/r/awcmovement/comments/9uc7b8/only_asians_look_like_anime_characters_only/

- And hundreds millions others out there waiting for us Asian guys to play with them.

The biggest evidence we Asian guys got the best girls can be directly seen on how they look like, the only reason they can be/look like that is because they inherit our Asian genetics when their mother, mother's mothers and so on chose to be inseminated by us Asians guys, of course their mother, mother's mother also not look much different from them when they are young. If our Asian

##### Define a function that cleans our text data

In [29]:
def clean_text(raw_text):
    get_text = BeautifulSoup(raw_text).get_text()
    letters_only = re.sub("[^a-zA-Z]", " ", raw_text)
    text = re.sub(r'^https?:\/\/.*[\r\n]*', '', letters_only, flags=re.MULTILINE)
    words = text.lower().split()
    stops = set(custom_stopwords)
    meaningful_words = [w for w in words if w not in stops]
    return(" ".join(meaningful_words))

# 5.03 Lecture

##### Define a function that returns cleaned text columns

In [30]:
def clean_text_columns(df, col_name):

    indices = list(df.index)
    
    for i in indices:
        df.loc[i, col_name] = clean_text(df.loc[i, col_name])
    
    return df[col_name]

##### Clean `title`

In [31]:
clean_text_columns(df, 'title')

  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


50009    working seasoned estheticians hundreds consult...
50014                                           wash water
50019                                 favorite repurchased
50020                                              times w
50034    recommendations minimal ingredient effective h...
                               ...                        
49252                   misc skincare new year resolutions
23438    concerns tips facemasks tightening discolorati...
31458                       improve skincare pores redness
2257                             fixed damaged barrier yay
22390    question new olay body wash b hyaluronic point...
Name: title, Length: 17871, dtype: object

##### Clean `selftext`

In [32]:
clean_text_columns(df, 'selftext')

  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that d

50009    project glowism https glowism friend female en...
50014    hello wondering wash water usually refrained w...
50019    personal favorites mugwort mask calms helps fi...
50020    recently purchased cosrx advanced snail cream ...
50034                                                title
                               ...                        
49252    hello sca thought interesting start skincare n...
23438    heard benefits facemasks used im year old male...
31458    imgur jufiyr think pictures speak redness zone...
2257     type concerns year old lighter tan think combi...
22390    trying body wash dullness tone arms legs affor...
Name: selftext, Length: 17871, dtype: object

##### Sanity check:

This is the same text as the control text.

In [33]:
print(df.loc[68619,'selftext'])

belong https old reddit r awcmovement comments uc look anime characters https old reddit r awcmovement comments uc h look anime characters https old reddit r awcmovement comments uc mo look anime characters https old reddit r awcmovement comments uc q look anime characters https old reddit r awcmovement comments uc b look anime characters hundreds millions waiting play biggest evidence got best girls directly seen look reason look inherit genetics mother mother mothers chose inseminated course mother mother mother look different young girls chose inseminated western males losers african males males look uncute masculine look old western girls non girls eye rolling banana explosion occur girl gives guy intense pleasure affected attractive attractive especially innocent girl powerful banana explode st fundriser mod status r awcmovement article slot antiwesterncosplayers blogspot chance mod fundrisers contribute yen fundrising mod status limited approve remove make flair deleting sub arti

## Reset index
---
Resetting indices of the cleaned data.

In [34]:
df = df.reset_index(drop=True)

## Save as .csv file
---
Once I save my cleaned data as a .csv file, I ran into missing values in my EDA. After exploring further in the codes below, these are caused by the syntax cleaning that substituted noise words or URL's with nothing, `''`. 

You will see that I drop missing values following reading my data in the subsequent notebook(s) since these values will inevitably be rendered as `NaN`.

##### Save cleaned data to .csv

In [35]:
df.to_csv('../data/cleaned_skincare.csv', index = False)

#### Confirm missing values are blank cells

In [36]:
check = pd.read_csv('../data/cleaned_skincare.csv')
check.shape

(17871, 7)

##### Checking missing `title`s

In [40]:
check[check['title'].isna()]

Unnamed: 0,author,title,selftext,num_comments,score,subreddit,is_ab
64,lilsozy,,seen changes country orders got changed australia,1,1,asianbeauty,1
65,NamakaJewelry,,namaka jonathan sayeb kristina ganina jonathan...,1,1,asianbeauty,1
94,liachen03,,canada means lot specific hard b n love magic ...,0,1,asianbeauty,1
124,talarkadeh,,img lpxctm https talarkadeh articles wedding g...,1,1,asianbeauty,1
142,Outrageous-World,,ingredients look different curel intensive moi...,0,1,asianbeauty,1
...,...,...,...,...,...,...,...
17493,Mpos072,,social media seeing people aha bha peeling sol...,16,1,skincareaddiction,0
17497,acikwofi,,hi received azelaic acid plant derived hemi sq...,2,1,skincareaddiction,0
17508,Universalhoed,,suffering texture hate skins texture exactly a...,3,1,skincareaddiction,0
17613,heathaleatha,,months postpartum breastfeeding trying revamp ...,1,1,skincareaddiction,0


In [54]:
df.loc[64, 'title']

''

##### Checking missing `selftext`s

In [52]:
check[check['selftext'].isna()][:3]

Unnamed: 0,author,title,selftext,num_comments,score,subreddit,is_ab
51,-daifuku-,kikumasamune hadalabo premium lotion,,37,1,asianbeauty,1
317,JuliaOphelia,alternative curology asia southeast asia,,1,1,asianbeauty,1
990,pizzaoven12,quaaludes,,1,1,asianbeauty,1


In [53]:
df.loc[51, 'selftext']

''

## Screenshots for reference
---

##### Fig 1.
>DISCOUNT CODE: "KINDNESS" for an additional discount on yesstyle.com :D <3

<img src="../assets/duplicate1.png" width="50%" height="50%">

##### Fig 2.
>when to apply overnight exfoliator?

<img src="../assets/duplicate2.png" width="50%" height="50%">

##### Fig 3.
>[Skin Concerns] Red, itchy skin on body

<img src="../assets/duplicate3.png" width="50%" height="50%">

##### Fig 4.
>[Skin Concerns] Daily Fluff and Hauls

<img src="../assets/duplicate4.png" width="50%" height="50%">