# Project 3

## Part 1: Data Cleaning, Feature Engineering, EDA

I have my post content, now I need to inspect it and clean it up if necessary.

### 0. Imports and Preliminaries

In [1]:
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize, regexp_tokenize
import string # for punctuation
import ipynb_utils as ipyutils  # custom variables and utility functions

In [2]:
# load data
data_path = '../data/scrapes.json'
post_df = pd.read_json(data_path, orient='index')
post_df.shape

(2187, 6)

In [3]:
post_df.head()

Unnamed: 0,uid,time,title,body-text,media,comments
0,214SaturnReturnMEGATHREAD-we'vebeengettingalot...,2022-02-01,Saturn Return MEGATHREAD - we've been getting ...,,,330
1,"51MERCURYRXINFOGRAPHIC:Taurus/Gemini,Apr-Jun20220",2022-06-01,"MERCURY RX INFOGRAPHIC: Taurus/Gemini, Apr-Jun...",,,22
2,17CHANIappissues?221,2022-08-30,CHANI app issues?,I just downloaded the CHANI app to try out and...,,4
3,124Makeeverymilecountnextseason—getthelatestcr...,,Make every mile count next season—get the late...,,True,0
4,86IsMercuryinAquariusinthe6thHouseaspowerfulas...,2022-08-30,Is Mercury in Aquarius in the 6th House as pow...,Not new to the deeper parts of astrology but t...,,8


### 1. EDA and Cleaning

In [4]:
post_df[post_df['time'] == ''].head() # are these all ads?

Unnamed: 0,uid,time,title,body-text,media,comments
3,124Makeeverymilecountnextseason—getthelatestcr...,,Make every mile count next season—get the late...,,True,0
9,254Yougottabereadyforanythingifyouwanttokeepup...,,You gotta be ready for anything if you want to...,,True,0
15,"237[MEGATHREAD]Reddit,meettheNextGenerationGMC...",,"[MEGATHREAD] Reddit, meet the Next Generation ...",TL;DR: GMC has unveiled the incredibly capable...,,0
21,145OkReddit–ifwegaveyouanicechestfulloficecold...,,Ok Reddit – if we gave you an ice chest full o...,,True,5
27,171Thedaysofmissingnotificationsareover.Withit...,,The days of missing notifications are over. Wi...,,True,0


In [5]:
# save a filter for these rows that might be ads
adfilter = post_df['time'] == ''

In [6]:
# inspect these rows in a bit more detail
post_df[adfilter]['title']

3       Make every mile count next season—get the late...
9       You gotta be ready for anything if you want to...
15      [MEGATHREAD] Reddit, meet the Next Generation ...
21      Ok Reddit – if we gave you an ice chest full o...
27      The days of missing notifications are over. Wi...
                              ...                        
2182    This Labor Day, enjoy a $45 promotional rate f...
2183    Even Jason, a dedicated triathlete, struggled ...
2184    Get a grip on your future by serving your coun...
2185    TRUST The Structure: Anyone can edit. Anyone C...
2186                   Full speed. Full Send. #TeamToyota
Name: title, Length: 215, dtype: object

In [7]:
# conclusion - all the posts with no timestamp are ads. 
# Even if they aren't all ads, I'm willing to forego this smallish subset
# of entries that mostly sound suspiciously like ads..
post_df.drop(post_df[adfilter].index, inplace=True)
post_df.shape

(1972, 6)

In [8]:
# convert media column to 0/1 (boolean)
post_df['media'] = (post_df['media'] == 'True').astype(int)
post_df.dtypes

uid          object
time         object
title        object
body-text    object
media         int64
comments      int64
dtype: object

Columns are about as clean as can be for now.

### 2. Feature Engineering

Some useful features to have at my fingertips would be title-length, title-word-count, post-length, post-word-count

In [9]:
post_df.head()

Unnamed: 0,uid,time,title,body-text,media,comments
0,214SaturnReturnMEGATHREAD-we'vebeengettingalot...,2022-02-01,Saturn Return MEGATHREAD - we've been getting ...,,0,330
1,"51MERCURYRXINFOGRAPHIC:Taurus/Gemini,Apr-Jun20220",2022-06-01,"MERCURY RX INFOGRAPHIC: Taurus/Gemini, Apr-Jun...",,0,22
2,17CHANIappissues?221,2022-08-30,CHANI app issues?,I just downloaded the CHANI app to try out and...,0,4
4,86IsMercuryinAquariusinthe6thHouseaspowerfulas...,2022-08-30,Is Mercury in Aquarius in the 6th House as pow...,Not new to the deeper parts of astrology but t...,0,8
5,37Whatistheproperorbforasextile?224,2022-08-30,What is the proper orb for a sextile?,What is the proper and respective orb for a se...,0,8


In [10]:
ipyutils.PAT_TOKEN

"[\\w\\d']+"

In [11]:
# title word count
post_df['title-wc'] = [len(regexp_tokenize(t, ipyutils.PAT_TOKEN)) 
                       for t in post_df['title']]
post_df['title-wc'].head(2)

0    36
1     8
Name: title-wc, dtype: int64

In [12]:
# title character count
post_df['title-cc'] = [len(t) for t in post_df['title']]
post_df['title-cc'].head(2)

0    214
1     51
Name: title-cc, dtype: int64

In [13]:
# body word count
post_df['body-wc'] = [len(regexp_tokenize(t, ipyutils.PAT_TOKEN)) 
                      for t in post_df['body-text']]
post_df['body-wc'].head()

0     0
1     0
2    50
4    59
5    45
Name: body-wc, dtype: int64

In [14]:
# body character count
post_df['body-cc'] = [len(t) for t in post_df['body-text']]
post_df['body-cc'].head()

0      0
1      0
2    221
4    314
5    224
Name: body-cc, dtype: int64

### Write Clean Data To Disk

In [15]:
# reorganize columns and remove uid column - only needed it to find duplicates
post_df = post_df[['time','title','body-text',
                  'title-cc','title-wc','body-cc','body-wc',
                  'media','comments']]
post_df.columns

Index(['time', 'title', 'body-text', 'title-cc', 'title-wc', 'body-cc',
       'body-wc', 'media', 'comments'],
      dtype='object')

In [16]:
post_df.to_json('../data/scrapes-clean.json', orient='index')

In [17]:
# END