# Project 3

## Part 1: Data Cleaning, Feature Engineering, EDA

I have my post content, now I need to inspect it, clean it up if necessary, generate train-test splits, and generate word counts for analysis.

### 0. Imports and Preliminaries

In [10]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
import string # for punctuation

In [11]:
# load data
data_path = '../data/scrapes.json'
post_df = pd.read_json(data_path, orient='index')
post_df.shape

(2187, 6)

In [12]:
post_df.head()

Unnamed: 0,uid,time,title,body-text,media,comments
0,214SaturnReturnMEGATHREAD-we'vebeengettingalot...,2022-02-01,Saturn Return MEGATHREAD - we've been getting ...,,,330
1,"51MERCURYRXINFOGRAPHIC:Taurus/Gemini,Apr-Jun20220",2022-06-01,"MERCURY RX INFOGRAPHIC: Taurus/Gemini, Apr-Jun...",,,22
2,17CHANIappissues?221,2022-08-30,CHANI app issues?,I just downloaded the CHANI app to try out and...,,4
3,124Makeeverymilecountnextseason—getthelatestcr...,,Make every mile count next season—get the late...,,True,0
4,86IsMercuryinAquariusinthe6thHouseaspowerfulas...,2022-08-30,Is Mercury in Aquarius in the 6th House as pow...,Not new to the deeper parts of astrology but t...,,8


### 1. EDA and Cleaning

In [13]:
post_df[post_df['time'] == ''].head() # are these all ads?

Unnamed: 0,uid,time,title,body-text,media,comments
3,124Makeeverymilecountnextseason—getthelatestcr...,,Make every mile count next season—get the late...,,True,0
9,254Yougottabereadyforanythingifyouwanttokeepup...,,You gotta be ready for anything if you want to...,,True,0
15,"237[MEGATHREAD]Reddit,meettheNextGenerationGMC...",,"[MEGATHREAD] Reddit, meet the Next Generation ...",TL;DR: GMC has unveiled the incredibly capable...,,0
21,145OkReddit–ifwegaveyouanicechestfulloficecold...,,Ok Reddit – if we gave you an ice chest full o...,,True,5
27,171Thedaysofmissingnotificationsareover.Withit...,,The days of missing notifications are over. Wi...,,True,0


In [14]:
# save a filter for these rows that might be ads
adfilter = post_df['time'] == ''

In [15]:
# inspect these rows in a bit more detail
post_df[adfilter]['title']

3       Make every mile count next season—get the late...
9       You gotta be ready for anything if you want to...
15      [MEGATHREAD] Reddit, meet the Next Generation ...
21      Ok Reddit – if we gave you an ice chest full o...
27      The days of missing notifications are over. Wi...
                              ...                        
2182    This Labor Day, enjoy a $45 promotional rate f...
2183    Even Jason, a dedicated triathlete, struggled ...
2184    Get a grip on your future by serving your coun...
2185    TRUST The Structure: Anyone can edit. Anyone C...
2186                   Full speed. Full Send. #TeamToyota
Name: title, Length: 215, dtype: object

In [16]:
# conclusion - all the posts with no timestamp are ads. 
# Even if they aren't all ads, I'm willing to forego 215 entries.
post_df.drop(post_df[adfilter].index, inplace=True)
post_df.shape

(1972, 6)

In [17]:
# convert time column to datetime format (can probably do more with this
# in that format)
dateformat = '%Y-%m-%d'
post_df['time'] = pd.to_datetime(post_df['time'], format=dateformat)
post_df.dtypes

uid                  object
time         datetime64[ns]
title                object
body-text            object
media                object
comments              int64
dtype: object

In [21]:
# convert media column to 0/1 (boolean)
post_df['media'] = (post_df['media'] == 'True').astype(int)
post_df.dtypes

uid                  object
time         datetime64[ns]
title                object
body-text            object
media                 int64
comments              int64
dtype: object

Columns are about as clean as can be for now.

### 2. Feature Engineering

Some useful features to have at my fingertips would be title-length, title-word-count, post-length, post-word-count

In [9]:
post_df.head()

Unnamed: 0,uid,time,title,body-text,media,comments
0,214SaturnReturnMEGATHREAD-we'vebeengettingalot...,2022-02-01,Saturn Return MEGATHREAD - we've been getting ...,,,330
1,"51MERCURYRXINFOGRAPHIC:Taurus/Gemini,Apr-Jun20220",2022-06-01,"MERCURY RX INFOGRAPHIC: Taurus/Gemini, Apr-Jun...",,,22
2,17CHANIappissues?221,2022-08-30,CHANI app issues?,I just downloaded the CHANI app to try out and...,,4
4,86IsMercuryinAquariusinthe6thHouseaspowerfulas...,2022-08-30,Is Mercury in Aquarius in the 6th House as pow...,Not new to the deeper parts of astrology but t...,,8
5,37Whatistheproperorbforasextile?224,2022-08-30,What is the proper orb for a sextile?,What is the proper and respective orb for a se...,,8
