# Scraping subreddits


## Problem statement
---
We are producing a trivia focused, _Jeopardy_ or _Who Wants to Be a Millionaire_* style game show, where we want the audience to guess the source of the movie details. Everything is scrambled together! The task: figure out if the details came from good movie details, or whether the movie production team took a shorcut and landed in Sh**ty Movie Details!

*Trademarks of their appropiate productions

To do this, we are going to solve the problem using a classifer trained on some subreddit data.

_Thanks to Gwen Rathgeber for some inspiration during the quest for a compelling problem statement!_

![](https://www.bigraildiversity.co.uk/wp-content/uploads/2018/10/Night-at-the-Movies-Converted-900x600.png)


## Data sources and references

---

1. GENERAL NOTE ON ATTRIBUTION: The work throughout this Report, including many if not most coding techniques, rely and borrow heavily from code discussed in Riley Dallas, Sophie "Sonya" Tabac, Charlie Rice, Heather Robbins, and Gwen Rathgeber's class lectures, Notebooks, GitHub repositories, tips on techniques and troubleshooting help. The code is adapted to solve our specific problem.


2. The following subReddits were scraped:

    * [Movie Details](https://www.reddit.com/r/MovieDetails/)
    * [Sh**ty Movie Details](https://www.reddit.com/r/shittymoviedetails/)

---

### Note on the data and style

Strong language may appear in various Reddit posts in raw form. To the extent possible, it shall be cleaned in the course of the project. There also may be some humor used throughout the presentation of the analysis.

---

## Webscrape - let's fetch some data!

We'll import all our required libraries up here.

In [1]:
#import libraries
import pandas as pd, numpy as np, requests, time, nltk, datetime as dt

#NLP
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

import gensim.downloader as api #allows us to get word2vec anf glove embeddins that we need
from gensim.models.word2vec import Word2Vec
from transformers import pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


#classifiers
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, confusion_matrix, plot_confusion_matrix, \
                             recall_score, precision_score)
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.dummy import DummyClassifier


# easier to see full text with a bigger maxwidth:
pd.options.display.max_colwidth = 200

In [2]:
#pip install python-Levenshtein
#this is for the gensim.similarities.levenshtein submodule

Next, we will use the PushShift API to collect the subReddit data we need.

More information about the API can be found on the [Pushshift repo](https://github.com/pushshift/api).

### Set up single post scrape

In [3]:
#Query the PushShift API for the subReddit data we need
url = 'https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails'
url2 = 'https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails'

_PRO-TIP: We can paste the URL in browser to preview the JSON format._

Next, we need to define some parameters and then confirm that we get a response from the API, from both the subReddits. (NOTE: There is no key needed to access the Reddit API.)

In [4]:
#define what we need to get from the subreddit
params = {
    'subreddit': 'MovieDetails',
    'size' : 100 #100 seems to be the max, even if change this to a greater size
#    'before': ''
}

In [5]:
#define what we need to get from the subreddit
params2 = {
    'subreddit': 'shittymoviedetails',
    'size' : 100 #100 seems to be the max, even if change this to a greater size
#    'before': ''
}

Let's check that our response is valid -- we are looking for a 200 response code.

In [6]:
res = requests.get(url, params) 
res.status_code

200

In [7]:
res2 = requests.get(url2, params2)
res2.status_code

200

Let's look at the content we fetched, for a single post:

In [8]:
print(type(res)) #check attrs

data = res.json() #get the content in JSON format

orig_posts = data['data'][0] #Fetch the list of first posts

#verify what we got
print(orig_posts)

print(type(orig_posts))

print(data.keys())

<class 'requests.models.Response'>
{'all_awardings': [], 'allow_live_comments': False, 'author': 'ambisinister_gecko', 'author_flair_css_class': None, 'author_flair_richtext': [], 'author_flair_text': None, 'author_flair_type': 'text', 'author_fullname': 't2_640pfb72', 'author_patreon_flair': False, 'author_premium': False, 'awarders': [], 'can_mod_post': False, 'contest_mode': False, 'created_utc': 1619336854, 'domain': 'self.MovieDetails', 'full_link': 'https://www.reddit.com/r/MovieDetails/comments/my3gul/mortal_kombat_2021_has_a_fight_scene_in_a_trailer/', 'gildings': {}, 'id': 'my3gul', 'is_crosspostable': True, 'is_meta': False, 'is_original_content': False, 'is_reddit_media_domain': False, 'is_robot_indexable': True, 'is_self': True, 'is_video': False, 'link_flair_background_color': '#0079d3', 'link_flair_css_class': 'Image', 'link_flair_richtext': [{'e': 'text', 't': '🥚 Easter Egg'}], 'link_flair_template_id': 'b968d6fa-5f2d-11e7-8e64-0e8c19beb1c6', 'link_flair_text': '🥚 Easter

Everything we do, we have to repeat for our second subReddit.

In [9]:
#repeat for the 2nd subreddit
data2 = res2.json() #get the content in JSON format
posts2 = data2['data'][0] #Fetch the list of first post
posts2

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'Sarke1',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_4ta1n',
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1619335786,
 'domain': 'i.imgur.com',
 'full_link': 'https://www.reddit.com/r/shittymoviedetails/comments/my392g/the_i_am_legend_script_originally_featured_aliens/',
 'gildings': {},
 'id': 'my392g',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': False,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 0,
 'num_crossposts': 0,
 'over_18': False,
 'parent

In [10]:
len(orig_posts)

61

Let's take a look at what we pulled in a more visually accessible format.

In [11]:
#create df
results_df = pd.DataFrame(data['data'])
results_df2 = pd.DataFrame(data2['data'])
results_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,removed_by_category,media,media_embed,secure_media,secure_media_embed,gallery_data,is_gallery,media_metadata,author_flair_template_id,author_flair_text_color
0,[],False,ambisinister_gecko,,[],,text,t2_640pfb72,False,False,...,,,,,,,,,,
1,[],False,PrivateEducation,,[],,text,t2_b0a3h,False,False,...,,,,,,,,,,
2,[],False,PrivateEducation,,[],,text,t2_b0a3h,False,False,...,moderator,,,,,,,,,
3,[],False,advent_precursor,,[],,text,t2_2ghlqap4,False,False,...,,,,,,,,,,
4,[],False,stalwartGRUNT,,[],,text,t2_1k03g353,False,False,...,,,,,,,,,,


In [12]:
results_df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_css_class',
       'link_flair_richtext', 'link_flair_template_id', 'link_flair_text',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'subreddit_type', 'suggested_

There is a lot of metadata here. We probably will not need most of it.

Here are the important tags we plan to use:

* subreddit
* selftext
* title
* created_utc
* author
* is_self (to filter out link-only posts)
* score
* num_comments

Let's grab just the metadata we need.

In [13]:
subfields = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'is_self', \
'score', 'num_comments']

In [14]:
results_df = results_df[subfields]
results_df2 = results_df2[subfields]
results_df.head(2)

Unnamed: 0,title,selftext,subreddit,created_utc,author,is_self,score,num_comments
0,Mortal Kombat (2021) has a fight scene in a trailer home between &lt;spoiler&gt; and &lt;spoiler&gt;. The scene is meant to mirror the trailer fight scene in Kill Bill Vol 2.,"The mortal kombat fight is between Sonya and Kano, didn't want to leave spoilers in the title.\n\nBoth scenes take place in a trailer home. Both scenes involve fighting a one-eyed (arguably, in th...",MovieDetails,1619336854,ambisinister_gecko,True,1,1
1,George Harrison can be seen in the background of the ExLeper scene of Life of Brian (1979 ) wearing a golden crown with red headwear. Cant find anyone else posting about him being in this scene on...,,MovieDetails,1619336852,PrivateEducation,False,1,1


Let's remove any duplicate posts.

In [15]:
#dupes
results_df.drop_duplicates(inplace=True)
results_df2.drop_duplicates(inplace=True)

We also only want original text content here.

In [16]:
#filter only for self posts
results_df = results_df[results_df["is_self"] == True]
results_df2 = results_df2[results_df2['is_self']==True]

We'll grab the timestamp so that we can set before and after the post we want, to pull in our desired volume of posts.

In [17]:
#convert timestamp to a format we understand
results_df['timestamp'] = results_df['created_utc'].map(dt.date.fromtimestamp)
results_df2['timestamp'] = results_df2['created_utc'].map(dt.date.fromtimestamp)
results_df['timestamp']

0     2021-04-25
5     2021-04-24
6     2021-04-24
8     2021-04-24
19    2021-04-24
26    2021-04-24
27    2021-04-24
50    2021-04-24
63    2021-04-23
64    2021-04-23
65    2021-04-23
66    2021-04-23
68    2021-04-23
72    2021-04-23
88    2021-04-23
95    2021-04-23
98    2021-04-23
Name: timestamp, dtype: object

In [18]:
results_df.shape

(17, 9)

Looks like we get a couple days' worth in a single pull.

### Set up to pull @certain freq

Let's set up our pull to be a bit more dynamic.

In [19]:
#define endpoint
base_url = 'https://api.pushshift.io/reddit/search/submission'

#update params
subreddit = 'MovieDetails'
size = 100

#construct URL
stem = f'{base_url}?subreddit={subreddit}&size={size}'
stem

'https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=100'

In [20]:
stem == url

False

Let's add in a time component to loop over multiple days' worth of posts.

In [21]:
days = 30
url = f'{stem}&after={days}d'
url

'https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=100&after=30d'

In [22]:
#verify output

res=requests.get(url)
assert res.status_code == 200
json_data = res.json()

results_df = pd.DataFrame(json_data['data'])[subfields]

results_df[
    'created_utc'].map(dt.date.fromtimestamp).head()

0    2021-03-27
1    2021-03-27
2    2021-03-27
3    2021-03-27
4    2021-03-27
Name: created_utc, dtype: object

We see something from as far back as a month ago, which is what we were expecting.

In [23]:
results_df.shape

(100, 8)

### Automate pull

Let's set up a function that fetches posts and appends them to a dataframe.

We will then run this on our desired subReddits.

In [24]:
#loop to iterate through the pulls

def fetch_posts(subreddit, #specify subreddit
                kind = 'submission', is_video=False,  #grab post; partial ref to Riley Robertson's query
               day_window = 30, n=500): #number of iterations to run
    #through; this is also how many dfs are output that need to be concatenated

#est. params
    
    base_url = f'https://api.pushshift.io/reddit/search/{kind}'

    stem = f'{base_url}?subreddit={subreddit}&size=500'

  
    posts = []

    for i in range(1, n+1): #that many iterations
        
        URL = '{}&after={}d'.format(stem, day_window *i)
        
        print('Query from timeframe: ' + URL)
                
        try: #partial ref to Ben Mathis's query to handle exception
            res = requests.get(URL)
            assert res.status_code == 200
            
        except:
            continue
        
        json_dict = res.json()['data'] #grab the single key from the list of json dicts
   
        df = pd.DataFrame.from_dict(json_dict) #convert dictionary info to a df
        
        posts.append(df) #append the df from each iteration to our posts list
        
        total_scraped = sum(len(x) for x in posts) #understand, how many posts we
        #are getting
        
       # print(len(posts))
            
        print(total_scraped)
        
        if total_scraped > 10_000:
            break
        
        time.sleep(1) #wait 1 s between requests
        
    #merge list of dfs from our requests
    full_df = pd.concat(posts, sort=False)
    
    #if kind == 'submission' & selftext:not=[removed]:
    
    full_df = full_df[subfields] #only want the specific columns
    full_df.drop_duplicates(inplace=True) #de-dupe posts
    full_df = full_df.loc[full_df['is_self']==True] #only grab original submissions
    
    print(full_df.shape)
    
    full_df['timestamp'] = full_df['created_utc'].map(dt.date.fromtimestamp) #convert time
    
    return full_df.reset_index(drop=True)

In [25]:
movie_deets = fetch_posts('MovieDetails')
movie_deets.head()

Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=30d
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=60d
100
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=90d
200
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=120d
300
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=150d
400
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=180d
500
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=210d
600
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=240d
700
Query from timeframe: https://api.pushshift.io/reddit/s

6273
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=2100d
6373
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=2130d
6473
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=2160d
6573
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=2190d
6673
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=2220d
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=2250d
6773
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=2280d
6873
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=2310d
6973
Query from timeframe: https://ap

12273
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=4140d
12373
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=4170d
12473
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=4200d
12573
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=4230d
12673
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=4260d
12773
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=4290d
12873
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=4320d
12973
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=4350d
13073
Query from timefra

18773
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=6180d
18873
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=6210d
18973
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=6240d
19073
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=6270d
19173
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=6300d
19273
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=6330d
19373
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=6360d
19473
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=6390d
19573
Query from timefra

Unnamed: 0,title,selftext,subreddit,created_utc,author,is_self,score,num_comments,timestamp
0,Free Movie Streaming Sites 2021 | No Signup &amp; Download,[removed],MovieDetails,1614153597,Wafer_Jaded,True,1,0,2021-02-23
1,All time Comedy Movie Tier List,[removed],MovieDetails,1614172795,TrustedMarketing,True,1,2,2021-02-24
2,Comedy Movies Tier List photo from (2008) Step Brothers,[removed],MovieDetails,1614172875,TrustedMarketing,True,1,4,2021-02-24
3,The little things (2021) theory,[deleted],MovieDetails,1614192363,[deleted],True,1,2,2021-02-24
4,The little things (2021) *shoes*,[removed],MovieDetails,1614192478,r35p30t,True,2,4,2021-02-24


In [None]:
movie_deets.shape

### Repeat this process for our second subreddit

In [29]:
bad_deets = fetch_posts('shittymoviedetails')
print(bad_deets.shape)
bad_deets.head()

Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=30d
100
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=60d
200
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=90d
300
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=120d
400
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=150d
500
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=180d
600
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=210d
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=240d
700
Query f

6275
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=1980d
6375
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=2010d
6475
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=2040d
6575
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=2070d
6675
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=2100d
6775
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=2130d
6875
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=2160d
6975
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=

Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=3930d
12475
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=3960d
12575
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=3990d
12675
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=4020d
12775
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=4050d
12875
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=4080d
12975
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=4110d
13075
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&siz

18775
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=5880d
18875
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=5910d
18975
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=5940d
19075
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=5970d
19175
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=6000d
19275
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=6030d
19375
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails&size=500&after=6060d
19475
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetai

Unnamed: 0,title,selftext,subreddit,created_utc,author,is_self,score,num_comments,timestamp
0,"Question, is this a sub to make fun of movies ? Sarcastically?",,shittymoviedetails,1616877480,ExpensiveIngenuity1,True,1,1,2021-03-27
1,"If you ever feel useless, just remember...","In Finding Nemo, the fish students get on a jellyfish’s back like a bus. Not like they can’t swim.",shittymoviedetails,1616914691,k1ckstarteer,True,1,3,2021-03-27
2,"In the second episode of The Falcon and the Winter Soldier, the new Captain America throws his shield to break his sidekick's fall off of a speeding truck. This is because his sidekick is from IKE...",,shittymoviedetails,1616953328,sometimesavowel,True,1,0,2021-03-28
3,Kubrick's The Shining (1980) Was Named So Due to the Incessant Glare from Nicholson's Balding Head.,"For only $25.85, you too can 'cosplay' Jack Torrance.\n\nhttps://www.ebay.com/itm/Brown-Bob-Wig-Bald-Head-for-Cosplay-The-Shining-Jack-Nicholson-Costume-HM-664-/183269829335?mkevt=1&amp;mkcid=1&am...",shittymoviedetails,1616961963,5pez__A,True,0,0,2021-03-28
4,"Christopher walken awakens from a coma with the ability of second sight in the movie ""the dead zone"", in the movie he talks about the book, ""the legend of sleepy hollow"".","This is because he really does have second sight and knew that in the 1999 release of the movie, based on the book, he would be playing the part of the headless horseman and wanted to create some ...",shittymoviedetails,1617011798,40andbored,True,1,0,2021-03-29


In [None]:
bad_deets.shape

### Output the dataframes!

In [31]:
movie_deets.to_csv('../data_outputs/good_movie_deets.csv', index=False)
bad_deets.to_csv('../data_outputs/bad_movie_deets.csv', index=False)

## Cleaning and EDA

Let's look at basic stats about our data.

In [32]:
#<df>.isnull().sum()

Let's understand the density and usefulness of our content, by analyzing the volume of comments and looking for the richness of the text-heavy columns we are expecting.

We are also going to see, whether there are any interesting prospective features for training our models.

In [33]:
#<df>['num_comments'].value_counts()

In [34]:
#check for post density
#<df>['selftext'].value_counts()

In [35]:
#<df>['title'].value_counts()

In [36]:
#<df>['author'].value_counts() #get only unique

## NLP & feature eng

### Word analysis

In [37]:
#word freq
#vectorize self-text

In [38]:
#sentiment