# Scraping subreddits


## Problem statement
---
We are producing a trivia focused, _Jeopardy_ or _Who Wants to Be a Millionaire_* style game show, where we want the audience to guess the source of the movie details. Everything is scrambled together! The task: figure out if the details came from good movie details, or whether the movie production team took a shorcut and landed in Sh**ty Movie Details!

*Trademarks of their appropiate productions

To do this, we are going to solve the problem using a classifer trained on some subreddit data.

_Thanks to Gwen Rathgeber for some inspiration during the quest for a compelling problem statement!_

![](https://www.bigraildiversity.co.uk/wp-content/uploads/2018/10/Night-at-the-Movies-Converted-900x600.png)


## Data sources and references

---

1. GENERAL NOTE ON ATTRIBUTION: The work throughout this Report, including many if not most coding techniques, rely and borrow heavily from code discussed in Riley Dallas, Sophie "Sonya" Tabac, Charlie Rice, Heather Robbins, and Gwen Rathgeber's class lectures, Notebooks, GitHub repositories, tips on techniques and troubleshooting help. The code is adapted to solve our specific problem.


2. The following subReddits were scraped:

    * [Movie Details](https://www.reddit.com/r/MovieDetails/)
    * [Sh**ty Movie Details](https://www.reddit.com/r/shittymoviedetails/)

---

### Note on the data and style

Strong language may appear in various Reddit posts in raw form. To the extent possible, it shall be cleaned in the course of the project. There also may be some humor used throughout the presentation of the analysis.

---

## Webscrape - let's fetch some data!

We'll import all our required libraries up here.

In [1]:
#import libraries
import pandas as pd, numpy as np, requests, time, nltk, datetime as dt

#NLP
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

import gensim.downloader as api #allows us to get word2vec anf glove embeddins that we need
from gensim.models.word2vec import Word2Vec
from transformers import pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


#classifiers
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, confusion_matrix, plot_confusion_matrix, \
                             recall_score, precision_score)
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.dummy import DummyClassifier


# easier to see full text with a bigger maxwidth:
pd.options.display.max_colwidth = 200



Next, we will use the PushShift API to collect the subReddit data we need.

More information about the API can be found on the [Pushshift repo](https://github.com/pushshift/api).

### Set up single post scrape

In [2]:
#Query the PushShift API for the subReddit data we need
url = 'https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails'
url2 = 'https://api.pushshift.io/reddit/search/submission?subreddit=shittymoviedetails'

_PRO-TIP: We can paste the URL in browser to preview the JSON format._

Next, we need to define some parameters and then confirm that we get a response from the API, from both the subReddits. (NOTE: There is no key needed to access the Reddit API.)

In [9]:
#define what we need to get from the subreddit
params = {
    'subreddit': 'MovieDetails',
    'size' : 100 #100 seems to be the max, even if change this to a greater size
#    'before': ''
}

In [10]:
#define what we need to get from the subreddit
params2 = {
    'subreddit': 'shittymoviedetails',
    'size' : 100 #100 seems to be the max, even if change this to a greater size
#    'before': ''
}

Let's check that our response is valid -- we are looking for a 200 response code.

In [11]:
res = requests.get(url, params) 
res.status_code

200

In [12]:
res2 = requests.get(url2, params2)
res2.status_code

200

Let's look at the content we fetched, for a single post:

In [18]:
print(type(res)) #check attrs

data = res.json() #get the content in JSON format

posts = data['data'][0] #Fetch the list of first posts

#verify what we got
print(posts)

print(type(posts))

print(data.keys())

<class 'requests.models.Response'>
{'all_awardings': [], 'allow_live_comments': False, 'author': 'October23rd2077', 'author_flair_css_class': None, 'author_flair_richtext': [], 'author_flair_text': None, 'author_flair_type': 'text', 'author_fullname': 't2_70rlnux8', 'author_patreon_flair': False, 'author_premium': False, 'awarders': [], 'can_mod_post': False, 'contest_mode': False, 'created_utc': 1619315293, 'domain': 'reddit.com', 'full_link': 'https://www.reddit.com/r/MovieDetails/comments/mxy6dk/in_cars_2006_the_king_strip_weathers_crashes_in/', 'gallery_data': {'items': [{'id': 40772735, 'media_id': 'n2933i7868v61'}, {'id': 40772736, 'media_id': '75dqji7868v61'}]}, 'gildings': {}, 'id': 'mxy6dk', 'is_crosspostable': True, 'is_gallery': True, 'is_meta': False, 'is_original_content': False, 'is_reddit_media_domain': False, 'is_robot_indexable': True, 'is_self': False, 'is_video': False, 'link_flair_background_color': '#0079d3', 'link_flair_css_class': 'Image', 'link_flair_richtext': 

Everything we do, we have to repeat for our second subReddit.

In [21]:
#repeat for the 2nd subreddit
data2 = res2.json() #get the content in JSON format
posts2 = data2['data'][0] #Fetch the list of first post
posts2

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'invertedparadX',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_8i9wni58',
 'author_patreon_flair': False,
 'author_premium': True,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1619315795,
 'domain': 'i.redd.it',
 'full_link': 'https://www.reddit.com/r/shittymoviedetails/comments/mxybxi/the_nineties_was_weird_bruh_goro_mortal_kombat/',
 'gildings': {},
 'id': 'mxybxi',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': True,
 'is_robot_indexable': True,
 'is_self': False,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 0,
 'num_crossposts': 0,
 'over_18': False,
 'pa

In [22]:
len(posts)

67

Let's take a look at what we pulled in a more visually accessible format.

In [23]:
#create df
results_df = pd.DataFrame(data['data'])
results_df2 = pd.DataFrame(data2['data'])
results_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,wls,removed_by_category,post_hint,preview,media,media_embed,secure_media,secure_media_embed,author_flair_template_id,author_flair_text_color
0,[],False,October23rd2077,,[],,text,t2_70rlnux8,False,False,...,6,,,,,,,,,
1,[],False,coolstevenn,,[],,text,t2_4g0p2x9c,False,False,...,6,,,,,,,,,
2,[],False,Letywolf,,[],,text,t2_housnk8,False,False,...,6,,,,,,,,,
3,[],False,thewiseone91,,[],,text,t2_71kxf,False,False,...,6,,,,,,,,,
4,[],False,thewiseone91,,[],,text,t2_71kxf,False,False,...,6,,,,,,,,,


In [24]:
results_df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gallery_data', 'gildings', 'id',
       'is_crosspostable', 'is_gallery', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_css_class',
       'link_flair_richtext', 'link_flair_template_id', 'link_flair_text',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_metadata',
       'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_su

There is a lot of metadata here. We probably will not need most of it.

Here are the important tags we plan to use:

* subreddit
* selftext
* title
* created_utc
* author
* is_self (to filter out link-only posts)
* score
* num_comments

Let's grab just the metadata we need.

In [25]:
subfields = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'is_self', \
'score', 'num_comments']

In [26]:
results_df = results_df[subfields]
results_df2 = results_df2[subfields]
results_df.head(2)

Unnamed: 0,title,selftext,subreddit,created_utc,author,is_self,score,num_comments
0,"In Cars (2006), “The King” Strip Weathers crashes in his final race, but makes it to the finish line. In 1992, NASCAR’s “King,” Richard Petty (the voice of Strip Weathers in the Cars movies), cras...",,MovieDetails,1619315293,October23rd2077,False,1,1
1,"In Shrek 2 (2004) there's a portrait in Fairy Godmother's cottage depicting the cyclops from The Poison Apple with the caption ""After."" This suggests that there's another portrait depicting ""Befor...",,MovieDetails,1619314737,coolstevenn,False,1,1


Let's remove any duplicate posts.

In [27]:
#dupes
results_df.drop_duplicates(inplace=True)
results_df2.drop_duplicates(inplace=True)

We also only want original text content here.

In [28]:
#filter only for self posts
results_df = results_df[results_df["is_self"] == True]
results_df2 = results_df2[results_df2['is_self']==True]

We'll grab the timestamp so that we can set before and after the post we want, to pull in our desired volume of posts.

In [29]:
#convert timestamp to a format we understand
results_df['timestamp'] = results_df['created_utc'].map(dt.date.fromtimestamp)
results_df2['timestamp'] = results_df2['created_utc'].map(dt.date.fromtimestamp)
results_df['timestamp']

10    2021-04-24
17    2021-04-24
18    2021-04-24
41    2021-04-24
54    2021-04-23
55    2021-04-23
56    2021-04-23
57    2021-04-23
59    2021-04-23
63    2021-04-23
79    2021-04-23
86    2021-04-23
89    2021-04-23
91    2021-04-23
92    2021-04-23
Name: timestamp, dtype: object

In [30]:
results_df.shape

(15, 9)

Looks like we get a couple days' worth in a single pull.

### Set up to pull @certain freq

Let's set up our pull to be a bit more dynamic.

In [31]:
#define endpoint
base_url = 'https://api.pushshift.io/reddit/search/submission'

#update params
subreddit = 'MovieDetails'
size = 100

#construct URL
stem = f'{base_url}?subreddit={subreddit}&size={size}'
stem

'https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=100'

In [32]:
stem == url

False

Let's add in a time component to loop over multiple days' worth of posts.

In [36]:
days = 30
url = f'{stem}&after={days}d'
url

'https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=100&after=30d'

In [38]:
#verify output

res=requests.get(url)
assert res.status_code == 200
json_data = res.json()

results_df = pd.DataFrame(json_data['data'])[subfields]

results_df[
    'created_utc'].map(dt.date.fromtimestamp).head()

0    2021-03-27
1    2021-03-27
2    2021-03-27
3    2021-03-27
4    2021-03-27
Name: created_utc, dtype: object

We see somthing from as far back as a month ago, which is what we were expecting.

In [39]:
results_df.shape

(100, 8)

In [59]:
#for loop to iterate through the pulls

def fetch_posts(subreddit):

#est. params
    subreddit = 'MovieDetails'
    
    kind = 'submission' #post vs. comment -- Riley
    
#est url

    base_url = f'https://api.pushshift.io/reddit/search/{kind}'

    stem = f'{base_url}?subreddit={subreddit}&size=500'

#vars

    day_window =7
    
    n=3 #number of iters, after which the loop will stop;
    #it will create n result df's
    
    posts = []

    for i in range(1, n+1):
        
        URL = f'{stem}&after={i * day_window}d'
        
        print('Query from timeframe: ' + URL)
        
        res = requests.get(URL)
        
        assert res.status_code == 200
        
        json = res.json()['data']
        
        df = pd.DataFrame(json)[subfields]
        
        posts.append(df)
        
        time.sleep(1) #wait 1 s between requests
        
        #check post length
        
        print(len(posts))
        
    return df

In [60]:
type(posts)

dict

In [61]:
type(df)

NameError: name 'df' is not defined

In [52]:
df.columns

NameError: name 'df' is not defined

In [62]:
df1 = fetch_posts(subreddit)
df1.head()

Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=7d
1
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=14d
2
Query from timeframe: https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails&size=500&after=21d
3


Unnamed: 0,title,selftext,subreddit,created_utc,author,is_self,score,num_comments
0,The Honda Civic ninjas in Cars 2(2011) are a nod to the Heist Civics seen in the beginning of The Fast and The Furious(2001),,MovieDetails,1617504536,Ranchdippingsauc,False,1,5
1,In 'Under The Skin' (2013) the van Scarlett Johansson uses to hunt for men was equipped with multiple hidden cameras. So she actually drove around Glasgow picking up random strangers who had no id...,,MovieDetails,1617506890,Schlappydog,False,1,33
2,"In City of God (2002), one of the actors, who used to be in a real gang, asked director Fernando Meirelles if the group was not going to pray like they always did before a shootout. Meirelles told...",,MovieDetails,1617507107,lanzevedo,False,1,8
3,Borderlands (film),,MovieDetails,1617510858,alizamessy,False,1,0
4,"My girlfriend realized that why Murph says ""It's traditional"" after saying ""Eureka"" (to which no one, including Topher Grace's character understands) is because the history books in their society ...",,MovieDetails,1617512113,axc356,False,1,2


We will concatenate all the output dataframes with our posts into one massive dataframe of all the collected posts.

In [63]:
df1.columns

Index(['title', 'selftext', 'subreddit', 'created_utc', 'author', 'is_self',
       'score', 'num_comments'],
      dtype='object')

In [64]:
combined = pd.concat(df1, sort=False)
print(combined.shape)
combined.head()

TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"

### Repeat this process for our second subreddit

## Cleaning and EDA

In [None]:
df = pd.DataFrame(posts) #Get the list of posts into a dataframe!

In [None]:
#see, what we got!
df.head()

Let's look at basic stats about our data.

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
df.columns

Let's understand the density and usefulness of our content, by analyzing the volume of comments and looking for the richness of the text-heavy columns we are expecting.

We are also going to see, whether there are any interesting prospective features for training our models.

In [None]:
df['is_original_content'].value_counts()

In [None]:
df['num_comments'].value_counts()

In [None]:
df['media_only'].value_counts()

In [None]:
df['created_utc'].value_counts().unique #check for time stamps; replace this w/ groupby()

In [None]:
#check for post density
df['selftext'].value_counts()

In [None]:
df['selftext'].value_counts().sum()

In [None]:
df['title'].value_counts()

In [None]:
df['media'].value_counts().sum()

In [None]:
df['subreddit_subscribers'].value_counts().unique

In [None]:
df['upvote_ratio'].value_counts()

## NLP & feature eng