# Scraping subreddits


## Problem statement
---
We are producing a trivia focused, _Jeopardy_ or _Who Wants to Be a Millionaire_* style game show, where we want the audience to guess the source of the movie details. Everything is scrambled together! The task: figure out if the details came from good movie details, or whether the movie production team took a shorcut and landed in Sh**ty Movie Details!

*Trademarks of their appropiate productions

To do this, we are going to solve the problem using a classifer trained on some subreddit data.


## Data sources and references

---

1. GENERAL NOTE ON ATTRIBUTION: The work throughout this Report, including many if not most coding techniques, rely and borrow heavily from code discussed in Riley Dallas, Sophie "Sonya" Tabac, Charlie Rice, Heather Robbins, and Gwen Rathgeber's class lectures, Notebooks, GitHub references, tips on techniques and troubleshooting help. The code is adapted to solve our specific problem.


2. The following subReddits were scraped:

    * [Movie Details](https://www.reddit.com/r/MovieDetails/)
    * [Sh**ty Movie Details](https://www.reddit.com/r/shittymoviedetails/)

---

### Note on the data and style

Strong language may appear in various Reddit posts in raw form. To the extent possible, it shall be cleaned in the course of the project. There also may be some humor used throughout the presentation of the analysis.

---

## Webscrape - let's fetch some data for our first Subreddit!

In [28]:
#import libraries
import pandas as pd, numpy as np, requests, time, nltk

from nltk.stem import WordNetLemmatizer

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, confusion_matrix, plot_confusion_matrix, \
                             recall_score, precision_score)
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.dummy import DummyClassifier

#also grab c-val, all the metrics

In [3]:
#read in PushShift API with the subReddit data we need
url = 'https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails'

_PRO-TIP: Paste https://api.pushshift.io/reddit/search/submission?subreddit=MovieDetails in browser to analyze the JSON._

In [4]:
#define what we need to get from the subreddit
params = {
    'subreddit': 'movie_details',
    'size' : 100 #100 seems to be the max, even if change this to a greater size
#    'before': ''
}

In [5]:
res = requests.get(url, params) 
res.status_code #check that our response is valid -- looking for a 200 code

200

In [6]:
data = res.json() #get the content in JSON format

In [7]:
posts = data['data'] #Fetch the list of first posts

In [8]:
len(posts)

100

In [9]:
type(posts)

list

In [10]:
posts #view our first set of fetched posts as a list of dictionaries

[{'all_awardings': [],
  'allow_live_comments': False,
  'author': 'weird_YT_channel',
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_text': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_3oijkjwj',
  'author_patreon_flair': False,
  'author_premium': False,
  'awarders': [],
  'can_mod_post': False,
  'contest_mode': False,
  'created_utc': 1619032108,
  'domain': 'self.MovieDetails',
  'full_link': 'https://www.reddit.com/r/MovieDetails/comments/mvmxua/swiss_army_man/',
  'gildings': {},
  'id': 'mvmxua',
  'is_crosspostable': False,
  'is_meta': False,
  'is_original_content': False,
  'is_reddit_media_domain': False,
  'is_robot_indexable': False,
  'is_self': True,
  'is_video': False,
  'link_flair_background_color': '#373c3f',
  'link_flair_richtext': [{'e': 'text', 't': '⏱️ Continuity'}],
  'link_flair_template_id': 'ffaa9c56-0fad-11ea-9f47-0e370252fd8d',
  'link_flair_text': '⏱️ Continuity',
  'link_flair_text_color': 'light',
 

## Repeat this process for our second subreddit

## Cleaning and EDA

In [11]:
df = pd.DataFrame(posts) #Get the list of posts into a dataframe!

In [12]:
#see, what we got!
df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,thumbnail_width,url_overridden_by_dest,media,media_embed,secure_media,secure_media_embed,gallery_data,media_metadata,author_flair_background_color,author_flair_text_color
0,[],False,weird_YT_channel,,[],,text,t2_3oijkjwj,False,False,...,,,,,,,,,,
1,[],False,Stainless001,,[],,text,t2_w8e62,False,False,...,140.0,https://www.reddit.com/gallery/mvmtmf,,,,,,,,
2,[],False,Stainless001,,[],,text,t2_w8e62,False,False,...,,https://www.reddit.com/gallery/mvmsla,,,,,,,,
3,[],False,yasshost77,,[],,text,t2_bdkalzsx,False,False,...,140.0,https://www.xvidxx.com/video/197/are-you-serio...,,,,,,,,
4,[],False,Kris_Huf,,[],,text,t2_46fknvbc,False,False,...,140.0,https://i.redd.it/wn1cww9peku61.jpg,,,,,,,,


Let's look at basic stats about our data.

In [13]:
df.shape

(100, 76)

In [14]:
df.isnull().sum()

all_awardings                     0
allow_live_comments               0
author                            0
author_flair_css_class           99
author_flair_richtext             0
                                 ..
secure_media_embed               85
gallery_data                     96
media_metadata                   95
author_flair_background_color    99
author_flair_text_color          99
Length: 76, dtype: int64

In [15]:
df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_template_id', 'link_flair_text', 'link_flair_text_color',
       'link_flair_type', 'locked', 'media_only', 'no_follow', 'num_comments',
       'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink',
       'pinned', 'post_hint', 'preview', 'pwls', 'removed_by_category',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'subre

Let's understand the density and usefulness of our content, by analyzing the volume of comments and looking for the richness of the text-heavy columns we are expecting.

We are also going to see, whether there are any interesting prospective features for training our models.

In [16]:
df['is_original_content'].value_counts()

False    100
Name: is_original_content, dtype: int64

In [17]:
df['num_comments'].value_counts()

2      26
3      15
0      13
4       7
1       6
5       6
6       6
7       5
8       2
15      2
12      2
241     1
9       1
209     1
13      1
16      1
18      1
19      1
153     1
524     1
11      1
Name: num_comments, dtype: int64

In [18]:
df['media_only'].value_counts()

False    100
Name: media_only, dtype: int64

In [19]:
df['created_utc'].value_counts().unique #check for time stamps; replace this w/ groupby()

<bound method Series.unique of 1618838598    1
1618929752    1
1618871101    1
1618868680    1
1618847548    1
             ..
1618929992    1
1618972851    1
1618844086    1
1618881464    1
1618912768    1
Name: created_utc, Length: 100, dtype: int64>

In [42]:
#check for post density
df['selftext'].value_counts()

[removed]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

In [21]:
df['selftext'].value_counts().sum()

100

In [22]:
df['title'].value_counts()

visite this site to watch porn vidéos                                                                                                                                                                                                                 6
🔴🔴 visite this link to watch porn vidéos 👇👇                                                                                                                                                                                                           5
In the opening titles for Vanquish (2021), the newspaper has the generic "lorem ipsum" placeholder text, indicating potential under-investment from the production.                                                                                   2
In The Truman Show (1998) when Truman is examining old pictures of his father, Truman brushes aside a small baggy of marijuana also in the box.                                                                                                       1
During t

In [23]:
df['media'].value_counts().sum()

15

In [24]:
df['subreddit_subscribers'].value_counts().unique

<bound method Series.unique of 2279837    2
2280303    2
2281465    2
2279824    2
2279991    1
          ..
2282332    1
2283101    1
2279902    1
2280304    1
2281686    1
Name: subreddit_subscribers, Length: 96, dtype: int64>

In [25]:
df['upvote_ratio'].value_counts()

1.0    99
0.5     1
Name: upvote_ratio, dtype: int64

## NLP & feature eng