## My data sources

---

1. GENERAL NOTE ON ATTRIBUTION: The work throughout this Notebook, including coding techniques, rely and borrow heavily from code discussed in Riley Dallas's class lectures. All of these lectures can be found currently on YouTube and the course GitHub:

-- https://www.youtube.com/watch?v=AcrjEWsMi_E&feature=youtu.be

-- Lecture recordings (GitHub Enterprise DSIR-1214EC course)

---

2. The following subReddits were considered but not necessarily analyzed in detail for this project:

Subreddit for recipes: https://www.reddit.com/r/recipes/

Subreddit for breakfast: https://www.reddit.com/r/breakfast/

Subreddit for dessert: https://www.reddit.com/r/DessertPorn/ 

Subreddit for cars: https://www.reddit.com/r/cars/

Subreddit for architecture: https://www.reddit.com/r/architecture/

---
## My problem statement and questions posed for research

Can we detect the subReddit origin, based on topic and chatter around specific cuisines?

(Some challenges I expect, though: recipes might be more verbose and text heavy than some of the others, which seem complete with pictures.)

---

## Webscrape - fetch some data for our first Subreddit!

In [7]:
#import libraries
import pandas as pd
import numpy as np

import requests
import time

In [1]:
#read in PushShift API with the subReddit data we need
url = 'https://api.pushshift.io/reddit/search/submission?subreddit=cars'

_Paste https://api.pushshift.io/reddit/search/submission?subreddit=cars in browser to analyze the JSON._

In [19]:
#define what we need to get from the subreddit
params = {
    'subreddit': 'cars',
    'size' : 100 #100 seems to be the max, even if change this to a greater size
#    'before': ''
}

In [20]:
res = requests.get(url, params) 
res.status_code #check that our response is valid -- looking for a 200 code

200

In [21]:
data = res.json() #get the content in JSON format

In [22]:
posts = data['data'] #Fetch the list of first posts

In [23]:
len(posts)

100

In [17]:
type(posts)

list

In [25]:
posts #view our first set of fetched posts as a list of dictionaries

[{'all_awardings': [],
  'allow_live_comments': False,
  'author': 'KittiesHavingSex',
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_text': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_7wyt9',
  'author_patreon_flair': False,
  'author_premium': False,
  'awarders': [],
  'can_mod_post': False,
  'contest_mode': False,
  'created_utc': 1611949415,
  'domain': 'self.cars',
  'full_link': 'https://www.reddit.com/r/cars/comments/l821b8/why_dont_we_see_ev_track_comparisons/',
  'gildings': {},
  'id': 'l821b8',
  'is_crosspostable': True,
  'is_meta': False,
  'is_original_content': False,
  'is_reddit_media_domain': False,
  'is_robot_indexable': True,
  'is_self': True,
  'is_video': False,
  'link_flair_background_color': '',
  'link_flair_richtext': [],
  'link_flair_text_color': 'dark',
  'link_flair_type': 'text',
  'locked': False,
  'media_only': False,
  'no_follow': False,
  'num_comments': 0,
  'num_crossposts': 0,
  'over_18':

## Cleaning and EDA

In [26]:
df = pd.DataFrame(posts) #Get the list of posts into a dataframe!

In [35]:
#see, what we got!
df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,media_embed,secure_media,secure_media_embed,thumbnail_height,thumbnail_width,link_flair_css_class,link_flair_text,author_flair_template_id,author_flair_text_color,author_flair_background_color
0,[],False,KittiesHavingSex,,[],,text,t2_7wyt9,False,False,...,,,,,,,,,,
1,[],False,Gr0wlerz,,[],,text,t2_1uvtcn1s,False,False,...,,,,,,,,,,
2,[],False,Strandern74,,[],,text,t2_7w0uetro,False,False,...,,,,,,,,,,
3,[],False,wraithyouu,,[],,text,t2_5umfwbkm,False,False,...,,,,,,,,,,
4,[],False,MrPoopyButthole6911,,[],,text,t2_6n5zr3sw,False,False,...,,,,,,,,,,


Let's look at basic stats about our data.

In [52]:
df.shape

(100, 72)

In [55]:
df.isnull().sum()

all_awardings                      0
allow_live_comments                0
author                             0
author_flair_css_class           100
author_flair_richtext              1
                                ... 
link_flair_css_class              88
link_flair_text                   88
author_flair_template_id          85
author_flair_text_color           83
author_flair_background_color     98
Length: 72, dtype: int64

In [36]:
df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'subreddit_type', 'thumbnail', 'title', 'total_awards_received',
       'treatment_tags', 'upvote_ratio',

Let's understand the density and usefulness of our content, by analyzing the volume of comments and looking for the richness of the text-heavy columns we are expecting.

We are also going to see, whether there are any interesting prospective features for training our models.

In [43]:
df['is_original_content'].value_counts()

False    100
Name: is_original_content, dtype: int64

In [45]:
df['num_comments'].value_counts()

0     36
1     32
2     23
3      2
61     1
31     1
22     1
14     1
9      1
5      1
4      1
Name: num_comments, dtype: int64

In [44]:
df['media_only'].value_counts()

False    100
Name: media_only, dtype: int64

In [67]:
df['created_utc'].value_counts().unique #check for time stamps; replace this w/ groupby()

<bound method Series.unique of 1611896063    1
1611922006    1
1611938895    1
1611915074    1
1611924040    1
             ..
1611933604    1
1611947941    1
1611939411    1
1611928745    1
1611941122    1
Name: created_utc, Length: 100, dtype: int64>

In [42]:
#check for post density
df['selftext'].value_counts()

[removed]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

In [58]:
df['selftext'].value_counts().sum()

100

In [57]:
df['title'].value_counts()

Porsche 911 Restoration                                                                                                                                                                                                                                                       2
2006 Saturn ion, accelerated forward while breaking.                                                                                                                                                                                                                          1
(Teaser Video) 2022 Nissan Pathfinder Reveal Coming February 4th 2021 at 1 PM EST                                                                                                                                                                                             1
Do you want to play a VIN game?                                                                                                                                                         

In [51]:
df['media'].value_counts().sum()

19

In [65]:
df['subreddit_subscribers'].value_counts().unique

<bound method Series.unique of 2218526    2
2216003    2
2216408    2
2218772    2
2217645    1
          ..
2217082    1
2215898    1
2217179    1
2215513    1
2216960    1
Name: subreddit_subscribers, Length: 96, dtype: int64>

In [48]:
df['upvote_ratio'].value_counts()

1.00    99
0.99     1
Name: upvote_ratio, dtype: int64

## NLP & feature eng