# Project 3: Web APIs & NLP - Data Gathering from Subreddits
- Project 3 done by Anand Ramchandani

## Problem Statement
- Collect posts from 2 subreddits and use NLP to train a classifier to distinguish between posts from the subreddits 'r/beer' and 'r/wine'

# Contents

- [Background](#Background)
- [Plan to achieve project goals](#Plan-to-achieve-project-goals)
- [Data Gathering from Subreddits](#Data-Gathering-from-Subreddits)
- [Data Import & Cleaning](#Data-Import-and-Cleaning)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Tokenization and Lemmatization](#Tokenization-and-Lemmatization)
- [Word Cloud](#Word-Cloud)
- [Data Modelling](#Data-Modelling)
- [Model Evaluation](#Model-Evaluation)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

# Background

- The goal is to select 2 subreddits from American social news aggregation, content rating, and discussion website Reddit. Then train a machine learning model that will classify and distinguish newer posts into its respective subreddit accurately    
- For this project we have selected subreddits r/beer and r/wine. These were chosen as they are similar enough being popular alcoholic beverages with lower alcohol content than hard liquor, hence more widely and recreationally consumed so as to provide a large popular following on Reddit. Plus, these 2 subreddits are differentiated enough to train a machine learning model  

# Plan to achieve project goals

1. Gather data using Reddit's API  
2. Clean Data   
3. Data Exploration  
4. Identify relevant features and conduct feature engineering
5. Vectorize the words using CountVectorizer and TF-IDF
6. Modelling using Linear Regression and Multinomial Naive Bayes
7. Refinements and Hyper Parmeter Tuning
8. Revaluate and select best model
9. Evaluate on new test data

# Import Libraries

In [17]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 100)
sns.set_style("whitegrid")

import requests
import time
import string
import random

from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [18]:
# changing our pandas settings so that we can view all columns 
#pd.set_option('max_columns', 999)
#pd.set_option('max_rows', 999)

# Data Gathering from Subreddits

## Use Reddit API to Gather Posts

### Gather Posts from Beer Subreddit

In [19]:
# Setting arbitrary headers user agent to prevent 429 Request errors
headers = {'User-agent':'Lawrence of Arabia'}

# List for storing posts
beer_posts=[] 

# Empty parameter for first iteration
after=None

# Reddit returns 25 posts for each scrape

for num in range(40): 
    
    #print at the 10th page
    if num % 10 == 0: 
        print(f'scrape {num} in progress...') 
        
    if after == None:
        param={} 
    else:
        param={'after':after}
        
    url = 'https://www.reddit.com/r/beer.json'
    
    results = requests.get(url,params=param,headers=headers)
    
    #Checking if the requests are successful
    if results.status_code==200: 
        d_json=results.json()
        beer_posts.extend(d_json['data']['children']) 
        after=d_json['data']['after']
    else:
        print(results.status_code)
        break
        
    # Sleep timing
    time.sleep(1) #seconds to sleep

scrape 0 in progress...
scrape 10 in progress...
scrape 20 in progress...
scrape 30 in progress...


In [20]:
len(beer_posts)

988

In [21]:
# Checking keys of each post
beer_posts[0]['data'].keys()

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'thumbnail_height', 'top_awarded_type', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'upvote_ratio', 'author_flair_background_color', 'subreddit_type', 'ups', 'total_awards_received', 'media_embed', 'thumbnail_width', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'is_created_from_ads_ui', 'author_premium', 'thumbnail', 'edited', 'author_flair_css_class', 'author_flair_richtext', 'gildings', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'removed_by_category', 'banned_by', 'author_flair_type', 'domain', 'allow_live_comments', 'selftext_html', 

In [22]:
beer_df = pd.DataFrame([post['data'] for post in beer_posts])
beer_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,thumbnail_height,top_awarded_type,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,thumbnail_width,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,is_created_from_ads_ui,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,...,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,author_is_blocked,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,post_hint,preview,call_to_action,link_flair_template_id,url_overridden_by_dest,crosspost_parent_list,crosspost_parent,poll_data,author_cakeday
0,,beer,"Hi, Howdy, Hello! No doubt you’re here because...",t2_7zg7o,False,,0,False,Beer Suggestions on r/beer And You: So You Wan...,[],r/beer,False,6,,0,,,False,t3_i0we2n,False,dark,0.99,,public,131,0,{},,,False,[],,False,False,,{},,False,131,,False,False,self,False,,[],{},,True,,...,False,False,False,[],[],False,False,False,True,,[],False,,,,t5_2qhg1,False,,,,i0we2n,True,,botulizard,,11,True,all_ads,False,[],False,,/r/beer/comments/i0we2n/beer_suggestions_on_rb...,all_ads,True,https://www.reddit.com/r/beer/comments/i0we2n/...,445080,1596151000.0,0,,False,,,,,,,,,
1,,beer,Do you have questions about beer? We have answ...,t2_6l4z3,False,,0,False,No Stupid Questions Wednesday - ask anything a...,[],r/beer,False,6,,0,,,False,t3_wkwty3,False,dark,0.81,,public,3,0,{},,,False,[],,False,False,,{},,False,3,,False,True,self,False,,[],{},,True,,...,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qhg1,False,,,,wkwty3,True,,AutoModerator,,7,True,all_ads,False,[],False,,/r/beer/comments/wkwty3/no_stupid_questions_we...,all_ads,True,https://www.reddit.com/r/beer/comments/wkwty3/...,445080,1660136000.0,0,,False,self,{'images': [{'source': {'url': 'https://extern...,,,,,,,
2,,beer,I'm talking about a brewery or taproom that ha...,t2_3sl7r,False,,0,False,What is the best thematic brewery/taproom?,[],r/beer,False,6,,0,0.0,,False,t3_wmrj0l,False,dark,0.92,,public,60,0,{},0.0,,False,[],,False,False,,{},,False,60,,False,False,self,False,pint3,[],{},,True,,...,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qhg1,False,,,,wmrj0l,True,,DocGerbil256,,103,True,all_ads,False,[],False,dark,/r/beer/comments/wmrj0l/what_is_the_best_thema...,all_ads,False,https://www.reddit.com/r/beer/comments/wmrj0l/...,445080,1660327000.0,0,,False,self,{'images': [{'source': {'url': 'https://extern...,,,,,,,
3,,beer,"I’ve looked at lists on Untapped, Beeradvocate...",t2_57uglhd3,False,,0,False,Asheville NC brewery recommendations?,[],r/beer,False,6,,0,,,True,t3_wn213p,False,dark,0.83,,public,8,0,{},,,False,[],,False,False,,{},,False,8,,False,False,self,False,,[],{},,True,,...,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qhg1,False,,,,wn213p,True,,Woah-Big-Gulps-Huh,,12,True,all_ads,False,[],False,,/r/beer/comments/wn213p/asheville_nc_brewery_r...,all_ads,False,https://www.reddit.com/r/beer/comments/wn213p/...,445080,1660355000.0,0,,False,,,,,,,,,
4,,beer,,t2_kakjaf9z,False,,0,False,"I am mexican and I’m genuinely curious, what w...","[{'e': 'text', 't': 'Discussion'}]",r/beer,False,6,,0,,,False,t3_wm8rnq,False,dark,0.94,,public,204,1,{},,,False,[],,False,False,,{},Discussion,False,204,,False,False,self,False,,[],{},,True,,...,False,False,False,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,False,False,False,,[],False,,,,t5_2qhg1,False,,,,wm8rnq,True,,psychologicalprowler,,398,True,all_ads,False,[],False,,/r/beer/comments/wm8rnq/i_am_mexican_and_im_ge...,all_ads,False,https://www.reddit.com/r/beer/comments/wm8rnq/...,445080,1660269000.0,0,,False,,,,adacd18c-88fc-11e3-8c8a-12313b0ce8a6,,,,,


### Gather Posts from Wine Subreddit

In [23]:
# Setting arbitrary headers user agent to prevent 429 Request errors
headers = {'User-agent':'Lawrence of Arabia'}

# List for storing posts
wine_posts=[] 

# Empty parameter for first iteration
after=None

# Reddit returns 25 posts for each scrape

for num in range(40): 
    
    #print at the 10th page
    if num % 10 == 0: 
        print(f'scrape {num} in progress...') 
        
    if after == None:
        param={} 
    else:
        param={'after':after}
        
    url = 'https://www.reddit.com/r/wine.json'
    
    results = requests.get(url,params=param,headers=headers)
    
    #Checking if the requests are successful
    if results.status_code==200: 
        d_json=results.json()
        wine_posts.extend(d_json['data']['children']) 
        after=d_json['data']['after']
    else:
        print(results.status_code)
        break
        
    # Sleep timing
    time.sleep(1) #seconds to sleep

scrape 0 in progress...
scrape 10 in progress...
scrape 20 in progress...
scrape 30 in progress...


In [24]:
len(wine_posts)

999

In [25]:
len(set([post['data']['name'] for post in wine_posts]))

972

In [26]:
wine_df = pd.DataFrame([post['data'] for post in wine_posts])
wine_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,thumbnail_height,top_awarded_type,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,thumbnail_width,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,is_created_from_ads_ui,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,...,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,author_is_blocked,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,post_hint,url_overridden_by_dest,preview,is_gallery,media_metadata,gallery_data,call_to_action,crosspost_parent_list,crosspost_parent,author_cakeday
0,,wine,Want to know how much that bottle of 1945 Chât...,t2_39hfp,False,,0,False,[MEGA THREAD] - How Much is My Wine Worth?,[],r/wine,False,6,,0,,,False,t3_r7lf76,False,dark,0.95,,public,114,0,{},,,False,[],,False,False,,{},,False,114,,False,False,self,False,,[],{},,True,,...,False,False,[],[],False,False,False,False,Wine Pro - Curator,[],False,,,,t5_2qhs8,False,,,,r7lf76,True,,cheezerman,,491,False,all_ads,False,[],False,dark,/r/wine/comments/r7lf76/mega_thread_how_much_i...,all_ads,True,https://www.reddit.com/r/wine/comments/r7lf76/...,182099,1638491000.0,0,,False,,,,,,,,,,
1,,wine,"Bottle porn without notes, random musings, off...",t2_6l4z3,False,,0,False,Free Talk Friday,[],r/wine,False,6,,0,,,False,t3_wmciew,False,dark,1.0,,public,7,0,{},,,False,[],,False,False,,{},,False,7,,False,True,self,False,,[],{},,True,,...,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qhs8,False,,,,wmciew,True,,AutoModerator,,30,True,all_ads,False,[],False,,/r/wine/comments/wmciew/free_talk_friday/,all_ads,True,https://www.reddit.com/r/wine/comments/wmciew/...,182099,1660280000.0,0,,False,,,,,,,,,,
2,,wine,,t2_884pwcjw,False,,0,False,Starting our wine cellar project,[],r/wine,False,6,,0,140.0,,False,t3_wmuxwp,False,dark,0.98,,public,61,0,{},140.0,,False,[],,True,False,,{},,False,61,,False,False,https://a.thumbs.redditmedia.com/6UwnEvTZkeiBL...,False,,[],{},,False,,...,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qhs8,False,,,,wmuxwp,True,,Defiant_Day8427,,8,True,all_ads,False,[],False,,/r/wine/comments/wmuxwp/starting_our_wine_cell...,all_ads,False,https://i.redd.it/f15szo9tcch91.jpg,182099,1660336000.0,0,,False,image,https://i.redd.it/f15szo9tcch91.jpg,{'images': [{'source': {'url': 'https://previe...,,,,,,,
3,,wine,,t2_8i26t,False,,0,False,Advice on this 1971 Haut-Brion,[],r/wine,False,6,,0,140.0,,False,t3_wmqfvf,False,dark,0.96,,public,47,0,{},140.0,,False,[],,True,False,,{},,False,47,,False,False,https://b.thumbs.redditmedia.com/loxMVR4MBawCb...,False,,[],{},,False,,...,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qhs8,False,,,,wmqfvf,True,,Boney3147,,22,True,all_ads,False,[],False,,/r/wine/comments/wmqfvf/advice_on_this_1971_ha...,all_ads,False,https://i.redd.it/4g2hfn52ebh91.jpg,182099,1660324000.0,0,,False,image,https://i.redd.it/4g2hfn52ebh91.jpg,{'images': [{'source': {'url': 'https://previe...,,,,,,,
4,,wine,"Hey everyone, was out to dinner and had an int...",t2_3p0my3ou,False,,0,False,"Ordered Sancerre, was brought Pouilly-Fumé",[],r/wine,False,6,,0,,,False,t3_wms859,False,dark,0.95,,public,33,0,{},,,False,[],,False,False,,{},,False,33,,False,False,self,False,,[],{},,True,,...,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qhs8,False,,,,wms859,True,,chicfan51,,23,True,all_ads,False,[],False,,/r/wine/comments/wms859/ordered_sancerre_was_b...,all_ads,False,https://www.reddit.com/r/wine/comments/wms859/...,182099,1660329000.0,0,,False,,,,,,,,,,


In [27]:
beer_select_df = beer_df[['name', 'author','title','selftext','subreddit']]
beer_select_df.head()

Unnamed: 0,name,author,title,selftext,subreddit
0,t3_i0we2n,botulizard,Beer Suggestions on r/beer And You: So You Wan...,"Hi, Howdy, Hello! No doubt you’re here because...",beer
1,t3_wkwty3,AutoModerator,No Stupid Questions Wednesday - ask anything a...,Do you have questions about beer? We have answ...,beer
2,t3_wmrj0l,DocGerbil256,What is the best thematic brewery/taproom?,I'm talking about a brewery or taproom that ha...,beer
3,t3_wn213p,Woah-Big-Gulps-Huh,Asheville NC brewery recommendations?,"I’ve looked at lists on Untapped, Beeradvocate...",beer
4,t3_wm8rnq,psychologicalprowler,"I am mexican and I’m genuinely curious, what w...",,beer


In [28]:
beer_select_df.dtypes

name         object
author       object
title        object
selftext     object
subreddit    object
dtype: object

In [29]:
wine_select_df = wine_df[['name', 'author','title','selftext','subreddit']]
wine_select_df.head()

Unnamed: 0,name,author,title,selftext,subreddit
0,t3_r7lf76,cheezerman,[MEGA THREAD] - How Much is My Wine Worth?,Want to know how much that bottle of 1945 Chât...,wine
1,t3_wmciew,AutoModerator,Free Talk Friday,"Bottle porn without notes, random musings, off...",wine
2,t3_wmuxwp,Defiant_Day8427,Starting our wine cellar project,,wine
3,t3_wmqfvf,Boney3147,Advice on this 1971 Haut-Brion,,wine
4,t3_wms859,chicfan51,"Ordered Sancerre, was brought Pouilly-Fumé","Hey everyone, was out to dinner and had an int...",wine


In [30]:
wine_select_df.dtypes

name         object
author       object
title        object
selftext     object
subreddit    object
dtype: object

# Save Datasets for Beer and Wine after Scraping

In [31]:
# Exporting Beer Scraped Dataset pre-EDA
beer_select_df.to_csv('../project_3/datasets/beer.csv')

In [32]:
# Exporting Wine Scraped Dataset pre-EDA
wine_select_df.to_csv('../project_3/datasets/wine.csv')