<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Executive-Summary" data-toc-modified-id="Executive-Summary-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Executive Summary</a></span><ul class="toc-item"><li><span><a href="#Problem-statement" data-toc-modified-id="Problem-statement-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Problem statement</a></span></li><li><span><a href="#Business-relevance" data-toc-modified-id="Business-relevance-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Business relevance</a></span></li><li><span><a href="#How-it-will-be-carried-out" data-toc-modified-id="How-it-will-be-carried-out-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>How it will be carried out</a></span></li></ul></li><li><span><a href="#Data-Gathering:-Extracting-data-from-subreddits-using-json" data-toc-modified-id="Data-Gathering:-Extracting-data-from-subreddits-using-json-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Gathering: Extracting data from subreddits using json</a></span><ul class="toc-item"><li><span><a href="#How-reddit's-json-files-are-structured" data-toc-modified-id="How-reddit's-json-files-are-structured-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>How reddit's json files are structured</a></span></li><li><span><a href="#Exploring-how-to-retrieve-1000-rows-from-each-subreddit" data-toc-modified-id="Exploring-how-to-retrieve-1000-rows-from-each-subreddit-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Exploring how to retrieve 1000 rows from each subreddit</a></span></li><li><span><a href="#Extracting-json-files-and-exporting-into-csv" data-toc-modified-id="Extracting-json-files-and-exporting-into-csv-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Extracting json files and exporting into csv</a></span></li></ul></li></ul></div>

## Executive Summary

### Problem statement

The objective is to pick two subreddits from Reddit and train a machine learning model that will be able to classify new posts into the correct subreddit. 

I identified r/whisky and r/wine as my two topics of interest because the two topics (i.e. based on real-world knowledge) share enough similarities to provide a good challenge, but are also differentiated enough that it should be possible to train a machine learning model. 

**Examples of simlarities**
* Both topics are part of the alcohol industry
* Flavour profile vocabulary e.g. rich, lush, citrus, toasty, full-bodied, length, finish
* Production jargon e.g. yeast, ferment, years, oak, stainless steel
* Periphery jargon e.g. bottle, collection, region, drink, flight, ABV

**Examples of differences**
* Ingredient terms e.g. rye, grain, malt vs. grapes, pinot noir, chardonnay
* Flavour profile vocabulary e.g. smoky, peaty vs. tannins
* Geographical regions e.g. Scotland, Islay, Kentucky, vs Australia, Burgundy, 
* Periphery terms e.g. dram, age, distillery, cask vs glass, vintage, winery, barrel

### Business relevance
**Search Engine Optimization**

Wine and whisky are the kinds of luxury class products that can feel intimidating because the domains are often thought of as requiring a lot of specialised knowledge and vocabulary. And while such jargon certainly show up a lot in brochures and specialist magazines, the reality is that the terms people use to talk about whiskies/wines in casual conversation tends to be more down-to-earth, comprehensible. And now with drinks companies also increasingly looking at e-commerce, knowing the everyday words that people use to discuss or search about whiskies/wines can help these companies improve their SEO game. 

**Email filtering**

For some reason, many smaller retailers in the alcohol sector are either very resistant against or slow in launching full-fledge e-commerce sites. Instead, their idea of retailing online is often uploading the product catalogue online in PDF form, such that prospective customers will have to email them to make enquiries or place purchase orders. This in turn means that when an email lands in the company's enquiries/sales inbox, someone will have to direct the email to the correct category manager. A good classification system would therefore help to automate and reduce much of this grunt work. 


### How it will be carried out

* Data gathering via Reddit's API
* Data cleaning (e.g. rows with null values, duplicate rows)
* Data exploration to understand features of interest
* Data pre-processing to remove markup elements, numerals, emojis etc
* Vectorize words into numerical data using Count Vectorizer and TF-IDF
* Basic modeling with Linear Regression and Multinomial Naive Bayes
* Re-iterate pre-processing and modeling with further refinement
    * Spacy to lemmatize text
    * Tune hyper-parameters for Count Vectorizer and TF-IDF
    * Tune Linear Regression, Multinomial Naive Bayes and Random Forest Classifier
    * Pick best model (i.e. overfits the least)
    * Evaluate on new test data st

## Data Gathering: Extracting data from subreddits using json

In [2]:
# Importing libaries
import pandas as pd

from bs4 import BeautifulSoup
import requests
import time


In [3]:
# changing our pandas settings so that we can view all columns 
pd.set_option('max_columns', 999)
pd.set_option('max_rows', 999)

In [4]:
# Setting target subreddits

# 'Henceforth, using 'w' in variable names to denote "whisky", 'v'in variable names to denote "wine" (because, wine = vino)

w_url = "https://www.reddit.com/r/whisky.json"
v_url = "https://www.reddit.com/r/wine.json"


In [5]:
# Setting arbitrary user agent to circumvent 429 Too Many Requests errors
headers = {"User-agent": "Fish" }

# Establishing connection to whisky subreddit
w_res = requests.get(w_url, headers=headers)

# Ex., 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found.
print(w_res.status_code)

w_html = w_res.text

200


In [6]:
# Establishing connection to whisky subreddit:
v_res = requests.get(v_url, headers=headers)

# Ex., 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found.
print(v_res.status_code)

v_html = v_res.text

200


In [7]:
# Taking a look at the json for r/whisky
w_json = w_res.json()
w_json

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 25,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'whisky',
     'selftext': '',
     'author_fullname': 't2_pwox5',
     'saved': False,
     'mod_reason_title': None,
     'gilded': 0,
     'clicked': False,
     'title': 'The Fascinating History of Whiskey in North America',
     'link_flair_richtext': [],
     'subreddit_name_prefixed': 'r/whisky',
     'hidden': False,
     'pwls': None,
     'link_flair_css_class': None,
     'downs': 0,
     'thumbnail_height': 73,
     'hide_score': False,
     'name': 't3_dlttu2',
     'quarantine': False,
     'link_flair_text_color': 'dark',
     'author_flair_background_color': None,
     'subreddit_type': 'public',
     'ups': 42,
     'total_awards_received': 0,
     'media_embed': {},
     'thumbnail_width': 140,
     'author_flair_template_id': None,
     'is_original_content': False,
     'user_reports': [],
     'secure_media': None,
     'i

In [8]:
# finding out the keys to the json dictionary
# this is the "first level" of the json file
# we can assume that the keys are the same for both reddits
sorted(w_json.keys())

['data', 'kind']

In [9]:
# taking a look at the second level of data within the key "data"
w_json["data"]


{'modhash': '',
 'dist': 25,
 'children': [{'kind': 't3',
   'data': {'approved_at_utc': None,
    'subreddit': 'whisky',
    'selftext': '',
    'author_fullname': 't2_pwox5',
    'saved': False,
    'mod_reason_title': None,
    'gilded': 0,
    'clicked': False,
    'title': 'The Fascinating History of Whiskey in North America',
    'link_flair_richtext': [],
    'subreddit_name_prefixed': 'r/whisky',
    'hidden': False,
    'pwls': None,
    'link_flair_css_class': None,
    'downs': 0,
    'thumbnail_height': 73,
    'hide_score': False,
    'name': 't3_dlttu2',
    'quarantine': False,
    'link_flair_text_color': 'dark',
    'author_flair_background_color': None,
    'subreddit_type': 'public',
    'ups': 42,
    'total_awards_received': 0,
    'media_embed': {},
    'thumbnail_width': 140,
    'author_flair_template_id': None,
    'is_original_content': False,
    'user_reports': [],
    'secure_media': None,
    'is_reddit_media_domain': False,
    'is_meta': False,
    'cate

In [10]:
# finding out the keys to the next level (3rd) 
# from the results above we most likely want to go into children
sorted(w_json["data"].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [11]:
w_json["data"]["children"]

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'whisky',
   'selftext': '',
   'author_fullname': 't2_pwox5',
   'saved': False,
   'mod_reason_title': None,
   'gilded': 0,
   'clicked': False,
   'title': 'The Fascinating History of Whiskey in North America',
   'link_flair_richtext': [],
   'subreddit_name_prefixed': 'r/whisky',
   'hidden': False,
   'pwls': None,
   'link_flair_css_class': None,
   'downs': 0,
   'thumbnail_height': 73,
   'hide_score': False,
   'name': 't3_dlttu2',
   'quarantine': False,
   'link_flair_text_color': 'dark',
   'author_flair_background_color': None,
   'subreddit_type': 'public',
   'ups': 42,
   'total_awards_received': 0,
   'media_embed': {},
   'thumbnail_width': 140,
   'author_flair_template_id': None,
   'is_original_content': False,
   'user_reports': [],
   'secure_media': None,
   'is_reddit_media_domain': False,
   'is_meta': False,
   'category': None,
   'secure_media_embed': {},
   'link_flair_text': None,
   'c

The key "children" seem to be returning us a list of posts for r/wine.
Now to check how many posts we get for each call

In [12]:
len(w_json["data"]["children"])

25

In [13]:
# looking at the structure of each post
w_json["data"]["children"][2]

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'whisky',
  'selftext': "I'll start by saying hello, everyone! First post on this subreddit, long-time lurker.\n\nI love a good whisky. I also do not have much of an income, so I have a tendency to take a week or two to make decisions on the nest bottle I purchase. As such I was hoping y'all could help me make a decision.\n\nI love Islay whisky, more than anything else I've tried; my favorite whisky ever is Ardbeg's Corryvreckan; I love its sweet and brackish charred flavor, simply fantastic. I also always have a bottle of Ardbeg 10 and Lagavulin 16 in my cupboard.\n\nOutside of peated whisky, I really enjoy ones with a thick, sweet, woody taste. The last couple I tried that I have really liked have been Balvenie's Carribean cask, and Nikka from the Barrel; the former being unavailable here and the latter I recently purchased.\n\nI'm interested in anything fitting the above profiles; scotch or bourbon. I'm looking to spen

In [14]:
# since this leve is again a dictionary, we want to go to use the key "data" to access the details and content of each post.
w_json["data"]["children"][2]["data"]

{'approved_at_utc': None,
 'subreddit': 'whisky',
 'selftext': "I'll start by saying hello, everyone! First post on this subreddit, long-time lurker.\n\nI love a good whisky. I also do not have much of an income, so I have a tendency to take a week or two to make decisions on the nest bottle I purchase. As such I was hoping y'all could help me make a decision.\n\nI love Islay whisky, more than anything else I've tried; my favorite whisky ever is Ardbeg's Corryvreckan; I love its sweet and brackish charred flavor, simply fantastic. I also always have a bottle of Ardbeg 10 and Lagavulin 16 in my cupboard.\n\nOutside of peated whisky, I really enjoy ones with a thick, sweet, woody taste. The last couple I tried that I have really liked have been Balvenie's Carribean cask, and Nikka from the Barrel; the former being unavailable here and the latter I recently purchased.\n\nI'm interested in anything fitting the above profiles; scotch or bourbon. I'm looking to spend anywhere up to $150. Any

### How reddit's json files are structured

Here's what we have established so far:
* ` w_json["data"]["children"] ` gives us a list of posts. Each json file contains 25 posts
* within each post, we need to use the key "data" to access the actual details about the post.
* this new level is also organised as a dictionary
* these are the keys we would probably be interested to extract data from, so that we can build a list of predictor features
    * `'title'`
    * `'selftext'` (there is also selftext_html but the former is more useful i.e. less data cleaning)
    * `'subreddit'` (technically, we should know which subreddit it would be based on the json file, but this might come in useful)
    * `'name'`: id for each post (the prefix 't3' probably corresponds to the type of item (i.e. listing)
    
### Exploring how to retrieve 1000 rows from each subreddit

We want to find out how we can a dataset of 1000 rows (maximum that reddit's API allows) for each subreddit.



In [15]:
# let's see what that "after" key from the second level of the json file was 
w_json["data"]["after"]

't3_dh31zt'

In [16]:
# let's see if that value can be found anywhere within the final row in the json file
w_json["data"]["children"][len(w_json)-1]

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'whisky',
  'selftext': "I'm hosting a Whisky tasting night at a local real ale/craft beer bar in November. First time I've done anything like this, was wondering what you guys thought of my line up for the night and the pricing etc.\n\nGreen Spot\n\nMidleton Method &amp; Madness Hungarian Oak\n\nTimorous Beastie 10yr\n\nClynelish 14yr\n\nGlenallachie 12yr\n\nBladnoch 17yr Californian Red Wine Cask\n\nLedaig 10yr\n\nOctomore 08.1\n\n£25 entry, 25ml pours.\n\nThink that's a nice progression through Irish single pot still, blended malt and Scottish single malt. Plenty to talk about there as well.",
  'author_fullname': 't2_ouo8fii',
  'saved': False,
  'mod_reason_title': None,
  'gilded': 0,
  'clicked': False,
  'title': 'Whisky tasting night',
  'link_flair_richtext': [],
  'subreddit_name_prefixed': 'r/whisky',
  'hidden': False,
  'pwls': None,
  'link_flair_css_class': None,
  'downs': 0,
  'thumbnail_height': None,
 


It seems like `w_json["data"]["after"]` corresponds to the name (i.e. id) for the very last post.

In [17]:
w_param = {"after": w_json["data"]["after"] }
w_param

{'after': 't3_dh31zt'}

In [18]:
w_res = requests.get(w_url, headers=headers, params=w_param)

# Ex., 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found.
print(w_res.status_code)

w_html = w_res.text

200


In [19]:
w_json2 = w_res.json()

# getting the length of our next json set
print(len(w_json2["data"]["children"]))

w_json2

25


{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 25,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'whisky',
     'selftext': '',
     'author_fullname': 't2_p04xvsi',
     'saved': False,
     'mod_reason_title': None,
     'gilded': 0,
     'clicked': False,
     'title': 'The value Aberlour!',
     'link_flair_richtext': [],
     'subreddit_name_prefixed': 'r/whisky',
     'hidden': False,
     'pwls': None,
     'link_flair_css_class': None,
     'downs': 0,
     'thumbnail_height': 140,
     'hide_score': False,
     'name': 't3_dgmpfi',
     'quarantine': False,
     'link_flair_text_color': 'dark',
     'author_flair_background_color': None,
     'subreddit_type': 'public',
     'ups': 52,
     'total_awards_received': 0,
     'media_embed': {},
     'thumbnail_width': 140,
     'author_flair_template_id': None,
     'is_original_content': False,
     'user_reports': [],
     'secure_media': None,
     'is_reddit_media_domain': True,

In [481]:
{"after": w_json2["data"]["after"] }

{'after': 't3_dbzuvg'}

### Extracting json files and exporting into csv

Now that we've figured out how to keep accessing each new set of 25 rows, we can create a function `ext_subr()` that will 
* make multiple API calls to extract json files
* loop through each json file to build our list of posts
* return that list of posts as a DataFrame object with `'name' ` as the index
* save the DataFrame into a csv object into the current folder


In [20]:
# function to extract a subreddit
def ext_subr(url, csv_name): 
    
    headers = {"User-agent": "Fish" }
    param = ""
    
    this_list = []
    
    n_per_call = 25 # known number of rows an API call typically provides
    n_calls = int(1000/n_per_call) # number of times we need to do API calls to get 1000 rows of data


    n_rows = 0 # to help us keep track of the number of rows we've extracted from the subreddit

    for calls in range(n_calls): 

        print(f"Running batch #{calls+1}")

        res = requests.get(url, headers=headers, params=param)
        status = res.status_code

        # to 'catch' incidences when we have trouble establishing a connection with the server, and to break out of the for loop
        if status != 200: 

            print("Error retrieving json")
            break

        else: 

            this_json = res.json()
            param = {"after": this_json["data"]["after"] }
            print("size of json file is:",len(this_json["data"]["children"]) )
            n_rows = n_rows + len(this_json["data"]["children"])

            # just in case we don't have 1000 rows of data to work with
            if (param["after"] == ""):
                for row in range(len(this_json["data"]["children"])):
                    this_list.append(this_json["data"]["children"][row]["data"])


                # since we're on the last "page""in the reddit we need to break out of the for loop            
                break

            else:
                for row in range(len(this_json["data"]["children"])):
                    this_list.append(this_json["data"]["children"][row]["data"])

            print("---\n")



        # taking a pause before the next API call to prevent overloading the server (or getting flagged as a DDOS attack!)
        time.sleep(1)

    print("TOTAL rows retrieved:", n_rows)
    
    
    df = pd.DataFrame(this_list)

    # we want to re-order the columns so that we can look at the most useful columns first
    cols_to_order = ["name", "author", "created", "title", "selftext", "selftext_html", "url", "media", "subreddit"]
    new_columns = cols_to_order + (df.columns.drop(cols_to_order).tolist())
    df = df[new_columns]
    
    df.to_csv(csv_name, index=False)
    
    print(f"Returning DataFrame created from {url}")
    print(f"{csv_name} created in current folder.")
    
    return df

In [21]:
w_url = "https://www.reddit.com/r/whisky.json"
v_url = "https://www.reddit.com/r/wine.json"


In [22]:
raw_vino = ext_subr(v_url, "raw_vino.csv")

Running batch #1
size of json file is: 27
---

Running batch #2
size of json file is: 25
---

Running batch #3
size of json file is: 25
---

Running batch #4
size of json file is: 25
---

Running batch #5
size of json file is: 25
---

Running batch #6
size of json file is: 25
---

Running batch #7
size of json file is: 25
---

Running batch #8
size of json file is: 25
---

Running batch #9
size of json file is: 25
---

Running batch #10
size of json file is: 25
---

Running batch #11
size of json file is: 25
---

Running batch #12
size of json file is: 25
---

Running batch #13
size of json file is: 25
---

Running batch #14
size of json file is: 25
---

Running batch #15
size of json file is: 25
---

Running batch #16
size of json file is: 25
---

Running batch #17
size of json file is: 25
---

Running batch #18
size of json file is: 25
---

Running batch #19
size of json file is: 25
---

Running batch #20
size of json file is: 25
---

Running batch #21
size of json file is: 25
---

R

In [23]:
print(raw_vino.shape)
print()
raw_vino.head(10)

(991, 104)



Unnamed: 0,name,author,created,title,selftext,selftext_html,url,media,subreddit,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,awarders,banned_at_utc,banned_by,can_gild,can_mod_post,category,clicked,content_categories,contest_mode,created_utc,crosspost_parent,crosspost_parent_list,discussion_type,distinguished,domain,downs,edited,gilded,gildings,hidden,hide_score,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,likes,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_text,link_flair_text_color,link_flair_type,locked,media_embed,media_metadata,media_only,mod_note,mod_reason_by,mod_reason_title,mod_reports,no_follow,num_comments,num_crossposts,num_reports,over_18,parent_whitelist_status,permalink,pinned,post_hint,preview,pwls,quarantine,removal_reason,report_reasons,saved,score,secure_media,secure_media_embed,send_replies,spoiler,steward_reports,stickied,subreddit_id,subreddit_name_prefixed,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,thumbnail_height,thumbnail_width,total_awards_received,ups,user_reports,view_count,visited,whitelist_status,wls
0,t3_dibnii,PhoenixRising20,1571191000.0,**Monthly Wine Challenge - October 2019 Tastin...,Hi Everyone! I'm on time this month! And ea...,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",https://www.reddit.com/r/wine/comments/dibnii/...,,wine,[],False,,,False,,,[],,Wino,dark,text,t2_b8geq,False,[],,,False,False,,False,,False,1571162000.0,,,,,self.wine,0,False,0,{},False,False,dibnii,False,False,False,False,True,True,False,,,,[],,dark,text,False,{},,False,,,,[],False,16,0,,False,all_ads,/r/wine/comments/dibnii/monthly_wine_challenge...,False,self,{'images': [{'source': {'url': 'https://extern...,6,False,,,False,19,,{},True,False,[],True,t5_2qhs8,r/wine,95627,public,,self,,,0,19,[],,False,all_ads,6
1,t3_djmwhi,CondorKhan,1571431000.0,Free Talk Friday,"Bottle porn without notes, random musings, off...","&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",https://www.reddit.com/r/wine/comments/djmwhi/...,,wine,[],True,,,False,,,[],,,,text,t2_89rd5,False,[],,,False,False,,False,,False,1571402000.0,,,,moderator,self.wine,0,False,0,{},False,False,djmwhi,False,False,False,False,True,True,False,,,,[],,dark,text,False,{},,False,,,,[],False,34,0,,False,all_ads,/r/wine/comments/djmwhi/free_talk_friday/,False,,,6,False,,,False,8,,{},True,False,[],True,t5_2qhs8,r/wine,95627,public,,self,,,0,8,[],,False,all_ads,6
2,t3_dmk4y8,nickro5,1571969000.0,Any info on this wine? Found in a cellar that'...,,,https://i.redd.it/bq6v03jf3ju31.jpg,,wine,[],False,,,False,,,[],,,,text,t2_18xfrbfj,False,[],,,False,False,,False,,False,1571940000.0,,,,,i.redd.it,0,False,0,{},False,True,dmk4y8,False,False,False,True,True,False,False,,,,[],,dark,text,False,{},,False,,,,[],False,3,0,,False,all_ads,/r/wine/comments/dmk4y8/any_info_on_this_wine_...,False,image,{'images': [{'source': {'url': 'https://previe...,6,False,,,False,13,,{},True,False,[],False,t5_2qhs8,r/wine,95627,public,,https://a.thumbs.redditmedia.com/OVEmqD7HSdpq9...,140.0,140.0,0,13,[],,False,all_ads,6
3,t3_dmhv2s,unusualbehavior,1571959000.0,Seeking recommendations for wine(s) to pair wi...,Hi r/wine! I’m making [this](https://thewander...,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",https://www.reddit.com/r/wine/comments/dmhv2s/...,,wine,[],False,,,False,,,[],,,,text,t2_c1ykjt0,False,[],,,False,False,,False,,False,1571930000.0,,,,,self.wine,0,False,0,{},False,False,dmhv2s,False,False,False,False,True,True,False,,,,[],,dark,text,False,{},,False,,,,[],False,12,0,,False,all_ads,/r/wine/comments/dmhv2s/seeking_recommendation...,False,self,{'images': [{'source': {'url': 'https://extern...,6,False,,,False,11,,{},True,False,[],False,t5_2qhs8,r/wine,95627,public,,self,,,0,11,[],,False,all_ads,6
4,t3_dm31kn,noodlemen2,1571882000.0,"Despite a last minute pants crisis, I passed!",,,https://i.redd.it/czkgyltbybu31.jpg,,wine,[],True,,,False,,,[],,,,text,t2_12mfgd,False,[],,,False,False,,False,,False,1571854000.0,,,,,i.redd.it,0,False,0,{},False,False,dm31kn,False,False,False,True,True,False,False,,,,[],,dark,text,False,{},,False,,,,[],False,48,0,,False,all_ads,/r/wine/comments/dm31kn/despite_a_last_minute_...,False,image,{'images': [{'source': {'url': 'https://previe...,6,False,,,False,454,,{},True,False,[],False,t5_2qhs8,r/wine,95627,public,,https://b.thumbs.redditmedia.com/KPd9gSlElgyDX...,94.0,140.0,0,454,[],,False,all_ads,6
5,t3_dmd3c6,Hyrman,1571932000.0,Wine bottle stopper fresh off the lathe,,,https://i.redd.it/0nenw89r1gu31.jpg,,wine,[],False,,,False,,,[],,,,text,t2_1i5tlrci,False,[],,,False,False,,False,,False,1571903000.0,,,,,i.redd.it,0,False,0,{},False,False,dmd3c6,False,False,False,True,True,False,False,,,,[],,dark,text,False,{},,False,,,,[],False,1,0,,False,all_ads,/r/wine/comments/dmd3c6/wine_bottle_stopper_fr...,False,image,{'images': [{'source': {'url': 'https://previe...,6,False,,,False,32,,{},True,False,[],False,t5_2qhs8,r/wine,95627,public,,https://b.thumbs.redditmedia.com/dBHsEYJ6mx5dr...,140.0,140.0,0,32,[],,False,all_ads,6
6,t3_dm6cqp,JulienMiquel,1571896000.0,Cabernet Franc wines from the Loire can be inc...,,,https://i.redd.it/76vzqwin2du31.jpg,,wine,[],False,,,False,,,[],9670473a-bb06-11e1-aa05-12313b088941,Wine Pro,dark,text,t2_114s99is,False,[],,,False,False,,False,,False,1571867000.0,,,,,i.redd.it,0,False,0,{},False,False,dm6cqp,False,False,False,True,True,False,False,,,,[],,dark,text,False,{},,False,,,,[],False,19,0,,False,all_ads,/r/wine/comments/dm6cqp/cabernet_franc_wines_f...,False,image,{'images': [{'source': {'url': 'https://previe...,6,False,,,False,136,,{},True,False,[],False,t5_2qhs8,r/wine,95627,public,,https://b.thumbs.redditmedia.com/46xfnMfVeGAhW...,105.0,140.0,0,136,[],,False,all_ads,6
7,t3_dm9jdr,shenglih,1571911000.0,Excited to see the two giants in person today!,,,https://i.redd.it/8xf16c2v9eu31.jpg,,wine,[],False,,,False,,,[],,,,text,t2_10lazs,False,[],,,False,False,,False,,False,1571882000.0,,,,,i.redd.it,0,False,0,{},False,False,dm9jdr,False,False,False,True,True,False,False,,,,[],,dark,text,False,{},,False,,,,[],False,7,0,,False,all_ads,/r/wine/comments/dm9jdr/excited_to_see_the_two...,False,image,{'images': [{'source': {'url': 'https://previe...,6,False,,,False,48,,{},True,False,[],False,t5_2qhs8,r/wine,95627,public,,https://b.thumbs.redditmedia.com/sCpsgG1L3Iw_P...,105.0,140.0,0,48,[],,False,all_ads,6
8,t3_dm9rs2,milovat,1571912000.0,And so it begins. Hopefully in 4-6 months I’ll...,,,https://i.redd.it/p24n1bq9deu31.png,,wine,[],False,,,False,,,[],,,,text,t2_jwtaxci,False,[],,,False,False,,False,,False,1571883000.0,,,,,i.redd.it,0,False,0,{},False,False,dm9rs2,False,False,False,True,True,False,False,,,,[],,dark,text,False,{},,False,,,,[],False,5,0,,False,all_ads,/r/wine/comments/dm9rs2/and_so_it_begins_hopef...,False,image,{'images': [{'source': {'url': 'https://previe...,6,False,,,False,39,,{},True,False,[],False,t5_2qhs8,r/wine,95627,public,,https://b.thumbs.redditmedia.com/N0G4I06y8lnWF...,104.0,140.0,0,39,[],,False,all_ads,6
9,t3_dm9pv6,had-me-at-bi-weekly,1571911000.0,Celebratory East Bench Ridge,,,https://i.redd.it/se78nujjceu31.jpg,,wine,[],True,,,False,,,[],,,,text,t2_q3x01ln,False,[],,,False,False,,False,,False,1571883000.0,,,,,i.redd.it,0,False,0,{},False,False,dm9pv6,False,False,False,True,True,False,False,,,,[],,dark,text,False,{},,False,,,,[],False,9,0,,False,all_ads,/r/wine/comments/dm9pv6/celebratory_east_bench...,False,image,{'images': [{'source': {'url': 'https://previe...,6,False,,,False,37,,{},True,False,[],False,t5_2qhs8,r/wine,95627,public,,https://b.thumbs.redditmedia.com/fHQheAW66qcik...,140.0,140.0,0,37,[],,False,all_ads,6


In [24]:
# now to create the csv file and DataFrame for the whisky subreddit

raw_whisky = ext_subr(w_url, "raw_whisky.csv")

Running batch #1
size of json file is: 25
---

Running batch #2
size of json file is: 25
---

Running batch #3
size of json file is: 25
---

Running batch #4
size of json file is: 25
---

Running batch #5
size of json file is: 25
---

Running batch #6
size of json file is: 25
---

Running batch #7
size of json file is: 25
---

Running batch #8
size of json file is: 25
---

Running batch #9
size of json file is: 25
---

Running batch #10
size of json file is: 25
---

Running batch #11
size of json file is: 25
---

Running batch #12
size of json file is: 25
---

Running batch #13
size of json file is: 25
---

Running batch #14
size of json file is: 25
---

Running batch #15
size of json file is: 25
---

Running batch #16
size of json file is: 25
---

Running batch #17
size of json file is: 25
---

Running batch #18
size of json file is: 25
---

Running batch #19
size of json file is: 25
---

Running batch #20
size of json file is: 25
---

Running batch #21
size of json file is: 25
---

R

In [25]:
print(raw_whisky.shape)
print()
raw_whisky.head()


(997, 105)



Unnamed: 0,name,author,created,title,selftext,selftext_html,url,media,subreddit,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,awarders,banned_at_utc,banned_by,can_gild,can_mod_post,category,clicked,content_categories,contest_mode,created_utc,crosspost_parent,crosspost_parent_list,discussion_type,distinguished,domain,downs,edited,gilded,gildings,hidden,hide_score,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,likes,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_text,link_flair_text_color,link_flair_type,locked,media_embed,media_metadata,media_only,mod_note,mod_reason_by,mod_reason_title,mod_reports,no_follow,num_comments,num_crossposts,num_reports,over_18,parent_whitelist_status,permalink,pinned,post_hint,preview,pwls,quarantine,removal_reason,report_reasons,saved,score,secure_media,secure_media_embed,send_replies,spoiler,steward_reports,stickied,subreddit_id,subreddit_name_prefixed,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,thumbnail_height,thumbnail_width,total_awards_received,ups,user_reports,view_count,visited,whitelist_status,wls
0,t3_dlttu2,alexwblack,1571832000.0,The Fascinating History of Whiskey in North Am...,,,https://www.primermagazine.com/2019/learn/nort...,,whisky,[],False,,,False,,,,[],,,,text,t2_pwox5,False,[],,,False,False,,False,,False,1571803000.0,,,,,primermagazine.com,0,False,0,{},False,False,dlttu2,False,False,False,False,True,False,False,,,,[],,dark,text,False,{},,False,,,,[],False,2,0,,False,,/r/whisky/comments/dlttu2/the_fascinating_hist...,False,link,{'images': [{'source': {'url': 'https://extern...,,False,,,False,39,,{},False,False,[],False,t5_2qryn,r/whisky,20272,public,,https://b.thumbs.redditmedia.com/pCLlwR0UiIQLC...,73.0,140.0,0,39,[],,False,,
1,t3_dlk3fi,UisgeLobos,1571789000.0,Whisky tasting night,I'm hosting a Whisky tasting night at a local ...,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",https://www.reddit.com/r/whisky/comments/dlk3f...,,whisky,[],False,,,False,,,,[],,,,text,t2_ouo8fii,False,[],,,False,False,,False,,False,1571761000.0,,,,,self.whisky,0,False,0,{},False,False,dlk3fi,False,False,False,False,True,True,False,,,,[],,dark,text,False,{},,False,,,,[],False,9,0,,False,,/r/whisky/comments/dlk3fi/whisky_tasting_night/,False,,,,False,,,False,9,,{},True,False,[],False,t5_2qryn,r/whisky,20272,public,,self,,,0,9,[],,False,,
2,t3_dlgoqh,the_wordless_one,1571773000.0,Could anyone help an Islay fanatic choose a ne...,"I'll start by saying hello, everyone! First po...","&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",https://www.reddit.com/r/whisky/comments/dlgoq...,,whisky,[],False,,,False,,,,[],,,,text,t2_178lf4,False,[],,,False,False,,False,,False,1571744000.0,,,,,self.whisky,0,1.5718e+09,0,{},False,False,dlgoqh,False,False,False,False,True,True,False,,,,[],,dark,text,False,{},,False,,,,[],False,49,0,,False,,/r/whisky/comments/dlgoqh/could_anyone_help_an...,False,,,,False,,,False,12,,{},True,False,[],False,t5_2qryn,r/whisky,20272,public,,self,,,0,12,[],,False,,
3,t3_dloix7,Peatysmokeygoodness,1571807000.0,IAMA Student at University of Illinois conduct...,Please tell me why you love whiskey and what a...,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",https://www.reddit.com/r/whisky/comments/dloix...,,whisky,[],False,,,False,,,,[],,,,text,t2_4uopsi6y,False,[],,,False,False,,False,,False,1571779000.0,,,,,self.whisky,0,False,0,{},False,False,dloix7,False,False,False,False,True,True,False,,,,[],,dark,text,False,{},,False,,,,[],True,6,0,,False,,/r/whisky/comments/dloix7/iama_student_at_univ...,False,self,{'images': [{'source': {'url': 'https://extern...,,False,,,False,0,,{},True,False,[],False,t5_2qryn,r/whisky,20272,public,,self,,,0,0,[],,False,,
4,t3_dl6boj,sjaakarie,1571718000.0,Has anyone already tried this?,,,https://i.redd.it/aenpzbpvdyt31.jpg,,whisky,[],False,,,False,,,,[],,,,text,t2_1124re,False,[],,,False,False,,False,,False,1571689000.0,,,,,i.redd.it,0,False,0,{},False,False,dl6boj,False,False,False,True,True,False,False,,,,[],,dark,text,False,{},,False,,,,[],False,26,0,,False,,/r/whisky/comments/dl6boj/has_anyone_already_t...,False,image,{'images': [{'source': {'url': 'https://previe...,,False,,,False,19,,{},True,False,[],False,t5_2qryn,r/whisky,20272,public,,https://b.thumbs.redditmedia.com/YZTIuHXLnNc2U...,105.0,140.0,0,19,[],,False,,


To make it easier for the rest of the technical report to be run en-masse without having to wait through the data-extracting process, the technical report has been split into two components.

Please continue to Project_3_part_two.ipnyb