# DSI19 Project 3 - Data Extraction
---

## Table of Contents

* [1. Initial Data Exploration](#chapter1)
    * [1.1 Initial Data Scrape](#chapter1_1)
    * [1.2 Examine Data](#chapter1_2)
    * [1.3 Observations](#chapter1_3)
    * [1.2 Features Selected](#chapter1_4)
* [2. Data Scraping](#chapter2)
    * [2.1 Data Extraction Function](#section_2_1)
    * [2.2 Pull from Subreddits](#section_2_2)

In [5]:
# Library imports
import requests
import time
import pandas as pd
import json
from IPython.display import clear_output

### 1. Initial Data Exploration <a class="anchor" id="chapter1"></a>
---

The objective is to extract data from the two subreddits, `/r/tifu` and `/r/confessions` in order to build a classification model. Before the model can be built, data exploration needs to be done in order to see how the data pulled using the api is constructed.

### 1.1 Initial Data Scrape <a class="anchor" id="chapter1_1"></a>

In [10]:
# Defining the subreddit to be scraped
url = 'https://www.reddit.com/r/tifu/.json'

# Creating a header parameter for the API
headers = {'user-agent':'dnys'}

# Requesting data through API
res = requests.get(url, headers = headers)
print(res.status_code)

200


    Status code of 200 represents a good connection to the url.

### 1.2 Examine Data <a class="anchor" id="chapter1_2"></a>

In [11]:
# Creating data object to store extracted data
data = res.json()

In [12]:
print(sorted(data.keys())) # Examining keys of extracted data
print(data['kind'])

['data', 'kind']
Listing


Data extracted is a dictionary with 2 keys.
- `Data`: where information required is stored
- `Kind`: string value

In [91]:
print(sorted(data['data'].keys())) # Examining keys withink data['data']

['after', 'before', 'children', 'dist', 'modhash']


In [13]:
print(data['data']['after']) # Examining values of 'after' key
print(data['data']['before']) # Examining values of 'before' key

t3_kv3p0o
None


In [14]:
# Observing first item in extracted data
display(data['data']['children'][0])

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'tifu',
  'selftext': 'Not a long story but as a new years resolution I told myself I would make an effort to compliment people because it always makes me feel good when I receive one.\nSo this woman comes in and she\'s really nice and friendly and as I\'m making her coffee we are having a chat and she has this huge belt that is absolutely covered in those sparkly jewel thingies. So I say. "I really like your belt it\'s really shiny" she smiles and says "thanks" so I go. "I wish I could pull it off" not thinking that it could mean something else other than I wish it looked good on me. she blushes and looks down and when I finish her coffee she looks at me and smiles and says "thanks" so my assistant manager comes over and she\'s like "did you just tell her you wanted to take her belt off?" My face goes red as I realise what I said. I explained it to my assistant manager who just laughed and told me to think about what i s

In [15]:
# Testing if the 'after' key contains marker for the last post within that API pull
print(data['data']['children'][24]['data']['name'] == data['data']['after'])

True


In [16]:
# Extracting all the keys within the data key of data['data']['children']
print(sorted(data['data']['children'][0]['data'].keys()))

['all_awardings', 'allow_live_comments', 'approved_at_utc', 'approved_by', 'archived', 'author', 'author_flair_background_color', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders', 'banned_at_utc', 'banned_by', 'can_gild', 'can_mod_post', 'category', 'clicked', 'content_categories', 'contest_mode', 'created', 'created_utc', 'discussion_type', 'distinguished', 'domain', 'downs', 'edited', 'gilded', 'gildings', 'hidden', 'hide_score', 'id', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video', 'likes', 'link_flair_background_color', 'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id', 'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked', 'media', 'media_embed', 'media_only', 'mod_note', 'mod_reason_by', 'mod_reas

In [17]:
# Determing number of values for each API pull
print(len(data['data']['children']))

25


### 1.3 Observations <a class="anchor" id="chapter1_3"></a>

- Information required lives in `data['data']['children']`
- Subreddit is contained within `['subreddit']`
- Title of post is contained within `['title']`
- Body of post is contained within `['selftext']`
- `data['data']['after']` contains the id of the last post within that API pull
- Each pull extracts 25 posts

### 1.4 Features Selected <a class="anchor" id="chapter1_4"></a>

The model will be trained on text contained in the title and the body of the posts in order to classify it a either of the subreddits. In addition to that, other features will also be extracted that may prove to have useful insights.
- `name`: denoting user id of the poster
- `title`: title of the post
- `selftext`: body of the post
- `ups`: upvotes of the post
- `num_comments`: number of comments made for the post
- `subreddit`: subreddit the post belongs to

## 2. Data Scraping <a class="anchor" id="chapter2"></a>
---

Given the selected features, required data will be scraped using the APIs.

### 2.1 Data Extraction Function <a class="anchor" id="chapter2_1"></a>

In [18]:
# Define function to extract posts from subreddit, given the url and the number of targeted posts
def subred_extraction(url, number):

    headers = {'user-agent':'dnys'} # To prevent 429 status code
    posts = [] # Creating list to store extracted posts
    after = None # Marker to continue extracting after each pull of 25 posts
    features = ['name', 'title', 'selftext', 'ups', 'num_comments', 'subreddit'] # Features to be extracted

    for i in range(int(round(number/25,0))): # Loop to get requested number of posts
        print(f'Scraping {(i+1)*25} posts.') # Print status as posts are extracted
        if after == None: # If statement to set marker
            params = {}
        else:
            params = {'after':after}
        url = url
        res = requests.get(url, params=params, headers=headers) # Requesting data using API with marker
        
        if res.status_code == 200: # Testing for connection
            the_json = res.json()
            
            for i in range(len(the_json['data']['children'])): # Looping through each post extracted with each run of the API
                placement_dict = {} # Creating placeholder for extracted data
                for feature in features: # Loop for extracting required feature
                    placement_dict[feature] = the_json['data']['children'][i]['data'][feature]
                posts.append(placement_dict) # Creating a list item for extracted data of the post
                
            after = the_json['data']['after'] # Setting new marker 
            print(len(posts))
        else:
            print(f'Request time out. Error code: {res.status_code}.') # Error message for bad connection
            break
        clear_output(wait=True)
        time.sleep(1) # Creating time interval between each API request
        print(f'Successfully extracted {len(posts)} posts.')
    return posts

### 2.2 Pull from Subreddits <a class="anchor" id="chapter2_2"></a>

Subreddits are:
- /r/tifu
- /r/confessions

In [19]:
# Creating the URL for each subreddit
url1 = 'https://www.reddit.com/r/tifu/new/.json'
url2 = 'https://www.reddit.com/r/confessions/new/.json'

In [20]:
# Extracting data for TIFU subreddit, requesting for 1000 posts and saving to .csv file
tifu_posts = subred_extraction(url1,1000)
pd.DataFrame(tifu_posts).to_csv('../data/tifu_subreddit.csv',index=False)

Successfully extracted 997 posts.


In [21]:
# Extracting data for Confessions subreddit, requesting for 1000 posts and saving to .csv file
confessions_posts = subred_extraction(url2,1000)
pd.DataFrame(confessions_posts).to_csv('../data/confessions_subreddit.csv',index=False)

Successfully extracted 992 posts.


In [22]:
# Read in extracted data
tifu_df = pd.read_csv('../data/tifu_subreddit.csv')
confessions_df = pd.read_csv('../data/confessions_subreddit.csv')

In [24]:
display(tifu_df.describe())

Unnamed: 0,ups,num_comments
count,997.0,997.0
mean,999.558676,57.308927
std,5177.492488,243.364587
min,0.0,0.0
25%,7.0,4.0
50%,18.0,9.0
75%,58.0,19.0
max,66113.0,3214.0


In [26]:
display(confessions_df.describe())

Unnamed: 0,ups,num_comments
count,992.0,992.0
mean,75.271169,12.740927
std,551.937176,52.051345
min,0.0,0.0
25%,0.0,2.0
50%,3.0,4.0
75%,7.0,9.0
max,7390.0,1011.0
