# Scraping Reddit:

### Overview:
- **Part 1:** Connecting to Reddit
- **Part 2:** Taking a look at the `.json` file
- **Part 3:** Scraping Reddit!
- **Part 4:** Brief look at the data

In [2]:
import csv
import requests
import json
import time

from bs4 import BeautifulSoup
import numpy as np
import seaborn as sns
import pandas as pd

## Connecting to Reddit:


### API - The code below requests Reddit's API to establish a connection.

- Success!  
- The HTTP `status_code` returned a 200, which is the "ok" to connect.

In [3]:
url = "http://www.reddit.com/r/politics/.json"

In [4]:
headers = {'User-agent': 'adelr'}
res = requests.get("http://www.reddit.com/r/politics/.json", headers=headers)
res.status_code

200

## Taking a look at the `.json` content

In [5]:
res.content;

In [6]:
the_json = res.json()

#### Taking a look at the keys:

In [7]:
sorted(the_json.keys())

['data', 'kind']

In [8]:
the_json['kind']

'Listing'

#### Taking a look at the second layer of keys:

In [9]:
sorted(the_json['data'].keys()) 

['after', 'before', 'children', 'dist', 'modhash']

In [13]:
the_json['data']['children']; # Taking a look at the posts.

[{'data': {'approved_at_utc': None,
   'approved_by': None,
   'archived': False,
   'author': 'therealdanhill',
   'author_flair_css_class': None,
   'author_flair_template_id': None,
   'author_flair_text': None,
   'banned_at_utc': None,
   'banned_by': None,
   'can_gild': False,
   'can_mod_post': False,
   'clicked': False,
   'contest_mode': False,
   'created': 1528061667.0,
   'created_utc': 1528032867.0,
   'distinguished': 'moderator',
   'domain': 'self.politics',
   'downs': 0,
   'edited': False,
   'gilded': 0,
   'hidden': False,
   'hide_score': False,
   'id': '8o8k49',
   'is_crosspostable': False,
   'is_reddit_media_domain': False,
   'is_self': True,
   'is_video': False,
   'likes': None,
   'link_flair_css_class': None,
   'link_flair_text': None,
   'locked': False,
   'media': None,
   'media_embed': {},
   'media_only': False,
   'mod_note': None,
   'mod_reason_by': None,
   'mod_reason_title': None,
   'mod_reports': [],
   'name': 't3_8o8k49',
   'no_follo

In [11]:
len(the_json['data']['children'])

26

#### Converting the `json` into a dataframe to take a better look:

#### The `.json` file is formatted and clustered in a nested dictionary under `data`.

- For each`child` there is a `data` and `kind`

In [12]:
pd.DataFrame(the_json['data']['children'])

Unnamed: 0,data,kind
0,"{'is_crosspostable': False, 'subreddit_id': 't...",t3
1,"{'is_crosspostable': False, 'subreddit_id': 't...",t3
2,"{'is_crosspostable': False, 'subreddit_id': 't...",t3
3,"{'is_crosspostable': False, 'subreddit_id': 't...",t3
4,"{'is_crosspostable': False, 'subreddit_id': 't...",t3
5,"{'is_crosspostable': False, 'subreddit_id': 't...",t3
6,"{'is_crosspostable': False, 'subreddit_id': 't...",t3
7,"{'is_crosspostable': False, 'subreddit_id': 't...",t3
8,"{'is_crosspostable': False, 'subreddit_id': 't...",t3
9,"{'is_crosspostable': False, 'subreddit_id': 't...",t3


#### Taking a look at the first `child`

In [13]:
the_json['data']['children'][0]

{'data': {'approved_at_utc': None,
  'approved_by': None,
  'archived': False,
  'author': 'ToadProphet',
  'author_flair_css_class': 'newyork-flag',
  'author_flair_template_id': 'bd600206-8e72-11e6-bf04-0ee844677561',
  'author_flair_text': 'New York',
  'banned_at_utc': None,
  'banned_by': None,
  'can_gild': False,
  'can_mod_post': False,
  'clicked': False,
  'contest_mode': False,
  'created': 1527704019.0,
  'created_utc': 1527675219.0,
  'distinguished': None,
  'domain': 'washingtonpost.com',
  'downs': 0,
  'edited': False,
  'gilded': 0,
  'hidden': False,
  'hide_score': False,
  'id': '8n7f58',
  'is_crosspostable': False,
  'is_reddit_media_domain': False,
  'is_self': False,
  'is_video': False,
  'likes': None,
  'link_flair_css_class': None,
  'link_flair_text': None,
  'locked': False,
  'media': None,
  'media_embed': {},
  'media_only': False,
  'mod_note': None,
  'mod_reason_by': None,
  'mod_reason_title': None,
  'mod_reports': [],
  'name': 't3_8n7f58',
  'no

In [14]:
the_json['data']['children'][0]['data']

{'approved_at_utc': None,
 'approved_by': None,
 'archived': False,
 'author': 'ToadProphet',
 'author_flair_css_class': 'newyork-flag',
 'author_flair_template_id': 'bd600206-8e72-11e6-bf04-0ee844677561',
 'author_flair_text': 'New York',
 'banned_at_utc': None,
 'banned_by': None,
 'can_gild': False,
 'can_mod_post': False,
 'clicked': False,
 'contest_mode': False,
 'created': 1527704019.0,
 'created_utc': 1527675219.0,
 'distinguished': None,
 'domain': 'washingtonpost.com',
 'downs': 0,
 'edited': False,
 'gilded': 0,
 'hidden': False,
 'hide_score': False,
 'id': '8n7f58',
 'is_crosspostable': False,
 'is_reddit_media_domain': False,
 'is_self': False,
 'is_video': False,
 'likes': None,
 'link_flair_css_class': None,
 'link_flair_text': None,
 'locked': False,
 'media': None,
 'media_embed': {},
 'media_only': False,
 'mod_note': None,
 'mod_reason_by': None,
 'mod_reason_title': None,
 'mod_reports': [],
 'name': 't3_8n7f58',
 'no_follow': False,
 'num_comments': 808,
 'num_cro

#### Name of every post:

In [14]:
post_names = [post['data']['name'] for post in the_json['data']['children']] 
post_names

['t3_8o8k49',
 't3_8o88op',
 't3_8o9fa6',
 't3_8o85qa',
 't3_8o8w2f',
 't3_8oaari',
 't3_8o8als',
 't3_8o991g',
 't3_8oa34y',
 't3_8o7zsx',
 't3_8o8umj',
 't3_8o9bt9',
 't3_8o86qs',
 't3_8o9ux8',
 't3_8o8g3q',
 't3_8o98d9',
 't3_8o7zel',
 't3_8o8ozj',
 't3_8o9n8g',
 't3_8o8eaw',
 't3_8o8dct',
 't3_8o97dh',
 't3_8o9l5z',
 't3_8o7rhx',
 't3_8o87og',
 't3_8o4ws6']

#### This is the anchor when interacting with Reddit's API:

In [16]:
the_json['data']['after']

't3_8n8qnw'

In [17]:
param = {'after' : 't3_8mq25j'}

In [18]:
requests.get(url, params = param, headers = headers)

<Response [200]>

-------------

------------
## Scraping Reddit throuh the API:

#### The function below is a script to scrape the data from Reddit:

In [19]:
def get_posts( sub = 'all', num_pages = 1, avoid_distinguished = True, attached = None):
    """
    Returns a list of pages from a subreddit. 
    
    ===========================
    ======= Parameters ========
    ===========================

    sub = 'all' (default): type = string
        The subreddit you want to querry. 
        https://reddit.com/r/{sub}/ 
    -------------------------------------------------------------
    num_pages = 4 (default): type = int
        Number of pages to read from.  
        This also is the number of seconds
        this function takes to run
    -------------------------------------------------------------
    avoid_distinguished = True (default): type = bool
        Whether or not to avoid stickied, archived,
        and admin posts
    -------------------------------------------------------------
    attached = None (default): type = List
        The list that you are appending new data onto.
        Default to make a new list.  
        
    ===========================
    ========  Example =========
    ===========================    
    
    the_posts= get_posts(sub = 'jokes',
                            num_pages=1, 
                            avoid_distinguished=True)
                            
    the_posts= get_posts(sub = 'nosleep',
                            num_pages=1, 
                            avoid_distinguished=True, 
                            attached=the_posts )
    
    >>> Returns a list of ~25 posts from reddit.com/r/jokes and
                    ~25 posts from reddit.com/r/nosleep
    
    
    """
    if attached:
        posts = attached
    else:
        posts = []
    counter = 0
    after = None
    while counter < num_pages:
        if after == None:
            params = {}
        else:
            params = {'after': after}
        res = requests.get(f'https://reddit.com/r/{sub}/.json', params ,headers=headers)
        if(res.status_code!=200):
            print('invalid sub')
            return None
        the_json = res.json()
        if avoid_distinguished:
            page = [child for child in the_json['data'].get('children') 
                    if not child['data']['stickied'] and not child['data']['archived'] 
                    and not child['data']['distinguished']]
        else:
            page = the_json['data'].get('children')
        posts.extend(page)
        after = the_json['data']['after']
        counter += 1
        time.sleep(2)
    return posts

#### Scraping the Data from two Subreddits: `Politics` & `Europe` 

In [20]:
poli = get_posts(sub = 'politics',
                            num_pages = 1000, 
                            avoid_distinguished = True)

In [21]:
poli = get_posts(sub = 'europe',
                            num_pages = 1000, 
                            avoid_distinguished=True, 
                            attached = poli )

## Taking a look at the Data Collected:

#### Converting the data into a `Dataframe:`

In [44]:
def posts_as_DataFrame(posts, features = ['subreddit', 'author', 'title', 'selftext', 
                                          'created_utc', 'num_comments', 'ups', 'downs','score', 
                                          'domain', 'id', 'subreddit_id']):
    feat_dict = [{feat : post['data'][feat] for feat in features}  for post in posts]
    return pd.DataFrame(feat_dict)

In [45]:
df = posts_as_DataFrame(poli)
df.head()

Unnamed: 0,author,created_utc,domain,downs,id,num_comments,score,selftext,subreddit,subreddit_id,title,ups
0,ToadProphet,1527675000.0,washingtonpost.com,0,8n7f58,808,15418,,politics,t5_2cneq,Federal prosecutors poised to get more than 1 ...,15418
1,fuzzyshorts,1527680000.0,axios.com,0,8n7tbv,536,6809,,politics,t5_2cneq,Hurricane Maria killed more people than 9/11 o...,6809
2,today_okay,1527686000.0,haaretz.com,0,8n8fvu,254,3295,,politics,t5_2cneq,It's been 467 days since Trump held his last p...,3295
3,Usawasfun,1527681000.0,thehill.com,0,8n7ydg,246,4010,,politics,t5_2cneq,Fox News’s Napolitano: Trump’s ‘Spygate’ claim...,4010
4,Usawasfun,1527685000.0,axios.com,0,8n8crl,647,3139,,politics,t5_2cneq,"Trump: ""I wish"" I didn't pick Jeff Sessions as...",3139


### Saving the Data into a `.CSV`:

In [47]:
df.to_csv('poli_2.csv')