# Project 3: Web APIs & Classification

### Problem statement

We are a team of political advisors, and our job is to provide insights to political and business groups on the current political landscape, and how they should steer their policies and strategies.

Reddit posts and comments of two dominant political groups, Democrats and Conservatives, were downloaded and studied. The goal of this project is to build a classifier on which subreddit a given post came from. In addition, by researching the popular texts among Republican and Democrat supporters, we expect to extract key words that implies major issues facing.





### Data Collection
Reddit is the only data source. About 1000 posts and related comments and selftexts from each subreddit groups are downloaded by webscrapping. 

Two subreddit groups are used:
- r/Conservative
- r/democrats

### Data Preprocessing
The data is then preprocessed in the following steps to remove undesirable features from natural language processing perspective:
- remove non-characters for saving texts in csv properly

## Step1: Downloading data - Democrats

### Importing libaries

In [1]:
import pandas as pd
import numpy as np
import requests
import time
import re

### Crawling the Democrats' posts

#### First level of crawling - list of posts

In [2]:
# url - Reddit API via .json
url = "https://www.reddit.com/r/democrats.json"

In [3]:
# Function to to make HTML requests to Reddit to receive the information in json format
# Each page has ~25 posts so the loop will go through 40 times to get ~1000 posts
# This page contains information such as posts' name (similar to id), title, url of individual post etc
# Request status 200 means success
# Add a latency of 0.2 second after each request

def extract_posts(url):
    
    headers = {"User-agent": "Bot DSI"}
    posts = []
    after = None

    # loop for 40 times
    for i in range(40):
        print(i)
        if after == None:
            params = {}
        else:
            # Field "after" refers to the last name (id) of the post on the current page
            params = {"after": after}
        res = requests.get(url, params=params, headers=headers)
        if res.status_code == 200:
            json_text = res.json()
            posts.extend(json_text["data"]["children"])
            after = json_text["data"]["after"]
        else:
            print(res.status_code)
            break
        time.sleep(0.2)
    return posts

In [4]:
# Calling the function to extract posts (first level of crawling)
posts = extract_posts(url)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


#### Processing data from the first layer of crawling

In [5]:
# Function to process the key information crawled from the first layer of crawling
# In order to save the text in csv file properly, non-characters are replaced by spaces
def trim_posts(posts):

    url_root = "https://www.reddit.com"

    posts_trimmed= []
    for i in range(len(posts)):
        new_post = {}
        new_post["name"] = posts[i]['data']["name"]
        new_post["title"] = re.sub('[^a-zA-Z0-9 \n\.]', ' ', posts[i]['data']["title"])
        new_post["url_comments"] = url_root + posts[i]['data']['permalink']
        posts_trimmed.append(new_post)
    
    return posts_trimmed

In [6]:
posts_trimmed = trim_posts(posts)

In [7]:
posts_trimmed[0]

{'name': 't3_kwqigy',
 'title': 'House reaches enough votes to impeach President Trump for the second time',
 'url_comments': 'https://www.reddit.com/r/democrats/comments/kwqigy/house_reaches_enough_votes_to_impeach_president/'}

#### Saving results of first level of crawling in csv

In [8]:
df_posts = pd.DataFrame(posts_trimmed)

In [9]:
df_posts.head()

Unnamed: 0,name,title,url_comments
0,t3_kwqigy,House reaches enough votes to impeach Presiden...,https://www.reddit.com/r/democrats/comments/kw...
1,t3_kwxf3v,Aged like milk x2,https://www.reddit.com/r/democrats/comments/kw...
2,t3_kx07lk,Classic,https://www.reddit.com/r/democrats/comments/kx...
3,t3_kwjpk7,Republicans Actual Argument Against Impeachmen...,https://www.reddit.com/r/democrats/comments/kw...
4,t3_kwz1uh,about sums it up,https://www.reddit.com/r/democrats/comments/kw...


In [None]:
df_posts.to_csv("../data/democrats.csv", index=False)

#### Reading csv files with permalinks for second level of crawling - extract comments under the post

In [10]:
df_posts = pd.read_csv("../data/democrats.csv")

In [11]:
df_posts.head()

Unnamed: 0,name,title,url_comments
0,t3_kv4lr7,House Democrats launch second impeachment of T...,https://www.reddit.com/r/democrats/comments/kv...
1,t3_kvg3xu,Do I have to,https://www.reddit.com/r/democrats/comments/kv...
2,t3_kv4oca,Camp Auschwitz guy identified,https://www.reddit.com/r/democrats/comments/kv...
3,t3_kvekkt,No Crawling Back,https://www.reddit.com/r/democrats/comments/kv...
4,t3_kvfqwa,Use the 14th Amendment to ban Trump,https://www.reddit.com/r/democrats/comments/kv...


In [13]:
# add new columns to store selftext and comments
df_posts["comments"] = ""
df_posts["selftext"] = ""

In [14]:
df_posts.shape

(997, 5)

In [15]:
df_posts.head()

Unnamed: 0,name,title,url_comments,comments,selftext
0,t3_kv4lr7,House Democrats launch second impeachment of T...,https://www.reddit.com/r/democrats/comments/kv...,,
1,t3_kvg3xu,Do I have to,https://www.reddit.com/r/democrats/comments/kv...,,
2,t3_kv4oca,Camp Auschwitz guy identified,https://www.reddit.com/r/democrats/comments/kv...,,
3,t3_kvekkt,No Crawling Back,https://www.reddit.com/r/democrats/comments/kv...,,
4,t3_kvfqwa,Use the 14th Amendment to ban Trump,https://www.reddit.com/r/democrats/comments/kv...,,


In [16]:
# function to extract selftext and comments based on permalinks to the "comments" page

def extract_selftext_comments(url):
    headers = {"User-agent": "Bot DSI"}
    res = requests.get(url, headers=headers)


    if (res.status_code == 200):
        json_text = res.json()

        comments_array = []
        selftext = ""
        
        # gather comments and convert it to a string
        for i in range(len(json_text[1]["data"]["children"])):
            try:
                comments_array.append(json_text[1]["data"]["children"][i]["data"]["body"])
            except:
                pass
        #selftext does not necessarily exist - skipped if not available
        try:
            selftext.append(json_text[0]["data"]["children"][0]["data"]["selftext"])
        except:
                pass

        try:
            selftext.append(json_text[0]["data"]["children"][0]["data"]["crosspost_parent_list"][0]["selftext"])
        except:
                pass
    
    output = {"selftext": selftext, "comments":" ".join(comments_array)}
    
    return output


#### Processing the data from second level of crawling

#### Removing the non-characters in order to save text in csv files properly

In [None]:
for i in df_posts.index:
    print(i)
    url = df_posts.loc[i, "url_comments"][0:-1] + ".json"
    print(url)
    selftext_comments = extract_selftext_comments(url)
    selftext = selftext_comments["selftext"]
    comments = selftext_comments["comments"]
    selftext = re.sub('[^a-zA-Z0-9 \n\.]', ' ', selftext)
    comments = re.sub('[^a-zA-Z0-9 \n\.]', ' ', comments)
    df_posts.loc[i, "selftext"] = selftext
    df_posts.loc[i, "comments"] = comments
    time.sleep(0.1)

In [None]:
df_posts.head()

#### Saving the information from second layer of crawling - ~1000 posts and related comments

In [None]:
df_posts.to_csv("../data/democrats_comments.csv", index=False)