# Problem Statement

As of 2016, [roughly one third](https://www.pewresearch.org/internet/2016/10/25/political-content-on-social-media/) of Americans comment, discuss, or post about politics on social media. As a result, political campaigns have begun to invest significant resources into political advertising on social media. In order to craft targeted ads, political advertising agencies are investing significant resources into [identifying the populations of social media users discussing politics](https://www.americanbar.org/groups/crsj/publications/human_rights_magazine_home/voting-in-2020/political-advertising-on-social-media-platforms/). The goal of this project is to build a text classification model that can differentiate between casual and political conversations on social media to aid advertising agencies target particular users for specific ads. 

# Background

My models are trained using text data from the website reddit, with the submissions specifically coming from the subreddits: 
1. r/CasualConversation
2. r/PoliticalDiscussion

The CasualConversation subreddit is a forum dedicated having fun conversations "about anything that is on your mind". with 1.44 million members, there is a huge variety in conversation topics from advice, to discussing dinner plans, favorite memories. As a result, it serves as a great baseline signal for what average conversation looks like on social media.

The PoliticalDiscussion subreddit is a forum focused solely on posing questions regarding current politics, mainly centering on US politics as its core topic. The subreddit is home to 1.91 million redditors who have vigorous debates regarding political strategy and opinions on recent political news.

# Methodology


* Collect data from 10,000 posts from r/CasualConversations and r/PoliticalDiscussions using Pushshift API.
    * I collect an even number of posts from each subreddit so that the baseline accuracy is equivalent to the flip of a coin: 50/50. 
* Clean text data, engineer new features with NLP, and lemmatize text so that each word maintains its meaning but is reduced to its base form.  
* Split data into training and testing datasets to validate the performance of my model. Model will be created with training data, and then its accuracy will be tested using the testing data. 
* Vectorize the text data using TF-IDF methodology to account for outliers and weigh each word according to its importance to the meaning of a senetence. 
* Fit the data to a logistic regression and random forest classification model. Analyze the classification metrics to determine which model performed better. 

# Data

### Reddit NLP - Data Collection

[Pushshift](https://github.com/pushshift/api) is a service created y the /r/datasets mod team to help provide enhanced search capapbilities for searching Reddit data. The Pushshift RESTful API allows for a higher level search functionality and querying of comments and submissions, aiding in data collection for analysis and modeling. The API leverages the requests library to return a json response that can then be parsed for the data of interest. 

In [41]:
import pandas as pd
import numpy as np
import time
import requests

### Query Syntax

Setting the query url to the Pushshift API for selecting subreddit submissions. In the context of my subreddits, the submissions themselves are more verbose and provide a greater indication of the topic of conversation.  

In [43]:
url = 'https://api.pushshift.io/reddit/search/submission'

Here I am setting the query parameters, specifying the subreddits I want to collects submissions from and number of submissions I want to collect for each query. 

In [44]:
casual_params = {
    'subreddit': 'casualconversation',
    'size': 100
}

political_params = {
    'subreddit': 'politicaldiscussion',
    'size': 100
}

In [None]:
response_1 = requests.get(url, casual_params)
print(response_1.status_code)

response_2 = requests.get(url, political_params)
print(response_2.status_code)

The status code of 200 tells us that the query was accepted for both subreddits.

### Scraping the Subreddits

In order to collect the most text per post in each subreddit, I wrote a function that scrapes each subreddit for posts that have not been removed, deleted, or have empty post text. By doing so, I am ensuring that I have the most text dense posts possible for training my models. 

In [1]:
def scraper(url, api_params):
    
    '''Requests subreddit submission data from Pushshift API. 
    ---
    Returns:
    type: DataFrame
        Composed of post submission data. Only includes submissions 
        that were not removed, deleted, or had empty text posts. 
    
    ---
    Parameters:
    url
        Type: String. 
        Base Pushshift API url
    
    api_params:
        type: Dictionary. 
        Specific subreddit querying paramters. 
    '''
    # iterator is 0 before loop
    i = 0
    
    # Append new paramter in loop, so instantiate a fresh dictionary 
    new_params = api_params
    df_list = []
    
    # Iterating over subreddit submission data until 10,000 posts collected
    while i < 10_000:
        
        # Request with subreddit specific parameters
        res = requests.get(url, new_params)
        
        # Collecting request data 
        data = res.json()
        
        # Gathering submission-specific data
        posts = data['data']
        
        # Creating dataframe of all post data from query
        df = pd.DataFrame(posts)
        
        # Iterating over post data from this query
        for row in df.index.to_list():
            
            # Removing data for posts that were removed or deleted
            if df.loc[row,'selftext']=='[removed]' or df.loc[row,'selftext']=='[deleted]':
                df.drop(row, inplace=True)
                continue
                
            # Removing posts that were empty
            if df.loc[row, 'selftext'] == '' or df.loc[row, 'selftext'] == '.':
                df.drop(row, inplace=True)
        
        # Dropping null posts with null text values
        df.dropna(subset=['selftext'],inplace=True)
            
        # Instantiating new query parameter to gather older posts in next query
        new_params['before'] = df.iloc[-1]['created_utc']
        
        # Appending remaining query data to list of DataFrames
        df_list.append(df)
        
        # Progressing iterator forward proportional to the length of data queried
        i += len(df)
        
        # Limiting number of requests per second
        time.sleep(2)
        
    # Concatenating dataframes of all the query data
    return pd.concat(df_list, ignore_index=True)


In [2]:
# Scraping submissions in CasualConversation subreddit and storing in one df
casual_df = scraper(url, casual_params)

In [3]:
# Scraping submissions in PoliticalDiscussion subreddit and storing in one df
political_df = scraper(url, political_params)

Converting PoliticalDiscussion dataframe into csv \
`political_df.to_csv('./Data/politics.csv', index=False)`

Converting CasualConversation dataframe into csv\
`casual_df.to_csv('./Data/casual_convo.csv', index=False)`

# Please continue to Notebook 2