# Project 3: Differentiate Reddit Bioinformatics and Data Science Subreddits 

In this project, I will create a model to differentiate Bioinformatics and Data Science-related articles. I will focus on determining Reddit Bioinformatics and Data Science subreddits in this project. My baseline is 67%. I have imbalanced classes and will use F1 as my primary metric and accuracy score as a helper to find the best model.

---

## Data Collection

In this notebook, I will collect data from Reddit Bioinformatics and Data Science subreddits.

In [9]:
# Imports
import requests
import pandas as pd

In [10]:
# Link to the Reddit 
url = 'https://api.pushshift.io/reddit/search/submission'

### Functions

In [11]:
def get_subreddit_data(subreddit, limit, until = None):
    '''
    Input:
    Subreddit name, 
    Limit of the input, 
    Until - default None
    
    Output: the Data Frame with response 
    '''
    params = {
        'subreddit': subreddit,
        'limit': limit,
        'filter': 'subreddit, selftext,title, created_utc',
        'until': until
    }


    res = requests.get(url, params)

    print(f'Status code {res.status_code}')
    
    return pd.DataFrame(res.json()['data'])

In [12]:
def get_data_n_times(n, subreddit, limit):
    '''
    Input:
    Number of times to get data
    Subreddit name
    Limit per request   
    
    Output Data Frame with all collected data
    '''
    data = get_subreddit_data(subreddit, limit, until = None)
    last_created_utc = data['created_utc'].tail(1)
    for _ in range(n-1):
        data = pd.concat([data, get_subreddit_data(subreddit, limit, until = last_created_utc)], ignore_index=True)
        last_created_utc = data['created_utc'].tail(1)
    return data

---

## Collecting the Data from Reddit using PushShift

In [13]:
# Collecting data from the Bioinformatics subreddit
bioinformatics = get_data_n_times(7, 'bioinformatics', 100)

Status code 200
Status code 200
Status code 200
Status code 200
Status code 200
Status code 200
Status code 200


In [14]:
# Checking the amount of data available 
bioinformatics.shape

(683, 4)

In [32]:
# Collecting data from the Data Science subreddit
datascience = get_data_n_times(14, 'datascience', 100)

Status code 200
Status code 200
Status code 200
Status code 200
Status code 200
Status code 200
Status code 200
Status code 200
Status code 200
Status code 200
Status code 200
Status code 200
Status code 200
Status code 200


In [33]:
# Checking the amount of data available 
datascience.shape

(1400, 4)

In [34]:
# Combine data from two reddits
reddit = pd.concat([bioinformatics, datascience], ignore_index=True)

In [35]:
reddit.shape

(2083, 4)

In [36]:
# Save collected data to file
reddit.to_csv('../data/reddit.csv', index=False)

---

I gathered enough data for both subreddits. Next, I will clean, explore and prepare the data for modeling