# Predicting Subreddit Subjects: Exploring the Capabilities of Machine Learning Models

## Problem Statement

"Despite the abundance of online forums and social media platforms, categorizing user-generated content remains a challenge. However, recent advances in machine learning have enabled the creation of models that can predict the subject of a subreddit based on the data from the subreddit itself. Despite the success of these models, there is a need to explore their limitations, accuracy, and potential applications in various domains. Here in T-Mobile as a data scientist, my focus is on trying diffrenet machine learning algorithms to find the best model for predicting two different subreddits of our devices (Apple Watch and Samsung Galaxy Watch) based on subreddit contents."

## Background

Online forums and social media platforms like Reddit have become valuable sources of information and feedback for businesses and individuals alike. With millions of users sharing their thoughts, opinions, and experiences on a wide range of topics, these platforms provide rich data that can be used to gain insights into customer needs and preferences. However, analyzing this data manually can be time-consuming and inefficient, especially when dealing with large volumes of posts and comments.

In recent years, machine learning algorithms have emerged as powerful tools for analyzing and classifying text data. These algorithms can learn to identify patterns, relationships, and trends in large datasets and make predictions based on those patterns. One area where these algorithms could show great promise is in predicting the subreddit to which a post belongs based on its content.

In this project, we present an approach for collecting and preprocessing the data from Apple Watch and Samsung Galaxy Watch subreddits and using natural language processing techniques to extract features from the text data. We will then train and test machine learning algorithms to determine the most accurate algorithm for predicting the subreddit based on the content of the post.

By leveraging the insights from our analysis, we can gain a better understanding of the language used by users when discussing the Apple Watch and Samsung Galaxy Watch. This information can be used to inform our product development, marketing, and customer service strategies, helping to focus on products that better meet the needs and preferences of your customers.

## Contents:

- [Problem Statement](#Problem-Statement)
- [Background](#Background)
- [Pushshift Reddit API](#Pushshift-Reddit-API)
- [Imports](#Imports)
- [Functions](#Functions)
- [Fetch Data](#Fetch-Data)
- [Save Datasets](#Save-Datasets)

# Pushshift Reddit API

The [pushshift.io](https://github.com/pushshift/api) Reddit API was designed and created by the /r/datasets mod team to help provide enhanced functionality and search capabilities for searching Reddit comments and submissions.

## Imports

In [None]:
import pandas as pd
import numpy as np
import requests

## Functions

In [None]:
# Define a function for fetching data from seubreddit posts and return a datafarme with the time of last post
def fetch_data(url, subreddit, size, before=None):

    # Define a parameters that will be used in our API
    params = {
        
        # Subreddit name
        'subreddit': subreddit,
        
        # Number of posts (pushif api can't get more that 500 posts in each iteration)
        'size': size,
        
        # Before the specific time
        'before': before
    }
    
    # Send a Get Request to the specified URL with defined parameters
    res = requests.get(url, params)
    
    # Extract the JSON data from the response
    data = res.json()
    
    # Get posts out of data
    posts = data['data']
    
    # Make a posts datframe
    df = pd.DataFrame(posts)
    
    # Return the dataframe with the time of last post in that dataframe
    return df, posts[-1]['created_utc']

# Define a function for repeating above function in the number of iterations
def fetch_multiple_data(url, subreddit, size, num_iterations):
    dfs = []
    before = None
    
    # For loop for iterating the process based on the number of num_iterations
    for _ in range(num_iterations):
        df, before = fetch_data(url, subreddit, size, before)
        dfs.append(df)
    
    # Concatenate created dataframes
    combined_df = pd.concat(dfs, axis=0)
    return combined_df

## Fetch Data

In [None]:
# Get 3,500 posts out of applewatch subrddit 
url = "https://api.pushshift.io/reddit/search/submission"
subreddit = "applewatch"
size = 500
num_iterations = 7

apple = fetch_multiple_data(url, subreddit, size, num_iterations)

In [None]:
# Get 3,500 posts out of galaxywatch subreddit
url = "https://api.pushshift.io/reddit/search/submission"
subreddit = "galaxywatch"
size = 500
num_iterations = 7

samsung = fetch_multiple_data(url, subreddit, size, num_iterations)   

## Save Datasets

In [None]:
apple.to_csv("../data/apple.csv")

In [None]:
samsung.to_csv("../data/samsung.csv")