<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Part 1 — Subreddit Posts Extraction

## Background

[*Twitch*](https://www.twitch.tv/p/en/about/) is an American interactive livestreaming service for content spanning gaming, entertainment, sports, music, and more.

Similar to many other online social platforms, we are always looking for new ways to better engage our users, expands our products and service offerings, thereby increasing customer stickiness through greater user experience and improving both top and bottom lines.

_**Why Gaming?**_<br>
- **Video Gaming Industry**: ~$178.73 Billion in 2021 (increase of 14.4% from 2020) ([*source*](https://www.wepc.com/news/video-game-statistics/))

- **Global eSports market**: ~$1.08 Billion in 2021 (increase of ~50% from 2020) ([*source*](https://www.statista.com/statistics/490522/global-esports-market-revenue/))

- **eSports industry's global market revenue**: Forecasted to grow to as much as $1.62 Billion in 2024. 

- China alone accounts for almost 1/5 of this market. 

In recent months, we started a pilot program with a subset of our most active users by availing them to a new beta forum that has sparked many discussions amongst our gaming users.

This has resulted in hightened traffic with frequent posts and comments updates daily. Our business development and marketing counterparts also realised these gaming users are predominantly focusing on 2 games, namely [***Dota 2***](https://www.dota2.com/home) and [***League of Legends (LoL)***](https://leagueoflegends.com). 

Our business development and marketing colleagues see great potential in tapping on this group of active gamers and the associated data. However, since there is merely 1 single beta gaming forum thread, users have to sieve through multiple posts to find topics that interest or are relevant to them, resulting in potential poor user experience. Additionally, it would be more effective and efficient to target each game's user base separately by designing sales and marketing campaigns that better meet the corresponding user base's needs.

## Problem Statement

- Our business development and marketing colleagues have requested for us, the Data Science team, to design an **AI model** that **correctly classifies posts** in the 1 single beta gaming forum thread into 2 separate threads, 1 for Dota 2 and another for League of Legends (LoL), with an **accuracy of at least 85%** and **Top 10 Predictors for each subreddit** thereby improving user experience and increasing ease of designing more targeted sales and marketing campaigns that better meet the corresponding user base's needs.

**Datasets to be scraped:** _(Refer to **Part 2 — Subreddit Posts Classification** for Data cleaning and Modeling code, etc.)_
 - **`dota2_raw.csv`**: Dota 2 dataset
 - **`lol_raw.csv`**: League of Legends (LoL) dataset
<br>

**Brief Description of Datasets selected:** 
- The 2 datasets above, each comprising 4,000 records, were scrapped using the code below with the [*pushshift API*](https://github.com/pushshift/api) from subreddits: 
  - [***r/DotA2***](https://www.reddit.com/r/DotA2/); and 
  - [***r/leagueoflegends***](https://www.reddit.com/r/leagueoflegends/)

In [1]:
import datetime
import pandas as pd
from random import randrange
import requests
import time

In [2]:
# Function extract posts from subreddit in multiples of 100 posts per request

def extract(subreddit, n):
    
    url = 'https://api.pushshift.io/reddit/search/submission'
    
    # Create dataframe for storing posts
    df = pd.DataFrame()
    
    # Loop n times to retreive the required number of posts (100 posts per request)
    #     i.e. n = 40 extracts 4000 posts in total
    # Append data extracted from each round to the dataframe
    for i in range(n):
        
        if i == 0:    # For the first loop only, use current time
            params = {'subreddit': subreddit, 'size': 100}
        else:         # For subsequent loops, use timestamp in post at last data row based on the 'created_utc'
            params = {'subreddit': subreddit, 'size': 100, 'before': last_row_timestamp}
            
        result = requests.get(url, params)
        data = pd.DataFrame(result.json()['data'])
        df = df.append(data, ignore_index = True)
        
        # Initialize next extraction loop using timestamp in post at last data row based on the 'created_utc'
        last_row_timestamp = df['created_utc'].iloc[-1]
        
        # Set a timer before next iteration
        time.sleep(randrange(1, 5))
        
    return df




# Function to compare all rows in dataset to check for duplicated rows
def duplicate(df):
    
    if len(df[df.duplicated()]) == 0:
        return "This dataset has no duplicated rows."
    else:
        print(f'Duplicated row(s):\n {df[df.duplicated()]}')  # List duplicated row(s) in dataset, if any
    
    return

In [3]:
# Extract 4000 rows of posts from "Dota 2" subreddit
# https://www.reddit.com/r/DotA2/
dota2 = extract('DotA2', 40)
dota2.shape

(4000, 87)

In [4]:
# Save the dota2 dataframe (Raw version) as a csv file into the same folder as the original data files
dota2.to_csv('../datasets/dota2_raw.csv', index = False)

In [5]:
# Extract 4000 rows of posts from "League of Legends" subreddit
# https://www.reddit.com/r/leagueoflegends/
lol = extract('leagueoflegends', 40)
lol.shape

(3999, 80)

In [6]:
# Save the lol dataframe (Raw version) as a csv file into the same folder as the original data files
lol.to_csv('../datasets/lol_raw.csv', index = False)

In [None]:
# Remove duplicated row in dota2 dataset except for the first occurrence & update index column of dataset accordingly
dota2.drop_duplicates(inplace = True, ignore_index = True)
dota2.shape

In [None]:
# Save the dota2 dataframe as a csv file into the same folder as the original data files
dota2.to_csv('../datasets/dota2.csv', index = False)

In [None]:
# Remove duplicated row in lol dataset except for the first occurrence & update index column of dataset accordingly
lol.drop_duplicates(inplace = True, ignore_index = True)
lol.shape

In [None]:
# Save the lol dataframe as a csv file into the same folder as the original data files
lol.to_csv('../datasets/lol.csv', index = False)

### Refer to Part 2 — Subreddit Posts Classification for Data cleaning and Modeling code, etc.