# Problem Statement

#### Oh no! Reddit is in shambles. Reddit's data science team needs assistance accurately classifying all posts to the specific subreddit that they came from. To prove that we can be of assistance, we will decrease the scale of the project and focus on just the Nike and Adidas subreddits and build a model to determine whether a given post is a Nike related post. 

#### We will build different types of classification models: a Logistic Regression model, a Multinomial Naive Bayes Model, a Decision Tree Model, a Bagged Trees Model, an Extra Trees Model,  and a Random Forest model. We will optimize our model so our `score` is as close to `1` as possible. With an accurate model, Reddit should be able to apply this model to sort out the rest of their posts that are currently in limbo. 

# Data Collection

#### We will be hitting Reddit's API for samples of Nike and Adidas related posts. Steps for the data collection process will be the same for both companies. 

In [1]:
import requests
import time 
import pandas as pd

In [2]:
nike_url = 'https://www.reddit.com/r/Nike/hot/.json'             # Set Nike url

adidas_url = 'https://www.reddit.com/r/adidas/hot/.json'         # Set Adidas url

In [3]:
headers = {'User-agent': 'danhyunkim'}                           # Set user agent to prevent status code issue

In [4]:
nike_res = requests.get(nike_url, headers=headers)               # Set Nike get request

adidas_res = requests.get(adidas_url, headers=headers)           # Set Adidas get request

In [5]:
# Check status code for both requests to ensure we are not getting a 400 or 500 code
print(nike_res.status_code)

print(adidas_res.status_code)

200
200


### Nike data collection

In [6]:
nike_json = nike_res.json()

In [7]:
nike_posts = []
after = None
for i in range(40):
    print(i)
    if after == None:
        params = {}
    else: 
        params = {'after': after}
    NIKE_url = 'https://www.reddit.com/r/Nike/hot/.json'
    res = requests.get(NIKE_url, params=params, headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        nike_posts.extend(the_json['data']['children'])
        after = the_json['data']['after']
    else:
        print(res.status_code)
        break
    time.sleep(2)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


In [8]:
nike_df = pd.DataFrame(nike_posts)
nike_df.shape

(999, 2)

In [9]:
nike_cells = []
for i in range(len(nike_posts)):
    nike_cells.append({'title': nike_posts[i]['data']['title'], 
                       'subreddit': nike_posts[i]['data']['subreddit'], 
                      'comments': nike_posts[i]['data']['num_comments'], 
                       'text': nike_posts[i]['data']['selftext'], 
                      'subscribers': nike_posts[i]['data']['subreddit_subscribers'], 
                       'url': nike_posts[i]['data']['url'],
                      'created': nike_posts[i]['data']['created_utc'],
                      'author': nike_posts[i]['data']['author'],
                      'score': nike_posts[i]['data']['score']})
    
nike_df = pd.DataFrame(nike_cells)

In [10]:
nike_df.head()

Unnamed: 0,comments,created,subreddit,subscribers,text,title,url
0,1,1456352000.0,Nike,11031,Please note that any product from a previous s...,Tips for identifying Nike Product,https://www.reddit.com/r/Nike/comments/47fex4/...
1,6,1545143000.0,Nike,11031,,"Nike IDs, 🔥or🗑?",https://i.redd.it/kop0xjccq1521.jpg
2,0,1545150000.0,Nike,11031,,3D printed Jordan 1s by me,https://i.redd.it/j4oaab0za2521.jpg
3,7,1545082000.0,Nike,11031,,Hardshell Backpack / Drone Camera Bag INFO PLE...,https://i.redd.it/2bk8fp7rnw421.jpg
4,0,1545109000.0,Nike,11031,,Another Redditor asked me too find the name of...,https://i.redd.it/6kqs52x3xy421.jpg


In [11]:
nike_df.shape

(999, 7)

In [12]:
nike_df.to_csv('./Nike.csv')

### Adidas data collection

In [13]:
adidas_json = adidas_res.json()

In [14]:
adidas_posts = []
after = None
for i in range(40):
    print(i)
    if after == None:
        params = {}
    else: 
        params = {'after': after}
    ADIDAS_url = 'https://www.reddit.com/r/adidas/hot/.json'
    res = requests.get(ADIDAS_url, params=params, headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        adidas_posts.extend(the_json['data']['children'])
        after = the_json['data']['after']
    else:
        print(res.status_code)
        break
    time.sleep(2)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


In [15]:
adidas_df = pd.DataFrame(adidas_posts)
adidas_df.shape

(992, 2)

In [16]:
adidas_cells = []
for i in range(len(adidas_posts)):
    adidas_cells.append({'title': adidas_posts[i]['data']['title'], 
                       'subreddit': adidas_posts[i]['data']['subreddit'], 
                      'comments': adidas_posts[i]['data']['num_comments'], 
                       'text': adidas_posts[i]['data']['selftext'], 
                      'subscribers': adidas_posts[i]['data']['subreddit_subscribers'], 
                       'url': adidas_posts[i]['data']['url'],
                      'created': adidas_posts[i]['data']['created_utc'],
                        'author': nike_posts[i]['data']['author'],
                        'score': nike_posts[i]['data']['score']})
    
adidas_df = pd.DataFrame(adidas_cells)

In [17]:
adidas_df.head()

Unnamed: 0,comments,created,subreddit,subscribers,text,title,url
0,0,1513752000.0,adidas,9658,"With holidays coming around, and a lot of big ...",Just a friendly holiday reminder about CS ques...,https://www.reddit.com/r/adidas/comments/7kzpb...
1,11,1545102000.0,adidas,9658,Hey Everyone!\n\nI'm doing a group project for...,Adidas Brand Survey,https://www.reddit.com/r/adidas/comments/a7704...
2,5,1545108000.0,adidas,9658,,3D printed YEEZY,https://i.redd.it/q14x6l4vuy421.jpg
3,2,1545120000.0,adidas,9658,,Awkward flex .,https://i.redd.it/wln7dfhruz421.jpg
4,2,1545114000.0,adidas,9658,,I Just Had To Pick These Up,https://i.redd.it/mcb80ukwaz421.jpg


In [18]:
adidas_df.shape

(992, 7)

In [19]:
adidas_df.to_csv('./Adidas.csv')