## Classifying subreddit posts using Natural Language Processing (NLP)

#### Problem Statement

Is there a significant difference between what NASA and Space X are discussing that can be targeted to advertise to the fans of each corporation?

#### Description

Using NLP on the titles of the subreddits of Space X and NASA I will fit classification models that can predict which specific posts came from either Space X or NASA. With this model we can then infer what topics are being discussed within each subreddit and if possible identify how to specifically advertise to the fans of Space X or to the fans of NASA.

In [1]:
import requests
import json
import pandas as pd
import numpy as np
import time

### Step # 1: Obtain the raw data (NASA)

#### Reddit API

Using the public Reddit API I will perform a webscrape on both the NASA and Space X subreddits to obtain an object that will hold the json data (i.e. each individual post) from these subreddits to apply NLP.

#### Desription

First, I will ensure that the url to the json data can be reached and sends back a successful status code so that I know I can pull the json data successfully.

In [12]:
url_nasa = 'https://www.reddit.com/r/nasa.json'

In [13]:
headers = {'User-agent': 'Nelson 0.1'}

In [14]:
res = requests.get(url_nasa, headers=headers)

In [15]:
res.status_code

200

#### Request to Reddit

Next, using the below for loop I will send a request to Reddit to scrape each post within the NASA subreddit. Each post that is scraped is coming in as a json file and being filtered to create a list of the values from within a specific section of the json where the subreddit data exists.

In [16]:
posts_nasa = []
after = None
for i in range(50):
    print(i)
    if after == None:
        params = {}
    else:
        params = {'after': after}
    url_nasa = 'https://www.reddit.com/r/nasa.json'
    res = requests.get(url_nasa, params=params, headers = headers)
    if res.status_code == 200:
        nasa_json = res.json()
        current_posts = [p['data'] for p in nasa_json['data']['children']]
        posts_nasa.extend(current_posts)
        after = nasa_json['data']['after']
    else:
        print(res.status_code)
        break
    time.sleep(3)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49


#### Received from Reddit

From the NASA subreddit I received 1,227 posts and specifically (after checking for duplicate posts) 952 unique posts.

In [17]:
len(posts_nasa)

1227

In [18]:
len(set([p['name'] for p in posts_nasa]))

952

### Step # 2: Setup a Dataframe 

With the raw json data now received and stored within a list I will convert the json data into a dataframe to allow for ease of use in exploratory data analysis and modeling.

In [19]:
nasa_df = pd.DataFrame(posts_nasa)

In [21]:
nasa_df.head()

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail_height,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,I_DR_NOW,,,,[],,,...,93.0,140.0,"NASA Television to Air Launch, Capture of Japa...",306,https://www.nasa.gov/press-release/nasa-televi...,[],,False,all_ads,6
1,,,False,Lalalauren582,,,,[],,,...,140.0,140.0,I went to NASA Goddard Space Center for the fi...,121,https://i.redd.it/p5fs387daok11.jpg,[],,False,all_ads,6
2,,,False,moon-worshiper,,,,[],,,...,78.0,140.0,Exoplanet report recommends development of lar...,15,https://spacenews.com/exoplanet-report-recomme...,[],,False,all_ads,6
3,,,False,wbgamer,,,,[],,,...,93.0,140.0,NASA's ASPIRE project will conduct its 3rd fli...,5,https://www.nasa.gov/wallops/2018/feature/nasa...,[],,False,all_ads,6
4,,,False,TokathSorbet,,,,[],,,...,,,STS TAL abort vehicle recovery?,5,https://www.reddit.com/r/nasa/comments/9dklj0/...,[],,False,all_ads,6


### Step # 3: Save the files

Saving both the newly created dataframe of the NASA subreddit posts to csv and the raw json file received from Reddit.

In [22]:
nasa_df.to_csv('../data/nasa_df.csv')

In [23]:
with open('../data/nasa_json.json', 'w+') as f:
    json.dump(nasa_json, f)

### Step # 4 : Obtain the raw data (SpaceX)

#### Desription

This time I will ensure that the url to the json data from Space X subreddit can be reached and sends back a successful status code so that I know I can pull the json data successfully.

In [24]:
url_spacex = 'https://www.reddit.com/r/spacex.json'

In [25]:
headers = {'User-agent': 'Nelson 0.1'}

In [26]:
res_spacex = requests.get(url_spacex, headers=headers)

In [27]:
res_spacex.status_code

200

#### Request to Reddit

Next, using the below for loop I will send a request to Reddit to scrape each post within the Space X subreddit. Each post that is scraped is coming in as a json file and being filtered to create a list of the values from within a specific section of the json where the subreddit data exists.

In [28]:
posts_spacex = []
after = None
for i in range(50):
    print(i)
    if after == None:
        params = {}
    else:
        params = {'after': after}
    url_spacex = 'https://www.reddit.com/r/spacex.json'
    res_spacex = requests.get(url_spacex, params=params, headers = headers)
    if res_spacex.status_code == 200:
        spacex_json = res_spacex.json()
        current_posts = [p['data'] for p in spacex_json['data']['children']]
        posts_spacex.extend(current_posts)
        after = spacex_json['data']['after']
    else:
        print(res_spacex.status_code)
        break
    time.sleep(3)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49


#### Received from Reddit

From the Space X subreddit I received 1,247 posts and specifically (after checking for duplicate posts) 995 unique posts.

In [29]:
len(posts_spacex)

1247

In [30]:
len(set([p['name'] for p in posts_spacex]))

995

### Step # 5: Setup a Dataframe 

With the raw json data now received and stored within a list I will convert the data into a dataframe to allow for ease of use in exploratory data analysis and modeling.

In [31]:
spacex_df = pd.DataFrame.from_dict(posts_spacex)

### Step # 6: Save the files

Saving both the newly created dataframe of the Space X subreddit posts to csv and the raw json file received from Reddit.

In [32]:
spacex_df.to_csv('../data/spacex_df.csv')

In [33]:
with open('../data/spacex_json.json', 'w+') as f:
    json.dump(spacex_json, f)