# Amanda Jo Russell
# Project 3: Web APIs & Classification

# Data Science Problem

Utilizing text collected from two different subreddits, this project seeks to predict the correct subreddit origin of a given post. Several processes will be used such as data acquisition via the Reddit API and natural language processing (NLP) methods, and various classification models will be implemented to analyze the performance results of each. The two subreddits chosen for this project were those pertaining to the ride-sharing companies Lyft and Uber. These were carefully chosen based on my genuine curiosity between the two companies, as they are practically interchangeable from a user perspective despite their vast and notable distinctions from a stakeholder perspective. An almost equal amount of posts were collected from each subreddit (701 from Lyft and 700 from Uber), primarily using only the post title and core text to train and test the models. The observations and insights gained from the data and models will be used to address business implications and recommendations for the respective companies.

# Executive Summary

•Data Science Problem Statement <br>
•Data Collection <br>
•Data Cleaning & EDA <br>
•Preprocessing & Modeling <br>
•Evaluation and Conceptual Understanding | Conclusion and Recommendations

# Data Collection

##### Imported libraries.

In [1]:
import requests
import json
import time
import pandas as pd

##### Imported list of posts from first subreddit.

In [2]:
first_posts = []
headers = {'user-agent' : 'moi'}
after = None
for i in range(28):
    print(i)
    if after == None:
        params = {}
    else:
        params = {'after': after}
    first_url = 'https://www.reddit.com/r/Lyft/.json'
    first_res = requests.get(first_url, params=params, headers=headers)
    if first_res.status_code == 200:
        first_json = first_res.json()
        first_posts.extend(first_json['data']['children'])
        after = first_json['data']['after']
    else:
        print(res.first_status_code)
        break
    time.sleep(2)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


##### Imported list of posts from second subreddit.

In [3]:
second_posts = []
headers = {'user-agent' : 'moi'}
after = None
for i in range(28):
    print(i)
    if after == None:
        params = {}
    else:
        params = {'after': after}
    second_url ='https://www.reddit.com/r/uber/.json'
    second_res = requests.get(second_url, params=params, headers=headers)
    if second_res.status_code == 200:
        second_json = second_res.json()
        second_posts.extend(second_json['data']['children'])
        after = second_json['data']['after']
    else:
        print(res.second_status_code)
        break
    time.sleep(2)
    

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


##### Observed full length of each list,  then check how many unique values will remain after removing duplicates.

In [4]:
print(len(first_posts))
print(len(set([p['data']['name'] for p in first_posts])))

701
701


In [5]:
print(len(second_posts))
print(len(set([p['data']['name'] for p in second_posts])))

700
700


##### Turned each list into a dataframe with only specific columns we need.

In [18]:
title_list = [p['data']['title'] for p in first_posts]
selftext_list = [p['data']['selftext'] for p in first_posts]
name_list = [p['data']['name'] for p in first_posts]
subreddit_name = [p['data']['subreddit'] for p in first_posts]

first_df = pd.DataFrame({'title':title_list, 'selftext':selftext_list, 'name':name_list, 'subreddit':subreddit_name})
first_df.head()

Unnamed: 0,title,selftext,name,subreddit
0,PROMO CODE THREAD - Post your promo codes here...,Looks like we need another code thread. \n\nIf...,t3_8so63i,Lyft
1,Does this seem fair?,,t3_b92sgl,Lyft
2,I got accused of being high/drunk while drivin...,So I’ve been driving for Lyft for 6 months. It...,t3_b983ym,Lyft
3,Is holding lost phones captive for big money r...,,t3_b96ipv,Lyft
4,Is it just me or is the Lyft Driver app glitch...,Every single ride request I get has the estima...,t3_b95rnx,Lyft


In [19]:
title_list = [p['data']['title'] for p in second_posts]
selftext_list = [p['data']['selftext'] for p in second_posts]
name_list = [p['data']['name'] for p in second_posts]
subreddit_name = [p['data']['subreddit'] for p in second_posts]

second_df = pd.DataFrame({'title':title_list, 'selftext':selftext_list, 'name':name_list, 'subreddit':subreddit_name})
second_df.head()

Unnamed: 0,title,selftext,name,subreddit
0,Reserved Uber for 4am-4:15am in advance. Car a...,,t3_b8vqcn,uber
1,Uber taking tip money now? What’s up with thes...,,t3_b93eo4,uber
2,Uber Deactivated my account?,So I order UberEATS around twice a day and the...,t3_b97uli,uber
3,Its just frustrating at times when the whole w...,,t3_b95g4z,uber
4,jump bike - How do you know if you’re in a no ...,I’m interested in taking a jump bike to work b...,t3_b959gr,uber


##### Filtered through all rows and remove any duplicates (determined by unique 'name' code).

In [23]:
first_df = first_df.drop_duplicates(['name'])
second_df = second_df.drop_duplicates(['name'])

##### Confirmed that shape has changed to match unique value count above.

In [24]:
print('first', first_df.shape)
print('second', second_df.shape)

first (701, 4)
second (700, 4)


##### Merged both dataframes together, shuffled rows, and saved as a csv file.

In [29]:
merge_posts = pd.concat([first_df, second_df], axis=0)

In [57]:
all_posts = merge_posts.sample(frac=1).reset_index(drop=True)
all_posts.head()

Unnamed: 0,title,selftext,name,subreddit
0,Nightlife Playlist,I usually drive people going out to clubs. I w...,t3_b5xkl4,Lyft
1,Uber should add this feature to their app to m...,A woman was killed after mistaking someone els...,t3_b83pf5,uber
2,Interest in starting but I drive a Jeep,I want to start as a side business. I provide ...,t3_auo6v4,uber
3,Uber or Lyft while vacationing,"So, I would like to start doing Uber or Lyft i...",t3_b96qf0,Lyft
4,"I got co-erced into getting someone a Lyft, wh...","To make a long story short, I was co-erced and...",t3_anh5bz,Lyft


In [58]:
all_posts.to_csv('./datasets/lyft_uber.csv', index=False)