# Project 3: Web APIs & NLP  

### Problem Statement:  

Reddit has been a popular source for consumers and brand owners alike to provide the untarnished truth about a product or service. Reddit inventivizes authors through karma points. Posts that have more upvote points (karma) end up being at the top of the page. This allows Reddit to provide users with the most relevant content.

In today's market, it is imperative for businesses to be customer centric. Customers who are advocates for brands widely utilize social media to share their opinions and experiences. In this project, I will be creating a model that can predict whether a post originated from the Uber or Lyft Subreddit.  

Lyft and Uber are two competitors in the rideshare market. In order to promote their brand, both companies must be able to deliver optimal service to riders and provide a healthy environment for workers as well. 

In [1]:
import requests
import pandas as pd

### Lyft Subreddit Data Gathering

As the first step, I utilize the pushshift api to pull posts from a specific subreddit. The pushshift api is limited to pulling 100 posts. A function was created that accepts the subreddit and the initial utc of the subreddit's first post. The function was designed to create a dataframe of over 6800 posts that has the created utc, subreddit, self text, and title as features.

In [2]:
url = 'https://api.pushshift.io//reddit/search/submission'
params = {
    'subreddit': 'Lyft',
    'size':100
}

res = requests.get(url, params)
res.status_code

200

In [3]:
data = res.json()
posts = data['data']

In [4]:
df = pd.DataFrame(posts)
df.shape

(100, 74)

In [5]:
df[['created_utc','subreddit','selftext','title']].head()

Unnamed: 0,created_utc,subreddit,selftext,title
0,1616000327,Lyft,It’s very snowy in my area so i paid for a rid...,How to get help from Lyft
1,1615993961,Lyft,"Hey all, I got charged for my first ever Lyft ...",Clarifying Question
2,1615992565,Lyft,"So I signed on the lift today, if you ride bac...",I’ll be there I promise.
3,1615937479,Lyft,So i’m new to lyft actually this is my first r...,Scheduled a ride. Should i be worried?
4,1615876814,Lyft,"Ordered Lyft, driver accepts, then goes to Bur...",Getting Burger King after accepting ride


In [6]:
df[['created_utc','subreddit','selftext','title']].tail()

Unnamed: 0,created_utc,subreddit,selftext,title
95,1613998501,Lyft,Normally my ride to work only runs me about 13...,Price increasing
96,1613947838,Lyft,[removed],Free Car
97,1613946579,Lyft,,Lyft misses payment to renew operating license...
98,1613937961,Lyft,,"Starting a Union in Sarasota Florida, for High..."
99,1613933043,Lyft,Is their a ride service with car seats? I’m no...,In Seattle with kids


I noticed that the difference in the utc from the first and last post was approximately 2,000,000.

In [7]:
#Difference between utc in the first - last post
1615876814 - 1613893077	

1983737

Function that accepts the subreddit and the utc of the first post as parameters

In [8]:
def reddit (subreddit,utc):
    url = 'https://api.pushshift.io//reddit/search/submission'
    df_subreddit = pd.DataFrame(columns=['created_utc','subreddit','selftext','title'])    
    for x in range(utc, utc - 10**8, - 3*10**6):
        params = {
            'subreddit': subreddit,
            'size':100,
            'before': x}
        res = requests.get(url, params)
        data = res.json()
        posts = data['data']
        df = pd.DataFrame(posts, columns = ['created_utc','subreddit','selftext','title'])
        df_subreddit = pd.concat([df_subreddit,df])
    return df_subreddit

In [9]:
lyft = reddit('Lyft', 1615876814)

In [10]:
lyft.shape

(3400, 4)

In [11]:
#Verify that the lyft df has 1000 unique values
lyft.nunique()

created_utc    3399
subreddit         1
selftext       2217
title          3350
dtype: int64

### Uber Subreddit Data Gathering

In [12]:
url = 'https://api.pushshift.io//reddit/search/submission'
params = {
    'subreddit': 'Uber',
    'size':100
}
res = requests.get(url, params)
res.status_code

200

In [13]:
data_uber = res.json()
posts = data_uber['data']

In [14]:
df_uber = pd.DataFrame(posts)
df_uber.shape

(100, 72)

In [15]:
df_uber[['created_utc','subreddit','selftext','title']].head()

Unnamed: 0,created_utc,subreddit,selftext,title
0,1616000783,uber,I'm new to uber. Have only used it a hand full...,"No cars available, but still being charged? (h..."
1,1615988041,uber,,UBER DIETY AMOST OVER 12000!
2,1615986645,uber,I have a high passenger rating and can usually...,Trouble getting rides on any rideshare app?
3,1615946943,uber,,"Uber to pay drivers a minimum wage, holiday pa..."
4,1615939820,uber,Are you partners with r/Uberdrivers because yo...,Uh...


In [16]:
df_uber[['created_utc','subreddit','selftext','title']].tail()

Unnamed: 0,created_utc,subreddit,selftext,title
95,1614881203,uber,So I have had 2 drivers show up to give rides ...,Two rides showed up that weren't requested.
96,1614874100,uber,,"Yo, get your free skin with code ""free100cc"""
97,1614821848,uber,I have had this problem before- I get an email...,Someone had access to my Uber account
98,1614821845,uber,[deleted],Am I doing the right thing as a passenger?
99,1614812318,uber,,LOL


In [17]:
1615870847 - 1614729872	

1140975

In [18]:
uber = reddit('Uber',1615924925)

In [19]:
uber.shape

(3400, 4)

I notice that the uber data has one duplicate, the value was dropped.

In [20]:
uber.nunique()

created_utc    3398
subreddit         1
selftext       2061
title          3349
dtype: int64

In [21]:
uber.shape

(3400, 4)

In [22]:
uber.nunique()

created_utc    3398
subreddit         1
selftext       2061
title          3349
dtype: int64

In [24]:
rideshare = pd.concat([lyft,uber])

In [25]:
rideshare.to_csv('./data/rideshare.csv', index=False)