# Project 3: Advanced Running Retargeting using NLP

---

## Overview 

"Couch to 5k" is a popular, free, beginner's running program by Josh Clark that helps non-runners become runners within 9 weeks. The program incorporates walking, running and cross-training into manageable increments, allowing those with little-to-no fitness level to start running. Not surprisingly, "Couch to 5k" has increased in popularity for beginner runners and has even become a popular subreddit for users to share tips, questions, wins and experiences with others as they begin their running journey. Advanced runners also use Reddit as a community for more advanced topics in running, such as marathon training, endurance running and race results. [r/C25K](https://www.reddit.com/r/C25K/) and [r/AdvancedRunning](https://www.reddit.com/r/AdvancedRunning/) are the subreddits that will be further analyzed in this project.  

---
## Problem Statement

A running enthusiast mobile application company for all levels of runners is looking to launch a retargeting marketing campaign to convert advanced runners who have downloaded the application to being paid members. When the app is downloaded, users select their skill level. In an effort to understand their customer's needs, “Couch-to-5k (C25k)" and “Advanced Running” subreddits will be analyzed and modeled using Natural Language Processing (NLP), Random Forest and Logistic Regression Classification Techniques. This project aims to identify the type of language being used between beginner and advanced runners in order to provide marketing retargeting ad recommendations for the advanced runners customer segmentation. 

The data used for this project are from the following sources: [r/C25K](https://www.reddit.com/r/C25K/), [r/AdvancedRunning](https://www.reddit.com/r/AdvancedRunning/).

---

## Part 1: Data Wrangling/Web-Scraping

The data obtained for this project came from web-scraping two subreddits. r/Advanced Running and r/BeginnersRunning were the original inteneded data. After initial cleaning of r/BeginnersRunning, it was evident that there were not enough usable data to be ran through a classification model. r/C25K was then selected as the "beginner's running" data source. The following code involves making a request to Reddit's API for the wanted subreddits. A delay of 2 seconds was indicated between post scrapes in order to prevent any server disruption issues. 15,000 posts were scraped from each subreddit. Scraped posts were then concatenated into a dataframe and then saved to csv's to be used in the next sections. 

In [1]:
#imports
import numpy as np
import pandas as pd
import datetime, time
import json
import requests

#### Web Scraping r/AdvancedRunning

In [2]:
#Credit to Gwen Rathgeber/Ben Mathis
subreddits = ['AdvancedRunning', 'BeginnersRunning']
kind = "submission"  # we want text posts

# Establish URL base
BASE_URL = f"https://api.pushshift.io/reddit/search/{kind}" # also known as the "API endpoint"

last_date = datetime.datetime.utcfromtimestamp(time.time())     #utc from timestamp -50_000
posts = {}  #empty dictionary
for subreddit in subreddits:
    posts[subreddit] = []
    day = 7                       #start with the most recent post
    cumulative_posts = 0
    while cumulative_posts < 15000:                           #scrape 15,000 b/c minimum is 10,000 and some will be junk from what you scrape
        stem = f"{BASE_URL}?subreddit={subreddit}&size=100"   #part of query, #will scrape from 100 posts
        URL = f"{stem}&after={day}d"                           #will scrape from after the day we scrape it
        print("Querying from: " + URL)
        try:                                                  #we use try, except b/c scraping from the web, you'll get a lot of errors
            res = requests.get(URL)
            assert res.status_code == 200
            json = res.json()['data']
            df = pd.DataFrame(json)
            posts[subreddit].append(df)
            cumulative_posts += df.shape[0]
            final_date_pulled = datetime.datetime.utcfromtimestamp(df.iloc[-1, df.columns.get_loc('created_utc')])
            increment = (last_date - final_date_pulled).days + 1
            increment = increment if increment > 0 else 1
            day += increment
            last_date = final_date_pulled
            print('successful')
        except:
            print(f'Scrape for {URL}, {day} failed')

        time.sleep(2)                    #this is a delay in between scrapes

print("Query complete!")

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=AdvancedRunning&size=100&after=7d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=AdvancedRunning&size=100&after=10d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=AdvancedRunning&size=100&after=13d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=AdvancedRunning&size=100&after=16d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=AdvancedRunning&size=100&after=20d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=AdvancedRunning&size=100&after=24d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=AdvancedRunning&size=100&after=30d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=AdvancedRunning&size=100&after=35d
successful
Querying from: https://api.pushsh

#### Saving r/AdvancedRunning Posts to DataFrame 

In [4]:
raw_advanced = pd.concat(posts['AdvancedRunning'])
#raw_beginners = pd.concat(posts['BeginnersRunning'])

raw_advanced.to_csv('raw_advanced_running.csv', index=False)
#raw_beginners.to_csv('raw_beginners_running.csv', index=False)

#### Web-Scraping r/C25K

In [12]:
#Needed to pick a different subreddit as there were not enough unique posts 

#Credit to Gwen Rathgeber/Ben Mathis
subreddits = ['C25k']
kind = "submission"  # we want text posts

# Establish URL base
BASE_URL = f"https://api.pushshift.io/reddit/search/{kind}" # also known as the "API endpoint"

last_date = datetime.datetime.utcfromtimestamp(time.time())     #utc from timestamp -50_000
posts = {}  #empty dictionary
for subreddit in subreddits:
    posts[subreddit] = []
    day = 7                       #start with the most recent post
    cumulative_posts = 0
    while cumulative_posts < 15000:                           #scrape 15,000 b/c minimum is 10,000 and some will be junk from what you scrape
        stem = f"{BASE_URL}?subreddit={subreddit}&size=100"   #part of query, #will scrape from 100 posts
        URL = f"{stem}&after={day}d"                           #will scrape from after the day we scrape it
        print("Querying from: " + URL)
        try:                                                  #we use try, except b/c scraping from the web, you'll get a lot of errors
            res = requests.get(URL)
            assert res.status_code == 200
            json = res.json()['data']
            df = pd.DataFrame(json)
            posts[subreddit].append(df)
            cumulative_posts += df.shape[0]
            final_date_pulled = datetime.datetime.utcfromtimestamp(df.iloc[-1, df.columns.get_loc('created_utc')])
            increment = (last_date - final_date_pulled).days + 1
            increment = increment if increment > 0 else 1
            day += increment
            last_date = final_date_pulled
            print('successful')
        except:
            print(f'Scrape for {URL}, {day} failed')

        time.sleep(2)                    #this is a delay in between scrapes

print("Query complete!")

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=C25k&size=100&after=7d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=C25k&size=100&after=8d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=C25k&size=100&after=9d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=C25k&size=100&after=10d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=C25k&size=100&after=11d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=C25k&size=100&after=12d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=C25k&size=100&after=14d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=C25k&size=100&after=18d
successful
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=C25k&size=100&after=23d
successful
Querying from

#### Saving r/C25K Posts to DataFrame 

In [13]:
raw_couch_5k = pd.concat(posts['C25k'])

raw_couch_5k.to_csv('raw_couch_5k.csv', index=False)

---

#### Next Section: Part 2 - Initial Data Cleaning