# Project 3: Webscraping, NLP and classification modelling

## Background: 

I am a marketing data analyst looking to optimize advertising efficiency. My target audience for this project is people who work in tech support, and my goal is to effectively target this audience using the correct keywords relevant to them.

To this end, I have decided to explore Reddit in order to classify posts, based on natural language processing. For this project, I will be focusing on primarily text-based subreddits to enable more accurate text analysis. To increase the complexity, I will aim to classify posts from similar subreddits to tease out the nuances that make them different.

Thus, the subreddits I have chosen are  
1) https://www.reddit.com/r/talesfromtechsupport/  
2) https://www.reddit.com/r/talesfromcallcenters/

Combined, the 2 subreddits have a total of 800,000 members. Both have a similar purpose, ie - rantings by support departments. Therefore, it will be interesting to see what makes them different, and if machine learning models can accurately classify posts to one or the other. 

For the sake of this project, I will be focusing on accurately classifying posts that belong to the tech support group.

Therefore from a data science perspective the optimization parameter for my model should be accuracy.

## Problem Statement: 


### What are the indicative words that help to effectively target advertising to a niche user group (tech support staff)? 

## Data Science Challenge: 

### How can I accurately identify posts that belong to 2 subreddits that are very similar in purpose but show differences of nuance?

#### Overview of technical analysis: 

1) Data Scraping using Reddit API   
2) Exploratory Data Analysis  
3) Natural Language Processing and Classification Modelling (Logistic Regression and Naive Bayes)  
4) Advanced Modelling with CARTs (Random Forest, Extra Trees, Support Vector Machine, ADA Boost, Gradient Boost) and Optimization  
5) Fresh Reddit scrape for 'true' unseen data  
6) Final Modelling with optimized parameters  

#### In addition to the GA project requirement, I have also looked into a personal area of interest - 

7) Sentiment Analysis and Topic Modelling Visualization (Latent Dirichlet Allocation)   



#### For ease of viewing, I have separated each section into a separate jupyter notebook.

# Imports and Data Scraping

In [2]:
# library imports
import requests
import time
import pandas as pd
import ast # to convert string data to indexable list of dictionaries
from tqdm import tqdm

In [39]:
# create header parameter for API
headers_dict = {'User-agent':'hello-reddit-i-am-totally-not-a-bot'}

In [42]:
# instantiate API variables
url = 'https://reddit.com/'
sub1_url = url + 'r/talesfromcallcenters'           # setting sub1 
sub2_url = url + 'r/talesfromtechsupport'        # setting sub2 

limit_num = 100     # API 'limit' parameter

sub1_after = None  # instantiate empty counters for API 'after' parameter
sub2_after = None

sub1_pages = []    # instantiate empty lists to save API results
sub2_pages = []

for i in range(20): # pull from API 20 times
    
    # add 'after' parameters if an id has been saved - starts as None
    if sub1_after and sub2_after:
        # create full API url for sub1
        sub1_after_url = sub1_url + '.json?limit=' \
                            + str(limit_num) + '&after=' \
                            + sub1_after
        print(sub1_after_url)
        
        # create full API url for sub2
        sub2_after_url = sub2_url + '.json?limit=' \
                            + str(limit_num) + '&after=' \
                            + sub2_after
        print(sub2_after_url)
    
    # if one after is logged and the other is not
    elif bool(sub1_after) != bool(sub2_after):
        print('After reference out of sync.')
        break
    
    else:
        # create first run url
        sub1_after_url = sub1_url + '.json?limit=' + str(limit_num)
        sub2_after_url = sub2_url + '.json?limit=' + str(limit_num)
    
    # pull json from sub1
    sub1_res = requests.get(sub1_after_url, headers=headers_dict)
    print(i, sub1_res.status_code)
    
    # if sub1 connection is established
    if sub1_res.status_code == 200:
        # add page to list
        sub1_pages.append(sub1_res.json()['data'])
        print('sub1_pages length: ', len(sub1_pages))
        
        # set 'after' parameter for next run
        sub1_after = sub1_res.json()['data']['after']
        print('sub01_after: ', sub1_after)
        
    else:        
        print('Connection failed.\n')
    
    # sleep one second
    time.sleep(1)
    
    # pull json from sub2
    sub2_res = requests.get(sub2_after_url, headers=headers_dict)
    print(i, sub2_res.status_code)
    
    # if sub2 connection is established
    if sub2_res.status_code == 200:
        # add page to list
        sub2_pages.append(sub2_res.json()['data'])
        print('sub2_pages length: ', len(sub2_pages))
        
        # set 'after' parameter for next run
        sub2_after = sub2_res.json()['data']['after']
        print('sub2_after: ', sub2_after)
    else:
        print('Connection failed.\n')
        
    # sleep 2 seconds    
    time.sleep(2)

0 200
sub1_pages length:  1
sub01_after:  t3_jpe44f
0 200
sub2_pages length:  1
sub2_after:  t3_jl1a98
https://reddit.com/r/talesfromcallcenters.json?limit=100&after=t3_jpe44f
https://reddit.com/r/talesfromtechsupport.json?limit=100&after=t3_jl1a98
1 200
sub1_pages length:  2
sub01_after:  t3_jcxa01
1 200
sub2_pages length:  2
sub2_after:  t3_j5r758
https://reddit.com/r/talesfromcallcenters.json?limit=100&after=t3_jcxa01
https://reddit.com/r/talesfromtechsupport.json?limit=100&after=t3_j5r758
2 200
sub1_pages length:  3
sub01_after:  t3_j2r3sl
2 200
sub2_pages length:  3
sub2_after:  t3_ipjvct
https://reddit.com/r/talesfromcallcenters.json?limit=100&after=t3_j2r3sl
https://reddit.com/r/talesfromtechsupport.json?limit=100&after=t3_ipjvct
3 200
sub1_pages length:  4
sub01_after:  t3_it4gmh
3 200
sub2_pages length:  4
sub2_after:  t3_icnreb
https://reddit.com/r/talesfromcallcenters.json?limit=100&after=t3_it4gmh
https://reddit.com/r/talesfromtechsupport.json?limit=100&after=t3_icnreb
4 20

In [43]:
# create DataFrames from posting lists
sub1_df = pd.DataFrame(sub1_pages)
sub2_df = pd.DataFrame(sub2_pages)

In [44]:
# save API data to files
sub1_df.to_csv('../datasets/sub1_scrape_27nov.csv', index=False)
sub2_df.to_csv('../datasets/sub2_scrape_27nov.csv', index=False)

# Checkpoint after data scraping

In [3]:
sub1_df = pd.read_csv('../datasets/sub1_scrape_27nov.csv')
sub2_df = pd.read_csv('../datasets/sub2_scrape_27nov.csv')

In [4]:
sub1_df['children'] = sub1_df.children.map(lambda x: ast.literal_eval(x))
sub2_df['children'] = sub2_df.children.map(lambda x: ast.literal_eval(x))

# Converting the scrape into dataframes

In [5]:

# save post dictionaries in arrays

sub1 = sub1_df['children']
sub2 = sub2_df['children']

In [6]:
sub1[0][0]['data']

{'approved_at_utc': None,
 'subreddit': 'talesfromcallcenters',
 'selftext': 'I get that tempers are shorter these days, but I am having a hard time lately with people taking out their frustrations on me, because I happen to be convenient.\n\nI work in an inbound support call centre, and we try very hard not to release the call if possible.  But I spoke to a real “Special” person today.\n\nShe’s going to file a complaint because I pointed out that she violated our terms of use.\n\nI am so close to being done with all of this....  (the job, I mean)\n\n\nThanks for the awards!\nThanks, everyone, for your support.  Definitely feeling better.',
 'author_fullname': 't2_3zqnh8ue',
 'saved': False,
 'mod_reason_title': None,
 'gilded': 0,
 'clicked': False,
 'title': 'Just need to vent...',
 'link_flair_richtext': [],
 'subreddit_name_prefixed': 'r/talesfromcallcenters',
 'hidden': False,
 'pwls': 6,
 'link_flair_css_class': 'short',
 'downs': 0,
 'thumbnail_height': None,
 'top_awarded_type'

In [7]:
#create list of titles
sub1_titles = [sub1[i][j]['data']['title'] for i in range(len(sub1))
            for j in range(len(sub1[i]))]


sub2_titles = [sub2[i][j]['data']['title'] for i in range(len(sub2)) 
            for j in range(len(sub2[i]))]

In [8]:
# create list of post using nested comprehensions
sub1_posts = [sub1[i][j]['data']['selftext'] for i in range(len(sub1)) 
            for j in range(len(sub1[i]))]

sub2_posts = [sub2[i][j]['data']['selftext'] for i in range(len(sub2)) 
            for j in range(len(sub2[i]))]

In [9]:
# create list of upvotes using nested comprehensions
sub1_ups = [sub1[i][j]['data']['ups'] for i in range(len(sub1)) 
            for j in range(len(sub1[i]))]

sub2_ups = [sub2[i][j]['data']['ups'] for i in range(len(sub2)) 
            for j in range(len(sub2[i]))]

In [10]:
# create list of upvotes using nested comprehensions
sub1_gilded = [sub1[i][j]['data']['gilded'] for i in range(len(sub1)) 
            for j in range(len(sub1[i]))]

sub2_gilded = [sub2[i][j]['data']['gilded'] for i in range(len(sub2)) 
            for j in range(len(sub2[i]))]

In [11]:
# compile lists into DataFrame
sub1_df = pd.DataFrame([sub1_titles, sub1_posts, sub1_ups, sub1_gilded], index=['title','post','upvotes','gilded'])

In [12]:
#transpose DF
sub1_df = sub1_df.T

In [13]:
sub1_df.head()

Unnamed: 0,title,post,upvotes,gilded
0,Just need to vent...,"I get that tempers are shorter these days, but...",230,0
1,Reverse call center post,On mobile so I hope I do this right. \n\nI ha...,39,0
2,"""So you're willing to lose a customer for $3 d...",I work for a car rental company as a specialis...,763,0
3,Free Talk Friday - Nov 27,Welcome to Free Talk Friday! We are suspending...,0,0
4,Accidentally Exposed a Family Fraud,I work for a small local ISP. One of the thin...,958,0


In [14]:
# compile lists into DataFrame
sub2_df = pd.DataFrame([sub2_titles, sub2_posts, sub2_ups, sub2_gilded], index=['title','post','upvotes', 'gilded'])

In [15]:
#transpose DF
sub2_df = sub2_df.T

In [16]:
# binarize the classifier: 'belongs_to_sub2' 
sub1_df['belongs_to_sub2'] = 0
sub2_df['belongs_to_sub2'] = 1

In [17]:
sub1_df.to_csv('../datasets/sub1_df_27_nov.csv', index=False)
sub2_df.to_csv('../datasets/sub2_df_27_nov.csv', index=False)

In [19]:
#combine the two subs
df = pd.concat([sub1_df, sub2_df])

In [20]:
df.post.fillna(' ', inplace=True)

In [21]:
df['title_x_post'] = df['title'] + ' ' + df['post']

In [22]:
df.belongs_to_sub2.value_counts()

1    901
0    842
Name: belongs_to_sub2, dtype: int64

In [23]:
#check distribution of target variable
df.belongs_to_sub2.value_counts(normalize= True)

1    0.516925
0    0.483075
Name: belongs_to_sub2, dtype: float64

In [24]:
df.head()

Unnamed: 0,title,post,upvotes,gilded,belongs_to_sub2,title_x_post
0,Just need to vent...,"I get that tempers are shorter these days, but...",230,0,0,Just need to vent... I get that tempers are sh...
1,Reverse call center post,On mobile so I hope I do this right. \n\nI ha...,39,0,0,Reverse call center post On mobile so I hope I...
2,"""So you're willing to lose a customer for $3 d...",I work for a car rental company as a specialis...,763,0,0,"""So you're willing to lose a customer for $3 d..."
3,Free Talk Friday - Nov 27,Welcome to Free Talk Friday! We are suspending...,0,0,0,Free Talk Friday - Nov 27 Welcome to Free Talk...
4,Accidentally Exposed a Family Fraud,I work for a small local ISP. One of the thin...,958,0,0,Accidentally Exposed a Family Fraud I work for...


In [25]:
df.tail()

Unnamed: 0,title,post,upvotes,gilded,belongs_to_sub2,title_x_post
896,TFTS Top Tales - April 2020,Hi Everybody!\n\nHere's another month of Top T...,37,0,1,TFTS Top Tales - April 2020 Hi Everybody!\n\nH...
897,I'm a corporate developer i know what im talki...,"Hi, This is another story about working Tech S...",187,0,1,I'm a corporate developer i know what im talki...
898,"""How did my contract details get on this websi...",So I am the youngest person at my work and som...,585,0,1,"""How did my contract details get on this websi..."
899,No ma'am I cant turn that off...,"I work as a printer technician, but specifical...",1310,0,1,No ma'am I cant turn that off... I work as a p...
900,Tycho Electric Anomaly,Thirty-nine years ago I did telephone tech sup...,401,0,1,Tycho Electric Anomaly Thirty-nine years ago I...


# Final Save

In [26]:
df.to_csv('../datasets/combined_27_nov_df.csv', index=False)