<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Problem-Statement" data-toc-modified-id="Problem-Statement-1">Problem Statement</a></span></li><li><span><a href="#Executive-Summary" data-toc-modified-id="Executive-Summary-2">Executive Summary</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-3">Imports</a></span></li><li><span><a href="#Logging" data-toc-modified-id="Logging-4">Logging</a></span></li><li><span><a href="#Web-scraping" data-toc-modified-id="Web-scraping-5">Web scraping</a></span><ul class="toc-item"><li><span><a href="#Automated-scraping-functions" data-toc-modified-id="Automated-scraping-functions-5.1">Automated scraping functions</a></span></li><li><span><a href="#Today-I-Learned-Scrape" data-toc-modified-id="Today-I-Learned-Scrape-5.2">Today I Learned Scrape</a></span></li><li><span><a href="#Shower-Thoughts-Scrape" data-toc-modified-id="Shower-Thoughts-Scrape-5.3">Shower Thoughts Scrape</a></span></li></ul></li><li><span><a href="#Concat-&amp;-Save-Joint-DataFrames" data-toc-modified-id="Concat-&amp;-Save-Joint-DataFrames-6">Concat &amp; Save Joint DataFrames</a></span></li></ul></div>

# Problem Statement
A plagiarism company has come to us hoping we can identify plagiarized concepts rather than just copy-pasted words. Using NLP, we can look for clues in language structure and content to differentiate between original and learned ideas.

![pikachudetective](../Images/pikachu.jpg)

In order to tackle this problem we will scrape reddit for learned and original content using the subreddits Today I Learned and Shower Thoughts respectively. We then use predictive modeling to classify the content as learned or original, and evaluate the results so we can deploy an appropriate separation threshold.

Simple Unit Testing and Logging are included for deployment.

# Executive Summary
In order to build an idea-plagiarism model, we used **Reddit data** from **ShowerThoughts** and **Today I Learned**, which contains **1504 observations**. This information pertains to subreddit titles pulled on October 17, 2019. 


We used several models to classify this data into their categories, the best one being a **MLPClassifier**. In order to optimize this model, we used an extensive grid search with FeatureUnion vectorizers (tfidf and cvec).


The resulting train and tests score were 0.84 and **0.86**. At the 0.5 threshold, the **precision was 0.83** and the **recall was 0.92**.

We then set up a simple threshold modifier to suit the clients needs. For instance, we might want to be more severe with papers for publication than we would want to be on high school students. 

# Imports

In [1]:
import requests
import time
import pandas as pd
import numpy as np
import regex as re
import nltk
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
import pytest
import logging
import sys

from nltk.stem                       import WordNetLemmatizer
from nltk.tokenize                   import RegexpTokenizer
from bs4                             import BeautifulSoup  
from nltk.corpus                     import stopwords
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.pipeline                import Pipeline,FeatureUnion
from sklearn.model_selection         import cross_val_score, train_test_split, GridSearchCV
from sklearn.ensemble                import VotingClassifier, AdaBoostClassifier
from sklearn.feature_selection       import VarianceThreshold
from sklearn.linear_model            import LogisticRegression
from sklearn.tree                    import DecisionTreeClassifier
from sklearn.naive_bayes             import MultinomialNB, GaussianNB
from sklearn.neural_network          import MLPClassifier
from sklearn.preprocessing           import StandardScaler
from sklearn.svm                     import SVC
from sklearn.metrics                 import confusion_matrix,classification_report

warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

# Logging

Since we would want to deploy the scrape to be used on live data, we want to logg our actions to supervise our project.

In [2]:
#A log file is created to keep track of everything
logging.basicConfig(format='%(asctime)s | %(levelname)s : %(message)s',
                     level=logging.ERROR, stream=sys.stdout)
logger = logging.getLogger()

#To avoid long web scraping outputs
logging.getLogger("urllib3").setLevel(logging.WARNING)

In [3]:
#nice format for the logger
fhandler = logging.FileHandler(filename='mylog.log', mode='a')
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fhandler.setFormatter(formatter)
logger.addHandler(fhandler)
logger.setLevel(logging.DEBUG)

# Web scraping

## Automated scraping functions

In [4]:
#web requests
def request(url_base, user):
    #User
    user_agent = {"User-agent": user}      # header to prevent 429 error
    #Requests
    res = requests.get(url = url_base,
                   headers = user_agent)
    #check status - should be 2xx
    print(res.status_code)
    return res, res.json()

In [5]:
#pulling
def pulling(url_base, user_agent):
    # empty list
    posts = []
    
    # Instantiate after string
    after = None
    
    for pull_num in range(30):
        
        # Count message
        print(f'Pull number {pull_num+1}')
        
        # url updates
        if after == None:  
            new_url = url_base
        else:
            new_url = url_base+'?after='+after
        
        # request
        res = requests.get(url = new_url,
                           headers = user_agent)
    
        # data extraction
        if res.status_code == 200:
            json_data = res.json()                     
            posts.extend(json_data['data']['children']) 
            
        # update string
            after = json_data['data']['after']
        else:
            print("We've run into an error. The status code is:", res.status_code)
            break
    
        # wait
        time.sleep(2)
        print("We have:", len(set([p['data']['name'] for p in posts])), "posts in this subreddit")
    return posts

In [6]:
# Make DataFrame
def scrape_df(posts, file_name, save_path):
    df=pd.DataFrame(columns=[])
    redits=[]
    titles=[]
    texts=[]
    for val,post in enumerate(posts):
        redits.append(posts[val]['data']['subreddit'])
        titles.append(posts[val]['data']['title'])
        texts.append(posts[val]['data']['selftext'])
    df['subred']=redits
    df['title']=titles
    df['text']=texts
    #df_name=df
    df.to_csv(f'{save_path}/{file_name}');
    return df

## Today I Learned Scrape

In [7]:
res, json = request(url_base="https://www.reddit.com/r/todayilearned.json", user='ambar')

200


In [8]:
til_posts = pulling(url_base="https://www.reddit.com/r/todayilearned.json", user_agent={"User-agent": 'ambar'})

Pull number 1
We have: 25 posts in this subreddit
Pull number 2
We have: 50 posts in this subreddit
Pull number 3
We have: 75 posts in this subreddit
Pull number 4
We have: 100 posts in this subreddit
Pull number 5
We have: 125 posts in this subreddit
Pull number 6
We have: 150 posts in this subreddit
Pull number 7
We have: 175 posts in this subreddit
Pull number 8
We have: 200 posts in this subreddit
Pull number 9
We have: 225 posts in this subreddit
Pull number 10
We have: 250 posts in this subreddit
Pull number 11
We have: 275 posts in this subreddit
Pull number 12
We have: 300 posts in this subreddit
Pull number 13
We have: 325 posts in this subreddit
Pull number 14
We have: 350 posts in this subreddit
Pull number 15
We have: 375 posts in this subreddit
Pull number 16
We have: 400 posts in this subreddit
Pull number 17
We have: 425 posts in this subreddit
Pull number 18
We have: 450 posts in this subreddit
Pull number 19
We have: 475 posts in this subreddit
Pull number 20
We have: 

In [9]:
til_df=scrape_df(til_posts, 'til_posts.csv', '../Data');

## Shower Thoughts Scrape

In [10]:
res, json = request(url_base="https://www.reddit.com/r/Showerthoughts.json", user='ambar')

200


In [11]:
st_posts = pulling(url_base="https://www.reddit.com/r/Showerthoughts.json", user_agent={"User-agent": 'ambar'})

Pull number 1
We have: 27 posts in this subreddit
Pull number 2
We have: 52 posts in this subreddit
Pull number 3
We have: 77 posts in this subreddit
Pull number 4
We have: 102 posts in this subreddit
Pull number 5
We have: 127 posts in this subreddit
Pull number 6
We have: 152 posts in this subreddit
Pull number 7
We have: 177 posts in this subreddit
Pull number 8
We have: 202 posts in this subreddit
Pull number 9
We have: 227 posts in this subreddit
Pull number 10
We have: 252 posts in this subreddit
Pull number 11
We have: 277 posts in this subreddit
Pull number 12
We have: 302 posts in this subreddit
Pull number 13
We have: 327 posts in this subreddit
Pull number 14
We have: 352 posts in this subreddit
Pull number 15
We have: 377 posts in this subreddit
Pull number 16
We have: 402 posts in this subreddit
Pull number 17
We have: 427 posts in this subreddit
Pull number 18
We have: 452 posts in this subreddit
Pull number 19
We have: 477 posts in this subreddit
Pull number 20
We have: 

In [12]:
st_df=scrape_df(st_posts, 'st_posts.csv', '../Data');

# Concat & Save Joint DataFrames 

In [13]:
til_df=til_df.append(st_df)
FullRedditScrape=til_df.reset_index().drop(columns='index')
FullRedditScrape.to_csv('../Data/FullRedditScrape.csv');