Title: What's a life pro tip, anyways?
Date: 2019-05-12 12:00
Tags: python
Slug: reddit

What's the difference between a life pro tip, and one that is a bit more questionable? This can be sometimes a subtle difference, or a moral grey area to distinguish, even for humans.  

Life Pro Tip: 
> A concise and specific tip that improves life for you and those around you in a specific and significant way.

<br>

_Example_: "If you want to learn a new language, figure out the 100 most frequently used words and start with them. Those words make up about 50% of everyday speech, and should be a very solid basis."

> An Unethical Life Pro Tip is a tip that improves your life in a meaningful way, perhaps at the expense of others and/or with questionable legality. Due to their nature, do not actually follow any of these tips–they're just for fun. 

<br>

Example: "Save business cards of people you don't like. If you ever hit a parked car accidentally, just write "sorry" on the back and leave it on the windshield."

Let's collect posts (web scrap) from 2 subreddits, and create a machine learning model using Natural Langauge Processing (NLP) to classify which subreddit a particular post belongs too. Can my model pick up on sarcasm, internet 'trolling', or tongue-in-cheek semantic of sentences? Probably not, but let's try. I hope you have as much fun playing with this, as I did making it. 

If you're feeling lucky, visit my app for a Life Pro Tip!

---

# Reddit API 

Fortunately, Reddit provides a public JSON end point, so we can easily consume that format, and manipulate it a Pandas DataFrame. Simply add `.json` at the end of the URL.

If you plan to run your own `get` requests, keep in mind that Reddit has a limit of 25 posts / request. In conjunction with `for` loop, write a `time.sleep()` function in Python (or something equivalent) to avoid a 429 Too Many Requests error. 



## Data dictionary

We are interested in the following features:

<table class="table table-responsive table-bordered">
<thead>
</thead>
<tbody>
<tr>
<td><b>Target variable, y </b></td>
<td> <p> subreddit (str) </p> </td>
</tr>
    
<tr>
<td><b>Design matrix, X </b></td>
<td> 
<p>title (str)</p>
<p>score (int)</p>
<p>num_comments (int)</p>
<p>author (int)</p>
<p>name (int)</p>
</ul>
</td></tr>   

</tbody>
</table>

--- 
# Pre-processing data


First, I have to pre-process the data, and use natural language processing packages to tokenize strings to individual words. We will be using Python's Natural Language Toolkit (nltk) package.

Follow along if you want to create your own classifier, otherwise, skip to the results. All code is available on GitHub. Refer to reddit_garry.py for scraping.

In [65]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder, StandardScaler, LabelBinarizer
from sklearn_pandas import DataFrameMapper
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV 
from sklearn.linear_model import LogisticRegression
from sklearn import svm
import pickle
from sklearn.pipeline import Pipeline
from sklearn.exceptions import DataConversionWarning
import warnings
warnings.filterwarnings(action='ignore', category=DataConversionWarning)
warnings.filterwarnings(action='ignore', category=FutureWarning) 
np.set_printoptions(suppress=True)
pd.set_option('display.max_colwidth', -1)
from IPython.display import HTML

In [66]:
# encoding utf-8 for special characters
raw_lpt = pd.read_csv("./lpt.csv", encoding='utf-8')
raw_ulpt = pd.read_csv("./ulpt.csv", encoding='utf-8')

Merge, train, test split data. 

Notice how there is "ULPT" or "LPT" in the title, which is clearly target leakage. To prevent target leakage in the title, I will use regular expressions (Regex) to match permutations of LPT, lpt, ULPT, ulpt and remove.

In [67]:
df = pd.merge(raw_lpt, raw_ulpt, how='outer')
HTML(df.sample(2).to_html(classes="table table-responsive table-striped table-bordered"))

Unnamed: 0,author,name,num_comments,score,subreddit,title
1287,UnfairCorner,t3_a76prs,285,9174,UnethicalLifeProTips,"ULPT: If you have two pets of the same breed and colour, you only have to get pet insurance for one."
416,mnkymnk,t3_aet1r2,609,25580,LifeProTips,"LPT: Take a videocamera and spend 10min filming every room and every item in your house. Upload footage to the cloud. If you are ever in the unfortunate situation of a house-fire, this will make the insurance claim thousand times easier."


In [79]:
y = df.subreddit
X = df.drop(["subreddit",'name'],axis=1) # drop name, it's a unique identifier, not predictive
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [114]:
def post_to_words(raw_post):
    '''Returns a list of words ready for classification, by tokenizing,
    removing punctuation, setting to lower case and removing stop words.'''
    tokenizer = RegexpTokenizer(r'[a-z]+')
    words = tokenizer.tokenize(raw_post.lower())
    meaningful_words = [w for w in words if not w in set(stopwords.words('english'))]
    return(" ".join(meaningful_words))

In [80]:
X_train.loc[:,"title_clean"] = X_train["title"].apply(lambda row : re.sub(r"[uU]*[lL][pP][tT]\s*:*", '', row)).apply(lambda row: post_to_words(row))

In [115]:
X_test.loc[:,"title_clean"] = X_test["title"].apply(lambda row : re.sub(r"[uU]*[lL][pP][tT]\s*:*", '', row))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [89]:
X_test.loc[:,"title_clean"] = X_test["title"].apply(lambda row: post_to_words(row))

---
# Modeling

## CountVectorizer

Let's start simple. CountVectorizer is a bag of words model processes text by ignoring structure of a sentences and merely assesses the count of specific words, or word combinations.

In [7]:
def my_vectorizer(vectorizer,X_train,X_test,y_train,y_test,stop=None):
    '''Takes a vectorizer, fits the model, learns the vocabulary,
    transforms data and returns the transformed matrices'''
    # transform text
    vect = vectorizer(stop_words=stop) 
    train_data_features = vect.fit_transform(X_train.title)
    test_data_features = vect.transform(X_test.title)
    le = LabelEncoder()
    target_train = le.fit_transform(y_train)
    target_test = le.transform(y_test)
    
    # transform non text
    mapper = DataFrameMapper([
    ("author", LabelBinarizer()),
    (["num_comments"], StandardScaler()),
    (["score"], StandardScaler())], df_out=True)
    Z_train = mapper.fit_transform(X_train)
    Z_test = mapper.transform(X_test)
    print(f' Learned distinct training vocabulary is {train_data_features.shape[1]}')
    print(f' Remember: 0 -> {le.classes_[0]}, 1 -> {le.classes_[1]}')
    
    # Baseline model
    print(f' Baseline model that guessed all LPT -> {round(1-sum(target_test)/len(target_test),2)} accurate')
    
    # Combine both df columns together
    a = pd.DataFrame(train_data_features.todense())
    b = Z_train
    c = pd.DataFrame(test_data_features.todense())
    d = Z_test

    # reset indices in order to merge
    a = a.reset_index().drop("index",axis=1)
    b = b.reset_index().drop("index",axis=1)
    c = c.reset_index().drop("index",axis=1)
    d = d.reset_index().drop("index",axis=1)
    
    Z_train = pd.merge(a,b, left_index=True, right_index=True)
    Z_test = pd.merge(c,d, left_index=True, right_index=True)
    return (Z_train, Z_test, target_train, target_test)

---
## Classification

With my data ready in array format, I can now apply binary classifiers. I'll try:

<ul>
<li>Logistic Regression</li>
<li>Naive Bayes Multinomial</li>
<li>Support Vector Machine</li>
</ul>

In [104]:
my_tuple = my_vectorizer(CountVectorizer,X_train,X_test,y_train,y_test,stop='english')
Z_train = my_tuple[0]
Z_test = my_tuple[1]
target_train = my_tuple[2]
target_test = my_tuple[3]

 Learned distinct training vocabulary is 5208
 Remember: 0 -> LifeProTips, 1 -> UnethicalLifeProTips
 Baseline model that guessed all LPT -> 0.5 accurate


In [182]:
def results(model):
    '''Return a sample of 5 wrong predictions'''
    model.fit(Z_train, target_train)
    print(f' Training accuracy: {model.score(Z_train, target_train)}')
    print(f' Test accuracy: {model.score(Z_test, target_test)}')
    predictions = model.predict(Z_test)
    predictions = np.where(predictions==0,"LifeProTips","UnethicalLifeProTips")
    proba = model.predict_proba(Z_test) 
    # proba[:,0] probability that it's 0
    final = pd.DataFrame(list(zip(predictions, proba[:,0], y_test, X_test.title_clean, X_test.num_comments, X_test.score)), columns=['prediction', 'proba_lpt','label', 'title_clean', "num_comments", "score"])   
    # final.to_csv("final.csv",index=False) # export for my app
    wrong = final[final.prediction!=final.label]
    return HTML(wrong.sample(2).to_html(classes="table table-responsive table-striped table-bordered"))

In [183]:
results(LogisticRegression())

 Training accuracy: 1.0
 Test accuracy: 0.99581589958159


Unnamed: 0,prediction,proba_lpt,label,title_clean,num_comments,score
326,UnethicalLifeProTips,0.327096,LifeProTips,"Hey All, While We Appreciate All The Pokémon Go Tips, Please Keep Them To /R/PokemonGO. Thank You!",0,12728
451,UnethicalLifeProTips,0.441197,LifeProTips,"Just Because The Election Is Over, Does Not Mean That This Subreddit Will Be Accepting Politics or Politic Related Tips. We Will Still Not Accept Them. Keep Those Posts To Their Proper Subreddits. - Thank you.",0,15175


In [106]:
model = LogisticRegression()
model.fit(Z_train, target_train);

# create a pipeline and serialize model to file (for my app)
pipe = Pipeline([("model", model)])
pickle.dump(pipe, open("pipe.pkl", "wb"))

Let's look at the largest coefficients which correspond to num_comments (col 6452), score (col 6453), column 653 & 3218, and peak into the words. 

In [None]:
pd.options.display.float_format = '{:.2f}'.format
# my_coef = pd.DataFrame(list(zip(Z_train.columns,abs(model.coef_[0]))),columns=["x","coef"]).sort_values(by="coef",ascending=False).head()                       

In [None]:
my_coef = pd.DataFrame(list(zip(Z_train.columns,model.coef_[0])),columns=["x","coef"])
large_coef = my_coef[(my_coef["x"]=="num_comments") | (my_coef["x"]=="score") | (my_coef["x"]==653) | (my_coef["x"]==3218)] 
HTML(large_coef.to_html(classes="table table-responsive table-striped table-bordered"))

In [None]:
cvect = CountVectorizer(stop_words='english') 
train_data_features = cvect.fit_transform(X_train.title)

In [None]:
print(cvect.get_feature_names()[3218])
print(cvect.get_feature_names()[653])

In [None]:
f'If the number of comments increases by 1, the likelihood of being an Unethical Life Pro Tip is {np.exp(-4.26)} more likely.'

In [None]:
f'If the score (upvotes - downvotes) increases by 1, the likelihood of being an Unethical Life Pro Tip is {np.exp(-1.89)} more likely.'

This is really performant off the bat with 99.9% training accuracy and 99.6% test accuracy. 

There's many more false positives than false negatives. One could argue, false positives are not as bad
since you don't want to heed the advice of a bad tip, but if you miss a life pro tip, it's not as damaging. If we wanted to be more strict, we could tweak the threshold such that only predictions > 75% would be classified as UnethicalLifeProTip.

"Give the same perfume to your wife and your girlfriend. It could save your ass one day." 🙅🏻‍♂️ - _Not a Life Pro Tip_ but was predicted a _Pro Tip_ 

* The more 'popular' i.e. more comments and score, the great likelihood that it is unethical. Controversial posts tend to gain more popularity.
 
* If your document includes words 'pay' or 'business', then the likelihood of being unethical is 3x more likely. There's probably a lot of unethical comments around payment and businesses!

---
## Term Frequency Inverse Document Frequency (TF-IDF)

Compared to CountVectorizer, TF-IDF vectorizer tells us which words are most discriminating between documents.
Words that occur often in one document but don't occur in many documents are important and contain a great deal of discriminating power. Note, TF-IDF figures are between [0,1]. The score is based on how often a word is compared in your document (spam) and other documents.

In [23]:
my_tuple = my_vectorizer(TfidfVectorizer,X_train,X_test,y_train,y_test,stop='english')
Z_train = my_tuple[0]
Z_test = my_tuple[1]
target_train = my_tuple[2]
target_test = my_tuple[3]

 Learned distinct training vocabulary is 5208
 Remember: 0 -> LifeProTips, 1 -> UnethicalLifeProTips
 Baseline model that guessed all LPT -> 0.5 accurate


In [31]:
model = LogisticRegression()
model.fit(Z_train, target_train);

In [32]:
results(LogisticRegression())

 Training accuracy: 0.9790648988136776
 Test accuracy: 0.895397489539749


Unnamed: 0,prediction,label,title,num_comments,score
439,UnethicalLifeProTips,LifeProTips,"LPT: When applying for jobs online, save a copy of the job responsibilities and requirements. This information is usually not available after they stop accepting applications and will be useful when preparing for the interviews.",143,12465
192,UnethicalLifeProTips,LifeProTips,LPT: Save your PowerPoint presentations with a .pps extension instead of .ppt. They'll open directly in presentation mode and PowerPoint will close when the slideshow is over.,324,23122


Test accuracy decreases to 85.1% (overfitting) which is lower than CountVectorizer. There are probably not as many discriminating words. Common words are helpful in distinguishing between the two classes. Words of high frequency that are predictive of one of the classes.

It's not entire clear which words are the most influential, some words might indicate sarcasm. Overall, it's impressive that a logistic regression model is so powerful already, let's try a few more algorithms.

---
## Naive Bayes Classifier
The multinomial Naive Bayes classifier is appropriate for classification with discrete features (e.g., word counts for text classification), as the columns of X are all integer counts.

Note, this classifier accepts only positive values so I have run the abs function on my scaled features. While I have the option to add a prior, I have opted to have Sklearn estimate from training data directly. I don't have a strong opinion if a particular post is in one subreddit over the other. 

In [34]:
model = MultinomialNB()
Z_train = my_tuple[0]
Z_test = my_tuple[1]
target_train = my_tuple[2]
target_test = my_tuple[3]

Z_train.num_comments = abs(Z_train.num_comments)
Z_train.score = abs(Z_train.score)
model.fit(Z_train, target_train)

print(f' Training accuracy: {model.score(Z_train, target_train)}')
print(f' Test accuracy: {model.score(Z_test, target_test)}')

predictions = model.predict(Z_test)
predictions = np.where(predictions==0,"LifeProTips","UnethicalLifeProTips")
final = pd.DataFrame(list(zip(predictions, y_test, X_test.title, X_test.num_comments, X_test.score)), columns=['prediction', 'label', 'title', "num_comments", "score"])
wrong = final[final.prediction!=final.label] 
HTML(wrong.sample(2).to_html(classes="table table-responsive table-striped table-bordered"))

 Training accuracy: 0.9993021632937893
 Test accuracy: 0.9121338912133892


Unnamed: 0,prediction,label,title,num_comments,score
271,UnethicalLifeProTips,LifeProTips,"LPT: If you're an impulse spender or find it too easy to drop money on something, translate the price of an item into the hours that you have worked to make that amount of money. It really puts in perspective the cash value of an item against the value of your time.",900,25095
42,UnethicalLifeProTips,LifeProTips,LPT If you’re heading to a busy cafe or restaurant that doesn’t take bookings arrange meet your friends at 10 minutes to the hour (9:50am or 6:50pm) instead of on the hour. You’ll beat out everyone who arranged to meet on the hour and get seated much sooner.,649,27442


In [35]:
from sklearn.metrics import confusion_matrix
predictions = model.predict(my_tuple[1])
print(confusion_matrix(my_tuple[3], predictions))
tn, fp, fn, tp = confusion_matrix(my_tuple[3], predictions).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

[[206  35]
 [  7 230]]
True Negatives: 206
False Positives: 35
False Negatives: 7
True Positives: 230


While Naive Bayes also has a high training accuracy, it is severely overfitting. In this case, there are more false negatives than false positives. It tended predict that certain posts were unethical life pro tips, when in fact they are!

---
## Support Vector Machines

- Exceptional perfomance
- Effective in high-dimensional data
- Low risk of overfitting, but a black box method

In [36]:
my_tuple = my_vectorizer(CountVectorizer,X_train,X_test,y_train,y_test,stop='english')
Z_train = my_tuple[0]
Z_test = my_tuple[1]
target_train = my_tuple[2]
target_test = my_tuple[3]

results(svm.SVC())

 Learned distinct training vocabulary is 5208
 Remember: 0 -> LifeProTips, 1 -> UnethicalLifeProTips
 Baseline model that guessed all LPT -> 0.5 accurate
 Training accuracy: 0.7662247034193999
 Test accuracy: 0.7573221757322176


Unnamed: 0,prediction,label,title,num_comments,score
117,UnethicalLifeProTips,LifeProTips,"LPT Whenever you receive a greeting card with money in it for your birthday (or any other special day), always act like you don't see the money and read the card out loud first. After that, then thank them for the money. People really appreciate when you take the time to enjoy their greeting cards.",1077,25192
367,UnethicalLifeProTips,LifeProTips,"LPT: People want someone to tell them what to do in emergency situations. For example while performing CPR on someone don't say ""Someone call an ambulance"" instead talk to one person and ask him/her to call an ambulance directly.",706,20378


In [37]:
params = {'C': [1,3],'gamma': ["scale"]}

grid_search = GridSearchCV(svm.SVC(), param_grid=params, cv=5)
grid_search.fit(Z_train, target_train)

print(grid_search.best_score_)
print(grid_search.best_params_)
print(grid_search.score(Z_test,target_test))

0.9979064898813678
{'C': 1, 'gamma': 'scale'}
0.99581589958159


In [42]:
results(svm.SVC(1,gamma="scale"))

 Training accuracy: 1.0
 Test accuracy: 0.99581589958159


Unnamed: 0,prediction,label,title,num_comments,score
326,UnethicalLifeProTips,LifeProTips,"Hey All, While We Appreciate All The Pokémon Go Tips, Please Keep Them To /R/PokemonGO. Thank You!",0,12728
451,UnethicalLifeProTips,LifeProTips,"Just Because The Election Is Over, Does Not Mean That This Subreddit Will Be Accepting Politics or Politic Related Tips. We Will Still Not Accept Them. Keep Those Posts To Their Proper Subreddits. - Thank you.",0,15175


Out of the box, SVC is not performant. We have to tune hyperparameters to improve the accuracy. Recall, if C is large, we do not regularize much (larger budget that the margin can be violated), leading to a more perfect classifier of our training data. Of course, there will be a trade off in overfitting and greater error due to higher variance. A smaller gamma helps with lower bias, by trading off with higher variance. Gamma = "scale", which uses `n_features` * `X.var()` tends to work well. 

<table class="table table-striped table-responsive table-bordered">
<thead>
</thead>
<tbody>
    
<tr>
<td><b> Classification Model </b></td>
<td><b> Training Accuracy % </b></td>
<td><b> Test Accuracy % </b></td>
</tr>

<tr>
<td><b>Baseline </b></td>
<td> 0.5 </td>
<td> 0.5 </td>
</tr>   


<tr>
<td><b>Logistic </b></td>
<td> 0.999 </td>
<td> 0.996</td>
</tr>   

<tr>
<td><b>Naive Bayes </b></td>
<td> 0.999 </td>
<td> 0.912 </td>
</tr>  

<tr>
<td><b>Support Vector Machines </b></td>
<td> 0.998 </td>
<td> 0.996 </td>
</tr>   

</tbody>
</table>

Given these results, my selected production model will be the logistic regression model with TF-IDF as the vectorizer. The Logistic Model is the equally as performant as the SVM, while providing much more interpret-ability.

In conclusion:

* The more 'popular' i.e. more comments and score, the great likelihood that it is a unethical. Controversial posts tend to gain more popularity.

* If your document includes words 'pay' or 'business', then the likelihood of being unethical is 3x more likely. There's probably a lot of unethical comments around payment and businesses!

---
# Wrap up

I was able to create an app using Natural Language Processing to classify which subreddit a particular post belongs to.

While this was a fun use case of NLP, this analysis is widely applicable other areas, such as politics in classifying fake news vs. real news, or for eCommerce, for sentiment analysis of user reviews (i.e. polarity classification - positive, negative of neutral). Further, many virtual assistants (Amazon Alexa, Google Assistant) use NLP to understand the human's question and provide the appropriate response.

As you can imagine, there would be far greater consequences, if the prediction was a false-positive or false-negative and tuning the model to adjust these thresholds is critical.

As a next step, I hope to investigate other NLP open source packages such as [Spacy](https://spacy.io/)!