# Project 3
# Using NLP to classify posts to one of two subreddits

Now that I have scraped the data, I can move into the NLP/modeling phase. First, I'll load the datasets.

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
import re

import warnings
warnings.filterwarnings('ignore')

In [2]:
dead = pd.read_csv("./dead.csv")
phish = pd.read_csv("./phish.csv")

In [3]:
dead.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,Hey friends! If you’re heading to Bob Weir and...,,gratefuldead,1551565481,RollLikeACantaloupe,0,3,True,2019-03-02
1,UJB 94-02-27,"Dagnabbit, I’ve never heard them stretch it ou...",gratefuldead,1551566171,Basil1229,0,1,True,2019-03-02
2,Dead On Ice,https://liveforlivemusic.com/news/detroit-red-...,gratefuldead,1551569146,Brightwings73,10,1,True,2019-03-02
3,Olin Arageed should be suggested more often fo...,Dark stars still the king in my mind but man t...,gratefuldead,1551572220,calliopewoman,8,23,True,2019-03-02
4,Bob weir and the wolf bros merch?,While someone’s there can we get a picture of ...,gratefuldead,1551574511,Stratengar,7,3,True,2019-03-02


In [4]:
phish.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,Phishy LA (Westside) band needs bass player.,"Though we play mostly originals, we sit right ...",phish,1551562683,russellspurlock,6,10,True,2019-03-02
1,VELVET SEA CHALLANGE..please continue along..,I took a moment from my day\nAnd wrapped it up...,phish,1551566136,designworksgarage,3,0,True,2019-03-02
2,"A Favorite of Mine - Manchester, NH - 10-26-2010","This show hit me in a very special place, and ...",phish,1551567384,Cactus_Bomb,12,10,True,2019-03-02
3,10/30/10 Chalkdust&gt;Whole Lotta Love&gt;Chal...,,phish,1551571181,save-the-tigers,7,14,True,2019-03-02
4,Anyone else who uses LastFM know if it is poss...,,phish,1551578757,ReturnOfTheFox,0,2,True,2019-03-02


On first inspection, I notice a few things that I'll address:
- null entries in selftext cells
- url text
- "\n" appearing repeatedly in selftext

Null Values:

In [5]:
dead.isnull().sum()

title             0
selftext        533
subreddit         0
created_utc       0
author            0
num_comments      0
score             0
is_self           0
timestamp         0
dtype: int64

In [6]:
phish.isnull().sum()

title             0
selftext        479
subreddit         0
created_utc       0
author            0
num_comments      0
score             0
is_self           0
timestamp         0
dtype: int64

The null values only occur in the selftext column. Since we have text in both the title column and the selftext column, I'll combine these into a single "text" column. For the purposes of NLP, there is no need to separate the titles from the selftext, so these null values will naturally get sorted out. This way, we do not need to delete data.

First, I'll fill the null entries with a blank string.

In [7]:
dead.fillna("", inplace=True);
phish.fillna("", inplace=True);

In [8]:
dead["text"] = dead["title"] + " " + dead["selftext"]
phish["text"] = phish["title"] + " " + phish["selftext"]

Since I am going to build an NLP model, the only information that I need is the text column and the subreddit column. I will drop the other columns and then combine the two dataframes into one.

In [9]:
dead.drop(columns = [col for col in dead.columns if col not in ["text", "subreddit"]],
          inplace=True)

phish.drop(columns = [col for col in phish.columns if col not in ["text", "subreddit"]],
          inplace=True)

In [10]:
jam = pd.concat([dead, phish], axis=0)
jam.reset_index(inplace=True, drop=True)

In [11]:
jam["subreddit"] = jam["subreddit"].map({"gratefuldead": 0,
                                         "phish": 1})

In [12]:
jam.to_csv("./jam0.csv", index=False)

In the interest of establishing a baseline performance for my models, I am going to save this initial dataframe to a csv and, in a different notebook, go through the process of count vectorizing and modeling.

Next, I'll focus on parsing the text data to remove text that doesn't reflect actual language. For instance, I mentioned above that I noticed recurrences of the substring "\n". This is a symbol for a new paragraph, and so it doesn't pertain to our NLP analysis. Similarly, many of the posts contain links to urls. I will try to delete this type of text, so that I can get close to a dataset that only contains language-related text. I will try to do this with methodology that can be applied generally.

### "\n"

"\n" is a substring that seems to be present in several of the posts. How many text entries contain the string, "\n"?

In [13]:
len(jam[jam["text"].str.contains("\n")])

4165

I'll replace "\n" with spaces.

In [14]:
jam["text"] = jam["text"].map(lambda x: x.replace("\n", " "))

### [deleted]

Another issue with many of the posts is the phrase "[deleted]", which denotes text that has retroactively been removed from posts. How many posts contain this substring?

In [15]:
len(jam[jam["text"].str.contains(r"\[deleted\]")])

537

I'll replace "[deleted]" with spaces.

In [16]:
jam["text"] = jam["text"].map(lambda x: re.sub("\[deleted\]", " ", x))

In [17]:
len(jam[jam["text"].str.contains(r"\[deleted\]")])

0

### [removed]

Similarly, I'll replace the instances of "[removed]" with spaces.

In [18]:
len(jam[jam["text"].str.contains(r"\[removed\]")])

76

In [19]:
jam["text"] = jam["text"].map(lambda x: re.sub("\[removed\]", " ", x))

In [20]:
len(jam[jam["text"].str.contains(r"\[removed\]")])

0

In [21]:
jam.to_csv("./jam1.csv")

### urls

The function below will break up a string into the individual tokens separated by spaces. By mapping this function to the jam["text"] column, I will filter out urls from the posts. It is possible that some unconventional ones might slip past these filters, but at least I will filter out the vast majority.

In [22]:
# I referred to this stackoverflow page for help on this:
# https://stackoverflow.com/questions/8122079/
# python-how-to-check-a-string-for-substrings-from-a-list

def drop_url(text):
    text_list = text.split()
    url_tags = ["http", ".com", "www.", ".org", ".net", "&amp", "width=", "size=", "width=",
                "height=", "style=", "scrolling=", "allowFullScreen=", "frameborder=", 
                "allowTransparency=", "iframe", "&gt", "&lt"]
    
    filtered_list = [word for word in text_list if any(tag in word for tag in url_tags) == False]
    
    return " ".join(filtered_list)

In [23]:
jam["text"] = jam["text"].map(lambda x: drop_url(x))

In [24]:
jam[jam["text"] == ""]

Unnamed: 0,subreddit,text
416,0,
832,0,
8698,1,
8701,1,
9508,1,


In [25]:
jam = jam.drop(jam[jam["text"] == ""].index)
jam.reset_index(inplace=True, drop=True)

I'll save the dataframe again, and reassess the performance of the model.

In [26]:
jam.to_csv("./jam2.csv", index=False)

### Lemmatization

I want to preserve contractions like "you're" and "don't", so I'll begin by removing punctuation from the text.

In [27]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r"([\w’]+)")

tokens = [tokenizer.tokenize(text.lower()) for text in jam["text"]]

from nltk.stem import WordNetLemmatizer

lem = WordNetLemmatizer()

tokens_lem = [[lem.lemmatize(i) for i in line] for line in tokens]

Now that I have the list of lemmatized tokens lists, I'll join the lists back into the individual text strings and assign that to the jam["text"] column.

In [28]:
jam["text"] = [" ".join(word_list) for word_list in tokens_lem]

I'll check for empty cells. Individual symbols or emojis may have been filtered out.

In [29]:
jam[jam["text"] == ""]

Unnamed: 0,subreddit,text
4456,0,
5440,0,
7594,1,


In [30]:
jam = jam.drop(jam[jam["text"] == ""].index)
jam.reset_index(inplace=True, drop=True)

At this point, I'll move my focus to trying different models and tuning hyperparameters.

In [31]:
jam.to_csv("./jam3.csv", index=False)