# Text Preprocessing for Reddit Comment Classification

Took some steps from these articles:
https://www.dataquest.io/blog/natural-language-processing-with-python/
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
https://mlwhiz.com/blog/2019/01/17/deeplearning_nlp_preprocess/?utm_campaign=shareaholic&utm_medium=reddit&utm_source=news

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.tag import pos_tag_sents

comment_data = pd.read_csv('../Data/reddit_train.csv')
print(comment_data)

          id                                           comments  \
0          0  Honestly, Buffalo is the correct answer. I rem...   
1          1  Ah yes way could have been :( remember when he...   
2          2  https://youtu.be/6xxbBR8iSZ0?t=40m49s\n\nIf yo...   
3          3  He wouldn't have been a bad signing if we woul...   
4          4  Easy. You use the piss and dry technique. Let ...   
...      ...                                                ...   
69995  69995  Thank you, you confirm Spain does have nice pe...   
69996  69996  Imagine how many he would have killed with a r...   
69997  69997  Yes. Only. As in the guy I was replying to was...   
69998  69998  Looking for something light-hearted or has a v...   
69999  69999  I love how I never cry about casters because I...   

            subreddits  
0               hockey  
1                  nba  
2      leagueoflegends  
3               soccer  
4                funny  
...                ...  
69995           euro

### Tokenization and Tagging

In [2]:
# tokenize comments using Twitter-specific tokenizer (couldn't find a Reddit one)
tt = TweetTokenizer()
comment_data['comments_tokens'] = comment_data['comments'].apply(tt.tokenize)

# tag comments
print(comment_data['comments'].tolist())
comment_data['comments_tagged'] = pos_tag_sents(comment_data['comments_tokens'].tolist())

# TODO count totals of each part of speech (noun, adjective, etc)

# TODO typo correction

print(comment_data)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



          id                                           comments  \
0          0  Honestly, Buffalo is the correct answer. I rem...   
1          1  Ah yes way could have been :( remember when he...   
2          2  https://youtu.be/6xxbBR8iSZ0?t=40m49s\n\nIf yo...   
3          3  He wouldn't have been a bad signing if we woul...   
4          4  Easy. You use the piss and dry technique. Let ...   
...      ...                                                ...   
69995  69995  Thank you, you confirm Spain does have nice pe...   
69996  69996  Imagine how many he would have killed with a r...   
69997  69997  Yes. Only. As in the guy I was replying to was...   
69998  69998  Looking for something light-hearted or has a v...   
69999  69999  I love how I never cry about casters because I...   

            subreddits                                    comments_tokens  \
0               hockey  [Honestly, ,, Buffalo, is, the, correct, answe...   
1                  nba  [Ah, yes, way, co

### Word Count, N-Grams, Stopword Removal

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

# in the bag of words matrix, remove punctuation and stopwords
count_vectorizer = CountVectorizer(tokenizer=tt.tokenize, stop_words="english", ngram_range=(1,3))
counts = count_vectorizer.fit_transform(comment_data.comments)
count_vectorizer.vocabulary_

{'honestly': 1206519,
 ',': 88817,
 'buffalo': 612966,
 'correct': 751050,
 'answer': 477920,
 '.': 198352,
 'remember': 1899247,
 'people': 1710572,
 '(': 40846,
 'somewhat': 2070287,
 ')': 61095,
 'joking': 1336483,
 "buffalo's": 613061,
 'mantra': 1524434,
 'starting': 2104889,
 'goalies': 1089794,
 '"': 4983,
 'win': 2406401,
 'game': 1056729,
 'traded': 2269186,
 'think': 2219798,
 "edmonton's": 910412,
 'office': 1657019,
 'travesty': 2274803,
 'better': 562153,
 '10': 319453,
 'years': 2451133,
 'systematic': 2161829,
 'destruction': 823849,
 'term': 2191808,
 "'": 35016,
 'competitive': 722087,
 'responsible': 1913422,
 'change': 663297,
 'draft': 884753,
 'lottery': 1490053,
 'honestly ,': 1206528,
 ', buffalo': 103706,
 'buffalo correct': 613008,
 'correct answer': 751200,
 'answer .': 477987,
 '. remember': 270526,
 'remember people': 1900339,
 'people (': 1710664,
 '( somewhat': 58942,
 'somewhat )': 2070293,
 ') joking': 68844,
 "joking buffalo's": 1336522,
 "buffalo's man

In [22]:
print(counts.shape)

(70000, 2479317)


### Reducing Dimensionality
Pick the important information out of the large matrix generated above.

TODO: instead of using chi2 (which requires the subreddit list to be converted to binary values), use a score function that works for multi-class classification

In [33]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# iterate through the 20 possible classes
for subreddit in comment_data.subreddits.unique():
    print(subreddit)
    # use a binary value to represent whether this comment belongs to this subreddit
    subreddits = np.array(comment_data["subreddits"].copy(deep=True))
    subreddit_match = subreddits == subreddit
    subreddit_nomatch = subreddits != subreddit
    subreddits[subreddit_match] = 1
    subreddits[subreddit_nomatch] = 0
    
#     # select the k=1000 most useful words/n-grams (k can be varied)
#     selector = SelectKBest(chi2, k=1000)
#     selector.fit(counts, subreddits)
#     top_words = selector.get_support().nonzero()

    # Pick only the most informative columns in the data.
#     chi_matrix = full_matrix[:,top_words[0]]

hockey
(70000,)


ValueError: Unknown label type: (array([1, 0, 0, ..., 0, 0, 0], dtype=object),)

### Meta-Features
Find attributes like comment length, amount of punctuation, average word length, etc.

In [37]:
from pymfe.mfe import MFE

# Extract all measures
mfe = MFE()
mfe.fit(comment_data['comments'].tolist(), comment_data['subreddits'].tolist())
ft = mfe.extract()
print(ft)

# Extract general, statistical and information-theoretic measures
mfe = MFE(groups=["general", "statistical", "info-theory"])
mfe.fit(X, y)
ft = mfe.extract()
print(ft)

MemoryError: Unable to allocate array with shape (69158, 69158) and data type float64