1. You can find the dataset controversial-comments.jsonl for this exercise in the Weekly Resources: Week 2 Data Files.

Pre-processing Text: For this part, you will start by reading the controversial-comments.jsonl file into a DataFrame. Then,

A. Convert all text to lowercase letters.

B. Remove all punctuation from the text.

C. Remove stop words.

D. Apply NLTK’s PorterStemmer.

2. Now that the data is pre-processed, you will apply three different techniques to get it into a usable form for model-building. Apply each of the following steps (individually) to the pre-processed data.

A. Convert each text entry into a word-count vector (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook).

B. Convert each text entry into a part-of-speech tag vector (see section 6.7 in the Machine Learning with Python Cookbook).

C. Convert each entry into a term frequency-inverse document frequency (tfidf) vector (see section 6.9 in the Machine Learning with Python Cookbook).

Follow-Up Question

For the three techniques in problem (2) above, give an example where each would be useful.

NOTE

Running these steps on all of the data can take a while, so feel free to cut down on the number of texts (maybe 50,000) if your program takes too long to run. But be sure to select the text entries randomly!

In [44]:
# 1. You can find the dataset controversial-comments.jsonl for this exercise in the Weekly Resources: Week 2 Data Files.
# Pre-processing Text: For this part, you will start by reading the controversial-comments.jsonl file into a DataFrame. Then,
import pandas as pd

def readFile(f='controversial-comments.jsonl'):
    print("START READING")
    df = pd.read_json(f, lines=True)
    print("END READING")
    return df
    
df = readFile()
df

START READING
END READING


Unnamed: 0,con,txt
0,0,Well it's great that he did something about th...
1,0,You are right Mr. President.
2,0,You have given no input apart from saying I am...
3,0,I get the frustration but the reason they want...
4,0,I am far from an expert on TPP and I would ten...
...,...,...
949995,0,I genuinely can't understand how anyone can su...
949996,0,"As a reminder, this subreddit [is for civil di..."
949997,0,K. Don't explain why or anything.
949998,0,[deleted]


In [45]:
# A. Convert all text to lowercase letters.
df['txt'] = df['txt'].str.lower()
df

Unnamed: 0,con,txt
0,0,well it's great that he did something about th...
1,0,you are right mr. president.
2,0,you have given no input apart from saying i am...
3,0,i get the frustration but the reason they want...
4,0,i am far from an expert on tpp and i would ten...
...,...,...
949995,0,i genuinely can't understand how anyone can su...
949996,0,"as a reminder, this subreddit [is for civil di..."
949997,0,k. don't explain why or anything.
949998,0,[deleted]


In [46]:
# B. Remove all punctuation from the text.

import string
# strip off punctuation characters as shown below
# '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
df['txt'] = df['txt'].str.translate(str.maketrans('','',string.punctuation))
df

Unnamed: 0,con,txt
0,0,well its great that he did something about tho...
1,0,you are right mr president
2,0,you have given no input apart from saying i am...
3,0,i get the frustration but the reason they want...
4,0,i am far from an expert on tpp and i would ten...
...,...,...
949995,0,i genuinely cant understand how anyone can sup...
949996,0,as a reminder this subreddit is for civil disc...
949997,0,k dont explain why or anything
949998,0,deleted


In [47]:
# C. Remove stop words.
import nltk
from nltk.corpus import stopwords

#nltk.download('stopwards')
# installed through command line
# Self Notes: python -m nltk.downloader stopwords

stop_words=stopwords.words('english')

# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
# This splits, removes any stopwords and then joins back the txt

#df['txt_without_stopwords'] = df['txt'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
df['txt'] = df['txt'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

In [48]:
df

Unnamed: 0,con,txt
0,0,well great something beliefs office doubt trum...
1,0,right mr president
2,0,given input apart saying wrong argument clearly
3,0,get frustration reason want way foundation com...
4,0,far expert tpp would tend agree lot problems u...
...,...,...
949995,0,genuinely cant understand anyone support point...
949996,0,reminder subreddit civil discussionhttpswwwred...
949997,0,k dont explain anything
949998,0,deleted


In [49]:
# D. Apply NLTK’s PorterStemmer.
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
df['txt'] = df['txt'].apply(lambda x: ' '.join([porter.stem(word) for word in x.split()]))

In [50]:
df

Unnamed: 0,con,txt
0,0,well great someth belief offic doubt trump wou...
1,0,right mr presid
2,0,given input apart say wrong argument clearli
3,0,get frustrat reason want way foundat complex p...
4,0,far expert tpp would tend agre lot problem und...
...,...,...
949995,0,genuin cant understand anyon support point ok ...
949996,0,remind subreddit civil discussionhttpswwwreddi...
949997,0,k dont explain anyth
949998,0,delet


In [51]:
# Save a copy of this in csv as backup for future steps
df.to_csv("cleaned_file.csv")

In [86]:
# 2. Now that the data is pre-processed, you will apply three different techniques 
# to get it into a usable form for model-building. Apply each of the following steps 
# (individually) to the pre-processed data.
# A. Convert each text entry into a word-count vector 
# (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook).

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# Create the bag of words feature matrix
count = CountVectorizer()
# Not able to run the program with full data. So reducing the sample size
dfsample = df.sample(frac = 0.03)
dfsample

# Self Notes - https://stackoverflow.com/questions/57507832/unable-to-allocate-array-with-shape-and-data-type
# Getting - MemoryError: Unable to allocate array with shape (950000, 226148) and data type int64
# Unable to run with 0.05, reducing the number

Unnamed: 0,con,txt
76868,0,assur gop hous would vote anyway
290295,0,aint brexit damn isnt exactli fuck
122412,0,like ive said there crazi work crazi cant penc...
828059,0,remind subreddit civil discussionhttpswwwreddi...
705587,0,hack trump corner ever sinc privat meet trump ...
...,...,...
6134,1,realli ruthless would gain money though
280107,0,get stump guis
343985,0,lol vote clinton shut fuck vox
677357,0,he go declar presid regardless elect day resul...


In [87]:
#bag_of_words
#dfsample
bag_of_words = count.fit_transform(np.array(dfsample['txt']))
arr = bag_of_words.toarray()
pd.DataFrame(bag_of_words.toarray(), columns= count.get_feature_names())

Unnamed: 0,00,0000000,00000013,00001,001,002,00224,005,01,010,...,zuck,zuckerburg,zumwalt,zwick,zyxsubject,zz,zzlist,ˈperəˌdīm,кофе,яepublican
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28495,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
28496,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
28497,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
28498,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
# B. Convert each text entry into a part-of-speech tag vector 
# (see section 6.7 in the Machine Learning with Python Cookbook).
import nltk
import pandas as pd
from nltk import pos_tag
from nltk import word_tokenize
df = pd.read_csv("cleaned_file.csv", index_col=0)
dfsample = df.sample(frac = 0.03)

#for first time only
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
#dfsample['token']=pos_tag(word_tokenize(dfsample['txt']))


In [30]:
#tagged
#dfsample['tokenized'] = dfsample['txt'].apply(lambda x: [].append([pos_tag(word_tokenize(word)) for word in x.split()]))
def tokenizer(arr):
    tokens = []
    for word in arr:
        tokens.append(pos_tag(word_tokenize(str(word))))
    return tokens

dfsample['tokenized'] = tokenizer(dfsample['txt'])
dfsample

Unnamed: 0,con,txt,tokenized
605483,0,im normal dont interest polit typic wait ive w...,"[(im, NN), (normal, JJ), (dont, JJ), (interest..."
504854,0,longer hour work,"[(longer, RB), (hour, NN), (work, NN)]"
471043,0,jew vote trump think would show muslim black p...,"[(jew, JJ), (vote, NN), (trump, NN), (think, N..."
425409,0,berni trump like broken clock agre correct ide...,"[(berni, JJ), (trump, NN), (like, IN), (broken..."
831703,0,understood he huge trump apologist ahol,"[(understood, NN), (he, PRP), (huge, JJ), (tru..."
...,...,...,...
517425,0,import differ though hillari go turn even coun...,"[(import, NN), (differ, NN), (though, IN), (hi..."
2499,0,remov,"[(remov, NN)]"
55077,0,im okay,"[(im, NN), (okay, NN)]"
427710,0,seced state requir said state approv congress ...,"[(seced, VBN), (state, NN), (requir, NN), (sai..."


In [31]:
# C. Convert each entry into a term frequency-inverse document frequency (tfidf) vector 
# (see section 6.9 in the Machine Learning with Python Cookbook).
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

In [32]:
feature_matrix = tfidf.fit_transform(dfsample['txt'].values.astype('U'))
feature_matrix.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [33]:
for i, (k, v) in enumerate(tfidf.vocabulary_.items()):
    print(k, v)
    if i > 25:
        break

im 11398
normal 14974
dont 6283
interest 11842
polit 16423
typic 21966
wait 23074
ive 12086
watch 23175
three 21341
debat 5524
make 13391
inform 11689
decis 5561
would 23712
vote 23002
elect 6716
someth 19715
els 6769
yesterday 23877
point 16383
done 6267
due 6459
dilig 5941
confid 4774
clinton 4409
far 7399


### Follow-Up Question
#### For the three techniques in problem (2) above, give an example where each would be useful.
A. Bag of Words models is considered to check if a known word occurs in a document or not. It gives each string as a 1X(number of words in vocabulary) array which tells how many times a particular world is contained in each sentence. It does not care about meaning, context, and order in which they appear. This gives the insight that similar documents will have word counts similar to each other. In other words, the more similar the words in two documents, the more similar the documents can be.

Usage wise, we may use this in detecting spam mails.

B. POS tagging can be really useful, particularly if you have words or tokens that can have multiple POS tags. For instance, the word "google" can be used as both a noun and verb, depending upon the context. While processing natural language, it is important to identify this difference.

Usage wise, we may try to identify the gender or demographics of the people or person conversing, for eg. if we anlayze chat texts, we may be able to identify the person gender or which part of the world he is from etc.

C. Using the term frequency-inverse document frequency (tfidf) vector approach, we are able to find which words appear the most in a document. More number of times a word appears, it is an evidence that the document is about that topic. So we may be able to identify the context of the document.

Usage wise, we may be able to identify the context of a discussion or document. So lets say, If we are looking for some specific context we may use a combination of Bag of Words approach and tfidf to understand which conversations in a chat are required to be further analyzed and which may be discarded. 
