CSC480 Assignment 1 tutorial
Damien Trunkey, Emily Lucas, Sophia Parrett, Fernando Valdivia, Divya Satrawada, Rey Ortiz
Social Justice League!

## Introduction:
    
This is a tutorial on how to preprocess data to make it ready to train a model. 

You can get data many ways. You can use [webscraping](https://realpython.com/python-web-scraping-practical-introduction/#reader-comments), you can get a dataset from [kaggle datasets](https://www.kaggle.com/datasets), or you can get data using an api that has access to  lots of data. We chose to use the Reddit.com api to get data from the reddit.com forums. 
    


## Imports

We are using pandas and numpy, which are two data science libraries in python. The pandas import alows us to create data frames (matrices) and numpy lets us do many mathematical functions.

In [1]:
# Imports
import pandas as pd
import numpy as np
from collections import Counter
import sys
import time

In [2]:
# Get the data into a dataframe to manupulate and do preprocessing
dataFile = 'AmItheAsshole_subreddit.csv'
df = pd.read_csv(dataFile)
df = df.head(7)
df = df[['selftext', 'ups']]
df = df[2:7]
ground_truth = df['ups']
print(ground_truth)
display(df)

2    14569
3     3022
4     3507
5     1835
6    11710
Name: ups, dtype: int64


Unnamed: 0,selftext,ups
2,Hello.\r\n\r\nI am one of the many in the midw...,14569
3,I (31f) married my husband (35m) when I gradua...,3022
4,\r\nI am a college student and in high school ...,3507
5,\r\n\r\nI (29F) got set up (again) for a blind...,1835
6,My wife (39F) and I (36M) got married and move...,11710


## Normalization

Normalization is the process of simplifying text in order to make it easier to work with. This is achieved by multiple steps. To begin with, change all of the words with the lower-case version of themselves using .lower(). This is needed because words such as "They" and "they" would be counted as two different words. However, these should be counted as the same word, so .lower() is used to normalize the capitalization of the text. 

After all the words are changed to lower-case, the next step is to replace all non-alphanumeric characters with whitespace. This is achieved by utilizing the .replace("[^\w\s]", " ") method. This would eliminate all punctuation, as well as any additional spaces, tabs, or newlines between words and replace it with whitespace. [^\w\s] is a regular expression that is used to capture all non-alphanumeric words. Once again, this is needed because punctuation is irrelevant when it comes to natural language processing and elimination it would further simplify the text. 

Additionally, replacing everything with whitespaces is needed for the final step, which is to split the text on whitespace. This is achieved by utilizing .split(). This method takes the string of text and turns it into a list of separate words. It does this by going through each letter of the string and anytime a whitespace is found, the previous grouping of letters is taken and inserted into a list as one element. To summarize, the steps of normalization are taking the string and applying .lower(), .replace("[^\w\s]", " "), and finally utilizing .split(). 

In [3]:
bag_of_words = (
    df['selftext'].
    str.lower().                  # convert all letters to lowercase
    str.replace("[^\w\s]", " ", regex=True).  # replace non-alphanumeric characters by whitespace
    str.split()                   # split on whitespace
)

In [4]:
#Getting raw frequency to turn into tf-idf vectors
raw_frequency = bag_of_words.apply(Counter)

df['selftext'] = raw_frequency

tf = pd.DataFrame(list(raw_frequency),index=raw_frequency.index)
columns = list(tf.columns)
tf = tf.fillna(0)
display(tf)
display(df)

Unnamed: 0,hello,i,am,one,of,the,many,in,midwest,hit,...,respects,admires,okay,feels,berate,allowing,happen,rebuild,boundaries,ashamed
2,1.0,40,2.0,2.0,13,14,1.0,5,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,27,2.0,0.0,6,12,0.0,13,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,65,1.0,1.0,7,14,2.0,16,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,20,0.0,1.0,3,8,0.0,1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,21,0.0,0.0,10,12,0.0,7,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Unnamed: 0,selftext,ups
2,"{'hello': 1, 'i': 40, 'am': 2, 'one': 2, 'of':...",14569
3,"{'i': 27, '31f': 1, 'married': 1, 'my': 22, 'h...",3022
4,"{'i': 65, 'am': 1, 'a': 17, 'college': 6, 'stu...",3507
5,"{'i': 20, '29f': 1, 'got': 5, 'set': 1, 'up': ...",1835
6,"{'my': 8, 'wife': 4, '39f': 1, 'and': 17, 'i':...",11710


## Remove Stop Words 
Next in this tutorial we will remove stop words. Removing stop words is important because it gets rid of trivial words (like "a", "the" etc.), and focuses on more important information. Words that are kept are more topical, and have a stronger connotation. This can be done by creating a list of stop words (from 'stopwords-short.txt') and removing columns in our dataframe that contain stop words.  

In [5]:
stopFile = 'stopwords-short.txt'
f = open(stopFile, "r")
stop_words = []
for line in f:
    words = line.split(',')
for word in words:
    word = word.replace('"', "").strip(" ").lower()
    stop_words.append(word)
    
for col in columns:
    if col in stop_words:
        tf = tf.drop([col], axis=1)
display(tf)

Unnamed: 0,hello,am,one,many,midwest,hit,snowpocolypse,think,we,got,...,respects,admires,okay,feels,berate,allowing,happen,rebuild,boundaries,ashamed
2,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,3,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,1.0,1.0,2.0,0.0,0.0,0.0,0.0,4,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,3,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## TF-IDF Vectors

TF-IDF stands for term frequency-inverse document frequency and it is a measure to quantify  how important certain terms are throughout a list of documents (corpus). 
Term frequency (TF) is the measure of how often a word appears in a single document and inverse document frequency (IDF) is the measure of how common or uncommon a term is throughout the corpus. If a word appears in only two documents, then it may be considered rare, therefore it carries more importance. The equation to find IDF follows the form where t is the term and D is the corpus and d is the current document and N is the number of documents:
idf(t,D) = log(N / count(d D; t  D))
IDF is important because it takes common words in the English language and weights them less, giving less common words more impact.
When we put TF and IDF together we can show that a term is inversely related to its frequency across documents. By multiplying these together we can get the final TF-IDF value. Higher values means a term holds more importance and closer to 0 means it's less relevant.
ifidf(t,d,D) = tf(t,d) * idf(t,D)

In [6]:
# Get document frequencies 
# (How many documents does each word appear in?)
df = (tf > 0).sum(axis=0)

# Get IDFs
idf = np.log(len(tf) / df)
idf.sort_values()

# Calculate TF-IDFs
tf_idf = tf * idf
tf_idf

Unnamed: 0,hello,am,one,many,midwest,hit,snowpocolypse,think,we,got,...,respects,admires,okay,feels,berate,allowing,happen,rebuild,boundaries,ashamed
2,1.609438,1.021651,1.021651,0.916291,1.609438,1.609438,1.609438,0.916291,0.0,0.446287,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.021651,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.669431,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.510826,0.510826,1.832581,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.510826,0.0,0.0,0.0,0.0,0.916291,0.0,1.115718,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.223144,...,1.609438,1.609438,1.609438,1.609438,1.609438,1.609438,1.609438,1.609438,1.609438,1.609438
