# Part 1: Data Processing


## Task 1: Data processing on news_sample

here, each step of preprocessing is tested one at a time, before being combined and applied in a simple pipeline.

### tokenization
tokenizatioon is done with the nltk library, using the word_tokenizer. This tokenizer is an improved version of the TreebankWordTokenizer.

This tokenizer requires an additional download, but is more robust and accurate. For example, it is able to correctly process words like D'Artagnan, state-of-the-art, wish'd and abbreviations like I.B.M. as one word. Other simpler tokenizers, like wordpunkt, would instead simply seperate words via punctuation and spaces, and would therefore not capture these details, which may lead to the meaning being lost; words like state-of-the-art.

Placeholders like <NUMBER> are seperated from their chevron brackets when tokenizing. There doesn't seem to be a tokenizer that recognizes them as one word with the chevrons. However, they are still distinguished from other words by being capital. the chevrons therefore serve as seperators from other words or placeholders.

All tokens with at least one ascii character is kept. everything else is discarded. This is to remove punctuation and other junk. 

In [1]:
# Using 'wordpunct_tokenize' to split text on whitespace and punctuation
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
import string
def tokenize_data(data):    
    tokenized_data = word_tokenize(data)
    tok_punct_data = [word for word in tokenized_data if any(char in string.ascii_letters for char in word)]
    return(tok_punct_data)

text = '''<NUMBER> <URL> D'Artagnan vel'koz bel'veth kai'sa I.B.M. trump's state-of-the-art To be, or not to be: that is the question:
Whether ’tis nobler in the mind to suffer...;'''
print(tokenize_data(text))

['NUMBER', 'URL', "D'Artagnan", "vel'koz", "bel'veth", "kai'sa", 'I.B.M', 'trump', "'s", 'state-of-the-art', 'To', 'be', 'or', 'not', 'to', 'be', 'that', 'is', 'the', 'question', 'Whether', 'tis', 'nobler', 'in', 'the', 'mind', 'to', 'suffer']


[nltk_data] Downloading package punkt to /home/zeyu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Cleaning the text


In [2]:
import pandas as pd
news_sample = pd.read_csv("../news_sample.csv")
news_sample_content = ' '.join(news_sample['content'])

### Our own regex_clean function
Testing using news_sample, our own regex_glean and the cleantext clean result in approximately the same amount of text reduction and resulting unique words as the cleantext module.

Our regex_clean can dissern various forms of dates, while cleantext just views them as multiple numbers. While cleantext offers extra cleaning functionalities like removing currencies and transliterating to ascii. However, in the end, we have chosen to use our own regex_clean, as it is more customizable, so we can easily optimize it. (for example; recognizing 100,000 as one number instead of two, or recognizing multiple forms of urls like google.com as well as https://www.google.com, something cleantext cannot do)

There is also an issue with cleantext, where selecting lower=True will cause placeholders to be lower as well: <NUMBER> becomes <number>. This is unacceptable, as all placeholders must be uppercase, and everything else lower case for placeholders to be distinguished during tokenization, stopwords, and stemming.
Therefore, text must be lowercased with .lower() before using cleantext 

In [3]:
import re
def regex_clean(text):
    # lowercase
    pattern_lowercase = re.compile('[A-Z]')
    cleaned_text = re.sub(pattern_lowercase, lambda x: x.group(0).lower(), text)
    # whitespace
    pattern_whitespace = re.compile(' {2,}')
    cleaned_text = re.sub(pattern_whitespace, " ", cleaned_text)
    # newline
    pattern_newline = re.compile('\n+')
    cleaned_text = re.sub(pattern_newline, "\n", cleaned_text)
    # tab
    pattern_tab = re.compile('\t+')
    cleaned_text = re.sub(pattern_tab, "\t", cleaned_text)
    # emails
    pattern_email = re.compile('''([^,|\"|\|| |\t|\n|'|\]|\[]*@[^,|\"|\|| |\t|\n|'|\]|\[]*\.(com|org|edu|uk|net|gov))''')
    cleaned_text = re.sub(pattern_email, "<EMAIL>", cleaned_text)
    # URL's
    pattern_URL1 = re.compile('''([^,|\"|\|| |\t|\n|'|\]|\[]*\.(com|org|edu|uk|net|gov)[^,|\"|\|| |\t|\n|'|\]|\[]*)''')         # top-level domains
    pattern_URL2 = re.compile('''https?:\/\/[^,|\"|\|| |\t|\n|'|\]|\[]*''')                                                           # http(s) 
    cleaned_text = re.sub(pattern_URL1, "<URL>", cleaned_text)
    cleaned_text = re.sub(pattern_URL2, "<URL>", cleaned_text)
    # dates 
    pattern_dates = re.compile('''(((0[1-9]|[1-2]\d|3[0-1])(\-|\/|\.|\,| ){1,2}(0[1-9]|1[1-2]|[a-z]{3,9})(\-|\/|\.|\,| ){1,2}(\d{2,}))|((0[1-9]|1[1-2]|[a-z]{3,9})(\-|\/|\.|\,| )(0[1-9]|[1-2]\d|3[0-1])(\-|\/|\.|\,| ){1,2}(\d{2,}))|((\d{2,}))(\-|\/|\.|\,| )(0[1-9]|1[1-2]|[a-z]{3,9})(\-|\/|\.|\,| ){1,2}(0[1-9]|[1-2]\d|3[0-1])|((jan|january|feb|febuary|apr|april|may|jun|june|aug|august|sep|september|oct|october|nov|november|dec|december)(\-|\/|\.|\,| ){1,2}(0[1-9]|[1-2]\d|3[0-1])))''')
    cleaned_text = re.sub(pattern_dates, '<DATE>', cleaned_text)
    # numbers
    pattern_numbers = re.compile('(\d,\d|\d\.\d|\d)+')
    pattern_numbers_2 = re.compile('((\d:\d|\d,\d|\d\.\d|\d)+)')
    cleaned_text = re.sub(pattern_numbers_2, '<NUM>', cleaned_text)
    return cleaned_text

regex_cleaned = regex_clean(news_sample_content)
regex_tokenized = tokenize_data(regex_cleaned)
regex_clean_vocab= len(set(regex_tokenized))
original_vocab = len(set(tokenize_data(news_sample_content)))
print(f"Number of unique tokens (words, punctuations and everything else) before cleaning using regex:{original_vocab}")
print(f"Number of unique tokens after cleaning using regex:{regex_clean_vocab}")
print(f"Reduction rate of unique tokens using regex_clean: {round(1-regex_clean_vocab/original_vocab,2)}")
#print(regex_cleaned)

Number of unique tokens (words, punctuations and everything else) before cleaning using regex:20251
Number of unique tokens after cleaning using regex:16670
Reduction rate of unique tokens using regex_clean: 0.18


### clean using cleantext

In [4]:
from cleantext import clean
def cleantext_clean(data):
    cleaned_data = data.lower()
    cleaned_data = clean.clean(cleaned_data,
        fix_unicode=False,               # fix various unicode errors
        to_ascii=False,                  # transliterate to closest ASCII representation
        lower=False,                     # lowercase text
        no_line_breaks=False,           # fully strip line breaks as opposed to only normalizing them
        no_urls=True,                  # replace all URLs with a special token
        no_emails=True,                # replace all email addresses with a special token
        no_phone_numbers=False,         # replace all phone numbers with a special token
        no_numbers=True,               # replace all numbers with a special token
        no_digits=True,                # replace all digits with a special token
        no_currency_symbols=False,      # replace all currency symbols with a special token
        no_punct=False,                 # remove punctuations
        replace_with_punct="",          # instead of removing punctuations you may replace them
        replace_with_url="<URL>",
        replace_with_email="<EMAIL>",
        replace_with_phone_number="<PHONE>",
        replace_with_number="<NUMBER>",
        replace_with_digit="<DIGIT>",
        replace_with_currency_symbol="<CUR>",
        lang="en"                       # set to 'de' for German special handling
    )   
    return cleaned_data



cleantext_cleaned = cleantext_clean(news_sample_content)
cleantext_tokenized = tokenize_data(cleantext_cleaned)
cleantext_vocab= len(set(cleantext_tokenized))
original_vocab = len(set(tokenize_data(news_sample_content)))
print(f"Number of unique tokens (words, punctuations and everything else) before cleaning using regex:{original_vocab}")
print(f"Number of unique tokens after cleaning using regex:{cleantext_vocab}")
print(f"Reduction rate of unique tokens using regex_clean: {round(1-cleantext_vocab/original_vocab,2)}")
#print(cleantext_cleaned)

Number of unique tokens (words, punctuations and everything else) before cleaning using regex:20251
Number of unique tokens after cleaning using regex:16757
Reduction rate of unique tokens using regex_clean: 0.17


### Removing stopwords
stopword removal is done with the a joined pool of stopwords from nltk.corpus.stopwords and http://members.unine.ch/jacques.savoy/clef/. Both are good sources of stopwords, and combining them increases the number of stopwords removed.

furthermore, since all tokenizers seem to seperate the possessive 's and it is not removed by stopwords, 's becomes one of the most frequent words. therefore, tokens with only one ascii letter are removed. This effectively prevents that from happening, since all other one-letter words are stopwords anyways.

In [5]:
# using NLTK's in-built collection of stopwords 
from nltk.corpus import stopwords
import os 
# Stopwords from nltk
stop_words_nltk = set(stopwords.words('english'))
# collecting more stopwords from website: http://members.unine.ch/jacques.savoy/clef/ given in lecture
stop_words_extra_path = os.path.join(os.getcwd(), "../stopwords_extra.txt")
stop_words_extra = set(open(stop_words_extra_path, "r").read().split("\n"))
stop_words = stop_words_nltk | stop_words_extra 

# removing stopwords
def stopwords_data(data):
    #stopwords are removed
    stopword_data = [word for word in data if word not in stop_words and sum(1 for char in word if char in string.ascii_letters) > 1]
    return stopword_data

stopworded_data = stopwords_data(regex_tokenized)
stopworded_vocab = len(set(stopworded_data))
print(f"Vocabulary size before removing stopwords: {regex_clean_vocab}")
print(f"Vocabulary size after removing stopwords: {stopworded_vocab}")
print(f"Vocabulary reduction rate from tokenizing to stopwords: {round(1 - stopworded_vocab/regex_clean_vocab, 2)}")
print(stopworded_data)

Vocabulary size before removing stopwords: 16670
Vocabulary size after removing stopwords: 16137
Vocabulary reduction rate from tokenizing to stopwords: 0.03


### Stemming
Stemming is done with the Porterstemmer/SnowballStemmer (undesided) from nltk.stem

In [6]:
# Stemming
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
# initialize the stemmer
#stemmer = PorterStemmer()
stemmer = SnowballStemmer("english")
placeholders = ['NUM', 'URL', 'EMAIL', 'DATE']
# stemming
def stem_data(data):
    stemmed_data =  [stemmer.stem(word) if word not in placeholders else word for word in data]
    return stemmed_data

stemmed_data = stem_data(stopworded_data)
stemmed_vocab = len(set(stemmed_data))
print(f"Vocabulary size before stemming: {stopworded_vocab}")
print(f"Vocabulary size after stemming: {stemmed_vocab }")
print(f"Reduction rate = {round(1 - stemmed_vocab / stopworded_vocab, 2)}")
print(stemmed_data)

Vocabulary size before stemming: 16137
Vocabulary size after stemming: 10997
Reduction rate = 0.32
['power', 'christma', 'make', 'wild', 'wonder', 'thing', 'holi', 'triniti', 'posit', 'power', 'good', 'simpl', 'act', 'give', 'receiv', 'lost', 'day', 'worri', 'money', 'success', 'hold', 'back', 'give', 'congreg', 'ohio', 'move', 'action', 'power', 'sermon', 'church', 'christma', 'eve', 'pastor', 'grand', 'lake', 'unit', 'methodist', 'church', 'celina', 'ohio', 'gave', 'emot', 'sermon', 'import', 'understand', 'messag', 'jesus', 'religi', 'peopl', 'messag', 'jesus', 'make', 'peopl', 'suffer', 'enjoy', 'life', 'bit', 'sermon', 'generos', 'live', 'jesus', 'live', 'long', 'time', 'ago', 'act', 'generous', 'fashion', 'time', 'generous', 'act', 'time', 'focus', 'sermon', 'potenc', 'sermon', 'lost', 'congreg', 'move', 'action', 'sermon', 'end', 'congreg', 'decid', 'offer', 'bowl', 'pass', 'room', 'pitch', 'christma', 'eve', 'word', 'sermon', 'ring', 'ear', 'offer', 'member', 'congreg', 'drove'

### data preprocessing
This function combines cleaning, tokenization, removing stopwords and stemming. This function takes a filepath of a .csv file as input. THe dataset is read into a panda dataframe, and is processed. The result is written into a new file at the same location. 

This is the simplest form of our preprocessing pipeline. This function currently uses our regex_clean

In [7]:
import pandas as pd
import os
from collections import Counter
import string
import matplotlib.pyplot as plt

def data_preprocessing(filepath):
    # Initialize the output file
    directory, filename = os.path.split(filepath)
    base, ext = os.path.splitext(filename)
    clean_filename = f"{base}_cleaned{ext}"
    clean_file_path = os.path.join(directory, clean_filename)
    print(f"new cleaned dataset:", clean_file_path)

    #read file
    df = pd.read_csv(filepath)
    #process data and gather vocabulary info.
    df.loc[:, 'content']=df['content'].apply(regex_clean)
    df.loc[:, 'content']=df['content'].apply(tokenize_data)
    cleaned_vocab = Counter()
    cleaned_vocab.update(token for token_list in df['content'] for token in token_list)
    df.loc[:, 'content']=df['content'].apply(stopwords_data)
    stopworded_vocab = Counter()
    stopworded_vocab.update(token for token_list in df['content'] for token in token_list)
    df.loc[:, 'content']=df['content'] = df['content'].apply(stem_data)
    stemmed_vocab = Counter()
    stemmed_vocab.update(token for token_list in df['content'] for token in token_list)
    #write to output file
    df.to_csv(clean_file_path, header=False, index=False, mode='w')
    return cleaned_vocab, stopworded_vocab, stemmed_vocab
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
news_sample = "news_sample.csv"
cleaned_vocab, stopworded_vocab, stemmed_vocab= data_preprocessing("../news_sample.csv")

#calculate and print reduction statistics
print(f"\n Vocabulary before stemming and removing stopwords:{len(cleaned_vocab)}")
print(f"Vocabulary after removing stopwords:{len(stopworded_vocab)}")
print(f"Reduction rate of vocabulary size from removing stopwords: {round(1-len(stopworded_vocab)/len(cleaned_vocab),2)}")
print(f"Vocabulary after stemming:{len(stemmed_vocab)}")
print(f"Reduction rate of vocabulary size from stemming: {round(1-len(stemmed_vocab)/len(stopworded_vocab),2)}")


new cleaned dataset: ../news_sample_cleaned.csv

 Vocabulary before stemming and removing stopwords:16669
Vocabulary after removing stopwords:16136
Reduction rate of vocabulary size from removing stopwords: 0.03
Vocabulary after stemming:10996
Reduction rate of vocabulary size from stemming: 0.32
