# Exercise 2: NLP and feature engineering
----

In this exercise, you can use yesterday's dataset (news articles) or your own data.

Today, we will use this data for analysis and feature extraction using NLP, focusing on pre-processing steps. 

These are important components of feature engineering: moving from textual data to a feature set that can be used in a classification model.

Remember: the goal in the end is to produce an optimal 'bag of words' model (i.e., a document-term matrix) as input for a machine learning model (in this case, a classification model).

In [None]:
#PACKAGES -
#NOTE: you should import all packages you need here, but for now, we will do this in-line.


#INITIALIZATION - 
datadir = "/Users/rupertkiddle/Desktop/teach/2024/Introduction to Machine Learning (GESIS)/3_datasets"



### 1. Read in the data

You can use the code you've written yesterday as a starting point. Again, try your code on a small sample of the data, and scale up later--once your confident that your code works as intended.

In [6]:
from glob import glob #for getting filepaths:

#let's define a function to read the data:
def read_data(listofoutlets):
    texts = []
    labels = []
    for label in listofoutlets:
        for file in glob(datadir+f'/articles/*/{label}/*'):
            with open(file) as f:
                texts.append(f.read())
                labels.append(label)
    return texts, labels

#let's read the articles and set the corresponding labels
articles, labels = read_data(['Infowars', 'BBC', 'The Guardian']) #choose your own news-outlets

#report outcome:
print(f"Number of documents: {len(articles)}")
print(f"Number of labels: {len(set(labels))}")

Number of documents: 6000
Number of labels: 3


### sample the data (for testing)

In [10]:
import random

#zip the articles and labels together:
zipped_data = list(zip(articles, labels))

#sample 1% of the data:
sample_size = int(len(zipped_data) * 0.01)
sampled_data = random.sample(zipped_data, sample_size)

#unzip the sampled data:
articles, labels = zip(*sampled_data)

#convert back to lists (<tuples)
articles = list(articles)
labels = list(labels)

### 2. Examples of Pre-processing Steps - 


a.  lowercasing  
b.  tokenization  
c.  stopword removal  
d.  stemming and/or lemmatizing  
e.  cleaning: removing punctuation, line breaks, double spaces  

#### Lowercasing:

In [24]:
#pre-check:
print("Sample text before lowercasing:")
print(articles[:1])

#lowercase all the articles:
articles_lower_cased = [art.lower() for art in articles]

#post-check:
print("Sample text after lowercasing:")
print(articles_lower_cased[:1])

Sample text before lowercasing:
['Formula 1 has returned this weekend with one big difference - there are no grid girls.\n\nIn the past, you would have seen the likes of professional glamour model Nikki Lee on the tracks ahead of the races.\n\nBut following a backlash in the off-season, this will be the first F1 season in decades that won\'t feature grid girls.\n\n"Looks like I\'ll be going to the job centre now," Nikki told Newsbeat when the ban was announced.\n\nShe said F1\'s decision to axe grid girls has put her livelihood on the line and that the work she does is dying out.\n\nNikki says most of the criticism comes from other women.\n\n"When I went to Le Mans [car race], I had beer bottles pelted at me by the wives and girlfriends.\n\n"I don\'t see where women get off on telling us how to dress. I don\'t care what other people do as long as you\'re not harming me," she says.\n\nOther grid girls have also hit out at the decision to axe their profession.\n\nThe decision to scrap pr

#### Tokenization:

In [25]:
#TOKENIZATION, SIMPLE - 

#pre-check:
print("Sample text before tokenization:")
print(articles[:1])

#basic solution, using the string method `.split()`. 
articles_tokenized = [art.split() for art in articles]

#post-check:
print("Sample text after tokenization:")
print(articles_tokenized[:1])

Sample text before tokenization:
['Formula 1 has returned this weekend with one big difference - there are no grid girls.\n\nIn the past, you would have seen the likes of professional glamour model Nikki Lee on the tracks ahead of the races.\n\nBut following a backlash in the off-season, this will be the first F1 season in decades that won\'t feature grid girls.\n\n"Looks like I\'ll be going to the job centre now," Nikki told Newsbeat when the ban was announced.\n\nShe said F1\'s decision to axe grid girls has put her livelihood on the line and that the work she does is dying out.\n\nNikki says most of the criticism comes from other women.\n\n"When I went to Le Mans [car race], I had beer bottles pelted at me by the wives and girlfriends.\n\n"I don\'t see where women get off on telling us how to dress. I don\'t care what other people do as long as you\'re not harming me," she says.\n\nOther grid girls have also hit out at the decision to axe their profession.\n\nThe decision to scrap p

In [15]:
#TOKENIZATION, ADVANCED -

#more sophisticated solution, using the NLTK library.
#NOTE: TbWT separtes punctuation (world!), handles contractions (can't), and splits critics (it's).
from nltk.tokenize import TreebankWordTokenizer #for tokenization

#pre-check:
print("Sample text before advanced tokenization:")
print(articles[:1])

#tokenize the articles:
articles_tokenized = [TreebankWordTokenizer().tokenize(art) for art in articles]

#post-check:
print("Sample text after advanced tokenization:")
print(articles_tokenized[:1])

Sample text before advanced tokenization:
['The Trump administration has announced criminal charges and sanctions against nine Iranians accused of participating in a government-sponsored hacking scheme to steal sensitive information from hundreds of universities, private companies and US government agencies.\n\nThe nine defendants, accused of working at the behest of the Iranian government-tied Islamic Revolutionary Guard Corps, hacked the computer systems of about 320 universities in the United States and abroad to steal expensive research that was then used or sold for profit, prosecutors said.\n\nThe hackers are also accused of breaking into the networks of dozens of government organizations, such as the Department of Labor and Federal Energy Regulatory Commission, and companies, including law firms and biotechnology corporations.\n\nThe Department of Justice said the hackers were affiliated with an Iranian company called the Mabna Institute, which prosecutors say contracted since a

In [None]:
#TOKENIZATION, MORE ADVANCED -
import regex #for regular expressions
import nltk #for natural language processing 

#create your own tokenizer that first split into sentences. In this way,`TreebankWordTokenizer` works better - 

#nltk.download("punkt_tab") #uncomment this if needed.

#let's define a class for our (custom) tokenizer:
#NOTE: what does 'self' mean in this context? 
class MyTokenizer:
    def tokenize(self, text): #it takes a string as input.
        tokenizer = TreebankWordTokenizer() #initialize the tokenizer
        result = [] #initialize the result list
        word = r"\p{letter}" #this is a regex pattern for letters
        for sent in nltk.sent_tokenize(text): #split the text into sentences
            tokens = tokenizer.tokenize(sent)   #tokenize the sentence (with TbWT) 
            tokens = [t for t in tokens if regex.search(word, t)] #NOTE: what is this doing?
            result += tokens #add the (valid) tokens to the result list
        return result #return the result list

#instantiate the tokenizer:
mytokenizer = MyTokenizer()

#pre-check:
print("Sample text before our custom tokenization:")
print(articles[:1])

#run the tokenizer on our articles: 
articles_tokenized = [mytokenizer.tokenize(art) for art in articles]

#post-check:
print("Sample text after our custom tokenization:")
print(articles_tokenized[:1])

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\stolw010\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Sample text before our custom tokenization:
['The Trump administration has announced criminal charges and sanctions against nine Iranians accused of participating in a government-sponsored hacking scheme to steal sensitive information from hundreds of universities, private companies and US government agencies.\n\nThe nine defendants, accused of working at the behest of the Iranian government-tied Islamic Revolutionary Guard Corps, hacked the computer systems of about 320 universities in the United States and abroad to steal expensive research that was then used or sold for profit, prosecutors said.\n\nThe hackers are also accused of breaking into the networks of dozens of government organizations, such as the Department of Labor and Federal Energy Regulatory Commission, and companies, including law firms and biotechnology corporations.\n\nThe Department of Justice said the hackers were affiliated with an Iranian company called the Mabna Institute, which prosecutors say contracted since

#### Stopwords:

In [None]:
#STOPWORDS, SIMPLE - 

#let's use NLTK to get a list of stopwords:
from nltk.corpus import stopwords
#nltk.download("stopwords") #NOTE: uncomment this if needed. 

#let's create a list of stopwords:
mystopwords = stopwords.words("english")
mystopwords.extend(["add", "more", "words"]) # it's just a list, so we can add more words if we want to.

print(mystopwords) #just to demonstrate the point...

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\stolw010\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [18]:
# now, remove stopwords from the corpus:

#pre-check:
print("Sample text before removing stopwords:")
print(articles[:1])

#stopword removal:
articles_without_stopwords = [] #initialize an empty list
for article in articles: #for each article
    articles_no_stop = "" #initialize an empty string
    for word in article.lower().split(): #for each word in the list of lowered words from the article
        if word not in mystopwords: #if the word is not a stopword
            articles_no_stop = articles_no_stop + " " + word #add the word to the empty string
    articles_without_stopwords.append(articles_no_stop) #add the article sans stopwords to the list

#post-check:
print("Sample text after removing stopwords:")
print(articles_without_stopwords[:1])

Sample text before removing stopwords:
['The Trump administration has announced criminal charges and sanctions against nine Iranians accused of participating in a government-sponsored hacking scheme to steal sensitive information from hundreds of universities, private companies and US government agencies.\n\nThe nine defendants, accused of working at the behest of the Iranian government-tied Islamic Revolutionary Guard Corps, hacked the computer systems of about 320 universities in the United States and abroad to steal expensive research that was then used or sold for profit, prosecutors said.\n\nThe hackers are also accused of breaking into the networks of dozens of government organizations, such as the Department of Labor and Federal Energy Regulatory Commission, and companies, including law firms and biotechnology corporations.\n\nThe Department of Justice said the hackers were affiliated with an Iranian company called the Mabna Institute, which prosecutors say contracted since at l

In [19]:
# same solution, but with list comprehension:
# NOTE: can you explain how this works? 
articles_without_stopwords = [" ".join([w for w in article.lower().split() if w not in mystopwords]) for article in articles]

In [20]:
#STOPWORDS, ADVANCED -

# more elegant -solution, by writing a function and calling it in a list comprehension:
def remove_stopwords(article, stopwordlist):
    cleantokens = []
    for word in article:
        if word.lower() not in stopwordlist:
            cleantokens.append(word)
    return cleantokens

#pre-check:
print("Sample text before removing stopwords:")
print(articles_tokenized[:1])

articles_without_stopwords = [remove_stopwords(art, mystopwords) for art in articles_tokenized]

#post-check:
print("Sample text after removing stopwords:")
print(articles_without_stopwords[:1])

Sample text before removing stopwords:
[['The', 'Trump', 'administration', 'has', 'announced', 'criminal', 'charges', 'and', 'sanctions', 'against', 'nine', 'Iranians', 'accused', 'of', 'participating', 'in', 'a', 'government-sponsored', 'hacking', 'scheme', 'to', 'steal', 'sensitive', 'information', 'from', 'hundreds', 'of', 'universities', 'private', 'companies', 'and', 'US', 'government', 'agencies', 'The', 'nine', 'defendants', 'accused', 'of', 'working', 'at', 'the', 'behest', 'of', 'the', 'Iranian', 'government-tied', 'Islamic', 'Revolutionary', 'Guard', 'Corps', 'hacked', 'the', 'computer', 'systems', 'of', 'about', 'universities', 'in', 'the', 'United', 'States', 'and', 'abroad', 'to', 'steal', 'expensive', 'research', 'that', 'was', 'then', 'used', 'or', 'sold', 'for', 'profit', 'prosecutors', 'said', 'The', 'hackers', 'are', 'also', 'accused', 'of', 'breaking', 'into', 'the', 'networks', 'of', 'dozens', 'of', 'government', 'organizations', 'such', 'as', 'the', 'Department', '

#### Stemming and lemmatization:

In [21]:
#STEMMING AND LEMMATIZATION -

from nltk.stem.snowball import SnowballStemmer #this stems by removing suffixes (e.g. -ing, -ed)

stemmer = SnowballStemmer("english") #initialize the stemmer

#pre-check:
print("Sample text before stemming:")
print(articles[:1])

stemmed_text = [] #initialize an empty list
for article in articles: #for each article
    stemmed_words = "" #initialize an empty string
    for word in article.lower().split(): #for each word in the list of lowered words from the article
        stemmed_words = stemmed_words + " " + stemmer.stem(word) #stem the word and add it to the empty string
    stemmed_text.append(stemmed_words.strip()) #add the stemmed article to the list

#post-check:
print("Sample text after stemming:")
print(stemmed_text[:1])

Sample text before stemming:
['The Trump administration has announced criminal charges and sanctions against nine Iranians accused of participating in a government-sponsored hacking scheme to steal sensitive information from hundreds of universities, private companies and US government agencies.\n\nThe nine defendants, accused of working at the behest of the Iranian government-tied Islamic Revolutionary Guard Corps, hacked the computer systems of about 320 universities in the United States and abroad to steal expensive research that was then used or sold for profit, prosecutors said.\n\nThe hackers are also accused of breaking into the networks of dozens of government organizations, such as the Department of Labor and Federal Energy Regulatory Commission, and companies, including law firms and biotechnology corporations.\n\nThe Department of Justice said the hackers were affiliated with an Iranian company called the Mabna Institute, which prosecutors say contracted since at least 2013 

In [22]:
# same solution, but with list comprehension:
# NOTE: why should we use list comprehension here?
stemmed_text  = [" ".join([stemmer.stem(w) for w in article.lower().split()]) for article in articles]

In [35]:
# compare tokeninzation and lemmatization using `Spacy`:

import spacy 
#spacy.cli.download("en_core_web_sm") #uncomment this if needed.
nlp = spacy.load("en_core_web_sm") #load the small english model.

#pre-check:
print("Sample text before lemmatization:")
print(articles[:1])

#let's lemmatize the articles:
lemmatized_articles = [[token.lemma_ for token in nlp(art)] for art in articles]

#post-check:
print("Sample text after lemmatization:")
print(lemmatized_articles[:1])

Sample text before lemmatization:
['Formula 1 has returned this weekend with one big difference - there are no grid girls.\n\nIn the past, you would have seen the likes of professional glamour model Nikki Lee on the tracks ahead of the races.\n\nBut following a backlash in the off-season, this will be the first F1 season in decades that won\'t feature grid girls.\n\n"Looks like I\'ll be going to the job centre now," Nikki told Newsbeat when the ban was announced.\n\nShe said F1\'s decision to axe grid girls has put her livelihood on the line and that the work she does is dying out.\n\nNikki says most of the criticism comes from other women.\n\n"When I went to Le Mans [car race], I had beer bottles pelted at me by the wives and girlfriends.\n\n"I don\'t see where women get off on telling us how to dress. I don\'t care what other people do as long as you\'re not harming me," she says.\n\nOther grid girls have also hit out at the decision to axe their profession.\n\nThe decision to scrap 

#### Cleaning: 

In [37]:
#### CLEANING: removing punctuation, line breaks, double spaces

n = random.randint(0, 9)
articles[n] # print a random article to inspect.

## Typical cleaning up steps:
from string import punctuation
articles = [art.replace('\n\n', '') for art in articles] # remove line breaks
articles = ["".join([w for w in art if w not in punctuation]) for art in articles] # remove punctuation
articles = [" ".join(art.split()) for art in articles] # remove double spaces by splitting the strings into words and joining these words again
articles[n] # print the same article to see whether the changes are in line with what you want


'For the past two election cycles thriceelected New York city mayor Michael Bloomberg has toyed with a bid for the US presidency With 2020 approaching the financial news data billionaire is again looking at his chances but with one crucial differenceIn 2012 and 2016 Bloomberg considered running as an independent each time he concluded that he could not win and ran the risk of splitting the Democratic vote and helping the Republican candidate to win officeBut in 2020 sources close to the finance mogul have told the Guardian if the now 76yearold candidate does eventually jump into the race he plans to run as a Democrat But after two and now three election cycles in which Bloomberg has teased his interest and poured over polling data there are still questions about his ultimate commitment to a runIn previous flirtations Bloomberg has explained that he dropped the effort to avoid splitting the Democratic vote and risking a Republican presidencyWhen he pulled out from formally entering the 

#### N-grams:

- Think about what type of n-grams you want to add to your feature set. Extract and inspect n-grams and/or collocations, and add them to your feature set if you think this is relevant.

In [38]:
articles_bigrams = [["_".join(tup) for tup in nltk.ngrams(art.split(),2)] for art in articles] # creates bigrams
articles_bigrams[7][:5] # inspect the results...

# maybe we want both unigrams and bigrams in the feature set?
assert len(articles)==len(articles_bigrams)

articles_uniandbigrams = []
for a,b in zip([art.split() for art in articles],articles_bigrams):
    articles_uniandbigrams.append(a + b)

#and let's inspect the outcomes again.
articles_uniandbigrams[7]
len(articles_uniandbigrams[7]),len(articles_bigrams[7]),len(articles[7].split())


#Or, if you want to inspect collocations:
text = [nltk.Text(tkn for tkn in art.split()) for art in articles ]
text[7].collocations(num=10)


San Francisco; StarSpangled Banner; Francisco 49ers


### 3. Extract entities and other meaningful information (enrich your feature set)

Depending on your interests and the nature of the data, you could:

- use regular expressions to distinguish relevant from irrelevant texts, or to extract substrings
- use NLP techniques such as Named Entity Recognition to extract entities that occur.

#### pos-tagging:

In [40]:
#tokenize and POS-tag with NLTK:
tokens = [nltk.word_tokenize(sentence) for sentence in articles]
tagged = [nltk.pos_tag(sentence) for sentence in tokens]
print(tagged[0]) # inspect the first article's POS tags

[('Formula', 'NN'), ('1', 'CD'), ('has', 'VBZ'), ('returned', 'VBN'), ('this', 'DT'), ('weekend', 'NN'), ('with', 'IN'), ('one', 'CD'), ('big', 'JJ'), ('difference', 'NN'), ('there', 'EX'), ('are', 'VBP'), ('no', 'DT'), ('grid', 'JJ'), ('girlsIn', 'VBZ'), ('the', 'DT'), ('past', 'NN'), ('you', 'PRP'), ('would', 'MD'), ('have', 'VB'), ('seen', 'VBN'), ('the', 'DT'), ('likes', 'NNS'), ('of', 'IN'), ('professional', 'JJ'), ('glamour', 'NN'), ('model', 'NN'), ('Nikki', 'NNP'), ('Lee', 'NNP'), ('on', 'IN'), ('the', 'DT'), ('tracks', 'NNS'), ('ahead', 'RB'), ('of', 'IN'), ('the', 'DT'), ('racesBut', 'NN'), ('following', 'VBG'), ('a', 'DT'), ('backlash', 'NN'), ('in', 'IN'), ('the', 'DT'), ('offseason', 'NN'), ('this', 'DT'), ('will', 'MD'), ('be', 'VB'), ('the', 'DT'), ('first', 'JJ'), ('F1', 'NNP'), ('season', 'NN'), ('in', 'IN'), ('decades', 'NNS'), ('that', 'IN'), ('wont', 'JJ'), ('feature', 'NN'), ('grid', 'JJ'), ('girlsLooks', 'NNS'), ('like', 'IN'), ('Ill', 'NNP'), ('be', 'VB'), ('goin

#### entity detection:

In [None]:
#detect named entities with Spacy:
nlp = spacy.load('en_core_web_sm')

#let's get the named entities:
doc = [nlp(sentence) for sentence in articles]
for i in doc:
    for ent in i.ents:
        if ent.label_ == 'PERSON':
            print(ent.text, ent.label_ )

### 4. Create your own pre-processing and feature extraction pipeline:

Combine the methods above to produce a list of preprocessed texts (features) for your classification model.

### 5. Train a supervised classifier

Use your code from yesterday's assignment to train a classifier. 

Perform the same classification task, but this time carefully consider which feature set you want to use.

Reflect on the options listed above, and extract features that you think are relevant to include. 

Carefully consider **pre-processing steps**: what type of features will you feed your algorithm? Do you, for example, want to manually remove stopwords, or include ngrams? 

Use these features as input for your classifier, and investigate the effects hereof on performance of the classifier. 

Not that the purpose is not to build the perfect classifier, but to inspect the effects of different feature engineering decisions on the outcomes of your classification algorithm.