# Exercise 2: NLP and feature engineering
----

In this exercise, you can use one of yesterday's datasets (IMDB or the newspaper data). 

Today, we will use this data for analysis and feature extraction using NLP. 

These are important components of feature engineering: moving from textual data to a feature set that can be used in a classification model.

In [24]:
%pip install spacy
!python -m spacy download en_core_web_sm

Note: you may need to restart the kernel to use updated packages.
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ------- -------------------------------- 2.4/12.8 MB 12.2 MB/s eta 0:00:01
     ---------------- ----------------------- 5.2/12.8 MB 13.9 MB/s eta 0:00:01
     ------------------------- -------------- 8.1/12.8 MB 14.0 MB/s eta 0:00:01
     ---------------------------------- ---- 11.3/12.8 MB 14.1 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 13.8 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
from glob import glob
import os
import random
import re
from nltk.stem import PorterStemmer
import spacy
nlp = spacy.load("en_core_web_sm") # potentially you need to install the language model first, notice that as of spaCy v3.0 you use its full name instead of just 'en'  


data_dir = r"C:/Data Management/Gesis IML/articles-small" #adjust this to your data directory

### 1. Read in the data

You can use the code you've written yesterday as a starting point. Again, try your code on a small sample of the data, and scale up later--once your confident that your code works as intended.

In [2]:
infowarsfiles = glob(os.path.join(data_dir, 'articles/*/Infowars/*'))
infowarsarticles = []
for filename in infowarsfiles:
    with open(filename) as f:
	    infowarsarticles.append(f.read())

### 2. first analyses and pre-processing steps

- Perform some first analyses on the data using string methods and regular expressions.
Techniques you can try out include:

a.  lowercasing  
b.  tokenization  
c.  stopword removal  
d.  stemming and/or lemmatizing  
e.  cleaning: removing punctuation, line breaks, double spaces  

In [3]:
# taking a random sample of the articles for practice purposes
articles = random.sample(infowarsarticles, 10)

In [4]:
#get a list of words in this sample of articles:
words = []
for article in articles:
    words.extend(article.split())
print(words[:20])

['Charges', 'have', 'been', 'dropped', 'against', '11', 'members', 'of', 'Turkish', 'President', 'Recep', 'Tayyip', 'Erdogans', 'security', 'detail', 'that', 'were', 'accused', 'of', 'beating']


In [5]:
#now make sure all are lower case:
words_lower = [word.lower() for word in words]
print(words_lower[:20])

['charges', 'have', 'been', 'dropped', 'against', '11', 'members', 'of', 'turkish', 'president', 'recep', 'tayyip', 'erdogans', 'security', 'detail', 'that', 'were', 'accused', 'of', 'beating']


In [None]:
#tokenize the words:
# \w+ matches a word,
#'\w+ matches a contraction (like 't, 's, etc.),
#[^\w\s] matches any single character that is not a word character or whitespace.
# The | operator tells the regex engine to match any one of these alternatives at each position in the text.

def tokenize(text):
    # This regex splits words and contractions into separate tokens, and handles punctuation
    pattern = r"\w+|\'\w+|[^\w\s]"
    return re.findall(pattern, text.lower())

# Example usage on the first article:
tokens = tokenize(articles[0])
print("example: first 20 tokens of first article that show the change:", tokens[10:30]) #Note this range might need to be different for you depending on your random selection

tokenized_words = [tokenize(article) for article in articles]
print("example: tokens of first article:", tokenized_words[:1]) # gives all tokens for first article


example: first 10 tokens of first article: ['recep', 'tayyip', 'erdogans', 'security', 'detail', 'that', 'were', 'accused', 'of', 'beating', 'protesters', 'in', 'washington', ',', 'd', '.', 'c', '.', 'federal', 'prosecutors']
example: tokens of first article: [['charges', 'have', 'been', 'dropped', 'against', '11', 'members', 'of', 'turkish', 'president', 'recep', 'tayyip', 'erdogans', 'security', 'detail', 'that', 'were', 'accused', 'of', 'beating', 'protesters', 'in', 'washington', ',', 'd', '.', 'c', '.', 'federal', 'prosecutors', 'made', 'the', 'decision', 'to', 'drop', 'the', 'charges', 'against', '11', 'of', 'out', 'the', '15', 'security', 'members', 'in', 'connection', 'with', 'the', 'incident', '.', 'police', 'originally', 'announced', 'charges', 'against', '16', 'people', 'in', 'connection', 'with', 'the', 'violent', 'clashes', 'in', 'june', '.', 'the', 'scuffle', 'took', 'place', 'last', 'may', 'after', 'roughly', 'two', 'dozen', 'protesters', 'gathered', 'outside', 'of', 'th

In [14]:
#stopword removal (list created for demonstration by GPT4.1)
stopwords = set([
    "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
    "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers",
    "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves",
    "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are",
    "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does",
    "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until",
    "while", "of", "at", "by", "for", "with", "about", "against", "between", "into",
    "through", "during", "before", "after", "above", "below", "to", "from", "up", "down",
    "in", "out", "on", "off", "over", "under", "again", "further", "then", "once",
    "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",
    "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own",
    "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should",
    "now"
])

def remove_stopwords(tokens):
    return [token for token in tokens if token not in stopwords]


filtered_tokenized_words = [remove_stopwords(article) for article in tokenized_words]
print("example: tokens of first article:", filtered_tokenized_words[:1])


example: tokens of first article: [['charges', 'dropped', '11', 'members', 'turkish', 'president', 'recep', 'tayyip', 'erdogans', 'security', 'detail', 'accused', 'beating', 'protesters', 'washington', ',', 'd', '.', 'c', '.', 'federal', 'prosecutors', 'made', 'decision', 'drop', 'charges', '11', '15', 'security', 'members', 'connection', 'incident', '.', 'police', 'originally', 'announced', 'charges', '16', 'people', 'connection', 'violent', 'clashes', 'june', '.', 'scuffle', 'took', 'place', 'last', 'may', 'roughly', 'two', 'dozen', 'protesters', 'gathered', 'outside', 'turkish', 'embassy', 'protest', 'erdogans', 'policies', 'visit', 'washington', '.']]


In [20]:
#stem the tokens
stemmer = PorterStemmer()
def stem_tokens(tokens):
    return [stemmer.stem(token) for token in tokens]

stemmed_tokenized_words = [stem_tokens(article) for article in filtered_tokenized_words]
print("example: stemmed tokens of first article:", stemmed_tokenized_words[:1])


example: stemmed tokens of first article: [['charg', 'drop', '11', 'member', 'turkish', 'presid', 'recep', 'tayyip', 'erdogan', 'secur', 'detail', 'accus', 'beat', 'protest', 'washington', ',', 'd', '.', 'c', '.', 'feder', 'prosecutor', 'made', 'decis', 'drop', 'charg', '11', '15', 'secur', 'member', 'connect', 'incid', '.', 'polic', 'origin', 'announc', 'charg', '16', 'peopl', 'connect', 'violent', 'clash', 'june', '.', 'scuffl', 'took', 'place', 'last', 'may', 'roughli', 'two', 'dozen', 'protest', 'gather', 'outsid', 'turkish', 'embassi', 'protest', 'erdogan', 'polici', 'visit', 'washington', '.']]


In [30]:
[" ".join(article) for article in filtered_tokenized_words][:1]

['charges dropped 11 members turkish president recep tayyip erdogans security detail accused beating protesters washington , d . c . federal prosecutors made decision drop charges 11 15 security members connection incident . police originally announced charges 16 people connection violent clashes june . scuffle took place last may roughly two dozen protesters gathered outside turkish embassy protest erdogans policies visit washington .']

In [34]:
#use the code from the slides for lemmatization:
#first need to join tokens into a string for spacy to process it:
print("example string:", [" ".join(article) for article in filtered_tokenized_words][:1])

#use this input to lemmatize using spacy:
lemmatized_tokens = [[token.lemma_ for token in nlp(" ".join(article))] for article in filtered_tokenized_words]
print("example: lemmatized tokens of first article:", lemmatized_tokens[:1])


example string: ['charges dropped 11 members turkish president recep tayyip erdogans security detail accused beating protesters washington , d . c . federal prosecutors made decision drop charges 11 15 security members connection incident . police originally announced charges 16 people connection violent clashes june . scuffle took place last may roughly two dozen protesters gathered outside turkish embassy protest erdogans policies visit washington .']
example: lemmatized tokens of first article: [['charge', 'drop', '11', 'member', 'turkish', 'president', 'recep', 'tayyip', 'erdogan', 'security', 'detail', 'accuse', 'beat', 'protester', 'washington', ',', 'd', '.', 'c', '.', 'federal', 'prosecutor', 'make', 'decision', 'drop', 'charge', '11', '15', 'security', 'member', 'connection', 'incident', '.', 'police', 'originally', 'announce', 'charge', '16', 'people', 'connection', 'violent', 'clash', 'june', '.', 'scuffle', 'take', 'place', 'last', 'may', 'roughly', 'two', 'dozen', 'protest

In [35]:
#now define a regex to remove punctiation, linebreaks and spaces:
pattern = r'[^\w\s]'  # matches any character that is not a word character or whitespace

cleaned_tokens = []
for article in lemmatized_tokens:
    article_clean = [re.sub(pattern, '', token) for token in article]  # remove punctuation
    article_clean = [token.strip() for token in article_clean if token.strip()]  # remove linebreaks and extra spaces
    cleaned_tokens.append(article_clean)

print("example: lemmatized tokens of first article after cleaning:", cleaned_tokens[:1])

example: lemmatized tokens of first article after cleaning: [['charge', 'drop', '11', 'member', 'turkish', 'president', 'recep', 'tayyip', 'erdogan', 'security', 'detail', 'accuse', 'beat', 'protester', 'washington', 'd', 'c', 'federal', 'prosecutor', 'make', 'decision', 'drop', 'charge', '11', '15', 'security', 'member', 'connection', 'incident', 'police', 'originally', 'announce', 'charge', '16', 'people', 'connection', 'violent', 'clash', 'june', 'scuffle', 'take', 'place', 'last', 'may', 'roughly', 'two', 'dozen', 'protester', 'gather', 'outside', 'turkish', 'embassy', 'protest', 'erdogan', 'policy', 'visit', 'washington']]


### 3. N-grams

- Think about what type of n-grams you want to add to your feature set. Extract and inspect n-grams and/or collocations, and add them to your feature set if you think this is relevant.

### 4. Extract entities and other meaningful information

Try to extract meaningful information from your texts. Depending on your interests and the nature of the data, you could:

- use regular expressions to distinguish relevant from irrelevant texts, or to extract substrings
- use NLP techniques such as Named Entity Recognition to extract entities that occur.

### 5. Train a supervised classifier

Go back to your code belonging to yesterday's assignment. Perform the same classification task, but this time carefully consider which feature set you want to use. Reflect on the options listed above, and extract features that you think are relevant to include. Carefully consider **pre-processing steps**: what type of features will you feed your algorithm? Do you, for example, want to manually remove stopwords, or include ngrams? Use these features as input for your classifier, and investigate the effects hereof on performance of the classifier. Not that the purpose is not to build the perfect classifier, but to inspect the effects of different feature engineering decisions on the outcomes of your classification algorithm.

## BONUS

- Compare that bottom-up approach with a top-down (keyword or regular-expression based) approach.