# Exercise 2: NLP and feature engineering
----

In this exercise, you can use one of yesterday's datasets (IMDB or the newspaper data). 

Today, we will use this data for analysis and feature extraction using NLP. 

These are important components of feature engineering: moving from textual data to a feature set that can be used in a classification model.

In [1]:
#PACKAGES -
#NOTE: you should import all packages you need here, but for now, we will do this in-line.


#DATA - 
datadir = "/Users/rupertkiddle/Desktop/teach/2024/Introduction to Machine Learning (GESIS)/3_datasets"

### 1. Read in the data

You can use the code you've written yesterday as a starting point. Again, try your code on a small sample of the data, and scale up later--once your confident that your code works as intended.

In [None]:
#get filepaths for the articles:
from glob import glob #for filepaths
infowarsfiles = glob(datadir+'/articles/*/Infowars/*')

#initialize an empty list to store the articles:
infowarsarticles = []

#loop through the filepaths, open them, and append the articles to the list:
for filename in infowarsfiles:
    with open(filename) as f:
        infowarsarticles.append(f.read())

#how many articles do we have?
len(infowarsarticles)

In [3]:
#let's take a random sample of 10 articles:
import random #for RNG
articles = random.sample(infowarsarticles, 10)

#just a sanity check:
assert len(articles) == 10

### 2. first analyses and pre-processing steps

- Perform some first analyses on the data using string methods and regular expressions.
Techniques you can try out include:

a.  lowercasing  
b.  tokenization  
c.  stopword removal  
d.  stemming and/or lemmatizing  
e.  cleaning: removing punctuation, line breaks, double spaces  

In [4]:
#LOWERCASING - 

#lowercase all the articles:
articles_lower_cased = [art.lower() for art in articles]

In [5]:
#TOKENIZATION, SIMPLE - 

#basic solution, using the string method `.split()`. 
articles_split = [art.split() for art in articles]

In [6]:
#TOKENIZATION, ADVANCED -

#more sophisticated solution, using the NLTK library.
from nltk.tokenize import TreebankWordTokenizer #for tokenization
articles_tokenized = [TreebankWordTokenizer().tokenize(art) for art in articles]

In [None]:
#TOKENIZATION, MORE ADVANCED -
import regex #for regular expressions
import nltk #for natural language processing 

#even more sophisticated; create your own tokenizer that first split into sentences. In this way,`TreebankWordTokenizer` works better.
nltk.download("punkt_tab") 

class MyTokenizer:
    def tokenize(self, text):
        tokenizer = TreebankWordTokenizer()
        result = []
        word = r"\p{letter}"
        for sent in nltk.sent_tokenize(text):
            tokens = tokenizer.tokenize(sent)    
            tokens = [t for t in tokens
                      if regex.search(word, t)]
            result += tokens
        return result

mytokenizer = MyTokenizer()

print(mytokenizer.tokenize(articles[0]))

In [None]:
#STOPWORDS, SIMPLE - 

# define your stopwordlist:
from nltk.corpus import stopwords
#nltk.download("stopwords")
mystopwords = stopwords.words("english")
mystopwords.extend(["add", "more", "words"]) # manually add more stopwords to your list if needed
print(mystopwords) #let's see what's inside 

In [17]:
# now, remove stopwords from the corpus:
articles_without_stopwords = []
for article in articles:
    articles_no_stop = ""
    for word in article.lower().split():
        if word not in mystopwords:
            articles_no_stop = articles_no_stop + " " + word
    articles_without_stopwords.append(articles_no_stop)

In [None]:
# same solution, but with list comprehension:
articles_without_stopwords = [" ".join([w for w in article.lower().split() if w not in mystopwords]) for article in articles]

In [18]:
#STOPWORDS, ADVANCED -

# different--probably more sophisticated--solution, by writing a function and calling it in a list comprehension:
def remove_stopwords(article, stopwordlist):
    cleantokens = []
    for word in article:
        if word.lower() not in mystopwords:
            cleantokens.append(word)
    return cleantokens

articles_without_stopwords = [remove_stopwords(art, mystopwords) for art in articles_tokenized]

In [None]:
#NOTE: it's good practice to frequently inspect the results of your code, to make sure you are not making mistakes, and the results make sense. 
# for example, compare your results to some random articles from the original sample:

n = random.randint(0, 9)
print(articles[n][:100])
print("-----------------")
print(" ".join(articles_without_stopwords[n])[:100])

In [23]:
#STEMMING AND LEMMATIZATION -

from nltk.stem.snowball import SnowballStemmer #for stemming
stemmer = SnowballStemmer("english")

stemmed_text = []
for article in articles:
    stemmed_words = ""
    for word in article.lower().split():
        stemmed_words = stemmed_words + " " + stemmer.stem(word)
    stemmed_text.append(stemmed_words.strip())

In [24]:
# same solution, but with list comprehension:

stemmed_text  = [" ".join([stemmer.stem(w) for w in article.lower().split()]) for article in articles]

In [28]:
# compare tokeninzation and lemmatization using `Spacy`:

import spacy #for nlp
#spacy.cli.download("en_core_web_sm")
nlp = spacy.load("en_core_web_sm")
#NOTE: you may need to spacy download en_core_web_sm
lemmatized_articles = [[token.lemma_ for token in nlp(art)] for art in articles]

In [None]:
# again, frequently inspect your code, and for example compare the results to the original articles:

#pick a random article:
n = random.randint(0, 9)

print(articles[n][:100])
print("-----------------")
print(stemmed_text[n][:100])
print("-----------------")
print(" ".join(lemmatized_articles[n])[:100])

In [None]:
#### CLEANING: removing punctuation, line breaks, double spaces

n = random.randint(0, 9)
articles[n] # print a random article to inspect.

## Typical cleaning up steps:
from string import punctuation
articles = [art.replace('\n\n', '') for art in articles] # remove line breaks
articles = ["".join([w for w in art if w not in punctuation]) for art in articles] # remove punctuation
articles = [" ".join(art.split()) for art in articles] # remove double spaces by splitting the strings into words and joining these words again
articles[n] # print the same article to see whether the changes are in line with what you want


### 3. N-grams

- Think about what type of n-grams you want to add to your feature set. Extract and inspect n-grams and/or collocations, and add them to your feature set if you think this is relevant.

In [None]:
articles_bigrams = [["_".join(tup) for tup in nltk.ngrams(art.split(),2)] for art in articles] # creates bigrams
articles_bigrams[7][:5] # inspect the results...

# maybe we want both unigrams and bigrams in the feature set?
assert len(articles)==len(articles_bigrams)

articles_uniandbigrams = []
for a,b in zip([art.split() for art in articles],articles_bigrams):
    articles_uniandbigrams.append(a + b)

#and let's inspect the outcomes again.
articles_uniandbigrams[7]
len(articles_uniandbigrams[7]),len(articles_bigrams[7]),len(articles[7].split())


#Or, if you want to inspect collocations:
text = [nltk.Text(tkn for tkn in art.split()) for art in articles ]
text[7].collocations(num=10)


### 4. Extract entities and other meaningful information

Try to extract meaningful information from your texts. Depending on your interests and the nature of the data, you could:

- use regular expressions to distinguish relevant from irrelevant texts, or to extract substrings
- use NLP techniques such as Named Entity Recognition to extract entities that occur.

In [None]:
#tokenize and POS-tag with NLTK:
tokens = [nltk.word_tokenize(sentence) for sentence in articles]
tagged = [nltk.pos_tag(sentence) for sentence in tokens]
print(tagged[0])


#detect named entities with Spacy:
nlp = spacy.load('en_core_web_sm')

doc = [nlp(sentence) for sentence in articles]
for i in doc:
    for ent in i.ents:
        if ent.label_ == 'PERSON':
            print(ent.text, ent.label_ )

In [None]:
#TODO: integrate these soltions - 

#removing stopwords:

mystopwords = set(stopwords.words('english')) # use default NLTK stopword list; alternatively:
#mystopwords = set(open('mystopwordfile.txt').readlines())  #read stopword list from a textfile with one stopword per line
documents = [" ".join([w for w in doc.split() if w not in mystopwords]) for doc in documents]
documents[7]


#using N-grams as features:
documents_bigrams = [["_".join(tup) for tup in nltk.ngrams(doc.split(),2)] for doc in documents] # creates bigrams
documents_bigrams[7][:5] # inspect the results...

#maybe we want both unigrams and bigrams in the feature set?
assert len(documents)==len(documents_bigrams)

documents_uniandbigrams = []
for a,b in zip([doc.split() for doc in documents],documents_bigrams):
    documents_uniandbigrams.append(a + b)

#and let's inspect the outcomes again.
documents_uniandbigrams[7]
len(documents_uniandbigrams[7]),len(documents_bigrams[7]),len(documents[7].split())


#or, if you want to inspect collocations:
text = [nltk.Text(tkn for tkn in doc.split()) for doc in documents ]
text[7].collocations(num=10)

#NOTE: if you want to include n-grams as feature input, add the following argument to your vectorizer:*
myvectorizer= CountVectorizer(analyzer=lambda x:x)

### 5. Train a supervised classifier

Go back to your code belonging to yesterday's assignment. Perform the same classification task, but this time carefully consider which feature set you want to use. Reflect on the options listed above, and extract features that you think are relevant to include. Carefully consider **pre-processing steps**: what type of features will you feed your algorithm? Do you, for example, want to manually remove stopwords, or include ngrams? Use these features as input for your classifier, and investigate the effects hereof on performance of the classifier. Not that the purpose is not to build the perfect classifier, but to inspect the effects of different feature engineering decisions on the outcomes of your classification algorithm.

In [54]:
### using manually crafted features as input for supervised machine learning with `sklearn`
import nltk # for NLP
import random # for RNG
from glob import glob # for filepaths
from sklearn.model_selection import train_test_split # for creating train-test splits


#define a function to read the data:
def read_data(listofoutlets):
    texts = []
    labels = []
    for label in listofoutlets:
        for file in glob(datadir+f'/articles/*/{label}/*'):
            with open(file) as f:
                texts.append(f.read())
                labels.append(label)
    return texts, labels

#execute, returning corresponding lists of texts and labels:
documents, labels = read_data(['Infowars', 'BBC'])

In [60]:
#create bigrams and combine with unigrams  
documents_bigrams = [["_".join(tup) for tup in nltk.ngrams(doc.split(),2)] for doc in documents] # creates bigrams

# maybe we want both unigrams and bigrams in the feature set?
assert len(documents)==len(documents_bigrams)

documents_uniandbigrams = []
for a,b in zip([doc.split() for doc in documents],documents_bigrams):
    documents_uniandbigrams.append(a + b)

#and let's inspect the outcomes again.
#documents_uniandbigrams[7]


In [61]:
#some sanity checks:
len(documents_uniandbigrams[7]),len(documents_bigrams[7]),len(documents[7].split())
assert len(documents_uniandbigrams) == len(labels)

In [None]:
#now lets fit a `sklearn` vectorizer on the manually crafted feature set:

from sklearn.feature_extraction.text import CountVectorizer
X_train,X_test,y_train,y_test=train_test_split(documents_uniandbigrams, labels, test_size=0.3)

#NOTE: we do *not* want scikit-learn to tokenize a string into a list of tokens,
# after all, we already *have* a list of tokens. lambda x:x is just a fancy way of saying: do nothing!
myvectorizer= CountVectorizer(analyzer=lambda x:x)



#fit the vectorizer, and transform:
X_features_train = myvectorizer.fit_transform(X_train)
X_features_test = myvectorizer.transform(X_test)


#inspect the vocabulary and their id mappings

# inspect
myvectorizer.vocabulary_

In [None]:
#finally, run the model again
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

model = MultinomialNB()
model.fit(X_features_train, y_train)
y_pred = model.predict(X_features_test)

print(f"Accuracy : {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))



###final remark on ngrams in scikit learn

#Of course, you do not *have* to do all of this if you just want to use ngrams. Alternatively, you can simply use

myvectorizer = CountVectorizer(ngram_range=(1,2))
X_features_train = myvectorizer.fit_transform(X_train)

#*if X_train are the **untokenized** texts.*

#what this little example illustrates, though, is that you can use *any* manually crafted feature set as input for scikit-learn.


## BONUS

- Compare that bottom-up approach with a top-down (keyword or regular-expression based) approach.