# BLU09 - Exercises

Welcome to the exercises of the BLU09! Should you get stuck on an exercise take a look at the hints or at the learning notebook in order to get some clues. Good luck!

In [1]:
import json
import pandas as pd
import numpy as np
from hashlib import sha256
from collections import Counter

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
import spacy
from spacy.matcher import Matcher

## The Goal
In this learning unit you are going to investigate how we can extract features from our textual data to determine if an e-mail is 'spam' or 'not spam' (also known as ham). You will start by building some basic features, then go on to build more complex ones, and finally putting it all together. You should be able to have a working classifier by the end of the notebook. 

## The Dataset
You are going to use a very well known Kaggle dataset for spam detection - [Kaggle Spam Collection](https://www.kaggle.com/uciml/sms-spam-collection-dataset). This is the same dataset that you looked at in the learning notebooks.


In [3]:
df = pd.read_csv('datasets/spam.csv', encoding='latin1')
df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1,inplace=True)
df.rename(columns={"v1":"label", "v2":"message"},inplace=True)
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# load the medium-sized SpaCy model
nlp = spacy.load('en_core_web_md')

In [5]:
# Create a list of SpaCy "Docs" by leveraging the SpaCy pipeline
docs = list(nlp.pipe(df[:3000].message))

## Q1 - Text exploration with SpaCy 

Your team's productivity is plummeting: there is just way too many spam emails jamming everyone's inboxes. You decide to take action. In order to look cool in front of everyone and save the team from disgrace, you decide to build a spam classifier that will filter out spam emails. 

You decide to start simple and perform some exploration using `SpaCy`.

### Q1.a) Create a simple matcher
You suspect that the words "FREE", "WIN", and "URGENT" often occur in spam emails. Take advantage of SpaCy's `Matcher` to count the total number *exact* of matches of these words. Looking at the below figure should help you choose the pattern to use for this purpose.

![](media/token_attributes.png)

In [15]:
# Count the number of total exact matches of the words "FREE", "WIN", and "URGENT" using the SpaCy Matcher and assign it to "count"
matcher = Matcher(nlp.vocab)
                            
for w in ["FREE", "WIN", "URGENT"]:
    pattern = [{'ORTH':w}]
    matcher.add(w,None,pattern)

count =0
for doc in docs:
    matches = matcher(doc)
    count += len(matches)

# YOUR CODE HERE

[]
0
[]
0
[]
0
[]
0
[]
0
[]
0
[]
0
[]
0
[]
0
[(9248010018089469450, 28, 29)]
1
[]
1
[]
1
[(11683445893799161430, 0, 1), (9248010018089469450, 8, 9)]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[]
3
[(9248010018089469450, 27, 28)]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[]
4
[(11683445893799161430, 0, 1)]
5
[]
5
[]
5
[]
5
[]
5
[]
5
[]
5
[]
5
[]
5
[]
5
[]
5
[]
5
[]
5
[]
5
[]
5
[]
5
[]
5
[]
5
[(9248010018089469450, 13, 14)]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]
6
[]


[]
50
[]
50
[]
50
[]
50
[]
50
[(11683445893799161430, 0, 1)]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[]
51
[(11683445893799161430, 0, 1)]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]
52
[]

In [16]:
count_hash = '7b1a278f5abe8e9da907fc9c29dfd432d60dc76e17b0fabab659d2a508bc65c4'
assert sha256(str(count).encode()).hexdigest() == count_hash

### Q1.b) Extract URLs

You also suspect that spam messages have many URLs. Build a matcher to extract all URLs from the text and store the URLs (and only the URLs!) in a list. Assign the result to `url_list`.

In [27]:
#Extract all the URLs in the text and store them in a list called url_list
matcher = Matcher(nlp.vocab)

url_list = []

pattern = [ {"LIKE_URL": True}]
matcher.add('url',None,pattern)

for doc in docs:
    matches = matcher(doc)
    for match_id, start, end in matches:
        url_list.append(doc[start:end]) 
        
# YOUR CODE HERE



In [26]:
list_hash = '141ad9ecb8c7ee6199cd399cf154f13ca99858aa6a71ff8a227dccd28f909350'
assert len(url_list) == 88
assert sha256(','.join(map(str, url_list)).encode()).hexdigest() == list_hash

### Q1.c) Extract Part of Speech features

You also think that spam messages may have many verbs (luring you to take action), and so you decide to extract the verbs from the text. 

To help you, here's the list of PoS available in SpaCy:

![](media/pos_helper.png)

To complete this exercise you should build a matcher to extract verbs. Use this matcher to create a list containing the number of verbs for each document. Store the verb counts in a list called `verb_counts`.

In [64]:
#Store the verb counts (according to SpaCy's matcher) per document in a list called verb_counts
#Hint: using the function "len(...)" might help
#
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"VERB"}]
matcher.add("V",None,pattern)

verb_counts = []
for doc in docs:
    matches = matcher(doc)
    verb_counts.append(len(matches))


# YOUR CODE HERE

verb_counts

[2,
 1,
 2,
 2,
 4,
 7,
 3,
 4,
 5,
 3,
 8,
 3,
 3,
 11,
 1,
 3,
 1,
 5,
 1,
 3,
 2,
 3,
 2,
 3,
 2,
 9,
 0,
 11,
 6,
 3,
 8,
 9,
 1,
 2,
 9,
 7,
 4,
 1,
 1,
 11,
 5,
 7,
 4,
 2,
 2,
 1,
 2,
 1,
 3,
 10,
 2,
 3,
 3,
 17,
 1,
 4,
 4,
 1,
 2,
 1,
 5,
 1,
 1,
 2,
 6,
 6,
 6,
 2,
 3,
 2,
 2,
 1,
 2,
 3,
 2,
 4,
 3,
 1,
 2,
 3,
 1,
 2,
 1,
 2,
 0,
 4,
 5,
 7,
 1,
 1,
 4,
 8,
 7,
 4,
 6,
 7,
 1,
 4,
 6,
 4,
 3,
 5,
 4,
 10,
 2,
 1,
 2,
 3,
 6,
 4,
 1,
 1,
 5,
 7,
 4,
 6,
 6,
 3,
 4,
 4,
 5,
 1,
 8,
 6,
 2,
 2,
 12,
 1,
 3,
 2,
 1,
 3,
 1,
 3,
 6,
 2,
 3,
 2,
 7,
 5,
 2,
 1,
 5,
 5,
 1,
 3,
 7,
 4,
 1,
 1,
 5,
 2,
 1,
 4,
 14,
 1,
 2,
 10,
 4,
 6,
 5,
 1,
 4,
 7,
 4,
 2,
 7,
 4,
 3,
 1,
 1,
 6,
 1,
 4,
 4,
 6,
 1,
 7,
 7,
 4,
 5,
 2,
 3,
 1,
 1,
 8,
 1,
 4,
 4,
 6,
 2,
 10,
 1,
 7,
 6,
 0,
 2,
 2,
 6,
 1,
 1,
 6,
 6,
 2,
 2,
 5,
 6,
 1,
 2,
 2,
 5,
 2,
 0,
 0,
 2,
 1,
 2,
 3,
 2,
 13,
 1,
 10,
 1,
 3,
 4,
 5,
 6,
 0,
 5,
 1,
 4,
 2,
 2,
 1,
 4,
 3,
 1,
 2,
 7,
 8,
 0,
 1,
 6,
 9,
 2,
 6,
 2,


In [34]:
hash_count = '073a8f81f495d87e3ebf393c8db29d2f8bb0767a18aff2d3aa55ffd10e8787bf'
assert sha256(str(sum(verb_counts)).encode()).hexdigest() == hash_count

### Q1.d) Extract entities

You also think it would be useful to extract some Organizations from the text in order to check for any re-occurring patterns.

Build a `Matcher` to match organizations in the text and extract the top 5 most common ones. Assign them to `most_common_ents`.

*hint: Use [Counter](https://docs.python.org/3/library/collections.html#collections.Counter) to extract the most common elements (check the most_common(n) method). You will need to feed it strings (not SpaCy spans)*

*note: in a real-case scenario we would perform some text preprocessing first and build a better entity recognizer, but let us not worry about that here*


In [35]:
a = dict()

In [36]:
a["ze"]

KeyError: 'ze'

In [97]:
# Build a matcher to extract the organization-type entities from the text and assign them to most_common_ents
#
matcher = Matcher(nlp.vocab)
pattern = [{"ENT_TYPE":"ORG"}]
matcher.add("ORGAN",None,pattern)

orgs = []
for doc in docs:
    matches = matcher(doc)
    for match_id, start, end in matches:
#         if doc[start:end] in orgs:
#             print(doc[start:end])
        orgs.append(str(doc[start:end]))

        #print(doc[start:end])

from collections import Counter  # available in Python 2.7 and newer

counts = Counter(orgs)    
    
"most_common_ents = ..."
# YOUR CODE HERE
#sorted(counts,reverse=True)
#for key in counts.keys():
    #print(counts[key])
most_common_ents =  counts.most_common(5)
most_common_ents

[('&', 67), ('/', 32), ('Nokia', 25), ('lt;#&gt', 24), (' ', 21)]

In [98]:
ent_hash = "1d45ae99abcc02002be90eabecf61d0ce0613d1de5f0c37ddd7bbbd7e8198cf5"  # sha256('Nokia')
assert len(most_common_ents) == 5
assert ent_hash in (sha256(ent.encode()).hexdigest() for ent, count in most_common_ents)

## Q2 - Extracting features

You decide to start extracting more complex features in the hope of extracting more information from the text.
Let us now work with the full Kaggle dataset.

In [95]:
all_docs = list(nlp.pipe(df['message']))

### Q2.a) Create "complex" patterns
You suspect that spammers write emails that compel you to take action by talking about things which are great. So, you decide to investigate the ocurrences of an adjective followed by an entity.

Build a `Matcher` that matches all occurrences of an adjective followed by an organization and store the 5 most common ocurrences in a list named `most_common_adj_ents`.

In [104]:
#Build a `Matcher` that matches all ocurrences of an adjective followed by
#and organization and store the 5 most common ocurrences in a list named most_common_adj_ents
#
matcher = Matcher(nlp.vocab)

pattern = [{'POS': 'ADJ'}, {"ENT_TYPE":"ORG"}]
matcher.add("adjorg",None,pattern)

adj_ents = []
for doc in docs:
    matches = matcher(doc)
    for match_id, start, end in matches:
        adj_ents.append(str(doc[start:end]))

most_common_adj_ents = Counter(adj_ents).most_common(5)

# YOUR CODE HERE


In [105]:
adj_ent_hash = "4dbad11fe79c962e80f0ad3dc3ef788eaa4b7412c6c99f1d1fe78586486e4afe"  # sha256('top Sony')
assert len(most_common_adj_ents) == 5
assert adj_ent_hash in (sha256(adj_ent.encode()).hexdigest() for adj_ent, count in most_common_adj_ents)

### Q2.b) Create Numerical Features

You start thinking what features could actually be useful for solving your problem. One possible factor that may help is to know the number of adjectives and verbs used, the number of entities, and the length of the messages. 

Add extra fields to the `df` dataframe with the count of the number of adjectives and verbs, entities (all types), and the length of the document. For this consider the adjectives, verbs, and entities as those identified by SpaCy and the length of the text as the count of characters. 

*note: the number of verbs and adjectives should be summed in a single variable*

Assign the number of verbs and adjectives to a new column called `n_verb_adjs`, the number of entities to a column called `n_ents`, and the length of the message to a column called `len_message`.

In [111]:
#Assign the number of verbs and adjectives to a new column called n_verbs_adjs,
#the number of entities to a column called n_ents,
#and the lenght of the message to a column called len_message
#
# Hint: you can iterate over the tokens in Spacy doc to inspect them 
# for doc in docs:
#     print(doc.ents)
#     for token in doc:
#         print(token.pos_)

n_ents = []
n_verb_adjs = []
len_message = []
#
for doc in all_docs:
    count_ents = 0
    count_verb_adjs = 0
    for token in doc:
        count_ents += 1
        if token.pos_ == "ADJ":
            count_verb_adjs += 1

    n_ents.append(count_ents)
    n_verb_adjs.append(count_verb_adjs)
    len_message.append(len(doc))
#    
df['n_verb_adjs'] = n_verb_adjs
df['n_ents'] = n_ents
df['len_message'] = len_message

# YOUR CODE HERE


In [112]:
assert all(col in df.columns for col in ('n_verb_adjs', 'n_ents', 'len_message'))
assert np.allclose(df.n_verb_adjs.sum(), 23979, 20)
assert np.allclose(df.n_ents.sum(), 5537, 20)

## Q3 - Pipelines and Feature Unions
It is now time for you to leverage on your newly built features and construct pipelines that can be fed to a classifier. You decide to use a [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) as you hear from industry experts it tends to work well for text classification problems.

In [113]:
# split data into train and test sets
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

In [114]:
class Selector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a column from the dataframe to perform additional transformations on
    """ 
    def __init__(self, key):
        self.key = key
        
    def fit(self, X, y=None):
        return self
    

class TextSelector(Selector):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on text columns in the data
    """
    def transform(self, X):
        return X[self.key]
    
    
class NumberSelector(Selector):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on numeric columns in the data
    """
    def transform(self, X):
        return X[[self.key]]

    
def get_accuracy(feats, train_data, test_data):
    """
    Return the accuracy on the test_data by using a RandomForestClassifier trained on the 
    train_data with the features described by feats
    """

    pipeline = Pipeline([
        ('features',feats),
        ('classifier', RandomForestClassifier(random_state = 42, n_estimators=10)),
    ])

    pipeline.fit(train_data, train_data.label)

    preds = pipeline.predict(test_data)
    accuracy = np.mean(preds == test_data.label)
    
    print("Accuracy: {:.4f}".format(accuracy))
    
    return accuracy

### Q3.a) Build a Feature Union
You hypothesize that combining the text and numerical features could help you build a strong classifier. 

Use `FeatureUnion` to join the text features extracted from a standard sklearn `TfidfVectorizer` (with $ngram\_range=(1,2)$) and the numeric feature of the length of the messages scaled to zero mean and unit variance *[hint](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)*. Assign them to a variable named `feats`.

In [123]:
text_pipe = Pipeline([
                ('selector', TextSelector("message")),
                ('tfidf', TfidfVectorizer())
            ])

len_pipe =  Pipeline([
                ('selector', NumberSelector("len_message")),
                ('standard', StandardScaler())
            ])

feats = FeatureUnion([('text',text_pipe),
                      ('numb',len_pipe)
                    ])

# YOUR CODE HERE

In [124]:
assert isinstance(feats, FeatureUnion)
assert any(isinstance(obj, Selector) for obj in feats.transformer_list[0][1])
assert any(isinstance(obj, TfidfVectorizer) for obj in feats.transformer_list[0][1])
assert np.allclose(get_accuracy(feats, train_data, test_data), 0.9668, 0.01)

Accuracy: 0.9731


### Q3.b) Add more features
You decide to try adding the number of verbs and adjectives to your features to see if they can improve the performance of your classifier. 

Add the number of verbs and adjs `n_verb_adjs` that you computed in Q2.b to your features. Assign your features to `feats_v2`.

In [127]:
verbs = Pipeline([
                ('selector', NumberSelector("n_verb_adjs")),
                ('standard', StandardScaler())
            ])
#...
feats_v2 = FeatureUnion([('feats',feats),
                         ('verbs',verbs)
                        ])

# YOUR CODE HERE

In [128]:
accuracy = get_accuracy(feats_v2, train_data, test_data)
assert np.allclose(accuracy, 0.9704, 0.01)

Accuracy: 0.9695


In [125]:
train_data

Unnamed: 0,label,message,n_verb_adjs,n_ents,len_message
1978,ham,No I'm in the same boat. Still here at my moms...,2,25,25
3989,spam,(Bank of Granite issues Strong-Buy) EXPLOSIVE ...,0,44,44
3935,ham,They r giving a second chance to rahul dengra.,1,10,10
4078,ham,O i played smash bros &lt;#&gt; religiously.,0,12,12
4086,spam,PRIVATE! Your 2003 Account Statement for 07973...,2,26,26
4919,ham,"G says you never answer your texts, confirm/deny",0,11,11
2268,spam,88066 FROM 88066 LOST 3POUND HELP,0,6,6
4696,ham,"Okey dokey, iÛ÷ll be over in a bit just sorti...",0,15,15
3653,ham,Why i come in between you people,0,7,7
70,ham,Wah lucky man... Then can save money... Hee...,1,11,11


### Q3.c) Add the entities feature
You try to improve your model even further by including the number of entities `n_ents` feature that you created in Q2.b above. 

Add the number of entities to your features and assign the result to `feats_v3` (**no need to scale** the features this time).

In [131]:
ents = Pipeline([
                ('selector', NumberSelector("n_ents")),
                ('standard', StandardScaler())
            ])
feats_v3 = FeatureUnion([
                ('prev', feats_v2),
                ('ents', ents)
            ])

# YOUR CODE HERE

In [132]:
accuracy = get_accuracy(feats_v3, train_data, test_data)
assert np.allclose(accuracy, 0.9659, 0.01)

Accuracy: 0.9722


You realize that your accuracy actually decreased, which reminds you that more features does not necessarily mean better results.

## Conclusion

You realize you can get fairly high accuracy on the spam problem using a fairly simple solution. You know there are many things you could improve and many further paths you could choose in order to try to take your classifier to the next level, but you decide to leave that challenge for another day. 