In this notebook we will process the synthetic Austen/food reviews data and convert it into feature vectors. In later notebooks these feature vectors will be the inputs to models which we will train and eventually use to identify spam. 

This notebook uses [term frequency-inverse document frequency](https://en.wikipedia.org/wiki/Tf–idf), or tf-idf, to generate feature vectors. Tf-idf is commonly used to summarise text data, and it aims to capture how important different words are within a set of documents. Tf-idf combines a normalized word count (or term frequency) with the inverse document frequency (or a measure of how common a word is across all documents) in order to identify words, or terms, which are 'interesting' or important within the document. 


We begin by loading in the data:

In [None]:
import pandas as pd
import os.path

df = pd.read_parquet(os.path.join("data", "training.parquet"))

To illustrate the computation of tf-idf vectors we will first implement the method on a sample of three of the documents we just loaded.   

In [None]:
import numpy as np

np.random.seed(0xc0ffeeee)
df_samp = df.sample(3)

In [None]:
pd.set_option('display.max_colwidth', -1) #ensures that all the text is visible
df_samp

We begin by computing the term frequency ('tf') of the words in the three texts above. We use the `token_pattern` parameter to specify that we only want to consider words (no numeric values). We limit the number of words (`max_features`) to 20, so that we can easily inspect the output. This means that only the 20 words which appear most frequently across the three texts will be represented. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern='(?u)\\b[A-Za-z]\\w+\\b', max_features = 20)
counts = vectorizer.fit_transform(df_samp["text"])

In [None]:
vectorizer.get_feature_names() #shows all the words used as features for this vectorizer

In [None]:
counts

In [None]:
print(counts.toarray()) 

Each row of the array corresponds to one of the texts, whilst the columns relate to the words considered in this vectorizer. (You can confirm that 'all' appears once in the first two texts, and twice in the third text, and so on.)

The next stage of the process is to use the results of the term frequency matrix to compute the tf-idf. 

The inverse document frequency (idf) for a particular word, or feature, is computed as (the log of) a ratio of the number of documents in a corpus to the number of documents which contain that feature (up to some constant factors). 

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
df_tfidf = tfidf_transformer.fit_transform(counts)

In [None]:
print(df_tfidf.toarray())

Each row of the object above is the desired tf-idf vector for the relevant document. 

A major disadvantage of using a vectorizer is that it will be dependent upon the dictionary of words it sees when it is 'fit' to the data. As such, if we are presented with a new passage of text and wish to compute a feature vector for for that text we are required to know which word maps to which space of the vector. Keeping track of a dictionary is impractical and will lead to inefficiency. 

Furthermore, there are only "spaces" in the vectorizer for words that have been seen in the fitting stage. If a new text sample contains a word which was not present when the vectorizer was first fit, there will be no place in the feature vectors to count that word. 

With that in mind, we consider using a [hashing vectorizer](https://en.wikipedia.org/wiki/Feature_hashing). Words can be hashed to buckets, and the bucket count incremented. This will give us a counts matrix, like we saw above, which we can then compute the tf-idf matrix for, without the need to keep track of which column in the matrix any given word maps to. 

One disadvantage of this approach is that collisions will occur - with a finite set of buckets multiple words will hash to the same bucket. As such we are no longer computing an exact tf-idf matrix.

Furthermore we will not be able to recover the word (or words) associated with a bucket at a later time if we need them. (For our application this won't be needed.)

We fix the number of buckets at 2<sup>10</sup> = 1024, but you can try using a different number of buckets and see how the spam detection models are effected.  

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer
BUCKETS=1024

hv = HashingVectorizer(norm=None, token_pattern='(?u)\\b[A-Za-z]\\w+\\b', n_features=BUCKETS, alternate_sign = False)
hv


In [None]:
hvcounts = hv.fit_transform(df["text"])
hvcounts

We can then go on to compute the "approximate" tf-idf matrix for this, by applying the tf-idf transformer to the hashed counts matrix.

In [None]:
tfidf_transformer = TfidfTransformer()
hvdf_tfidf = tfidf_transformer.fit_transform(hvcounts)

In [None]:
hvdf_tfidf

These vectors have far too many dimensions for us to easily picture  them as points in space.  [Principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis), or PCA, is a statistical technique that is over a century old; it takes observations in a high-dimensional space and maps them to a (potentially much) smaller number of dimensions. We'll see it in action now, using the [implementation from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA).

(To learn a little more about PCA and an alternative technique, visit [the visualisation notebook](01-vectors-and-visualization.ipynb).)

In [None]:
#PCA projection so that output can be visualised

import sklearn.decomposition

DIMENSIONS = 2

pca2 = sklearn.decomposition.TruncatedSVD(DIMENSIONS)

pca_a = pca2.fit_transform(hvdf_tfidf)

In [None]:
pca_df = pd.DataFrame(pca_a, columns=["x", "y"])
pca_df.sample(10)

In [None]:
plot_data = pd.concat([df.reset_index(), pca_df], axis=1)

In [None]:
import altair as alt

alt.Chart(plot_data.sample(1000)).encode(x="x", y="y", color="label").mark_point().interactive()

We want to be able to easily compute feature vectors using the hashing tf-idf workflow laid out above. The `Pipeline` facility in scikit-learn streamlines this workflow by making it easy to pass data through multiple transforms. In the next cell we set up our pipeline.

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer,TfidfTransformer
from sklearn.pipeline import Pipeline
import pickle, os

vect = HashingVectorizer(norm=None, token_pattern='(?u)\\b[A-Za-z]\\w+\\b', n_features=BUCKETS, alternate_sign = False)
tfidf = TfidfTransformer()

feat_pipeline = Pipeline([
    ('vect',vect),
    ('tfidf',tfidf)
])

We can then use the `fit_transform` method to apply the pipeline to our data frame. This produces a sparse matrix (only non zero entries are recorded). We convert this to a dense array using the `toarray()` function, then append the index and labels to aid readability. 

In [None]:
feature_vecs = feat_pipeline.fit_transform(df["text"]).toarray()
labeled_vecs = pd.concat([df.reset_index()[["index", "label"]],
                                pd.DataFrame(feature_vecs)], axis=1)
labeled_vecs.columns = labeled_vecs.columns.astype(str)

In [None]:
labeled_vecs.sample(10)

We save the feature vectors to a parquet file.

In [None]:
labeled_vecs.to_parquet(os.path.join("data", "features.parquet"))

We will then serialize our pipeline to a file on disk so that we can reuse the document frequencies we've observed on training data to weight term vectors.

In [None]:
from mlworkflows import util
util.serialize_to(feat_pipeline, "feature_pipeline.sav")

Now that we have a feature engineering approach, next step is to train a model.  Again, you have two choices for your next step:  [click here](04-model-logistic-regression.ipynb) for a model based on *logistic regression*, or [click here](04-model-random-forest.ipynb) for a model based on *ensembles of decision trees.*