# Vectorizer
In order to classify our headlines into categories, we first need to convert our headlines into a vector representation to feed it to a classifier.

## CountVectorizer
A count vectorizer is one such way to convert a sentence into a vector.

### Preprocessing
The CountVectorizer (just like a lot of other vectorizers) starts with preprocessing the data, e.g. by removing capital letter, punctuation, etc.

### Tokenization
It also tokenizes the data, e.g. by splitting the sentence into words. This gives word-grams. Often those word grams are of length 1, but you can set an n-gram range to get n-grams of different lengths (e.g. (1, 3) will generate all tuples from 1 to 3 words). Besides word grams, you can also generate character n-grams (where you take the grams per letter/character instead of per word).

### Vectorization
The vectorization step is the last step in the CountVectorizer. It converts the tokenized data into a vector. Each position in this vector represents one of our token. The number in this position is how many times this token occurs in given sentence.

In [25]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(["Hallo ik ben Arno", "Hey Arno", "Ik ben Bob of ben ik Bob?"])

count_vect.get_feature_names_out()

array(['arno', 'ben', 'bob', 'hallo', 'hey', 'ik', 'of'], dtype=object)

In [26]:
train_counts.toarray()

array([[1, 1, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 1, 0, 0],
       [0, 2, 2, 0, 0, 2, 1]], dtype=int64)

## TfidfVectorizer
Tf-Idf stands for "Term Frequency"-"Inverse Document frequency". We already determined the "Term Frequency" for each of the terms (tokens) using our count vectorizer. To go from the token count to the tf-idf value, you divide it by the total number of times that token appears in any of the document. ([src](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)) Both the counting and dividing is done by the TfidfVectorizer.



In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer()
train_counts = tfidf_vect.fit_transform(["Hallo ik ben Arno", "Hey Arno", "Ik ben Bob of ben ik Bob?"])


tfidf_vect.get_feature_names_out()

array(['arno', 'ben', 'bob', 'hallo', 'hey', 'ik', 'of'], dtype=object)

In [28]:
train_counts.toarray()

array([[0.45985353, 0.45985353, 0.        , 0.60465213, 0.        ,
        0.45985353, 0.        ],
       [0.60534851, 0.        , 0.        , 0.        , 0.79596054,
        0.        , 0.        ],
       [0.        , 0.4902234 , 0.64458485, 0.        , 0.        ,
        0.4902234 , 0.32229243]])

This gives us a vector representation for each of the headlines, which we can feed to our classifier.

## Informative features
A vectorizer determines the weights for different words. The model adjusts some of its own weights while training to make predictions based on the weights of the vectorizer. In the util files, you can find a function `show_most_informative_features` which shows the most important words for a specific model/vectorizer.

## Stemmer
A stemmer converts similar words to the same stem. E.g. when you have the words "run", "running", "ran", "runner", "runs", the stemmer will convert them all to "run". This is useful because it reduces the number of features (tokens) in our vector representation.