# Understanding vectorizers

In the following code examples, we will experiment with vectorizers to understand a bit better how they work. Feel free to adjust the code, and try things out yourself.

For now, we will practice with `sklearn`'s vectorizers. however, packages such as `gensim` offer their own build in functionality to vectorize the data. We will start working with `gensim` in week 4.


In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

## Example 1: Inspect the output of a vectorizer in a dense format

The following code cell will fit and transform three documents using a `Count`-based vectorizer. Next, the output is transformed to a *dense* matrix, and printed. 

1. Do you understand the output?
2. Is it smart to transform output to a dense format? What will happen if you work with millions of documents, rather than 3 short sentences?
3. what happens if you replace `CountVectorizer()` for `TfidfVectorizer()`?

In [3]:
texts = ["hello students!", "how are you today?", "what?", "hello hello everybody"]
vect = CountVectorizer()# initialize the vectorizer

X = vect.fit_transform(texts) #fit the vectorizer and transform the documents in one go
print(pd.DataFrame(X.A, columns=vect.get_feature_names()).to_string())
df = pd.DataFrame(X.toarray().transpose(), index = vect.get_feature_names())

   are  everybody  hello  how  students  today  what  you
0    0          0      1    0         1      0     0    0
1    1          0      0    1         0      1     0    1
2    0          0      0    0         0      0     1    0
3    0          1      2    0         0      0     0    0


## Example 2: Inspect the output of a vectorizer in a sparse format

Internally, `sklearn` represents the data in a *sparse* format, as this is computationally more efficient, and less memory is required.


In [4]:
texts = ["hello students!", "how are you today?", "what?", "hello hello everybody"]
count_vec = CountVectorizer() #initilize the vectorizer
count_vec_fit = count_vec.fit_transform(texts) #fit the vectorizer and transform the documents in one go

    1.Inspect the shape of transformed texts. We can see that we have a 4x8 sparse matrix, meaning that we have 4 
    rows (=documents) and 8 unique tokens (=words, numbers)


In [5]:
count_vec_fit

<4x8 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

    2.Get the feature names. This will return the tokens that are in the vocabulary of the vectorizer

In [6]:
count_vec.get_feature_names()

['are', 'everybody', 'hello', 'how', 'students', 'today', 'what', 'you']

    3. Represent the token's mapping to it's id values. The numbers do *not* represent the count of the words but the position of the words in the matrix

In [7]:
count_vec.vocabulary_ 

{'hello': 2,
 'students': 4,
 'how': 3,
 'are': 0,
 'you': 7,
 'today': 5,
 'what': 6,
 'everybody': 1}

    4. Get sparse representation on document level

In [8]:
for i, document in zip(count_vec_fit, texts):
    print(document)
    print(i)
    print()

hello students!
  (0, 2)	1
  (0, 4)	1

how are you today?
  (0, 3)	1
  (0, 0)	1
  (0, 7)	1
  (0, 5)	1

what?
  (0, 6)	1

hello hello everybody
  (0, 2)	2
  (0, 1)	1



a. Do you understand the output printed above?  
b. What happens if you change the `count` to a `tfidf` vectorizer?  
c. Adjust the code using the slides of [this week](https://github.com/annekroon/CCS-2/blob/main/week02/week02-lecture.pdf), page 40. More specifically, try removing stopwords, pruning and see how your results are affected. 