# [Bag of Words](https://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation)

### Import libraries

In [1]:
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

# [Count Vectorizer](https://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage)

- El `CountVectorizer` crea una matriz de dimensión:`(número_de_documentos, número_de_tokens)`
- Cada entrada de esta matriz es la frecuencia o el número de apariciones de cada palabra en el documento.
- A cada término encontrado por el analizador durante el ajuste/entrenamiento/fit se le asigna un índice entero único correspondiente a una columna en la matriz resultante.
- Su entrada es una lista de strings. Cada string es un documento.
- **Las palabras que no se vieron en el corpus de entrenamiento se ignorarán por completo en futuras llamadas al método de transformación**

In [2]:
vectorizer = CountVectorizer()
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
X

<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

### Obtener la representación matricial del CountVectorizer

In [3]:
matrix_representation = X.toarray()
matrix_representation

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

### Obtener el nombre de las features (tokens)

In [4]:
rows = []
for i in range(matrix_representation.shape[0]):
    rows.append("doc_{}".format(i))
cols = vectorizer.get_feature_names()
print("Rows: {}".format(rows))
print("Cols: {}".format(cols))

Rows: ['doc_0', 'doc_1', 'doc_2', 'doc_3']
Cols: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [5]:
pd.DataFrame(matrix_representation, columns = cols, index = rows)

Unnamed: 0,and,document,first,is,one,second,the,third,this
doc_0,0,1,1,1,0,0,1,0,1
doc_1,0,1,0,1,0,2,1,0,1
doc_2,1,0,0,0,1,0,1,1,0
doc_3,0,1,1,1,0,0,1,0,1


# ¿Qué pasa si tenemos un documento con nuevas palabras que no aparecieron en training?

El `CountVectorizer` ignora las palabras que no forman parte del vocabulario (creadas a partir de los datos de entrenamiento).

In [6]:
new_doc = ["these words did not appear before"]
X_test = vectorizer.transform(new_doc)
X_test.toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

# [TF-IDF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn-feature-extraction-text-tfidfvectorizer)

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [8]:
tfidf_vectorizer = TfidfVectorizer()
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
X = tfidf_vectorizer.fit_transform(corpus)
X

<4x9 sparse matrix of type '<class 'numpy.float64'>'
	with 19 stored elements in Compressed Sparse Row format>

### Obtener la representación matricial del TfidfVectorizer

In [9]:
matrix_representation = X.toarray()
matrix_representation

array([[0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674],
       [0.        , 0.27230147, 0.        , 0.27230147, 0.        ,
        0.85322574, 0.22262429, 0.        , 0.27230147],
       [0.55280532, 0.        , 0.        , 0.        , 0.55280532,
        0.        , 0.28847675, 0.55280532, 0.        ],
       [0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674]])

### Obtener el nombre de las features (tokens)

In [10]:
rows = []
for i in range(matrix_representation.shape[0]):
    rows.append("doc_{}".format(i))
cols = vectorizer.get_feature_names()
print("Rows: {}".format(rows))
print("Cols: {}".format(cols))

Rows: ['doc_0', 'doc_1', 'doc_2', 'doc_3']
Cols: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [11]:
pd.DataFrame(matrix_representation, columns = cols, index = rows)

Unnamed: 0,and,document,first,is,one,second,the,third,this
doc_0,0.0,0.438777,0.541977,0.438777,0.0,0.0,0.358729,0.0,0.438777
doc_1,0.0,0.272301,0.0,0.272301,0.0,0.853226,0.222624,0.0,0.272301
doc_2,0.552805,0.0,0.0,0.0,0.552805,0.0,0.288477,0.552805,0.0
doc_3,0.0,0.438777,0.541977,0.438777,0.0,0.0,0.358729,0.0,0.438777
