# TEXT VECTORIZATION

- This is the process of transforming a text in a vector
- I present two algorithms:
 - TFIDF
 - Hashing

In [1]:
corpus = [
    "comprovada estabilidade e permanencia de grupo voltado a mercancia de drogas, cabivel e o enquadramento no crime de associacao para o trafico",
    "a periculosidade do paciente, evidenciada pela acentuada quantidade de droga apreendida e fundamento idôneo para a decretacao de prisao",
    "a mera quantidade da droga ou insumo, ainda que elevada, por si so, nao legitima o afastamento do redutor previsto no art. 33",
]

### TFIDF

- The vector for each text is the size of the corpus
- It is possible to vary the size of the ngrams adjusting the parameter ngram_range
- The corpus must fit in the RAM memory of the computer

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_tfidf = TfidfVectorizer(ngram_range=(1,1))
X_tfidf = vectorizer_tfidf.fit_transform(corpus)
print("First ten words found ", vectorizer_tfidf.get_feature_names()[:10])
print("Matrix shape ", X_tfidf.shape)

First ten words found  ['33', 'acentuada', 'afastamento', 'ainda', 'apreendida', 'art', 'associacao', 'cabivel', 'comprovada', 'crime']
Matrix shape  (3, 45)


In [12]:
X_tfidf.toarray()[0]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.2333648 , 0.2333648 , 0.2333648 , 0.2333648 ,
       0.        , 0.53243984, 0.        , 0.        , 0.        ,
       0.2333648 , 0.        , 0.2333648 , 0.2333648 , 0.        ,
       0.        , 0.2333648 , 0.        , 0.        , 0.        ,
       0.        , 0.2333648 , 0.        , 0.17747995, 0.        ,
       0.        , 0.17747995, 0.        , 0.        , 0.2333648 ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.2333648 , 0.2333648 ])

### HASHING

- The size of the vector is fixed and predetermined by the user. I suggest between 2^12 and 2^15
- It is possible to vary the size of the ngrams adjusting the parameter ngram_range
- The vectorizer doesn't need to be fitted with the corpus

In [4]:
from sklearn.feature_extraction.text import HashingVectorizer

n_features = 2**12
vectorizer_hash = HashingVectorizer(n_features=n_features, ngram_range=(1, 1))
X_hash = vectorizer_hash.fit_transform(corpus)
print("Matrix shape ", X_hash.shape)

Matrix shape  (3, 4096)


In [14]:
X_hash.toarray()[0][:100]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])