<a href="https://colab.research.google.com/github/adalves-ufabc/2022.Q2-PLN/blob/main/2022_Q2_PLN_Notebook_09.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Processamento de Linguagem Natural [2022.Q2]**
Prof. Alexandre Donizeti Alves

# **Representação de Textos**

---

### **Codificação *One-Hot***

In [1]:
#--------------------------
# One Hot Encoding of text 
#--------------------------

documents = ["Cachorro morde homem.",
             "Homem morde cachorro.", 
             "Cachorro come carne.", 
             "Homem come comida."]

processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

['cachorro morde homem',
 'homem morde cachorro',
 'cachorro come carne',
 'homem come comida']

In [2]:
# build the vocabulary
vocab = {}
count = 0
for doc in processed_docs:
    for word in doc.split():
        if word not in vocab:
            count = count +1
            vocab[word] = count
print(vocab)

{'cachorro': 1, 'morde': 2, 'homem': 3, 'come': 4, 'carne': 5, 'comida': 6}


In [3]:
# get one hot representation for any string based on this vocabulary
# if the word exists in the vocabulary, its representation is returned
# if not, a list of zeroes is returned for that word
def get_onehot_vector(somestring):
    onehot_encoded = []
    for word in somestring.split():
        temp = [0]*len(vocab)
        if word in vocab:
            # -1 is to take care of the fact indexing in array starts from 0 and not 1
            temp[vocab[word]-1] = 1 
        onehot_encoded.append(temp)

    return onehot_encoded

In [4]:
print(processed_docs[0])
get_onehot_vector(processed_docs[0])

cachorro morde homem


[[1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0]]

In [5]:
#one hot representation for a random text, using the above vocabulary
get_onehot_vector("homem e cachorro são bons") 

[[0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0]]

In [6]:
get_onehot_vector("homem e homem são bons")

[[0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0]]

### ***Bag of Words***

In [7]:
#---------------------
# Bag of Words - BoW
#---------------------

documents = ["Cachorro morde homem.",
             "Homem morde cachorro.", 
             "Cachorro come carne.", 
             "Homem come comida."]

processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

['cachorro morde homem',
 'homem morde cachorro',
 'cachorro come carne',
 'homem come comida']

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# look at the documents list
print("Corpus: ", processed_docs)

count_vect = CountVectorizer()

# build a BOW representation for the corpus
bow_rep = count_vect.fit_transform(processed_docs)

# look at the vocabulary mapping
print("Vocabulario: ", count_vect.vocabulary_)

# see the BOW rep for first 2 documents
print("Representacao BoW para 'cachorro morde homem': ", bow_rep[0].toarray())
print("Representacao BoW para 'homem morde cachorro': ",bow_rep[3].toarray())

# get the representation using this vocabulary, for a new text
temp = count_vect.transform(["cachorro e cachorro são amigos"])
print("Representacao Bow para 'cachorro e cachorro são amigos':", temp.toarray())

Corpus:  ['cachorro morde homem', 'homem morde cachorro', 'cachorro come carne', 'homem come comida']
Vocabulario:  {'cachorro': 0, 'morde': 5, 'homem': 4, 'come': 2, 'carne': 1, 'comida': 3}
Representacao BoW para 'cachorro morde homem':  [[1 0 0 0 1 1]]
Representacao BoW para 'homem morde cachorro':  [[0 0 1 1 1 0]]
Representacao Bow para 'cachorro e cachorro são amigos': [[2 0 0 0 0 0]]


No código acima, representamos o texto considerando a frequência das palavras. Porém, às vezes, não nos importamos muito com a frequência, mas apenas queremos saber se uma palavra apareceu em um texto ou não. Ou seja, cada documento é representado como um vetor de 0s e 1s. Usaremos a opção `binary = True` no *CountVectorizer* para este propósito

In [10]:
# BoW with binary vectors
count_vect = CountVectorizer(binary=True)
bow_rep_bin = count_vect.fit_transform(processed_docs)
temp = count_vect.transform(["cachorro e cachorro são amigos"])
print("Bow representation for 'cachorro e cachorro são amigos':", temp.toarray())

Bow representation for 'cachorro e cachorro são amigos': [[1 0 0 0 0 0]]


### ***Bag of N-grams***

In [11]:
# corpus
documents = ["Cachorro morde homem.", 
             "Homem morde cachorro.", 
             "Cachorro come carne.", 
             "Homem come comida."]

processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

['cachorro morde homem',
 'homem morde cachorro',
 'cachorro come carne',
 'homem come comida']

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

# Ngram vectorization example with count vectorizer and uni, bi, trigrams
count_vect = CountVectorizer(ngram_range=(1,3))

# Build a BOW representation for the corpus
bow_rep = count_vect.fit_transform(processed_docs)

# Look at the vocabulary mapping
print("Vocabulario: ", count_vect.vocabulary_)

# see the BOW rep for first 2 documents
print("Representacao BoW para 'cachorro morde homem': ", bow_rep[0].toarray())
print("Representacao BoW para 'homem morde cachorro: ",bow_rep[1].toarray())

# get the representation using this vocabulary, for a new text
temp = count_vect.transform(["cachorro e cachorro são amigos"])

print("Representacao BoW para 'cachorro e cachorro são amigos':", temp.toarray())

Vocabulario:  {'cachorro': 0, 'morde': 15, 'homem': 10, 'cachorro morde': 3, 'morde homem': 17, 'cachorro morde homem': 4, 'homem morde': 13, 'morde cachorro': 16, 'homem morde cachorro': 14, 'come': 6, 'carne': 5, 'cachorro come': 1, 'come carne': 7, 'cachorro come carne': 2, 'comida': 9, 'homem come': 11, 'come comida': 8, 'homem come comida': 12}
Representacao BoW para 'cachorro morde homem':  [[1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1]]
Representacao BoW para 'homem morde cachorro:  [[1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0]]
Representacao BoW para 'cachorro e cachorro são amigos': [[2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


### ***TF-IDF***

In [13]:
# corpus
documents = ["Cachorro morde homem.", 
             "Homem morde cachorro.", 
             "Cachorro come carne.", 
             "Homem come comida."]

processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

['cachorro morde homem',
 'homem morde cachorro',
 'cachorro come carne',
 'homem come comida']

In [15]:
# Scikit-Learn
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
bow_rep_tfidf = tfidf.fit_transform(processed_docs)

# IDF for all words in the vocabulary
print("IDF for all words in the vocabulary",tfidf.idf_)
print("-"*10)

# all words in the vocabulary.
print("All words in the vocabulary",tfidf.get_feature_names_out())
print("-"*10)

#TFIDF representation for all documents in our corpus 
print("TFIDF representation for all documents in our corpus\n",bow_rep_tfidf.toarray()) 
print("-"*10)

temp = tfidf.transform(["cachorro e homem sao amigos"])
print("Tfidf representation for 'cachorro e homem sao amigos':\n", temp.toarray())

IDF for all words in the vocabulary [1.22314355 1.91629073 1.51082562 1.91629073 1.22314355 1.51082562]
----------
All words in the vocabulary ['cachorro' 'carne' 'come' 'comida' 'homem' 'morde']
----------
TFIDF representation for all documents in our corpus
 [[0.53256952 0.         0.         0.         0.53256952 0.65782931]
 [0.53256952 0.         0.         0.         0.53256952 0.65782931]
 [0.44809973 0.70203482 0.55349232 0.         0.         0.        ]
 [0.         0.         0.55349232 0.70203482 0.44809973 0.        ]]
----------
Tfidf representation for 'cachorro e homem sao amigos':
 [[0.70710678 0.         0.         0.         0.70710678 0.        ]]
