# Word Embedding Explained, a comparison and code tutorial
Disponível em https://medium.com/@dcameronsteinke/tf-idf-vs-word-embedding-a-comparison-and-code-tutorial-5ba341379ab0

Outras fontes interessantes:
- Paper Mikolov: https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
- https://medium.com/explorations-in-language-and-learning/online-learning-of-word-embeddings-7c2889c99704
- https://medium.com/explorations-in-language-and-learning/understanding-word-vectors-f5f9e9fdef98

### Dicts de word embeddings

Para testes feitos com o arquivo compscience_papers.csv, usaremos aqui o dicionário de word embeddings em inglês da FastText, desenvolvido pelo centro de Pesquisa em AI do Facebook. Usaremos o wiki-news-300d-1M.vec lookup dictionary, o qual contém mapeamento 300-dimensional de 1 milhão de palavras únicas - para economizar memória, os testes foram feitos usando apenas as 100.000 palavras mais comuns, sendo as restantes ignoradas. O arquivo está disponível para download em https://fasttext.cc/docs/en/english-vectors.html.

In [2]:
# Loading the data file from local download
path_fastText = 'wiki-news-300d-1M.vec'
dictionary = open(path_fastText, 'r', encoding='utf-8',
                  newline='\n', errors='ignore')
embeds = {}
for line in dictionary:
    tokens = line.rstrip().split(' ')
    embeds[tokens[0]] = [float(x) for x in tokens[1:]]
    
    if len(embeds) == 100000:
        break
print(embeds['car'])

[-0.016, -0.0003, -0.1684, 0.0899, -0.02, -0.0093, 0.0482, -0.0308, -0.0451, 0.0006, 0.168, 0.0965, 0.3061, -0.0411, 0.0296, -0.0463, 0.0325, -0.0703, 0.0222, -0.1404, -0.2638, -0.0134, 0.1277, 0.1227, 0.1803, -0.0192, 0.0353, 0.1214, 0.1509, -0.0861, 0.0976, -0.0255, -0.0276, -0.1556, -0.0739, 0.0543, -0.067, -0.003, 0.1515, 0.0608, 0.033, 0.0747, 0.0009, 0.055, 0.0048, -0.0132, -0.0262, -0.1804, 0.0805, 0.0464, -0.0159, -0.0302, -0.6785, 0.1632, 0.0103, 0.0655, -0.0843, 0.0227, 0.0335, -0.0356, -0.0638, -0.1111, -0.0017, 0.0978, 0.0565, -0.0352, 0.0395, 0.1867, 0.079, -0.1234, 0.0186, 0.089, 0.1631, 0.0783, 0.0561, 0.1447, -0.0251, 0.1376, -0.0079, -0.0239, 0.0218, 0.1494, -0.0191, -0.2479, -0.0499, 0.0516, -0.1298, -0.0648, 0.2738, 0.0078, 0.0171, -0.0372, 0.077, -0.1167, -0.0377, -0.0432, 0.0186, 0.0209, -0.0167, 0.0345, -0.1472, 0.0122, -0.053, -0.0073, 0.1029, 0.0283, -0.1264, 0.0066, -0.0579, 0.1004, -0.1225, 0.0247, 0.0808, -0.0399, -0.0108, 0.0043, 0.0184, 0.0488, -0.174, -0.3

Para os testes feitos com os dados fornecidos pelo paper Inferring the source of official texts: can SVM beat ULMFiT?, disponível em https://cic.unb.br/~teodecampos/KnEDLe/propor2020/, usaremos o dicionário de embeddings em língua portuguesa fornecido pelo NILC - http://www.nilc.icmc.usp.br/embeddings. Usaremos o FastText CBOW 300 dimensões.

In [3]:
# Loading the data file from local download
path_fastTextPT = 'cbow_s300.txt'
dictionary = open(path_fastTextPT, 'r', encoding='utf-8',
                  newline='\n', errors='ignore')
embedsPT = {}
for line in dictionary:
    tokens = line.rstrip().split(' ')
    embedsPT[tokens[0]] = [float(x) for x in tokens[1:]]
    
    if len(embedsPT) == 100000:
        break
print(embedsPT['carro'])

[3.1796, 0.19209, 0.22359, -2.6106, -0.12218, 0.032726, 2.3504, 0.22662, 1.4124, 1.0825, -0.85295, 0.29017, 1.9458, 1.0684, -0.35541, 2.2881, -0.69868, 0.087266, 0.83, 2.3799, -1.0629, -0.67116, 0.075786, -2.1696, 2.9464, 0.10431, 0.39483, 0.34337, 0.62104, -0.44857, -0.99443, -0.089758, -0.52677, -2.332, -1.5584, 0.31164, -1.3988, -3.7487, -1.0172, -0.37468, -1.3402, 1.6693, -1.4098, 1.4323, -0.87958, -1.0398, 2.4174, -0.2133, 2.1066, 1.0972, -0.2813, 0.57358, -1.2151, -1.3656, 1.809, 0.21582, 1.0618, -0.34813, -1.7238, 0.27825, 0.10802, 0.83614, -3.3179, 0.12296, 1.3057, -0.15216, 0.58171, -1.5202, 0.35706, -2.6664, -2.9915, 0.67336, -1.2807, -0.30292, -1.1132, 0.62384, -2.7854, -0.87867, -2.2291, 1.9846, -2.3804, 0.562, 2.6336, -0.56135, -3.8398, -2.2169, -2.1987, -0.23831, -1.0293, 0.54459, -0.081219, -1.1411, 1.8342, -1.2903, 0.065219, 1.5573, -1.4097, 5.0478, -0.86079, -0.29849, -0.80884, 0.58545, -0.58787, -1.7649, 1.0095, 0.54806, 3.687, 3.1301, 1.2827, -0.78299, -0.30358, 0.09

### Import de ambos os arquivos csv:
 - train.csv, do trabalho de Pedro e
 - compscience_papers.csv, do repositório de testes da equipe DataAnnotation.

In [4]:
import pandas as pd
import numpy as np

In [5]:
path_to_text = '/home/mstauffer/Documentos/UnB/9º Semestre/KnEDle/sprints/5_27_maio-03_junho/luz_de_araujo_etal_propor2020/data/clean/train.csv'
data_peter = pd.read_csv(path_to_text, encoding='utf8')#[['v1', 'v2']]
# Creating the feature set and label set
textPT = data_peter['text']
labelPT = data_peter['label']
print(data_peter[10:14])

                                                label  \
10  SECRETARIA DE ESTADO DE DESENVOLVIMENTO URBANO...   
11  SECRETARIA DE ESTADO DE FAZENDA, PLANEJAMENTO,...   
12                      SECRETARIA DE ESTADO DE SAÚDE   
13          SECRETARIA DE ESTADO DE SEGURANÇA PÚBLICA   

                                                 text  is_valid  
10  O Termo de Recebimento Definitivo declarará fo...     False  
11  O DISTRITO FEDERAL, por intermédio da Diretori...     False  
12  O SECRETÁRIO DE ESTADO DE SAÚDE DO DISTRITO FE...     False  
13  O DIRETOR-GERAL DO DEPARTAMENTO DE TRÂNSITO DO...     False  


In [6]:
path_to_text = '/home/mstauffer/Documentos/UnB/9º Semestre/KnEDle/sprints/compscience_papers/compscience_papers.csv'
data = pd.read_csv(path_to_text, encoding='utf8')#[['v1', 'v2']]
# Creating the feature set and label set
text = data['text']
label = data['label']
print(data[10:14])

                                                 text  label
10  INFORMATION EXTRACTION AS A BASIS FOR PORTABLE...      3
11  Footprint-Based Retrieval\n\nBarry Smyth & Eli...      1
12  CATEGORY: Artificial Intelligence Learning Hum...      3
13  Distributed Energy­conserving Routing Protocol...      3


### Pré-processamento embeddings
Para tratar as word embeddings, usaremos um método de pré-processamento da lib Keras que converte um texto em uma sequência de palavras (os famosos tokens). Pontuação e capitalização também serão removidas.

- Tratamento dados Pedro (Em português)

In [8]:
from keras.preprocessing.text import text_to_word_sequence
array_length = 20 * 300
embedding_featuresPT = pd.DataFrame()
for document in textPT:
    # Saving the first 20 words of the document as a sequence
    words = text_to_word_sequence(document)[0:20] 
    
    # Retrieving the vector representation of each word and 
    # appending it to the feature vector 
    feature_vector = []
    for word in words:
        try:
            feature_vector = np.append(feature_vector, 
                                       np.array(embedsPT[word]))
        except KeyError:
            # In the event that a word is not included in our 
            # dictionary skip that word
            pass
    # If the text has less then 20 words, fill remaining vector with
    # zeros
    zeroes_to_add = array_length - len(feature_vector)
    feature_vector = np.append(feature_vector, 
                               np.zeros(zeroes_to_add)
                               ).reshape((1,-1))
    
    # Append the document feature vector to the feature table
    embedding_featuresPT = embedding_featuresPT.append( 
                                     pd.DataFrame(feature_vector))
print(embedding_featuresPT.shape)

(717, 6000)


- Tratamento dados Data Annotation (Em inglês)

In [9]:
from keras.preprocessing.text import text_to_word_sequence
array_length = 20 * 300
embedding_features = pd.DataFrame()
for document in text:
    # Saving the first 20 words of the document as a sequence
    words = text_to_word_sequence(document)[0:20] 
    
    # Retrieving the vector representation of each word and 
    # appending it to the feature vector 
    feature_vector = []
    for word in words:
        try:
            feature_vector = np.append(feature_vector, 
                                       np.array(embeds[word]))
        except KeyError:
            # In the event that a word is not included in our 
            # dictionary skip that word
            pass
    # If the text has less then 20 words, fill remaining vector with
    # zeros
    zeroes_to_add = array_length - len(feature_vector)
    feature_vector = np.append(feature_vector, 
                               np.zeros(zeroes_to_add)
                               ).reshape((1,-1))
    
    # Append the document feature vector to the feature table
    embedding_features = embedding_features.append( 
                                     pd.DataFrame(feature_vector))
print(embedding_features.shape)

(682, 6000)


### Criamos tabelas TFIDF para baseline de comparação.

- TFIDF dados Pedro (Em português)

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = list(textPT)
tfidfPT = TfidfVectorizer(max_features = 6000) 
tfidfPT.fit(corpus)
tfidf_featuresPT = tfidfPT.transform(corpus)
print(tfidf_featuresPT.shape)

(717, 6000)


- TFIDF dados Data Annotation (Em inglês)

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = list(text)
tfidf = TfidfVectorizer(max_features = 6000) 
tfidf.fit(corpus)
tfidf_features = tfidf.transform(corpus)
print(tfidf_features.shape)

(682, 6000)


### Tratamento dos labels via sklearn label encoder.

- Labels dados Pedro (Em português)

In [12]:
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import LinearSVC
from sklearn.metrics import precision_recall_fscore_support
# Converting the labels from strings to binary
le = LabelEncoder()
le.fit(labelPT)
labelPT = le.transform(labelPT)

- Labels dados Data Annotation (Em inglês)

In [13]:
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import LinearSVC
from sklearn.metrics import precision_recall_fscore_support
# Converting the labels from strings to binary
le = LabelEncoder()
le.fit(label)
label = le.transform(label)

### Split dos dados em treino e teste e instanciação dos modelos

In [14]:
# Taking 70/30 train test split
train_percent = 0.7
train_cutoff = int(np.floor(train_percent*len(text)))

- Split e instanciação dos modelos - dados Pedro (Em português)

In [15]:
# Word Embedding
embeded_modelPT = LinearSVC(max_iter=20000)
embeded_modelPT.fit(embedding_featuresPT[0 : train_cutoff], 
                  labelPT[0 : train_cutoff])
embeded_predictionPT = embeded_modelPT.predict(
                   embedding_featuresPT[train_cutoff + 1 : len(text)])
# TF-IDF table
tfidf_modelPT = LinearSVC(max_iter=20000)
tfidf_modelPT.fit(tfidf_featuresPT[0 : train_cutoff], 
                  labelPT[0 : train_cutoff])
tfidf_predictionPT = tfidf_modelPT.predict(
                  tfidf_featuresPT[train_cutoff + 1 : len(text)])



- Split e instaciação do modelo - dados Data Annotation (Em inglês)

In [17]:
# Taking 70/30 train test split
train_percent = 0.7
train_cutoff = int(np.floor(train_percent*len(text) ) )
# Word Embedding
embeded_model = LinearSVC(max_iter=20000)
embeded_model.fit(embedding_features[0 : train_cutoff], 
                  label[0 : train_cutoff])
embeded_prediction = embeded_model.predict(
                   embedding_features[train_cutoff + 1 : len(text)])
# TF-IDF table
tfidf_model = LinearSVC(max_iter=20000)
tfidf_model.fit(tfidf_features[0 : train_cutoff], 
                  label[0 : train_cutoff])
tfidf_prediction = tfidf_model.predict(
                  tfidf_features[train_cutoff + 1 : len(text)])

### Represantação dos resultados em tabelas

- Tabela dados Pedro (Em português)

In [18]:
resultsPT = pd.DataFrame(index = ['Word Embedding', 'TF-IDF'], 
          columns = ['Precision', 'Recall', 'F1 score', 'support']
          )
resultsPT.loc['Word Embedding'] = precision_recall_fscore_support(
          labelPT[train_cutoff + 1 : len(text)], 
          embeded_predictionPT, 
          average = 'weighted'
          )
resultsPT.loc['TF-IDF'] = precision_recall_fscore_support(
          labelPT[train_cutoff + 1 : len(text)], 
          tfidf_predictionPT, 
          average = 'weighted'
          )

  _warn_prf(average, modifier, msg_start, len(result))


- Tabela dados Data Annotation (Em inglês)

In [19]:
results = pd.DataFrame(index = ['Word Embedding', 'TF-IDF'], 
          columns = ['Precision', 'Recall', 'F1 score', 'support']
          )
results.loc['Word Embedding'] = precision_recall_fscore_support(
          label[train_cutoff + 1 : len(text)], 
          embeded_prediction, 
          average = 'weighted'
          )
results.loc['TF-IDF'] = precision_recall_fscore_support(
          label[train_cutoff + 1 : len(text)], 
          tfidf_prediction, 
          average = 'weighted'
          )

### Tabelas de resultados

- Resultados dados Pedro (Em português)

In [20]:
resultsPT.head()

Unnamed: 0,Precision,Recall,F1 score,support
Word Embedding,0.835766,0.833333,0.828399,
TF-IDF,0.904991,0.897059,0.894488,


- Resultados dados Data Annotation (Em inglês)

In [21]:
results.head()

Unnamed: 0,Precision,Recall,F1 score,support
Word Embedding,0.673325,0.681373,0.675881,
TF-IDF,0.966429,0.965686,0.965066,
