# Conjunto de clasificadores robustos en NLP

__________________________________

![Alt Text](./img/preproc.png)

# Bag of Words

*Ejemplo extraido de https://scikit-learn.org/ adaptado al curso*

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?',]

***CountVectorizer:*** implementa la tokenización como el recuento de ocurrencias

In [4]:
vectorizer = CountVectorizer()

***Entrenamiento***

In [5]:
X = vectorizer.fit_transform(corpus)

***Palabras detectadas*** 

In [6]:
 print(vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


**Oraciones procesadas**
* This is the first document.
* This document is the second document.
* And this is the third one.
* Is this the first document?

A cada término encontrado se le asigna un **índice** entero único correspondiente a una columna en la matriz resultante.

In [7]:
print(X.toarray())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


In [8]:
vectorizer.vocabulary_.get('this')

8

In [9]:
vectorizer.transform(['Something completely new.']).toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

In [10]:
vectorizer.transform(['Document one new.']).toarray()

array([[0, 1, 0, 0, 1, 0, 0, 0, 0]], dtype=int64)

***Español***

In [11]:
corpus = ['Este es el primer documento.',
          'Este documento es el segundo documento.',
          'Y este es el tercero.',
          '¿Es este el primer documento?']

In [12]:
vectorizer = CountVectorizer()

In [13]:
X = vectorizer.fit_transform(corpus)

In [14]:
 print(vectorizer.get_feature_names())

['documento', 'el', 'es', 'este', 'primer', 'segundo', 'tercero']


In [15]:
print(X.toarray())

[[1 1 1 1 1 0 0]
 [2 1 1 1 0 1 0]
 [0 1 1 1 0 0 1]
 [1 1 1 1 1 0 0]]


In [16]:
vectorizer.vocabulary_.get('este')

3

In [17]:
vectorizer.transform(['Algo completamente nuevo.']).toarray()

array([[0, 0, 0, 0, 0, 0, 0]], dtype=int64)

In [18]:
vectorizer.transform(['Primer documento nuevo.']).toarray()

array([[1, 0, 0, 0, 1, 0, 0]], dtype=int64)

____________________

***Desventaja:*** modelo pierde informacion de contexto, oraciones en afirmativo y en interrogativo podrian ser consideradas como similares.

***Bigrams:*** hasta cierto punto esto puede ser solucionado mediante el uso de este recurso

In [19]:
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?',]

In [20]:
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))

In [21]:
X2 = vectorizer2.fit_transform(corpus)

In [22]:
print(vectorizer2.get_feature_names())

['and this', 'document is', 'first document', 'is the', 'is this', 'second document', 'the first', 'the second', 'the third', 'third one', 'this document', 'this is', 'this the']


**Oraciones procesadas**
* This is the first document.
* This document is the second document.
* And this is the third one.
* Is this the first document?

In [23]:
print(X2.toarray())

[[0 0 1 1 0 0 1 0 0 0 0 1 0]
 [0 1 0 1 0 1 0 1 0 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 1 0 1 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 1]]


__________________________

***Texto extenso a modelar***

In [24]:
from dividing_into_sentences import read_text_file, preprocess_text, divide_into_sentences_nltk

In [25]:
def get_sentences(filename):
    sherlock_holmes_text = read_text_file(filename)
    sherlock_holmes_text = preprocess_text(sherlock_holmes_text)
    sentences = divide_into_sentences_nltk(sherlock_holmes_text)
    return sentences

In [26]:
def get_new_sentence_vector(sentence, vectorizer):
    new_sentence_vector = vectorizer.transform([sentence])
    return new_sentence_vector

In [27]:
def create_vectorizer(sentences):
    vectorizer = CountVectorizer(max_df=0.6) # Se descarta lo que este por debajo de este valor
    X = vectorizer.fit_transform(sentences)
    return (vectorizer, X)    

In [28]:
def create_bigram_vectorizer(sentences):
    bigram_vectorizer = CountVectorizer(ngram_range=(1, 2))
    X = bigram_vectorizer.fit_transform(sentences)
    return (bigram_vectorizer, X)

In [29]:
sentences = get_sentences("sherlock_holmes_1.txt")

In [30]:
(vectorizer, X) = create_vectorizer(sentences)

In [31]:
print(X)

  (0, 113)	1
  (0, 98)	1
  (0, 46)	1
  (0, 97)	1
  (0, 53)	1
  (0, 10)	1
  (0, 0)	1
  (0, 123)	1
  (1, 38)	1
  (1, 94)	1
  (1, 40)	1
  (1, 43)	1
  (1, 63)	1
  (1, 41)	1
  (1, 115)	1
  (1, 11)	1
  (1, 78)	1
  (1, 69)	1
  (2, 97)	1
  (2, 41)	1
  (2, 47)	1
  (2, 45)	1
  (2, 28)	1
  (2, 24)	1
  (2, 87)	1
  :	:
  (9, 85)	1
  (9, 56)	1
  (9, 14)	1
  (9, 66)	1
  (9, 20)	1
  (9, 106)	1
  (9, 102)	1
  (9, 70)	1
  (10, 113)	1
  (10, 123)	2
  (10, 43)	1
  (10, 108)	1
  (10, 75)	1
  (10, 118)	2
  (10, 107)	1
  (10, 52)	1
  (10, 4)	1
  (10, 76)	1
  (10, 15)	1
  (10, 126)	1
  (10, 109)	1
  (10, 55)	1
  (10, 23)	1
  (10, 88)	1
  (10, 60)	1


In [32]:
print(X.todense())

[[1 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 0 1]]


In [33]:
print(len(X.todense()))

11


In [34]:
print(vectorizer.get_feature_names())

['_the_', 'abhorrent', 'actions', 'adjusted', 'adler', 'admirable', 'admirably', 'admit', 'akin', 'all', 'always', 'any', 'as', 'balanced', 'be', 'but', 'cold', 'crack', 'delicate', 'distracting', 'disturbing', 'doubt', 'drawing', 'dubious', 'eclipses', 'emotion', 'emotions', 'excellent', 'eyes', 'factor', 'false', 'felt', 'finely', 'for', 'from', 'gibe', 'grit', 'has', 'have', 'he', 'heard', 'her', 'high', 'him', 'himself', 'his', 'holmes', 'in', 'instrument', 'into', 'introduce', 'intrusions', 'irene', 'is', 'it', 'late', 'lenses', 'love', 'lover', 'machine', 'memory', 'men', 'mental', 'mention', 'might', 'mind', 'more', 'most', 'motives', 'name', 'nature', 'never', 'not', 'observer', 'observing', 'of', 'one', 'or', 'other', 'own', 'particularly', 'passions', 'perfect', 'placed', 'position', 'power', 'precise', 'predominates', 'questionable', 'reasoner', 'reasoning', 'results', 'save', 'seen', 'seldom', 'sensitive', 'sex', 'she', 'sherlock', 'sneer', 'softer', 'spoke', 'strong', 'suc

In [35]:
new_sentence = "And yet there was but one woman to him, and that woman was the late Irene Adler, of dubious and questionable memory."
new_sentence_vector = get_new_sentence_vector(new_sentence, vectorizer)

In [36]:
analyze = vectorizer.build_analyzer()
print(analyze(new_sentence))

['and', 'yet', 'there', 'was', 'but', 'one', 'woman', 'to', 'him', 'and', 'that', 'woman', 'was', 'the', 'late', 'irene', 'adler', 'of', 'dubious', 'and', 'questionable', 'memory']


In [37]:
print(new_sentence_vector)

  (0, 4)	1
  (0, 15)	1
  (0, 23)	1
  (0, 43)	1
  (0, 52)	1
  (0, 55)	1
  (0, 60)	1
  (0, 75)	1
  (0, 76)	1
  (0, 88)	1
  (0, 107)	1
  (0, 108)	1
  (0, 109)	1
  (0, 113)	1
  (0, 118)	2
  (0, 123)	2
  (0, 126)	1


In [38]:
print(new_sentence_vector.todense())

[[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  1 1 0 0 0 1 0 0 0 0 2 0 0 0 0 2 0 0 1]]


In [39]:
print(vectorizer.get_feature_names())

['_the_', 'abhorrent', 'actions', 'adjusted', 'adler', 'admirable', 'admirably', 'admit', 'akin', 'all', 'always', 'any', 'as', 'balanced', 'be', 'but', 'cold', 'crack', 'delicate', 'distracting', 'disturbing', 'doubt', 'drawing', 'dubious', 'eclipses', 'emotion', 'emotions', 'excellent', 'eyes', 'factor', 'false', 'felt', 'finely', 'for', 'from', 'gibe', 'grit', 'has', 'have', 'he', 'heard', 'her', 'high', 'him', 'himself', 'his', 'holmes', 'in', 'instrument', 'into', 'introduce', 'intrusions', 'irene', 'is', 'it', 'late', 'lenses', 'love', 'lover', 'machine', 'memory', 'men', 'mental', 'mention', 'might', 'mind', 'more', 'most', 'motives', 'name', 'nature', 'never', 'not', 'observer', 'observing', 'of', 'one', 'or', 'other', 'own', 'particularly', 'passions', 'perfect', 'placed', 'position', 'power', 'precise', 'predominates', 'questionable', 'reasoner', 'reasoning', 'results', 'save', 'seen', 'seldom', 'sensitive', 'sex', 'she', 'sherlock', 'sneer', 'softer', 'spoke', 'strong', 'suc

________________

In [40]:
(bigram_vectorizer, X) = create_bigram_vectorizer(sentences)

In [41]:
print(X)

  (0, 269)	1
  (0, 229)	1
  (0, 118)	1
  (0, 226)	1
  (0, 136)	1
  (0, 20)	1
  (0, 0)	1
  (0, 299)	1
  (0, 275)	1
  (0, 230)	1
  (0, 119)	1
  (0, 228)	1
  (0, 137)	1
  (0, 21)	1
  (0, 1)	1
  (1, 93)	1
  (1, 221)	1
  (1, 101)	1
  (1, 108)	1
  (1, 156)	1
  (1, 103)	1
  (1, 278)	1
  (1, 31)	1
  (1, 190)	1
  (1, 167)	1
  :	:
  (10, 307)	1
  (10, 261)	1
  (10, 141)	1
  (10, 60)	1
  (10, 210)	1
  (10, 151)	1
  (10, 30)	1
  (10, 308)	1
  (10, 262)	1
  (10, 285)	1
  (10, 45)	1
  (10, 187)	1
  (10, 300)	1
  (10, 271)	1
  (10, 109)	1
  (10, 251)	1
  (10, 301)	1
  (10, 288)	1
  (10, 253)	1
  (10, 142)	1
  (10, 8)	1
  (10, 180)	1
  (10, 61)	1
  (10, 27)	1
  (10, 211)	1


In [42]:
print(X.todense())

[[1 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 0 1 1]]


In [43]:
print(len(X.todense()))

11


In [44]:
print(bigram_vectorizer.get_feature_names())

['_the_', '_the_ woman', 'abhorrent', 'abhorrent to', 'actions', 'adjusted', 'adjusted temperament', 'adler', 'adler of', 'admirable', 'admirable things', 'admirably', 'admirably balanced', 'admit', 'admit such', 'akin', 'akin to', 'all', 'all emotions', 'all his', 'always', 'always _the_', 'and', 'and actions', 'and finely', 'and observing', 'and predominates', 'and questionable', 'and sneer', 'and that', 'and yet', 'any', 'any emotion', 'any other', 'as', 'as his', 'as lover', 'balanced', 'balanced mind', 'be', 'be more', 'but', 'but admirably', 'but as', 'but for', 'but one', 'cold', 'cold precise', 'crack', 'crack in', 'delicate', 'delicate and', 'distracting', 'distracting factor', 'disturbing', 'disturbing than', 'doubt', 'doubt upon', 'drawing', 'drawing the', 'dubious', 'dubious and', 'eclipses', 'eclipses and', 'emotion', 'emotion akin', 'emotion in', 'emotions', 'emotions and', 'excellent', 'excellent for', 'eyes', 'eyes she', 'factor', 'factor which', 'false', 'false positio

In [45]:
new_sentence = "I had seen little of Holmes lately."
new_sentence_vector = bigram_vectorizer.transform([new_sentence])

In [46]:
print(new_sentence_vector)
print(new_sentence_vector.todense())

  (0, 118)	1
  (0, 179)	1
  (0, 219)	1
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


In [47]:
new_sentence1 = " And yet there was but one woman to him, and that woman was the late Irene Adler, of dubious and questionable memory."
new_sentence_vector1 = vectorizer.transform([new_sentence1])

In [48]:
print(new_sentence_vector1)
print(new_sentence_vector1.todense())

  (0, 4)	1
  (0, 15)	1
  (0, 23)	1
  (0, 43)	1
  (0, 52)	1
  (0, 55)	1
  (0, 60)	1
  (0, 75)	1
  (0, 76)	1
  (0, 88)	1
  (0, 107)	1
  (0, 108)	1
  (0, 109)	1
  (0, 113)	1
  (0, 118)	2
  (0, 123)	2
  (0, 126)	1
[[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  1 1 0 0 0 1 0 0 0 0 2 0 0 0 0 2 0 0 1]]


______________________________________

# CBOW

***Continuous Bag-of-Words Word2Vec:*** es una arquitectura para crear vectores de palabras que utiliza tanto palabras futuras como pasadas. La función objetivo de CBOW es:

![Alt Text](./img/clow.png)

![Alt Text](./img/word2vec_diagrams.png)

***Paper:*** [Efficient Estimation of Word Representations in Vector Space](https://paperswithcode.com/method/cbow-word2vec)

*Codigo adaptado para la clase: ***fasttext*** quick start guide*

In [49]:
import numpy as np
np.random.seed(13)

In [50]:
import gensim

In [51]:
import random
from IPython.display import Image

In [54]:
#import keras.backend as K
#import keras.backend.tensorflow_backend as K
import keras.backend as K
from keras.models import Sequential, Model
from keras.layers import Embedding, Reshape, Activation, Input, Lambda, Dense
from keras.preprocessing import sequence
from keras.layers.merge import Dot
from keras.utils import np_utils
from keras.utils.data_utils import get_file
from keras.utils.vis_utils import model_to_dot
from keras.preprocessing.text import Tokenizer


In [55]:
window_size = 4

In [56]:
def skipgrams(sequence, vocabulary_size,
              window_size=window_size, negative_samples=1., shuffle=True,
              categorical=False, sampling_table=None, seed=None):
    couples = []
    labels = []
    for i, wi in enumerate(sequence):
        if not wi:
            continue
        if sampling_table is not None:
            if sampling_table[wi] < random.random():
                continue

        window_start = max(0, i - window_size)
        window_end = min(len(sequence), i + window_size + 1)
        for j in range(window_start, window_end):
            if j != i:
                wj = sequence[j]
                if not wj:
                    continue
                couples.append([wi, wj])
                if categorical:
                    labels.append([0, 1])
                else:
                    labels.append(1)

    if negative_samples > 0:
        num_negative_samples = int(len(labels) * negative_samples)
        words = [c[0] for c in couples]
        random.shuffle(words)

        couples += [[words[i % len(words)],
                    random.randint(1, vocabulary_size - 1)]
                    for i in range(num_negative_samples)]
        if categorical:
            labels += [[1, 0]] * num_negative_samples
        else:
            labels += [0] * num_negative_samples

    if shuffle:
        if seed is None:
            seed = random.randint(0, 10e6)
        random.seed(seed)
        random.shuffle(couples)
        random.seed(seed)
        random.shuffle(labels)
        
    return couples, labels

def generate_data_for_cbow(corpus, window_size, V):
    maxlen = window_size*2
    corpus = tokenizer.texts_to_sequences(corpus)
    for words in corpus:
        L = len(words)
        for index, word in enumerate(words):
            contexts = []
            labels   = []            
            s = index - window_size
            e = index + window_size + 1
            
            contexts.append([words[i] for i in range(s, e) if 0 <= i < L and i != index])
            labels.append(word)
            x = sequence.pad_sequences(contexts, maxlen=maxlen)
            y = np_utils.to_categorical(labels, V)
            yield (x, y)

In [57]:
path = './data/alice.txt'
corpus = open(path, encoding="utf-8").readlines()

In [58]:
corpus = [sentence for sentence in corpus if sentence.count(' ') >= 2]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
V=len(tokenizer.word_index) + 1

In [59]:
embedding_dim = 100
# inputs
w_inputs = Input(shape=(1, ), dtype='int32')
w = Embedding(V, embedding_dim)(w_inputs)

# context
c_inputs = Input(shape=(1, ), dtype='int32')
c = Embedding(V, embedding_dim)(c_inputs)
o = Dot(axes=2)([w, c])
o = Reshape((1,), input_shape=(1, 1))(o)
o = Activation('sigmoid')(o)

sg_model = Model(inputs=[w_inputs, c_inputs], outputs=o)
sg_model.compile(loss='binary_crossentropy', optimizer='adam')

In [60]:
cbow = Sequential()
cbow.add(Embedding(input_dim=V, output_dim=embedding_dim, input_length=window_size*2))
cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(embedding_dim,)))
cbow.add(Dense(V, activation='softmax'))
cbow.compile(loss='categorical_crossentropy', optimizer='adadelta')

***Entrenamiento***

In [61]:
for ite in range(5):
    loss = 0.
    for i, doc in enumerate(tokenizer.texts_to_sequences(corpus)):
        data, labels = skipgrams(sequence=doc, vocabulary_size=V, window_size=5, negative_samples=5.)
        x = [np.array(x) for x in zip(*data)]
        y = np.array(labels, dtype=np.int32)
        if x:
            loss += sg_model.train_on_batch(x, y)

    print(ite, loss)

0 1110.177969634533
1 758.7284075319767
2 702.7573907524347
3 674.9359678328037
4 651.4569097012281


In [62]:
for ite in range(5):
    loss  =  0.
    for  x, y in generate_data_for_cbow(corpus, window_size, V):
        loss += cbow.train_on_batch(x, y)

    print(ite, loss)

0 250922.2698507309
1 250793.67269039154
2 250664.94546985626
3 250535.95819091797
4 250406.54448890686


***Guardar los vectores generados***

In [63]:
with open('./data/sg_vectors.txt' ,'w') as f:
    f.write('{} {}\n'.format(V-1, embedding_dim))
    vectors = sg_model.get_weights()[0]
    for word, i in tokenizer.word_index.items():
        f.write('{} {}\n'.format(word, ' '.join(map(str, list(vectors[i, :])))))

In [64]:
with open('./data/cbow_vectors.txt' ,'w') as f:
    f.write('{} {}\n'.format(V-1, embedding_dim))
    vectors = cbow.get_weights()[0]
    for word, i in tokenizer.word_index.items():
        f.write('{} {}\n'.format(word, ' '.join(map(str, list(vectors[i, :])))))

***Carga de vectores***

In [65]:
sg_model = gensim.models.KeyedVectors.load_word2vec_format( open('./data/sg_vectors.txt', 'r'), binary=False)
cbow_model = gensim.models.KeyedVectors.load_word2vec_format(open('./data/cbow_vectors.txt', 'r'), binary=False)

In [66]:
sg_model.most_similar(positive=['queen'])

[('hearts', 0.7938040494918823),
 ('became', 0.7249054312705994),
 ('cook', 0.7186954617500305),
 ('verse', 0.689565122127533),
 ('impatiently', 0.6888817548751831),
 ('king', 0.6803977489471436),
 ('lobsters', 0.67889004945755),
 ('wildly', 0.6765428185462952),
 ('footman', 0.6761890649795532),
 ('duchess’s', 0.6755886673927307)]

In [67]:
cbow_model.most_similar(positive=['queen'])

[('date', 0.3321659564971924),
 ('impatiently', 0.32813021540641785),
 ('sir—”', 0.3218364715576172),
 ('wish', 0.31456565856933594),
 ('rule', 0.311204195022583),
 ('he’ll', 0.3093535304069519),
 ('energetic', 0.30847007036209106),
 ('gravely', 0.3012162744998932),
 ('elbow', 0.28901851177215576),
 ('eats', 0.2877623438835144)]

In [68]:
sg_model.most_similar(positive=['alice'])

[('thought', 0.6757129430770874),
 ('poor', 0.6078629493713379),
 ('“it', 0.5983031988143921),
 ('“i’m', 0.5915191173553467),
 ('rather', 0.5902792811393738),
 ('curious', 0.5878759622573853),
 ('doubtfully', 0.5791394114494324),
 ('girl', 0.5791118144989014),
 ('“that’s', 0.5784812569618225),
 ('“i’d', 0.5708906650543213)]

In [69]:
cbow_model.most_similar(positive=['alice'])

[('produced', 0.36737269163131714),
 ('feeling', 0.3586391806602478),
 ('stay', 0.3092033863067627),
 ('“who', 0.29888805747032166),
 ('absurd', 0.29571735858917236),
 ('book', 0.2882564961910248),
 ('sad', 0.27913543581962585),
 ('bread', 0.2776448428630829),
 ('future', 0.27288612723350525),
 ('daisy', 0.27287036180496216)]

In [70]:
sg_model.most_similar(positive=['the'])

[('hearts', 0.6624538898468018),
 ('queen', 0.6542152166366577),
 ('queen’s', 0.653067946434021),
 ('king', 0.6318836212158203),
 ('under', 0.5902504324913025),
 ('laws', 0.5705169439315796),
 ('professor', 0.5625193119049072),
 ('march', 0.5595977902412415),
 ('singers', 0.5542469024658203),
 ('became', 0.5508366823196411)]

In [71]:
cbow_model.most_similar(positive=['the'])

[('preserve', 0.34494319558143616),
 ('fact', 0.3368389904499054),
 ('brought', 0.33671051263809204),
 ('carrier', 0.3263804316520691),
 ('by', 0.31738507747650146),
 ('answered', 0.29890382289886475),
 ('royalty', 0.2966153621673584),
 ('soldiers', 0.28517836332321167),
 ('dreamy', 0.2848338484764099),
 ('whistle', 0.2828007936477661)]

_______________________

# Term-Frequency Inverse Document-Frequency - TF IDF

En un corpus de texto extenso, algunas palabras estarán muy presentes (por ejemplo, "the", "a", "is" en inglés), por lo que ***contienen muy poca información significativa sobre el contenido real del documento***. Si se pasa directamente a un clasificador, esos términos muy frecuentes ensombrecerían las frecuencias de términos más raros pero más interesantes.

In [72]:
import nltk
import string

In [73]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [74]:
from nltk.stem.snowball import SnowballStemmer

In [75]:
from removing_stopwords import read_in_csv

In [76]:
stemmer = SnowballStemmer('english')
stopwords_file_path = "./data/stopwords.csv"
sentences = get_sentences("sherlock_holmes_1.txt")

In [77]:
def tokenize_and_stem(sentence):
    tokens = nltk.word_tokenize(sentence)
    filtered_tokens = [t for t in tokens if t not in string.punctuation]
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

In [78]:
def create_char_vectorizer(sentences):
    #Create TF-IDF object
    tfidf_char_vectorizer = TfidfVectorizer(analyzer='char_wb', max_df=0.90, max_features=200000,
                                        min_df=0.05, use_idf=True, ngram_range=(1,3))
    tfidf_char_vectorizer = tfidf_char_vectorizer.fit(sentences)
    tfidf_matrix = tfidf_char_vectorizer.transform(sentences)
    print(tfidf_matrix)
    dense_matrix = tfidf_matrix.todense()
    print(dense_matrix)
    print(tfidf_char_vectorizer.get_feature_names())
    analyze = tfidf_char_vectorizer.build_analyzer()
    print(analyze("To Sherlock Holmes she is always _the_ woman."))
    return (tfidf_char_vectorizer, tfidf_matrix)

In [79]:
def create_vectorizer(sentences):
    #Create TF-IDF object
    stopword_list = read_in_csv(stopwords_file_path)
    stemmed_stopwords = [tokenize_and_stem(stopword)[0] for stopword in stopword_list]
    stopword_list = stopword_list + stemmed_stopwords
    tfidf_vectorizer = TfidfVectorizer(max_df=0.90, max_features=200000,
                                        min_df=0.05, stop_words=stopword_list,
                                        use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
    tfidf_vectorizer = tfidf_vectorizer.fit(sentences)
    tfidf_matrix = tfidf_vectorizer.transform(sentences)
    print(tfidf_matrix)
    dense_matrix = tfidf_matrix.todense()
    print(dense_matrix)
    print(tfidf_vectorizer.get_feature_names())
    return (tfidf_vectorizer, tfidf_matrix)

In [80]:
(vectorizer, matrix) = create_vectorizer(sentences)

FileNotFoundError: [Errno 2] No such file or directory: './data/stopwords.csv'

In [None]:
analyze = vectorizer.build_analyzer()
print(analyze("To Sherlock Holmes she is always _the_ woman."))

___________________________

In [None]:
from sklearn.datasets import fetch_20newsgroups

In [None]:
newsgroups_train = fetch_20newsgroups(subset='train')

In [None]:
newsgroups_test = fetch_20newsgroups(subset='test')

In [None]:
x_train = newsgroups_train.data

In [None]:
x_test = newsgroups_test.data

In [None]:
y_train = newsgroups_train.target

In [None]:
y_test = newsgroups_test.target

In [None]:
print ("Categorias de las 20 fuentes de datos:")
print (newsgroups_train.target_names)
print ("___________________________")
print ("Ejemplo de un email:")
print (x_train[0])
print ("___________________________")
print ("Ejemplos de Target:")
print (y_train[0])
print (newsgroups_train.target_names[y_train[0]])

______________

In [None]:
import nltk

In [None]:
import string

In [None]:
import pandas as pd

In [None]:
from nltk.corpus import stopwords

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
from nltk import pos_tag

In [None]:
from nltk.stem import PorterStemmer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
def preprocessing(text):
    text2 = " ".join("".join([" " if ch in string.punctuation else ch for ch in text]).split())
    tokens = [word for sent in nltk.sent_tokenize(text2) for word in nltk.word_tokenize(sent)]
    tokens = [word.lower() for word in tokens]
    stopwds = stopwords.words('english')
    tokens = [token for token in tokens if token not in stopwds]
    tokens = [word for word in tokens if len(word)>=3]
    stemmer = PorterStemmer()

    try:
        tokens = [stemmer.stem(word) for word in tokens]
    except:
        tokens = tokens
        
    tagged_corpus = pos_tag(tokens)    
    Noun_tags = ['NN','NNP','NNPS','NNS']
    Verb_tags = ['VB','VBD','VBG','VBN','VBP','VBZ']
    lemmatizer = WordNetLemmatizer()

    def prat_lemmatize(token,tag):
        if tag in Noun_tags:
            return lemmatizer.lemmatize(token,'n')
        elif tag in Verb_tags:
            return lemmatizer.lemmatize(token,'v')
        else:
            return lemmatizer.lemmatize(token,'n')
    
    pre_proc_text =  " ".join([prat_lemmatize(token,tag) for token,tag in tagged_corpus])             

    return pre_proc_text

In [None]:
x_train_preprocessed  = []

In [None]:
for i in x_train:
    x_train_preprocessed.append(preprocessing(i))

In [None]:
x_test_preprocessed = []

In [None]:
for i in x_test:
    x_test_preprocessed.append(preprocessing(i))

In [None]:
vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1, 2),  stop_words='english', 
                             max_features= 10000,strip_accents='unicode',  norm='l2')

In [None]:
x_train_2 = vectorizer.fit_transform(x_train_preprocessed).todense()

In [None]:
x_test_2 = vectorizer.transform(x_test_preprocessed).todense()

**Deep Learning modules**

In [None]:
import numpy as np

In [None]:
from keras.models import Sequential

In [None]:
from keras.layers.core import Dense, Dropout, Activation

In [None]:
from keras.optimizers import Adadelta,Adam,RMSprop

In [None]:
from keras.utils import np_utils

In [None]:
from sklearn.metrics import accuracy_score,classification_report

***Hyper parameters***

In [None]:
np.random.seed(1337) 
nb_classes = 20
batch_size = 64
nb_epochs = 20

In [None]:
Y_train = np_utils.to_categorical(y_train, nb_classes)

In [None]:
model = Sequential()

In [None]:
model.add(Dense(1000,input_shape= (10000,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))

In [None]:
model.add(Dense(500))
model.add(Activation('relu'))
model.add(Dropout(0.5))

In [None]:
model.add(Dense(50))
model.add(Activation('relu'))
model.add(Dropout(0.5))

In [None]:
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
print (model.summary())

***Model Training***

In [None]:
model.fit(x_train_2, Y_train, batch_size=batch_size, epochs=nb_epochs,verbose=1)

***Model Prediction***

In [None]:
y_train_predclass = model.predict_classes(x_train_2,batch_size=batch_size)

In [None]:
y_test_predclass = model.predict_classes(x_test_2,batch_size=batch_size)

In [None]:
print ("Train accuracy: {}". format(round(accuracy_score(y_train,y_train_predclass),3)))
print ("Test accuracy: {}". format(round(accuracy_score(y_test,y_test_predclass),3)))

In [None]:
print ("Test Classification Report\n")
print (classification_report(y_test,y_test_predclass))

__________________________________

# Word Embeddings

Mecanismo que como resultado del entrenamiento de una ***red neuronal***, predice una palabra a partir de todas las demás palabras de la oración. Los vectores resultantes son similares para palabras que ocurren en contextos similares. 

In [None]:
from gensim.models import KeyedVectors

In [None]:
import numpy as np

In [None]:
w2vec_model_path = "./models/40/model.bin"

In [None]:
def load_model(path):
    model = KeyedVectors.load_word2vec_format(w2vec_model_path, binary=True)
    return model

In [None]:
def get_sentence_vector(word_vectors):
    matrix = np.array(word_vectors)
    centroid = np.mean(matrix[:,:], axis=0)
    return centroid

In [None]:
def get_word_vectors(sentence, model):
    word_vectors = []
    for word in sentence:
        try:
            word_vector = model.get_vector(word.lower())
            word_vectors.append(word_vector)
        except KeyError:
            continue
    return word_vectors

In [None]:
model = load_model(w2vec_model_path)
print(model['holmes'])
print(model.most_similar(['holmes'], topn=15))

sentence = "It was not that he felt any emotion akin to love for Irene Adler."
word_vectors = get_word_vectors(sentence, model)
sentence_vector = get_sentence_vector(word_vectors)
words = ['banana', 'apple', 'computer', 'strawberry']
print(model.doesnt_match(words))

word = "cup"
words = ['glass', 'computer', 'pencil', 'watch']
print(model.most_similar_to_given(word, words))

________________

# Word2vec Model

El algoritmo ***word2vec*** utiliza un modelo de red neuronal para aprender asociaciones de palabras de un gran corpus de texto. Una vez entrenado, dicho modelo puede detectar palabras sinónimas o sugerir palabras adicionales para una oración parcial.

![Alt Text](./img/word2vec_translation.png)

#### Fuente: Python Deep Learning Projects, curaduria para el presente curso

In [None]:
import requests
import os
import re
import multiprocessing

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import nltk

In [None]:
import gensim.models.word2vec as w2v

In [None]:
import sklearn.manifold

In [None]:
import tensorflow as tf

In [None]:
#nltk.download("punkt")
#nltk.download("stopwords")

In [None]:
def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]", " ", raw)
    words = clean.split()
    return list(map(lambda x: x.lower(), words))

#### Preproceso a realizar con el texto

![Alt Text](./img/learning-word-vectors-1.png)

***Texto extraido de Principles of Geology by Sir Charles Lyell, Project Gutenberg***

In [None]:
filepath = 'http://www.gutenberg.org/files/33224/33224-0.txt'
corpus_raw = requests.get(filepath).text

In [None]:
# Clean text
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
raw_sentences = tokenizer.tokenize(corpus_raw)

In [None]:
# Sentence where each word is tokenized
sentences = (sentence_to_wordlist(raw) for raw in raw_sentences if raw)
sentences = list(sentences)
token_count = sum([len(sentence) for sentence in sentences])
print(f'The book corpus contains {token_count} tokens.')

#### Definicion del Modelo

In [None]:
# Dimensiones
num_features = 300

# umbral minimo para considerar una palabra.
min_word_count = 3

# Definicon de tareas en paralelo
num_workers = multiprocessing.cpu_count()

# Ventana 
context_size = 7

# Bajada a disco
downsampling = 1e-3

# Semilla.
seed = 1

model2vec = w2v.Word2Vec(sg=1, seed=seed, workers=num_workers, min_count=min_word_count,
                         window=context_size, sample=downsampling)

In [None]:
model2vec.build_vocab(list(sentences))

***Nota:*** Aumentar el número de dimensiones conduce a una mejor generalización, pero también agrega más complejidad computacional

***Nota:*** Parametro ***context_size***. Establece el límite superior para la distancia entre la predicción de palabras actual y objetivo dentro de una oración.

#### Entrenamiento

In [None]:
model2vec.train(sentences, total_examples=model2vec.corpus_count, epochs=10)

if not os.path.exists(os.path.join('trained', 'sample')):
    os.makedirs(os.path.join('trained', 'sample'))

model2vec.save(os.path.join('trained', 'sample', 'sample.w2v'))

#### Evaluacion del modelo

In [None]:
print('Similar a: "earth":')
for sWord in model2vec.wv.most_similar("earth"):
    print(sWord)
    
print('\nSimilar a: "human":')
for sWord in model2vec.wv.most_similar("human"):
    print(sWord)
    
print('\nContribucion Positiva y Negativa:')
for sWord in model2vec.wv.most_similar_cosmul(positive=['earth', 'moon'], negative=['orbit']):
    print(sWord)

__________________

# Train Word2Vec

In [121]:
import gensim
from gensim.test.utils import datapath
from gensim.models import KeyedVectors
import pickle
from os import listdir
from os.path import isfile, join
from tokenization import tokenize_nltk

In [139]:
from bag_of_words import get_sentences

In [158]:
word2vec_model_path = "models/word2vec.model"
books_dir = "books/"
evaluation_file = "questions-words.txt"
pretrained_model_path = "models/40/model.bin"

In [159]:
def train_word2vec(words, word2vec_model_path):
    #model = gensim.models.Word2Vec(
    #    words,
    #    size=50,
    #    window=7,
    #    min_count=1,
    #    workers=10)
    model = gensim.models.Word2Vec(words, window=5, min_count=5)
    model.train(words, total_examples=len(words), epochs=200)
    pickle.dump(model, open(word2vec_model_path, 'wb'))
    return model

In [180]:
def get_all_book_sentences(directory):
    text_files = [join(directory, f) for f in listdir(directory) if isfile(join(directory, f)) and ".rtf" in f]
    all_sentences = []
    for text_file in text_files:
        sentences = get_sentences(text_file)
        all_sentences = all_sentences + sentences
    return all_sentences

In [194]:
def test_model(w1):
    model = pickle.load(open(word2vec_model_path, 'rb'))
    #words = list(model.wv.vocab)
    words = list(model.wv.index_to_key)
    #print(words)
    words = model.wv.most_similar(w1, topn=10)
    print(words)

In [182]:
def evaluate_model(model, filename):
    return model.wv.accuracy(filename)

In [183]:
sentences = get_all_book_sentences(books_dir)

['books/Adventures_of_Huckleberry_Finn_by_Mark_Twain.rtf', 'books/Alices_Adventures_in_Wonderland_by_Lewis_Carroll.rtf', 'books/A_Dolls_House_by_Henrik_Ibsen.rtf', 'books/A_Tale_of_Two_Cities_by_Charles_Dickens.rtf', 'books/Dracula_by_Bram_Stoker.rtf', 'books/Emma_by_Jane_Austen.rtf', 'books/Frankenstein_by_Mary_Shelley.rtf', 'books/Great_Expectations_by_Charles_Dickens.rtf', 'books/Grimms_Fairy_Tales_by_The_Brothers_Grimm.rtf', 'books/Metamorphosis_by_Franz_Kafka.rtf', 'books/Pride_and_Prejudice_by_Jane_Austen.rtf', 'books/The_Adventures_of_Sherlock_Holmes_by_Arthur_Conan_Doyle.rtf', 'books/The_Adventures_of_Tom_Sawyer_by_Mark_Twain.rtf', 'books/The_Count_of_Monte_Cristo_by_Alexandre_Dumas.rtf', 'books/The_Importance_of_Being_Earnest_by_Oscar_Wilde.rtf', 'books/The_Picture_of_Dorian_Gray_by_Oscar_Wilde.rtf', 'books/The_Prince_by_Nicolo_Machiavelli.rtf', 'books/The_Romance_of_Lust_by_Anonymous.rtf', 'books/The_Yellow_Wallpaper_by_Charlotte_Perkins_Gilman.rtf', 'books/Ulysses_by_James_J

In [185]:
sentences = [tokenize_nltk(s.lower()) for s in sentences]

In [187]:
model = train_word2vec(sentences,word2vec_model_path)

In [195]:
oneWord = "river"
test_model(oneWord)

[('banks', 0.6442744731903076), ('stream', 0.6146652102470398), ('woods', 0.6124129891395569), ('hill', 0.6082079410552979), ('illinois', 0.6081189513206482), ('shore', 0.6002070307731628), ('mountains', 0.5959462523460388), ('avenue', 0.5911605954170227), ('corso', 0.5810585021972656), ('raft', 0.5785054564476013)]


In [196]:
model = pickle.load(open(word2vec_model_path, 'rb'))

__________________________________

# Word2Vec Español

[Entrenamiento de wor2vect en españo, ejemplo adaptado para la clase](https://github.com/dccuchile/spanish-word-embeddings)

In [103]:
from gensim.models.keyedvectors import KeyedVectors

Alternativamente s epuede acceder a [Word vectors for 157 languages](https://fasttext.cc/docs/en/crawl-vectors.html)

In [104]:
wordvectors_file_vec = './data/fasttext-sbwc.3.6.e20.vec'

In [105]:
cantidad = 100000

In [106]:
wordvectors = KeyedVectors.load_word2vec_format(wordvectors_file_vec, limit=cantidad)

#### Buscar analogias o palabras que tienen un contexto similar

[Reference: KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html)

In [107]:
wordvectors.most_similar_cosmul(positive=['rey','mujer'],negative=['hombre'])

[('reina', 0.9141532778739929),
 ('infanta', 0.8582409620285034),
 ('berenguela', 0.8470728993415833),
 ('princesa', 0.8445042371749878),
 ('consorte', 0.835599422454834),
 ('emperatriz', 0.8247664570808411),
 ('regente', 0.8239888548851013),
 ('infantas', 0.8104740381240845),
 ('hermanastra', 0.8072930574417114),
 ('regencia', 0.8037239909172058)]

In [108]:
wordvectors.most_similar_cosmul(positive=['actor','mujer'],negative=['hombre'], topn=10)

[('actriz', 0.9687139391899109),
 ('compositora', 0.855713427066803),
 ('cantante', 0.8482002019882202),
 ('actrices', 0.845941424369812),
 ('dramaturga', 0.8354867696762085),
 ('presentadora', 0.8346402645111084),
 ('bailarina', 0.8301039934158325),
 ('coprotagonista', 0.8284398317337036),
 ('guionista', 0.828334629535675),
 ('cantautora', 0.8273791670799255)]

In [109]:
wordvectors.most_similar_cosmul(positive=['hijo','mujer'],negative=['hombre'], topn=5)

[('hija', 0.9641352295875549),
 ('esposa', 0.911634087562561),
 ('madre', 0.9057635068893433),
 ('nieta', 0.8976945877075195),
 ('hermanastra', 0.8958925604820251)]

In [110]:
wordvectors.most_similar_cosmul(positive=['yerno','mujer'],negative=['hombre'])

[('nuera', 0.8991931080818176),
 ('cuñada', 0.8967029452323914),
 ('esposa', 0.8791162967681885),
 ('hija', 0.8787108659744263),
 ('suegra', 0.8752366304397583),
 ('sobrina', 0.8678680658340454),
 ('hermanastra', 0.8615662455558777),
 ('viuda', 0.8587483167648315),
 ('yernos', 0.8577941656112671),
 ('nieta', 0.8574916124343872)]

In [111]:
wordvectors.most_similar_cosmul(positive=['jugar','canta'],negative=['cantar'])

[('juega', 0.927038848400116),
 ('jugará', 0.9030497670173645),
 ('juegue', 0.8957996368408203),
 ('jugando', 0.8832089304924011),
 ('juegan', 0.868077278137207),
 ('jugado', 0.8658615946769714),
 ('jugó', 0.8645128607749939),
 ('juegas', 0.8533657789230347),
 ('jugaría', 0.8508267402648926),
 ('jugara', 0.8470849394798279)]

In [112]:
wordvectors.most_similar_cosmul(positive=['jugar','cantaría'],negative=['cantar'])

[('jugaría', 1.002570629119873),
 ('jugarían', 0.9512909650802612),
 ('jugara', 0.9422452449798584),
 ('disputaría', 0.918655276298523),
 ('jugará', 0.908361554145813),
 ('jugaran', 0.8989545106887817),
 ('jugase', 0.8874877095222473),
 ('disputarían', 0.8822468519210815),
 ('jugó', 0.8740343451499939),
 ('ficharía', 0.8733251094818115)]

In [113]:
wordvectors.most_similar_cosmul(positive=['ir','jugando'],negative=['jugar'])

[('yendo', 0.907002329826355),
 ('ido', 0.8450857996940613),
 ('saliendo', 0.832144021987915),
 ('caminando', 0.8135581612586975),
 ('yéndose', 0.8133329153060913),
 ('acercando', 0.8035196661949158),
 ('iremos', 0.8023999333381653),
 ('marchando', 0.8001841902732849),
 ('parando', 0.7995682954788208),
 ('irá', 0.7987060546875)]

In [114]:
wordvectors.most_similar_cosmul(positive=['santiago','venezuela'],negative=['chile'])

[('caracas', 0.9048638343811035),
 ('barinas', 0.871845543384552),
 ('brión', 0.8565776944160461),
 ('cojedes', 0.851475715637207),
 ('cumaná', 0.8507834076881409),
 ('guanare', 0.8507249355316162),
 ('maturín', 0.8474243879318237),
 ('mariño', 0.8468520641326904),
 ('barquisimeto', 0.8451403379440308),
 ('falcón', 0.8430415987968445)]

In [115]:
wordvectors.most_similar_cosmul(positive=['habana','chile'],negative=['santiago'])

[('cuba', 0.9638005495071411),
 ('venezuela', 0.8891815543174744),
 ('colombia', 0.876230001449585),
 ('cubana', 0.8471046686172485),
 ('nicaragua', 0.8443881273269653),
 ('cubanos', 0.8370179533958435),
 ('ecuador', 0.8361554145812988),
 ('brasil', 0.8355840444564819),
 ('cubano', 0.8315702080726624),
 ('panamá', 0.8302189111709595)]

In [116]:
wordvectors.most_similar_to_given('santiago', ['cuba','chile', 'brasil'] )

'chile'

In [117]:
wordvectors.n_similarity('santiago', 'chile')

0.7982443

In [118]:
wordvectors.n_similarity('chile', 'brasil')

0.9034706

#### palabra dentro de que está más lejana del resto de las palabras de la lista

In [119]:
wordvectors.doesnt_match(['blanco','azul','rojo','chile'])

'chile'

In [120]:
wordvectors.doesnt_match(['sol','luna','almuerzo','jupiter'])

'almuerzo'

_____________________

# Bidirectional Encoder Representations from Transformers - Bert

***BERT*** utiliza Transformer, un mecanismo que aprende las relaciones contextuales entre palabras en un texto. En su forma básica, Transformer incluye dos mecanismos separados: un codificador que lee la entrada de texto y un decodificador que produce una predicción para la tarea.

In [2]:
from sentence_transformers import SentenceTransformer

In [1]:
from dividing_into_sentences import read_text_file, divide_into_sentences_nltk

In [3]:
text = read_text_file("sherlock_holmes.txt")

In [4]:
sentences = divide_into_sentences_nltk(text)

In [5]:
model = SentenceTransformer('bert-base-nli-mean-tokens')

  0%|          | 0.00/405M [00:00<?, ?B/s]

Some weights of the model checkpoint at C:\Users\Admin/.cache\torch\sentence_transformers\sbert.net_models_bert-base-nli-mean-tokens\0_BERT were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  return torch._C._cuda_getDeviceCount() > 0


In [6]:
sentence_embeddings = model.encode(["the beautiful lake"])

In [7]:
print("Sentence embeddings:")
print(sentence_embeddings)

Sentence embeddings:
[[-7.61979818e-02 -5.74670196e-01  1.08264279e+00  7.36554265e-01
   5.51345468e-01 -9.39117730e-01 -2.80430317e-01 -5.41625619e-01
   7.50949025e-01 -4.40971524e-01  5.31526685e-01 -5.41883469e-01
   1.92792729e-01  3.44117790e-01  1.50266433e+00 -6.26989603e-01
  -2.42828488e-01 -3.66734445e-01  5.57459652e-01 -2.21802518e-01
  -9.69591320e-01 -4.38950866e-01 -7.93552220e-01 -5.84923148e-01
  -1.55690819e-01  2.12004021e-01  4.02013242e-01 -2.63063669e-01
   6.21910155e-01  5.97237468e-01  9.78126079e-02  7.20052540e-01
  -4.66322601e-01  3.86450440e-01 -8.24903786e-01  1.09985721e+00
  -3.59134972e-01 -4.31918532e-01  2.56567001e-02  5.73159456e-01
   2.40237385e-01 -7.67571270e-01  9.38899636e-01 -3.60024512e-01
  -8.77114773e-01 -2.47680500e-01 -8.65839303e-01  1.04203498e+00
   3.65989506e-01 -6.47719055e-02 -7.04246700e-01  5.91062289e-03
  -8.04807484e-01  2.21369982e-01 -1.79775178e-01  8.04759383e-01
  -4.44357067e-01 -4.46379125e-01  7.55990148e-02 -2.17