# Natural Language Processing

Una manera de detectar plagios sería identificar coincidencias exactas entre textos, o, de manera alternativa, identificar *similitudes* como sinónimos y expresiones similares. Este enfoque es más complejo.

In [5]:
import numpy as np 
def euclidean(vec1,vec2): 
    distance=np.array(vec1)-np.array(vec2) 
    squared_sum=np.sum(distance**2) 
    return np.sqrt(squared_sum)

## word2vec

We'll use data that comes from a large collection of natural language text,also called a corpus. A corpus may be a collection of books, newspaper articles, research papers, theatrical plays, or blog posts or a mix of these.The  mportant point is that it consists of natural language—phrases and sentences that were put together by humans and reflect the way humans speak and write. Once we have our natural language corpus, we can look athow to use it to quantify the meanings of words.

In [1]:
import gensim.downloader as api

In [2]:
vectors = api.load('word2vec-google-news-300')

In [4]:
vectors['sword']

array([ 0.51953125,  0.1875    ,  0.31445312, -0.20605469, -0.0078125 ,
        0.375     ,  0.22558594, -0.02441406, -0.06445312,  0.27929688,
        0.02746582, -0.24511719, -0.21582031,  0.13574219, -0.27148438,
       -0.09130859, -0.06884766, -0.08349609,  0.14160156, -0.14160156,
        0.24316406, -0.23730469,  0.32421875, -0.00582886, -0.12792969,
        0.0201416 ,  0.07617188, -0.10742188,  0.16894531, -0.12988281,
        0.07958984,  0.2265625 ,  0.11035156,  0.12792969,  0.02856445,
        0.01965332, -0.06933594,  0.21875   , -0.06738281, -0.04370117,
        0.23046875,  0.07714844,  0.49804688, -0.14550781,  0.23632812,
       -0.10009766,  0.02893066, -0.16699219,  0.09814453, -0.24804688,
       -0.09082031,  0.3515625 , -0.00439453, -0.29296875,  0.00793457,
       -0.140625  , -0.10888672,  0.00212097, -0.13476562, -0.02575684,
       -0.02148438,  0.10888672,  0.07324219,  0.15332031, -0.06835938,
       -0.01831055,  0.08544922, -0.39257812,  0.03979492,  0.12

In [6]:
print(euclidean(vectors['sword'],vectors['knife'])) 
print(euclidean(vectors['sword'],vectors['herring'])) 
print(euclidean(vectors['car'],vectors['van']))

3.2766972
4.9384727
2.608656


In [7]:
def dot_product(vector1,vector2): 
    thedotproduct=np.sum([vector1[k]*vector2[k] for k in range(0,len(vector1))]) 
    return(thedotproduct) 
def vector_norm(vector): 
    thenorm=np.sqrt(dot_product(vector,vector)) 
    return(thenorm) 
def cosine_similarity(vector1,vector2): 
    thecosine=0 
    thedotproduct=dot_product(vector1,vector2) 
    thecosine=thedotproduct/(vector_norm(vector1)*vector_norm(vector2)) 
    thecosine=np.round(thecosine,4) 
    return(thecosine)

In [8]:
print(cosine_similarity(vectors['sword'],vectors['knife'])) 
print(cosine_similarity(vectors['sword'],vectors['herring'])) 
print(cosine_similarity(vectors['car'],vectors['van']))

0.5576
0.0529
0.6116


## Manipulando vectores matemáticamente

In [9]:
king = vectors['king'] 
queen = vectors['queen'] 
man = vectors['man'] 
woman = vectors['woman']

Hagamos expresión simbólica entre las palabras *king, queen, mand y woman*:
> king - man + woman = queen

Esto es imposible desde el lenguaje. En nuestro modelo de vectores, es posible efectuar la operación izquierda, esperando obtener un vector similar o igual al vector queen.

In [10]:
newvector = king-man+woman

In [11]:
print(cosine_similarity(newvector,queen)) 
print(euclidean(newvector,queen))

0.7301
2.298658


In [12]:
print(cosine_similarity(vectors['fish'],vectors['herring'])) 
print(euclidean(vectors['fish'],vectors['herring']))

0.6992
2.7537737


## Detectando plagio

In [13]:
print(cosine_similarity(vectors['the'],vectors['the'])) 
print(euclidean(vectors['having'],vectors['having']))

1.0
0.0


In [15]:
print(cosine_similarity(vectors['trouble'],vectors['problem'])) 
print(euclidean(vectors['come'],vectors['approach'])) 
print(cosine_similarity(vectors['put'],vectors['insert']))

0.5327
2.9844923
0.3435
