# NLP with Python: Nearest Neighbors Search

Source: https://sanjayasubedi.com.np/nlp/nlp-with-python-nearest-neighbor-search/

In [1]:
import numpy as np
import pandas as pd

## Introduction

Nearest Neighbor search is used to find objects that are similar to each other. The idea is that given an input, NN search finds the objects in our database that are similar to the input. As a simple example, if you had a database of news articles and you want to retrieve news similar to your query then you would perform a nearest neighbors search for you input query against all articles in your database and return top 10 results.

## Data Preprocessing

In [2]:
from sklearn.datasets import fetch_20newsgroups

bunch = fetch_20newsgroups(remove=['headers', 'footers'])
print(type(bunch))
print(bunch.keys())

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])


In [3]:
print(bunch.data[0])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(max_features=10000)
features = vec.fit_transform(bunch.data)
print(features.shape)

(11314, 10000)


## Model Training

In [6]:
from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(n_neighbors=10, metric='cosine')
knn.fit(features)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                 radius=1.0)

## Results

In [8]:
knn.kneighbors(features[0], return_distance=False)

array([[   0,  958, 8013, 8266,  659, 5553, 3819, 5282, 2554, 7993]])

In [17]:
knn.kneighbors(features[1], return_distance=True)

(array([[0.        , 0.69439209, 0.74732411, 0.76815797, 0.77770994,
         0.77935143, 0.78204508, 0.78326044, 0.82015529, 0.82390838]]),
 array([[   1, 6399, 9130, 4693, 1270, 5509, 9921, 2116, 5541, 5097]]))

In [10]:
input_texts = ["any recommendations for good ftp sites?", "i need to clean my car"]
input_features = vec.transform(input_texts)

In [11]:
D, N = knn.kneighbors(input_features, n_neighbors=2, return_distance=True)

In [26]:
for input_text, distances, neighbors in zip(input_texts, D, N):
    print('=' * 200)
    print('Input Text:\n' + input_text, '\n')
    for distance, neighbor in zip(distances, neighbors):
        print('-' * 200)
        print(f"Distance: {distance:.3f}, Neighbor Index: {neighbor}", '\n')
        print(bunch.data[neighbor], '\n\n\n')

Input Text:
any recommendations for good ftp sites? 

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Distance: 0.512, Neighbor Index: 7665 

Hi!

I am looking for ftp sites (where there are freewares or sharewares)
for Mac. It will help a lot if there are driver source codes in those 
ftp sites. Any information is appreciated. 

Thanks in advance. 



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Distance: 0.587, Neighbor Index: 89 

I would like to experiment with the INTEL 8051 family.  Does anyone out  
there know of any good FTP sites that might have compiliers, assemblers,  
etc.?
 



Input Text:
i need to clean my car 

--------------------------------------------------------

In [68]:
def get_response(input_text):
    input_list = [input_text]
    input_feature = vec.transform(input_list)
    D, N = knn.kneighbors(input_feature, n_neighbors=2, return_distance=True)
    print('=' * 100)
    print(f"Your Input:\n{input_text}\n\n")
    D = D.reshape(-1)
    N = N.reshape(-1)
    for distance, neighbor in zip(D, N):
        print('-' * 100, '\n')
        print(f"Distance: {distance}, Neighbor Index: {neighbor}")
        print(bunch.data[neighbor], '\n\n')
        print('_' * 100)
    print('=' * 100)

In [69]:
query = input('What do you want from me? ')
get_response(query)

What do you want from me? I love Sekardayu
Your Input:
I love Sekardayu


---------------------------------------------------------------------------------------------------- 

Distance: 0.4477827755240442, Neighbor Index: 7978
Above all, love each other deeply, because love covers over a multitude of
sins.  


____________________________________________________________________________________________________
---------------------------------------------------------------------------------------------------- 

Distance: 0.4596248110541047, Neighbor Index: 4884
davem@bnr.ca (Dave Mielke) writes,

>  However, God's love is qualified.  The Bible says:
> 
>      The way of the wicked is an abomination unto the LORD:  but he
>      loveth him that followeth after righteousness.   Proverbs 15:9
> 
>      For  the LORD knoweth the way of the righteous: but the way of
>      the ungodly shall perish.                            Psalm 1:6
 
  
I am extremely uncomfortable with this way of phras