## Поиск по корпусу 20newsgroups (TF-IDF)

In [1]:
from sklearn.datasets import fetch_20newsgroups
import scipy.sparse

In [2]:
corpus = fetch_20newsgroups()

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(corpus.data) # Transform corpus into tf-idf matrix.

In [5]:
def top_10(tfidf_matrix, query):
    
    top_10 = 10 * [[-1, -1]] # [document_index, relevance]  
    tfidf_query = tfidf.transform([query]) # Transform query into tf-idf form.
    sparse_query = scipy.sparse.find(tfidf_query) # An array ony with nonzero elements.
    
    for document_index in range(tfidf_matrix.shape[0]): # For each document in corpus:
        current_relevance = 0
        
        # Comput inner product of tf-idf vectors for the document and the query (vectors are already normalized).
        
        for nonzero_element in range(len(sparse_query[0])): # To make it faster we iterate only over nonzero elements
                                                            # in tf-idf vector of the query.
            current_relevance += tfidf_matrix[document_index, sparse_query[1][nonzero_element]]*sparse_query[2][nonzero_element]
        
        # Insert the document in the top 10 if it is relevant enough.
        
        for i in range(10):
            if current_relevance > top_10[i][1]:
                for j in range(9, i, -1):
                    top_10[j] = top_10[j-1]
                top_10[i] = [document_index, current_relevance] 
                break
        
    return top_10

### Пример запроса

In [6]:
top_10_list = top_10(tfidf_matrix, "drug diller")
print("Top ten of the most relevant texts ([text index, text relevance]):\n")
print(top_10_list)
print("\n\nThe most relevant text:\n")
print(corpus.data[top_10_list[0][0]])

Top ten of the most relevant texts ([text index, text relevance]):

[[9181, 0.44704529770661872], [10207, 0.26131776878165341], [253, 0.23207575625032334], [124, 0.22929265123565615], [10297, 0.22401445394500974], [5866, 0.22339331038960109], [3069, 0.22234969611743896], [1220, 0.21559544801720507], [6104, 0.19960331242929308], [1978, 0.19491586921319926]]


The most relevant text:

From: kennejs@a.cs.okstate.edu (KENNEDY JAMES SCOT)
Subject: We're winning the war on drugs.  Not!
Organization: Oklahoma State University, Computer Science, Stillwater
Keywords: drugs DEA WOD legalization
Lines: 140

The DEA and other organizations would have the American people
believe that we are winning the "war on drugs".  I'm going to
dispel the propaganda that the DEA is putting out by showing
you the drug war's *real* status. To help prove my assertions
I've also posted two articles from USA Today that clearly
demonstrate that drug use among certain age groups *is* on the
rise.  If WOD is working, a