In [1]:
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import nltk
from collections import defaultdict
import string
from nltk.stem.porter import PorterStemmer

## 20newsgroups dataset

In [2]:
newsgroups = fetch_20newsgroups(categories=['comp.graphics','comp.os.ms-windows.misc','rec.autos','rec.sport.hockey'],
                                subset='all', shuffle=True, random_state=1)

Список категорий новостей

In [3]:
pprint(list(newsgroups.target_names))

['comp.graphics', 'comp.os.ms-windows.misc', 'rec.autos', 'rec.sport.hockey']


Размерность корпуса

In [4]:
print("%d documents" % len(newsgroups.data))
print("%d categories" % len(newsgroups.target_names))

3947 documents
4 categories


## Tokenizer

In [5]:
#nltk.download()

In [6]:
#взято с stackoverflow
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    tokens = [i for i in tokens if i not in string.punctuation]
    stems = stem_tokens(tokens, stemmer)
    return stems

Получаем TF-IDF матрицу для корпуса

In [7]:
vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenize)
tfidf_data = vectorizer.fit_transform(newsgroups.data)

TF-IDF для запроса

Пример:

In [8]:
query = 'computer vision'
tfidf_query = vectorizer.transform([query])

Вывод значений TF=IDF для каждого словав запросе

In [9]:
feature_names = vectorizer.get_feature_names()
for word in tfidf_query.nonzero()[1]:
    print(feature_names[word], ' - ', tfidf_query[0, word])

vision  -  0.907611197309
comput  -  0.419811760817


## Query Search

Функция принимает запрос, сортирует документы по cosine similarity с этим запросом и выводит top_count документов

In [10]:
def query_results(query_string, top_count):
    tfidf_query = vectorizer.transform([query_string])
    cosine_similarities = defaultdict(float) #словарь всех дистанций
    count = 0
    for doc in tfidf_data: #для каждого документа в корпусе находим косинусное расстояние с запросом
        #пользуемся матричным видом для умножения векторов
        #так как TF-IDF нормализует данные, делить на длины векторов не нужно
        cosine_similarity = doc*(tfidf_query[0].transpose()) 
        if not cosine_similarity:
            cosine_similarity = 0.0
        else:
            #при умножении матриц получается матрица размером [1,1], записываем этот элемент
            cosine_similarity = cosine_similarity[0,0]
        #записываем в словарь
        cosine_similarities[newsgroups.data[count]] = cosine_similarity
        count += 1
    #сортируем словарь по значению и выводим заданное значение документов
    for key, value in sorted(cosine_similarities.items(), reverse=True, key=lambda x:x[1])[:top_count]:
        print('Similarity value = ', value, '\n\n', key )
        print('----------------------------------------------------------------------')

## Query examples

In [11]:
query_results("hockey champion", 3)

Similarity value =  0.345276332877 

 Organization: University of Maine System
From: The Always Fanatical: Patrick Ellis <IO11330@MAINE.MAINE.EDU>
Subject: Re: 1993 NHL Draft
 <1993Apr20.184627.4585@newshub.ariel.yorku.ca>
 <1993Apr21.064605.24531@CSD-NewsHost.Stanford.EDU>
Lines: 19

>San Jose will then get Kariya

    ya know that kind of funny cause I've seen Kariya on Campus
with a Sharks hat on.......



             Pat Ellis


P.S.  GO BRUINS    GO UMAINE BLACK BEARS    42-1-2       NUMBER 1......

                   HOCKEY EAST REGULARS SEASON CHAMPIONS.....
                   HOCKEY EAST TOURNAMENT CHAMPIONS>......
                   PAUL KARIYA, HOBEY BAKER AWARD WINNER.......
         NCAA DIV. 1 HOCKEY TOURNAMENT CHAMPIONS!!!!!!!!!!!!!!!!!!!


                    M-A-I-N-E      GGGGOOOOOOO    BBBLLLUUEEEE!

----------------------------------------------------------------------
Similarity value =  0.328511619126 

 Organization: University of Maine System
From: The Always Fa

In [12]:
query_results("auto speed", 5)

Similarity value =  0.272982173716 

 From: aas7@po.CWRU.Edu (Andrew A. Spencer)
Subject: Re: SHO and SC
Organization: Case Western Reserve University, Cleveland, OH (USA)
Lines: 53
Reply-To: aas7@po.CWRU.Edu (Andrew A. Spencer)
NNTP-Posting-Host: slc5.ins.cwru.edu


In a previous article, a207706@moe.dseg.ti.com (Robert Loper) says:

>In article <C5L8rE.28@constellation.ecn.uoknor.edu> callison@uokmax.ecn.uoknor.edu (James P. Callison) writes:
>>In article <1993Apr15.232412.2261@ganglion.ann-arbor.mi.us> david@ganglion.ann-arbor.mi.us (David Hwang) writes:
>>>In article <5214@unisql.UUCP> wrat@unisql.UUCP (wharfie) writes:
>>>>In article <chrissC587qB.D1B@netcom.com> chriss@netcom.com (Chris Silvester) writes:
>>>>
>>
>>Why anyone would order an SHO with an automatic transmission is
>>beyond me; if you can't handle a stick, you should stick with a
>>regular Taurus and leave the SHO to real drivers. That is not to
>>say that there aren't real drivers who can't use the stick (eg
>>disab

In [13]:
query_results("windows", 1)

Similarity value =  0.455260792194 

 From: tomh@metrics.com (Tom Haapanen)
Subject: RFD: comp.os.ms-windows.nt.{misc,setup}
Organization: Software Metrics Inc.
Lines: 76
NNTP-Posting-Host: rodan.uu.net

This is the official Request for Discussion (RFD) for the creation of two
new newsgroups for Microsoft Windows NT.  This is a second RFD, replacing
the one originally posted in January '93 (and never taken to a vote).  The
proposed groups are described below:

NAME: 	 comp.os.ms-windows.nt.setup
STATUS:  Unmoderated.
PURPOSE: Discussions about setting up and installing Windows NT, and about
	 system and peripheral compatability issues for Windows NT.

NAME:	 comp.os.ms-windows.nt.misc
STATUS:	 Unmoderated.
PURPOSE: Miscellaneous non-programming discussions about using Windows NT,
	 including issues such as security, networking features, console
	 mode and Windows 3.1 (Win16) compatability.

RATIONALE:
	Microsoft NT is the newest member of the Microsoft Windows family
	of operating syst