# Finding Similar Items

## 3.1 Applications of Near-Neighbor Search

## 3.1.1 Jaccard Similarity of Sets

$$Jaccard\ Similarity = \frac{|A\cap B|}{|A\cup B|}$$

In [4]:
def jaccard_sim(a, b):
    
    return len(a.intersection(b)) / len(a.union(b))

In [7]:
a = set([1, 2, 3, 4, 5])
b = set([3, 4, 7, 8])

a_intersection_b = a.intersection(b)
a_union_b = a.union(b)

print("Intersection: {0}\nUnion: {1}\nJaccard Similarity = {2}".format(a_intersection_b, 
                                                                       a_union_b,
                                                                       jaccard_sim(a, b)))

Intersection: {3, 4}
Union: {1, 2, 3, 4, 5, 7, 8}
Jaccard Similarity = 0.2857142857142857


### 3.1.2 Similarity of Documents

### 3.1.3 Collaborative Filtering as a Similar-Sets Problem




**ver introduction to recommender systems**

In [42]:
def jaccard_bag_sim(a, b):
    
    intersection_sum = sum((a & b).values())
    union_sum = sum(a.values()) + sum(b.values())
    
    return intersection_sum / union_sum

In [43]:
from collections import Counter

a = Counter('aaab')
b = Counter('aabbc')

In [44]:
jaccard_bag_sim(a, b)

0.3333333333333333

## 3.2 Shingling of Documents

### 3.2.1 k-Shingles

In [159]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import codecs

In [172]:
with codecs.open('dom_casmurro.txt', 'r', 'utf-8') as f:
    
    dom_casmurro = f.read()
    
with codecs.open('perto_do_coracao_selvagem.txt', 'r', 'utf-8') as f:
    
    perto_coracao = f.read()

In [191]:
cv = CountVectorizer(analyzer='word', # n-gram de words
                     ngram_range=(2, 2), # 2-gram
                     token_pattern='\w+') # considerar 'word' um ou mais caracteres

cv.fit([dom_casmurro, perto_coracao])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(2, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='\\w+', tokenizer=None,
        vocabulary=None)

In [192]:
cv_bigrams = cv.transform([dom_casmurro])

In [193]:
pd.DataFrame(cv_bigrams.todense(), columns=cv.get_feature_names()).T.sample(10)

Unnamed: 0,0
dita a,1
gostava de,8
também em,0
palavras era,2
disso ela,0
mais composto,1
compassos assimétricos,0
repisar o,1
sua palavra,0
esperar à,1


## 3.2.3 Hashing Shingles

In [176]:
from sklearn.feature_extraction.text import HashingVectorizer

In [189]:
hv = HashingVectorizer(analyzer='word', # n-gram de words
                     ngram_range=(2, 2), # 2-gram
                     token_pattern='\w+')

hv.fit([dom_casmurro, perto_coracao])

HashingVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
         lowercase=True, n_features=1048576, ngram_range=(2, 2),
         non_negative=False, norm='l2', preprocessor=None, stop_words=None,
         strip_accents=None, token_pattern='\\w+', tokenizer=None)

In [190]:
hv_bigrams = hv.transform([dom_casmurro])

In [198]:
import sys

print("Perto do Coração Selvagem vocabulary size: {0} megabytes".format(sys.getsizeof(perto_coracao) / 10**6))
print("Dom Casmurro size: {0} megabytes".format(sys.getsizeof(dom_casmurro) / 10**6))
print("CountVectorizer vocabulary size: {0} megabytes".format(sys.getsizeof(cv.vocabulary_) / 10**6))

Perto do Coração Selvagem vocabulary size: 0.591586 megabytes
Dom Casmurro size: 0.75784 megabytes
CountVectorizer vocabulary size: 2.621544 megabytes


## 3.3 Similarity-Preserving Summaries of Sets

### 3.3.1 Matrix Representation of Sets

In [199]:
from sklearn.feature_extraction import DictVectorizer

In [203]:
dv = DictVectorizer(sparse=False)

D = [{'a':1, 'd':1}, {'c': 1}, {'b': 1, 'd': 1, 'e': 1}, {'a': 1, 'c': 1, 'd': 1}]

X = dv.fit_transform(D)
X # diferente da notação no livro, os conjuntos ficam representados por linhas, e não colunas

array([[ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  1.,  0.,  1.,  1.],
       [ 1.,  0.,  1.,  1.,  0.]])

In [204]:
dv.inverse_transform(X)

[{'a': 1.0, 'd': 1.0},
 {'c': 1.0},
 {'b': 1.0, 'd': 1.0, 'e': 1.0},
 {'a': 1.0, 'c': 1.0, 'd': 1.0}]

### 3.3.2 Minhashing

Ver: http://mccormickml.com/2015/06/12/minhash-tutorial-with-python-code/

### 3.3.3 Minhashing and Jaccard Similarity

### 3.3.4 Minhash Signatures

### 3.3.5 Computing Minhash Signatures

## 3.4 Locality-Sensitive Hashing for Documents