# NaiveVectorizer: Test

In this notebook, we test our implementation of a vectorizer. The goal is primarily to ensure we understood how the vectorization works.  

`NaiveVectorizer` should behave exactly as the `TfidfVectorizer` with the default options and 
`analyzer=char`, `norm=l2`, `sublinear_tf=True`, `use_idf=False`.


One difference, though: in our own implementation, we added an option called `ignore_non_words` (default: True) which automatically discards ngrams that don't contain _at least one letter_. 

## Loading data

In [1]:
%run notebook_utils.py

In [2]:
X_train, X_test, y_train, y_test = load_split_data()

## Importing the langid package

In [3]:
cd ..

/Users/Lin/git/SwigSpot/language-detection


In [4]:
from langid import NaiveVectorizer

In [5]:
cd -

/Users/Lin/git/SwigSpot/language-detection/notebooks


## Testing the NaiveVectorizer

To be able to compare the results between the two implementations, it is important to use the same `ngram_range` and `max_features` parameters and also to turn off the `ignore_non_letters` functionnality of the `NaiveVectorizer`.

### Creation and fitting

Here, we are surprised to see that our implementation is slightly faster at fitting time...

In [6]:
options = dict(ngram_range=(3,3), max_features=1000)

In [7]:
%%time
nv = NaiveVectorizer(ignore_non_words=False, **options)
nv.fit(X_train, y_train)

CPU times: user 1.44 s, sys: 178 ms, total: 1.62 s
Wall time: 1.62 s


In [8]:
%%time
from sklearn.feature_extraction.text import TfidfVectorizer
tiv = TfidfVectorizer(analyzer='char', use_idf=False, sublinear_tf=True, **options)
tiv.fit(X_train, y_train)

CPU times: user 1.94 s, sys: 66.2 ms, total: 2.01 s
Wall time: 2.01 s


### Comparing the two

In [9]:
print("Comparing features: ")
nv_feat = nv._feature_names
tiv_feat = tiv.get_feature_names()

print(" Are the same ?", sorted(nv_feat) == sorted(tiv_feat))
print(" Difference:", set(nv_feat).difference(set(tiv_feat)))

Comparing features: 
 Are the same ? True
 Difference: set()


In [10]:
print("Comparing results: \n")
test_sentence = "The heart wants what the hear wants boy."

nv_result = nv.transform([test_sentence])[0]
tiv_result = tiv.transform([test_sentence])[0]

nv_fw  = [(nv_feat[i], nv_result[0, i]) for i in nv_result.nonzero()[1]]
tiv_fw = [(tiv_feat[i], tiv_result[0, i]) for i in tiv_result.nonzero()[1]]

nv_fw.sort(key=lambda t: t[0])
tiv_fw.sort(key=lambda t: t[0])

print("ngram   NV    TFIDF")
print("=========================")
for n, t in zip(nv_fw, tiv_fw):
    assert n[0] == t[0]
    print("%s:  %.3f   %.3f   (%s)" % (n[0], n[1], t[1], 'ok' if n[1] == t[1] else 'not ok'))

Comparing results: 

ngram   NV    TFIDF
 bo:  0.161   0.161   (ok)
 he:  0.272   0.272   (ok)
 th:  0.161   0.161   (ok)
 wa:  0.272   0.272   (ok)
 wh:  0.161   0.161   (ok)
ant:  0.272   0.272   (ok)
ar :  0.161   0.161   (ok)
art:  0.161   0.161   (ok)
at :  0.161   0.161   (ok)
e h:  0.272   0.272   (ok)
ear:  0.272   0.272   (ok)
hat:  0.161   0.161   (ok)
he :  0.272   0.272   (ok)
nts:  0.272   0.272   (ok)
r w:  0.161   0.161   (ok)
rt :  0.161   0.161   (ok)
s b:  0.161   0.161   (ok)
s w:  0.161   0.161   (ok)
t t:  0.161   0.161   (ok)
t w:  0.161   0.161   (ok)
the:  0.272   0.272   (ok)
ts :  0.272   0.272   (ok)
