<a href="https://colab.research.google.com/github/basselkassem/nlp-toolkit/blob/master/2_hash_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
docs = [
        'it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory',
        'it is fast to pickle and un-pickle as it holds no state besides the constructor parameters',
        'it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.'
]
users = ['user1', 'user2', 'user1']

#Hash encoding


When we have large text data, it becomes not possible to represent those texts with bag of words or tfidf. 
An alternative solution is to compute the hash of the tokens/n-grams.

**Example: Spam filtering**


*   0.4 million user
*   3.2 million emails
*   40 million unique token

We map each token/n-gram to an index(number) by computing the hash function like: $\phi(x)=hash(x)\% 2^b$, if b = 22 we have 4 million features.

Hash functions introduce collisions, but it is proven through [experiments](https://arxiv.org/pdf/0902.2206.pdf) that those collisions do not reduces the quality of the model.







In [0]:
from sklearn.feature_extraction.text import HashingVectorizer

In [15]:
vectorizer = HashingVectorizer(analyzer='word', stop_words='english', ngram_range=[1, 1], n_features=10)
X = vectorizer.fit_transform(docs).toarray()
X.shape

(3, 10)

## Personalized token triks

Map each user to the tokens. In other words, emails that are considered a spam by some user are not considered spams by another. In order to learn this preference, we do:


*   $\phi_o(token)=hash(token)\%2^b$
*   $\phi_u(user +''\_''+token) = hash(user +''\_''+token) \%2^b$
*   $\phi(user, token) = \phi_o(token) + \phi_u(user +''\_''+token)$



In [25]:
users_docs = []
for i, user in enumerate(users):
  user_doc = [user + '_' + token for token in docs[i].split()]
  users_docs.append(' '.join(user_doc))
users_docs

['user1_it user1_is user1_very user1_low user1_memory user1_scalable user1_to user1_large user1_datasets user1_as user1_there user1_is user1_no user1_need user1_to user1_store user1_a user1_vocabulary user1_dictionary user1_in user1_memory',
 'user2_it user2_is user2_fast user2_to user2_pickle user2_and user2_un-pickle user2_as user2_it user2_holds user2_no user2_state user2_besides user2_the user2_constructor user2_parameters',
 'user1_it user1_can user1_be user1_used user1_in user1_a user1_streaming user1_(partial user1_fit) user1_or user1_parallel user1_pipeline user1_as user1_there user1_is user1_no user1_state user1_computed user1_during user1_fit.']

In [28]:
vectorizer = HashingVectorizer(analyzer='word', stop_words='english', ngram_range=[1, 1], n_features=10)
phi_o = vectorizer.fit_transform(docs).toarray()
phi_u = vectorizer.fit_transform(users_docs).toarray()
X = phi_o + phi_u
X.shape

(3, 10)