__Hashing with HashingVectorizer in NLP__

HashingVectorizer is a method of feature extraction in NLP that uses a hash function to transform text into numerical features. This is useful when working with large datasets because it has a lower memory footprint than other methods like CountVectorizer and TfidfVectorizer.

__CountVectorizer:__ A vectorizer that converts text to a matrix of token counts.

__TfidfVectorizer:__ A vectorizer that converts text to a matrix of TF-IDF features.

__HashingVectorizer:__ A vectorizer that converts text to a matrix of hashed features.

__Word2Vec:__ A neural network-based vectorizer that learns word embeddings from a corpus of text.

__GloVe:__ A pre-trained vectorizer that creates word embeddings based on co-occurrence statistics.

__Doc2Vec:__ A neural network-based vectorizer that learns document embeddings from a corpus of text.

### Count Vectorizer 

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

# Example documents
documents = ["This is the first document.", "This is the second second document.", "And the third one.", "Is this the first document?"]

# Create the CountVectorizer object
vectorizer = CountVectorizer()

# Fit the CountVectorizer object on the documents
matrix = vectorizer.fit_transform(documents)

# Get feature names from vocabulary_
feature_names = vectorizer.vocabulary_

# Print feature names and their corresponding indices
print(feature_names)

# Print the matrix of token counts
print(matrix.toarray())


{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]


### Tfidf Vectorizer 

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the corpus of text documents
corpus = ['This is the first document.', 
          'This is the second second document.', 
          'And the third one.', 
          'Is this the first document?']

# Create a TfidfVectorizer object
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the vectorizer on the corpus of text documents
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Get feature names from the vectorizer's vocabulary_
feature_names = tfidf_vectorizer.vocabulary_

# Print the matrix of TF-IDF scores
print(tfidf_matrix.toarray())

# Print the feature names
print(feature_names)


[[0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]
 [0.         0.27230147 0.         0.27230147 0.         0.85322574
  0.22262429 0.         0.27230147]
 [0.55280532 0.         0.         0.         0.55280532 0.
  0.28847675 0.55280532 0.        ]
 [0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]]
{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}


### Hashing Vectorizer

In [5]:
from sklearn.feature_extraction.text import HashingVectorizer

# Sample documents
doc1 = "The cat in the hat."
doc2 = "The cat is out of the bag."
doc3 = "The dog ate my homework."

# Create HashingVectorizer object
vectorizer = HashingVectorizer(n_features=10)

# Vectorize the documents
X = vectorizer.transform([doc1, doc2, doc3])

# Print the feature vectors
print(X.toarray())


[[ 0.          0.          0.          0.          0.33333333  0.
   0.          0.         -0.66666667  0.66666667]
 [-0.30151134  0.          0.          0.          0.          0.
   0.60302269  0.30151134 -0.60302269  0.30151134]
 [ 0.37796447  0.37796447  0.         -0.37796447  0.          0.
   0.          0.         -0.75592895  0.        ]]
