# Bag of Words Model

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text
texts = ["small dog", "cute cat", "cute dog"]

# Convert to Bag of Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

# Vocabulary and representation
print("Vocabulary:", vectorizer.get_feature_names_out())
print("Bag of Words Representation:\n", X.toarray())

Vocabulary: ['cat' 'cute' 'dog' 'small']
Bag of Words Representation:
 [[0 0 1 1]
 [1 1 0 0]
 [0 1 1 0]]


# TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert to TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(texts)

# view the TF-IDF values for the first document
feature_names = tfidf_vectorizer.get_feature_names_out()
first_document_vector = X_tfidf[0]

print("Feature names:", feature_names)
print("TF-IDF values for the first document:")
print(first_document_vector.toarray())

Feature names: ['cat' 'cute' 'dog' 'small']
TF-IDF values for the first document:
[[0.         0.         0.60534851 0.79596054]]


# Word Embeding

In [None]:
from gensim.models import Word2Vec

# Example sentences
sentences = [["I", "love", "NLP"], ["NLP", "is", "awesome"]]

# Train Word2Vec model
model = Word2Vec(
    sentences=sentences,    # Input data
    vector_size=10,         # Dimension of word embeddings
    window=3,               # Context window size
    min_count=1,            # Minimum word frequency
    workers=4,              # Number of CPU threads
    sg=1                    # Use Skip-Gram (1) or CBOW (0)
)

# Get vector for a word
print("Vector for 'NLP':", model.wv['NLP'])

Vector for 'NLP': [-0.00536227  0.00236431  0.0510335   0.09009273 -0.0930295  -0.07116809
  0.06458873  0.08972988 -0.05015428 -0.03763372]


# Comparations


# Comparison: CountVectorizer, TF-IDF, and Word2Vec

| **Feature**           | **CountVectorizer**                                       | **TF-IDF**                                            | **Word2Vec**                                         |
|-----------------------|-----------------------------------------------------------|------------------------------------------------------|-----------------------------------------------------|
| **What It Does**      | Converts text into a matrix of word counts.               | Converts text into a matrix with term frequencies adjusted by their inverse document frequency. | Converts words into dense vectors that capture semantic meaning. |
| **Representation**    | Sparse matrix of word counts.                            | Sparse matrix, but adjusts word frequencies based on their importance. | Dense vector representations of words. |
| **Contextual Awareness**| No context; based only on word occurrence.               | No context; focuses on word importance in documents.   | Captures semantic relationships and context.          |
| **When to Use**       | When you need simple frequency-based features.            | When you want to weigh words by their relevance to the document and corpus. | When you need to capture word meanings and relationships. |
| **Best for**          | Text classification, keyword extraction, basic NLP tasks.  | Document classification, keyword extraction, search engines. | Sentiment analysis, language modeling, machine translation, word similarity. |
| **Complexity**        | Simple, easy to understand and implement.                 | Slightly more complex due to inverse document frequency (IDF). | More complex; requires large datasets or pre-trained embeddings. |
| **Example NLP Tasks** | - Document classification (e.g., spam vs. non-spam)      | - Information retrieval (e.g., search engines)       | - Sentiment analysis, text generation, word similarity |
| **When to Use**       | - Small to medium datasets.                              | - When importance of words needs to be weighed.        | - Large datasets where semantic meaning is important. |
| **Advantages**        | - Simple and interpretable.                              | - Helps with document-level importance of terms.      | - Encodes semantic meaning and relationships.          |
| **Disadvantages**     | - Ignores context; leads to sparse matrices.              | - Ignores word order and context; still sparse.        | - Requires large data to train effectively; harder to interpret. |

---

### Example Use Cases

| **Task**                             | **CountVectorizer**                                   | **TF-IDF**                                          | **Word2Vec**                                        |
|--------------------------------------|-------------------------------------------------------|----------------------------------------------------|----------------------------------------------------|
| **Text Classification (e.g., spam detection)** | Use word counts to classify documents as spam or not.   | Use weighted word counts for better classification, emphasizing rare terms. | Use embeddings to classify text based on semantic meaning. |
| **Keyword Extraction**               | Identify frequently occurring words.                  | Extract important keywords based on document relevance. | Identify similar words based on vector proximity. |
| **Sentiment Analysis**               | Not ideal due to lack of context.                     | Can be used for sentiment analysis by weighing important words. | Best for sentiment analysis by capturing word meanings in context. |
| **Document Retrieval**               | Simple search based on word frequency.                | Improve search by using word importance (TF-IDF).   | Semantic search based on word meanings. |
| **Clustering Text Data**             | Group similar documents based on word counts.         | Group documents based on significant terms with TF-IDF scores. | Group documents by similarity in semantic space (using Word2Vec embeddings). |