In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

In [2]:
df = pd.read_csv(r"C:\Users\Aditya kumar Dubey\OneDrive\Apps\Documents\Desktop\Data\Titanic-Dataset.csv")

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# use single brackets around "Name" because CountVectorizer expects 1-D input
X = df['Name']
y = df['Survived']

In [5]:
pipe = make_pipeline(CountVectorizer(), MultinomialNB())

In [6]:
# cross-validate the pipeline using default parameters
from sklearn.model_selection import cross_val_score
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.8001820350260498

In [7]:
# specify parameter values to search (use a distribution for any continuous parameters)
import scipy as sp
params = {}
params['countvectorizer__min_df'] = [1, 2, 3, 4]
params['countvectorizer__lowercase'] = [True, False]
params['multinomialnb__alpha'] = sp.stats.uniform(scale=1)

In [8]:
# try "n_iter" random combinations of those parameter values
from sklearn.model_selection import RandomizedSearchCV
rand = RandomizedSearchCV(pipe, params, n_iter=10, cv=5, scoring='accuracy', random_state=1)
rand.fit(X, y);

In [9]:
# what was the best score found during the search?
rand.best_score_

0.8080534806352395

In [10]:
# which combination of parameters produced the best score?
rand.best_params_

{'countvectorizer__lowercase': False,
 'countvectorizer__min_df': 3,
 'multinomialnb__alpha': 0.1981014890848788}

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text corpus
documents = [
    "I love machine learning",
    "Machine learning is amazing",
    "Deep learning and machine learning are popular"
]

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Convert to array
print(X.toarray())

# Get the feature names (vocabulary)
print(vectorizer.get_feature_names_out())


[[0 0 0 0 0 1 1 1 0]
 [1 0 0 0 1 1 0 1 0]
 [0 1 1 1 0 2 0 1 1]]
['amazing' 'and' 'are' 'deep' 'is' 'learning' 'love' 'machine' 'popular']


In [13]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 13 stored elements and shape (3, 9)>
  Coords	Values
  (0, 6)	1
  (0, 7)	1
  (0, 5)	1
  (1, 7)	1
  (1, 5)	1
  (1, 4)	1
  (1, 0)	1
  (2, 7)	1
  (2, 5)	2
  (2, 3)	1
  (2, 1)	1
  (2, 2)	1
  (2, 8)	1


Counter-vectorization refers to the process of converting text data into numerical vectors, where each word or token in the text is represented by its frequency (or count) in the document. It is most commonly achieved using a **CountVectorizer** (like in Scikit-learn), which creates a document-term matrix where each entry corresponds to the frequency of a word in a given document.

The idea is to represent text as a collection of **word counts** rather than raw text, making it suitable for machine learning algorithms that work with numerical data.

### **Key Points of Counter-Vectorization**
1. **Tokenization**: The process of breaking the text into individual words (tokens).
2. **Vocabulary Creation**: Building a list of unique words (the vocabulary) that appear in the dataset.
3. **Frequency Count**: For each document, counting how often each word in the vocabulary appears.
4. **Document-Term Matrix**: A matrix where rows represent documents and columns represent words, with each value being the frequency of that word in the respective document.

---

### **Example of Counter-Vectorization**
Imagine you have the following three sentences:
1. "I love data science."
2. "Data science is amazing."
3. "Machine learning and data science are important."

**Step 1: Tokenization**
- Sentence 1: ["I", "love", "data", "science"]
- Sentence 2: ["Data", "science", "is", "amazing"]
- Sentence 3: ["Machine", "learning", "and", "data", "science", "are", "important"]

**Step 2: Vocabulary Creation**
- Vocabulary: ['I', 'love', 'data', 'science', 'is', 'amazing', 'machine', 'learning', 'and', 'are', 'important']

**Step 3: Frequency Count (Document-Term Matrix)**
|         | I | love | data | science | is | amazing | machine | learning | and | are | important |
|---------|---|------|------|---------|----|---------|---------|----------|-----|-----|-----------|
| **Doc 1** | 1 | 1    | 1    | 1       | 0  | 0       | 0       | 0        | 0   | 0   | 0         |
| **Doc 2** | 0 | 0    | 1    | 1       | 1  | 1       | 0       | 0        | 0   | 0   | 0         |
| **Doc 3** | 0 | 0    | 1    | 1       | 0  | 0       | 1       | 1        | 1   | 1   | 1         |

---

### **Use Case of Counter-Vectorization**
Counter-vectorization is particularly useful for tasks like:
- **Text Classification** (spam detection, sentiment analysis)
- **Topic Modeling**
- **Clustering** (grouping similar documents)
  
It is a simple, yet effective, way to convert text into a form that machine learning models can understand.

Would you like to dive deeper into a specific application of counter-vectorization?