# **Feature Extraction from Text**

Machine learning algorithms require numerical data to function, so raw text must be converted into numerical vectors through a process called feature extraction. This process transforms unstructured text into a structured format that algorithms can interpret. Here are some common techniques for feature extraction:

#### 1. **Bag of Words (BoW)**
The Bag of Words model represents text data as a matrix of token counts. Each document is converted into a vector where the value for each dimension corresponds to the frequency of a specific word in the document. This method disregards word order but captures word occurrence.

#### 2. **TF-IDF (Term Frequency-Inverse Document Frequency)**
TF-IDF is an extension of the Bag of Words model that weighs the word counts by their importance. It combines two metrics:
- **Term Frequency (TF):** Measures how frequently a word occurs in a document.
- **Inverse Document Frequency (IDF):** Measures how important a word is by considering its frequency across all documents.
The resulting vectors reflect both the frequency and importance of words, making TF-IDF useful for identifying significant terms.

#### 3. **Word Embeddings**
Word embeddings are dense vector representations of words where similar words have similar vectors. These methods capture semantic relationships and contextual meanings, providing a richer representation than simple counts.

   - **Word2Vec:** A popular word embedding technique that uses neural networks to learn word associations from large text corpora. It creates vectors where words with similar meanings are close in the vector space.

   - **GloVe (Global Vectors for Word Representation):** Another word embedding method that factors in the global word-word co-occurrence matrix from a corpus. GloVe generates vectors based on the frequency of word co-occurrences across the entire dataset.

These feature extraction techniques help transform textual data into a format suitable for machine learning models, allowing algorithms to perform tasks such as classification, clustering, and sentiment analysis more effectively.



## **Bag of Words (BoW)**

**Bag of Words (BoW)** is a widely used technique for converting text into numerical features, making it compatible with machine learning algorithms. It offers a straightforward and versatile approach to text representation, which is helpful for various natural language processing (NLP) tasks. Below is a detailed overview:

#### **What is Bag of Words?**

BoW is a method that transforms text data into numerical vectors by capturing the frequency of words within a document. The key idea is to represent the text based on the presence or absence of words, without considering the order in which they appear.

**Key Components:**
1. **Vocabulary:** A collection of all unique words found in the text corpus (the entire set of documents).
2. **Feature Matrix:** A table where each row corresponds to a document and each column represents a word from the vocabulary. The values in the matrix denote the frequency of each word in the corresponding document.

#### **How Does It Work?**

1. **Creating the Vocabulary:**
   - Identify all distinct words in the corpus.
   - For instance, given the following reviews:
     - "Good hotel"
     - "Not a good hotel"
     - "Best hotel in the area"
   - The vocabulary would include: `good`, `hotel`, `not`, `a`, `best`, `in`, `the`, `area`.

2. **Building the Feature Matrix:**
   - Each row in the matrix represents a document, and each column corresponds to a word in the vocabulary.
   - The matrix is filled with word counts or binary values indicating the presence of words.

   Here’s an example matrix for the provided reviews:


|              | good | hotel | not | a | best | in | the | area |
|--------------|------|-------|-----|---|------|----|-----|------|
| **First Review**  | 1    | 1     | 0   | 0 | 0    | 0  | 0   | 0    |
| **Second Review** | 1    | 1     | 1   | 1 | 0    | 0  | 0   | 0    |
| **Third Review**  | 0    | 1     | 0   | 0 | 1    | 1  | 1   | 1    |


#### **Drawbacks of Bag of Words:**

1. **Dominance of Frequent Words:**
   - Common words may overshadow less frequent but potentially important words, affecting the feature representation.

2. **Loss of Word Order:**
   - BoW disregards the sequence of words, which means information about the syntax and semantics related to word order is lost.

3. **Large and Sparse Matrices:**
   - With a vast vocabulary, the feature matrix can become very large and mostly empty (sparse), leading to high memory usage and computational inefficiency.

Despite these limitations, Bag of Words remains a fundamental technique in text processing due to its simplicity and effectiveness for many text classification tasks.

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Sample corpus
corpus = [
    'I love programming in Python.',
    'Python is great for data analysis.',
    'Machine learning with Python is fun.'
]

# Initialize the CountVectorizer
count_vectorizer = CountVectorizer()

# Fit and transform the corpus
bag_of_words = count_vectorizer.fit_transform(corpus)

# Get feature names
feature_names = count_vectorizer.get_feature_names_out()

# Create a DataFrame with the bag-of-words matrix
df_bow = pd.DataFrame(bag_of_words.toarray(), columns=feature_names)

print(df_bow)

   analysis  data  for  fun  great  in  is  learning  love  machine  \
0         0     0    0    0      0   1   0         0     1        0   
1         1     1    1    0      1   0   1         0     0        0   
2         0     0    0    1      0   0   1         1     0        1   

   programming  python  with  
0            1       1     0  
1            0       1     0  
2            0       1     1  


## **Term Frequency–Inverse Document Frequency (TF-IDF)**

**Term Frequency–Inverse Document Frequency (TF-IDF)** is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). TF-IDF helps to identify words that are significant within a document but not necessarily common across the entire corpus. This method addresses some limitations of the Bag of Words (BoW) model by emphasizing words that are rare and more informative.

#### **How TF-IDF Works**

TF-IDF is composed of two main parts:

1. **Term Frequency (TF):**
   - **Term Frequency** measures how often a term appears in a document. It can be seen as the probability of finding a word within that document.
   - Formula:
     ```
     TF = (Number of times term t appears in a document) / (Total number of terms in the document)
     ```
   - For example, if a document contains 100 words and the word "cat" appears 3 times, the TF for "cat" would be:
     ```
     TF = 3 / 100 = 0.03
     ```

2. **Inverse Document Frequency (IDF):**
   - **Inverse Document Frequency** measures the importance of a term across the entire corpus. It decreases if the term appears in many documents, and increases if it appears in fewer documents. This helps to down-weight common words and up-weight rare words.
   - Formula:
     ```
     IDF = log((Total number of documents D) / (Number of documents containing term t))
     ```
   - For instance, if there are 10 million documents in the corpus and the word "cat" appears in 1,000 of them, the IDF would be:
     ```
     IDF = log(10,000,000 / 1,000) = 4
     ```

3. **TF-IDF Calculation:**
   - The TF-IDF weight is the product of TF and IDF, representing the term's importance in the document relative to the entire corpus.
   - Formula:
     ```
     TF-IDF = TF * IDF
     ```
   - For the word "cat" in the previous example, the TF-IDF weight would be:
     ```
     TF-IDF = 0.03 * 4 = 0.12
     ```

#### **Advantages Over Bag of Words (BoW)**

- **Addressing Dominance of Frequent Words:**
  - Unlike BoW, which treats all words equally, TF-IDF adjusts the weight of each word by considering its frequency in the document and its rarity across the corpus. This reduces the dominance of frequently occurring words and highlights words that are significant but less common.

#### **Example Application**

Using TF-IDF, each word's weight in the hotel reviews would be calculated to better capture the importance of words. For instance, terms that are more specific to a document and less common across the corpus would receive higher TF-IDF scores.



In [4]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample corpus
corpus = [
    'I love programming in Python.',
    'Python is great for data analysis.',
    'Machine learning with Python is fun.'
]

# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the corpus
values = tfidf_vectorizer.fit_transform(corpus)

# Get feature names
feature_names = tfidf_vectorizer.get_feature_names_out()

# Create a DataFrame with the TF-IDF matrix
df_tfidf = pd.DataFrame(values.toarray(), columns=feature_names)

print(df_tfidf)

   analysis      data       for       fun     great        in       is  \
0  0.000000  0.000000  0.000000  0.000000  0.000000  0.546454  0.00000   
1  0.450504  0.450504  0.450504  0.000000  0.450504  0.000000  0.34262   
2  0.000000  0.000000  0.000000  0.450504  0.000000  0.000000  0.34262   

   learning      love   machine  programming    python      with  
0  0.000000  0.546454  0.000000     0.546454  0.322745  0.000000  
1  0.000000  0.000000  0.000000     0.000000  0.266075  0.000000  
2  0.450504  0.000000  0.450504     0.000000  0.266075  0.450504  
