<a href="https://colab.research.google.com/github/ghassenov/ML-notebooks/blob/main/Feature_Extraction_of_Text_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Feature Extraction from textual data:
* text data cannot be directly used in machine learning models because they require numerical inputs.
* Feature Extraction converts text into numerical representations.
* two popular methods are Bag of words (bow) and TF-IDF

Bag of words (BOW)

---

* represents text as a bag of words, ignoring grammar and word order.
* each document is converted into a vector if word counts
* creates vocabulary of all unique words in the dataset.

How it works?
1. Tokenization : split text into words.
2. Vocabulary Creation: collects all unique words.
3. vectorization: count occurences of each word in a document.

Implementation of bow

In [4]:
#import libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [5]:
#defining sample text data
documents = [
    "I love machine learning",
    "I hate bad movies",
    "Machine learning is fun"
]

In [6]:
# bow implementation
# Initialize CountVectorizer
bow_vectorizer = CountVectorizer()

# Fit and transform the documents
bow_matrix = bow_vectorizer.fit_transform(documents)

# Get feature names (vocabulary)
print("BOW Vocabulary:", bow_vectorizer.get_feature_names_out())

# Convert to DataFrame for better visualization
import pandas as pd
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=bow_vectorizer.get_feature_names_out())
print("\nBag of Words (BOW) Representation:")
print(bow_df)

BOW Vocabulary: ['bad' 'fun' 'hate' 'is' 'learning' 'love' 'machine' 'movies']

Bag of Words (BOW) Representation:
   bad  fun  hate  is  learning  love  machine  movies
0    0    0     0   0         1     1        1       0
1    1    0     1   0         0     0        0       1
2    0    1     0   1         1     0        1       0


TF-IDF (Term Frequency-Inverse Document Frequency)

---
* Improves BOW by weighing words based on importance.

* TF (Term Frequency): How often a word appears in a document.

* IDF (Inverse Document Frequency): Penalizes words that appear too frequently across all documents (e.g., "the", "is").

How it works?
* TF(t,d)=(Total terms in document d)/(
Number of times term t appears in document d)

* IDF(t) = log( Total documents/
Number of documents containing term t
 )
* TF-IDF(t,d) = TF(t,d) * IDF(t)

In [7]:
#TF-IDF implementaion
# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Convert to DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
print("\nTF-IDF Representation:")
print(tfidf_df)


TF-IDF Representation:
       bad       fun     hate        is  learning      love   machine   movies
0  0.00000  0.000000  0.00000  0.000000  0.517856  0.680919  0.517856  0.00000
1  0.57735  0.000000  0.57735  0.000000  0.000000  0.000000  0.000000  0.57735
2  0.00000  0.562829  0.00000  0.562829  0.428046  0.000000  0.428046  0.00000
