# write a program to extract features from text

## Why do we need feature extraction in NLP?
Computers do not understand raw text.

They only understand numbers.

So, before applying Machine Learning / Deep Learning models, we must convert text → numerical features.

Feature extraction captures meaningful information from text (like word counts, importance, context).

## Common Feature Extraction Techniques

### Bag of Words (BoW)

Treats text as a collection of words.

Ignores grammar & word order.

Represents text by counting how many times each word appears.

Example:

Texts: ["I love NLP", "NLP loves Python"]

Vocabulary = {I, love, NLP, loves, Python}

Features =

[1, 1, 1, 0, 0] (for "I love NLP")

[0, 0, 1, 1, 1] (for "NLP loves Python")

### TF–IDF (Term Frequency – Inverse Document Frequency)

Not all words are equally important.

Words like "the", "is" appear everywhere, but they don’t add much meaning.

TF–IDF reduces the weight of common words and increases weight of rare, important words.

Formula:

TF (Term Frequency) = No. of times a word appears / Total words in the document

IDF (Inverse Document Frequency) = log(Total documents / Documents containing the word)

TF-IDF = TF × IDF

In [13]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [14]:
documenets =[
    "Natural Language Processing makes machines understand text.",
    "Text mining and NLP are used in machine learning.",
    "Feature extraction converts text into numerical form."
]


In [16]:
#Bag of Words
print("------Bag of Words(BOW)-------")
Vectorizer = CountVectorizer()
X_bow = Vectorizer.fit_transform(documenets)
print("Vocabulary:",Vectorizer.get_feature_names_out())
print("BoW Matrix:\n",X_bow.toarray())

------Bag of Words(BOW)-------
Vocabulary: ['and' 'are' 'converts' 'extraction' 'feature' 'form' 'in' 'into'
 'language' 'learning' 'machine' 'machines' 'makes' 'mining' 'natural'
 'nlp' 'numerical' 'processing' 'text' 'understand' 'used']
BoW Matrix:
 [[0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 1 1 1 0]
 [1 1 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 1]
 [0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0]]


In [20]:
# TF-IDF
print("\n----- TF-IDF -----")
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(documenets)
print("Vocabulary:", tfidf.get_feature_names_out())
print("TF-IDF Matrix:\n", X_tfidf.toarray())


----- TF-IDF -----
Vocabulary: ['and' 'are' 'converts' 'extraction' 'feature' 'form' 'in' 'into'
 'language' 'learning' 'machine' 'machines' 'makes' 'mining' 'natural'
 'nlp' 'numerical' 'processing' 'text' 'understand' 'used']
TF-IDF Matrix:
 [[0.         0.         0.         0.         0.         0.
  0.         0.         0.39687454 0.         0.         0.39687454
  0.39687454 0.         0.39687454 0.         0.         0.39687454
  0.2344005  0.39687454 0.        ]
 [0.34608857 0.34608857 0.         0.         0.         0.
  0.34608857 0.         0.         0.34608857 0.34608857 0.
  0.         0.34608857 0.         0.34608857 0.         0.
  0.20440549 0.         0.34608857]
 [0.         0.         0.39687454 0.39687454 0.39687454 0.39687454
  0.         0.39687454 0.         0.         0.         0.
  0.         0.         0.         0.         0.39687454 0.
  0.2344005  0.         0.        ]]
