Feature extraction is the process of transforming raw data into a set of meaningful attributes or features that can be used for machine learning, data analysis, or pattern recognition. The goal is to extract the most relevant information from the raw data while reducing its complexity, making it easier for algorithms to analyze and process.

In machine learning, raw data (such as images, text, or time-series data) may contain many variables or characteristics, but not all of them are useful for prediction or analysis. Feature extraction helps to identify and isolate the most important aspects of the data that contribute to the task at hand (e.g., classification, regression, clustering).


# For text to numerical format

We can apply two kinds of technique:
1. One Hot Encoding
2. Bag of Words

One Hot encoding increases the complexity and dimension
so it is wiser to apply bag of words . Especially in sentiments analysis as it is based on frequency

In [2]:
import numpy as np
import pandas as pd

In [3]:
df = pd.DataFrame({"text":["people watch dswithbappy",
"dswithbappy watch dswithbappy",
"people write comment",
"dswithbappy write comment"],"output":[1,1,0,0]})


In [4]:
df

Unnamed: 0,text,output
0,people watch dswithbappy,1
1,dswithbappy watch dswithbappy,1
2,people write comment,0
3,dswithbappy write comment,0


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()

In [7]:
bow=cv.fit_transform(df['text'])

In [9]:
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'dswithbappy': 1, 'write': 4, 'comment': 0}


In [10]:
bow.toarray()

array([[0, 1, 1, 1, 0],
       [0, 2, 0, 1, 0],
       [1, 0, 1, 0, 1],
       [1, 1, 0, 0, 1]], dtype=int64)

In Natural Language Processing (NLP), **n-grams** refer to contiguous sequences of `n` items (usually words) in a text. **Unigrams**, **bigrams**, and **trigrams** are specific types of n-grams where `n` is 1, 2, and 3, respectively. These are used as features in a **Bag of Words** (BoW) model to represent the text in a structured form.

### 1. **Unigrams (1-grams)**
Unigrams are individual words from the text. For example, the sentence:
- "I love coding"

Would be broken down into the following unigrams:
- **["I", "love", "coding"]**

### 2. **Bigrams (2-grams)**
Bigrams are pairs of consecutive words in a text. For example, the same sentence:
- "I love coding"

Would be broken down into the following bigrams:
- **["I love", "love coding"]**

### 3. **Trigrams (3-grams)**
Trigrams are triples of consecutive words. For the sentence:
- "I love coding"

The trigrams would be:
- **["I love coding"]**

### Bag of Words (BoW) Representation
The **Bag of Words** model treats a text as an unordered collection of words or n-grams (unigrams, bigrams, trigrams, etc.), without considering the sequence in which they appear. In a BoW approach, you count how many times each n-gram occurs in the text.

For example:
- **Text 1**: "I love coding"
- **Text 2**: "I love machine learning"

The unigrams, bigrams, and trigrams would be extracted as follows:

- **Unigrams**: ["I", "love", "coding", "machine", "learning"]
- **Bigrams**: ["I love", "love coding", "coding machine", "machine learning"]
- **Trigrams**: ["I love coding", "love coding machine", "coding machine learning"]

Then, you can represent the frequency of each n-gram in a matrix or vector format.

### Usage in Machine Learning:
N-grams (unigrams, bigrams, trigrams) are widely used in feature extraction for text classification tasks, such as sentiment analysis, spam detection, etc. By representing text as vectors of n-gram frequencies, machine learning algorithms can process the textual data in a more structured form.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample corpus (list of sentences)
corpus = [
    "I love coding",
    "I love machine learning",
    "I enjoy solving problems with coding"
]

# Create a CountVectorizer instance for unigrams, bigrams, and trigrams
vectorizer = CountVectorizer(ngram_range=(1, 2))

# Fit and transform the corpus to get n-grams
X = vectorizer.fit_transform(corpus)
print(vectorizer.vocabulary_)


{'love': 4, 'coding': 0, 'love coding': 5, 'machine': 7, 'learning': 3, 'love machine': 6, 'machine learning': 8, 'enjoy': 1, 'solving': 11, 'problems': 9, 'with': 13, 'enjoy solving': 2, 'solving problems': 12, 'problems with': 10, 'with coding': 14}


# TFIDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in text processing to evaluate the importance of a word (or term) within a document relative to a corpus (a collection of documents). It combines two components: Term Frequency (TF) and Inverse Document Frequency (IDF).

1. Term Frequency (TF)
This measures how frequently a term appears in a document.
2. Inverse Document Frequency (IDF)
This measures the importance of the term across the entire corpus. If a term appears in many documents, it is considered less important, and vice versa.
3. TF-IDF Score
The TF-IDF score is simply the product of TF and IDF.
1. Term Frequency (TF):

   TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in document d)

2. Inverse Document Frequency (IDF):

   IDF(t) = log ( (Total number of documents) / (Number of documents containing the term t) )

3. TF-IDF Score:

   TF-IDF(t, d) = TF(t, d) * IDF(t)
