<a href="https://colab.research.google.com/github/expeditive/machine-learning/blob/main/feature_extracton.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Feature extraction** in Machine Learning (ML) is the process of transforming raw data into numerical features that can be processed while preserving the information in the original data set. It’s a key part of **feature engineering** and often determines the success of your model.

---

### 🔍 What is Feature Extraction?
Feature extraction involves:
- Selecting important aspects of data.
- Converting data into a format that is suitable for ML models.
- Reducing data dimensionality (optional but common).
- Improving model accuracy and training time.

---

### ✅ Why is Feature Extraction Important?
- Helps models learn better from data.
- Reduces noise and irrelevant data.
- Speeds up training a
nd improves accuracy.
- Makes it possible to use complex data types (like images, text, audio).

---

### 📦 Common Feature Extraction Techniques

#### 1. **For Numerical Data**
- **Statistical features**: mean, standard deviation, min, max, etc.
- **Polynomial features**: adding interactions between variables.
- **Normalization/Standardization**: scaling data to a standard range.

#### 2. **For Text Data**
- **Bag of Words (BoW)**
- **TF-IDF (Term Frequency-Inverse Document Frequency)**
- **Word Embeddings**: Word2Vec, GloVe, BERT

#### 3. **For Image Data**
- **Pixel intensity values**
- **Edge detection**: using filters like Sobel, Canny
- **CNN features**: using layers of a Convolutional Neural Network (deep learning)

#### 4. **For Audio Data**
- **MFCC (Mel Frequency Cepstral Coefficients)**
- **Chroma features**
- **Spectrograms**

---

### 🧠 Feature Extraction vs Feature Selection
- **Feature Extraction** = creating new features from raw data (e.g., PCA, TF-IDF)
- **Feature Selection** = choosing a subset of existing features that are most relevant (e.g., using correlation, mutual information)

---

### 🛠️ Tools/Libraries Used
- **Python**: `scikit-learn`, `pandas`, `NumPy`
- **For NLP**: `NLTK`, `spaCy`, `transformers`
- **For images**: `OpenCV`, `TensorFlow`, `PyTorch`

--

### 🧠 TF-IDF Vectorizer in Machine Learning

**TF-IDF (Term Frequency - Inverse Document Frequency)** is a popular technique for **feature extraction from text data**. It transforms text into numerical vectors that reflect how important a word is to a document **relative to a collection (corpus)**.

---

### 📚 Breakdown of TF-IDF

#### ✅ 1. **Term Frequency (TF)**  
Measures how frequently a term appears in a document.

\[
\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total terms in document } d}
\]

#### ✅ 2. **Inverse Document Frequency (IDF)**  
Measures how important a term is in the **entire corpus**. Rare terms across all documents get higher scores.

\[
\text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents with term } t}\right)
\]

#### ✅ 3. **TF-IDF Score**  
\[
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
\]

---

### 🛠️ Using `TfidfVectorizer` in Python (from `sklearn`)

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
corpus = [
    "Machine learning is fascinating",
    "Learning algorithms are essential for machine learning",
    "Deep learning is a subset of machine learning"
]

# Initialize vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

# Show feature names (vocabulary)
print(vectorizer.get_feature_names_out())

# Convert to dense array and view TF-IDF scores
print(X.toarray())
```

---

### ✅ When to Use TF-IDF:
- For **text classification** (e.g., spam detection, sentiment analysis)
- To reduce noise from **common words** (e.g., "the", "and", etc.)
- For **search engines** and **document similarity** tasks

---

### 🧠 Tips:
- Combine with dimensionality reduction (e.g., TruncatedSVD for LSA)
- Use **stop words** removal to ignore common words
- Adjust parameters like `max_df`, `min_df`, `ngram_range` for better performance

---

Want to see a real-world example like spam classification using TF-IDF + ML model?