# **Feature Engineering**

## 1. What is a parameter?
A **parameter** is an internal characteristic of a machine learning model that is learned from the training data. For example, in linear regression, the slope and intercept (weights) are parameters.

---

## 2. What is correlation?
**Correlation** is a statistical measure that expresses the extent to which two variables are linearly related. It ranges from -1 to +1.

---

## 3. What does negative correlation mean?
**Negative correlation** means that as one variable increases, the other decreases. For example, as temperature increases, the sale of heaters decreases.

---

## 4. Define Machine Learning. What are the main components in Machine Learning?
**Machine Learning** is a subset of AI that enables systems to learn from data and improve performance without being explicitly programmed.

**Main components:**
- Data
- Model
- Loss Function
- Optimizer
- Evaluation Metrics

---

## 5. How does loss value help in determining whether the model is good or not?
The **loss value** represents the error between predicted and actual output. A lower loss indicates a better performing model.

---

## 6. What are continuous and categorical variables?
- **Continuous variables** are numerical and can take any value (e.g., height, weight).
- **Categorical variables** represent categories (e.g., gender, city).

---

## 7. How do we handle categorical variables in Machine Learning? What are the common techniques?
**1. Label Encoding**
- How it works: Converts each category to a number.

  - Example: {Male: 0, Female: 1}

- Use when: The categorical variable is ordinal (e.g., Low < Medium < High).

- Pros: Simple, compact.

- Cons: Implies an order even if there isn't one — not ideal for nominal variables.

**2. One-Hot Encoding**
- How it works: Creates a new binary column for each category.

  - Example: City = Delhi, Mumbai, Kolkata →

    - Delhi: [1, 0, 0]

    - Mumbai: [0, 1, 0]

- Use when: The variable is nominal (no inherent order).

- Pros: No assumptions about ordering.

- Cons: Can lead to high dimensionality if many unique categories (the "curse of dimensionality").

**3. Ordinal Encoding**
- Similar to Label Encoding but explicitly assigns numbers based on a known order.

  - Example: Low = 1, Medium = 2, High = 3

- Use when: You have ordinal data.

**4. Binary Encoding**
- Converts categories to binary numbers and then splits into separate columns.

  - E.g., Category A = 1 → 001, B = 2 → 010

- Pros: Less dimensionality than one-hot encoding.

- Use when: You have many categories but want to keep dimensionality low.

**5. Target Encoding (Mean Encoding)**
- Replaces each category with the mean of the target variable for that category.

  - Example (in a regression task): If "City" = Mumbai usually has house prices = ₹50L, encode "Mumbai" as 50.

- Use when: You have many categories and are doing supervised learning.

- Caution: Can lead to overfitting.

**6. Frequency or Count Encoding**
- Replace each category with its frequency/count.

  - E.g., if "Male" appears 100 times, encode as 100.

- Use when: Simplicity is preferred, or as a baseline.

---

## 8. What do you mean by training and testing a dataset?
- **Training** dataset is used to teach the model.
- **Testing** dataset evaluates model performance on unseen data.

---

## 9. What is sklearn.preprocessing?
**sklearn.preprocessing** is a module in Scikit-learn that provides functions and classes to prepare your data for machine learning models.

Raw data is often messy — with missing values, inconsistent scales, or categorical variables. The preprocessing module helps transform it into a format that ML algorithms can use effectively.

---

## 10. What is a Test set?
A **test set** is a portion of the dataset not seen by the model during training and used only to assess its final performance.

---

## 11. How do we split data for model fitting (training and testing) in Python?
```python
import pandas as pd
from sklearn.model_selection import train_test_split

X = [[1], [2], [3], [4], [5]]
y = [1, 4, 9, 16, 25]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

---

## 12. How do you approach a Machine Learning problem?
1. Understand the problem
2. Collect and clean data
3. Perform EDA (Exploratory Data Analysis)
4. Preprocess data
5. Select and train model
6. Evaluate and tune
7. Deploy

---

## 13. Why do we have to perform EDA before fitting a model to the data?
**1. Understand the Data**
- Know what you're working with: types of variables, distributions, data types, etc.
- Helps identify target and feature relationships.

**2. Identify Missing or Incorrect Values**
- Models don’t like missing or weird values (like 'N/A', 0, '??').
- EDA helps you spot and handle them before training.

**3. Detect Outliers**
- Outliers can skew your model's predictions, especially in linear models.
- Visual tools like box plots help catch them.

**4. Visualize Distributions & Relationships**
- Use histograms, pair plots, heatmaps to understand feature distributions and correlations.

**5. Feature Engineering Ideas**
- EDA often reveals patterns that help create new features or transform existing ones.
- E.g., binning ages, log-transforming skewed data, etc.

**6. Choose the Right Model & Preprocessing**
- Some models require scaling (e.g., SVM), others don’t.
- If variables are highly correlated, you might choose to drop or combine them.

**7. Avoid Garbage-In, Garbage-Out**
- If the data is messy and misunderstood, the model will give poor results — no matter how advanced it is.

---

## 14. How can you find correlation between variables in Python?
```python
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
correlation_matrix = df.corr()
correlation_matrix
```

---

## 15. What is causation? Explain difference between correlation and causation with an example.
- **Causation** means one event causes another.
- **Correlation** is a mutual relationship but doesn't imply causation.

Example:
- Correlation: Ice cream sales and drowning deaths increase together.
- Causation: Hot weather increases both independently.

---

## 16. What is an Optimizer? What are different types of optimizers? Explain each with an example.
An **optimizer** minimizes the loss function.

Common types:
- **SGD (Stochastic Gradient Descent)**: updates weights using a few samples.
- **Adam**: combines momentum and adaptive learning rate.

```python
from tensorflow.keras.optimizers import Adam, SGD

optimizer = Adam(learning_rate=0.01)
```

---

## 17. What is sklearn.linear_model?
**sklearn.linear_model** is a module in Scikit-learn that provides linear models for regression and classification.

It includes some of the most widely used algorithms in machine learning, like Linear Regression, Logistic Regression, Ridge, Lasso, and more.

---

## 18. What does model.fit() do? What arguments must be given?
**model.fit()** is the method that trains your machine learning model. It takes your features (X) and target/labels (y) and learns the patterns in the data.

```python
model.fit(X_train, y_train)
```
Arguments:
  - Features (X)
  - Target (y)

---

## 19. What does model.predict() do? What arguments must be given?
**model.predict()** is used after training a model with **fit()**. It takes new input data (X) and uses the learned patterns to make predictions (outputs).
```python
predictions = model.predict(X_test)
```
Arguments:
  - Features to predict on

---

## 20. What is feature scaling? How does it help in Machine Learning?
Feature scaling is the process of transforming your features (input variables) so that they’re all on a similar scale — usually to avoid some features dominating others just because of their magnitude.

For example:

- Height in cm: [150, 170, 180]

- Salary in INR: [20,000, 60,000, 1,00,000]

Even though both are important, salary will dominate due to its larger scale. That’s where scaling helps.

It help in Machine Learning to:
- Improves performance of models that are sensitive to scale:
  - K-Nearest Neighbors (KNN)

  - Support Vector Machines (SVM)

  - Gradient Descent-based algorithms

  - Principal Component Analysis (PCA)

- Makes training faster & more stable (especially for neural nets and optimization-based models).

- Better visualization when plotting features.

---

## 21. How do we perform scaling in Python?

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_X = scaler.fit_transform([[1, 2], [3, 4], [5, 6]])
scaled_X
```

---

## 22. Explain data encoding?
**Data encoding** is converting categorical data to numerical form so that it can be used by ML algorithms.

Types:
- Label Encoding
- One-Hot Encoding
- Binary Encoding

```python
# Label Encoding
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(['cat', 'dog', 'dog', 'mouse'])
labels
```