#Feature Engineering

1. What is a parameter?
   - A parameter is a value that is passed into a function, method, or procedure to customize its behavior. Parameters act as variables inside the function and are used to accept input from the caller.

2. What is correlation?
   - Correlation is a statistical measure that describes the relationship between two variables. It tells us whether and how strongly two variables are related.

   - What does negative correlation mean?
   - Negative correlation means that as one variable increases, the other variable decreases. In other words, they move in opposite directions.

3. Define Machine Learning. What are the main components in Machine Learning?
   - Machine Learning is a branch of artificial intelligence (AI) that enables systems to learn from data and make predictions or decisions without being explicitly programmed. Instead of following strict rules, ML algorithms improve their performance over time by recognizing patterns in data.

   - Main Components of Machine Learning:
   - Data
   - Features (Input Variables)
   - Model
   - Training
   - Evaluation (Testing & Validation)
   - Hyperparameters & Optimization
   -  Deployment

4. How does loss value help in determining whether the model is good or not?
   - The loss value is a numerical measure of how well (or poorly) a machine learning model is performing. It quantifies the difference between the model’s predicted output and the actual target values. Lower loss values indicate a better model, while higher loss values suggest poor performance.

   - How Loss Helps in Evaluating a Model:
   - Measures Prediction Accuracy
   - Guides Model Optimization
   - Prevents Overfitting or Underfitting

5. What are continuous and categorical variables?
   - In data analysis and machine learning, variables are classified into two main types: continuous and categorical variables.

   - Continuous Variables:
   - A continuous variable can take an infinite number of values within a given range. These values are usually numerical and can be measured rather than counted.
   - Categorical Variables:
   - A categorical variable represents distinct groups or categories. These variables are usually non-numeric (but can be represented numerically, like 0 or 1) and are typically countable.

6. How do we handle categorical variables in Machine Learning? What are the common  teachniques?
   - Since most machine learning models work with numerical data, categorical variables must be converted into a numerical format before training.
   
   -  The common  teachniques are:
   - Encoding Categorical Variables
   - Label Encoding
   - Ordinal Encoding
   - Target Encoding (Mean Encoding)
   - Hashing Encoding.

7. What do you mean by training and testing a dataset?
   - In machine learning, we split a dataset into two main parts: training and testing datasets. This helps evaluate how well a model can generalize to unseen data.

   - Training Dataset:
   - The training dataset is used to train the machine learning model.

   - The model learns patterns from the input data by adjusting its internal parameters.

   - Usually makes up 70-80% of the total dataset.

   - Testing Dataset:
   - The testing dataset is used to evaluate the trained model.

  - It contains new data that the model has never seen before to check its performance.

   - Usually makes up 20-30% of the total dataset.

8. What is sklearn.preprocessing?
   - sklearn.preprocessing is a module in Scikit-Learn that provides various techniques for scaling, encoding, and transforming data to prepare it for machine learning models.

9. What is a Test set?
   - A test set is a portion of the dataset used to evaluate the performance of a trained machine learning model. It contains new data that the model has never seen before to check how well it generalizes to unseen data.

10. How do we split data for model fitting (training and testing) in Python?
    - To train and evaluate a machine learning model, we need to split our dataset into:
    - Training Set → Used to train the model (typically 70-80% of data).
    - Test Set → Used to evaluate the model (typically 20-30% of data).

    -  How do you approach a Machine Learning problem?
    - Solving a machine learning (ML) problem involves systematic steps:
    - Understand the Problem
    - Collect and Explore Data (EDA - Exploratory Data Analysis)
    - Data Preprocessing & Feature Engineering
    - Split Data for Training & Testing
    - Choose and Train a Model
    -  Evaluate Model Performance
    - Hyperparameter Tuning & Model Optimization
    - Deploy the Model
    - Monitor and Improve

11. Why do we have to perform EDA before fitting a model to the data?
    - Exploratory Data Analysis (EDA) is critical before fitting a model because it helps you understand your data, detect issues, and make informed preprocessing decisions. Skipping EDA can lead to poor model performance, misleading results, and overfitting.

12. What is correlation?
    - Correlation is a statistical measure that describes the strength and direction of a relationship between two variables. It helps us understand how one variable changes in relation to another.

13. What does negative correlation mean?
    - A negative correlation means that as one variable increases, the other decreases. In other words, they move in opposite directions.

14. How can you find correlation between variables in Python?
    - Python provides several ways to compute the correlation between variables, most commonly using Pandas and NumPy.

15. What is causation? Explain difference between correlation and causation with an example?
    - Causation (also called cause-and-effect) means that one variable directly influences another. If A causes B, then changing A will directly lead to a change in B.

    - Correlation:
    - A relationship between two variables
    - EX:Ice cream sales ↑ & Drowning deaths ↑ (but summer causes both)

    - causation:
    - Causation (also called cause-and-effect) means that one variable directly influences another. If A causes B, then changing A will directly lead to a change in B.
    - EX:Studying more → Higher grades

16. What is an Optimizer? What are different types of optimizers? Explain each with an example?
    - An optimizer is an algorithm used in Machine Learning and Deep Learning to adjust model parameters (weights and biases) to minimize the loss function. The goal of an optimizer is to improve model performance by reducing errors.

    - Types of Optimizers in Machine Learning:
    -  Gradient Descent (GD)
    - Stochastic Gradient Descent (SGD)
  -  Mini-Batch Gradient Descent
  - Momentum
   - AdaGrad (Adaptive Gradient Algorithm)
   -  RMSprop (Root Mean Square Propagation)
  -  Adam (Adaptive Moment Estimation)

  

Gradient Descent (GD):EXAMPLE

In [1]:
import numpy as np

# Cost function: f(x) = x^2
def cost_function(x):
    return x**2

# Gradient: f'(x) = 2x
def gradient(x):
    return 2*x

# Gradient Descent Algorithm
x = 10  # Initial point
learning_rate = 0.1
for i in range(10):  # 10 iterations
    x = x - learning_rate * gradient(x)
    print(f"Iteration {i+1}: x = {x}, Cost = {cost_function(x)}")


Iteration 1: x = 8.0, Cost = 64.0
Iteration 2: x = 6.4, Cost = 40.96000000000001
Iteration 3: x = 5.12, Cost = 26.2144
Iteration 4: x = 4.096, Cost = 16.777216
Iteration 5: x = 3.2768, Cost = 10.73741824
Iteration 6: x = 2.62144, Cost = 6.871947673600001
Iteration 7: x = 2.0971520000000003, Cost = 4.398046511104002
Iteration 8: x = 1.6777216000000004, Cost = 2.8147497671065613
Iteration 9: x = 1.3421772800000003, Cost = 1.801439850948199
Iteration 10: x = 1.0737418240000003, Cost = 1.1529215046068475


Stochastic Gradient Descent (SGD):
EXAMPLE

In [2]:
from sklearn.linear_model import SGDRegressor

# Example data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])  # y = 2x

# SGD Model
sgd = SGDRegressor(learning_rate="constant", eta0=0.1, max_iter=1000)
sgd.fit(X, y)

print(f"Weight: {sgd.coef_}, Bias: {sgd.intercept_}")


Weight: [1.90349378], Bias: [0.2170783]


Mini-Batch Gradient Descent:

EXAMPLE:Used in Deep Learning frameworks like TensorFlow & PyTorch.

Momentum Optimization:

EXAMPLE:Used in deep learning (Keras, PyTorch, TensorFlow).

RMSprop (Root Mean Square Propagation):

EXAMPLE:Used in TensorFlow, Keras.

Adam (Adaptive Moment Estimation):EXAMPLE



In [3]:
from tensorflow.keras.optimizers import Adam

optimizer = Adam(learning_rate=0.001)


17. What is sklearn.linear_model ?
    - sklearn.linear_model is a module in Scikit-Learn that provides various linear models for regression and classification tasks. It includes algorithms like Linear Regression, Logistic Regression, Ridge, Lasso, and more.

18. What does model.fit() do? What arguments must be given?
    - In Scikit-Learn, model.fit() is used to train a machine learning model on a given dataset. It learns the patterns from the training data and updates the model's internal parameters (e.g., weights in linear regression, decision tree splits, etc.)

    - Arguments Required for fit()
    -  Optional Parameters for fit()
    - When NOT to Use fit()

19. What does model.predict() do? What arguments must be given?
    - model.predict() is used to make predictions after a machine learning model has been trained with model.fit(). It takes new input data and returns the model’s predicted outputs based on learned patterns.

    -  Required Arguments for predict()
    - How predict() Works Internally
    - Common Errors in predict()
    
20. What are continuous and categorical variables?
    - In machine learning and statistics, variables are classified into continuous and categorical based on the type of data they represent.

   - Continuous Variables:
   - These variables can take any numerical value within a range and can be measured (e.g., weight, height, temperature).
   - Categorical Variables:
   - These variables represent distinct groups or categories. They do not have a numerical meaning.

21. What is feature scaling? How does it help in Machine Learning?
    - Feature Scaling is a technique used in machine learning to normalize or standardize numerical features so that they are on the same scale. It ensures that no single feature dominates others due to differences in magnitude.

  - How does it help in Machine Learning?
  -  Improves Model Performance
  - Speeds Up Training
  - Prevents Some Features from Dominating
  - Essential for Distance-Based Algorithms

22. How do we perform scaling in Python?
    - Feature scaling can be done using Scikit-Learn. The two most common techniques are:
   -  Min-Max Scaling (Normalization)
    -  Standardization (Z-score Normalization)

23. What is sklearn.preprocessing?
    - sklearn.preprocessing is a module in Scikit-Learn that provides tools for feature transformation and scaling in Machine Learning. It helps in normalizing, encoding, and standardizing data to improve model performance.

24. How do we split data for model fitting (training and testing) in Python?
    - When building a Machine Learning model, we need to split the dataset into:
    - Training Set → Used to train the model
    - Test Set → Used to evaluate model performance

25. Explain data encoding?
    - Data Encoding is the process of converting categorical data (text or labels) into numerical form so that machine learning models can process it.

