# ***Feature Engineering***

* Theory questions


1. What is a Parameter?

 *   A parameter means a configuration variable that is internal to the model and whose value is estimated from the training data. Parameters are the elements the learning algorithm optimizes during the training process to improve model performance. These are not set manually, but are learned automatically.

They directly affect the model’s predictions by shaping the learned patterns or relationships in the data. For instance, in a linear regression model, the slope and intercept are parameters that define the equation of the line fitted to the data.

Parameters differ from hyperparameters, which are set before training and control the learning process itself (like learning rate, number of trees, etc.).

2.  What is correlation?
   
  What does negative correlation mean?

 *  Correlation is a statistical measure that expresses the extent to which two variables are linearly related. It indicates whether an increase or decrease in one variable would lead to an increase or decrease in another. The value of correlation lies between -1 and +1, where:

 *  +1 indicates a perfect positive correlation

 *  -1 indicates a perfect negative correlation

 *  0 indicates no linear correlation


There are several types of correlation:

1) Pearson correlation: Measures linear correlation.

2) Spearman rank correlation: Measures monotonic relationships (not necessarily linear).

3) Kendall Tau: Measures the strength of dependence between two variables.


---



*  A negative correlation means that the two variables move in opposite directions. In other words, when the value of one variable increases, the value of the other tends to decrease. This is represented by a correlation coefficient less than 0 and greater than –1.

The closer the coefficient is to –1, the stronger the negative correlation. A value of –1 indicates a perfect inverse relationship.

Negative correlation is especially important when analyzing features in a dataset. For example, in health-related datasets, you might observe that as physical activity increases, body fat percentage decreases, showing a negative correlation.





3.  Define Machine Learning. What are the main components in Machine Learning?
 *  Machine Learning (ML) is a subset of artificial intelligence (AI) that enables systems to learn patterns from data and make decisions or predictions without being explicitly programmed for every task. It focuses on developing algorithms that improve automatically through experience.

Rather than hardcoding logic for every possible scenario, machine learning systems identify complex patterns in data and adapt their behavior over time.

At its core, ML is about building mathematical models based on data to perform specific tasks like classification, regression, clustering, and more.


---


Main Components in Machine Learning:

* Dataset: The collection of data used for training and testing. It includes both input features (independent variables) and outputs or labels (dependent variables in supervised learning).

* Features (Input Variables): These are the measurable properties or characteristics of the data used to make predictions. Feature engineering often enhances the quality of these inputs.

* Model (Algorithm): The mathematical structure or function used to map inputs to outputs. Examples include Decision Trees, Linear Regression, Neural Networks, etc.

* Loss Function: A function that measures how far the model’s prediction is from the actual target value. It quantifies error, guiding the model during training.

* Optimizer: An algorithm that updates the model’s parameters to minimize the loss function. Common optimizers include Gradient Descent and its variants.

* Training Process: The stage where the model learns patterns from the input data by adjusting its internal parameters to reduce error.

* Evaluation Metrics: Metrics used to assess the performance of the trained model, such as accuracy, precision, recall, mean squared error, etc.

* Testing and Validation: The phase where the trained model is applied to unseen data to check how well it generalizes.




4.  How does loss value help in determining whether the model is good or not?
 *   The loss value in machine learning is a numerical representation of how far off the model’s predictions are from the actual target values. It is calculated using a loss function, which measures the difference between predicted outputs and the actual outputs for a given set of inputs.

A lower loss value means the model is making predictions that are closer to the actual values, while a higher loss value indicates poor predictions. Therefore, the loss value is a direct indicator of model performance during training and sometimes during testing too.

5. What are Continuous and Categorical Variables?

 *  In machine learning and statistics, variables are the attributes or characteristics of data. These variables can be classified into two main types based on the nature of their values:

* Continuous Variables: These are numerical variables that can take an infinite number of values within a given range. They are measurable quantities and can have decimal values. Examples include height, temperature, weight, age, and price. They support mathematical operations like addition, subtraction, averaging, etc.

* Categorical Variables: These are variables that represent categories or groups. They contain a finite number of distinct values and are often non-numerical. Categorical data can be:

1)  Nominal (no inherent order, e.g., gender, color, city)

2) Ordinal (ordered categories, e.g., low, medium, high)



6. How do we handle categorical variables in Machine Learning? What are the common techniques?

 *  Handling categorical variables in machine learning refers to the process of converting non-numeric data (categories or labels) into a numerical format that machine learning algorithms can understand and use effectively.

Since most ML models require numeric input, this transformation is essential for training and prediction. The chosen method depends on the type of categorical data—nominal (unordered) or ordinal (ordered).


Common Techniques to Handle Categorical Variables:

* Label Encoding: Converts each category into a unique integer (0, 1, 2…). Best for ordinal variables.

* One-Hot Encoding: Creates binary (0/1) columns for each category. Best for nominal data.

* Ordinal Encoding: Manually assigns numbers to categories based on order or hierarchy.

* Binary Encoding / Hashing / Target Encoding (Advanced): Used for high-cardinality features (e.g., hundreds of categories). Balances dimensionality and information content.



7. What do you mean by training and testing a dataset?

 *  A dataset is typically divided into two main parts: the training dataset and the testing dataset. This separation helps evaluate the model's ability to generalize to unseen data.

* Training Dataset: This is the portion of the data used to train the machine learning model. The algorithm learns patterns, relationships, and features from this subset by adjusting internal parameters (like weights).

* Testing Dataset: This is a separate portion of the data that is not seen by the model during training. It is used to evaluate the model’s performance and check how well it generalizes to new, unseen inputs.




8.  What is sklearn.preprocessing?

 *   The sklearn.preprocessing module is part of the popular machine learning library scikit-learn and provides essential tools for transforming and preparing data for use with machine learning models. The primary aim of preprocessing is to improve the performance and accuracy of machine learning algorithms by ensuring that the data is in an appropriate format. This module includes a range of functionalities such as scaling numerical features, encoding categorical variables, handling missing data, and creating polynomial features, all of which are critical for the effective application of machine learning models.

9.  What is a Test set?
 *   A test set is a subset of the dataset that is used to evaluate the performance of a trained machine learning model. The test set contains data that the model has never seen before during the training process. Its primary purpose is to assess how well the model generalizes to new, unseen data, providing a more accurate reflection of the model's ability to perform in real-world scenarios.

The test set serves as a final checkpoint after training and validation. While the training set is used to train the model and the validation set is used to tune model parameters, the test set is strictly reserved for evaluating the model's final performance. This separation ensures that the model is not overfitting to the training data and can make predictions effectively on new, unseen data.

10.  How do we split data for model fitting (training and testing) in Python?
 *   splitting the data into training and testing sets is a fundamental step that ensures the model is evaluated on data it hasn't seen before, helping to assess its generalization ability. The process of splitting the data typically involves dividing the dataset into at least two subsets: the training set and the test set.

 * Training Set: This subset is used to train the machine learning model. The model learns the underlying patterns and relationships in the data from this set.

* Test Set: This subset is used to evaluate the performance of the model after it has been trained. The goal is to ensure the model performs well on unseen data, which helps to understand how it will generalize to real-world data.


Most common method for splitting data is using the train_test_split function from the sklearn.model_selection module. This function randomly splits the dataset into training and testing subsets.

11. How do you approach a Machine Learning Problem?

 *  Approaching a machine learning problem involves several critical steps that guide the process from problem understanding to model deployment. The steps can vary depending on the problem, but they typically follow a general workflow:

* Problem Definition: Clearly define the problem you are trying to solve.

* Data Collection and Understanding: Collect the data that will be used to train and evaluate the model.

* Data Preprocessing: Clean and preprocess the data by handling missing values, encoding categorical variables, scaling or normalizing features, and addressing any data imbalances or outliers.

* Data Splitting: Split the data into training, validation, and test sets.

* Model Selection: Choose a suitable machine learning algorithm based on the problem type.

* Model Training: Train the model on the training data.

* Model Evaluation: Evaluate the model using the test set to check how well it generalizes to unseen data.

* Hyperparameter Tuning: Adjust the hyperparameters of the model to improve its performance.

* Model Interpretation and Deployment: Interpret the model’s results, understand its behavior, and check for overfitting or underfitting.

12. Why do we have to perform EDA before fitting a model to the data?
 *  Exploratory Data Analysis (EDA) is the process of thoroughly examining, visualizing, and summarizing a dataset before applying any machine learning model. The main objective of EDA is to understand the structure, patterns, and quality of the data in order to make informed decisions about data preprocessing, feature engineering, and model selection.

EDA acts as a bridge between raw data and model building. It helps identify underlying trends, anomalies, missing values, outliers, and relationships between variables. Without performing EDA, we risk feeding incorrect or misleading information into the model, which can lead to inaccurate predictions and poor generalization on unseen data.

13.  What is correlation?
*  Correlation is a statistical measure that expresses the extent to which two variables are linearly related. It indicates whether an increase or decrease in one variable would lead to an increase or decrease in another. The value of correlation lies between -1 and +1, where:

 *  +1 indicates a perfect positive correlation

 *  -1 indicates a perfect negative correlation

 *  0 indicates no linear correlation


There are several types of correlation:

1) Pearson correlation: Measures linear correlation.

2) Spearman rank correlation: Measures monotonic relationships (not necessarily linear).

3) Kendall Tau: Measures the strength of dependence between two variables.

14.  What does negative correlation mean?
  *  A negative correlation means that the two variables move in opposite directions. In other words, when the value of one variable increases, the value of the other tends to decrease. This is represented by a correlation coefficient less than 0 and greater than –1.

The closer the coefficient is to –1, the stronger the negative correlation. A value of –1 indicates a perfect inverse relationship.

Negative correlation is especially important when analyzing features in a dataset. For example, in health-related datasets, you might observe that as physical activity increases, body fat percentage decreases, showing a negative correlation.


15. How can you find correlation between variables in Python?

 *  Methods to Find Correlation in Python between variables:

* Using pandas.corr() Method: This function computes the Pearson correlation coefficient by default. It can also compute Kendall and Spearman correlations by specifying the method parameter.

* Visualizing Correlation with a Heatmap: A heatmap helps visually identify strong or weak correlations between variables using color intensities.

* Choosing the Correlation Method:

method='pearson': Measures linear correlation (most common)

method='spearman': Based on rank, useful for non-linear but monotonic relationships

method='kendall': Also based on rank, better for small datasets or ordinal data

16.  What is causation? Explain difference between correlation and causation with an example.
 *  Causation refers to a direct relationship between two variables, where one variable is responsible for causing a change in the other. In other words, when a change in variable A directly results in a change in variable B, we say there is a cause-and-effect relationship between them.


Difference between correlation and casuation:

* Direction of Relationship: Correlation means that two variables move together in some way, either positively or negatively, whereas causation means that one variable directly causes the change in another.

* Dependency: In correlation, variables are statistically related but may not depend on each other; in causation, one variable is dependent on the other due to a direct influence.

* Evidence: Correlation can be measured using statistical formulas like Pearson’s coefficient, but causation requires scientific proof, experiments, or theoretical reasoning to confirm a cause-effect link.

* Possibility of Misinterpretation: Correlation can occur due to coincidence or a third variable (confounder), while causation must be established by ruling out external influences.



17.  What is an Optimizer? What are different types of optimizers? Explain each with an example.

 * An optimizer is a crucial component in machine learning and deep learning algorithms, especially during the training of models. It is a mathematical function or algorithm that adjusts the model’s internal parameters (weights and biases) to minimize the loss function, which measures how far the model’s predictions are from the actual values.

In other words, the optimizer's job is to improve the model's performance by iteratively updating the weights in such a way that the loss (error) is reduced as much as possible. This process is typically performed using techniques based on gradient descent or its variations.

There are several types of optimizers used in machine learning which are:

* Gradient Descent: This is the most basic form of optimization algorithm. It computes the gradient of the loss function using the entire training dataset and updates the model weights accordingly.

Example: teacher evaluates the performance of an entire class before giving collective feedback. This approach gives an overall idea but is slow and resource-intensive, especially if the class is huge.

* Stochastic Gradient Descent: SGD updates model parameters using only one data point at a time. While this introduces randomness and noise, it also makes the training process much faster.

Ex: a coach who corrects an athlete after every single move instead of waiting for the whole session to end. Feedback is quick and frequent, but might sometimes be based on noisy or misleading observations.

* Mini-Batch Gradient Descent: A middle-ground between batch and SGD, this method uses small chunks or batches of data (like 32 or 64 samples) to compute updates. It balances accuracy and speed.

Ex: A teacher divides the class into small groups and gives feedback to each group separately. This makes learning efficient and avoids overwhelming either the teacher or students.

*  Momentum Optimizer: Momentum builds on previous updates by adding a “memory” of past gradients. It helps the model move faster in the right direction and prevents it from getting stuck in small dips or oscillating.

ex: Imagine rolling a ball downhill. Even if it hits a small bump, its momentum helps it roll past the obstacle instead of stopping or turning around.

* AdaGrad: AdaGrad adjusts the learning rate for each parameter individually. It reduces the learning rate over time for frequently updated parameters and gives more focus to rarely updated ones.

Ex: Suppose you're studying different subjects. If you're good at math but weak in history, you start spending less time on math and more on history to balance your overall performance.

*  RMSProp: RMSProp improves AdaGrad by keeping a moving average of squared gradients, preventing the learning rate from becoming too small.

ex: Think of a smart manager who doesn’t overreact to every single feedback but makes decisions based on a running average of employee performance, allowing for better long-term improvement.

* Adam: Adam is one of the most widely used optimizers today. It combines the advantages of Momentum and RMSProp, making it efficient, stable, and adaptable to different problems.

ex: a top-tier mentor who keeps track of both your progress (momentum) and your mistakes (gradient strength), and customizes advice based on both. It leads to balanced, fast, and stable improvements.

18.  What is sklearn.linear_model ?

 *  sklearn.linear_model is a module in the Scikit-learn (sklearn) library that provides a collection of linear models used for both regression and classification tasks. These models work on the principle that the output or target variable can be expressed as a linear combination of input features. The module contains classes that implement various types of linear algorithms, allowing developers and researchers to apply mathematical models that assume a linear relationship between input and output.

19. What does model.fit() do? What arguments must be given?
 *  In machine learning using libraries like Scikit-learn, the model.fit() function is used to train a machine learning model on a given dataset. This is one of the most important steps in the model-building process.

When you create a model (for example, using LinearRegression(), DecisionTreeClassifier(), or any other algorithm), the model is only a blank container—it doesn't know anything about your data yet. The fit() method fills that container by feeding in the data, allowing the model to learn the relationships or patterns between input features and target outputs.

Necessary arguments are:

*  X (Features/Input Data): A 2D array-like structure (such as list of lists, NumPy array, or pandas DataFrame) containing independent variables or features.
Each row is one data point; each column is one feature.

* y (Target/Labels): A 1D array-like structure (such as list, NumPy array, or pandas Series) containing the dependent variable or the value we want to predict.

It should have the same number of entries as rows in X, where each entry corresponds to the correct output for the input in the same row.

20.  What does model.predict() do? What arguments must be given?
 *  After a machine learning model has been trained using the model.fit() method, the model.predict() function is used to make predictions on new, unseen data. This method allows the model to apply the patterns it has learned during training to estimate or classify outcomes for new inputs.

 Necessary arguments are:

*  X_new (Input Data for Prediction):

This is the only required argument.

It must be a 2D array-like structure (e.g., a list of lists, NumPy array, or pandas DataFrame), similar in structure to the training data used in model.fit().

It should include the same number of features/columns as the training data, but can have any number of rows (i.e., data points).



21. What are Continuous and Categorical Variables?

 *  In machine learning and statistics, variables are the attributes or characteristics of data. These variables can be classified into two main types based on the nature of their values:

* Continuous Variables: These are numerical variables that can take an infinite number of values within a given range. They are measurable quantities and can have decimal values. Examples include height, temperature, weight, age, and price. They support mathematical operations like addition, subtraction, averaging, etc.

* Categorical Variables: These are variables that represent categories or groups. They contain a finite number of distinct values and are often non-numerical. Categorical data can be:

1)  Nominal (no inherent order, e.g., gender, color, city)

2) Ordinal (ordered categories, e.g., low, medium, high)



22.  What is feature scaling? How does it help in Machine Learning?
 *  Feature scaling is a technique in machine learning used to standardize or normalize the range of independent variables (features) in a dataset. In many real-world datasets, features often have different units, magnitudes, or ranges. For example, age may range from 0 to 100, while income may range from thousands to millions. This difference in scale can create imbalance and lead to biased results during model training.

Feature scaling transforms features to a common scale without distorting their underlying relationships. The goal is to ensure that all features contribute equally to the learning process, especially in algorithms that are sensitive to the magnitude of values.

Its importance in machine learning:

* Improves Model Accuracy: In algorithms like logistic regression, support vector machines, and neural networks, large differences in feature magnitudes can slow down or mislead the optimization process. Feature scaling helps the model converge faster and learn more effectively.

* Removes Bias Toward Higher-Scale Features: Algorithms such as K-Nearest Neighbors (KNN) or K-Means clustering use distance metrics like Euclidean distance. If features are not scaled, the algorithm may be biased toward features with larger numerical ranges, even if those features are less important.

* Enhances Interpretability of Coefficients:
In linear models, scaled features allow for better interpretation and comparison of feature importance.

* Prevents Numerical Instability:
Unscaled data can result in unstable gradients or matrix operations (like covariance matrices in PCA), leading to poor training results or even runtime errors.

23.  How do we perform scaling in Python?
 *  Feature scaling is commonly performed using the sklearn.preprocessing module from the Scikit-learn library. This module provides various tools to scale or normalize numerical data so that features contribute equally to a machine learning model.

Scaling is generally performed after splitting the data into training and testing sets and before feeding the data into the model. The purpose is to bring all features to a comparable scale, ensuring better performance for algorithms that are sensitive to feature magnitudes.

Some methods in python to perform feature scaling:

* Min-Max Scaling (Normalization)
This method scales features to a fixed range, usually [0, 1].

* Standardization (Z-score Normalization)
This method transforms features to have a mean of 0 and standard deviation of 1.

* MaxAbs Scaling
This scales each feature by its maximum absolute value, useful for sparse data.

* Robust Scaling
This method uses the median and interquartile range, making it robust to outliers.

24.  What is sklearn.preprocessing?

 *   The sklearn.preprocessing module is part of the popular machine learning library scikit-learn and provides essential tools for transforming and preparing data for use with machine learning models. The primary aim of preprocessing is to improve the performance and accuracy of machine learning algorithms by ensuring that the data is in an appropriate format. This module includes a range of functionalities such as scaling numerical features, encoding categorical variables, handling missing data, and creating polynomial features, all of which are critical for the effective application of machine learning models.

25.  How do we split data for model fitting (training and testing) in Python?
 *   splitting the data into training and testing sets is a fundamental step that ensures the model is evaluated on data it hasn't seen before, helping to assess its generalization ability. The process of splitting the data typically involves dividing the dataset into at least two subsets: the training set and the test set.

 * Training Set: This subset is used to train the machine learning model. The model learns the underlying patterns and relationships in the data from this set.

* Test Set: This subset is used to evaluate the performance of the model after it has been trained. The goal is to ensure the model performs well on unseen data, which helps to understand how it will generalize to real-world data.


Most common method for splitting data is using the train_test_split function from the sklearn.model_selection module. This function randomly splits the dataset into training and testing subsets.

26.  Explain data encoding?
 *  Data encoding is the process of converting categorical data into numerical values so that it can be used effectively by machine learning algorithms. Since most ML models work with numerical data, categorical variables—such as names, labels, or categories—must be encoded into a numerical format before model training.

Encoding ensures that the algorithm understands and processes categorical inputs correctly without misinterpreting them as ordinal or mathematically related unless they truly are. It plays a crucial role in feature engineering, especially when handling textual or label-based data columns.