# **Feature Engineering**


1.  What is a parameter?
   - A parameter is an internal part of a machine learning model that is learned from the training data. It helps the model make predictions.

2.  What is correlation?
 What does negative correlation mean?
    - Correlation is a statistical measure that describes the strength and direction of a relationship between two variables. It quantifies how changes in one variable are associated with changes in another. The most common measure is the Pearson correlation coefficient, which ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear relationship
    - A negative correlation means that as one variable increases, the other variable tends to decrease, and vice versa. In other words, the variables move in opposite directions. For example, if the correlation coefficient between hours spent watching TV and exam scores is -0.8, it suggests that more TV time is associated with lower exam scores

3.  Define Machine Learning. What are the main components in Machine Learning?
   - Machine Learning is a subset of artificial intelligence that enables computers to learn from data and improve their performance on tasks without being explicitly programmed. The main components in machine learning are:
- Data: The raw information used for training and testing models.
- Algorithms: Mathematical procedures or rules that process data to find patterns.
- Model: The output of the learning process, which can make predictions or decisions.
- Training: The process of feeding data to the algorithm to learn patterns.
- Evaluation: Assessing the model's performance on unseen data

4.  How does loss value help in determining whether the model is good or not?
   - The loss value (or loss function) quantifies the difference between the model's predictions and the actual values. A lower loss value indicates that the model's predictions are closer to the true values, suggesting better performance. Monitoring the loss during training helps determine if the model is learning effectively and when to stop training to avoid overfitting

5.  What are continuous and categorical variables?
   - Continuous variables are numerical variables that can take any value within a range (e.g., height, weight, temperature).
   - Categorical variables are variables that represent categories or groups (e.g., gender, color, type of animal). They can be nominal (no order) or ordinal (ordered categories)


6.  How do we handle categorical variables in Machine Learning? What are the common techniques?
   - Categorical variables must be converted into a numerical format for most machine learning algorithms. Common techniques include:
- Label Encoding: Assigns a unique integer to each category.
- One-Hot Encoding: Creates binary columns for each category (1 if present, 0 otherwise).
- Ordinal Encoding: Assigns ordered integers to categories with a natural order.
- Binary Encoding: Converts categories into binary code and splits them into columns.
   - These techniques are available in libraries like scikit-learn and pandas

7.  What do you mean by training and testing a dataset?
   - Training a dataset means using a portion of the data to teach the model to recognize patterns and relationships.
   - Testing a dataset involves evaluating the trained model's performance on a separate, unseen portion of the data to assess its generalization ability
.


8.  What is sklearn.preprocessing?
   - sklearn.preprocessing is a module in scikit-learn that provides functions and classes for preprocessing data. It includes tools for scaling, normalizing, encoding categorical variables, imputing missing values, and more. Preprocessing ensures that data is in a suitable format for machine learning algorithms

9.  What is a Test set?
   - A test set is a subset of the data that is kept separate from the training data. It is used exclusively to evaluate the final performance of a trained model, providing an unbiased estimate of how the model will perform on new, unseen data
.

10. How do we split data for model fitting (training and testing) in Python?
 How do you approach a Machine Learning problem?
    - In Python, you can use the train_test_split function from sklearn.model_selection to split your data into training and testing sets.
    - For example:
   - A typical approach includes:
- Understanding the problem and data.
- Data collection and preprocessing (cleaning, encoding, scaling).
- Exploratory Data Analysis (EDA) to uncover patterns and relationships.
- Feature engineering to create or select relevant features.
- Model selection and training.
- Model evaluation using appropriate metrics.
- Hyperparameter tuning and optimization.
- Deployment and monitoring

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

11.  Why do we have to perform EDA before fitting a model to the data?
   - Exploratory Data Analysis (EDA) helps you understand the data's structure, detect anomalies, identify relationships, and select relevant features. EDA ensures that the data is suitable for modeling and helps prevent issues like overfitting, underfitting, or data leakage
.

12.  What is correlation?
   -  Correlation is a statistical measure that describes the strength and direction of a relationship between two variables. It quantifies how changes in one variable are associated with changes in another. The most common measure is the Pearson correlation coefficient, which ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear relationship

13.  How can you find correlation between variables in Python?
   - You can use the corr() method in pandas to compute the correlation matrix:
- For specific pairs, you can use scipy.stats.pearsonr or numpy.corrcoef
.

In [2]:
import pandas as pd

# Sample data (replace with your actual data)
data = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}

df = pd.DataFrame(data)
correlation_matrix = df.corr()

# Display the correlation matrix
display(correlation_matrix)

Unnamed: 0,col1,col2
col1,1.0,1.0
col2,1.0,1.0


14.  What is causation? Explain difference between correlation and causation with an example.
   - Causation means that one variable directly affects another. Correlation only indicates that two variables move together, not that one causes the other.
- Example: Ice cream sales and drowning incidents are correlated (both increase in summer), but eating ice cream does not cause drowning. The underlying cause is hot weather, which increases both swimming and ice cream consumption
.


15.  What is an Optimizer? What are different types of optimizers? Explain each with an example.
   - An optimizer is an algorithm that adjusts the parameters of a model to minimize the loss function during training.
- Gradient Descent: Updates parameters in the direction of the negative gradient of the loss function.
- Stochastic Gradient Descent (SGD): Uses a random subset (mini-batch) of data for each update, making it faster for large datasets.
- Adam: Combines momentum and adaptive learning rates for efficient optimization.
- RMSprop: Adjusts learning rates based on a moving average of squared gradients.
   - Example: In neural networks, Adam is often used for its efficiency and ability to handle sparse gradients
.

16.  What is sklearn.linear_model ?
   - sklearn.linear_model is a module in scikit-learn that provides classes and functions for linear models, such as linear regression, logistic regression, and ridge regression. These models are used for tasks like regression and classification

17.  What does model.fit() do? What arguments must be given?
   - model.fit() trains the model on the provided data. The main arguments are:
- X: Feature matrix (input data).
- y: Target vector (labels or values).

In [None]:
model.fit(X_train, y_train)
# Some models may accept additional arguments for sample weights or validation data

18.  What does model.predict() do? What arguments must be given?
   - model.predict() uses the trained model to make predictions on new data. The main argument is:
- X: Feature matrix of new/unseen data.

In [None]:
predictions = model.predict(X_test)
# No target values are needed for prediction

19.  What are continuous and categorical variables?
   - Continuous variables: Numbers that can have any value in a range (e.g., weight, height, price)

   - Categorical variables: Values grouped into categories (e.g., colors, gender, grade level)


20.  What is feature scaling? How does it help in Machine Learning?
   - Feature scaling is the process of normalizing or standardizing the range of independent variables (features). It helps machine learning algorithms converge faster and perform better, especially those that rely on distance calculations (e.g., k-NN, SVM, gradient descent-based models)
.


21.  How do we perform scaling in Python?
   - We use StandardScaler or MinMaxScaler from sklearn.preprocessing.

Example using MinMaxScaler:

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)


22.  Explain data encoding?
   - Data encoding is the process of converting categorical variables into a numerical format suitable for machine learning algorithms. Common encoding techniques include label encoding, one-hot encoding, and ordinal encoding. Encoding ensures that algorithms can interpret and process categorical data effectively
.