# Feature Engineering

1. What is a parameter?
   - A parameter is a value or variable that you pass into a function, method, or system to control its behavior or customize its operation. Parameters act like placeholders or inputs that allow you to provide different data each time the function or process is executed, making it more flexible and reusable.

2. What is correlation? What does negative correlation mean?
   - Correlation refers to the statistical relationship between two or more variables, where changes in one variable are associated with changes in another. In simpler terms, it measures how closely two variables move in relation to each other.

      A negative correlation means that as one variable increases, the other variable tends to decrease (and vice versa). In other words, the two variables move in opposite directions. If there's a strong negative correlation, the relationship between the variables is predictable: when one goes up, the other goes down, and the pattern is consistent.

3.  Define Machine Learning. What are the main components in Machine Learning?
    - Machine Learning (ML) is a subset of Artificial Intelligence (AI) that enables systems to learn from data, improve their performance over time, and make predictions or decisions without being explicitly programmed for specific tasks. Instead of being programmed with a fixed set of rules, ML algorithms use data to detect patterns, make decisions, and optimize processes automatically.

4. How does loss value help in determining whether the model is good or not?
    *   The loss value plays a crucial role in determining how well or poorly a machine learning model is performing. It measures how far off the model's predictions are from the actual, true values (i.e., the target values). Essentially, loss provides a way to quantify the error made by the model, and smaller loss values generally indicate a better-performing model.

5. What are continuous and categorical variables?
   - Continuous and categorical variables are two fundamental types of variables used in data analysis and machine learning. They refer to the different types of data that can be used in statistical models and the way these data are treated or processed.

6. How do we handle categorical variables in Machine Learning? What are the common techniques?
   - Handling categorical variables is a critical step in preparing data for machine learning because most algorithms work with numerical data. Therefore, categorical variables need to be converted into numerical representations before they can be used in a model.

      Here are common techniques for handling categorical variables in machine learning, summarized briefly:

      1.   Label Encoding
      2.   One-Hot Encoding
      3.   Binary Encoding
      4.   Frequency / Count Encoding
      5.   Target Encoding

7. What do you mean by training and testing a dataset?
   - Training and testing a dataset are essential steps in the machine learning workflow. These steps help ensure that the model you build can generalize well to new, unseen data, and doesn't just memorize the training data (a problem known as overfitting).

8. What is sklearn.preprocessing?
   - sklearn.preprocessing is a module in scikit-learn (a popular Python library for machine learning) that provides a collection of utilities for data preprocessing. Preprocessing refers to the steps of transforming and preparing data before using it to train a machine learning model.

9. What is a Test set?
   - A test set is a portion of your dataset that is used to evaluate the performance of a machine learning model after it has been trained on a training set.

10. What is a Test set? How do you approach a Machine Learning problem?
    - In Python, the most common way to split data for model fitting (i.e., splitting data into training and testing sets) is by using the train_test_split function from scikit-learn (sklearn.model_selection). This function randomly splits the dataset into two or more parts (typically, a training set and a test set).

       Approaching a machine learning problem involves a systematic, step-by-step process to build, train, and evaluate a model. While each problem is unique, the general approach remains fairly consistent across tasks, whether you're working on a classification, regression, or unsupervised task.

11. Why do we have to perform EDA before fitting a model to the data?
    - Performing Exploratory Data Analysis (EDA) before fitting a model to the data is crucial for several reasons. EDA is a preliminary step that helps you better understand the dataset and the underlying relationships within it.

12. What is correlation?
    - Correlation is a statistical measure that describes the strength and direction of the relationship between two variables. In simpler terms, it tells you whether and how strongly two variables are related.

13. What does negative correlation mean?
    - Negative correlation refers to a relationship between two variables in which one variable increases while the other decreases, and vice versa. In other words, when one variable goes up, the other goes down. The strength of this inverse relationship is represented by a negative correlation coefficient (between -1 and 0).

14. How can you find correlation between variables in Python?
    - To find the correlation between variables in Python, you can use the pandas library, which provides a built-in function to compute the correlation coefficient between columns of a DataFrame.

15. What is causation? Explain difference between correlation and causation with an example.
    - Causation refers to a cause-and-effect relationship between two variables, where one variable (the cause) directly influences or brings about a change in another variable (the effect). In other words, causation implies that a change in the independent variable (the cause) leads to a change in the dependent variable (the effect).

      Correlation refers to a statistical relationship between two variables, where they move together but doesn't imply one causes the other. Causation means that one variable directly influences the other, creating a cause-and-effect relationship. While correlation shows association, causation shows a direct influence.

16. What is an Optimizer? What are different types of optimizers? Explain each with an example.
    - An optimizer in machine learning is an algorithm or method used to minimize or maximize a loss function during training. The goal of an optimizer is to adjust the model's parameters (like weights in a neural network) to reduce the error between the predicted output and the actual output, improving the model's performance over time.

      In machine learning, optimizers play a critical role in updating the model parameters (like weights) to minimize the loss function during training. There are several types of optimizers, each with its own advantages and use cases. Here's an overview of the most common ones:

      1. **Gradient Descent (GD):** The Gradient Descent algorithm is the most basic optimization method. It updates the parameters by moving in the direction of the negative gradient of the loss function, i.e., the direction that reduces the loss.

          Example: In linear regression, gradient descent iteratively adjusts the weights of the model to minimize the difference between the predicted and actual outputs.

      2. **Stochastic Gradient Descent (SGD):** Stochastic Gradient Descent (SGD) updates the model's parameters using just one data point at a time. While it's faster than Batch GD, the updates are noisy, and the algorithm might oscillate around the minimum.

          Example: In neural networks, SGD can be used to update the weights after each training sample, speeding up the learning process.

      3. **Mini-batch Gradient Descent:** A compromise between Batch Gradient Descent and Stochastic Gradient Descent. Mini-batch GD uses a small subset of data (mini-batch) for each parameter update, improving efficiency and stability.

          Example: In deep learning, mini-batch gradient descent is used with deep neural networks, where the dataset is divided into mini-batches (e.g., 32 or 64 samples per batch) for each update.

      4. **Momentum:** Momentum is an extension of gradient descent. It helps accelerate gradients vectors in the right directions, thus leading to faster converging. It adds a fraction of the previous update to the current update, giving the optimizer “momentum” in the direction of the previous gradients.

          Example: In training deep neural networks, momentum helps improve convergence by smoothing the updates, especially when the gradient changes direction frequently.

      5. **RMSprop (Root Mean Square Propagation):** RMSprop is an adaptive learning rate optimizer. It adjusts the learning rate for each parameter based on the magnitudes of recent gradients. It divides the learning rate by an exponentially decaying average of squared gradients, which prevents the model from oscillating.

          Example: RMSprop is often used for training recurrent neural networks (RNNs) due to its ability to handle gradients that vary over time.

17. What is sklearn.linear_model?
    - sklearn.linear_model is a module within the scikit-learn library in Python, which contains several classes for linear models used in machine learning tasks. These models are typically used for regression, classification, and other supervised learning tasks. Linear models attempt to model the relationship between input variables (features) and the target variable (output) through a linear equation.

18. What does model.fit() do? What arguments must be given?
    - The model.fit() method in scikit-learn is used to train a machine learning model using the provided training data. It fits the model to the data, meaning it learns the parameters (such as weights, coefficients, etc.) that minimize the error or loss function based on the training set.

19. What does model.predict() do? What arguments must be given?
    - The model.predict() method in scikit-learn is used to make predictions using a trained machine learning model. After training the model with model.fit() on a set of data, model.predict() is used to predict the target values (output) for new, unseen input data based on the learned model parameters.

20. What are continuous and categorical variables?
    - In data science and machine learning, variables (or features) in a dataset are classified into two main types: continuous and categorical. Understanding the difference between these two types is important because it affects how you analyze the data and the type of model you use.

21. What is feature scaling? How does it help in Machine Learning?
    - Feature scaling is the process of normalizing or standardizing the range of independent variables or features in your data. The purpose of feature scaling is to ensure that features with different units or scales (e.g., height in cm and salary in dollars) do not disproportionately affect the model, especially for algorithms that rely on the distance between data points (e.g., k-NN, SVM, and gradient descent-based models like linear/logistic regression).

      Feature scaling helps in machine learning by ensuring that all features contribute equally to the model, preventing features with larger ranges from dominating the learning process. It improves the performance and convergence speed of algorithms that rely on distance metrics (like k-NN, SVM) or optimization (like gradient descent). Without scaling, models may become biased or inefficient, leading to suboptimal predictions.

22. How do we perform scaling in Python?
    - In Python, feature scaling can be performed using the scikit-learn library. Common scaling methods include Min-Max Scaling (using MinMaxScaler), which normalizes data to a range [0, 1; Standardization (using StandardScaler), which centers data around 0 with a standard deviation of 1; Robust Scaling (using RobustScaler), which is robust to outliers; and MaxAbs Scaling (using MaxAbsScaler), which scales data to the range [-1, 1]. These techniques ensure that all features contribute equally to machine learning models.

23. What is sklearn.preprocessing?
    - sklearn.preprocessing is a module in scikit-learn (a popular Python library for machine learning) that provides a collection of utilities for data preprocessing. Preprocessing refers to the steps of transforming and preparing data before using it to train a machine learning model.

24. How do we split data for model fitting (training and testing) in Python?
    - In Python, you can split data into training and testing sets using the train_test_split function from the scikit-learn library. This helps ensure that the model is trained on one portion of the data (training set) and evaluated on a separate portion (test set) to assess its performance.

25. Explain data encoding?
    - Data encoding is the process of converting categorical variables (non-numeric values) into numeric formats so they can be used by machine learning algorithms. Most machine learning models require numerical input, as they cannot process categorical data directly. Encoding is essential for ensuring that the algorithm can interpret and learn from categorical features.
