# Feature Engineering

**1. What is a parameter?**  

Ans - A parameter is a variable in a machine learning model that is learned from the training data. Examples include weights in a linear regression model or connection strengths in a neural network.

**2. What is correlation?What does negative correlation mean?**  

Ans - Correlation is a statistical measure that describes the degree to which two variables move in relation to each other.
Negative correlation means that as one variable increases, the other decreases. For example, as the price of a product increases, its demand may decrease.

**3. Define Machine Learning. What are the main components in Machine Learning?**  
Ans - **Machine Learning (ML)** is a subset of artificial intelligence that enables computers to learn from data and make predictions without being explicitly programmed.  
   * **Main components:**  
     1. **Data** - The input that is used for training and testing.  
     2. **Features** - Variables used to make predictions.  
     3. **Model** - The algorithm that learns from data.  
     4. **Loss Function** - Measures how well the model predicts outcomes.  
     5. **Optimization Algorithm** - Adjusts parameters to minimize the loss.  

**4. How does loss value help in determining whether the model is good or not?**  
Ans - A lower loss value indicates that the model's predictions are close to the actual values, suggesting better performance.

**5. What are continuous and categorical variables?**  

Ans-  Continuous and Categorical variables:
   * **Continuous variables**: Numeric values that can take any number within a range (e.g., height, temperature).  
   * **Categorical variables**: Discrete values that represent categories (e.g., colors, gender, product types).

**6. How do we handle categorical variables in Machine Learning? What are the common techniques?**  

Ans - Handling categorical variables in ML involves converting them into numerical form. Common techniques include:  

1. **One-Hot Encoding** - Creates binary columns for each category (best for nominal data).  
2. **Label Encoding** - Assigns unique integers to categories (useful for ordinal data).  
3. **Ordinal Encoding** - Maps categories to ordered numbers (for ordinal variables).  
4. **Frequency Encoding** - Replaces categories with their frequency count.  
5. **Target Encoding** - Replaces categories with mean of target variable (useful for predictive models).  
6. **Binary Encoding** - Converts categories into binary format (reduces dimensionality).  

**7. What do you mean by training and testing a dataset?**  

Ans - Training and testing a dataset simply means-
   * **Training dataset**: Used to train the machine learning model.  
   * **Testing dataset**: Used to evaluate the performance of the trained model.

**8. What is sklearn.preprocessing?**  

Ans - It is a module in **scikit-learn** that provides functions for data preprocessing, including scaling, encoding, and normalization.

**9. What is a Test set?**  

Ans - A test set is a portion of the dataset used to evaluate a model's performance after training.

**10. How do we split data for model fitting (training and testing) in Python?**

Ans - In Python, we typically split data into training and testing sets using train_test_split from sklearn.model_selection. This ensures our model is trained on one part and tested on unseen data.
  
Syntax:

    from sklearn.model_selection import train_test_split  
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Example:

    from sklearn.model_selection import train_test_split  
    # Sample data  
    X = [[1], [2], [3], [4], [5]]  # Features  
    y = [0, 1, 0, 1, 0]  # Target  

    # Split into 80% training, 20% testing  
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  

    print(X_train, X_test, y_train, y_test)

**How do you approach a Machine Learning problem?[Qno is not given]**  

Ans - Approaching a Machine Learning problem involves a structured workflow. Here's a simple step-by-step process:  

**1. Understand the Problem**  
   * Define the objective (e.g., classification, regression).  
   * Identify key constraints and success metrics.  

**2. Collect and Explore Data**  
   * Gather data from reliable sources.  
   * Perform **Exploratory Data Analysis (EDA)** (visualization, summary statistics).  

**3. Data Preprocessing**  
   * Handle missing values.  
   * Encode categorical variables.  
   * Normalize/scale numerical features.  
   * Split data into train and test sets.  

**4. Choose a Model**  
   * Select a baseline algorithm (e.g., Logistic Regression, Decision Trees).  
   * Try more advanced models (Random Forest, XGBoost, Neural Networks).  

**5. Train and Tune the Model**  
   * Fit the model on training data.  
   * Optimize hyperparameters (GridSearch, RandomSearch).

**6. Evaluate Performance**  
   * Use metrics like Accuracy, RMSE, F1-score, etc.  
   * Perform cross-validation to check generalization.  

**7. Deploy and Monitor**  
   * Deploy the model to production.  
   * Continuously monitor performance and update as needed.  

**11. Why do we have to perform EDA before fitting a model to the data?**

Ans - To understand the data distribution, detect missing values, outliers, and correlations, and make informed decisions about feature selection.

**12. What is correlation?**

Ans - **[Repeated Quesation same as Q2.]**

**13. What does negative correlation mean?**

Ans - **[Repeated Quesation same as Q2.]**

**14. How can you find correlation between variables in Python?**  

Ans - We can find the correlation between variables in Python using **Pandas** and  **Seaborn(For Visualization)**

**Using Pandas .corr()** -  Default method is Pearson correlation. Use data.corr(method='spearman') for Spearman correlation.

Example:
    import pandas as pd  
    correlation_matrix = df.corr()  
    print(correlation_matrix)

**15. What is causation? Explain difference between correlation and causation with an example.**  

Ans-  Causation means that one variable directly influences another.

**Difference between correlation and causation:**  
* Causation (or causality) means that one event directly causes another.
* Correlation means that two variables move together but may not have a direct cause-and-effect relationship.

**Example:**

* **Correlation:** Ice cream sales and drowning rates are correlated they increase together in summer.
→ But ice cream does not cause drowning. Instead, hot weather influences both.

* **Causation:** Smoking causes lung cancer.
→ There is a direct cause-and-effect relationship.
     
**16. What is an Optimizer? What are different types of optimizers? Explain each with an example.**  

Ans - An optimizer is an algorithm  that adjusts the model's parameters (weights and biases) to minimize the loss. It helps improve model accuracy by optimizing how the model learns from data.

**Types  of optimizers:**  
* **Gradient Descent** - Adjusts weights using gradients.  
* **Adam Optimizer** - Combines momentum and adaptive learning rate.  
* **RMSprop** - Suitable for non-stationary objectives.  
* Gradient Descent (GD) - Updates weights in the direction of the steepest loss decrease.
Example: Batch GD computes updates using the whole dataset.
* Stochastic Gradient Descent (SGD) - Updates weights for each data point, making it faster but noisy.
Example: Used in online learning.
* Adam (Adaptive Moment Estimation) - Combines Momentum and RMSProp for efficient learning.
Example: Used in Neural Networks (default optimizer in TensorFlow/Keras).
* RMSProp (Root Mean Square Propagation) - Adjusts learning rates based on past gradients to avoid overshooting.
Example: Works well for RNNs.

Example in Python (Adam Optimizer in TensorFlow)

    from tensorflow.keras.optimizers import Adam  
    optimizer = Adam(learning_rate=0.001)

**17. What is sklearn.linear_model?**  

Ans - A module in `scikit-learn` that provides linear models like Linear Regression and Logistic Regression.

**18. What does model.fit() do? What arguments must be given?**  

Ans - model.fit() is used in TensorFlow/Keras to train a machine learning model. It:
* Feeds training data (features & labels) into the model.
* Performs forward & backward propagation to adjust weights.
* Optimizes the model using a loss function and optimizer.

Syntax:

    model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val, y_val))

**Required Arguments:**  
* `X_train`: Features(Input training data).  
* `y_train`: Target values for training
* epochs : Number of times the model sees the data.
* batch_size : Number of samples per gradient update.

**19. What does model.predict() do? What arguments must be given?**  

Ans - The `model.predict()` function is used to generate predictions from a trained model based on new input data.

**Required Arguments:**  
* `X_test`: Test features  

**20.  What are continuous and categorical variables?**

 Ans - **[Repeated Quesation same as Q5.]**

**21. What is feature scaling? How does it help in Machine Learning?**  

Ans-  Feature scaling is a technique used in Machine Learning to standardize the range of independent variables (features) in a dataset. It ensures that all features have comparable scales, preventing certain features from dominating others due to their larger numerical ranges.

Feature scaling helps in Machine Learning by:  

1. **Faster Convergence** - Gradient-based algorithms (e.g., Gradient Descent) work efficiently with scaled data.  
2. **Improved Accuracy** - Distance-based models (e.g., k-NN, K-Means, SVM) perform better when features have similar scales.  
3. **Prevents Bias** - Ensures no single feature dominates due to larger numerical values.  
4. **Better Model Interpretation** - Standardized features improve visualization and comparison.

**22. How do we perform scaling in Python?**  

Ans- can perform feature scaling in Python using scikit-learn. The two common methods are:

**1. Min-Max Scaling (Normalization) :** Scales values to a fixed range (usually 0 to 1).  
```python
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
```

**2. Standardization (Z-score Normalization):**Scales data to have **mean = 0** and **standard deviation = 1**.  
```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
```

**23. What is sklearn.preprocessing?**  
Ans- **REPEATED QUESTION SAME AS Question 8**.

It is a module in **scikit-learn** that provides functions for data preprocessing, including scaling, encoding, and normalization.

**24. How do we split data for model fitting (training and testing) in Python?**  
Ans - **REPEATED QUESTION SAME AS Question 10**

**25. Explain data encoding?**  

Ans - Data encoding transforms categorical variables into numerical format for machine learning models.

**Types of Data Encoding:**
1. One-hot encoding  
* Converts categories into binary columns.

* Used for nominal data (where order does not matter).
2. Label encoding
* Converts categories into integer values.
* Used for ordinal data (where order matters).

3. Ordinal encoding
* Assigns numeric values based on order (like label encoding but considering rank).
* used when categorical variables have a meaningful order or ranking, but the differences between them are not necessarily uniform.