## Feature Engineering Assignment

### 1. What is a parameter?

A parameter is an internal variable of a model that is learned from the training data. For example, in linear regression, the slope (weights) and intercept are parameters.


### 2. What is correlation?

Correlation measures the strength and direction of a linear relationship between two variables. It ranges from -1 to +1.


* What does negative correlation mean?

Negative correlation means that as one variable increases, the other tends to decrease. A correlation value close to -1 indicates a strong negative relationship.


### 3. Define Machine Learning. What are the main components in Machine Learning?

Machine Learning (ML) is a field of AI that enables systems to learn patterns from data and make predictions.
**Main components:**

* Data
* Model
* Loss Function
* Optimizer
* Evaluation Metrics


### 4. How does loss value help in determining whether the model is good or not?

The loss value quantifies the difference between the predicted and actual outputs. Lower loss indicates better model performance.


### 5. What are continuous and categorical variables?

* Continuous: Numeric variables with infinite possible values (e.g., age, salary).
* Categorical: Discrete variables with limited categories (e.g., gender, color).


### 6. How do we handle categorical variables in Machine Learning? What are the common techniques?

In Machine Learning, categorical variables are variables that contain label values rather than numeric values. Since most algorithms work only with numerical input, we must convert these categorical variables into a suitable numerical format. This process is known as encoding. There are several common techniques used to handle categorical variables:

* Label Encoding - Converts categories to numbers.
* One-Hot Encoding - Creates binary columns for each category.
* Ordinal Encoding - For categories with order.


### 7. What do you mean by training and testing a dataset?

* Training: Data used to teach the model.
* Testing: Data used to evaluate model performance on unseen data.


### 8. What is sklearn.preprocessing?

A module in Scikit-learn that provides functions for data preprocessing such as scaling, encoding, and normalization.


### 9. What is a Test set?

A test set is a subset of data used only for evaluating the performance of a trained machine learning model.


### 10. How do we split data for model fitting (training and testing) in Python?

Using `train_test_split()` from `sklearn.model_selection`:

```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

* How do you approach a Machine Learning problem?

1. Understand the problem and data
2. Perform Exploratory Data Analysis (EDA)
3. Preprocess data
4. Split into training/test sets
5. Choose and train model
6. Evaluate performance
7. Tune hyperparameters
8. Deploy the model


### 11. Why do we have to perform EDA before fitting a model to the data?

EDA helps:

* Understand data distribution
* Detect outliers and missing values
* Discover relationships between variables
* Guide feature engineering


### 12. What is correlation? (repeated)

See Answer #2 above.

### 13. What does negative correlation mean? (repeated)

See Answer #3 above.

### 14. How can you find correlation between variables in Python?

Using Pandas:

```python
df.corr()
```

### 15. What is causation? Explain difference between correlation and causation with an example.

Causation means one variable directly affects another.
Difference:

* Correlation: Ice cream sales and drowning rates increase in summer.
* Causation: Smoking causes lung cancer.


### 16. What is an Optimizer? What are different types of optimizers? Explain each with an example.

Optimizer updates model parameters to minimize loss.

Types:

* SGD (Stochastic Gradient Descent): Updates using a single sample.
* Adam: Combines momentum and adaptive learning rates.
* RMSprop: Adapts learning rate based on average of recent magnitudes of gradients.


### 17. What is sklearn.linear\_model?

A Scikit-learn module that provides linear models like:

* `LinearRegression()`
* `LogisticRegression()`


### 18. What does model.fit() do? What arguments must be given?

It trains the model on data.

```python
model.fit(X_train, y_train)
```

### 19. What does model.predict() do? What arguments must be given?

It predicts outcomes for new input data.

```python
predictions = model.predict(X_test)
```

### 20. What are continuous and categorical variables? (repeated)

See Answer #6 above.

### 21. What is feature scaling? How does it help in Machine Learning?

Feature scaling standardizes the range of features. It helps improve model convergence and performance.

### 22. How do we perform scaling in Python?

Using Scikit-learn:

```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```

### 23. What is sklearn.preprocessing? (repeated)

See Answer #9 above.


### 24. How do we split data for model fitting (training and testing) in Python?

To evaluate a machine learning model fairly, the dataset is typically split into two or more parts: **training** and **testing** sets. The training set is used to teach the model, while the testing set evaluates how well the model performs on unseen data.

In Python, the most common method to do this is using train_test_split() from sklearn.model_selection

#### Code Example:

```python
from sklearn.model_selection import train_test_split

# X is the feature matrix, y is the target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

#### Explanation:

* test_size=0.2 means 20% of the data will be used for testing.
* random_state ensures reproducibility of the split.
* This results in four datasets: X_train, X_test, y_train, y_test.


### 25. Explain Data Encoding

Data encoding is the process of converting categorical (non-numeric) data into a numeric format that can be used by machine learning algorithms, which generally require numerical input.

There are several common encoding techniques:

#### 1. Label Encoding

* Assigns a unique integer to each category.
* Useful for ordinal data (e.g., "Low" = 0, "Medium" = 1, "High" = 2).
* Not ideal for nominal data due to potential misinterpretation of order.

```python
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Category'] = le.fit_transform(df['Category'])
```

#### 2. One-Hot Encoding

* Creates a new binary column for each category.
* Best suited for nominal (unordered) categories.

```python
import pandas as pd
df = pd.get_dummies(df, columns=['Category'])
```

#### 3. Ordinal Encoding

* Similar to label encoding but specifically for ordered categories.
* Explicit order can be defined during encoding.

```python
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
df[['Level']] = encoder.fit_transform(df[['Level']])
```

#### 4. Target Encoding (advanced)

* Replaces each category with the average target value for that category.
* Can overfit; often used with regularization or cross-validation.


Encoding ensures that the model interprets categorical variables correctly, improving training efficiency and accuracy. Choosing the right encoding method depends on whether the categories are ordinal or nominal and on the type of machine learning algorithm used.
