#**Feature Engineering Assignment**

## 1. What is a parameter?

In the context of machine learning, a parameter is a variable that the model learns during training. These are the internal configuration variables that determine how the model makes predictions. For example, in linear regression, the coefficients and intercept are parameters that the model adjusts to best fit the data.

                                                  or

In Machine Learning, a **parameter** is something the model learns from the training data. For example, in a line equation like `y = mx + c`, `m` and `c` are parameters. The model tries to find the best values of these to make accurate predictions.

---

## 2. What is correlation?

Correlation is a statistical measure that expresses the extent to which two variables are linearly related. It measures both the strength and direction of the relationship between two variables. Correlation values range from -1 to +1, where:

- +1 indicates a perfect positive correlation
- 0 indicates no correlation
-1 indicates a perfect negative correlation
---

## 3. What does negative correlation mean?

Negative correlation means that as one variable increases, the other variable tends to decrease. For example, if there's a negative correlation between study hours and exam failure rate, it means that as study hours increase, the failure rate tends to decrease. Graphically, points in a scatter plot would form a pattern from the upper left to the lower right.

---

## 4. Define Machine Learning. What are the main components?

Machine Learning is a subset of artificial intelligence that enables computers to learn from data and improve from experience without being explicitly programmed. The machine "learns" patterns from data and makes predictions or decisions based on what it has learned.

Main components of Machine Learning:

1. Data: The information used to train the model
2. Features: The variables or attributes in the data
3.Algorithm: The mathematical approach used for learning patterns
4.Model: The representation learned from the data
5.Training: The process of learning patterns from data
6.Evaluation: Assessing how well the model performs
Hyperparameters: Configuration settings for the algorithm

---

## 5. How does loss value help in determining whether the model is good or not?

The loss value measures how far the model's predictions are from the actual values. A lower loss value indicates better performance. When training a model, we aim to minimize this loss value.

The loss value helps determine if a model is good by:

- Tracking improvement during training
- Comparing different models
- Identifying overfitting (when training loss decreases but validation loss increases)
- Setting a benchmark for acceptable performance

A good model will have a low loss value on both training and testing data.

---

## 6. What are continuous and categorical variables?

Continuous variables:

- Can take any numerical value within a range
- Examples: height, weight, temperature, income
- Can be measured on a scale and can have decimal values

Categorical variables:

- Take on discrete values representing categories or groups
- Examples: gender, country, color, yes/no responses
- Cannot be meaningfully ordered or used in calculations

---

## 7.How do we handle categorical variables in Machine Learning? What are the common techniques?

Common techniques for handling categorical variables:

###1. Label Encoding:

- Converts each category to a unique integer
- Suitable for ordinal categories (where order matters)

In [2]:
# Import required libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Create a sample dataset
data = pd.DataFrame({
    'category': ['apple', 'banana', 'apple', 'orange', 'banana', 'orange', 'apple']
})

# Initialize the LabelEncoder
encoder = LabelEncoder()

# Fit and transform the 'category' column
data['category_encoded'] = encoder.fit_transform(data['category'])

# Show the original and encoded data
print(data)

# Optional: Display the mapping of labels
label_mapping = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))
print("\nLabel Mapping:")
print(label_mapping)


  category  category_encoded
0    apple                 0
1   banana                 1
2    apple                 0
3   orange                 2
4   banana                 1
5   orange                 2
6    apple                 0

Label Mapping:
{'apple': np.int64(0), 'banana': np.int64(1), 'orange': np.int64(2)}


###2. One-Hot Encoding:

- Creates binary columns for each category
- Avoids implying ordinal relationships

In [4]:
import pandas as pd

encoded_data = pd.get_dummies(data, columns=['category'])
encoded_data

Unnamed: 0,category_encoded,category_apple,category_banana,category_orange
0,0,True,False,False
1,1,False,True,False
2,0,True,False,False
3,2,False,False,True
4,1,False,True,False
5,2,False,False,True
6,0,True,False,False


###3. Binary Encoding:

- Represents categories as binary code
- More memory-efficient than one-hot for many categories

###4. Target Encoding:

- Replaces categories with the mean target value for that category
- Useful for high-cardinality features

###5. Frequency Encoding:

Replaces categories with their frequency in the dataset

---

## 8. What is training and testing a dataset?

**Training a dataset**:

- Using a portion of the data to teach the model patterns and relationships
- The model learns from this data by adjusting its parameters
- The goal is for the model to learn general patterns, not memorize the data

**Testing a dataset**:

- Using a separate portion of data (unseen during training) to evaluate model performance
- Tests the model's ability to generalize to new, unseen data
- Provides an unbiased evaluation of the model's performance

This split helps prevent overfitting and gives a more realistic estimate of how the model will perform in real-world scenarios.

---

## 9. What is `sklearn.preprocessing`?

sklearn.preprocessing is a module in the Scikit-learn library that provides functions and classes for data preprocessing. These tools help transform raw data into a format that is more suitable for machine learning algorithms.

Key features include:

- Scaling: StandardScaler, MinMaxScaler, RobustScaler
- Normalization: Normalizer
- Encoding: OneHotEncoder, LabelEncoder
- Transformation: PolynomialFeatures, PowerTransformer
- Imputation: SimpleImputer (for handling missing values)

Example:






In [5]:
# Feature Scaling using StandardScaler

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Sample dataset
data = pd.DataFrame({
    'height': [160, 170, 180, 190],
    'weight': [55, 65, 75, 85],
    'age': [20, 25, 30, 35]
})

# Display original data
print("Original Data:")
print(data)

# Apply Standard Scaling (mean = 0, std = 1)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# Convert back to DataFrame for better readability
scaled_df = pd.DataFrame(scaled_data, columns=data.columns)

print("\nScaled Data (StandardScaler):")
print(scaled_df)


Original Data:
   height  weight  age
0     160      55   20
1     170      65   25
2     180      75   30
3     190      85   35

Scaled Data (StandardScaler):
     height    weight       age
0 -1.341641 -1.341641 -1.341641
1 -0.447214 -0.447214 -0.447214
2  0.447214  0.447214  0.447214
3  1.341641  1.341641  1.341641


## 10. What is a Test set?

A test set is a portion of the original dataset that is set aside and not used during the training process. It represents new, unseen data that the model will encounter in real-world applications.

The test set serves several important purposes:

- Evaluates how well the model generalizes to new data
- Provides an unbiased evaluation of the final model's performance
- Helps detect overfitting (if the model performs significantly worse on test data)
- Simulates real-world deployment scenarios

A good test set should be representative of the overall data distribution and large enough to provide statistically significant results.

---

## 11. How do we split data for model fitting (training and testing) in Python?

We use the `train_test_split()` function from **scikit-learn** to split our dataset into two parts:
- One part for **training the model**
- Another part for **testing how well the model performs**




In [7]:
# Train-test split with model training and prediction

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Step 1: Create a sample dataset
data = pd.DataFrame({
    'experience': [1, 2, 3, 4, 5, 6, 7, 8],
    'salary': [30000, 35000, 40000, 45000, 50000, 55000, 60000, 65000]
})

# Step 2: Define features (X) and target (y)
X = data[['experience']]
y = data['salary']

# Step 3: Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,       # 20% test set
    random_state=42      # Reproducible results
)

# Step 4: Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Step 5: Predict on test set
predictions = model.predict(X_test)

# Step 6: Evaluate the model
print("Predictions:", predictions)
print("Actual:", y_test.values)
print("Mean Squared Error:", mean_squared_error(y_test, predictions))


Predictions: [35000. 55000.]
Actual: [35000 55000]
Mean Squared Error: 0.0


The test_size parameter controls what proportion goes to the test set (typically 20-30%).
The random_state parameter ensures reproducibility by using the same random split each time.

For time series data or when random splitting isn't appropriate, other methods like TimeSeriesSplit can be used.

---

## 12. How do you approach a ML problem?

A systematic approach to a machine learning problem:

1.Problem Definition

- Clearly define the problem and objectives
- Identify what success looks like
- Determine appropriate evaluation metrics

2.Data Collection

- Gather relevant data from various sources
- Ensure data quality and quantity

3.Exploratory Data Analysis (EDA)

- Understand data distributions and relationships
- Identify patterns, outliers, and missing values
- Visualize key insights

4.Data Preprocessing

- Handle missing values
- Convert categorical variables
- Scale/normalize features
- Create new features if needed

5.Feature Selection/Engineering

- Select relevant features
- Create new features to improve model performance
- Reduce dimensionality if necessary

6.Model Selection

- Choose appropriate algorithms based on the problem
- Consider interpretability vs. performance trade-offs

7.Training and Validation

- Split data into training, validation, and test sets
- Train multiple models
- Use cross-validation for robust evaluation

8.Hyperparameter Tuning

- Optimize model parameters
- Use grid search or random search

9.Model Evaluation

- Assess performance on test data
- Analyze errors and edge cases

10.Deployment and Monitoring

- Implement the model in production
- Monitor performance over time
- Retrain as needed

---

## 13. Why do EDA before fitting a model?

Exploratory Data Analysis (EDA) is crucial before model fitting because:

1. Understanding the Data: EDA helps you understand the structure, patterns, and characteristics of your data. This understanding guides feature selection and engineering.

2. Identifying Issues: EDA reveals missing values, outliers, imbalanced classes, or skewed distributions that could negatively impact model performance.

3. Feature Relationships: EDA uncovers relationships between features and the target variable, helping identify which features might be most important.

4. Data Quality Assessment: EDA helps detect data quality issues like duplicate records, inconsistent formats, or data entry errors.

5. Informing Preprocessing: EDA guides decisions about scaling, transformation, encoding, and other preprocessing steps.

6. Hypothesis Generation: EDA helps generate hypotheses about what factors influence the target variable.

7. Preventing Surprises: EDA helps avoid unexpected issues during model training that could waste time and computational resources.

8. Guiding Model Selection: Understanding data characteristics helps choose appropriate models (e.g., linear vs. non-linear).

By performing thorough EDA, you set a strong foundation for successful modeling and avoid potential pitfalls.

---
## 14. What is correlation?

Correlation is a statistical measure that indicates the extent to which two variables change together. It measures both the strength and direction of the linear relationship between two variables.

Key points about correlation:

- It ranges from -1 to +1
- A value close to +1 indicates a strong positive relationship
- A value close to -1 indicates a strong negative relationship
- A value near 0 indicates little or no linear relationship

The most common measure is the Pearson correlation coefficient, which specifically measures linear relationships. Other types include Spearman's rank correlation (for non-linear monotonic relationships) and Kendall's tau.

📌 **Example**:  
When temperature increases, ice cream sales also increase — this is **positive correlation**.

---

## 15. What does negative correlation mean?

**Negative correlation** means that when one value increases, the other one decreases.  
They move in **opposite directions**.

📌 **Example**:  
As the number of hours you spend watching TV goes up, your exam score might go down.  
This is a **negative correlation**.


---

## 16. How can you find correlation between variables in Python?

In Python, there are several ways to find correlations between variables:

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load sample data
df = pd.DataFrame({
    'feature1': np.random.normal(0, 1, 100),
    'feature2': np.random.normal(0, 1, 100),
    'feature3': np.random.normal(0, 1, 100)
})

# Create a correlated feature
df['feature4'] = df['feature1'] * 2 + np.random.normal(0, 0.5, 100)

# Method 1: Correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix)

# Method 2: Visual representation with heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix')
plt.show()

# Method 3: Pairplot to visualize relationships
sns.pairplot(df)
plt.show()

# Method 4: Calculate specific correlation
specific_corr = df['feature1'].corr(df['feature4'])
print(f"Correlation between feature1 and feature4: {specific_corr:.2f}")

```

For non-linear relationships,

```python
spearman_corr = df.corr(method='spearman')
print(spearman_corr)
```


---

## 17. What is causation? Explain difference between correlation and causation with an example.

Causation refers to a relationship where one event (the cause) directly influences another event (the effect). It implies that a change in one variable directly causes a change in another variable.

Difference between correlation and causation:

1. Correlation: Measures how variables change together; doesn't imply that one causes the other
2. Causation: Indicates that changes in one variable directly cause changes in another

Example:

Ice cream sales ↑ and drowning cases ↑ = correlated
But eating ice cream doesn't cause drowning = no causation
There is a strong positive correlation between ice cream sales and drowning deaths. Both increase during summer months.

- Correlation perspective: Ice cream sales and drowning deaths are positively correlated.
- Causation reality: Ice cream sales don't cause drownings, nor do drownings cause ice cream sales.
- Actual explanation: The hidden variable is hot weather/summer season, which independently causes both increased ice cream consumption and more people swimming (leading to more drowning incidents).

This example demonstrates why "correlation does not imply causation" is a fundamental principle in data analysis. To establish causation, you typically need controlled experiments or more sophisticated causal inference techniques.

---

## 18. What is an Optimizer?

An **optimizer** helps the model improve by reducing the loss value. It adjusts model parameters step by step.

Types:
- **SGD (Stochastic Gradient Descent)**
- **Adam**
- **RMSprop**

Example in Keras:
```python
model.compile(optimizer='adam', loss='mse')
```

---

## 19. What is `sklearn.linear_model`?

It’s a module in scikit-learn that has models for linear regression, logistic regression, etc.

---

## 20. What does model.fit() do? What arguments must be given?

The model.fit() method trains a machine learning model on the provided data. It's where the model learns from the data by adjusting its parameters to minimize the loss function.

Required arguments:

- X: The feature data (independent variables), typically a 2D array or DataFrame
- y: The target data (dependent variable), typically a 1D array or Series

example:
```python
from sklearn.linear_model import LinearRegression
import numpy as np

# Create sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])

# Initialize the model
model = LinearRegression()

# Train the model
model.fit(X, y)

# Now the model has learned the parameters
print(f"Coefficient: {model.coef_}")
print(f"Intercept: {model.intercept_}")
```

---

##21. What does model.predict() do? What arguments must be given?

The model.predict() method uses a trained machine learning model to make predictions on new data. It applies the patterns learned during training to generate output for unseen data.

Required arguments:

X: The feature data for which you want predictions, in the same format as the training data

example:

```python
from sklearn.linear_model import LinearRegression
import numpy as np

# Create and train a model
X_train = np.array([[1], [2], [3], [4]])
y_train = np.array([2, 4, 6, 8])
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on new data
X_new = np.array([[5], [6]])
predictions = model.predict(X_new)
print(f"Predictions: {predictions}")  # Should output [10, 12]
```

---

## 22. What are continuous and categorical variables?

Continuous variables:

- Can take any numerical value within a range
- Represent measurements where the difference between values is meaningful
- Examples: height, weight, temperature, income, age
- Can be further divided into interval and ratio variables

Categorical variables:

- Take on discrete values that represent categories or groups
- Examples: gender, blood type, country, product category
- Can be further divided into:
  - Nominal: Categories with no natural order (e.g., colors, blood types)
 - Ordinal: Categories with a meaningful order (e.g., education level, satisfaction ratings)

In machine learning, different handling techniques are required for these variable types because most algorithms expect numerical inputs.

---

## 23. What is feature scaling? How does it help in Machine Learning?

Feature scaling is the process of normalizing the range of independent variables to a common scale, typically 0 to 1 or -1 to 1.

How feature scaling helps in Machine Learning:

1. Improves convergence speed: Algorithms like gradient descent converge faster with scaled features.

2. Prevents dominance of large-scale features: Ensures that features with larger values don't dominate those with smaller values.

3. Essential for distance-based algorithms: Algorithms like k-NN, k-means, and SVM that use distance calculations require scaling for proper functioning.

4. Improves regularization effectiveness: L1/L2 regularization works more effectively when features are on similar scales.

5. Necessary for PCA and neural networks: These methods are highly sensitive to feature scaling.

Common scaling techniques include Min-Max Scaling (to 0-1 range) and Standardization (mean=0, std=1).

---

## 24. How to perform scaling in Python?
In Python, we can perform feature scaling using the scikit-learn library. Here are the most common scaling methods:

Using `StandardScaler` or `MinMaxScaler` from sklearn:

1.StandardScaler(Standardisation):
```python
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Sample data
data = pd.DataFrame({
    'height': [165, 170, 180, 190],
    'weight': [60, 70, 80, 90],
    'age': [20, 30, 40, 50]
})

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the data
scaled_data = scaler.fit_transform(data)

# Convert back to DataFrame for better viewing
scaled_df = pd.DataFrame(scaled_data, columns=data.columns)
print(scaled_df)
```
2.MinMaxScaler (Normalization):

```python
from sklearn.preprocessing import MinMaxScaler

# Initialize the scaler
min_max_scaler = MinMaxScaler()

# Fit and transform the data
normalized_data = min_max_scaler.fit_transform(data)

# Convert back to DataFrame
normalized_df = pd.DataFrame(normalized_data, columns=data.columns)
print(normalized_df)
```


---

## 25. What is sklearn.preprocessing?

sklearn.preprocessing is a module in the scikit-learn library that provides various functions and classes for data preprocessing before machine learning model training. It helps in transforming raw data into a format suitable for ML algorithms.

---

## 26. How do we split data for model fitting (training and testing) in Python?
In Python, we typically use the train_test_split function from scikit-learn to split data into training and testing sets:

```python
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Create sample dataset
np.random.seed(42)  # for reproducibility
data = pd.DataFrame({
    'feature1': np.random.normal(0, 1, 1000),
    'feature2': np.random.normal(0, 1, 1000),
    'feature3': np.random.normal(0, 1, 1000)
})
target = np.random.randint(0, 2, 1000)  # Binary target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    data,  # Features
    target,  # Target variable
    test_size=0.2,  # Use 20% for testing, 80% for training
    random_state=42,  # For reproducible results
    stratify=target  # Maintain same class distribution in train and test sets
)

# Check the split sizes
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

# Now we can train a model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

# And evaluate on the test set
accuracy = model.score(X_test, y_test)
print(f"Model accuracy: {accuracy:.2f}")
```

---

## 27. What is data encoding?

Data encoding is the process of converting categorical variables (non-numeric) into a numerical format that machine learning algorithms can understand. Since most ML algorithms require numerical input, encoding is a critical preprocessing step.

Common data encoding techniques:

1.Label Encoding:

- Assigns a unique integer to each category
- Maintains a single column
- Implies an ordinal relationship between categories
- Best used when categories have a natural order

Example:
```python
from sklearn.preprocessing import LabelEncoder

# Sample data
data = pd.DataFrame({'color': ['red', 'blue', 'green', 'red', 'blue']})

# Apply label encoding
encoder = LabelEncoder()
data['color_encoded'] = encoder.fit_transform(data['color'])

print(data)
print(f"Categories mapping: {dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))}")
```
2.One-Hot Encoding:

- Creates binary columns for each category
- No implied ordering between categories
- Increases dimensionality (more columns)
- Best for nominal categorical variables

```python
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Sample data
data = pd.DataFrame({'color': ['red', 'blue', 'green', 'red', 'blue']})

# Method 1: Using OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(data[['color']])
encoded_df = pd.DataFrame(
    encoded,
    columns=encoder.get_feature_names_out(['color'])
)
print(encoded_df)

# Method 2: Using pandas get_dummies
dummies = pd.get_dummies(data['color'], prefix='color')
print(dummies)
```
3.Binary Encoding:

- Represents categories as binary digits
- More memory-efficient than one-hot for high-cardinality features
- Middle ground between label and one-hot encoding

```python
from category_encoders import BinaryEncoder

# Sample data
data = pd.DataFrame({'color': ['red', 'blue', 'green', 'yellow', 'purple']})

# Apply binary encoding
encoder = BinaryEncoder(cols=['color'])
binary_encoded = encoder.fit_transform(data)

print(binary_encoded)
```

4.Target Encoding:

- Replaces categories with the mean target value for that category
- Useful for high-cardinality features
- Can lead to overfitting if not carefully implemented

```python
# Manual implementation of target encoding
data = pd.DataFrame({
    'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C'],
    'target': [1, 0, 1, 0, 1, 1, 0]
})

# Calculate mean target value per category
target_means = data.groupby('category')['target'].mean()

# Apply encoding
data['category_encoded'] = data['category'].map(target_means)

print(data)
```

5.Ordinal Encoding:

- Assigns integers based on the order of categories
- Used when categories have a meaningful order


```python
from sklearn.preprocessing import OrdinalEncoder
# Sample data with ordered categories
data = pd.DataFrame({'size': ['small', 'medium', 'large', 'medium', 'small']})

# Define the ordering
size_ordering = [['small', 'medium', 'large']]

# Apply ordinal encoding
encoder = OrdinalEncoder(categories=size_ordering)
data['size_encoded'] = encoder.fit_transform(data[['size']])

print(data)
```
Choosing the right encoding method depends on the nature of the categorical variable and the specific requirements of the machine learning algorithm you're using.
