##Q 1.What is a parameter?
**Ans** - In feature engineering, a parameter refers to a numerical or categorical value used to control or influence the transformation, selection, or generation of features from raw data. Parameters are often adjustable settings in feature transformation methods, and they help define how features are created or modified.

**Examples of Parameters**
1. Scaling Parameters - In Min-Max Scaling, the feature values are transformed using:
        X' = (X-Xmin)/(Xmax-Xmin)
Here, Xmin and Xmax are parameters.

2. Polynomial Degree - In Polynomial Feature Engineering, the degree of the polynomial is a parameter that determines how many higher-order terms are created.
3. Number of Bins - In Binning, the number of bins is a parameter that decides how the continuous data is split into discrete intervals.
4. Encoding Parameters - In One-Hot Encoding, the presence or absence of categorical values is parameterized by binary variables.
5. Window Size in Moving Averages - In Time Series Feature Engineering, the window size for calculating rolling statistics is a parameter.

##Q 2. What is correlation? What does negative correlation mean?
**Ans** - Correlation is a statistical measure that describes the relationship between two variables. It quantifies how one variable changes in relation to another. Correlation is commonly measured using Pearson's correlation coefficient (r), which ranges from -1 to 1:
* r = 1 -> Perfect positive correlation.
* r = 0 -> No correlation.
* r = -1 -> Perfect negative correlation.

Mathematically, Pearson's correlation coefficient is given by:

    r = [∑(Xi-Xˉ)(Yi-Yˉ)] / [{∑(Xi-Xˉ)^2}^2 * {(∑(Yi-Yˉ)^2}^2]
where:
* Xi and Yi are individual values,
* Xˉ and Yˉ are the means of the variables.

**Negative Correlation means**

A negative correlation means that as one variable increases, the other decreases.

Examples of Negative Correlation:
1. Temperature vs. Sweater Sales - As temperature increases, sweater sales decrease.
2. Exercise vs. Body Fat Percentage - More exercise leads to lower body fat.
3. Car Speed vs. Travel Time - Higher speed reduces travel time.

If the correlation coefficient (r) is closer to -1, the negative correlation is stronger. If it is closer to 0, the relationship is weak.

##Q 3. Define Machine Learning. What are the main components in Machine Learning?
**Ans** - Machine Learning is a subset of artificial intelligence that enables computers to learn from data and make decisions or predictions without being explicitly programmed. Instead of following predefined rules, ML algorithms identify patterns in data and improve their performance over time.

For example:
* Task T: Predicting house prices
* Experience E: Historical data of house prices
* Performance P: Accuracy of price predictions

**Main Components of Machine Learning**

1. Data
  * The foundation of ML; can be structured or unstructured.
  * Needs preprocessing.
2. Features
  * Relevant attributes extracted from data to improve model performance.
  * Example: In a house price model, features could be size, location, and number of bedrooms.
3. Model
  * The mathematical function that learns patterns from data.
  * Examples: Linear Regression, Decision Trees, Neural Networks.
4. Training Algorithm
  * The method used to adjust model parameters based on training data.
  * Example: Gradient Descent optimizes weights in neural networks.
5. Loss Function
  * Measures the difference between predicted and actual values.
  * Example: Mean Squared Error for regression.
6. Optimization Algorithm
  * Adjusts the model to minimize the loss function.
  * Example: Stochastic Gradient Descent.
7. Evaluation Metrics
  * Used to assess model performance on unseen data.
  * Examples:
    * Accuracy, Precision, Recall
    * Mean Absolute Error, R² score
8. Training & Testing Data Split
  * Data is split into:
    * Training set
    * Test set
    * Validation set
9. Hyperparameters
  * Configurations set before training.
  * Example: Learning rate, number of hidden layers in a neural network.
10. Deployment & Inference
  * Once trained, the model is deployed to make real-world predictions.
  * Example: A recommendation system suggesting products on Amazon.

##Q 4. How does loss value help in determining whether the model is good or not?
**Ans** - The loss value quantifies how well or poorly a machine learning model is performing. It represents the difference between the model's predictions and the actual ground truth values. A lower loss value generally indicates a better-performing model, while a higher loss suggests that the model is making significant errors.

1. **Role of Loss in Model Evaluation**
* Indicates Model Accuracy: A lower loss means the model is making better predictions.
* Guides Optimization: The model updates its parameters to minimize loss.
* Helps Identify Overfitting or Underfitting:
  * High training loss & high validation loss - Underfitting
  * Low training loss & high validation loss - Overfitting

2. **Common Loss Functions**

Loss functions differ based on the type of task:

For Regression Problems:
1. Mean Squared Error

        MSE = 1/n * ∑(yi-y^i)2

* Penalizes larger errors more than smaller ones.
* Good for continuous output problems.
2. Mean Absolute Error

        MAE = 1/n * ∑|yi-y^i|
* Measures absolute differences, making it less sensitive to outliers.

##Q 5. What are continuous and categorical variables?
**Ans** - In data science and statistics, variables are classified into different types based on their nature and the values they take. The two main types are continuous and categorical variables.

1. **Continuous Variables**

A continuous variable is a variable that can take any numerical value within a given range. These values are measured and can have decimal points.

**Characteristics of Continuous Variables:**
* Can take infinitely many values within a range
* Can have decimal points
* Often measured rather than counted

**Examples:**
* Height (e.g.,5.8ft, 175.3cm)
* Weight (e.g.,65.5kg, 150.2lbs)
* Temperature (e.g.,36.6°C, 98.4°F)
* Time (e.g.,2.45sec, 5.67hrs)

**Types of Continuous Variables:**
* Interval Variable - Has no true zero (e.g., temperature in Celsius/Fahrenheit).
* Ratio Variable - Has a true zero (e.g., weight, height, distance).

2. **Categorical Variables**

A categorical variable is a variable that represents categories or groups rather than numerical values. These values are counted, not measured.

**Characteristics of Categorical Variables:**
* Represent distinct groups or labels
* Do not have numerical meaning
* Can be nominal or ordinal

**Examples:**
* Gender (Male, Female, Other)
* Blood Type (A, B, AB, O)
* Color (Red, Blue, Green)
* Education Level (High School, Bachelor's, Master's, Ph.D.)

**Types of Categorical Variables:**
* Nominal Variables - No meaningful order (e.g., eye color, car brand).
* Ordinal Variables - Have a meaningful order but no fixed difference (e.g., education levels, satisfaction ratings: Low, Medium, High).

**Differences Between Continuous and Categorical Variables**

|Feature	|Continuous Variable	|Categorical Variable|
|-|||
|Definition	|Measured values that can take any number in a range	|Represents groups or categories|
|Numerical?	|Yes	|No|
|Decimals?	|Yes	|No|
|Examples	|Height, Weight, Temperature	|Gender, Car Brand, Blood Type|
|Subtypes	|Interval, Ratio	|Nominal, Ordinal|

##Q 6. How do we handle categorical variables in Machine Learning? What are the common techniques?
**Ans** - Categorical variables need to be converted into a numerical format before they can be used in machine learning models. There are several techniques to handle categorical data, and the best choice depends on the type of categorical variable and the algorithm used.

**1. Encoding Techniques for Categorical Variables**

(a) One-Hot Encoding
* Converts each category into a separate binary column (0 or 1).
* Best for nominal categorical variables.
* Used in tree-based models and neural networks.

**Example:**

In [None]:
Color: ['Red', 'Blue', 'Green']

* Pros: Works well with many machine learning models.
* Cons: Can create too many columns for high-cardinality features.

**Implementation in Python:**

In [None]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green']})
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(df[['Color']])
print(encoded)

(b) Label Encoding
* Assigns a unique integer to each category.
* Best for ordinal categorical variables.
* Used in models that can handle ordinal relationships.

Example:

In [None]:
Education: ['High School', 'Bachelor', 'Master', 'PhD']

* Pros: Simple and memory-efficient.
* Cons: Can mislead models that assume numerical relationships where none exist.

Implementation in Python:

In [None]:
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'Education': ['High School', 'Bachelor', 'Master', 'PhD']})
encoder = LabelEncoder()
df['Education_encoded'] = encoder.fit_transform(df['Education'])
print(df)

(c) Ordinal Encoding
* Similar to Label Encoding but explicitly defines an order.
* Used when categories have a meaningful ranking.

Example:

In [None]:
Size: ['Small', 'Medium', 'Large']

Implementation in Python:

In [None]:
from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({'Size': ['Small', 'Medium', 'Large']})
encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
df['Size_encoded'] = encoder.fit_transform(df[['Size']])
print(df)

(d) Target Encoding
* Replaces categories with the mean of the target variable.
* Works well for high-cardinality categorical variables.
* Commonly used in regression problems.

Example:
For a dataset predicting house prices:

In [None]:
Neighborhood: ['A', 'B', 'C']
Average Price: [200k, 250k, 300k]

* Pros: Reduces dimensionality compared to One-Hot Encoding.
* Cons: Can cause data leakage if applied incorrectly.

Implementation in Python:

In [None]:
import pandas as pd

df = pd.DataFrame({'Neighborhood': ['A', 'B', 'A', 'C'], 'Price': [200, 250, 220, 300]})
df['Neighborhood_encoded'] = df.groupby('Neighborhood')['Price'].transform('mean')
print(df)

(e) Frequency Encoding
* Replaces categories with their occurrence count in the dataset.
* Useful when some categories appear much more frequently than others.

Example:

In [None]:
City: ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago']

* Pros: Keeps useful information about category importance.
* Cons: May not always capture meaningful patterns.

Implementation in Python:

In [None]:
df = pd.DataFrame({'City': ['NY', 'LA', 'Chicago', 'NY', 'Chicago']})
df['City_encoded'] = df['City'].map(df['City'].value_counts())
print(df)

2. **Handling High-Cardinality Categorical Variables**

When a categorical feature has many unique values, encoding it efficiently is important:
* Target Encoding
* Frequency Encoding
* Feature Hashing - Converts categories into a fixed number of hash-based numerical columns.

Example using Feature Hashing:

In [None]:
from sklearn.feature_extraction import FeatureHasher

df = pd.DataFrame({'Product_ID': ['A123', 'B456', 'C789', 'A123']})
hasher = FeatureHasher(n_features=3, input_type='string')
hashed_features = hasher.transform(df['Product_ID'])
print(hashed_features.toarray())

3. **Choosing the Right Encoding Technique**

|Encoding Type	|Use Case|
|-||
|One-Hot Encoding (OHE)	|Small number of categories (nominal data)|
|Label Encoding	|Ordinal data (e.g., Education Level)|
|Ordinal Encoding	|When categories have a specific order|
|Target Encoding	|High-cardinality categorical data (regression problems)|
|Frequency Encoding	|When category frequency matters|
|Feature Hashing	|Large-scale categorical features|

##Q 7. What do you mean by training and testing a dataset?
**Ans** - In machine learning, data is typically split into two main sets:
1. Training Dataset - Used to train the model.
2. Testing Dataset - Used to evaluate the model's performance on unseen data.

**1. Training Dataset**
* The training dataset is the portion of data that the model learns from.
* It contains input features and corresponding target labels.
* The model identifies patterns and adjusts parameters based on this data.
* The goal is to minimize loss during training.

**Example:**

A dataset predicting house prices based on features like size and location.

|Size (sq ft)	|Location	|Price ($1000s)|
|-|||
|1500	|Urban	|300|
|1800	|Suburban	|250|
|1200	|Rural	|200|

* The model learns from this data and tries to find the relationship between size, location, and price.

**2. Testing Dataset**
* The testing dataset is used to evaluate how well the trained model generalizes to new, unseen data.
* It helps detect overfitting.
* The model does not learn from this dataset—only performance is measured.

**Example:**

|Size (sq ft)	|Location	|Price ($1000s)|
|-|||
|1600	|Urban	|???|
|1400	|Suburban	|???|

* The model predicts the price for the test data and compares it with actual values.

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd

data = pd.DataFrame({
    'Size': [1500, 1800, 1200, 1600, 1400],
    'Location': ['Urban', 'Suburban', 'Rural', 'Urban', 'Suburban'],
    'Price': [300, 250, 200, 275, 230]
})

train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

print("Training Set:\n", train_set)
print("\nTesting Set:\n", test_set)

##Q 8. What is sklearn.preprocessing?
**Ans** - sklearn.preprocessing is a module in Scikit-Learn that provides tools for transforming raw data into a format suitable for machine learning models. It includes functions for scaling, encoding, imputing missing values, and normalizing data.

**1. Use of sklearn.preprocessing?**
* Improves model performance by ensuring features are properly scaled.
* Handles categorical variables through encoding.
* Deals with missing data by imputing values.
* Standardizes features to ensure models converge faster.

**2. Common sklearn.preprocessing Techniques**

(a) Standardization
* Scales data to have mean = 0 and standard deviation = 1.
* Helps models like Logistic Regression, SVMs, and Neural Networks perform better.

(b) Min-Max Scaling
* Scales data between 0 and 1.
* Used in algorithms like KNN and Neural Networks.

(c) Encoding Categorical Variables

One-Hot Encoding
* Converts categories into binary columns.

Label Encoding
* Assigns unique integer values to categories.

(d) Imputation
* Fills missing values using mean, median, or most frequent values.

(e) Binarization
* Converts numerical values into 0s and 1s based on a threshold.

**3. Choosing the Right Preprocessing Method**

|Task	|Method|
|-||
|Standardizing data	|StandardScaler|
|Normalizing data	|MinMaxScaler|
|Encoding categorical data	|OneHotEncoder, LabelEncoder|
|Handling missing values	|SimpleImputer|
|Binarizing data	|Binarizer|

##Q 9. What is a Test set?
**Ans** - A test set is a portion of a dataset that is not used for training the model but is instead used to evaluate the model's performance on unseen data. It helps assess how well the model generalizes to new, real-world data.

**1. Purpose of a Test Set**
* Evaluates model performance on unseen data.
* Detects overfitting.
* Compares different models before selecting the best one.
* Ensures unbiased performance measurement before deployment.

##Q 10. How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?
**Ans** - **1. Split Data for Model Training and Testing in Python**

In machine learning, we split data into two sets:
* Training Set - Used to train the model (usually 70-80%).
* Test Set - Used to evaluate model performance (20-30%).

Using 'train_test_split' from sklearn

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd

data = pd.DataFrame({
    'Feature1': [10, 20, 30, 40, 50, 60, 70, 80],
    'Feature2': [5, 15, 25, 35, 45, 55, 65, 75],
    'Label': [1, 0, 1, 0, 1, 0, 1, 0]
})

train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

print("Training Set:\n", train_set)
print("\nTesting Set:\n", test_set)

* test_size = 0.2 - Allocates 20% of data for testing.
* random_state = 42 - Ensures reproducibility.

Splitting Data into Training, Validation, and Test Sets

If we need a validation set:

In [None]:
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

train_set, val_set = train_test_split(train_set, test_size=0.25, random_state=42)

print("Training Set:\n", train_set)
print("\nValidation Set:\n", val_set)
print("\nTesting Set:\n", test_set)

* This results in 60% train, 20% validation, 20% test.

**2. Approach to Machine Learning Problem**

To build an effective ML model, we follow these steps:

**Step 1: Define the Problem**
* Understand the problem statement.
* Identify the type of problem:
  * Regression
  * Classification
  * Clustering

Example: Predict house prices based on size and location.

**Step 2: Collect & Explore Data**
* Load the dataset.
* Check for missing values & outliers.
* Visualize data.

Example in Python:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("house_prices.csv")

print(df.isnull().sum())

df.hist(figsize=(8, 6))
plt.show()

**Step 3: Preprocess & Clean Data**
* Handle missing values.
* Convert categorical variables.
* Scale numerical features.

Example in Python:

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(strategy="mean")
df['Size'] = imputer.fit_transform(df[['Size']])

encoder = OneHotEncoder(sparse=False)
encoded_location = encoder.fit_transform(df[['Location']])

scaler = StandardScaler()
df[['Size', 'Price']] = scaler.fit_transform(df[['Size', 'Price']])

**Step 4: Split Data (Train-Test)**
* Use train_test_split() to divide data.
* Optionally, create a validation set for hyperparameter tuning.

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=['Price'])
y = df['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Step 5: Choose & Train a Model**
* Select an appropriate machine learning algorithm.
* Train the model using fit().

Example using Linear Regression:

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

**Step 6: Evaluate the Model**
* Use test data to check model performance.
* Metrics depend on the problem type:
  * Regression: RMSE, MAE, R²
  * Classification: Accuracy, Precision, Recall, F1-score

Example for Regression:

In [None]:
from sklearn.metrics import mean_squared_error

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

**Step 7: Tune Hyperparameters (Optional)**
* Use GridSearchCV or RandomizedSearchCV to find the best hyperparameters.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'fit_intercept': [True, False]}
grid_search = GridSearchCV(LinearRegression(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)

**Step 8: Deploy the Model**
* Save the trained model using joblib or pickle.
* Deploy it using Flask, FastAPI, or cloud services.

Saving Model:

In [None]:
import joblib

joblib.dump(model, 'house_price_model.pkl')

**Summary of Machine Learning Workflow**

|Step	|Action|
|-||
|Define Problem	|Identify task|
|Collect Data	|Load dataset, check for missing values|
|Preprocess Data	|Encode categorical, scale numerical data|
|Split Data	|Train-test split (80-20 or 70-30)|
|Train Model	|Select ML algorithm, fit model|
|Evaluate Model	|Use metrics like RMSE, accuracy|
|Hyperparameter Tuning	|Optimize parameters for better performance|
|Deploy Model	|Save model and deploy|

##Q 11. Why do we have to perform EDA before fitting a model to the data?
**Ans** - Exploratory Data Analysis is a crucial step in machine learning where we analyze, visualize, and preprocess data before fitting a model. Skipping EDA can lead to poor model performance, biased predictions, or incorrect conclusions.

**1. Understand Data Structure & Quality**
* Check data types.
* Identify missing values, duplicates, and inconsistencies.
* Detect outliers that may affect model training.

**Example: Checking data types & missing values**

In [None]:
import pandas as pd

df = pd.read_csv("house_prices.csv")

print(df.dtypes)

print(df.isnull().sum())

**2. Detect & Handle Missing Values**
* Missing values can bias model predictions.
* Solutions:
  * Drop rows/columns with too many missing values.
  * Impute missing values using mean, median, mode, or predictive models.

Example: Imputing missing values

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="mean")
df['Size'] = imputer.fit_transform(df[['Size']])

**3. Identify Outliers & Handle Them**
* Outliers can distort model learning.
* Use boxplots, histograms, or z-scores to detect them.

Example: Detecting Outliers Using Boxplot

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.boxplot(x=df['Price'])
plt.show()

**Solutions:**
* Remove extreme outliers.
* Apply log transformation or clipping to reduce impact.

**4. Detect Feature Relationships & Multicollinearity**
* Check correlation between features.
* Multicollinearity can confuse models.

Example: Correlation Matrix

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8,6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()

* Solution for Multicollinearity:
  * Remove redundant features.
  * Use Principal Component Analysis.

**5. Choose the Right Feature Engineering Approach**
* Categorical features need encoding.
* Numerical features need scaling.

Example: Encoding Categorical Variables

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
encoded_features = encoder.fit_transform(df[['Location']])

Example: Scaling Numerical Features

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Size', 'Price']] = scaler.fit_transform(df[['Size', 'Price']])

**6. Avoid Data Leakage & Bias**
* Ensure that no data from the test set leaks into training.
* Balance classes in imbalanced datasets (e.g., fraud detection).

Example: Handling Imbalanced Data Using SMOTE

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)

**Reasons for performing EDA**

|EDA Task	|Why It Matters?|
|-||
|Check Missing Values	|Prevents biased models|
|Identify Outliers	|Avoids extreme values affecting learning|
|Feature Correlation	|Removes redundant variables|
|Encode Categorical Data	|Ensures models can process categorical variables|
|Normalize Features	|Helps algorithms converge faster|
|Avoid Data Leakage	|Prevents unfair performance estimation|

##Q 12. What is correlation?
**Ans** - Correlation is a statistical measure that describes the strength and direction of the relationship between two variables. It tells us how one variable changes in relation to another.
* Positive Correlation: If one variable increases, the other also increases.
* Negative Correlation: If one variable increases, the other decreases.
* No Correlation: No relationship between the variables.

**1. Pearson's Correlation Coefficient (r)**
The most common measure of correlation is Pearson’s Correlation Coefficient (r), which ranges from -1 to 1:

    r = [∑(Xi-Xˉ)(Yi-Yˉ)] / [{∑(Xi-Xˉ)^2}{∑(Yi-Yˉ)^2}]

Interpretation of r:

|Value of r	|Interpretation|
|-||
|r = +1	|Perfect positive correlation|
|r = 0.7 to 0.99	|Strong positive correlation|
|r = 0.3 to 0.69	|Moderate positive correlation|
|r = 0	|No correlation|
|r = -0.3 to -0.69	|Moderate negative correlation|
|r = -0.7 to -0.99	|Strong negative correlation|
|r = -1	|Perfect negative correlation|

**2. Example of Correlation**

Positive Correlation Example
* Height vs. Weight
  * Taller people tend to weigh more.
  * r ≈ +0.8

Negative Correlation Example
* Temperature vs. Hot Coffee Sales
  * When the temperature increases, coffee sales decrease.
  * r ≈ -0.7

No Correlation Example
* Shoe Size vs. Exam Scores
  * Shoe size has no effect on exam scores.
  * r ≈ 0

**3. Visualizing Correlation**

Example in Python

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame({'Height': [150, 160, 170, 180, 190], 'Weight': [50, 60, 70, 80, 90]})

correlation = df.corr()
print(correlation)

sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.show()

##Q 13. What does negative correlation mean?
**Ans** - A negative correlation means that as one variable increases, the other decreases.

Mathematically, the Pearson correlation coefficient (r) for negative correlation is between -1 and 0:
* r = -1 -> Perfect negative correlation.
* r = -0.7 to -0.99 -> Strong negative correlation.
* r = -0.3 to -0.69 -> Moderate negative correlation.
* r = 0 -> No correlation.

**Examples of Negative Correlation**
* Temperature vs. Hot Coffee Sales - As temperature increases, coffee sales decrease.
* Exercise vs. Body Fat Percentage - More exercise leads to lower body fat.
* Speed vs. Travel Time - As car speed increases, travel time decreases.

**Visualizing Negative Correlation**

Example in Python

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x = np.array([1, 2, 3, 4, 5])
y = np.array([10, 8, 6, 4, 2])

plt.scatter(x, y, color='red')
plt.xlabel("Exercise Hours")
plt.ylabel("Body Fat Percentage")
plt.title("Negative Correlation Example")
plt.show()

##Q 14. How can you find correlation between variables in Python?
**Ans** - In Python, we can compute correlation using Pandas, NumPy, and Seaborn to analyze relationships between numerical variables.

**1. Using corr() in Pandas**

Pandas provides the .corr() method to compute correlation between numerical columns.

Example: Compute Correlation Matrix

In [None]:
import pandas as pd

data = {'Height': [150, 160, 170, 180, 190],
        'Weight': [50, 60, 70, 80, 90],
        'Age': [20, 25, 30, 35, 40]}

df = pd.DataFrame(data)

correlation_matrix = df.corr()
print(correlation_matrix)

* By default, .corr() calculates Pearson’s correlation.
* The closer the value is to +1 or -1, the stronger the correlation.

**2. Using NumPy’s corrcoef()**

NumPy's corrcoef() calculates correlation between two variables.

In [None]:
import numpy as np

height = np.array([150, 160, 170, 180, 190])
weight = np.array([50, 60, 70, 80, 90])

correlation = np.corrcoef(height, weight)
print(correlation)

* The result is a correlation matrix where correlation[0,1] gives the correlation between the two variables.

**3. Visualizing Correlation with Seaborn & Matplotlib**

A heatmap helps visualize correlation between multiple variables.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

correlation_matrix = df.corr()

plt.figure(figsize=(6,4))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

* Red/Blue shades indicate strong positive/negative correlation.
* Lighter shades suggest weak or no correlation.

**4. Compute Spearman & Kendall Correlation**

Besides Pearson's correlation, you can also compute:
* Spearman's Rank Correlation
* Kendall's Tau Correlation

In [None]:
df.corr(method='spearman')
df.corr(method='kendall')

**Summary of Methods**

|Method	|Use Case|
|-||
|.corr() (Pandas)	|Compute correlation matrix for all numerical variables|
|corrcoef() (NumPy)	|Compute correlation between two specific variables|
|heatmap() (Seaborn)	|Visualize correlations in a heatmap|
|method='spearman'	|Use for ordinal/ranked data|
|method='kendall'	|Use for small datasets with ranked variables|

##Q 15. What is causation? Explain difference between correlation and causation with an example.
**Ans** - Causation means that one event directly affects another. If X causes Y, then changing X will result in a change in Y.

For example:
* More exercise - Weight loss.
* Smoking - Lung cancer.

**Difference Between Correlation and Causation**
* Correlation means two variables are related, but one does NOT necessarily cause the other.
* Causation means one variable directly affects the other.

**Example: Ice Cream Sales & Drowning Cases**
* Observation: Ice cream sales and drowning cases are positively correlated.
* Does this mean eating ice cream causes drowning? No!
* Real Cause: Hot Weather increases both ice cream sales and swimming activity, leading to more drowning cases.
* This is correlation, NOT causation.

**Key Differences**

|Feature	|Correlation	|Causation|
|-|||
|Definition	|A statistical relationship between two variables.	|One variable directly causes a change in another.|
|Directionality	|No clear direction.	|X directly affects Y.|
|Example	|Higher ice cream sales & drowning cases.	|More smoking causes lung cancer.|
|Proves Cause?	|No	|Yes|

##Q 16. What is an Optimizer? What are different types of optimizers? Explain each with an example.
**Ans** - An optimizer is an algorithm that adjusts the parameters of a machine learning model to minimize the loss function and improve model performance. It helps the model learn patterns from data by updating weights in the right direction during training.

**Types of Optimizers in Machine Learning**

Optimizers can be broadly classified into two categories:
1. Gradient Descent-Based Optimizers
2. Non-Gradient-Based Optimizers

**1. Gradient Descent-Based Optimizers**

Gradient Descent is an optimization technique that updates model parameters by computing the gradient of the loss function and adjusting weights accordingly.

**(a) Batch Gradient Descent**
* Uses all training data to compute the gradient before updating weights.
* Converges smoothly but is slow for large datasets.

Formula for Weight Update:

    W = W - α*(dL/dW)
where:
* W = weights
* α = learning rate
* dL/dW = gradient of the loss function

Example in Python:

In [None]:
import numpy as np

learning_rate = 0.01
W = 2
gradient = 5

W = W - learning_rate * gradient
print(W)

* Pros: Works well for small datasets.
* Cons: Computationally expensive for large datasets.

**(b) Stochastic Gradient Descent**
* Updates weights after each training example.
* Used in online learning scenarios.

Example in Python:

In [None]:
from sklearn.linear_model import SGDRegressor

model = SGDRegressor(learning_rate='constant', eta0=0.01)

* Pros: Faster updates, good for large datasets.
* Cons: High variance in updates.

**(c) Mini-Batch Gradient Descent**
* A compromise between BGD and SGD.
* Updates weights after a small batch of training examples.
* Used in deep learning.

Example in Python:

In [None]:
from tensorflow.keras.optimizers import SGD

optimizer = SGD(learning_rate=0.01)

* Pros: Faster than BGD, more stable than SGD.
* Cons: Requires careful tuning of batch size.

**Non-Gradient-Based Optimizers**

Some optimization techniques do not rely on gradient calculations:

(g) Genetic Algorithms (GA)
* Inspired by natural evolution.
* Used for complex, non-differentiable problems.

Example in Python:

In [None]:
from geneticalgorithm import geneticalgorithm as ga

def f(x): return -x**2 + 4*x

algorithm_param = {'max_num_iteration': 100}
model = ga(function=f, dimension=1, variable_type='real', algorithm_parameters=algorithm_param)
model.run()

* Pros: Works well for non-convex problems.
* Cons: Computationally expensive.

**Summary: Which Optimizer to Use?**

|Optimizer	|Best Used For|
|-||
|SGD	|Large-scale learning|
|Momentum	|Faster convergence, avoiding local minima|
|RMSprop	|Recurrent Neural Networks|
|Adam	|General-purpose deep learning models|
|Genetic Algorithm	|Optimization without gradients|

##Q 17. What is sklearn.linear_model ?
**Ans** - sklearn.linear_model is a module in Scikit-Learn that provides linear models for regression and classification tasks. It includes algorithms like Linear Regression, Logistic Regression, Ridge, Lasso, and SGD-based models.

**1. Linear Models for Regression**

Regression models predict continuous values.

(a) Linear Regression
* Fits a straight line to the data using Ordinary Least Squares.

Example: Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])

model = LinearRegression()
model.fit(X, y)

y_pred = model.predict([[6]])
print("Prediction for X=6:", y_pred[0])

* Pros: Simple, interpretable.
* Cons: Assumes linear relationship.

(b) Ridge Regression
* Prevents overfitting by adding L2 penalty to the loss function.
* Useful when features are highly correlated.

Example: Ridge Regression

In [None]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)
ridge.fit(X, y)

(c) Lasso Regression
* Prevents overfitting and performs feature selection by adding L1 penalty.
* Shrinks some coefficients to zero, effectively removing irrelevant features.

Example: Lasso Regression

In [None]:
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X, y)

**2. Linear Models for Classification**

Classification models predict discrete labels.

(d) Logistic Regression
* Used for binary or multiclass classification.

Example: Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0, 0, 1, 1, 1])

log_reg = LogisticRegression()
log_reg.fit(X, y)

print("Prediction for X=2.5:", log_reg.predict([[2.5]]))

* Pros: Simple, efficient for classification.
* Cons: Not ideal for complex, nonlinear data.

(e) Stochastic Gradient Descent Classifier
* Efficient for large-scale datasets.
* Uses SGD to optimize models like Logistic Regression and SVMs.

Example: SGD Classifier

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(loss='log_loss', max_iter=1000)
sgd_clf.fit(X, y)

* Pros: Fast, works well with big data.
* Cons: Sensitive to hyperparameters.

**Summary**

|Model	|Use Case|
|-||
|Linear Regression	|Predicting continuous values (e.g., house prices)|
|Ridge Regression	|Handling multicollinearity in regression|
|Lasso Regression	|Feature selection in regression|
|Logistic Regression	|Binary/multiclass classification|
|SGD Classifier	|Large-scale classification problems|

##Q 18. What does model.fit() do? What arguments must be given?
**Ans** - model.fit() is a method in Scikit-Learn that trains a machine learning model by learning patterns from the input data.
* It adjusts model parameters based on the training data.
* It minimizes the loss function to improve model accuracy.
* It is used in both regression and classification tasks.

**model.fit() Works Internally**
1. Takes input features and target labels.
2. Initializes model parameters.
3. Applies the optimization algorithm to minimize loss.
4. Iterates over training data until convergence or the maximum number of iterations is reached.

Example: Training a Linear Regression Model

In [None]:
from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])

model = LinearRegression()

model.fit(X, y)

print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_)

* The model learns that the coefficient is 2, meaning the equation y = 2x.

**Arguments Required by fit()**

model.fit(X, y, sample_weight=None)

|Argument	|Description|
|-||
|X	|Input features|
|y	|Target labels|
|sample_weight|Weights for each sample|

**Example: Training a Logistic Regression Model**

For classification problems, fit() learns how to separate data into classes.

In [None]:
from sklearn.linear_model import LogisticRegression

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0, 0, 1, 1, 1])

clf = LogisticRegression()
clf.fit(X, y)

print("Prediction for X=3:", clf.predict([[3]]))

* The model classifies X=3 based on learned probabilities.

**What Happens If You Don’t Call fit()?**

If you try to use model.predict() without training (fit()), it will throw an error:
* Error: NotFittedError: This LinearRegression instance is not fitted yet.

**Summary**

|Task	|What fit() Does|
|-||
|Regression (Linear, Ridge, Lasso, etc.)	|Learns best-fit line by minimizing error|
|Classification (Logistic Regression, SVM, etc.)	|Learns decision boundaries to separate classes|
|Neural Networks (MLP, CNN, etc.)	|Updates weights using backpropagation|

##Q 19. What does model.predict() do? What arguments must be given?
**Ans** - model.predict() is a method in Scikit-Learn that makes predictions using a trained machine learning model. It takes input features (X) and outputs predicted values (y^).

* For Regression Models - Predicts a continuous value.
* For Classification Models - Predicts class labels.

**predict() Works Internally**
1. Takes input features (X).
2. Uses learned model parameters.
3. Applies the model’s mathematical function.
4. Outputs predictions (y^).

**Example: predict() in Linear Regression**

In [None]:
from sklearn.linear_model import LinearRegression
import numpy as np

X_train = np.array([[1], [2], [3], [4], [5]])
y_train = np.array([2, 4, 6, 8, 10])

model = LinearRegression()
model.fit(X_train, y_train)

X_test = np.array([[6], [7]])
y_pred = model.predict(X_test)

print("Predictions:", y_pred)

* Output: [12, 14] (since y = 2x).

**Example: predict() in Classification (Logistic Regression)**

For classification tasks, predict() returns class labels (0, 1, etc.).

In [None]:
from sklearn.linear_model import LogisticRegression

X_train = np.array([[1], [2], [3], [4], [5]])
y_train = np.array([0, 0, 1, 1, 1])

clf = LogisticRegression()
clf.fit(X_train, y_train)

X_test = np.array([[2.5], [3.5]])
predictions = clf.predict(X_test)

print("Predicted Class Labels:", predictions)

* Output: [0, 1] - The model predicts class 0 for X = 2.5 and class 1 for X = 3.5.

**Arguments Required by predict()**
model.predict(X)

|Argument	|Description|
|-||
|X	|Input features (same format as training data)|

* Important: X must have the same number of features as the model was trained on.

**What If You Call predict() Before fit()?**

In [None]:
model = LinearRegression()
model.predict([[6]])

* Error: NotFittedError: This LinearRegression instance is not fitted yet.
* Solution: Always train the model using fit() before calling predict().

##Q 20. What are continuous and categorical variables?
**Ans** - ***Continuous vs. Categorical Variables**

In data science and machine learning, variables are classified based on the type of data they represent. The two main types are continuous and categorical variables.

**1. Continuous Variables**

A continuous variable is a variable that can take any numerical value within a given range. These values are measured rather than counted and can have decimal points.
* Characteristics of Continuous Variables:
  * Can take infinitely many values within a range.
  * Can have decimal points.
  * Often measured rather than counted.

**Examples:**
* Height (e.g., 5.8 ft, 175.3cm)
* Weight (e.g., 65.5kg, 150.2lbs)
* Temperature (e.g., 36.6°C, 98.4°F)
* Time (e.g., 2.45sec, 5.67hrs)
* Price of a Product (e.g., $10.99, $99.50)

**Types of Continuous Variables:**
1. Interval Variable - Has no true zero (e.g., temperature in Celsius/Fahrenheit).
2. Ratio Variable - Has a true zero (e.g., weight, height, distance).

**Categorical Variables**

A categorical variable is a variable that represents categories or groups rather than numerical values. These values are counted, not measured.

**Characteristics of Categorical Variables:**
* Represent distinct groups or labels.
* Do not have numerical meaning.
* Can be nominal or ordinal.

**Examples:**
* Gender (Male, Female, Other)
* Blood Type (A, B, AB, O)
* Color (Red, Blue, Green)
* Education Level (High School, Bachelor's, Master's, PhD)
* Marital Status (Single, Married, Divorced)

**Types of Categorical Variables:**
1. Nominal Variables - No meaningful order (e.g., eye color, car brand).
2. Ordinal Variables - Have a meaningful order but no fixed difference (e.g., education levels, satisfaction ratings: Low, Medium, High).

**3. Differences Between Continuous and Categorical Variables**

|Feature	|Continuous Variable	|Categorical Variable|
|-|||
|Definition	|Measured values that can take any number in a range	|Represents groups or categories|
|Numerical?	|Yes	|No|
|Decimals?	|Yes	|No|
|Examples	|Height, Weight, Temperature	|Gender, Car Brand, Blood Type|
|Subtypes	|Interval, Ratio	|Nominal, Ordinal|

**4. Identifying Continuous vs. Categorical Variables in Python**

Example using Pandas

In [None]:
import pandas as pd

data = pd.DataFrame({
    'Height': [170, 165, 180, 175],
    'Weight': [65.5, 70.2, 80.1, 75.8],
    'Gender': ['Male', 'Female', 'Male', 'Female'],
    'Education': ['Bachelor', 'Master', 'PhD', 'High School']
})

categorical_cols = data.select_dtypes(include=['object']).columns
continuous_cols = data.select_dtypes(include=['float64', 'int64']).columns

print("Categorical Variables:", categorical_cols.tolist())
print("Continuous Variables:", continuous_cols.tolist())

##Q 21. What is feature scaling? How does it help in Machine Learning?
**Ans** - Feature scaling is a data preprocessing technique used to normalize or standardize numerical features so they have a consistent scale. This ensures that all features contribute equally to a machine learning model, preventing dominance by features with larger ranges.

**Why is Feature Scaling Important?**
* Improves Model Performance - Ensures that features with larger values do not overshadow smaller ones.
* Accelerates Model Convergence - Helps optimization algorithms faster.
* Prevents Numerical Instability - Avoids large variations in weight updates during training.
* Required for Distance-Based Algorithms - Essential for models like KNN, SVM, PCA, K-Means, etc.

**Types of Feature Scaling**

There are two main types of feature scaling:

**1. Min-Max Scaling**
* Scales values between 0 and 1.
* Retains the original distribution of data.
* Best for deep learning and neural networks.

Example in Python:

In [None]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

data = np.array([[50], [100], [150], [200]])
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)

print(scaled_data)

* Pros: Maintains the original data distribution.
* Cons: Sensitive to outliers.

**2. Standardization (Z-Score Scaling)**
* Scales data to have mean = 0 and standard deviation = 1.
* Best for models that assume normal distribution (e.g., Linear Regression, Logistic Regression, SVM).

Example in Python:

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

print(scaled_data)

* Pros: Works well when data follows a normal distribution.
* Cons: Transformed values are not in a fixed range.

**3. Robust Scaling (Handles Outliers)**
* Uses median and interquartile range (IQR) instead of mean and standard deviation.
* Less sensitive to outliers compared to Min-Max and Standardization.

Example in Python:

In [None]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
scaled_data = scaler.fit_transform(data)

print(scaled_data)

* Pros: Works well when data has outliers.
* Cons: Doesn't perform well when data is normally distributed.

**Choosing the Right Scaling Method**

|Scaling Method	|Use Case|
|-||
|Min-Max Scaling	|When the data distribution is not normal (Neural Networks, KNN)|
|Standardization (Z-Score Scaling)	|When the data follows a normal distribution (Regression, SVM)|
|Robust Scaling	|When the data contains outliers|

##Q 22. How do we perform scaling in Python?
**Ans** - Feature scaling can be done using Scikit-Learn’s preprocessing module, which provides different scaling techniques such as Min-Max Scaling, Standardization, and Robust Scaling.

**1. Min-Max Scaling (Normalization)**
* Scales features between 0 and 1 (or any custom range).
* Best for Neural Networks, KNN, Decision Trees.

Example:

In [None]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

data = np.array([[50], [100], [150], [200]])

scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(data)

print("Min-Max Scaled Data:\n", scaled_data)

* Pros: Preserves original distribution.
* Cons: Sensitive to outliers (small changes in min/max values can affect scaling).

**2. Standardization (Z-Score Scaling)**
* Scales data to have mean = 0 and standard deviation = 1.
* Best for Linear Regression, Logistic Regression, SVM.

Example:

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

print("Standard Scaled Data:\n", scaled_data)

* Pros: Works well when data follows a normal distribution.
* Cons: Does not bound values between 0 and 1.

**3. Robust Scaling (Handles Outliers)**
* Uses median and interquartile range (IQR) instead of mean & standard deviation.
* Best when data contains outliers.

Example:

In [None]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
scaled_data = scaler.fit_transform(data)

print("Robust Scaled Data:\n", scaled_data)

* Pros: Less sensitive to outliers.
* Cons: Does not preserve exact distribution shape.

**4. Scaling a Full Dataset (Multiple Features)**

Example: Scaling Multiple Columns in Pandas DataFrame

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

df = pd.DataFrame({
    'Age': [25, 30, 35, 40, 45],
    'Salary': [50000, 60000, 70000, 80000, 90000]
})

scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

print(df_scaled)

* All numerical columns are scaled at once.

**5. When to Use Different Scaling Methods?**

|Scaling Method	|Best Used For|
|-||
|Min-Max Scaling	|Deep Learning, KNN, Decision Trees|
|Standardization (Z-Score)	|Regression, SVM, PCA
Robust Scaling	|Handling Outliers|

##Q 23. What is sklearn.preprocessing?
**Ans** - sklearn.preprocessing is a module in Scikit-Learn that provides tools for scaling, normalizing, encoding, and transforming data before feeding it into a machine learning model. It helps in improving model performance by making features suitable for training.

**Why Use sklearn.preprocessing?**
* Ensures numerical features are on the same scale (important for models like SVM, KNN, and Gradient Descent-based algorithms).
* Encodes categorical variables so they can be used in models.
* Handles missing values and transforms skewed data.

**Key Functions in sklearn.preprocessing**

**1. Feature Scaling & Normalization**

Scaling helps ensure all numerical features contribute equally.

(a) Min-Max Scaling (Normalization)
* Scales values between 0 and 1.
* Best for Neural Networks, KNN, Decision Trees.

Example:

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform([[50], [100], [150], [200]])
print(scaled_data)

(b) Standardization (Z-Score Scaling)
* Scales data to have mean = 0, standard deviation = 1.
* Best for Regression, SVM, PCA.

Example:

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform([[50], [100], [150], [200]])
print(scaled_data)

(c) Robust Scaling (Handles Outliers)
* Uses median and interquartile range (IQR) instead of mean.
* Best when data contains outliers.

Example:

In [None]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
scaled_data = scaler.fit_transform([[50], [100], [150], [200]])
print(scaled_data)

**2. Encoding Categorical Variables**

Categorical features need to be converted into numerical values.

(a) One-Hot Encoding (OHE)
* Converts categories into binary columns.
* Used when categories have no order (Nominal).

Example:

In [None]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green']})
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(df[['Color']])
print(encoded)

(b) Label Encoding
* Assigns a unique integer to each category.
* Best when categories have an order (Ordinal).

Example:

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoded = encoder.fit_transform(['High', 'Medium', 'Low', 'High'])
print(encoded)

**3. Handling Missing Values**
* Fills missing values using mean, median, or mode.

Example:

In [None]:
from sklearn.impute import SimpleImputer
import numpy as np

data = np.array([[1, 2], [np.nan, 3], [7, 6]])
imputer = SimpleImputer(strategy="mean")
imputed_data = imputer.fit_transform(data)
print(imputed_data)

**Summary of sklearn.preprocessing Functions**

|Function	|Purpose|
|-||
|MinMaxScaler()	|Normalize data between 0 and 1|
|StandardScaler()	|Standardize data (mean=0, std=1)|
|RobustScaler()	|Scale using median (handles outliers)|
|OneHotEncoder()	|Convert categorical variables into binary columns|
|LabelEncoder()	|Convert categories into numerical values|
|SimpleImputer()	|Fill missing values using mean, median, etc.|

##Q 24. How do we split data for model fitting (training and testing) in Python?
**Ans** - In machine learning, we split data into two (or three) sets:
* Training Set – Used to train the model (70-80% of data).
* Test Set – Used to evaluate the model’s performance (20-30% of data).
* (Optional) Validation Set – Used for hyperparameter tuning (10-20% of data).

**1. Splitting Data Using train_test_split()**

The train_test_split() function from sklearn.model_selection is used to randomly divide data into training and testing sets.

Example: Train-Test Split (80%-20%)

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.DataFrame({
    'Feature1': [10, 20, 30, 40, 50, 60, 70, 80],
    'Feature2': [5, 15, 25, 35, 45, 55, 65, 75],
    'Label': [1, 0, 1, 0, 1, 0, 1, 0]
})

train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)

print("Training Set:\n", train_set)
print("\nTesting Set:\n", test_set)

* test_size=0.2 → 20% of data is reserved for testing.
* random_state=42 → Ensures reproducibility.

**2. Splitting Features and Labels (X and y)**

Example: Splitting X (features) and y (target labels)

In [None]:
X = df.drop(columns=['Label'])
y = df['Label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train:\n", X_train)
print("y_train:\n", y_train)

* The model will train on X_train, y_train and be evaluated on X_test, y_test.

**3. Splitting Data Into Training, Validation, and Test Sets**

If we need a validation set for hyperparameter tuning, we can split the data further.

Example: 60% Train, 20% Validation, 20% Test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

print("Training Set Size:", len(X_train))
print("Validation Set Size:", len(X_val))
print("Testing Set Size:", len(X_test))

* This results in:
  * 60% Training
  * 20% Validation
  * 20% Testing

**4. Stratified Sampling for Imbalanced Datasets**

If the dataset is imbalanced (e.g., fraud detection), use stratify=y to maintain the same class proportions.

Example: Stratified Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

* Ensures equal class distribution in training and test sets.

**5. When to Use Different Splits?**

|Scenario	|Train-Test Split|
|-||
|Small Dataset (< 1,000 samples)	|80% Train, 20% Test|
|Large Dataset (> 10,000 samples)	|70% Train, 20% Test, 10% Validation|
|Deep Learning	|60% Train, 20% Validation, 20% Test|
|Imbalanced Data	|Use stratify=y|

##Q 25. Explain data encoding
**Ans** - Data encoding is the process of converting categorical variables into numerical format so that machine learning models can process them. Since most ML algorithms work with numerical data, categorical features need to be transformed appropriately.

**Why is Data Encoding Important?**
* Machine Learning models require numerical inputs.
* Categorical features need to be converted into a usable form.
* Ensures the model understands relationships between categories.

**Types of Data Encoding**

**1. Label Encoding (Ordinal Encoding)**
* Assigns a unique integer to each category.
* Suitable for ordinal data (categories with a meaningful order).

Example:

|Education Level	|Encoded Value|
|-||
|High School	|0|
|Bachelor's	|1|
|Master's	|2|
|PhD	|3|

Python Implementation:

In [None]:
from sklearn.preprocessing import LabelEncoder

data = ['High School', 'Bachelor', 'Master', 'PhD']
encoder = LabelEncoder()
encoded_data = encoder.fit_transform(data)

print(encoded_data)

* Pros: Simple and memory-efficient.
* Cons: Models might assume numerical relationships where none exist.

**2. One-Hot Encoding (OHE)**
* Converts categories into binary columns (0s and 1s).
* Suitable for nominal data (categories with no order).

Example:

|Color	|Red	|Blue	|Green|
|-||||
|Red	|1	|0	|0|
|Blue	|0	|1	|0|
|Green	|0	|0	|1|

Python Implementation:

In [None]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green']})
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(df[['Color']])

print(encoded)

* Pros: No assumption of numerical order.
* Cons: Can create too many columns for high-cardinality features.

**3. Ordinal Encoding**
* Similar to Label Encoding, but we specify a meaningful order.

Example:

|Size	|Encoded Value|
|-||
|Small	|0|
|Medium	|1|
|Large	|2|

Python Implementation:

In [None]:
from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({'Size': ['Small', 'Medium', 'Large']})
encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
df['Size_encoded'] = encoder.fit_transform(df[['Size']])

print(df)

* Pros: Preserves ordinal relationships.
* Cons: Should only be used when order matters.

**4. Target Encoding (Mean Encoding)**
* Replaces categories with the mean of the target variable.
* Useful for high-cardinality categorical features.

Example: Predicting house prices

|Neighborhood	|Average Price ($1000s)|
|-||
|A	|200|
|B	|250|
|C	|300|

Python Implementation:

In [None]:
import pandas as pd

df = pd.DataFrame({'Neighborhood': ['A', 'B', 'A', 'C'], 'Price': [200, 250, 220, 300]})
df['Neighborhood_encoded'] = df.groupby('Neighborhood')['Price'].transform('mean')

print(df)

* Pros: Reduces dimensionality.
* Cons: Can lead to data leakage if applied incorrectly.

**5. Frequency (Count) Encoding**
* Replaces categories with their occurrence count in the dataset.
* Useful when some categories appear much more frequently.

Example:

|City	|Count|
|-||
|New York	|2|
|Los Angeles	|1|
|Chicago	|2|

Python Implementation:

In [None]:
df['City_encoded'] = df['City'].map(df['City'].value_counts())

* Pros: Keeps useful information about category importance.
* Cons: May not always capture meaningful patterns.

**Choosing the Right Encoding Technique**

|Encoding Type	|Use Case|
|-||
|Label Encoding	|Ordinal categories (e.g., Education Level)|
|One-Hot Encoding	|Small number of categories (e.g., Colors)|
|Ordinal Encoding	|When order matters (e.g., Small, Medium, Large)|
|Target Encoding	|High-cardinality categorical features (e.g., Zip Codes)|
|Frequency Encoding	|When frequency matters (e.g., Popularity of locations)|