Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.


ans.   Missing Values in a Dataset?

**Missing values** are entries in a dataset where no data value is stored for a variable in an observation. This can happen due to various reasons such as:

* Errors in data collection
* Data corruption
* Incomplete data entry
* Sensor or system failures



###  Why is it Essential to Handle Missing Values?

Handling missing values is crucial because:

1. **Bias Prevention**: Missing data can skew or bias the results.
2. **Model Accuracy**: Many machine learning algorithms can't handle missing values and may crash or give incorrect results.
3. **Data Integrity**: Ensures that conclusions drawn from the data are valid.
4. **Performance**: Unhandled missing values can negatively impact model training and prediction.



###  Algorithms That Are Not Affected by Missing Values:

Some algorithms can **handle missing data internally** or are **less sensitive** to it. Examples include:

1. **XGBoost (Extreme Gradient Boosting)**
2. **LightGBM (Light Gradient Boosting Machine)**
3. **CatBoost**
4. **k-Nearest Neighbors (k-NN)** (if imputation is built in)
5. **Some decision tree-based models** (e.g., CART can handle missing values by surrogate splits)

>  Note: While some models tolerate missing values, **it's still good practice** to analyze and impute or address them appropriately.





Q2: List down techniques used to handle missing data.  Give an example of each with python code.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

# Sample DataFrame with missing values
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', np.nan, 'David'],
    'Age': [25, np.nan, 30, 40]
})

# 1. Remove rows with missing data
df_dropped = df.dropna()

# 2. Mean/Median/Mode Imputation
df['Age_mean'] = df['Age'].fillna(df['Age'].mean())  # Mean
df['Age_median'] = df['Age'].fillna(df['Age'].median())  # Median

# 3. Forward Fill / Backward Fill
df['Age_ffill'] = df['Age'].fillna(method='ffill')  # Forward fill
df['Age_bfill'] = df['Age'].fillna(method='bfill')  # Backward fill

# 4. Interpolation
df['Age_interp'] = df['Age'].interpolate()  # Linear interpolation

# 5. Constant Imputation
df['Name_const'] = df['Name'].fillna('Unknown')

# 6. Predictive Imputation (Linear Regression)
# Encode Name column
df['Name_enc'] = df['Name'].astype('category').cat.codes
train = df[df['Age'].notnull()]
test = df[df['Age'].isnull()]
model = LinearRegression()
model.fit(train[['Name_enc']], train['Age'])
df.loc[df['Age'].isnull(), 'Age_predicted'] = model.predict(test[['Name_enc']])

print(df)


    Name   Age   Age_mean  Age_median  Age_ffill  Age_bfill  Age_interp  \
0  Alice  25.0  25.000000        25.0       25.0       25.0        25.0   
1    Bob   NaN  31.666667        30.0       25.0       30.0        27.5   
2    NaN  30.0  30.000000        30.0       30.0       30.0        30.0   
3  David  40.0  40.000000        40.0       40.0       40.0        40.0   

  Name_const  Name_enc  Age_predicted  
0      Alice         0            NaN  
1        Bob         1      34.285714  
2    Unknown        -1            NaN  
3      David         2            NaN  


  df['Age_ffill'] = df['Age'].fillna(method='ffill')  # Forward fill
  df['Age_bfill'] = df['Age'].fillna(method='bfill')  # Backward fill


Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Imbalanced data refers to a situation in classification problems where the number of observations in each class is significantly different. For example, in a binary classification problem for fraud detection:


ans.

Class 0 (Non-fraud): 98%

Class 1 (Fraud): 2%

This imbalance can bias the model toward the majority class.

 What Happens if Imbalanced Data Is Not Handled?
If you don’t handle imbalanced data:

Misleading Accuracy
The model may show high accuracy just by predicting the majority class.
Example: Predicting "Non-fraud" for every case gives 98% accuracy but is useless for finding fraud.

Poor Recall for Minority Class
The model fails to correctly identify important minority class events like fraud, disease, defects, etc.

Biased Predictions
The model becomes biased toward the majority class, ignoring the minority class patterns.

 Why Is It Critical?
In many applications like:

Fraud detection

Medical diagnosis

Spam detection
the minority class is more important, and missing it can lead to serious consequences.










Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and downsampling are required.

In [5]:
import pandas as pd
from sklearn.utils import resample

# Example dataframe
data = {
    'feature': [1,2,3,4,5,6,7,8,9,10],
    'class':   [0,0,0,0,0,1,1,1,1,1]  # 0 = majority, 1 = minority
}
df = pd.DataFrame(data)

# Separate majority and minority classes
majority = df[df['class'] == 0]
minority = df[df['class'] == 1]

# Down-sample majority class to match minority class size
majority_downsampled = resample(majority, replace=False, n_samples=len(minority), random_state=42)

# Combine downsampled majority and minority class
df_downsampled = pd.concat([majority_downsampled, minority])

print("Downsampled dataset:")
print(df_downsampled)


Downsampled dataset:
   feature  class
1        2      0
4        5      0
2        3      0
0        1      0
3        4      0
5        6      1
6        7      1
7        8      1
8        9      1
9       10      1


Q5: What is data Augmentation? Explain SMOTE.

In [8]:
from imblearn.over_sampling import SMOTE
from collections import Counter
import numpy as np

X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6],  # Majority class
              [6, 7], [7, 8]])                          # Minority class
y = np.array([0, 0, 0, 0, 0, 1, 1])  # 0 = majority, 1 = minority

print('Original dataset shape:', Counter(y))

# Set k_neighbors less than number of minority samples (which is 2 here)
smote = SMOTE(random_state=42, k_neighbors=1)
X_resampled, y_resampled = smote.fit_resample(X, y)

print('Resampled dataset shape:', Counter(y_resampled))


Original dataset shape: Counter({np.int64(0): 5, np.int64(1): 2})
Resampled dataset shape: Counter({np.int64(0): 5, np.int64(1): 5})


Q6: What are outliers in a dataset? Why is it essential to handle outliers?

ans.

**Outliers:**  
Outliers are data points that differ significantly from other observations in the dataset. They lie far away from the majority of the data, either much higher or much lower than most values.

**Why handle outliers?**

* **Distort statistical analysis:** Outliers can skew mean, variance, and correlation, leading to misleading conclusions.
* **Affect model performance:** Many machine learning algorithms assume data is normally distributed and can perform poorly if outliers exist.
* **Impact visualization:** Outliers can stretch scales in plots, hiding the overall data pattern.
* **Data quality issues:** Outliers may indicate data entry errors, measurement errors, or rare events that need special attention.

Handling outliers ensures more reliable, robust, and interpretable data analysis and modeling.


Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?


Ans.
1. **Remove missing data:**

   * Remove rows or columns with missing values if the missingness is small or not important.
   * Example: `df.dropna()` removes rows with any missing values.

2. **Imputation:**

   * Fill missing values with a specific value, such as mean, median, or mode.
   * Example: `df['age'].fillna(df['age'].mean(), inplace=True)`

3. **Predictive imputation:**

   * Use models (like regression, k-NN) to predict missing values based on other features.

4. **Using indicators:**

   * Create a new boolean feature indicating where data was missing.

5. **Forward or backward fill:**

   * Fill missing values using previous or next valid value (useful in time series).
   * Example: `df.fillna(method='ffill')`

6. **Model-based handling:**

   * Use algorithms that handle missing data internally, e.g., XGBoost, Random Forest.

7. **Treat missing data as a separate category:**

   * For categorical variables, consider missing as its own category.



Selecting the right technique depends on the data nature, amount of missingness, and analysis goals.


Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?



In [9]:
"""
Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

To determine whether data is Missing Completely at Random (MCAR), Missing at Random (MAR), or Not Missing at Random (NMAR), you can use the following strategies:

1. Visual Inspection:
   - Use heatmaps or visualization tools to check for missing data patterns.
   Example:
   import seaborn as sns
   sns.heatmap(df.isnull(), cbar=False)

   import missingno as msno
   msno.matrix(df)
   msno.heatmap(df)

2. Summary Statistics Comparison:
   - Compare distributions (mean, median, etc.) of other columns grouped by missingness.
   Example:
   df.groupby(df['age'].isnull())['income'].mean()

3. Correlation with Missingness:
   - Create an indicator variable for missing values and find correlation with other columns.
   Example:
   df['age_missing'] = df['age'].isnull().astype(int)
   print(df.corr()['age_missing'])

4. Little’s MCAR Test:
   - A statistical test (available in R and in some Python packages) to test if data is MCAR.

5. Domain Knowledge:
   - Use contextual understanding to identify reasons for missingness. For example, income data might be missing for unemployed individuals.

Summary:
- MCAR: Missingness is independent of any data → safe to drop.
- MAR: Missingness depends on observed data → use imputation.
- NMAR: Missingness depends on unobserved data → difficult to handle; may require advanced modeling.

Understanding the type of missing data helps in choosing the right imputation or data-cleaning strategy.
"""


"\nQ8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?\n\nTo determine whether data is Missing Completely at Random (MCAR), Missing at Random (MAR), or Not Missing at Random (NMAR), you can use the following strategies:\n\n1. Visual Inspection:\n   - Use heatmaps or visualization tools to check for missing data patterns.\n   Example:\n   import seaborn as sns\n   sns.heatmap(df.isnull(), cbar=False)\n\n   import missingno as msno\n   msno.matrix(df)\n   msno.heatmap(df)\n\n2. Summary Statistics Comparison:\n   - Compare distributions (mean, median, etc.) of other columns grouped by missingness.\n   Example:\n   df.groupby(df['age'].isnull())['income'].mean()\n\n3. Correlation with Missingness:\n   - Create an indicator variable for missing values and find correlation with other columns.\n   Example:\n   df

Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?



In [10]:
"""
Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

When dealing with imbalanced datasets in medical diagnosis (where false negatives can be costly), traditional accuracy may be misleading. Use the following evaluation strategies:

1. Confusion Matrix:
   - Gives a detailed breakdown of TP, TN, FP, and FN.
   - Helps understand how many positive cases are missed.

2. Precision, Recall, and F1-Score:
   - Precision = TP / (TP + FP): Focuses on minimizing false positives.
   - Recall = TP / (TP + FN): Important when false negatives are costly (e.g., undiagnosed disease).
   - F1-Score = Harmonic mean of Precision and Recall.

3. ROC Curve and AUC (Area Under Curve):
   - AUC-ROC shows the trade-off between TPR and FPR.
   - Better for visualizing model performance on imbalanced data.

4. Precision-Recall Curve:
   - More informative than ROC when the dataset is highly imbalanced.
   - Focuses on performance for the minority class.

5. Stratified K-Fold Cross-Validation:
   - Ensures each fold maintains the same class distribution as the overall dataset.

6. Use Class Weights:
   - Pass `class_weight='balanced'` to classifiers like Logistic Regression or RandomForestClassifier.

7. Threshold Tuning:
   - Adjust the decision threshold (default is 0.5) to increase sensitivity towards minority class.

Example (in sklearn):
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, model.predict_proba(X_test)[:,1]))

Conclusion:
Choose evaluation metrics that reflect the cost of misclassification, especially focusing on recall and F1-score in medical applications.
"""


'\nQ9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?\n\nWhen dealing with imbalanced datasets in medical diagnosis (where false negatives can be costly), traditional accuracy may be misleading. Use the following evaluation strategies:\n\n1. Confusion Matrix:\n   - Gives a detailed breakdown of TP, TN, FP, and FN.\n   - Helps understand how many positive cases are missed.\n\n2. Precision, Recall, and F1-Score:\n   - Precision = TP / (TP + FP): Focuses on minimizing false positives.\n   - Recall = TP / (TP + FN): Important when false negatives are costly (e.g., undiagnosed disease).\n   - F1-Score = Harmonic mean of Precision and Recall.\n\n3. ROC Curve and AUC (Area Under Curve):\n   - AUC-ROC shows the trade-off between TPR and FPR.\n 

Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?



In [11]:
"""
Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

To balance the dataset and ensure fair learning, especially when most customers report being satisfied, you can apply the following methods:

1. **Down-Sampling the Majority Class:**
   - Randomly remove samples from the majority class (e.g., 'satisfied') to match the number of minority class instances.

   Example using Pandas and sklearn:
   ----------------------------------
   from sklearn.utils import resample
   import pandas as pd

   # Example DataFrame
   df = pd.DataFrame({
       'satisfaction': ['satisfied'] * 90 + ['unsatisfied'] * 10,
       'score': list(range(100))
   })

   # Separate majority and minority
   df_majority = df[df.satisfaction == 'satisfied']
   df_minority = df[df.satisfaction == 'unsatisfied']

   # Downsample majority class
   df_majority_downsampled = resample(df_majority,
                                      replace=False,     # without replacement
                                      n_samples=len(df_minority),  # to match minority
                                      random_state=42)

   # Combine minority class with downsampled majority class
   df_balanced = pd.concat([df_majority_downsampled, df_minority])

   print("Balanced dataset:\n", df_balanced['satisfaction'].value_counts())

2. **Other Techniques (if down-sampling isn’t ideal):**
   - Up-Sampling the Minority Class: Duplicate or synthetically generate new samples (e.g., with SMOTE).
   - Use of Class Weights: Adjust algorithm to penalize misclassification of minority class more heavily.
   - Ensemble Techniques: Use ensemble models like BalancedRandomForest or EasyEnsembleClassifier.

Conclusion:
Down-sampling helps prevent model bias towards the majority class but may lead to loss of information. Always compare model performance using proper metrics (F1-score, recall) after balancing.
"""


'\nQ10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?\n\nTo balance the dataset and ensure fair learning, especially when most customers report being satisfied, you can apply the following methods:\n\n1. **Down-Sampling the Majority Class:**\n   - Randomly remove samples from the majority class (e.g., \'satisfied\') to match the number of minority class instances.\n\n   Example using Pandas and sklearn:\n   ----------------------------------\n   from sklearn.utils import resample\n   import pandas as pd\n\n   # Example DataFrame\n   df = pd.DataFrame({\n       \'satisfaction\': [\'satisfied\'] * 90 + [\'unsatisfied\'] * 10,\n       \'score\': list(range(100))\n   })\n\n   # Separate majority and minority\n   df_majority = df[df.satisfaction == \'satisfied\']\n   df_minority = df[df.sa

Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?




In [12]:
"""
Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

To handle class imbalance when the rare event (minority class) is underrepresented, **up-sampling** the minority class can help balance the dataset. Below are effective techniques:

1. **Random Over-Sampling:**
   - Replicate samples from the minority class until it matches the size of the majority class.

   Example:
   --------
   from sklearn.utils import resample
   import pandas as pd

   # Create an imbalanced dataset
   df = pd.DataFrame({
       'event': ['no'] * 90 + ['yes'] * 10,
       'value': list(range(100))
   })

   # Separate minority and majority classes
   df_majority = df[df.event == 'no']
   df_minority = df[df.event == 'yes']

   # Upsample minority class
   df_minority_upsampled = resample(df_minority,
                                    replace=True,              # Sample with replacement
                                    n_samples=len(df_majority),# Match majority class
                                    random_state=42)

   # Combine majority and upsampled minority
   df_balanced = pd.concat([df_majority, df_minority_upsampled])

   print("Balanced dataset:\n", df_balanced['event'].value_counts())

2. **SMOTE (Synthetic Minority Over-sampling Technique):**
   - Generates synthetic samples for the minority class using k-nearest neighbors.

   Example:
   --------
   from imblearn.over_sampling import SMOTE
   from collections import Counter
   from sklearn.datasets import make_classification

   # Sample imbalanced data
   X, y = make_classification(n_samples=100, n_features=2, weights=[0.9, 0.1], random_state=42)

   print("Original class distribution:", Counter(y))

   sm = SMOTE(random_state=42)
   X_resampled, y_resampled = sm.fit_resample(X, y)

   print("Resampled class distribution:", Counter(y_resampled))

3. **Use Class Weighting in Algorithms:**
   - Specify `class_weight='balanced'` in classifiers like `LogisticRegression`, `RandomForestClassifier`, etc.

Conclusion:
Up-sampling prevents the model from being biased toward the majority class. Choose the method based on your dataset size and whether synthetic generation (like SMOTE) makes sense contextually.
"""


'\nQ11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?\n\nTo handle class imbalance when the rare event (minority class) is underrepresented, **up-sampling** the minority class can help balance the dataset. Below are effective techniques:\n\n1. **Random Over-Sampling:**\n   - Replicate samples from the minority class until it matches the size of the majority class.\n\n   Example:\n   --------\n   from sklearn.utils import resample\n   import pandas as pd\n\n   # Create an imbalanced dataset\n   df = pd.DataFrame({\n       \'event\': [\'no\'] * 90 + [\'yes\'] * 10,\n       \'value\': list(range(100))\n   })\n\n   # Separate minority and majority classes\n   df_majority = df[df.event == \'no\']\n   df_minority = df[df.event == \'yes\']\n\n   # Upsample minority class\n   df_minority_