## Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

--Missing values in a dataset refer to the absence of data or information for specific observations or variables. They can occur for various reasons, such as data collection errors, data corruption, or simply because certain information was not collected or recorded. 

Handling Missing Values is essential  for several reasons:
1. Avoid Biased Analysis
2 . Maintain Data integrety
3. Improve model Performance

--
Algorithms thar are not affected by missing Values:
1. Decision tree
2. Naive Bayes
3. Random Forest
4. K-Nearest Neighbour


## Q2: List down techniques used to handle missing data. Give an example of each with python code.

1. Deletion of Missing Values:

This involves removing rows or columns with missing values. It's a simple approach but can result in a loss of valuable data.

In [1]:
import pandas as pd

data = {'A': [1, 2, None, 4],
        'B': [5, None, 7, 8]}
df = pd.DataFrame(data)

df_cleaned = df.dropna()
print(df_cleaned)


     A    B
0  1.0  5.0
3  4.0  8.0


2. Mean/Median/Mode Imputation:

Fill missing values with the mean, median, or mode of the respective column.

In [2]:
import pandas as pd

data = {'A': [1, 2, None, 4],
        'B': [5, None, 7, 8]}
df = pd.DataFrame(data)

df_filled = df.fillna(df.mean())
print(df_filled)


          A         B
0  1.000000  5.000000
1  2.000000  6.666667
2  2.333333  7.000000
3  4.000000  8.000000


3. Interpolation:

Interpolate missing values based on the values of neighboring data points.

In [3]:
import pandas as pd

data = {'A': [1, 2, None, 4],
        'B': [5, None, 7, None]}
df = pd.DataFrame(data)

df_interpolated = df.interpolate()
print(df_interpolated)


     A    B
0  1.0  5.0
1  2.0  6.0
2  3.0  7.0
3  4.0  7.0


## Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

--Imbalanced data refers to a situation in a classification problem where the distribution of classes (or categories) is not roughly equal. Instead, one class has significantly fewer instances (minority class), while another class dominates with a larger number of instances (majority class).

Problems if imbalanced data not handled:
1. Biased Models
2. Poor Generalization
3. Misleading Evaluation Metrics
4. Loss of importeant information


## Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down- sampling are required.

1. Up-sampling :

Up-sampling involves increasing the number of instances in the minority class by adding copies of existing instances or generating synthetic examples. 

Example:-
Consider a fraud detection dataset where only 2% of the transactions are fraudulent (minority class). To up-sample,randomly select instances from the fraudulent class and duplicate them multiple times until the class distribution is balanced.

2. Down-sampling:

Down-sampling involves reducing the number of instances in the majority class to match the minority class's size. This technique aims to create a balanced class distribution by removing some instances from the majority class.

Example:

In a customer churn prediction dataset, where only 10% of customers have churned (minority class), randomly select and remove instances from the non-churning customers until both classes have equal representation.

--
Uses Up-sampling and Down-sampling:
--

1. Up-sampling:

-- Use up-sampling when we have a small amount of data in the minority class, and we want to avoid losing information by keeping all instances.
-- It's useful when we have a moderate-sized dataset and can afford to create additional samples.

2. Down-sampling:

--Use down-sampling when we have a significantly larger amount of data in the majority class, and reducing its size will help balance the class distribution.
--It's suitable when we want to reduce computational overhead and model complexity, especially with large datasets.

## Q5: What is data Augmentation? Explain SMOTE.

--Data augmentation is a technique used in machine learning and computer vision to artificially increase the size of a dataset by creating new training examples from the existing ones. The goal is to improve the model's performance, generalization, and robustness by introducing variations in the training data.


--
SMOTE (Synthetic Minority Over-sampling Technique) steps:
--

1, Identify MInority class

2. select Neighbours

3. Create synthetic instances

4. Repeat

5. Combine with original data


## Q6: What are outliers in a dataset? Why is it essential to handle outliers?

--Outliers are data points in a dataset that significantly differ from the majority of other data points. They are extreme values that are either much larger or much smaller than the typical values in the dataset.

--
Reasons to handle outliers:-
--
1. Impact on statistical analysis
2. Impact on ML model
3. Misleading Visualization
4. Bias in Decision making


## Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

There are several techniques to handle missing data:-
1. Data imputation:-Replace missing values with the mean , meadian or mode  of the respective column.

2. Deletion of the missing rows/columns:- this technique is  suitable if the amount of missing data is small
3. Interpolation:- We can use interpolation method ro estimate the missing value based on the neighbouring data points.
4. Manual Entry:- if data  can be obtained by other means then it can be manually entered.

## Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

1. Exploratory Data Analysis (EDA): we can Start by conducting exploratory data analysis to visualize and understand the distribution of missing values. Create summary statistics, histograms, and heatmaps to examine patterns in missing data.

2. Missing Data Mechanism Tests:

    a.> Little's MCAR Test: Little's test checks whether the missing data is missing completely at random (MCAR). A p-value below a significance level suggests that data is not MCAR.

    b.> Chi-Square Test: For categorical data, you can use chi-square tests to examine the relationship between missingness and categorical variables.

## Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

--Evaluation Metrics:
--
1. Precision and Recall: Precision measures the proportion of true positive predictions among all positive predictions. Recall (Sensitivity) measures the proportion of true positive predictions among all actual positives. Focusing on these metrics is crucial when dealing with imbalanced datasets.

2. F1 Score: The F1 score is the harmonic mean of precision and recall and provides a balance between the two metrics. It's especially useful when you want to find a balance between minimizing false positives and false negatives.

3. Area Under the Receiver Operating Characteristic Curve (AUC-ROC): ROC and AUC provide a comprehensive view of model performance by considering different thresholds. AUC-ROC is valuable for binary classification problems.

4. Area Under the Precision-Recall Curve (AUC-PR): AUC-PR focuses on the precision-recall trade-off, making it a suitable metric for imbalanced datasets, especially when the 


--Ensemble Methods:
--
we can Use ensemble methods like Random Forests or Gradient Boosting with appropriate class weights to handle imbalanced data effectively.

--Stratified Sampling:
--
When splitting the dataset into training and testing sets, use stratified sampling to maintain the class distribution in both sets.

## Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

1. Random Under-sampling:

Randomly select a subset of instances from the majority class to match the size of the minority class. This is a straightforward and commonly used down-sampling technique.

2. Cluster-based Under-sampling:

Cluster the majority class instances and then down-sample by selecting a representative instance from each cluster. This approach preserves diversity within the majority class.

3. Synthetic Data Generation (SMOTE):

While SMOTE is often used for oversampling, we can use it in combination with under-sampling to achieve a balanced dataset. First, oversample the minority class using SMOTE, and then apply random under-sampling to the majority class.

## Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

1. Random Over-sampling:

Randomly duplicate instances from the minority class to match the size of the majority class. This is a straightforward and commonly used up-sampling technique.

2. SMOTE (Synthetic Minority Over-sampling Technique):

Generate synthetic examples for the minority class by interpolating between existing instances and their k-nearest neighbors. SMOTE helps create more diverse and meaningful synthetic data points.

3. Cluster-based Over-sampling:

Cluster the minority class instances and then up-sample by generating synthetic instances within each cluster.