### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

Missing values in a dataset are data entries that are absent or null. Handling missing values is crucial because they can lead to biased estimates, reduce statistical power, and affect the performance of machine learning models.

Some algorithms that are not significantly affected by missing values include:
- **Decision Trees**
- **Random Forests**
- **XGBoost**

These algorithms can handle missing values by using methods like surrogate splits (in Decision Trees) or by default handling mechanisms (in Random Forests and XGBoost).

### Q2: List down techniques used to handle missing data. Give an example of each with python code.

Here are some common techniques used to handle missing data, along with examples in Python:

####  Removing Missing Values

In [1]:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, None, 4],
    'B': [None, 2, 3, 4]
})

# Drop rows with any missing values
df_dropped = df.dropna()
print(df_dropped)


     A    B
1  2.0  2.0
3  4.0  4.0


In [2]:
# Drop columns with any missing values
df_dropped_columns = df.dropna(axis=1)
print(df_dropped_columns)

Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]


In [3]:
## Imputation with Mean/Median/Mode

In [5]:
from sklearn.impute import SimpleImputer

# Impute missing values with the mean
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

In [6]:
df_imputed

Unnamed: 0,A,B
0,1.0,3.0
1,2.0,2.0
2,2.333333,3.0
3,4.0,4.0


In [7]:
# Impute missing values with the median
imputer = SimpleImputer(strategy='median')
df_imputed_median = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

In [8]:
df_imputed_median

Unnamed: 0,A,B
0,1.0,3.0
1,2.0,2.0
2,2.0,3.0
3,4.0,4.0


In [9]:
# Impute missing values with the mode
imputer = SimpleImputer(strategy='most_frequent')
df_imputed_mode = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

In [10]:
df_imputed_mode

Unnamed: 0,A,B
0,1.0,2.0
1,2.0,2.0
2,1.0,3.0
3,4.0,4.0


In [11]:
## Imputation with Forward Fill/Backward Fill

In [13]:
# Forward fill missing values
df_filled = df.fillna(method='ffill')

In [14]:
df_filled

Unnamed: 0,A,B
0,1.0,
1,2.0,2.0
2,2.0,3.0
3,4.0,4.0


In [16]:
# Backward fill missing values
df_filled_bfill = df.fillna(method='bfill')

In [17]:
df_filled_bfill

Unnamed: 0,A,B
0,1.0,2.0
1,2.0,2.0
2,4.0,3.0
3,4.0,4.0


### Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Dataset where we will be having more number of positive class datapoints as compared to negative class. For example 900 datapoints in positive class and 100 datapoints in negative class

Problem- Model will get biased towards positive class as it will see more data for positive class

### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

Upsampling Techniques are the techniques where we create some artificial datapoints for the minority class. Here we use resampling technique from sk learn. It create new datapoints for the minority class

DownSampling techniques are the techniques where we remove datapoints from the majority class. So that no of datapoints are equal in both upsampling and downsampling techniques.

We majorly prefer upsampling as we loose lot of data in down sampling techniques

technique used for Upsampling- SMOTE

### Q5: What is data Augmentation? Explain SMOTE.

Data augmentation is a technique used to increase the diversity and quantity of training data without actually collecting new data.

SMOTE is an updampling technique (Synthetic Minority Oversampling Techniques). It is used to adddress the imbalanced dataset where minority class has significantly fewer instances than majority class. SMOTE tries to join 2 nearesr data points and try to add datapoinrs between those 2 nearest datapoints

### Q6: What are outliers in a dataset? Why is it essential to handle outliers?

Outliers are data points that significantly deviate from the rest of the dataset, either being much higher or lower than the majority of the data.

### Importance of Handling Outliers:
1. **Improved Model Performance**: Outliers can distort model training, leading to poor generalization and inaccurate predictions.
2. **Accurate Statistical Measures**: Outliers can skew mean, variance, and other statistical measures, leading to misleading interpretations.
3. **Robust Analysis**: Handling outliers ensures that analyses and models reflect the true characteristics of the data.

Handling outliers is essential to maintain the integrity and accuracy of data analysis and machine learning models.

### Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

Delete rows and Data points: Easiest but will loose lot of data

Delete columns which has NaN values

Imputation missing techniques
1. Mean value impuation- Plot histograms, KDE and check how data is plotted. Replace NaN with mean. it works well when we have normally distributed data points

2. Median Value Imputation: It is used if we have outliers in our dataset
3. Mode value Imputation: It can be used in categorical variables
4. Random Sample Imputation

### Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

To determine if missing data is missing at random (MAR) or if there is a pattern, you can use the following strategies:

Visual Inspection

Statistical Tests

Correlation Analysis


### Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies youcan use to evaluate the performance of your machine learning model on this imbalanced dataset?To evaluate the performance of a machine learning model on an imbalanced dataset, we can use the following strategies:

Use Appropriate Metrics:

Precision and Recall: Evaluate how well the model identifies positive cases.
F1-Score: Balance between precision and recall.
ROC-AUC: Assess the trade-off between true positive rate and false positive rate.
Precision-Recall AUC: More informative than ROC-AUC for imbalanced datasets.
Confusion Matrix
Cross-Validation with Stratified Sampling
Adjust Class Weights