## Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

 Missing values refer to the absence of data in a particular field or observation in a dataset. It is essential to handle missing values as it can lead to biased or inaccurate results in data analysis and machine learning models. Some algorithms that are not affected by missing values include tree-based algorithms such as decision trees, random forests, and gradient boosting machines.

## Q2: List down techniques used to handle missing data. Give an example of each with python code.


    - Interpolation
    - Forward/Backward Fill
    - Mean/Median Imputaation
    - Mode Imputation

## Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?


Imbalanced data refers to a situation where the number of observations in one class is significantly higher or lower than the other class in a binary classification problem. If imbalanced data is not handled, machine learning models can be biased towards the majority class, resulting in poor performance on the minority class. This can be addressed by techniques such as oversampling the minority class, undersampling the majority class, or using advanced techniques such as SMOTE.

## Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.


Up-sampling involves randomly duplicating observations from the minority class to increase their representation in the dataset. It is required when the minority class has insufficient observations to train the model effectively.

Down-sampling involves randomly removing observations from the majority class to reduce their representation in the dataset. It is required when the majority class has a significant number of observations that can lead to biased model performance.

In [None]:
from sklearn.utils import resample

df_minority = df[df['class']==1]
df_majority = df[df['class']==0]
df_minority_upsampled = resample(df_minority, replace=True, n_samples=len(df_majority))
df_upsampled = pd.concat([df_majority, df_minority_upsampled])


In [None]:
from sklearn.utils import resample

df_minority = df[df['class']==1]
df_majority = df[df['class']==0]
df_majority_downsampled = resample(df_majority, replace=False, n_samples=len(df_minority))
df_downsampled = pd.concat([df_majority_downsampled, df_minority])


## Q5: What is data Augmentation? Explain SMOTE.

Data augmentation is a technique used to increase the size of a dataset by creating new synthetic samples from the existing data. SMOTE (Synthetic Minority Over-sampling Technique) is a popular data augmentation technique for imbalanced data. SMOTE generates synthetic samples by interpolating between the minority class samples. The synthetic samples are generated by randomly selecting two or more nearest minority class neighbors of a given observation, and then creating a new observation by interpolating between them.

## Q6: What are outliers in a dataset? Why is it essential to handle outliers?

 Outliers are data points that lie far from the majority of the other data points in a dataset. It is essential to handle outliers as they can significantly affect statistical analysis and machine learning models. Outliers can skew the distribution of the data, leading to incorrect assumptions about the data and, in turn, resulting in erroneous predictions or classifications. Handling outliers can improve the accuracy and reliability of the model.

## Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?


There are several techniques to handle missing data in your analysis, including:

- Deleting rows with missing data.
- Replacing missing values with a fixed value (e.g., mean, median, or mode).
- Replacing missing values with the previous or next value in the sequence.
- Using interpolation techniques to estimate missing values (e.g., linear interpolation, cubic spline interpolation).
- Using machine learning algorithms to predict missing values.

## Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

Some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data include:

- Creating a missing data indicator variable.
- Examining the patterns of missing data to identify any potential relationships between the missing data and other variables in the dataset.
- Using statistical tests, such as Little's MCAR test, to determine if the missing data is missing completely at random or if there is a pattern to the missing data.

## Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

Some strategies you can use to evaluate the performance of your machine learning model on an imbalanced dataset include:

- Using performance metrics that are robust to imbalanced data, such as precision, recall, F1 score, or area under the receiver operating characteristic curve (AUC-ROC).
- Resampling the dataset to balance the classes, such as using oversampling, undersampling, or a combination of both.
- Using algorithms that are designed to handle imbalanced datasets, such as decision trees, random forests, or support vector machines with class weighting.

## Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

To balance the dataset and down-sample the majority class, you can use techniques such as:

- Random under-sampling: randomly removing examples from the majority class until the class distribution is balanced.
- Cluster-based under-sampling: grouping examples in the majority class and then removing examples from each cluster until the class distribution is balanced.
- Tomek links: identifying pairs of examples from different classes that are closest to each other and removing the examples from the majority class.
- Edited nearest neighbors: identifying examples in the majority class that are misclassified by their nearest neighbors in the minority class and removing them.

## Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?