#### `Q1`: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.


* Missing values in a dataset refer to the absence of a value in a particular observation or record. Missing values can occur due to various reasons, such as data entry errors, data corruption, or data that was not collected or recorded.

* It is essential to handle missing values because they can have a significant impact on the accuracy and reliability of any data analysis or modeling. Ignoring missing values can result in biased estimates, reduced statistical power, and even incorrect conclusions. Therefore, it is crucial to handle missing values effectively to ensure the quality of the data and the validity of any analysis or modeling.

* Some algorithms that are not affected by missing values include:

1. Decision Trees
2. Random Forest
3. K-Nearest Neighbors
4. Naive Bayes
5. Support Vector Machines (SVM)

#### `Q2`: List down techniques used to handle missing data. Give an example of each with python code.


* There are several techniques that can be used to handle missing data. Here are five common techniques along with their Python code examples:

In [1]:
import numpy as np 
import pandas as pd

df = pd.DataFrame({
    'p':[9,np.nan,7,8],
    'q':[10,np.nan,np.nan,8],
    'r':[3,4,6,5]
})
df

Unnamed: 0,p,q,r
0,9.0,10.0,3
1,,,4
2,7.0,,6
3,8.0,8.0,5


1. Deletion:
* Deletion is the simplest technique to handle missing data. In this method, we delete the missing data from the dataset. Deletion can be done in two ways:

In [2]:
df.dropna()

Unnamed: 0,p,q,r
0,9.0,10.0,3
3,8.0,8.0,5


2. Mean Imputation:
*  In this method, we replace the missing value with the mean, mode, or median value of the same column.

In [3]:
df.fillna(df.mean())

Unnamed: 0,p,q,r
0,9.0,10.0,3
1,8.0,9.0,4
2,7.0,9.0,6
3,8.0,8.0,5


3. Machine Learning-Based Imputation:
* In this method, we estimate the missing value using machine learning algorithms.

In [4]:
from sklearn.impute import KNNImputer

# knn imputation
imputer = KNNImputer(n_neighbors=2)
df_imputed = pd.DataFrame(imputer.fit_transform(df))
df_imputed

Unnamed: 0,0,1,2
0,9.0,10.0,3.0
1,8.5,9.0,4.0
2,7.0,9.0,6.0
3,8.0,8.0,5.0


#### `Q3`: Explain the imbalanced data. What will happen if imbalanced data is not handled?



* Imbalanced data refers to a dataset where the number of instances in one class is significantly higher or lower than the number of instances in another class. In other words, the distribution of the target variable is not uniform.

* If **imbalanced data is not handled**, it can lead to several problems, including:

* **Biased model performance**: In the case of imbalanced data, a model may be biased towards the majority class because it has more data to learn from. This can result in poor performance on the minority class, which may be of greater importance in certain applications.

* **Overfitting**: In the case of imbalanced data, a model may overfit to the majority class, resulting in poor generalization performance.

#### `Q4`: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.



* Up-sampling and down-sampling are techniques used to handle imbalanced datasets.

1. **Up-sampling :** 
* Up-sampling involves increasing the number of instances in the minority class to balance the dataset. This can be done by duplicating existing instances, generating new synthetic instances, or both.
* Up-sampling is required when the minority class is under-represented, and we need to boost its representation in the dataset to train a more balanced model.

2. **Down-sampling :**
* Down-sampling involves reducing the number of instances in the majority class to balance the dataset. This can be done by randomly removing instances or selecting a subset of instances
* Down-sampling is required when the majority class is over-represented, and we need to reduce its representation to achieve a more balanced model.

> For example, in a binary classification problem, we have 1000 samples of Class A and 100 samples of Class B. The dataset is imbalanced, and we need to balance it before training our model. In this case, we can use up-sampling to increase the number of Class B samples to, say, 500 by generating synthetic instances or duplicating existing instances. Alternatively, we can use down-sampling to reduce the number of Class A samples to 500 by randomly removing instances.

#### `Q5`: What is data Augmentation? Explain SMOTE.


* **Data augmentation** is a technique used to increase the size and diversity of a dataset by applying various transformations to existing data samples, such as rotations, translations, scaling, flipping, and adding noise. The aim of data augmentation is to generate new variations of the data that are still representative of the underlying data distribution.

* **SMOTE** : SMOTE stands for Synthetic Minority Over-sampling Technique. It is a technique used to handle imbalanced datasets by generating synthetic samples of the minority class using a combination of oversampling and data augmentation. SMOTE selects a random sample from the minority class and creates new synthetic samples by interpolating between this sample and its nearest neighbors. By generating new synthetic samples, SMOTE aims to increase the representation of the minority class in the dataset, improving the performance of machine learning models.

#### `Q6`: What are outliers in a dataset? Why is it essential to handle outliers?


* Outliers are data points that are significantly different from the other observations in the dataset. They can be caused by measurement errors, data entry errors, or rare events in the data. Outliers can have a significant impact on the statistical properties of the dataset, such as the mean and variance, and can skew the distribution of the data.

* **It is essential to handle outliers because** they can affect the accuracy of machine learning models by causing overfitting, underfitting, or bias in the model. Outliers can also distort the results of statistical analyses, leading to incorrect conclusions.

#### `Q7`: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?


There are several techniques that can be used to handle missing data in customer data analysis:

* **Deletion:** One technique is to simply delete the rows or columns with missing data. This is only recommended if the missing data is a small percentage of the total dataset and if the missing data is completely at random. However, if the missing data is a large percentage of the dataset, this technique can result in a significant loss of information.

* **Imputation:**  Another technique is to impute the missing values. This can be done using statistical methods such as mean imputation, median imputation, or mode imputation. Alternatively, more advanced techniques such as k-nearest neighbor imputation or multiple imputation can be used.

* **Machine learning-based methods:** Machine learning-based methods can also be used to handle missing data. For example, regression-based methods can be used to predict missing values based on other variables in the dataset.

* Domain knowledge: Domain knowledge can also be used to handle missing data. For example, if certain values are missing for a customer, but it is known that all customers with certain characteristics have the same value, then that value can be imputed for the missing values.

#### `Q8`: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?


 * **Missing data visualization:** Plotting the data can reveal patterns of missingness. For example, missing data may cluster in specific regions of the data space or be correlated with other variables in the dataset.

* **Correlation analysis:** Correlation analysis can be used to examine the relationship between missingness and other variables in the dataset. If missingness is correlated with other variables, it may indicate a non-random pattern of missingness.

* **Statistical tests**: Statistical tests can be used to determine if the missing data is missing at random. One approach is to compare the means and variances of the observed and missing data. If there are no significant differences, it may indicate missingness at random.

* **Imputation:** Imputation can be used to fill in missing data based on other variables in the dataset. If the imputed values are similar to the observed data, it may indicate missingness at random.

* **Expert knowledge:** Expert knowledge of the data collection process can provide insight into the patterns of missingness. For example, missingness may be due to data collection errors, survey non-response, or other factors.

#### `Q9`: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?


1. **Confusion Matrix:** A confusion matrix is a table that summarizes the performance of a classifier on a dataset. It displays the number of true positives, false positives, true negatives, and false negatives. A confusion matrix can help to evaluate the performance of the model, especially when dealing with imbalanced datasets.

2.  **Precision, Recall, and F1-score:** Precision, recall, and F1-score are metrics that are commonly used to evaluate the performance of machine learning models on imbalanced datasets. Precision is the fraction of true positive predictions among all positive predictions, while recall is the fraction of true positive predictions among all actual positive instances. F1-score is the harmonic mean of precision and recall. These metrics are useful when evaluating the performance of models on imbalanced datasets because they take into account both the false positive and false negative rates.

3. **ROC Curve and AUC: ROC (Receiver Operating Characteristic) :** curve is a plot of the true positive rate against the false positive rate. AUC (Area Under the ROC Curve) is a metric that measures the area under the ROC curve. ROC curve and AUC can be used to evaluate the performance of a model on an imbalanced dataset. A high AUC value suggests that the model is performing well on the dataset.

4. **Resampling techniques: ** Resampling techniques such as oversampling the minority class or undersampling the majority class can also be used to balance the dataset. Once the dataset is balanced, standard metrics such as accuracy, precision, recall, F1-score, and ROC can be used to evaluate the performance of the model.

#### `Q10`: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?


1. **Random undersampling:** This method involves randomly selecting a subset of observations from the majority class to match the size of the minority class. This can be done using techniques such as RandomUnderSampler from the imblearn library in Python.

2. **Synthetic minority oversampling technique (SMOTE):** This method involves generating synthetic observations for the minority class to match the size of the majority class. SMOTE can be used in combination with random undersampling to balance the dataset. This can be done using techniques such as SMOTETomek from the imblearn library in Python.

#### `Q11`: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

1. Random oversampling: This method involves randomly duplicating observations from the minority class to match the size of the majority class. This can be done using techniques such as RandomOverSampler from the imblearn library in Python.

2. **Synthetic minority oversampling technique (SMOTE):** This method involves generating synthetic observations for the minority class to match the size of the majority class. SMOTE can be used in combination with random oversampling to balance the dataset. This can be done using techniques such as SMOTE from the imblearn library in Python.

3. **Adaptive synthetic sampling (ADASYN):** This method is similar to SMOTE but focuses on generating more synthetic observations for the minority class samples that are harder to learn. This can be done using techniques such as ADASYN from the imblearn library in Python.