## ANSWER-1
Missing values refer to the absence of data in one or more columns of a dataset. These missing values may occur due to various reasons, such as data collection errors, data corruption, or data preprocessing issues.
 
 Methods to handle missing values are-
 1. Decision Tree
 2. K-Nearest Neighbours
 3. Support Vector Machines
 4. Random Forest

## ANSWER-2
#### Techniques to handle missing data are-
1. **Deletion:** Deletion involves removing the rows or columns that contain missing values from the dataset. This technique is only recommended when the amount of missing data is small relative to the size of the dataset, and the missing data is missing completely at random (MCAR).

In [17]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan, 4],
                   'B': [8, np.nan, np.nan, 8],
                   'C': [5, 10, 11, 2]})

df.dropna(inplace=True,axis=1)

df

Unnamed: 0,C
0,5
1,10
2,11
3,2


2. **Imputation:** Imputation involves replacing the missing values with estimated values based on the available data. There are several methods of imputation, including mean imputation, median imputation, and mode imputation.

In [21]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan, 4],
                   'B': [8, np.nan, np.nan, 8],
                   'C': [5, 10, 11, 2]})

df.fillna(df.median(), inplace=True)
df

Unnamed: 0,A,B,C
0,1.0,8.0,5
1,2.0,8.0,10
2,2.0,8.0,11
3,4.0,8.0,2


## ANSWER-3
Imbalanced data refers to a dataset where the number of instances in one class is significantly higher or lower than the number of instances in another class. In other words, the distribution of the target variable is not uniform.

* If imbalanced data is not handled, it can lead to several problems, including:

=> **Biased model performance:** In the case of imbalanced data, a model may be biased towards the majority class because it has more data to learn from. This can result in poor performance on the minority class, which may be of greater importance in certain applications.

=> **False positives and false negatives:** In imbalanced data, a model may predict the majority class with high accuracy, but perform poorly on the minority class. This can lead to a high number of false positives and false negatives.

=> **Overfitting:** In the case of imbalanced data, a model may overfit to the majority class, resulting in poor generalization performance

## ANSWER-4
**Up-sampling** refers to the process of increasing the number of instances in the minority class by randomly duplicating them. This can be done using techniques such as random oversampling or SMOTE (Synthetic Minority Over-sampling Technique).

  **For example**, suppose we have a dataset with 1000 instances, out of which 900 belong to class A and 100 belong to class B. Since the dataset is imbalanced, we can up-sample the minority class B by randomly duplicating its instances, resulting in a new dataset with 1800 instances, out of which 900 belong to class A and 900 belong to class B.

**Down-sampling** refers to the process of decreasing the number of instances in the majority class by randomly removing them. This can be done using techniques such as random under-sampling or Tomek links.

  **For example**, suppose we have a dataset with 1000 instances, out of which 900 belong to class A and 100 belong to class B. Since the dataset is imbalanced, we can down-sample the majority class A by randomly removing some of its instances, resulting in a new dataset with 200 instances, out of which 100 belong to class A and 100 belong to class B.

## ANSWER-5
**Data augmentation** is a technique used to increase the size of a dataset by creating new, synthetic data from the original data. This is often done to address problems of overfitting, improve the generalization performance of machine learning models, or to balance an imbalanced dataset.

**SMOTE (Synthetic Minority Over-sampling Technique)** is a popular data augmentation technique. SMOTE works by creating synthetic instances of the minority class by interpolating between existing instances of that class. Specifically, for each instance in the minority class, SMOTE selects k nearest neighbors (typically k=5) and creates a new instance by linearly interpolating between the selected instance and one of its k nearest neighbors. The interpolation factor is chosen randomly between 0 and 1, and the new instance is added to the dataset.
SMOTE is an effective technique for handling imbalanced data, as it creates new synthetic instances that are similar to existing instances in the minority class, and can thus help the machine learning model generalize better to the minority class.


## ANSWER-6
**Outliers** are data points in a dataset that significantly deviate from the rest of the data points. They can be caused by errors in data collection, measurement errors, or they may represent actual extreme values in the population.

It is essential to handle outliers because they can have a significant impact on the performance of machine learning models. Outliers can affect the accuracy and generalization performance of a model, as they can bias the model towards the extreme values, leading to overfitting or underfitting. Outliers can also affect the results of statistical analysis, such as the mean, variance, and standard deviation, leading to inaccurate or misleading conclusions.

## ANSWER-7
 There are several techniques that can be used to handle missing data in customer data analysis:

1. **Deletion:** One technique is to simply delete the rows or columns with missing data. This is only recommended if the missing data is a small percentage of the total dataset and if the missing data is completely at random. However, if the missing data is a large percentage of the dataset, this technique can result in a significant loss of information.
2. **Imputation:** Another technique is to impute the missing values. This can be done using statistical methods such as mean imputation, median imputation, or mode imputation. Alternatively, more advanced techniques such as k-nearest neighbor imputation or multiple imputation can be used.
3. **ML-based methods:** ML-based methods can also be used to handle missing data. For example, regression-based methods can be used to predict missing values based on other variables in the dataset.

## ANSWER-8
1. Visualization
2. Summary statistics
3. Imputation
4. Statistical tests

## ANSWER-9

1. **Confusion Matrix:** A confusion matrix is a table that summarizes the performance of a classifier on a dataset. It displays the number of true positives, false positives, true negatives, and false negatives. A confusion matrix can help to evaluate the performance of the model, especially when dealing with imbalanced datasets.
2. **Precision, Recall, and F1-score:** Precision, recall, and F1-score are metrics that are commonly used to evaluate the performance of machine learning models on imbalanced datasets. Precision is the fraction of true positive predictions among all positive predictions, while recall is the fraction of true positive predictions among all actual positive instances. F1-score is the harmonic mean of precision and recall. These metrics are useful when evaluating the performance of models on imbalanced datasets because they take into account both the false positive and false negative rates.

## ANSWER-10
1. **SMOTE:** This method involves generating synthetic observations for the minority class to match the size of the majority class. SMOTE can be used in combination with random undersampling to balance the dataset. This can be done using techniques such as SMOTE Tomek from the imblearn library in Python.

2. **Random undersampling:** This method involves randomly selecting a subset of observations from the majority class to match the size of the minority class. This can be done using techniques such as RandomUnderSampler from the imblearn library in Python.

## ANSWER-11
1. **Random oversampling:** This method involves randomly duplicating observations from the minority class to match the size of the majority class. This can be done using techniques such as RandomOverSampler from the imblearn library in Python.
2. **SMOTE:** This method involves generating synthetic observations for the minority class to match the size of the majority class. SMOTE can be used in combination with random oversampling to balance the dataset. This can be done using techniques such as SMOTE from the imblearn library in Python.