Q1. What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

ANS : Missing values can occur for various reasons, such as data entry errors, system malfunctions, or non-response by participants.

It is essential to handle missing values because they can lead to biased estimates, reduce the representativeness of the sample, and affect the accuracy of the analysis.

Some algorithms that are not affected by missing values include:Decision trees, Random forests, Support vector machines, K-nearest neighbors, Naive Bayes.

Q2. List down techniques used to handle missing data. Give an example of each with python code.

In [1]:
#Deletion methods: Deletion methods involve removing observations or variables with missing values. 
#There are two types of deletion methods: Listwise deletion, Pairwise deletion
import pandas as pd
# load dataset with missing values
df = pd.read_csv('services.csv')
# use pairwise deletion
df_pairwise = df.dropna()
# print the number of observations before and after pairwise deletion
print("Number of observations before pairwise deletion:", len(df))
print("Number of observations after pairwise deletion:", len(df_pairwise))

#Imputation methods: Imputation methods involve filling in the missing values with plausible values. 
#There are several types of imputation methods: Mean imputation, Regression imputation, Multiple imputation
import pandas as pd
# load dataset with missing values
df = pd.read_csv('services.csv')
# use mean imputation
df_mean = df.fillna(df.mean())
# print the number of missing values before and after mean imputation
print("Number of missing values before mean imputation:", df.isna().sum().sum())
print("Number of missing values after mean imputation:", df_mean.isna().sum().sum())

Number of observations before pairwise deletion: 23
Number of observations after pairwise deletion: 0
Number of missing values before mean imputation: 221
Number of missing values after mean imputation: 221


  df_mean = df.fillna(df.mean())


Q3. Explain the imbalanced data. What will happen if imbalanced data is not handled?

ANS :
Imbalanced data refers to a dataset where the classes or categories are not represented equally. For example, a dataset might have 90% of the observations in one class and only 10% in another. 
Imbalanced data can be a problem because many machine learning algorithms are designed to assume that the classes are balanced. 
This can lead to biased models that perform poorly on the minority class.

If imbalanced data is not handled, the resulting models may have poor predictive performance on the minority class, which is often the class of interest. 
This can result in missed opportunities for identifying important patterns or anomalies in the data.

Q4. What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down- sampling are required.

ANS : Up-sampling and down-sampling are techniques used to handle imbalanced data in a dataset.
Down-sampling involves reducing the number of observations in the majority class to balance it with the minority class. 
For example, if we have a dataset with 1000 observations and 900 of them belong to class A and 100 belong to class B, we can down-sample class A to 100 observations by randomly selecting 100 observations from class A.

Up-sampling involves increasing the number of observations in the minority class to balance it with the majority class.
For example, if we have a dataset with 1000 observations and 100 of them belong to class A and 900 belong to class B, we can up-sample class A to 900 observations by replicating each observation in class A nine times.

When to use up-sampling or down-sampling depends on the nature of the dataset and the goals of the analysis. 
If the minority class is important and we want to ensure that the model performs well on it, we might up-sample it. On the other hand, if the majority class is too large and it might overwhelm the model, we might down-sample it.

Q5. What is data Augmentation? Explain SMOTE.

ANS : 
Data augmentation is a technique used to increase the amount of training data by generating new examples from the existing data. 
This technique can be useful in scenarios where the available data is limited or imbalanced.

SMOTE (Synthetic Minority Over-sampling Technique) is a popular data augmentation technique used to address class imbalance by generating synthetic examples of the minority class. 
For example, suppose we have a dataset with 1000 observations and only 100 of them belong to the minority class. 
We can use SMOTE to generate new synthetic examples for the minority class by selecting one of the minority class observations and finding its k-nearest neighbors.

Q6. What are outliers in a dataset? Why is it essential to handle outliers?

ANS : Outliers are data points in a dataset that are significantly different from the other observations. 
These data points can occur due to measurement errors, data entry errors, or genuine anomalies in the data.

It is essential to handle outliers in a dataset because they can have a significant impact on the analysis and the resulting model. 
Outliers can skew the distribution of the data and affect the estimates of statistical measures such as mean and variance.

Q7. You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

ANS : There are several techniques that can be used to handle missing data in an analysis. Some of them are:

1. Deletion: One technique is to delete the missing data.
2. Mean/Mode/Median Imputation: Another technique is to replace the missing data with the mean, mode, or median of the available data.
3. Regression Imputation: Regression imputation is a technique that involves using regression models to predict the missing data based on the available data.
4. Multiple Imputation: Multiple imputation is a technique that involves creating multiple imputed datasets and combining the results to obtain more accurate estimates. 

Q8. You are working with a large dataset and find that a small percentage of the data is missing. What aresome strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data? 

ANS : Here are several strategies that can be used to determine if the missing data is missing at random or if there is a pattern to the missing data. Some of them are:

1. Visual inspection: One approach is to plot the data and look for patterns.
2. Correlation analysis: Another approach is to calculate the correlation between the variables and check if there is a relationship between the missing data and the other variables.
3. Missing data tests: There are several statistical tests that can be used to check if the missing data is missing at random or not. 
4. Imputation: Another strategy is to use imputation techniques to fill in the missing data and check if the imputed data matches the patterns of the available data. 

Q9. Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies youcan use to evaluate the performance of your machine learning model on this imbalanced dataset? 

ANS : 
1. When dealing with an imbalanced dataset, evaluating the performance of a machine learning model can be challenging. 
2. Confusion matrix: A confusion matrix is a table that summarizes the performance of a classification model.
3. ROC curve and AUC: The Receiver Operating Characteristic (ROC) curve is a plot that shows the trade-off between sensitivity and specificity for different threshold values. 
The Area Under the Curve (AUC) is a metric that summarizes the ROC curve.
4. Stratified k-fold cross-validation: In this technique, the dataset is split into k-folds, and the model is trained on k-1 folds and tested on the remaining fold. 
5. Resampling techniques: Resampling techniques involve modifying the original dataset to create a balanced dataset. 
6. Cost-sensitive learning: Cost-sensitive learning involves assigning different misclassification costs to different classes.

Q10. When attempting to estimate customer satisfaction for a project, you discover that the dataset isunbalanced, with the bulk of customers reporting being satisfied. What methods can you employ tobalance the dataset and down-sample the majority class?

ANS : When dealing with an imbalanced dataset with the majority class heavily outnumbering the minority class, one technique that can be used to balance the dataset is down-sampling the majority class.

Here are some methods to down-sample the majority class:Random under-sampling, Cluster-based under-sampling, Tomek links, NearMiss, Edited Nearest Neighbors

Q11. You discover that the dataset is unbalanced with a low percentage of occurrences while working on aproject that requires you to estimate the occurrence of a rare event. What methods can you employ tobalance the dataset and up-sample the minority class?

ANS : 
When dealing with an imbalanced dataset with a minority class heavily outnumbered by the majority class, one technique that can be used to balance the dataset is up-sampling the minority class. 

Here are some methods to up-sample the minority class:

Random over-sampling
Synthetic Minority Over-sampling Technique (SMOTE)
Adaptive Synthetic Sampling (ADASYN)
Synthetic Minority Over-sampling Technique with Tree-based Inference (SMOTE-IT)