Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some
algorithms that are not affected by missing values.

Ans: Missing data, or missing values, occur when you don't have data stored for certain variables or participants. Data can go missing due to incomplete data entry, equipment malfunctions, lost files, and many other reasons.

The cause of missing values can be data corruption or failure to record data. The handling of missing data is very important during the preprocessing of the dataset as many machine learning algorithms do not support missing values.

Most of the ML model algorithms require the imputation of missing values before fitting them, but there are some ML algorithms that can support NULL or missing values by default. A few of the most common algorithms that support NULL values are XGBoost, Naive Bayes, KNN, LightGBM, Random Forest, etc.

Q2: List down techniques used to handle missing data. Give an example of each with python code.

Ans: 5 ways to handle missing values in the dataset:

1) Drop the entire row which is having a null value:
2) Mean Value Imputation
3) Median value imputation  (Use when we have outiers in the datasets)
4) Mode-computation technique (uses with categorical values)
5) Random sample imputation

In [None]:
# 1) Drop the entire row which is having a null value:
df.dropna().shape 

In [2]:
# 2) Mean Value Imputation
df['age_mean']=df['age'].fillna(df['age'].mean()

In [None]:
# 3) Median value imputation  (Use when we have outiers in the datasets)
df['age_median']=df['age'].fillna(df['age'].median())

In [None]:
# 4) Mode-computation technique (uses with categorical values)
mode_val=df[df['embarked'].notnull()]['embarked'].mode()[0]
df['embarked_mode']=df['embarked'].fillna(mode_val)

Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Ans: An imbalanced dataset is defined by great differences in the distribution of the classes in the dataset. This means that a dataset is biased towards a class in the dataset. If the dataset is biased towards one class, an algorithm trained on the same data will be biased towards the same class.

Data imbalance usually reflects an unequal distribution of classes within a dataset. For example, in a credit card fraud detection dataset, most of the credit card transactions are not fraud and a very few classes are fraud transactions.

Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-
sampling are required.

Ans: Upsampling: Where you increase the frequency of the samples, such as from minutes to seconds.
Upsampling brings back the resolution to the resolution of previous layer.

* Ex: It can be very useful to oversample, some algorithms to compute an FFT work well with sample sets of power 2, it may be necessary to oversample your signal to apply a fast FFT.

Downsampling: Where you decrease the frequency of the samples, such as from days to months.
Downsampling reduces dimensionality of the features while losing some information. It saves computation.

* Ex: Let's say you have in your electronic editing an ADC digitizing 40M samples per second to study a heart rate of 70 beats per minute, you are very likely to work with useless information, that's why it will be better to down-sample your signal.

Q5: What is data Augmentation? Explain SMOTE.

Ans: Data augmentation is a technique of artificially increasing the training set by creating modified copies of a dataset using existing data. It includes making minor changes to the dataset or using deep learning to generate new data points.

##### SMOTE
SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.

#### Before appling SMOTE
![image.png](attachment:c40102ed-49e8-42ff-be5c-666ecbb87c81.png)

#### After applying SMOTE
![image.png](attachment:3c336145-7238-4673-bf86-05e0a2349cad.png)

Q6: What are outliers in a dataset? Why is it essential to handle outliers?

Ans: Outliers are the observations in a dataset that deviate significantly from the rest of the data. In any data science project, it is essential to identify and handle outliers, as they can have a significant impact on many statistical methods, such as means, standard deviations, etc., and the performance of ML models.

Q7: You are working on a project that requires analyzing customer data. However, you notice that some of
the data is missing. What are some techniques you can use to handle the missing data in your analysis?

Ans:Handling missing values falls generally into two categories. 
1) Deletion: One of the most prevalent methods for dealing with missing data is deletion. 

df.dropna(axis=1, inplace=True)

2) Imputation: This method entails replacing the missing value with a specific value. 

df.fillna(inplace=True)

* Regression imputation
The regression imputation method includes creating a model to predict the observed value of a variable based on another variable. Then you use the model to fill in the missing value of that variable.

* Simple Imputation
This method involves utilizing a numerical summary of the variable where the missing value occurred (that is using the feature or variable's central tendency summary, such as mean, median, and mode).

from sklearn.impute import SimpleImputer
#Specify the strategy to be the median class
fea_transformer = SimpleImputer(strategy="median")
values = fea_transformer.fit_transform(df[["Distance"]])
pd.DataFrame(values)

* KNN Imputation
KNN imputation is a fairer approach to the Simple Imputation method. It operates by replacing missing data with the average mean of the neighbors nearest to it.

from sklearn.impute import KNNImputer
fea_transformer = KNNImputer(n_neighbors=3)
values = fea_transformer.fit_transform(df[["Distance"]])
pd.DataFrame(values)

Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are
some strategies you can use to determine if the missing data is missing at random or if there is a pattern
to the missing data?

Ans: 
* Type of missing data        ->	         Imputation method

Missing Completely At Random  ->       	Mean, Median, Mode, or any other imputation method

Missing At Random             ->	Multiple imputation, Regression imputation

Missing Not At Random         ->	Pattern Substitution, Maximum Likelihood estimation

Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the
dataset do not have the condition of interest, while a small percentage do. What are some strategies you
can use to evaluate the performance of your machine learning model on this imbalanced dataset?

Ans:
* Try Resampling Your Dataset

You can change the dataset that you use to build your predictive model to have more balanced data.

This change is called sampling your dataset and there are two main methods that you can use to even-up the classes:

You can add copies of instances from the under-represented class called over-sampling (or more formally sampling with replacement), or
You can delete instances from the over-represented class, called under-sampling.
These approaches are often very easy to implement and fast to run. They are an excellent starting point.

In fact, I would advise you to always try both approaches on all of your imbalanced datasets, just to see if it gives you a boost in your preferred accuracy measures.

Some Rules of Thumb

Consider testing under-sampling when you have an a lot data (tens- or hundreds of thousands of instances or more)

Consider testing over-sampling when you don’t have a lot of data (tens of thousands of records or less)

Consider testing random and non-random (e.g. stratified) sampling schemes.

Consider testing different resampled ratios (e.g. you don’t have to target a 1:1 ratio in a binary classification problem, try other ratios)

Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
balance the dataset and down-sample the majority class?

Ans:We should use random under-sampling

In under-sampling, the simplest technique involves removing random records from the majority class, which can cause loss of information.

Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a
project that requires you to estimate the occurrence of a rare event. What methods can you employ to
balance the dataset and up-sample the minority class?

Ans: We should use random upsampling method

The simplest implementation of over-sampling is to duplicate random records from the minority class, which can cause overfitting. 