### Question 1

Missing data occurs when no data value is stored or has been observed the variable under study. 

It is essential to address missing values efficiently because:-
1. Reduce biased results, which may lead to inaccurate conclusions and predictions.
2. Reduces model performance
3. Misleading Visualisations
4. Model instability, leading to unpredictable behaviour, overfitting, underfitting, hence compromising the generalized ability of the model.

Algorithms such as k-NN and Random Forest algorithms support missing values.

### Question 2

Techniques used to handle missing data:-
1. dropna
2. fillna


In [2]:
import pandas as pd

In [3]:
df = pd.DataFrame({
    'key1':[1,2,3,None,5],
    'key2':[2,3,4,None,None],
    'key3':[3,4,5,6,7]
})

In [4]:
df

Unnamed: 0,key1,key2,key3
0,1.0,2.0,3
1,2.0,3.0,4
2,3.0,4.0,5
3,,,6
4,5.0,,7


In [5]:
df.dropna()

Unnamed: 0,key1,key2,key3
0,1.0,2.0,3
1,2.0,3.0,4
2,3.0,4.0,5


In [6]:
df.fillna(0)

Unnamed: 0,key1,key2,key3
0,1.0,2.0,3
1,2.0,3.0,4
2,3.0,4.0,5
3,0.0,0.0,6
4,5.0,0.0,7


### Question 3

Imbalanced data occurs when the classes in your target variable are not represented equally. This is common in scenarios such as fraud detection, medical diagnoses, and rare event predictions, where the number of instances of one class (e.g., fraudulent transactions) is much smaller than the number of instances of the other class (e.g., non-fraudulent transactions).

Challenges with Imbalanced Data
1. Bias Towards Majority Class: Machine learning models may become biased towards the majority class and perform poorly on the minority class.
2. Evaluation Metrics: Standard accuracy is not a good metric for imbalanced datasets. For example, if 95% of the data belongs to one class, a model that always predicts that class will have 95% accuracy, but it won't be useful for identifying the minority class.
3. Underrepresentation of Minority Class: The minority class may not have enough examples to learn from, leading to poor generalization.

### Question 4.

Up-sampling and down-sampling are techniques used to address the problem of imbalanced datasets, where one class is underrepresented compared to the others.

Up-sampling involves increasing the number of instances in the minority class to match the number of instances in the majority class.
Eg. Suppose we are building a model to detect fraudulent transactions in a dataset where 98% of transactions are legitimate and only 2% are fraudulent. This significant imbalance can cause the model to be biased towards predicting transactions as legitimate.

Down-sampling involves reducing the number of instances in the majority class to match the number of instances in the minority class.
Eg. Consider a scenario where you have a dataset for predicting machine failure, with 95% non-failure instances and 5% failure instances. The imbalance can lead to a model that is biased towards predicting non-failures. In this case, down-sampling the non-failure instances can help balance the dataset. 

### Question 5.

Data augmentation is a technique used to increase the diversity of your training data without actually collecting new data. It's commonly used in various domains, particularly in computer vision, but also in tabular data for imbalanced datasets. The goal is to improve the robustness and performance of machine learning models by artificially generating new examples from the existing data.
###### Example:
1. In computer vision, rotating images by small degrees
2. Synonym Replacement: Replacing words with their synonyms

SMOTE (Synthetic Minority Over-sampling Technique) is a popular technique for addressing imbalanced datasets in the context of ttabular data. It works by creating synthetic examples (say, by interpolation) of the minority class rather than simply duplicating existing ones. SMOTE reduces the risk of overfitting that can occur with simple duplication of minority class examples. It helps the model to learn the characteristics of the minority class better, improving predictive performance.

### Question 6.

Outliers are data points that differ significantly from the majority distribution of the data in a dataset. They can be much higher or lower than the rest of the values. Outliers can occur due to variability in the data, measurement errors, data entry errors, or they might indicate something inherently unusual about the data point.

Handling outliers is crucial for several reasons:
1. Outliers can skew mean and standard deviation, leading to a misleading representation of the data distribution. They can affect the results of hypothesis tests and confidence intervals.
2. Regression Models: Outliers can disproportionately influence the fit of the model, resulting in biased parameter estimates.
3. Clustering Algorithms: Outliers can distort the formation of clusters, affecting the overall clustering results.
4. Outliers may indicate data entry or measurement errors that need to be corrected to ensure data quality.
5. Handling outliers can improve the performance and robustness of machine learning models, leading to better predictions and insights.

### Question 7.

##### Techniques to handle missing data
1. Filling missing data using fillna: Based on the distribution of the data, the missing values can be filled with the mean, median or mode. For a normally distributed data, we may use the mean. For a skewed data, we may use the median. For a categorical data, we may use the mode. 
2. Filling missing data using interpolation like ffill and bfill, where ffill is forward interpolation and bfill is backward interpolation. 
3. In case of small number of rows having missing data, we can drop the rows. 
4. In case of large number of missing values in a certain column (which may or may not be important), we may drop the entire column.

### Question 8.

Strategies to Determine the Nature of Missing Data
1. MCAR (Missing Completely at Random): Missingness is independent of any observed or unobserved data. Testing for MCAR often involves Little’s MCAR test, which checks if the missing data is unrelated to the values of any variables.
2. MAR (Missing at Random): Missingness depends on observed data but not on the missing data itself. To test for MAR, analyze whether the missingness is related to other observed variables.
3. MNAR (Missing Not at Random): Missingness depends on the unobserved data. This is the most complex case, often requiring domain knowledge to hypothesize and test.
4. orrelation with Missingness: Compute the correlation between the presence of missing values (binary indicator) and other variables. Significant correlations may suggest that data is MAR.

### Question 9.

When dealing with an imbalanced dataset in a medical diagnosis project, where the condition of interest is rare, it’s crucial to use strategies that properly evaluate the performance of your machine learning model and reduce bias towards the majority class. 
Here the majority class does not have the condition of interest. 
1. We can use SMOTE, an oversampling technique used to generate synthetic samples for the minority class. 
2. Undersampling technique on majority class can be applied to reduce the data, but this leads to loss of information.

### Question 10.

Methods to Down-sample the Majority Class
1. Random under-sampling is a technique used to balance an imbalanced dataset by reducing the number of instances in the majority class. This method helps prevent the model from being biased toward the majority class and ensures that it learns to recognize the minority class effectively.
2. Cluster-based under-sampling is an advanced technique used to handle imbalanced datasets by strategically selecting representative samples from the majority class. This method involves clustering the majority class data and then sampling points from each cluster to form a balanced dataset. The idea is to retain the diversity of the majority class while reducing its size.

### Question 11.

Methods to Up-sample the minority class, i.e. the rare event
1. Random over-sampling involves randomly duplicating instances from the minority class to match the number of instances in the majority class.
2. SMOTE generates synthetic samples for the minority class by interpolating between existing minority class instances. This helps create more diverse examples in the minority class.