In [None]:
# Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some
# algorithms that are not affected by missing values.
Missing values in a dataset refer to the absence of data for certain observations or variables. 
They are represented as null, NaN, or some other marker depending on the data format. 

It is essential to handle missing values for several reasons:
Biased Analysis: If missing values are not handled, they can introduce bias into your analysis. 
Models trained on data with missing values may produce inaccurate or skewed results.

Reduced Sample Size: Missing values reduce the effective sample size, potentially leading 
to a loss of precision in your analysis.

Model Performance: Many machine learning algorithms cannot handle missing values and will throw
errors if they encounter them during training or inference. Handling missing values allows you to 
use a broader range of models.

Data Integrity: Handling missing values ensures the integrity and reliability of your dataset, 
making it more suitable for reporting and decision-making.

Algorithms that are not affected by missing values include:
Decision Trees
Random Forests
k-Nearest Neighbors (k-NN)
Naive Bayes

In [6]:
 import pandas as pd
data = {'A': [1, 2, None, 4],
        'B': [5, None, None, 8]}
df = pd.DataFrame(data)
df

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,
2,,
3,4.0,8.0


In [4]:
# Q2: List down techniques used to handle missing data. Give an example of each with python code.
1. Dropping Missing data rows/columns: Only suitable if less amount of missing data is present.

# Create a DataFrame with missing values
data = {'A': [1, 2, None, 4],
        'B': [5, None, None, 8]}
df = pd.DataFrame(data)

# Remove rows with any missing values
df_cleaned = df.dropna()
print(df_cleaned)


     A    B
0  1.0  5.0
3  4.0  8.0


In [7]:
2. Fill missing values with the mean, median, or mode of the respective column.

data = {'A': [1, 2, None, 4],
        'B': [5, None, None, 8]}
df = pd.DataFrame(data)

# Replace missing values with the mean of column A
df['A'].fillna(df['A'].mean(), inplace=True)
print(df)


          A    B
0  1.000000  5.0
1  2.000000  NaN
2  2.333333  NaN
3  4.000000  8.0


In [8]:
3. Fill missing values with a constant.

data = {'A': [1, 2, None, 4],
        'B': [5, None, None, 8]}
df = pd.DataFrame(data)

# Replace missing values with a constant (e.g., 0)
df.fillna(0, inplace=True)
print(df)


     A    B
0  1.0  5.0
1  2.0  0.0
2  0.0  0.0
3  4.0  8.0


In [9]:
# 4. Fill missing values using interpolation methods like linear, polynomial, or spline.

data = {'A': [1, None, 3, None, 5]}
df = pd.DataFrame(data)

# Interpolate missing values using linear interpolation
df['A'].interpolate(method='linear', inplace=True)
print(df)


     A
0  1.0
1  2.0
2  3.0
3  4.0
4  5.0


In [None]:
# Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?
Imbalanced data refers to a situation in a classification problem where the distribution of classes
is highly unequal. In other words, one class has significantly more instances than the other(s). 
For example, in a binary classification problem where you are predicting whether a transaction is 
fraudulent or not, you might have 95% of the transactions as non-fraudulent and only 5% as fraudulent.
This is an imbalanced dataset.

If Imbalanced data is not handled:
1. Biased Model:The model will work good on majority class but not on minority clas.
2. There can be Misclassification of Minority Class
3. Accuracy can be misleading in imbalanced datasets.

In [None]:
# Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-
# sampling are required.
1. Up-sampling (Over-sampling):
Up-sampling involves increasing the number of instances in the minority class to balance the 
class distribution.
Ex: Suppose you have a fraud detection dataset with 95% non-fraudulent transactions and only 
    5% fraudulent ones. To balance the dataset, you can up-sample the fraudulent transactions.
    
2. Down-sampling (Under-sampling):
Down-sampling involves reducing the number of instances in the majority class to balance the 
class distribution.
Ex: Consider a medical dataset where you are predicting a rare disease. You have a large dataset
    with 95% non-disease cases and 5% disease cases. In this case, you might choose to down-sample
    the non-disease cases to match the number of disease cases,

In [None]:
# Q5: What is data Augmentation? Explain SMOTE.
Data augmentation is a technique used in machine learning and data preprocessing to artificially 
increase the size of a dataset by creating variations of the existing data. 

SMOTE (Synthetic Minority Over-sampling Technique) is a specific data augmentation technique used 
primarily to address class imbalance problems in classification tasks, especially when dealing with
imbalanced datasets.By generating synthetic samples, it introduces diversity into the minority class, 
making it less likely for the model to overfit to the minority class while training. This can result
in improved model performance.

In [None]:
# Q6: What are outliers in a dataset? Why is it essential to handle outliers?
Outliers are data points in a dataset that significantly deviate from the majority of the data points.
They are extreme values that lie far away from the central tendency of a distribution.
It is essential to handle outliers because:
1. Outliers can distort basic statistical measures such as the mean and standard deviation, leading 
to inaccurate summaries of the data.
2.  Outliers can have a significant impact on data visualizations like histograms, box plots, and 
scatter plots.This can make interpreting difficult.
3. They can negatively affect the performance of machine learning models. 

In [None]:
# Q7: You are working on a project that requires analyzing customer data. However, you notice that some of
# the data is missing. What are some techniques you can use to handle the missing data in your analysis?
Deletion:
Listwise Deletion (Complete-Case Analysis): Remove rows or columns with missing values entirely. 
This is suitable when missing values are minimal and do not significantly impact the analysis. However, 
it may lead to a loss of information.

Imputation:
Mean/Median Imputation: Replace missing values with the mean (for continuous data) or median 
(for ordinal or skewed data) of the respective variable.
Mode Imputation: For categorical data, replace missing values with the mode of the variable.
Constant Imputation: Replace missing values with a predefined constant (e.g., 0 or -1).
Interpolation: Use interpolation methods like linear or spline interpolation to estimate missing
values based on neighboring data points in time series or ordered data.

In [None]:
# Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are
# some strategies you can use to determine if the missing data is missing at random or if there is a pattern
# to the missing data?

Visual Inspection:
Create graphical representations of the missing data patterns. Plot missing data indicators across variables
or rows to visually identify any patterns or clusters of missing values.

Summary Statistics:
Calculate summary statistics for variables with missing data and compare them to those without missing data.
If there are systematic differences in summary statistics, it may indicate that the data is not missing 
completely at random.

Correlation Analysis:
Examine the correlations between variables with missing data and other variables in the dataset. If missingness
in one variable is correlated with the values of another variable, it may suggest that data is missing at random
or missing not at random.

Data Imputation and Comparison:
Impute missing data using different imputation methods and compare the results. If the choice of imputation 
method significantly impacts the results, it may suggest that the missing data is not random.

In [None]:
# Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the
# dataset do not have the condition of interest, while a small percentage do. What are some strategies you
# can use to evaluate the performance of your machine learning model on this imbalanced dataset?
This is a Imbalanced dataset. To Evaluate such dataset, one needs to handle the imbalance first
1. Resample the training dataset using Undersampling or Oversampling techniques
2. Using Ensemble techniques to combine multiple models and generate results

Using Evaluation metrics apart from Accuracy.
Precision/Specificity: how many selected instances are relevant.
Recall/Sensitivity: how many relevant instances are selected.
F1 score: harmonic mean of precision and recall.
MCC: correlation coefficient between the observed and predicted binary classifications.
AUC: relation between true-positive rate and false positive rate.

In [None]:
# Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is
# unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
# balance the dataset and down-sample the majority class?
Downsampling or Undersampling can be used to balance the dataset.
Methods to Down sample are:
    1. Random : It removes the samples from majority class with or without replacement. This can discard useful
    samples as well.
        
    2. Nearmiss : Majority class values are Randomly eliminated. It finds the distance between instances of majority
    class and remove the instances which are very close to each other.This increases sapces between two classes.
    
    3. Tomeklinks : Majority class links are removed until all minimally distanced nearest neighbor pairs are of
    the same class

In [None]:
# Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a
# project that requires you to estimate the occurrence of a rare event. What methods can you employ to
# balance the dataset and up-sample the minority class?
Upsampling techniques are as follows:
    1. Random : Training data is provided with multiple copies of minority classes. Some of classes are randomly
    chosen with replacement.
    
    2. SMOTE (Synthetic minority oversampling technique) : It synthesises new minority instances between existing
    minority instances. It generates the virtual training records by linear interpolation for the minority class. 
    