In [None]:
Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some
algorithms that are not affected by missing values.

Missing data is defined as the values or data that is not stored (or not present) for some variable/s in
the given dataset. Below is a sample of the missing data from the Titanic dataset. You can see the columns 
'Age' and 'Cabin' have some missing values.

It is important to handle missing values as they can lead to inaccurate conclusions about the data, which
can significantly impact the accuracy of the analysis. There are several methods available to handle missing 
values, such as removal, imputation, flagging, etc.

The k-NN algorithm can ignore a column from a distance measure when a value is missing. Naive Bayes can also
support missing values when making a prediction. These algorithms can be used when the dataset contains null 
or missing values.



In [None]:
Q2: List down techniques used to handle missing data. Give an example of each with python code.
The techniques to handle missing data are:
•	Deleting Rows with missing values.
•	Impute missing values for continuous variable.
•	Impute missing values for categorical variable.
•	Other Imputation Methods.
•	Using Algorithms that support missing values.
•	Prediction of missing values.
An example of each with python code.

1. Delete Rows with Missing Values
One way of handling missing values is the deletion of the rows or columns having null values.
If any columns have more than half of the values as null then you can drop the entire column.
In the same way, rows can also be dropped if having one or more columns values as null. Before 
using this method one thing we have to keep in mind is that we should not be losing information.
Because if the information we are deleting is contributing to the output value then we should not 
use this method because this will affect our output.
When to delete the rows/column in a dataset?
•	If a certain column has many missing values then you can choose to drop the entire column.
•	When you have a huge dataset. Deleting for e.g. 2-3 rows/columns will not make much difference.
•	Output results do not depend on the Deleted data. 
Note: No doubt it is one of the quick techniques one can use to deal with missing values.
But this approach is not recommended. 

2. Replacing With Arbitrary Value
If you can replace the missing value with some arbitrary value using fillna().
Ex. In the below code, we are replacing the missing values with ‘0’.As well you can replace
any particular column missing values with some arbitrary value also.
•	Replacing with previous value – Forward fill
We can impute the values with the previous value by using forward fill. It is mostly used in time series data.
Syntax: df.fillna(method=’ffill’)
 
•	Replacing with next value – Backward fill
In backward fill, the missing value is imputed using the next value. It is mostly used in time series data.
 
3. Interpolation
Missing values can also be imputed using ‘interpolation’. Pandas interpolate method can be used to replace 
the missing values with different interpolation methods like ‘polynomial’, ‘linear’, ‘quadratic’.
The default method is ‘linear’.
Syntax: df.interpolate(method=’linear’)
For the time-series dataset variable, it makes sense to use the interpolation of the variable before 
and after a timestamp for a missing value. Interpolation in most cases supposed to be the best technique 
to fill missing values.
Handling missing values: python code:
We have taken dataset titanic.csv which is freely available at kaggle.com.This dataset was taken as it has missing values.

1.Reading the data
import pandas as pd
df = pd.read_csv("train.csv", usecols=['Age','Fare','Survived'])
df
 
The dataset is read and used three columns ‘Age’, ’Fare’, ’Survived’.
 
2. Checking if there are missing values
df.isnull().sum()
 
Output:
Survived      4
Age         179
Fare          2
dtype: int64
 
3.Filling missing values with 0
new_df = df.fillna(0)
new_df
 
 
 
4. Filling NaN values with forward fill value
new_df = df.fillna(method="ffill")
new_df
If we use forward fill that simply means we are forwarding the previous value where ever we have NaN values.
5. Setting forward fill limit to 1 
new_df = df.fillna(method="ffill",limit=1)
new_df
 
Now we have set the limit of forward fill to 1 which means that only once, the value will be copied below.
Like in this case we had three NaN values consecutively in column Survived. But one NaN value was filled
only as the limit is set to 1.
6. Filling NaN values in Backward Direction
new_df = df.fillna(method="bfill")new_df    
 
7. Interpolate of missing values
new_df = df.interpolate() 
df    
 
In this, we were having two values 22 and 26. And in between value was a NaN value. So that NaN value is
computed by getting the mean of 22 and 26 i.e. 24. In the same way, other NaN values were also computed.
8. Dropna()
new_df = df.dropna()
new_df     
 
Previously we were having 891 rows and after running this code we are left with 710 rows because some of the 
rows were continuing NaN values were dropped.
9. Deleting the rows having all NaN values
new_df = df.dropna(how='all')
new_df      
Those rows in which all the values are NaN values will be deleted. If the row even has one value even then
it will not be dropped 



In [None]:
Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

A classification data set with skewed class proportions is called imbalanced. Classes that make up a large
proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes.
When a dataset is imbalanced, several issues may arise. Models may exhibit bias toward the majority class,
resulting in poor predictions for the minority class. Accuracy as an evaluation metric can be misleading, 
as it may appear high while the model's performance on the minority class is lacking



In [None]:
Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-
sampling are required.

Downsampling. The idea of downsampling is remove samples from the signal, whilst maintaining its length with 
respect to time. For example, a time signal of 10 seconds length, with a sample rate of 1024Hz or samples per
second will have 10 x 1024 or 10240 samples.

df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

from sklearn.utils import resample
df_majority_downsampled=resample(df_majority,replace=False, #Sample With replacement
         n_samples=len(df_minority),
         random_state=42
        )
df_majority_downsampled.shape
df_downsampled=pd.concat([df_minority,df_majority_downsampled])
df_downsampled['target'].value_counts()

Upsampling is the process of inserting zero-valued samples between original samples to increase the 
sampling rate. (This is sometimes called “zero-stuffing”.) This kind of upsampling adds undesired 
spectral images to the original signal, which are centered on multiples of the original sampling rate

df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

from sklearn.utils import resample
df_minority_upsampled=resample(df_minority,replace=True, #Sample With replacement
         n_samples=len(df_majority),
         random_state=42
        )

df_minority_upsampled.shape
df_minority_upsampled.head()
df_upsampled=pd.concat([df_majority,df_minority_upsampled])
df_upsampled['target'].value_counts()




In [None]:
Q5: What is data Augmentation? Explain SMOTE.
Data augmentation is a technique in machine learning used to reduce overfitting when training a machine
learning model, by training models on several slightly-modified copies of existing data.

SMOTE (Synthetic Minority Over-sampling Technique) is a technique used in machine learning to address 
imbalanced datasets where the minority class has significantly fewer instances than the majority class.
SMOTE involves generating synthetic instances of the minority class by interpolating between existing instances.

from imblearn.over_sampling import SMOTE
oversample=SMOTE()
X,y=oversample.fit_resample(final_df[['f1','f2']],final_df['target'])
X.shape
y.shape
len(y[y==0])
len(y[y==1])
df1=pd.DataFrame(X,columns=['f1','f2'])
df2=pd.DataFrame(y,columns=['target'])
oversample_df=pd.concat([df1,df2],axis=1)
plt.scatter(oversample_df['f1'],oversample_df['f2'],c=oversample_df['target'])




In [None]:
Q6: What are outliers in a dataset? Why is it essential to handle outliers?
An outlier in statistics is an observation that lies an abnormal distance from other values in a random sample
from a population. There is, of course, a degree of ambiguity. Qualifying a data point as an anomaly leaves it
up to the analyst or model to determine what is abnormal—and what to do with such data points.
Outliers are important because they can have a large influence on statistics derived from the dataset. 
For example, the mean intake of energy or some nutrient may be [glossary term:] skewed upward or downward 
by one or a few extreme values (Learn More about Normal Distributions)



In [None]:
Q7: You are working on a project that requires analyzing customer data. However, you notice that some 
of the data is missing. What are some techniques you can use to handle the missing data in your analysis?


When dealing with missing data, data scientists can use two primary methods to solve the error: imputation
or the removal of data. The imputation method develops reasonable guesses for missing data. 
It's most useful when the percentage of missing data is low.

1.Missing data can be dealt with in a variety of ways. ...
2.Another common strategy among those who pay attention is imputation. ...
3.Mean imputation. ...
4.Substitution. ...
5.Hot deck imputation. ...
6.Cold deck imputation. ...
7.Regression imputation


In [None]:
Q8: You are working with a large dataset and find that a small percentage of the data is missing. 
What are some strategies you can use to determine if the missing data is missing at random or if
there is a pattern to the missing data?
Type of missing data	                                          Imputation method
Missing Completely At Random	                         Mean, Median, Mode, or any other imputation method
Missing At Random	                                     Multiple imputation, Regression imputation
Missing Not At Random	                                 Pattern Substitution, Maximum Likelihood estima



In [None]:
Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the
dataset do not have the condition of interest, while a small percentage do. What are some strategies you 
can use to evaluate the performance of your machine learning model on this imbalanced dataset?

This technique is used to upsample or downsample the minority or majority class. When we are using an 
imbalanced dataset, we can oversample the minority class using replacement. This technique is called 
oversampling. Similarly, we can randomly delete rows from the majority class to match them with the 
minority class which is called undersampling. After sampling the data we can get a balanced dataset 
for both majority and minority classes. So, when both classes have a similar number of records present 
in the dataset, we can assume that the classifier will give equal importance to both classes.
An example of this technique using the sklearn library’s resample() is shown below for illustration 
purposes. Here, Is_Lead is our target variable. Let’s see the distribution of the classes in the target.


An example of this technique using the sklearn library’s resample() is shown below for illustration
purposes. Here, Is_Lead is our target variable. Let’s see the distribution of the classes in the target.
 
It has been observed that our target class has an imbalance. So, we’ll try to upsample the data so 
that the minority class matches with the majority class.

from sklearn.utils import resample
#create two different dataframe of majority and minority class 
df_majority = df_train[(df_train['Is_Lead']==0)] 
df_minority = df_train[(df_train['Is_Lead']==1)] 
# upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,    # sample with replacement
                                 n_samples= 131177, # to match majority class
                                 random_state=42)  # reproducible results
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_minority_upsampled, df_majority])




In [None]:
Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
balance the dataset and down-sample the majority class?

import numpy as np
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0
n_class_0,n_class_1

## CREATE MY DATAFRAME WITH IMBALANCED DATASET
class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})
df=pd.concat([class_0,class_1]).reset_index(drop=True)
df.tail()
df['target'].value_counts()
## upsampling
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]
from sklearn.utils import resample
df_minority_upsampled=resample(df_minority,replace=True, #Sample With replacement
         n_samples=len(df_majority),
         random_state=42
        )
df_minority_upsampled.shape
df_minority_upsampled.head()
df_upsampled=pd.concat([df_majority,df_minority_upsampled])
df_upsampled['target'].value_counts()


In [None]:
Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while
working on a project that requires you to estimate the occurrence of a rare event. 
What methods can you employ to balance the dataset and up-sample the minority class?

import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

df = pd.concat([class_0, class_1]).reset_index(drop=True)

# Check the class distribution
print(df['target'].value_counts())
## downsampling
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]
from sklearn.utils import resample
df_majority_downsampled=resample(df_majority,replace=False, #Sample With replacement
         n_samples=len(df_minority),
         random_state=42
        )
df_majority_downsampled.shape
df_downsampled=pd.concat([df_minority,df_majority_downsampled])
df_downsampled['target'].value_counts()
