In [None]:
Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some
algorithms that are not affected by missing values.


In [None]:
Missing values in a dataset refer to the absence of values or data points in one or more fields of a dataset. Handling missing values is crucial because it can lead to biased or inaccurate results if not dealt with properly. Some algorithms that are not affected by missing values include Decision Trees, Random Forest, and Naive Bayes.

In [None]:
Q2: List down techniques used to handle missing data. Give an example of each with python code.


In [None]:
a) Deleting: Dropping the rows or columns that contain missing values.
df.dropna() #drops rows with any missing values
df.dropna(axis=1) #drops columns with any missing values

b) Imputation: Filling in the missing values with a substitute value.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df[['column_name']] = imputer.fit_transform(df[['column_name']])

c) Prediction: Using machine learning algorithms to predict the missing values based on other variables in the dataset.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer()
df[['column_name']] = imputer.fit_transform(df[['column_name']])


In [None]:
Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?


In [None]:
Imbalanced data refers to a situation in which the distribution of classes in a dataset is uneven, with one class having significantly fewer observations than the others. If imbalanced data is not handled properly, it can lead to biased or inaccurate model performance, where the model may predict the majority class accurately, but the minority class is ignored or under-predicted.

In [None]:

Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-
sampling are required.


In [None]:
 Up-sampling and down-sampling are techniques used to balance imbalanced datasets.

Up-sampling involves adding more instances of the minority class to the dataset to increase the representation of that class.

Down-sampling involves reducing the number of instances in the majority class to balance the dataset.

An example of when up-sampling and down-sampling are required is in credit card fraud detection, where the occurrence of fraud is rare compared to non-fraudulent transactions. In such a scenario, up-sampling the minority class can help the model learn patterns that distinguish fraudulent transactions from non-fraudulent ones. Down-sampling the majority class may be required in cases where the dataset is too large and computationally intensive, making it difficult to train the model.

In [None]:

Q5: What is data Augmentation? Explain SMOTE.


In [None]:
Data augmentation is a technique used to increase the size of a dataset by creating new synthetic examples from the existing data. SMOTE (Synthetic Minority Over-sampling Technique) is a data augmentation technique that generates synthetic examples of the minority class by interpolating new data points between existing ones.


In [None]:
Q6: What are outliers in a dataset? Why is it essential to handle outliers?


In [None]:
Outliers in a dataset are extreme values that deviate significantly from other values in the dataset. Handling outliers is important because they can skew the results of statistical analyses and machine learning models. Outliers can be detected using statistical methods like the Z-score or visual methods like box plots or scatter plots.

In [None]:
Q7: You are working on a project that requires analyzing customer data. However, you notice that some of
the data is missing. What are some techniques you can use to handle the missing data in your analysis?


In [None]:
Some techniques for handling missing data in customer data analysis include imputing missing values with mean or median values, using predictive modeling to fill in missing data, or removing the rows with missing data.

In [None]:
Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are
some strategies you can use to determine if the missing data is missing at random or if there is a pattern
to the missing data?


In [None]:
To determine if the missing data is missing at random or if there is a pattern to the missing data, some strategies include analyzing the patterns of missing values across the dataset, looking for correlations between missing data and other variables in the dataset, or using statistical tests like Little's MCAR test.

In [None]:
Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the
dataset do not have the condition of interest, while a small percentage do. What are some strategies you
can use to evaluate the performance of your machine learning model on this imbalanced dataset?


In [None]:
 Strategies for evaluating the performance of machine learning models on imbalanced datasets include using metrics like precision, recall, F1 score, or AUC-ROC curve. Additionally, techniques like resampling, modifying the class weights, or using different

In [None]:
Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
balance the dataset and down-sample the majority class?


In [None]:
Random Under-Sampling: This involves randomly removing samples from the majority class until it is balanced with the minority class.
from sklearn.utils import resample
# Separate majority and minority classes
majority_class = df[df['satisfaction'] == 'satisfied']
minority_class = df[df['satisfaction'] == 'unsatisfied']
# Downsample majority class
downsampled_majority = resample(majority_class, replace=False, n_samples=len(minority_class), random_state=42)
# Combine minority class and downsampled majority class
balanced_data = pd.concat([downsampled_majority, minority_class])

Cluster-Centroids: This method creates new samples by finding centroids for the majority class and creating new samples based on the difference between the centroid and the minority class.
from imblearn.under_sampling import ClusterCentroids
# Define the ClusterCentroids model
cc = ClusterCentroids(random_state=42)
# Resample the majority class
X_resampled, y_resampled = cc.fit_resample(X, y)

NearMiss: This method selects samples from the majority class that are closest to the minority class based on distance metrics.
from imblearn.under_sampling import NearMiss
# Define the NearMiss model
nm = NearMiss(version=1, n_neighbors=3)
# Resample the majority class
X_resampled, y_resampled = nm.fit_resample(X, y)


In [None]:
Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a
project that requires you to estimate the occurrence of a rare event. What methods can you employ to
balance the dataset and up-sample the minority class?

In [None]:
Random Over-Sampling: This involves randomly replicating samples from the minority class until it is balanced with the majority class.
from sklearn.utils import resample
# Separate majority and minority classes
majority_class = df[df['rare_event'] == 0]
minority_class = df[df['rare_event'] == 1]
# Upsample minority class
upsampled_minority = resample(minority_class, replace=True, n_samples=len(majority_class), random_state=42)
# Combine majority class and upsampled minority class
balanced_data = pd.concat([majority_class, upsampled_minority])


SMOTE (Synthetic Minority Over-sampling Technique): This method creates new synthetic samples for the minority class by finding its k-nearest neighbors and creating new samples based on the difference between the minority class and its neighbors.
from imblearn.over_sampling import SMOTE
# Define the SMOTE model
smote = SMOTE(random_state=42)
# Resample the minority class
X_resampled, y_resampled = smote.fit_resample(X, y)

ADASYN (Adaptive Synthetic Sampling): This method is similar to SMOTE, but it creates more synthetic samples for the minority class near the decision boundary.
from imblearn.over_sampling import ADASYN
# Define the ADASYN model
adasyn = ADASYN(random_state=42)
# Resample the minority class
X_resampled, y_resampled = adasyn.fit_resample(X, y)
