<a href="https://colab.research.google.com/github/VickyKandale/Assignment_pyhton.pwskills/blob/main/17_Mar_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values

Missing values are a common occurrence in datasets, and they refer to the absence of data for one or more variables in a particular observation. Missing values can occur for various reasons, such as data collection errors, data corruption, or intentionally left blank due to privacy concerns.

It is essential to handle missing values in a dataset because they can affect the accuracy of the analysis, predictions, and machine learning models. Ignoring or deleting the missing values can lead to biased results, reduced statistical power, and decreased predictive performance.

There are several algorithms that are not affected by missing values, such as:

`Decision Trees:` 

Decision trees can handle missing values by using surrogate splits to replace the missing values with the best possible alternative.

`Random Forests:`

 Random forests use multiple decision trees and aggregate their predictions, making them robust to missing values.

`K-Nearest Neighbors:`

 K-NN algorithm can handle missing values by ignoring the missing value in the distance calculation or imputing it using the mean value or the most frequent value.

`Naive Bayes:`


Naive Bayes algorithm can handle missing values by ignoring the missing values during the probability calculations.

Support Vector Machines: 

SVM can handle missing values by ignoring the missing values or by replacing them with the mean or median of the feature.

In conclusion, it is essential to handle missing values in a dataset to ensure the accuracy and reliability of the analysis and models. Several algorithms can handle missing values by either ignoring them or imputing them with a suitable alternative.

##Q2: List down techniques used to handle missing data.  Give an example of each with python code.

There are several techniques that can be used to handle missing data in a dataset. Here are some commonly used techniques with an example in Python:

1. Removal of missing data: The simplest approach is to remove the missing data from the dataset. However, this approach can lead to loss of valuable information.

In [None]:
import pandas as pd

# create a sample dataframe with missing values
df = pd.DataFrame({'A': [1, 2, None, 4],
                   'B': [5, None, 7, 8],
                   'C': [9, 10, 11, None]})

# drop rows with missing values
df.dropna(inplace=True)

print(df)


     A    B    C
0  1.0  5.0  9.0


`Imputation:` 
This involves filling in missing data with substitute values. This can be done using various techniques such as mean, median, mode, or even machine learning algorithms.

In [None]:
import seaborn as sns
df=sns.load_dataset('titanic')

In [None]:
# Missing Value check
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [None]:
# by checking decribing data set the age value has integers and deck has str data value
# Mean value inputation
df['age_mean']=df['age'].fillna(df['age'].mean())
df['age_median']=df['age'].fillna(df['age'].median())

In [None]:
df[['age_median','age_mean','age']]

Unnamed: 0,age_median,age_mean,age
0,22.0,22.000000,22.0
1,38.0,38.000000,38.0
2,26.0,26.000000,26.0
3,35.0,35.000000,35.0
4,35.0,35.000000,35.0
...,...,...,...
886,27.0,27.000000,27.0
887,19.0,19.000000,19.0
888,28.0,29.699118,
889,26.0,26.000000,26.0


In [None]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
age_mean         0
age_median       0
dtype: int64

In [None]:
# Now check the deck column data
df['deck'].unique()

[NaN, 'C', 'E', 'G', 'D', 'A', 'B', 'F']
Categories (7, object): ['A', 'B', 'C', 'D', 'E', 'F', 'G']

In [None]:
df['deck'].isnull().sum()

688

In [None]:
df[df['age'].notna()]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,age_mean,age_median
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False,22.0,22.0
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,38.0,38.0
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True,26.0,26.0
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False,35.0,35.0
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True,35.0,35.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
885,0,3,female,39.0,0,5,29.1250,Q,Third,woman,False,,Queenstown,no,False,39.0,39.0
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True,27.0,27.0
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True,19.0,19.0
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True,26.0,26.0


In [None]:
mode=df[df['age'].notna()]['deck'].mode()[0]

In [None]:
mode

'C'

In [None]:
df['deck_new']=df['deck'].fillna(mode)

In [None]:
df[['deck_new','deck']]

Unnamed: 0,deck_new,deck
0,C,
1,C,C
2,C,
3,C,C
4,C,
...,...,...
886,C,
887,B,B
888,C,
889,C,C


In [None]:
df['embarked'].isnull().sum()

2

In [None]:
df['embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [None]:
mode=df[df['age'].notna()]['embarked'].mode()[0]

In [None]:
df['embarked_mode']=df['embarked'].fillna(mode)

In [None]:
df1=df.dropna(axis=1)
df.isnull().sum()

survived           0
pclass             0
sex                0
age              177
sibsp              0
parch              0
fare               0
embarked           2
class              0
who                0
adult_male         0
deck             688
embark_town        2
alive              0
alone              0
age_mean           0
age_median         0
deck_new           0
embarked_mode      0
dtype: int64

In [None]:
# Here we checked the handled the missing value and cleaned the data using Mean,Mode & Median Imputation technique.
df1.isnull().sum()

survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
age_mean         0
age_median       0
deck_new         0
embarked_mode    0
dtype: int64

In [24]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   survived       891 non-null    int64   
 1   pclass         891 non-null    int64   
 2   sex            891 non-null    object  
 3   sibsp          891 non-null    int64   
 4   parch          891 non-null    int64   
 5   fare           891 non-null    float64 
 6   class          891 non-null    category
 7   who            891 non-null    object  
 8   adult_male     891 non-null    bool    
 9   alive          891 non-null    object  
 10  alone          891 non-null    bool    
 11  age_mean       891 non-null    float64 
 12  age_median     891 non-null    float64 
 13  deck_new       891 non-null    category
 14  embarked_mode  891 non-null    object  
dtypes: bool(2), category(2), float64(3), int64(4), object(4)
memory usage: 80.7+ KB


##Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Imbalanced data refers to a situation where the number of observations in each class of a classification problem is not equal. In other words, one class has a significantly larger number of observations than the other(s). For example, in a medical dataset, the number of patients with a particular rare disease may be much smaller than those without the disease.

If imbalanced data is not handled, it can lead to biased and inaccurate models. The machine learning algorithms may prioritize the majority class and ignore the minority class, leading to poor performance for the minority class. The resulting model may have high accuracy for the majority class, but poor accuracy for the minority class, which is often the class of interest in applications such as fraud detection, disease diagnosis, or rare event detection.

Moreover, imbalanced data can also result in overfitting, where the model memorizes the majority class instead of learning the underlying patterns in the data. This can result in poor generalization performance on new and unseen data.

Therefore, it is essential to handle imbalanced data by employing techniques such as undersampling, oversampling, or a combination of both. Undersampling involves removing some observations from the majority class to balance the class distribution, while oversampling involves creating new synthetic observations for the minority class. A combination of both techniques can also be used to improve the performance of machine learning algorithms on imbalanced data. Other techniques such as cost-sensitive learning, ensemble methods, and anomaly detection can also be used to handle imbalanced data.

##Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and downsampling are required.

Up-sampling and down-sampling are techniques used in machine learning to handle imbalanced datasets, where one class has significantly more samples than the other(s).

Down-sampling involves randomly removing samples from the majority class to balance the dataset. This can lead to loss of information and reduced accuracy, especially when the dataset is already small.

Up-sampling, on the other hand, involves increasing the number of samples in the minority class by replicating existing samples or generating synthetic samples using techniques such as SMOTE (Synthetic Minority Over-sampling Technique). This can help to balance the dataset and improve the performance of the machine learning model.

Here's an example to illustrate when up-sampling and down-sampling might be required:

Suppose we have a dataset with 1000 samples, of which 900 belong to Class A and 100 belong to Class B. This is a highly imbalanced dataset, with Class A being the majority class and Class B being the minority class. If we train a machine learning model on this dataset without balancing the classes, the model will likely perform poorly on Class B, as it has very few samples to learn from.

In this scenario, we might consider using up-sampling techniques to generate more samples for Class B. We could use SMOTE to generate synthetic samples based on the existing samples in Class B, increasing the number of samples to 500, for example. This would help to balance the dataset and improve the model's performance on Class B.

Alternatively, we could use down-sampling techniques to randomly remove samples from Class A, reducing the number of samples to, say, 500. This would balance the dataset, but could also result in loss of information and reduced accuracy, especially if the dataset is already small.

In summary, up-sampling and down-sampling are techniques used to handle imbalanced datasets in machine learning, and their use depends on the specific requirements of the machine learning model and the nature of the data.

##Q5: What is data Augmentation? Explain SMOTE.

Data augmentation is a technique used to increase the size of a training dataset by creating new synthetic samples from the original data. Data augmentation is particularly useful when the original dataset is small, imbalanced, or lacks diversity.

One popular method of data augmentation is SMOTE (Synthetic Minority Over-sampling Technique). SMOTE creates new synthetic samples of the minority class by interpolating between existing minority class samples. The algorithm selects a random minority class sample and finds its k nearest neighbors in the feature space. SMOTE then creates new synthetic samples by randomly selecting one of the k nearest neighbors and interpolating between the selected neighbor and the original minority class sample.

The level of interpolation is controlled by a user-defined parameter called the "sampling ratio." The sampling ratio determines the number of synthetic samples to be generated for each original minority class sample. SMOTE can also be used with other oversampling techniques, such as ADASYN (Adaptive Synthetic Sampling) or Borderline-SMOTE, which adapt the sampling ratio based on the local density of the minority class.

SMOTE is a powerful technique for handling imbalanced datasets as it can increase the diversity of the minority class and reduce the overfitting caused by duplicating existing samples. SMOTE has been shown to improve the performance of various machine learning algorithms on imbalanced datasets, including decision trees, support vector machines, and neural networks. However, it is essential to use SMOTE with care and avoid overfitting to the minority class, which can lead to reduced performance on new and unseen data.

##Q6: What are outliers in a dataset? Why is it essential to handle outliers?

Outliers are data points that lie far away from the majority of the other data points in a dataset. They can be caused by measurement errors, data entry errors, or other anomalies in the data. Outliers can have a significant impact on statistical analyses and machine learning models, as they can skew the results and reduce the accuracy of the models.

It is essential to handle outliers for several reasons:

Outliers can significantly impact statistical analyses such as mean, standard deviation, and correlation. Outliers can skew the results of these analyses and lead to incorrect conclusions.

Outliers can also have a significant impact on machine learning models. Many machine learning algorithms are sensitive to outliers, and outliers can negatively impact the accuracy of the models.

Outliers can also affect data visualization. If the outliers are not handled, they can lead to distorted graphs and visualizations, which can be misleading.

Handling outliers is important to improve the accuracy of statistical analyses and machine learning models. There are several ways to handle outliers, including:

Removing outliers: This involves removing the outliers from the dataset. However, this approach should be used with caution, as removing too many outliers can result in a loss of information and potentially biased results.

Transforming the data: Transforming the data can help to reduce the impact of outliers. For example, using log transformations can help to reduce the impact of extreme values.

Imputing missing values: If the outliers are caused by missing data, imputing the missing values can help to reduce the impact of outliers.

In summary, outliers are data points that lie far away from the majority of the other data points in a dataset. Handling outliers is essential to improve the accuracy of statistical analyses and machine learning models. There are several ways to handle outliers, including removing outliers, transforming the data, and imputing missing values.

##Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

Dealing with missing data is an important step in data analysis to ensure that your results are accurate and reliable. Here are some techniques you can use to handle missing data in your analysis:

Delete missing data: If the missing data is limited to a small portion of the dataset, you can simply delete the rows with missing data. However, this approach can lead to a reduction in sample size, which may impact the accuracy of your analysis.

Impute missing data: Imputation is the process of filling in missing data with estimated values. There are several methods for imputing missing data, including mean imputation, mode imputation, and regression imputation. Mean imputation replaces missing values with the mean of the non-missing values. Mode imputation replaces missing values with the mode of the non-missing values. Regression imputation predicts missing values using a regression model based on the non-missing values.

Use multiple imputations: Multiple imputations involve generating multiple imputed datasets using different imputation methods and then analyzing each imputed dataset separately. This approach accounts for the uncertainty in the imputed values and can produce more accurate results.


It is important to note that the choice of technique for handling missing data depends on the characteristics of the dataset and the research question being addressed.

##Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

There are several strategies that you can use to determine if the missing data is missing at random or if there is a pattern to the missing data:

* Check for patterns in the missing data: Look for any patterns in the missing data, such as missing data being concentrated in certain variables or groups. This could indicate that the missing data is not random.

* Analyze missing data mechanisms: There are three main types of missing data mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Analyzing these mechanisms can help you determine if the missing data is random or not.

* Impute missing values: Impute the missing data using different methods such as mean, median, or mode. Then compare the results to see if there are any significant differences. If there are significant differences, it could indicate that the missing data is not random.

* Correlate missingness with other variables: Look for correlations between missingness and other variables in the dataset. For example, if missingness is correlated with income, it could indicate that the missing data is not random.

* Use statistical tests: Use statistical tests such as the Little’s MCAR test to determine if the missing data is missing at random or not. This test compares the missing data pattern to a completely random pattern and tests if they are significantly different.

##Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

Imbalanced datasets, where one class is underrepresented, are a common problem in machine learning, particularly in medical diagnosis projects. Here are some strategies you can use to evaluate the performance of your machine learning model on an imbalanced dataset:

`Use evaluation metrics that account for class imbalance:`

 The standard accuracy metric may not be the best evaluation metric for imbalanced datasets, as it can be misleading. Instead, use evaluation metrics such as precision, recall, F1-score, or area under the receiver operating characteristic (ROC) curve. These metrics take into account the true positive rate (TPR) and false positive rate (FPR) of the model.

`Use resampling techniques: `

Resampling techniques such as oversampling and undersampling can help balance the dataset. Oversampling involves creating synthetic data points for the minority class, while undersampling involves removing data points from the majority class. However, these techniques may lead to overfitting or underfitting the model, respectively.

`Use ensemble methods:`

 Ensemble methods such as bagging, boosting, and stacking can be used to improve the performance of the model on imbalanced datasets. These methods combine the predictions of multiple models to produce a more accurate prediction.

Adjust the decision threshold:

 The decision threshold is the value used to classify an observation into a particular class. By adjusting the decision threshold, you can control the trade-off between precision and recall. For example, lowering the decision threshold can increase the recall at the cost of lower precision.

It is important to note that the choice of strategy for handling imbalanced datasets depends on the specific characteristics of the dataset and the research question being addressed. It is also important to properly evaluate the performance of the model using appropriate evaluation metrics and to avoid overfitting to the imbalanced dataset.

##Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

When dealing with an unbalanced dataset where the majority class dominates, the following methods can be employed to balance the dataset and down-sample the majority class:

`Random under-sampling:` 

This involves randomly removing data from the majority class until the dataset is balanced. However, this method may lead to a loss of important information if the dataset is already small.

`Cluster-based under-sampling:` 

This involves clustering the majority class data and selecting only the representative data points from each cluster to balance the dataset. This method can help preserve the information from the majority class.

`Tomek links:`

 This method involves removing the majority class samples that are closest to the minority class samples. This helps create more distinct boundaries between the two classes.

`Synthetic minority over-sampling technique (SMOTE):` 

This method involves creating synthetic samples of the minority class by randomly selecting samples and creating new synthetic samples by interpolating between them. This helps increase the minority class size while maintaining its diversity.

Ensemble techniques:

 Ensemble techniques involve combining multiple models to improve the classification performance. For example, the Balanced Random Forest (BRF) algorithm combines random under-sampling with random forest algorithm to balance the dataset and improve classification performance.

It is important to note that while balancing the dataset can improve model performance, it may also lead to a loss of information. Therefore, it is important to carefully consider the trade-off between balancing the dataset and preserving information.

##Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

When dealing with a dataset that is unbalanced with a low percentage of occurrences, it is important to balance the dataset before training any machine learning models. Here are some methods that can be employed to balance the dataset and up-sample the minority class:

`Oversampling:` 

Oversampling is a technique where the minority class is up-sampled by creating synthetic samples. There are various methods to generate synthetic samples, including random sampling and synthetic minority over-sampling technique (SMOTE). SMOTE is a popular oversampling method that creates synthetic samples by interpolating between the minority class samples.

`Undersampling:` 

Undersampling is a technique where the majority class is down-sampled by removing samples randomly. This method can be useful when there is a large number of majority class samples, and it may lead to faster training times.

`Combination of oversampling and undersampling:`

 A combination of oversampling and undersampling can be employed to balance the dataset. This can be done by oversampling the minority class and undersampling the majority class until a desired balance is achieved.

Using different evaluation metrics: 

In cases where it is difficult to balance the dataset, evaluation metrics such as precision, recall, and F1-score that account for class imbalance can be used to evaluate the performance of the model.

It is important to note that each of these methods has its advantages and disadvantages, and the choice of the method depends on the specifics of the dataset and the research question being addressed. Additionally, it is important to avoid overfitting to the up-sampled minority class and to perform proper evaluation of the model using appropriate metrics.