In [None]:

### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

- Missing Values: 
  Missing values are instances where no data value is stored for a variable in an observation. Missing values can occur due to various reasons such as data entry errors, equipment malfunctions, or respondents skipping questions.

- Importance of Handling Missing Values:
  Handling missing values is essential because they can lead to biased estimates, reduce the representativeness of the sample, and decrease the accuracy of machine learning models.

- Algorithms Not Affected by Missing Values:
  - Decision Trees
  - Random Forest
  - XGBoost



In [None]:
### Q2: List down techniques used to handle missing data. Give an example of each with Python code.

- Techniques to Handle Missing Data:
  - Remove Missing Data:
    python
    import pandas as pd

    df = pd.DataFrame({'A': [1, 2, None], 'B': [4, None, 6]})
    df.dropna(inplace=True)
    print(df)
    

  - Impute Missing Data (Mean/Median/Mode Imputation):
    python
    df = pd.DataFrame({'A': [1, 2, None], 'B': [4, None, 6]})
    df['A'].fillna(df['A'].mean(), inplace=True)
    df['B'].fillna(df['B'].median(), inplace=True)
    print(df)
    

  - Impute Missing Data (Using a Predictive Model):
    python
    from sklearn.impute import SimpleImputer

    df = pd.DataFrame({'A': [1, 2, None], 'B': [4, None, 6]})
    imputer = SimpleImputer(strategy='mean')
    df['A'] = imputer.fit_transform(df[['A']])
    print(df)
    

  - Using Algorithms That Handle Missing Values:
    python
    from sklearn.ensemble import RandomForestClassifier

    df = pd.DataFrame({'A': [1, 2, None], 'B': [4, None, 6], 'C': [1, 0, 1]})
    X = df[['A', 'B']]
    y = df['C']
    model = RandomForestClassifier()
    model.fit(X, y)
    



In [None]:
### Q3: Explain imbalanced data. What will happen if imbalanced data is not handled?

- Imbalanced Data: 
  Imbalanced data occurs when the classes in the dataset are not represented equally. For example, in a binary classification problem, if 95% of the samples belong to one class and only 5% belong to the other class, the dataset is imbalanced.

- Consequences of Not Handling Imbalanced Data:
  - The model may become biased towards the majority class.
  - Poor predictive performance for the minority class.
  - Skewed evaluation metrics, leading to misleading conclusions about model performance.



In [None]:
### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

- Up-sampling: 
  Increasing the number of instances in the minority class by replicating them or generating synthetic samples.

- Down-sampling: 
  Reducing the number of instances in the majority class by randomly removing samples.

- Example:
  - Up-sampling:
    python
    from sklearn.utils import resample

    df_majority = df[df.target == 0]
    df_minority = df[df.target == 1]

    df_minority_upsampled = resample(df_minority, replace=True, n_samples=len(df_majority), random_state=123)
    df_upsampled = pd.concat([df_majority, df_minority_upsampled])
    

  - Down-sampling:
    python
    df_majority_downsampled = resample(df_majority, replace=False, n_samples=len(df_minority), random_state=123)
    df_downsampled = pd.concat([df_majority_downsampled, df_minority])
    

- When Required: Up-sampling and down-sampling are required when dealing with imbalanced datasets to balance the class distribution and improve model performance.



In [None]:
### Q5: What is data augmentation? Explain SMOTE.

- Data Augmentation: 
  A technique used to increase the size of the training dataset by creating modified versions of existing data. It is commonly used in image processing.

- SMOTE (Synthetic Minority Over-sampling Technique):
  SMOTE is a technique used to generate synthetic samples for the minority class. It creates new samples by interpolating between existing minority samples.
  python
  from imblearn.over_sampling import SMOTE

  smote = SMOTE(random_state=123)
  X_resampled, y_resampled = smote.fit_resample(X, y)
  



In [None]:
### Q6: What are outliers in a dataset? Why is it essential to handle outliers?

- Outliers: 
  Outliers are data points that are significantly different from the rest of the data. They can be caused by measurement errors, data entry errors, or genuine variability in the data.

- Importance of Handling Outliers:
  - Outliers can skew the results of statistical analyses and machine learning models.
  - They can affect the performance and accuracy of models.
  - Removing or treating outliers can lead to more robust and reliable models.



In [None]:
### Q7: Techniques to Handle Missing Data in Customer Data Analysis

- Removing Missing Data:
  python
  df.dropna(inplace=True)
  

- Imputing Missing Data:
  python
  df['column'].fillna(df['column'].mean(), inplace=True)
  

- Using Predictive Models:
  python
  from sklearn.impute import SimpleImputer

  imputer = SimpleImputer(strategy='mean')
  df['column'] = imputer.fit_transform(df[['column']])
  



In [None]:
### Q8: Strategies to Determine if Missing Data is Random or Patterned

- Visual Inspection:
  python
  import seaborn as sns
  sns.heatmap(df.isnull(), cbar=False)
  

- Statistical Tests:
  python
  from statsmodels.imputation import mice

  result = mice.MICEData(df).predict_missing()
  



In [None]:
### Q9: Strategies for Evaluating Models on Imbalanced Medical Diagnosis Data

- Use Precision-Recall Curve:
  python
  from sklearn.metrics import precision_recall_curve

  precision, recall, thresholds = precision_recall_curve(y_true, y_pred)
  

- Use ROC-AUC Curve:
  python
  from sklearn.metrics import roc_auc_score

  roc_auc = roc_auc_score(y_true, y_pred)
  

- Use Confusion Matrix:
  python
  from sklearn.metrics import confusion_matrix

  cm = confusion_matrix(y_true, y_pred)
  



In [None]:
### Q10: Methods to Balance Dataset and Down-sample Majority Class for Customer Satisfaction

- Random Down-sampling:
  python
  df_majority_downsampled = resample(df_majority, replace=False, n_samples=len(df_minority), random_state=123)
  df_downsampled = pd.concat([df_majority_downsampled, df_minority])
  

- Use of Under-sampling Techniques:
  python
  from imblearn.under_sampling import RandomUnderSampler

  rus = RandomUnderSampler(random_state=123)
  X_resampled, y_resampled = rus.fit_resample(X, y)
  



In [None]:
### Q11: Methods to Balance Dataset and Up-sample Minority Class for Rare Event Occurrences

- Random Up-sampling:
  python
  df_minority_upsampled = resample(df_minority, replace=True, n_samples=len(df_majority), random_state=123)
  df_upsampled = pd.concat([df_majority, df_minority_upsampled])
  

- Use of SMOTE:
  python
  from imblearn.over_sampling import SMOTE

  smote = SMOTE(random_state=123)
  X_resampled, y_resampled = smote.fit_resample(X, y)
  

