#### Assignment 1: Handling Missing Data in a Tour Bookings Dataset
Objective:
To analyze and clean a travel booking dataset by identifying and handling missing values using various imputation techniques.
Instructions:
Load the provided dataset into Pandas.
Identify missing data:
Use isna() and info() functions to detect missing values.
Compute the percentage of missing values for each column.
Analyze missing data patterns:
Determine whether data is MCAR, MAR, or MNAR.
Visualize missing data patterns using seaborn.heatmap().
Handle missing values:
Apply different imputation techniques:
Mean/Median imputation for numerical columns (e.g., Package_Price).
Mode imputation for categorical columns (e.g., Destination).
Forward fill or backward fill for date-related fields.
K-Nearest Neighbors (KNN) imputation for complex cases.
Evaluate the impact:
Compare summary statistics before and after imputation.
Visualize the imputed values using histograms or boxplots.
Prepare a report:
Document findings, methods used, and final observations.
Submit a Jupyter Notebook with the cleaned dataset.


In [4]:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

# Load dataset
df = pd.read_csv('Tours_and_Travels.csv')

# Handle missing values
df.fillna(df.median(numeric_only=True), inplace=True)
for col in df.select_dtypes(include=['object']):
    df[col].fillna(df[col].mode()[0], inplace=True)

# Forward fill for date-related fields
date_cols = [col for col in df.columns if 'date' in col.lower()]
df[date_cols] = df[date_cols].fillna(method='ffill')

# KNN Imputation
knn_imputer = KNNImputer(n_neighbors=5)
df[df.select_dtypes(include=['float64', 'int64']).columns] = knn_imputer.fit_transform(df.select_dtypes(include=['float64', 'int64']))

# Save cleaned dataset
df.to_csv('Tours_and_Travels.csv', index=False)
print("Cleaned dataset saved successfully.")


Cleaned dataset saved successfully.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)
  df[date_cols] = df[date_cols].fillna(method='ffill')
