# Extended Data Cleansing

Let's use the "Adult Income" dataset from the UCI Machine Learning Repository, which is a larger dataset with both numeric and categorical attributes. You can find it __[here](https://archive.ics.uci.edu/ml/datasets/adult)__.

First, download the dataset and load it into a DataFrame:

In [None]:
import pandas as pd
import numpy as np

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]

df = pd.read_csv(url, header=None, names=column_names, na_values=' ?', skipinitialspace=True)
df.head()



The `na_values` and `skipinitialspace` arguments help us identify missing values in the dataset, as they are represented by a question mark with a preceding space.

Now let's introduce some missing values to the numeric columns for practice purposes:


In [None]:
np.random.seed(42)
df.loc[np.random.choice(df.index, 200, replace=False), "age"] = np.nan
df.loc[np.random.choice(df.index, 200, replace=False), "education-num"] = np.nan
df.loc[np.random.choice(df.index, 200, replace=False), "capital-gain"] = np.nan
df.loc[np.random.choice(df.index, 200, replace=False), "capital-loss"] = np.nan
df.loc[np.random.choice(df.index, 200, replace=False), "hours-per-week"] = np.nan


Now we can perform the data cleaning operations as before and observe their impact on the statistics of the numeric columns.


In [None]:
# Removing missing data
df_cleaned = df.dropna()
print("Removed Missing Data:\n", df_cleaned.describe())

# Replacing missing values with the mean
df_mean = df.fillna(df.mean(numeric_only=True))
print("Replaced with Mean:\n", df_mean.describe())

# Replacing missing values with the median
df_median = df.fillna(df.median(numeric_only=True))
print("Replaced with Median:\n", df_median.describe())

# Replacing missing values using forward fill
df_ffill = df.fillna(method='ffill')
print("Replaced with Forward Fill:\n", df_ffill.describe())

# Replacing missing values using backward fill
df_bfill = df.fillna(method='bfill')
print("Replaced with Backward Fill:\n", df_bfill.describe())



The `describe()` function provides summary statistics for each numeric column in the DataFrame. By comparing these statistics across different cleaning methods, you can observe the impact on the dataset's distribution.



Extra exercises:

1. Compare the mean and standard deviation for each numeric column after applying different missing value handling techniques. What are the advantages and disadvantages of each method?

2. Investigate the impact of missing value handling techniques on the correlation between numeric columns. Use the `corr()` method to compute the correlation matrix for the original dataset and each cleaned version. How do different methods affect the correlation?

3. Identify columns with a high percentage of missing values. Experiment with different threshold values and decide whether to remove or impute these columns.

4. Explore advanced imputation methods, such as k-Nearest Neighbors or regression-based imputation, using the `sklearn.impute` module. Compare their performance with the basic methods covered in this notebook.