<img src="data/images/div/lecture-notebook-header.png" />

# Data Preprocessing

It is almost a natural law in data mining, that no real-world dataset can be used "as is" to common task and expect high-qualitative results. Data preprocessing plays a crucial role in data mining and is an essential step in the overall data analysis process. Its purpose is to transform raw data into a clean, consistent, and well-structured format that is suitable for further analysis and mining. The main objectives of data preprocessing in data mining are as follows:

* **Data Cleaning:** Raw data often contains errors, missing values, inconsistencies, or outliers. Data cleaning involves techniques to handle these issues, such as removing or imputing missing values, correcting errors, and dealing with outliers to ensure the quality and integrity of the data.

* **Data Integration:** In many cases, data used for mining is collected from multiple sources. Data integration involves combining data from different sources into a unified dataset. It requires resolving schema and format differences, identifying and merging duplicate records, and establishing relationships between different data sources. The simple dataset used in this notebook stems from a single source, we ignore data integration here.

* **Data Transformation:** Data transformation involves converting the data into a suitable format for mining. This may include changing the scale of numerical data, normalizing or standardizing variables, reducing dimensionality, and transforming categorical variables into numerical representations.

* **Feature Selection/Extraction:** In data mining, not all features or attributes may be relevant or contribute significantly to the analysis. Feature selection involves identifying the most informative and relevant features for the mining task, while feature extraction aims to create new features or representations that capture the underlying patterns or reduce the dimensionality of the data.

* **Data Reduction:** Large datasets can be computationally expensive and may hinder the performance of mining algorithms. Data reduction techniques, such as sampling, aggregation, or instance selection, are employed to reduce the size of the dataset without losing critical information, thus improving efficiency and speeding up the mining process.

* **Handling Noisy Data:** Noise refers to irrelevant or misleading information present in the data. Preprocessing techniques, such as smoothing or filtering, can be applied to remove noise and improve the accuracy of mining results.

By performing these preprocessing steps, the data is prepared in a more suitable form, enhancing the effectiveness of data mining algorithms and improving the quality of the mined patterns, relationships, or insights. While the application of some preprocessing are almost always either required or the best practice, many preprocessing steps are less obvious and might not be recommended for any dataset and task. Which steps to perform during data preprocessing is to a large extent informed by the results of the Exploratory Data Analysis (EDA).

This noteboook covers a series of basic but common preprocessing steps for data mining. More specifically, we focus on data cleaning and data transformation as they are most common steps. Some of the other steps will be explicitly or implicitly addressed in subsequent notebooks, such as the notebooks about Dimensionality Reduction.

## Setting up the Notebook

### Specify how plots get rendered

In [None]:
%matplotlib inline

### Import Required Packages

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from tqdm import tqdm

---

## Loading the Data

Will use the dataset that we "messed up" in the EDA notebook. Recall that the original dataset was taken from [Kaggle](https://www.kaggle.com/sulianova/cardiovascular-disease-dataset), and containing 70,000 records of patient data to predict if a patient is suffering from a cardiovascular disease based on personal and health-related information.

As it is often the case, we first use `pandas` to load the file content into a dataframe.

In [None]:
# Load file into pandas dataframe
df = pd.read_csv('data/datasets/cardio/cardio_train_messy.csv', sep=';')

# Let's have a look at the first 5 rows
df.head()

In [None]:
num_records, num_attributes = df.shape

print("There are {} data points, each with {} attributes.". format(num_records, num_attributes))

Data preprocessing steps might involve removing data points and/or attributes. So it can be useful to keep an eye on that. Intuitively, we want to preserve as much of the data as possible.

---

## Data Cleaning

In the context of data mining or data analysis, data cleaning refers to the process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in a dataset. It is an essential step in the data preparation phase before conducting any meaningful analysis or modeling. Data cleaning involves various tasks, including:

* **Handling missing data:** Identifying missing values in the dataset and deciding on an appropriate strategy to handle them, such as imputation (replacing missing values with estimated values) or deletion.

* **Removing duplicates:** Identifying and removing duplicate records from the dataset to ensure data integrity and avoid biased analysis.

* **Correcting inconsistencies:** Identifying and resolving inconsistencies in the data, such as conflicting values or formatting errors. This may involve standardizing data formats, reconciling conflicting information, or updating outdated values.

* **Handling outliers:** Identifying extreme or erroneous values that deviate significantly from the expected range and deciding on the appropriate treatment. Outliers may be corrected, removed, or treated separately depending on the context and analysis goals.

* **Standardizing data:** Converting data into a consistent format or scale to facilitate analysis. This may involve normalizing numerical data, converting units of measurement, or standardizing categorical variables.

* **Validating data:** Verifying the accuracy and correctness of the data by performing logical checks, cross-referencing with external sources, or comparing against known benchmarks.

By performing these data cleaning tasks, analysts and data scientists try to ensure that the dataset is accurate, reliable, and suitable for analysis -- at least as practically possible and or reasonable. Clean data improves the quality of subsequent analyses, reduces biases, and helps in generating more reliable insights and predictions.

In the following, we go through a few basic but common data cleaning steps. However, the required and recommended data cleaning steps will typically highly depend on the dataset and the task. So do not consider the performed steps as neither complete nor always required.

### Handle Missing Values

#### NA values

If there are not too many data points with NA values for some attributes, it's often the most straightforward way to simply remove such data points. Replacing/filling NA values is not always applicable and generally requires to make some assumptions that must hold. The simplest way to remove all rows from a data frame that have NA values is with the [`pandas.DataFrame.dropna`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html). So we could do:

In [None]:
df_dropna_all = df.dropna()

num_records, num_attributes = df_dropna_all.shape

print("There are {} data points, each with {} attributes.". format(num_records, num_attributes))

However, we don't really care about NA values for attributes we don't need for our analysis. For example, our dataset has an `id` column. This is just an artificial attribute that is not useful. So let's first remove this column, and then remove all rows with NA values.

In [None]:
df = df.drop(columns=['id'])

df_dropna_needed = df.dropna()

We can again check the size of the filtered dataset.

In [None]:
num_records, num_attributes = df_dropna_needed.shape

print("There are {} data points, each with {} attributes.". format(num_records, num_attributes))

Not that this removes less data points than doing the removal without removing `id`. Of course, we could and should remove all columns we decided not to use for our analysis. While the difference in this example is not that great -- since we randomly remove attribute values -- in practice, this can make a big difference.

**Important:** Just because we do not use an attribute for our analysis, does not mean that this attribute or its values may not be important. For example, an NA value for `id` might indicate an invalid record even if the other attribute values look "normal". So if we are skeptical about these records, we might want to remove them after all. As with many steps, this is a context-dependent design decision.

In [None]:
# Let's continue with the data frame without the id column
df = df_dropna_needed

#### Default Values

In line with what we have already done during the EDA, let's plot the distribution of `weight` values using a simple histogram -- obviously, this should be be done for many to most relevant attributes, since in practice we would not know that `weight` has been tampered with like we did on purpose in the EDA notebook.

In [None]:
plt.figure()
plt.tick_params(labelsize=14)
plt.hist(df['weight'].to_numpy(), bins=75)
plt.xlabel('weight', fontsize=16)
plt.ylabel('count', fontsize=16)
plt.tight_layout()
plt.show()

In this example, it's relatively easy to see the `weight=0` represents some kind of "odd" value. Firstly, it is far off the expected distribution, and secondly, we semantically know that a person cannot have a weight of zero. Let's check how many data points actually have a weight of 0 (this should, of course, match the height of the bar in previous plot). `pandas` provides a powerful syntax to make this very easy.

In [None]:
print(df[df.weight == 0].shape)

This output means that 6,210 records (out of 70k) have a value of `0` for the attributes `weight` -- we also have now only 12 attributes after we have removed `id`.

The question is of course now how to handle this. The easy way would be to simply ignore the attribute `weight` and remove it from the dataframe. However, the assumption that a person's weight has not effect at all on this person's risk of suffering from a cardiovascular disease does arguably not hold. So we should *not* remove this attribute. The straightforward alternative is to remove all data points with `weight=0` as follows:

In [None]:
# Keep only the rows where weight != 0
df_no_default = df[df.weight != 0]

Let's check the shape of the dataframe again:

In [None]:
num_records, num_attributes = df_no_default.shape

print("There are {} data points, each with {} attributes.". format(num_records, num_attributes))

Well, we would lose another couple of thousand data points, which seems not too bad. Again, there are no hard rules or thresholds to decide when a loss of data is acceptable or not. The approach of removing data is generally the "safest" solution, since correcting or estimating missing values generally requires making some assumption that may or may not hold.

**Side note:** There are cases where missing values can be filled unambiguously by deriving them from existing information. This is typically the case if 2 or more attributes are dependent. For example, let's assume that a dataset contains both the age as well as the date of birth of a person -- we ignore here that this is generally not a common data management strategy. In this case, we could always derive a person's age as long as the information about that person's date of birth is available.

In the following, we look at a concrete approach of replacing missing values -- NA or default values -- with (hopefully) good estimates. The challenge here is to find a meaningful strategy to come up with good estimates. Here, we use the following simple (over-simplified?) strategy to set the weight of a data point `x` with `weight=0` as follows:

* Find `k` data points with heights most similar to the height of `x`
* Calculate the median weight of those `k` data points
* Set `weight` of `x` to the calculated median

This strategy assumes that there is a "reasonably" good correlation between `height` and `weight`. We know, of course, that this correlation is far from perfect, and in practice we would need to decide with this strategy to suggest better results at the end compared to simply removing all records with `weight=0`. Such questions make good data mining highly non-trivial.

First, for the calculation of median weights, we naturally want to only consider valid data points (no missing values) for `height` and `weight`. After all, we don't want NA or default values in the list we use to calculate the median.

In [None]:
# Remove rows that have NA values for height or weight
df_valid = df.dropna(subset=['height', 'weight'])

df_valid = df_valid[df_valid.weight != 0]

heights = df_valid['height'].to_numpy()
weights = df_valid['weight'].to_numpy()

print(df_valid.shape)

Now we can implement the loop that implements our proposed strategy; see the code cell below. Note that the value for `k` is another parameter for which we have to decide on the value. Intuitively, the larger the value for `k` the more the estimate reflects an average. In contrast, (very) small values of `k` might result in the estimates being highly affected by some records with rather extreme values -- although using the median instead of the mean makes this step a bit more robust. But again, using the mean might yield better results for a concrete dataset after all.

In [None]:
# Set k as the number of nearest data points we want to consider
k = 100

# Loop over each data point in our dataset
for idx, row in tqdm(df.iterrows(), total=len(df)):
    
    # Get the weight and height of the current row
    w, h = row['weight'], row['height']
    
    # If the weight is not the default value 0, do nothing and go to the next row
    if w > 0:
        continue
    
    # Calculate the absolute differences between the current height and all other heights
    # (the data points we are interested in have an absolute difference of 0 or as small as possible)
    diff = np.abs(heights - h)
    
    # Find the k indices with the smallest difference values
    # (these are the k data points with a height most similar to the current row)
    indices = np.argsort(diff)[:k]

    # Select the k weight w.r.t. the the indices
    k_weights = weights[indices]
    
    # Calculate the median of the k selected weights
    # np.mean(k_weights) is a valid and maybe even better alternative (it's a design decision)
    median = np.median(k_weights)
    
    # Set the weight value of the current row to the median
    df.at[idx, 'weight'] = median

Let's see how many rows with `weight=0` there are now:

In [None]:
print(df[df.weight == 0].shape)

The result should be `(0, 12)` indicating that the data frame has 0 records with `weight=0`.

We can also plot the distribution of the values for `weight` again:

In [None]:
plt.figure()
plt.tick_params(labelsize=14)
plt.hist(df['weight'].to_numpy(), bins=75)
plt.xlabel('weight', fontsize=16)
plt.ylabel('count', fontsize=16)
plt.tight_layout()
plt.show()

In general, replacing missing values with estimates always relies on some assumptions about the data. As we do not know the true values, we can never evaluate how good our estimates are. The hope is that the underlying assumptions to generate the estimates are "reasonable enough" that any potential errors in the estimates are still preferable compared to simply removing the data points.

---

## Data Transformation

The purpose of data transformation in data mining is to prepare the raw data for analysis and improve the quality and suitability of the data for mining tasks. Data transformation serves several important purposes:

* **Data normalization:** Transformation techniques like normalization and standardization bring data into a consistent scale and range. This ensures that variables with different units or scales are treated equally during analysis, preventing one variable from dominating the results. Normalization also helps in reducing the impact of outliers and extreme values.

* **Enhancing algorithm performance:** Many data mining algorithms assume certain data distributions or relationships. Data transformation can help meet these assumptions and improve the performance of the algorithms. For example, transforming skewed variables using logarithmic transformation can make them more normally distributed, which is often beneficial for many statistical and machine learning models.

* **Reducing data redundancy:** Data transformation can help reduce redundancy in the dataset. Redundant or highly correlated variables can negatively impact the performance of data mining algorithms, leading to overfitting or inflated importance of certain variables. By applying techniques like principal component analysis (PCA), data transformation can reduce the dimensionality of the data while retaining the most relevant information.

* **Handling missing data:** Data transformation techniques can be used to handle missing values in the dataset. By imputing missing values or assigning appropriate values to missing data points, the dataset becomes more complete, enabling effective analysis and mining.

* **Enabling the use of algorithms:** Some data mining algorithms require specific types of data or data structures. For instance, decision trees typically work best with categorical data, while distance-based algorithms like k-means clustering require numerical data. Data transformation techniques, such as one-hot encoding or feature scaling, can convert the data into the required format, making it compatible with the chosen algorithms.

* **Simplifying data representation:** Data transformation techniques like discretization can simplify the data representation by converting continuous variables into categorical ones. This can make the analysis easier and more interpretable, particularly for certain types of algorithms or when dealing with large datasets.

Let's have a look at some common technqiues.

### Aggregation

Aggregation refers to the process of combining multiple data values into a single summary value. It involves grouping data based on specific criteria and applying an aggregate function to derive meaningful insights or reduce the data volume. Aggregation is commonly used to transform detailed or granular data into a more summarized form, which can be beneficial for data mining tasks. It helps in reducing the complexity and size of the dataset while preserving important information. Aggregating data can reveal patterns, trends, or statistics at higher levels of granularity or provide a more holistic view of the data. It is particularly useful when dealing with large datasets, as it helps reduce the computational complexity and memory requirements for subsequent data mining tasks. It also allows for a more concise representation of data, making it easier to identify patterns or trends that may not be apparent in the original granular data.

For a simple example, in the code cell below, we convert `age` from days to years, which aggregates all values to the same one, if they yield the same year. This significantly reduces the number of unique values for this attribute. This can speed up methods like Decision Trees the need to find good threshold values for each attribute; see the later Decision Tree notebook for more details.

In [None]:
df.age = np.round(df.age / 365)

df.head()

### Binning & Smoothing

Binning and smoothing are techniques used to transform continuous data into discrete representations or to reduce noise in the data. These techniques can help simplify the data representation, handle outliers, and uncover patterns or trends. Binning, also known as discretization, is the process of dividing a continuous variable into a set of intervals or bins. It involves grouping similar values together to create a discrete representation of the data. Binning can be done in several ways such as equal-width binning, equal-frequency binning, or custom binning. Smoothing is a technique used to reduce noise or variability in the data, particularly in time series or sequential data. It aims to identify underlying trends or patterns by removing random fluctuations or outliers. Smoothing techniques include moving average, exponential smoothing, and kernel smoothing.

To easily see the effect of binning and smoothing, let's first look at the data points with lowest values for height.

In [None]:
df.sort_values(by=['height']).head()

Let's bin end smooth `height`, with a small bin size of 3 -- i.e., equal-width binning -- just to keep it simple. We then smooth each bin by replacing all values in the bin with its median. 

In [None]:
# Convert all heights into a numpy data from
heights = df['height'].to_numpy()

# Get the indices w.r.t to the sorting of the heights
# (e.g., indices[0] is the index of the data point with the smallest height)
indices = np.argsort(heights)

[`numpy.array_split`](https://numpy.org/doc/stable/reference/generated/numpy.array_split.html) is a useful function to split a given array into chunks of the same length (apart from maybe the last chunk)

In [None]:
# Set the bin size
bin_size = 3

# Define an array containing all zeros to hold the new weight values
new_heights = np.zeros_like(heights)

for b in np.array_split(indices, len(indices) // bin_size):
    # Get all the weights w.r.t. to the current bin indices
    bin_heights = heights[b]
    
    # Calucate median (again, mean/average is a valid alternative)
    median = np.median(bin_heights)
    
    # Set the height of ALL the data points in the bin to the median
    new_heights[b] = median


No we only have to set the `weight` attribute to the new list of weights

In [None]:
df.height = new_heights

We can see the differen immediately when looking at the same sample as above

In [None]:
df.sort_values(by=['height']).head()

Compare the output with the one before the binning. Using a bin size of 3 and the median makes understanding the effects very easy in here. For example, while initially the smallest weight values were 28, 32 and 43, they now all now the median of this set, i.e., 32.

### Normalization

Normalization is an important data preprocessing technique in data mining with the purpose of bringing data into a consistent scale and range. It ensures that variables with different units, magnitudes, or distributions contribute equally to the analysis and modeling process. The primary purposes of normalization in data preprocessing are as follows:

* **Equalizing variable scales:** Different variables in a dataset often have different scales. For example, one variable might range from 0 to 100, while another variable ranges from 1000 to 100000. Normalization rescales the values of variables to a common scale, typically between 0 and 1 or -1 and 1. By doing so, it ensures that no single variable dominates the analysis due to its larger magnitude.

* **Preventing bias:** Some data mining algorithms are sensitive to the scales of variables. Algorithms that rely on distance measures, such as k-means clustering or nearest neighbor methods, can be biased towards variables with larger scales. Normalizing the data eliminates such biases, enabling a fairer analysis where all variables contribute proportionally.

* **Improving algorithm performance:** Normalization can enhance the performance of certain data mining algorithms. Many algorithms, such as neural networks or support vector machines, are based on mathematical optimization techniques that assume input data to be on a similar scale. Failure to normalize the data may result in slower convergence, suboptimal performance, or difficulty in comparing the importance of different variables.

* **Handling outliers:** Outliers, extreme values that deviate significantly from the majority of the data, can disproportionately impact the analysis. Normalization can help mitigate the influence of outliers by compressing the range of data values. By mapping the data into a limited range, the impact of extreme values is reduced, making the analysis more robust.

* **Interpretability and feature importance:** Normalizing data makes the variables more easily interpretable and comparable. When variables are on the same scale, it becomes simpler to understand their relative importance or contributions in the analysis. It facilitates the identification of significant features and simplifies the interpretation of model coefficients or feature importance measures.

It's worth noting that normalization is not always necessary or suitable for every data mining task. Certain algorithms, such as decision trees or ensemble methods like random forests, are invariant to variable scales. In such cases, normalization may not have a significant impact. However, for many other algorithms and analyses, normalization plays a crucial role in ensuring accurate, fair, and robust results.

In the following, we normalize both `weight` and `height`, although we already applied binning and smoothing to `height`. While performing both steps is perfectly fine, one would commonly perform the normalization step first, depending on the smoothing strategy used.

#### Standardization

Standardization or z-score normalization assumes that the attribute values are (roughly) normally distributed. The standardized value (z-score) reflects the relative distance of the original attribute value to the overall mean. Accordingly, the formula for the calculation is given a attribute value $x_i$ is:

$$z_i  =\frac{x_i - \mu}{\sigma}$$

where $\mu$ is the mean and $\sigma$ is the standard variation over all attribute values in the dataset.

Using `pandas` and `numpy` we can implement this for `weight` and `height` very easily. Note that we don't need a loop that goes over each data point.

In [None]:
df['weight'] = (df['weight']-df['weight'].mean()) / df['weight'].std()
df['height'] = (df['height']-df['height'].mean()) / df['height'].std()

Let's check the results. Since we subtract the mean from all values, we now have heights and weights with negative values. While a negative height and weight might not "make sense" intuitively, it does not have to in the context of most data mining techniques as the order and relative distances between the values are still preserved.

In [None]:
df.sort_values(by=['height']).head()

We can check the standardization step by calculating the new means for `weight` and `height`.

In [None]:
mean_weight = np.mean(df['weight'])
mean_height = np.mean(df['height'])

print('Mean of weight: {}'.format(mean_weight))
print('Mean of height: {}'.format(mean_height))

Apart from issues with floating point precision, both means should be zero.

#### Scaling using Min-Max Normalization

Min-Max normalization scales all attribute values into a specified range. For a range of [0,1], the formula is as follows:

$$x^{norm}_i =\frac{x_i - MIN}{MAX - MIN} $$ 

where $MIN$ is the minimum and $MAX$ is the maximum among all attribute values. As a result, the smallest value gets scaled to 0, and the largest value gets scaled to 1.

**Note:** In this example, we scale after standardizing. In practice, you would choose only one of both methods.

In [None]:
df['weight'] = (df['weight']-df['weight'].min()) / (df['weight'].max()-df['weight'].min())
df['height'] = (df['height']-df['height'].min()) / (df['height'].max()-df['height'].min())

Now all the values for `height` and `weight` are within [0, 1]. In the code cell below, you will also again see the effect of binning and smoothing of `height` with the first 3 values for `height` (assuming sorting) have the same now lowest value of `0`.

In [None]:
df.sort_values(by=['height']).head()

### Discretization

Similar to binning and smoothing, discretization is a technique used to transform continuous variables into categorical or discrete representations. The purpose of discretization is to simplify the data, handle outliers, and enable the use of techniques designed for categorical variables. Discretization serves several important purposes:

* **Simplifying analysis:** Discretization reduces the complexity of continuous data by converting it into a smaller number of categories or bins. This simplification can make the analysis and interpretation of data more manageable and intuitive, especially when dealing with large datasets or complex relationships.

* **Handling non-linear relationships:** Discretization can be particularly useful when the relationship between the target variable and a predictor variable is non-linear. By converting the continuous variable into categories, it allows for capturing potential non-linear associations between variables that may not be easily modeled using continuous data.

* **Dealing with outliers:** Discretization can mitigate the impact of outliers by assigning them to appropriate bins. Outliers in continuous variables can distort the analysis or affect the performance of certain algorithms. By grouping outliers with other data points based on their value range, discretization can help reduce the influence of extreme values.

* **Enabling certain algorithms:** Some algorithms, such as association rules or decision tree algorithms, are specifically designed for categorical variables. Discretization allows for applying these algorithms to continuous variables by converting them into discrete representations. This expands the range of techniques that can be employed for analysis.

* **Preserving privacy:** Discretization can also play a role in preserving privacy and confidentiality. By converting continuous variables into categorical ones, sensitive information in the data can be obfuscated, reducing the risk of re-identification.

There are various methods of discretization, including the aforementioned equal-width binning, equal-frequency binning, as well as more sophisticated ones such as k-means clustering, entropy-based methods, and decision tree-based approaches, and others. The choice of discretization method depends on the specific requirements of the analysis, the nature of the data, and the goals of the data mining task.

However, it's important to note that discretization is a transformation that introduces some loss of information. The choice of the number of bins or categories and the algorithm used can impact the analysis results. Therefore, it is essential to carefully consider the trade-off between simplicity, interpretability, and the potential loss of information when applying discretization in data preprocessing.

To provide an illustrative example, let's use the original dataset without any missing values etc. We also ignore a whole bunch of columns to make the representation easier.

In [None]:
df = pd.read_csv('data/datasets/cardio/cardio_train.csv', sep=';')

# Drop all columns except for id, age, gender, and weight
df = df.drop(columns=['height', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio'])

df.head()

Let's assume we want to discretize `weight` into three classes `light`, `average` and `heavy`

In [None]:
labels = ['light', 'average', 'heavy']

`pandas` has nifty methods to map numeric values to predefined intervals; check [`pandas.IntervalIndex.from_tuples`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.IntervalIndex.from_tuples.html) and [`pandas.cut`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html)

In [None]:
# For our this example, let's consider 
# weights between 0 and 70 == light,
# weights between 70 and 90 == average,
# weights between 90 and 999 == heavy (999 just represents a max upper boundary)
bins = pd.IntervalIndex.from_tuples([(0, 70), (70, 90), (90, 999)])

# Create a new colum weight_interval
df['weight_interval'] = pd.cut(df['weight'], bins=bins)

df.head()

We can now map these 3 different intervals to our 3 weight labels; again, `pandas` has handy methods for this.

In [None]:
# Connect the 3 bins with the 3 labels
d = dict(zip(bins,['light','average','heavy']))

# Create a new column weight_class
df['weight_class'] = pd.cut(df['weight_interval'],bins).map(d)

df.head()

Lastly, we can remove the original attribute `weight` and the auxiliary attribute `weight_interval`.

In [None]:
df = df.drop(columns=['weight', 'weight_interval'])

df.head()

### One-Hot Encoding

One-hot encoding (or dummy encoding) is a technique used to represent categorical variables as binary vectors. The purpose of one-hot encoding is to transform categorical data into a format that can be effectively used by machine learning algorithms. It serves several important purposes:

* **Handling categorical variables:** Machine learning algorithms typically require numerical input data. However, many real-world datasets contain categorical variables, such as gender, color, or product type. One-hot encoding allows these categorical variables to be represented as binary vectors, where each category is encoded as a separate binary feature.

* **Maintaining variable independence:** By one-hot encoding categorical variables, each category becomes a separate binary feature. This encoding scheme ensures that no inherent order or magnitude is implied among the categories. It helps to maintain the independence of categorical variables during analysis, preventing any false assumptions of ordinality or continuity.

* **Avoiding numerical misinterpretation:** If categorical variables are directly encoded as integers (e.g., assigning numerical labels to categories), numerical values might be misinterpreted as having a mathematical relationship or order. Attributes `cholesterol` and `gluc` in our dataset are examples for this. One-hot encoding eliminates this issue by representing each category with a distinct binary feature, ensuring that the encoded representation is not misconstrued as having a numerical relationship.

* **Enabling distance calculations:** One-hot encoding allows the calculation of distances or similarities between categorical variables using metrics like Hamming distance. This enables the use of clustering algorithms or distance-based classifiers on categorical data.

* **Facilitating model interpretation:** One-hot encoding produces binary features, which can make it easier to interpret the model's weights or coefficients. Each feature represents the presence or absence of a particular category, making it straightforward to identify which categories are influential in the model's predictions.

* **Expanding the feature space:** One-hot encoding expands the feature space by creating additional binary features for each category. This can capture the potential relationships or interactions between categories, providing more information for the model to learn from.

It's important to note that one-hot encoding can lead to a significant increase in the dimensionality of the data, especially when dealing with categorical variables with many distinct categories. This expansion of features can impact computational complexity and memory requirements. Therefore, it's advisable to consider the trade-off between feature space expansion and the resources available for analysis.

For a simple example, let's use the previous output and encode `weight_class`. This is a very common preprocessing step since many algorithms do not support categorical attributes. We can use [`pandas.get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) for that to make it very easy for us.

In [None]:
weight_class_dummies = pd.get_dummies(df.weight_class, prefix='weight_class')

weight_class_dummies.head()

Now we only have to replace the `weight_class` attribute with the new dummy attributes.

In [None]:
# Concatenate inital data frame with dummy data frame
df = pd.concat([df, weight_class_dummies], axis=1)

# Remove the weight_class column
df = df.drop(columns=['weight_class'])

df.head()

---

## Summary

Data preprocessing plays a crucial role in data mining, as it directly impacts the quality, effectiveness, and efficiency of the mining process. The importance of data preprocessing can be summarized as follows:

* Data preprocessing helps improve data quality by addressing issues such as missing values, outliers, and inconsistencies. By handling missing data through imputation techniques, removing or correcting outliers, and addressing data inconsistencies, the preprocessing stage ensures that the data used for mining is more accurate, reliable, and suitable for analysis.

* Data preprocessing reduces data complexity by transforming the data into a more suitable format for analysis. Techniques like normalization, discretization, and feature selection reduce the dimensionality of the data, eliminate redundant or irrelevant features, and simplify the data representation. This simplification enhances the efficiency of mining algorithms, reduces computational complexity, and facilitates the identification of meaningful patterns or relationships.

* Data preprocessing enables the successful application of various data mining techniques and algorithms. Different algorithms have specific requirements regarding data type, scale, or distribution. Preprocessing techniques like one-hot encoding, normalization, or binning ensure that the data is prepared appropriately for different algorithms, facilitating accurate analysis and improving the performance of the mining process.

However, data preprocessing also poses several challenges. Cleaning and transforming large, complex datasets can be time-consuming and resource-intensive. Dealing with missing data, outliers, and inconsistencies requires careful consideration and appropriate techniques. Selecting the most suitable preprocessing techniques for a particular dataset or mining task requires domain knowledge and expertise. Additionally, data preprocessing introduces a level of subjectivity, as different preprocessing choices can lead to varying results. It's crucial to carefully analyze the impact of preprocessing decisions on the final analysis and ensure that they align with the goals of the mining task.

Overall, while data preprocessing presents challenges, its importance cannot be overstated. Properly preprocessing data addresses data quality issues, simplifies data representation, and enables the successful application of mining algorithms. By investing time and effort in effective data preprocessing, analysts can enhance the accuracy, reliability, and efficiency of the data mining process, leading to more meaningful and actionable insights from the data.