# Data Cleaning

Data cleaning is a crucial step in data preprocessing, which involves identifying and correcting or removing errors, inconsistencies, and inaccuracies in the data. The following are the key processes involved in data cleaning:

## Handling duplicates

Duplicates are observations that have identical values for all the variables in the dataset. Handling duplicates involves identifying them and deciding on an appropriate strategy to apply. In data science, handling duplicate values is an important task as duplicate values can cause problems such as bias in analysis and inaccurate model predictions. There are several ways to handle duplicate values in data science, some of which are:

- **Dropping duplicates:** One of the simplest methods to handle duplicate values is to drop them. This can be done using the `drop_duplicates()` function in pandas library in Python. For example, suppose you have a dataframe df with duplicate values in column 'A', then you can drop duplicates as shown in the corresponding code section below.
- **Aggregating duplicates:** Sometimes it is useful to aggregate duplicate values by calculating their mean, sum, or count. This can be done using the `groupby()` function in pandas library. For example, suppose you have a dataframe df with duplicate values in column 'A' and you want to calculate the mean of column 'B' for each unique value in column 'A', then you can use the code as shown in the cell below.
- **Keeping the first or last occurrence:** Sometimes it is useful to keep only the first or last occurrence of a duplicate value. This can be done using the `drop_duplicates()` function in pandas library by specifying the `keep` parameter as `'first' or 'last'`. For example, suppose you have a dataframe df with duplicate values in column 'A' and you want to keep only the first occurrence of each unique value in column 'A', then you can follow the code below.
- **Marking duplicates:** Another way to handle duplicate values is to mark them as such. This can be useful in cases where you want to keep track of duplicate values without removing them. This can be done using the `duplicated()` function in pandas library which returns a boolean series indicating which values are duplicates. For example, suppose you have a dataframe df with duplicate values in column 'A' and you want to mark them as duplicates using a new column 'duplicate', then you can take reference from the code below.

In [1]:
import pandas as pd

# Create a sample dataset with duplicate values
data = {'A': ['foo', 'bar', 'foo', 'baz', 'qux', 'bar', 'foo'],
        'B': [1, 2, 3, 4, 5, 6, 7],
        'C': [10, 20, 30, 40, 50, 60, 70]}
df = pd.DataFrame(data)

# Print the original dataframe
print('Original dataframe:')
print(df)

# Drop duplicates
df.drop_duplicates(subset=['A'], inplace=True)
print('\nDataframe after dropping duplicates:')
print(df)

# Aggregate duplicates
agg_df = df.groupby(['A']).mean()
print('\nDataframe after aggregating duplicates:')
print(agg_df)

# Keep the first occurrence
first_df = df.drop_duplicates(subset=['A'], keep='first')
print('\nDataframe after keeping the first occurrence:')
print(first_df)

# Keep the last occurrence
last_df = df.drop_duplicates(subset=['A'], keep='last')
print('\nDataframe after keeping the last occurrence:')
print(last_df)

# Mark duplicates
df['duplicate'] = df.duplicated(subset=['A'])
print('\nDataframe after marking duplicates:')
print(df)

Original dataframe:
     A  B   C
0  foo  1  10
1  bar  2  20
2  foo  3  30
3  baz  4  40
4  qux  5  50
5  bar  6  60
6  foo  7  70

Dataframe after dropping duplicates:
     A  B   C
0  foo  1  10
1  bar  2  20
3  baz  4  40
4  qux  5  50

Dataframe after aggregating duplicates:
       B     C
A             
bar  2.0  20.0
baz  4.0  40.0
foo  1.0  10.0
qux  5.0  50.0

Dataframe after keeping the first occurrence:
     A  B   C
0  foo  1  10
1  bar  2  20
3  baz  4  40
4  qux  5  50

Dataframe after keeping the last occurrence:
     A  B   C
0  foo  1  10
1  bar  2  20
3  baz  4  40
4  qux  5  50

Dataframe after marking duplicates:
     A  B   C  duplicate
0  foo  1  10      False
1  bar  2  20      False
3  baz  4  40      False
4  qux  5  50      False


## Handling incorrect data

This involves identifying and correcting data that are incorrect or inconsistent with the other data in the dataset. This can include correcting typographical errors, converting inconsistent data into a standardized format, or removing incorrect observations from the dataset. Here are some common steps for handling incorrect data:

- **Identify incorrect data:** The first step is to identify which data is incorrect. This can be done by visual inspection, summary statistics, or data profiling. Incorrect data can take many forms, such as missing values, out-of-range values, duplicate values, inconsistent values, or format errors. For example, we can calculate the minimum and maximum values of the 'age' column and check if they are reasonable.
- **Determine the cause of incorrect data:** Once you have identified the incorrect data, the next step is to determine the cause of the error. Incorrect data can be caused by various factors, such as human error, data entry errors, data processing errors, data storage errors, or data transfer errors. For example, the negative values in the 'age' column may be caused by data entry errors, while the values greater than 100 may be caused by misunderstanding the input format.
- **Decide on the appropriate action:** Depending on the cause and severity of the error, you may decide to take different actions. Some common actions include imputation, deletion, correction, or validation. Imputation involves filling in missing values with estimated values based on other data. Deletion involves removing incorrect data from the dataset. Correction involves manually correcting errors, such as spelling errors or formatting errors. Validation involves checking the correctness of the data against external sources or rules. For example, we can decide to delete the incorrect data by removing any rows with negative or greater-than-100 values in the 'age' column.
- **Apply the action to the data:** Once you have decided on the appropriate action, you can apply it to the data. This can be done using various techniques, such as filtering, sorting, grouping, or transforming the data. The goal is to produce a clean dataset that is free of incorrect data and ready for further analysis. For example, we can apply the deletion action using the 'drop' method of the pandas library in Python.

In [2]:
import pandas as pd
import numpy as np

# Create a dummy dataset
data = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'name': ['John', 'Jane', 'Bob', 'Alice', 'David'],
    'age': [25, -31, 42, 19, 37],
    'gender': ['M', 'F', 'M', 'F', 'Z'],
    'income': [50000, 70000, np.nan, '30000', 60000]
})
print("Dataset before Handling Incorrect Values")
print(data)

# Identify incorrect data
incorrect_age = data[(data['age'] <= 0) | (data['age'] > 100)]
incorrect_gender = data[~data['gender'].isin(['M', 'F'])]
incorrect_income = data[~data['income'].apply(lambda x: isinstance(x, (int, float)))]

# Delete incorrect data
data = data.drop(incorrect_age.index)
data = data.drop(incorrect_gender.index)
data = data.drop(incorrect_income.index)

# Display the cleaned dataset
print("\nDataset after Handling Incorrect Values")
print(data)

Dataset before Handling Incorrect Values
   id   name  age gender income
0   1   John   25      M  50000
1   2   Jane  -31      F  70000
2   3    Bob   42      M    NaN
3   4  Alice   19      F  30000
4   5  David   37      Z  60000

Dataset after Handling Incorrect Values
   id  name  age gender income
0   1  John   25      M  50000
2   3   Bob   42      M    NaN


## Handling inconsistent data

Inconsistent data can arise when the same data is represented in different ways across the dataset. This can include inconsistent date formats, inconsistent units of measurement, or inconsistent naming conventions. Handling inconsistent data involves identifying the inconsistencies and standardizing the data to a consistent format. It ensures that the data is consistent across all records and that the results of any analysis or modeling are accurate and reliable. Here are some steps you can take to handle inconsistent data:

- **Identify the inconsistent data:** Inconsistent data can come in many forms, such as misspelled words, different representations of the same data (e.g., 'USA', 'U.S.A.', 'United States'), or conflicting values for the same variable. You can use techniques such as data profiling, data visualization, or statistical analysis to identify inconsistent data.
- **Define a set of rules for resolving the inconsistencies:** Once you have identified the inconsistent data, you need to decide how to resolve it. You can define a set of rules or procedures for standardizing the data and ensuring consistency. For example, you may decide to replace all misspelled words with the correct spelling, convert all country names to ISO codes, or resolve conflicts by choosing the most recent or most accurate value.
- **Apply the rules to the data:** Once you have defined the rules, you need to apply them to the data. You can do this using various techniques, such as string matching and comparison, regular expressions, or data transformation functions in a programming language or data cleaning tool.
- **Verify the results:** After you have applied the rules, you should verify the results to ensure that the data is now consistent and that the rules have been applied correctly. You can do this by performing data profiling or visualizing the data to check for consistency and accuracy.

In [3]:
import pandas as pd
import numpy as np

# Create a dummy dataset
data = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'country': ['USA', 'U.S.A.', 'United States', 'Canada', 'Mexico'],
    'date': ['2021-01-01', '01/01/2021', '2021-01-01', '2021-01-01', '01-01-2021']
})

# Define rules for resolving inconsistencies
country_codes = {
    'USA': 'US',
    'U.S.A.': 'US',
    'United States': 'US',
    'Canada': 'CA',
    'Mexico': 'MX'
}
print("Inconsistent Data")
print(data)

def parse_date(date_str):
    if '-' in date_str:
        return date_str
    elif '/' in date_str:
        parts = date_str.split('/')
        return f'{parts[2]}-{parts[0]}-{parts[1]}'
    else:
        parts = date_str.split('-')
        return f'{parts[2]}-{parts[1]}-{parts[0]}'

# Apply rules to the data
data['country'] = data['country'].apply(lambda x: country_codes.get(x, x))
data['date'] = data['date'].apply(parse_date)

# Verify the results
print("\nConsistent Data")
print(data)

Inconsistent Data
   id        country        date
0   1            USA  2021-01-01
1   2         U.S.A.  01/01/2021
2   3  United States  2021-01-01
3   4         Canada  2021-01-01
4   5         Mexico  01-01-2021

Consistent Data
   id country        date
0   1      US  2021-01-01
1   2      US  2021-01-01
2   3      US  2021-01-01
3   4      CA  2021-01-01
4   5      MX  01-01-2021


In this code, we start by creating a dummy dataset with three columns: `'id'`, `'country'`, and `'date'`. The `'country'` and `'date'` columns contain inconsistent data in various formats, such as different spellings of `'USA'` and different date formats.

We then define a set of rules for resolving the inconsistencies. For the `'country'` column, we define a dictionary `country_codes` that maps inconsistent country names to ISO codes. For the `'date'` column, we define a function `parse_date` that converts dates in different formats to a standardized format of `'YYYY-MM-DD'`.

## Handling data normalization

Data normalization is the process of transforming data into a common scale or range, to eliminate differences in magnitude and make the data more comparable and interpretable. Normalization is an important step in data preprocessing, as it can improve the accuracy and performance of machine learning models and other data analysis techniques.

There are several methods of data normalization, depending on the nature and distribution of the data. Here are some common methods:

1. **Min-Max normalization:** This method scales the data to a fixed range, typically between 0 and 1. The formula for min-max normalization is:

```python
x_norm = (x - x_min) / (x_max - x_min)
```

where `x` is the original value, `x_min` and `x_max` are the minimum and maximum values in the data, respectively, and `x_norm` is the normalized value.

2. **Z-score normalization:** This method scales the data to have zero mean and unit variance. The formula for z-score normalization is:

```python
x_norm = (x - mean) / std
```

where `x` is the original value, `mean` and `std` are the mean and standard deviation of the data, respectively, and `x_norm` is the normalized value.

3. **Log transformation:** This method applies a logarithmic function to the data, to reduce the range of values and make the data more symmetric and normally distributed. The formula for log normalization is:

```python
x_norm = log(x)
```

where `x` is the original value, and `x_norm` is the normalized value.

4. This method applies a power function to the data, to adjust the skewness and kurtosis of the distribution and make the data more symmetric and normally distributed. The formula for power normalization is:

```python
x_norm = sign(x) * abs(x) ** a
```

where `x` is the original value, `a` is the power parameter (typically between 0 and 1), `sign` is the sign function that returns the sign of x (+1 or -1), and `abs` is the absolute value function. The normalized value `x_norm` is obtained by raising the absolute value of `x` to the power of `a`, and then multiplying it by the sign of `x` to preserve the direction of the data.

In [4]:
import pandas as pd
import numpy as np

# Create a dummy dataset
data = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'value': [10, 20, 30, 40, 50]
})

# Perform min-max normalization
data['value_norm_minmax'] = (data['value'] - data['value'].min()) / (data['value'].max() - data['value'].min())

# Perform z-score normalization
data['value_norm_zscore'] = (data['value'] - data['value'].mean()) / data['value'].std()

# Perform log normalization
data['value_norm_log'] = np.log(data['value'])

# Perform power normalization with a = 0.5
data['value_norm_power'] = np.sign(data['value']) * np.power(np.abs(data['value']), 0.5)

# Print the resulting dataframe
print(data)

   id  value  value_norm_minmax  value_norm_zscore  value_norm_log  \
0   1     10               0.00          -1.264911        2.302585   
1   2     20               0.25          -0.632456        2.995732   
2   3     30               0.50           0.000000        3.401197   
3   4     40               0.75           0.632456        3.688879   
4   5     50               1.00           1.264911        3.912023   

   value_norm_power  
0          3.162278  
1          4.472136  
2          5.477226  
3          6.324555  
4          7.071068  


In this code, we start by creating a dummy dataset with two columns: `'id'` and `'value'`. The `'value'` column contains numerical data with different magnitudes.

We then perform `min-max` normalization on the `'value'` column, using the formula `(x - min(x)) / (max(x) - min(x))`. We store the normalized values in a new column called `'value_norm_minmax'`.

Next, we perform `z-score` normalization on the `'value'` column, using the formula `(x - mean(x)) / std(x)`. We use the `mean` and `std` functions from the `numpy` library to perform the `z-score` normalization. We store the normalized values in a new column called `'value_norm_zscore'`.

After that, we perform `log` normalization on the `'value'` column, using the formula `x_norm = log(x)` with `x` as the original value. We store the normalized values in a new column called `'value_norm_log'`.

Finally, we perform power normalization on the `'value'` column, using the formula `x_norm = sign(x) * abs(x) ** a` with `a = 0.5` as the power parameter. We use the `np.sign` and `np.power` functions from the `numpy` library to perform the power normalization. We store the normalized values in a new column called `'value_norm_power'`.