# Dealing with data Inconsistancies 

# Dealing with Data Inconsistencies in Machine Learning

| **Inconsistency Type**           | **Description**                                                             | **Solution** |
|----------------------------------|-----------------------------------------------------------------------------|--------------|
| **Missing Values**               | Data points that are absent in the dataset.                                 | Imputation (mean, median, mode, KNN), Removal of rows/columns with missing values. |
| **Duplicate Entries**            | Repeated rows in the dataset which can skew analysis.                       | Remove duplicates using `drop_duplicates()` method in pandas. |
| **Incorrect Data Types**         | Columns having inappropriate data types.                                    | Convert data types using `astype()` method or `pd.to_datetime()` for dates. |
| **Outliers**                     | Data points that differ significantly from other observations.              | Handle outliers using IQR method, z-score method, or remove if justified. |
| **Inconsistent Categorical Labels**| Variations in categorical labels (e.g., 'Male' vs 'male').                  | Standardize labels using `replace()` or `str.lower()` methods in pandas. |
| **Typographical Errors**         | Mistakes in data entry leading to incorrect values.                        | Correct errors using `replace()`, data validation, or regular expressions. |
| **Data Range Errors**            | Values that fall outside a plausible range (e.g., negative ages).           | Validate and correct using conditional checks, replace with NaN, or apply domain knowledge. |
| **Date/Time Format Issues**      | Inconsistent date/time formats.                                             | Standardize dates using `pd.to_datetime()`, ensure uniform format across the dataset. |
| **Unit Inconsistencies**         | Different units used within the same column (e.g., 'kg' vs 'lbs').          | Convert all values to a common unit using a conversion factor. |
| **Inconsistent Data Formats**    | Variation in data presentation (e.g., '1000' vs '1,000').                   | Standardize formats using `str.replace()`, regular expressions, or parsing functions. |




In [75]:
import pandas as pd

In [76]:
data = {
    'date': ['2021-12-01', '01-12-2022', '2022/12/01', '12-01-2021'],
    'country': ['USA', 'U.S.A.', 'America', 'United States'],
    'name': ['Aammar', 'Amaar', 'Hamza', 'Hazma'],
    'sales_2020': [100, 200, None, 200],
    'sales_2021': [None, 150, 300, 150]
}
# make pandas dataframe
df = pd.DataFrame(data)

In [77]:
df.head()

Unnamed: 0,date,country,name,sales_2020,sales_2021
0,2021-12-01,USA,Aammar,100.0,
1,01-12-2022,U.S.A.,Amaar,200.0,150.0
2,2022/12/01,America,Hamza,,300.0
3,12-01-2021,United States,Hazma,200.0,150.0


In [78]:
from datetime import datetime
# Function to parse and standardize date strings
def parse_and_standardize_date(date_str):
    try:
        dt_obj = datetime.strptime(date_str, '%Y-%m-%d')  # Try to parse known format
        return dt_obj.strftime('%Y-%m-%d')
    except ValueError:
        try:
            dt_obj = datetime.strptime(date_str, '%d-%m-%Y')  # Try another common format
            return dt_obj.strftime('%Y-%m-%d')
        except ValueError:
            try:
                dt_obj = datetime.strptime(date_str, '%Y/%m/%d')  # Try another common format
                return dt_obj.strftime('%Y-%m-%d')
            except ValueError:
                print(f"Error: Date format not recognized for '{date_str}'")
                return date_str  # Return original string if format cannot be parsed

# Apply transformation to the 'date' column inplace
df['date'] = df['date'].apply(parse_and_standardize_date)

# Display the updated DataFrame
df

Unnamed: 0,date,country,name,sales_2020,sales_2021
0,2021-12-01,USA,Aammar,100.0,
1,2022-12-01,U.S.A.,Amaar,200.0,150.0
2,2022-12-01,America,Hamza,,300.0
3,2021-01-12,United States,Hazma,200.0,150.0


In [79]:
# standardizing the date format
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['date'] = df['date'].dt.strftime('%Y-%m-%d')
df.head()

Unnamed: 0,date,country,name,sales_2020,sales_2021
0,2021-12-01,USA,Aammar,100.0,
1,2022-12-01,U.S.A.,Amaar,200.0,150.0
2,2022-12-01,America,Hamza,,300.0
3,2021-01-12,United States,Hazma,200.0,150.0


In [80]:
# Harmonize the name of the coutry
country_mapping = {'USA': 'United States', 'U.S.A.': 'United States', 'America': 'United States'}
df['country'] = df['country'].replace(country_mapping)
df.head()

Unnamed: 0,date,country,name,sales_2020,sales_2021
0,2021-12-01,United States,Aammar,100.0,
1,2022-12-01,United States,Amaar,200.0,150.0
2,2022-12-01,United States,Hamza,,300.0
3,2021-01-12,United States,Hazma,200.0,150.0


In [82]:
# Correct the typographical Mistakes in name
df['name'] = df['name'].replace({'Amaar': 'Aammar', 'Hazma': 'Hamza'})
df.head()

Unnamed: 0,date,country,name,sales_2020,sales_2021
0,2021-12-01,United States,Aammar,100.0,
1,2022-12-01,United States,Aammar,200.0,150.0
2,2022-12-01,United States,Hamza,,300.0
3,2021-01-12,United States,Hamza,200.0,150.0


In [83]:
# remove duplicates
df = df.drop_duplicates(subset="name")
df.head()

Unnamed: 0,date,country,name,sales_2020,sales_2021
0,2021-12-01,United States,Aammar,100.0,
2,2022-12-01,United States,Hamza,,300.0


In [84]:
# 5. Resolving Contradictory Data
df = df.drop(df[df['sales_2021'] <= df['sales_2020']].index)
df.head()


Unnamed: 0,date,country,name,sales_2020,sales_2021
0,2021-12-01,United States,Aammar,100.0,
2,2022-12-01,United States,Hamza,,300.0
