# 8. Practical Example: Cleaning a Real Dataset

In this section, we will apply the techniques learned so far to clean and preprocess a real-world dataset. The process involves loading the dataset, exploring its structure, and performing step-by-step cleaning operations.

## Dataset Overview
We will use a dataset containing information about sales transactions.

### Steps:
1. Load the dataset.
2. Explore the dataset using `head()`, `info()`, and `describe()`.



In [None]:
import pandas as pd

# Load the dataset
file_path = '../DataSets/Data_COVID19_Indonesia.csv' 
covid_data = pd.read_csv(file_path)

# Explore the dataset
print('First few rows of the dataset:')
print(covid_data.head())

print('Dataset Info:')
print(covid_data.info())

print('Summary Statistics:')
print(covid_data.describe())

## Step-by-Step Cleaning

### 1. Address Missing Values
- Identify missing values using `isnull().sum()`.
- Fill missing values or drop rows/columns as necessary.



In [None]:
# Identify missing values
print('Missing values in each column:')
print([covid_data.isnull().sum()])

# Fill or drop missing values
covid_data['Growth Factor of New Cases'] = covid_data['Growth Factor of New Cases'].fillna(covid_data['Growth Factor of New Cases'].mean())
covid_data = covid_data.dropna(subset=['Time Zone'])
print('Dataset after handling missing values:')
print(covid_data.info())

In [None]:
sr = pd.Series(covid_data.isnull().sum())

In [None]:
covid_data.columns

### 2. Convert Columns to Proper Data Types
- Ensure numeric columns are of the correct type.
- Convert date columns to datetime format.



In [None]:
# Convert data types
covid_data['Date'] = pd.to_datetime(covid_data['Date'], format='%%m-%d-%Y', errors='coerce')
covid_data[['Total Cases', 'Total Deaths','Total Recovered', 'Total Active Cases']] = covid_data[['Total Cases', 'Total Deaths','Total Recovered', 'Total Active Cases']].astype(int)
print('Dataset after converting data types:')
print(covid_data.info())

### 3. Handle Inconsistent and Duplicate Data
- Remove duplicate rows using `drop_duplicates()`.
- Normalize text fields for consistency.



In [None]:
# Remove duplicates
covid_data = covid_data.drop_duplicates()
print('Dataset after removing duplicates:')
print(covid_data.info())

# Normalize text fields
covid_data['Location'] = covid_data['Location'].str.strip().str.title()
print('Dataset after normalizing text fields:')
print(covid_data.head())

### 4. Rename Columns and Reorder as Needed
- Rename ambiguous columns for clarity.
- Reorder columns for better readability.



In [None]:
# Rename columns
covid_data.rename(columns={'Total Deaths per Million': 'per Million Death'}, inplace=True)

# Reorder columns
columns_order = [
    'Date', 'Location', 'Location ISO Code', 'Location Level', 'City or Regency',
    'Province', 'Country', 'Continent', 'Island', 'Longitude', 'Latitude',
    'Time Zone', 'Special Status', 'Population', 'Population Density', 'Area (km2)',
    'Total Regencies', 'Total Cities', 'Total Districts', 'Total Urban Villages',
    'Total Rural Villages', 'New Cases', 'New Deaths', 'New Recovered',
    'New Active Cases', 'Total Cases', 'Total Cases per Million', 'New Cases per Million',
    'Total Deaths', 'per Million Death', 'Total Deaths per 100rb',
    'New Deaths per Million', 'Case Fatality Rate', 'Total Recovered',
    'Case Recovered Rate', 'Total Active Cases', 'Growth Factor of New Cases',
    'Growth Factor of New Deaths'
]

covid_data = covid_data[columns_order]
print('Dataset after renaming and reordering columns:')
print(covid_data.head())

### 5. Save the Cleaned Dataset to a File
Export the cleaned dataset to a new file for future use.



In [None]:
# Save cleaned dataset
cleaned_file_path = '../tmp/covid_data_cleaned.csv'  # Replace with desired path
covid_data.to_csv(cleaned_file_path, index=False)
print(f'Cleaned dataset saved to {cleaned_file_path}')