# Data Cleaning

In this notebook, we will focus on cleaning the COVID-19 dataset to prepare it for analysis. This includes handling missing values, standardizing date formats, and ensuring the data is in a suitable format for further analysis.

In [1]:
import pandas as pd

# Load the dataset
data = pd.read_csv('../data/covid19_data.csv')

# Display the first few rows of the dataset
data.head()

In [2]:
# Check for missing values
missing_values = data.isnull().sum()
missing_values[missing_values > 0]

In [3]:
# Handle missing values
# For simplicity, we will drop rows with missing values
data_cleaned = data.dropna()

# Alternatively, you could use imputation methods depending on the context
# data['column_name'].fillna(value, inplace=True)

# Display the shape of the cleaned dataset
data_cleaned.shape

In [4]:
# Standardize date formats
data_cleaned['date'] = pd.to_datetime(data_cleaned['date'])

# Verify the date format
data_cleaned['date'].head()

In [5]:
# Save the cleaned dataset for further analysis
data_cleaned.to_csv('../data/covid19_data_cleaned.csv', index=False)

# Display a message indicating that the cleaning process is complete
print('Data cleaning complete. Cleaned data saved as covid19_data_cleaned.csv.')