# Data Cleaning

There are many problems that often occur in raw data such as inconsistencies, missing values and duplicate entries. We want to tackle these issues to make the data easier to analyse. In this tutorial, we will explore the common techniques used in pandas to clean and preprocess data. These include handling missing values, detecting and removing duplicates, converting data types, working with categorical and string data, and performing basic data transformations. 

This section will cover key functions:
- Handling null values:  `isnull`, `dropna`, `ffill`
- Handling duplicate values:  `duplicated`, `drop_duplicates`
- Handling data types:  `astype`, `to_numeric`
- String operations:  `str`, `replace`, `re`

## Identify and handle missing data

In [None]:
import pandas as pd
from pathlib import Path

data = pd.read_excel(Path.cwd() / "data/Financial Sample.xlsx", header=2)

# Print rows that contain null values
null_rows = data[data.isnull().any(axis=1)]

print(null_rows)

In [None]:
# Print columns that contain null values
null_columns = data.columns[data.isnull().any()].tolist()

print(null_columns)

You can drop null values using `dropna`

In [None]:
data_no_nulls = data.dropna()

null_rows = data_no_nulls[data_no_nulls.isnull().any(axis=1)]

print(null_rows)
print(data_no_nulls)

It is often useful to provide a threshold for `dropna` this is the number of columns containing null that you will then drop. Often when reading in xlsx files it will include notes that happen below the actual data, so `dropna` is very useful for getting rid of these rows. 

There are scenarios were instead of removing these entries completely you wish to replace it with something else. You can do so using `fillna`. So here we are considering discounts so it is reasonable to replace null by 0.

In [None]:
data = pd.read_excel(Path.cwd() / "data/Financial Sample.xlsx", header=2)
data = data.fillna(0)

print(data[data["Discounts"] == 0])

A common situation I have uncountered is where we have a column for year and quarter and because the data is in order only Q1 has a year. In this scenario we want to do a forward fill `ffill`

In [None]:
data = {
    'Year': [2000, None, None, None, 2001, None, None, None, 2002, None, None, None, 2003, None, None],
    'Quarter': ['Q1', 'Q2', 'Q3', 'Q4', 'Q1', 'Q2', 'Q3', 'Q4', 'Q1', 'Q2', 'Q3', 'Q4', 'Q1', 'Q2', 'Q3']
}
df = pd.DataFrame(data)

print(df)

In [None]:
df_filled = df.ffill()

print(df_filled)

Simiarly there are other fill functions such as `bfill`. Please see documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.bfill.html

## Identify and handle duplicate data

You can identify duplicates using built in duplicates function. By default the function checks duplicates across all columns.

In [None]:
data = pd.read_csv(Path.cwd() / "data/customers-1000.csv")

duplicates = data[data.duplicated()]
print(duplicates)

This data contains no duplicate values. You can specify across what columns you want to check for duplicates by listing the columns you want to check.

In [None]:
duplicates = data[data.duplicated(["City"])]
print(duplicates)

To remove duplicates you can use the built in `drop_duplicates` function.

In [None]:
data = data.drop_duplicates(subset=["Country"])

print(data)

## Identify and handle data types

Data types are very important as it will effect the functions you can use if python has misinterpretated your data as something else. For example for many things we will look at later, we need to ensure that numeric data is interpreted as numeric in order to use it for plotting charts and using `pivot_table`. The `info` function can be really useful for getting some overview information aboout a dataframe. 

In [None]:
data = pd.read_excel(Path.cwd() / "data/Financial Sample.xlsx", usecols=[0,1,2,4], header =2)

print(data)
print(data.info())

The main data types we will look at are:
   - `int`: Represents integers (e.g., `x = 10`)
   - `float`: Represents floating-point numbers (e.g., `x = 3.14`)
   - `str`: String, sequence of characters (e.g., `text = "Hello, World!"`)
   - `bool`: Boolean value (`True` or `False`, e.g., `flag = True`)
   - `NoneType`: Represents the absence of value (e.g., `x = None`)

We can set the data types by either specifying it when we read in the data or later we can apply `astype` to a column. Object us the default for strings and mixed types.

In [None]:
dtype_dict = {
    "Customer Id": str}

data  = pd.read_csv(Path.cwd() / "data/customers-1000.csv", usecols=[0,1,2], dtype=dtype_dict)

print(data.dtypes)

We can then enforce it as a string by applying as type to the column.

In [None]:
data["First Name"] = data["First Name"].astype(str)

print(data.dtypes)

In [None]:
data["First Name"] = data["First Name"].astype("string")

print(data.dtypes)

A really useful pandas function is to_numeric this converts a column to numeric types (`integer` or `float`). It can handle possible non-numeric values like strings or null values. The defaults is `errors="raise"` which will error when are non-numeric values but more usefully `errors="coerce"` will convert non-numeric values to nan, this is useful when raw data contains values such as "-" when data is missing or surpressed.

In [None]:
data = pd.DataFrame({'col1': ['10', '20', 'thirty', '40', 'NaN']})

# Convert to numeric using pd.to_numeric(), non-numeric values will become NaN
data['col1_numeric'] = pd.to_numeric(data['col1'], errors='coerce')

print(data)

In [None]:
data['col1_numeric'] = data['col1'].astype(float)

This will give an error. So in conclusion astype is useful so establishing types when all the data can easily be numeric data but `to_numeric` gives you a nice way to dealing with invalid data and quickly handle errors.

## String Operations

String operations are incredibly useful for cleaning up data or putting things into a particular format. One useful operation you can do with strings is character slicing.

In [None]:
data = pd.read_excel(Path.cwd() / "data/Financial Sample.xlsx", header =2, usecols=[14,15])

# You must have str entries to apply character slicing
data["Year"] = data["Year"].astype(str)

# This combines the first 3 characters from "Month Name" and last 2 characaters from "Year"
data["Date"] = data["Month Name"].str[:3] + "-" + data["Year"].str[-2:]

print(data)

The `str` accessor allows you to perform string operations in a vectroized manner like here applying the operation to a whole column. Another extremely useful string operation is the replace.

In [None]:
data["Date"] = data["Date"].str.replace("-"," ")

print(data)

For more unique situations, using regex can be extremely useful for extracting particular information. I find this to be a really good use of chatgpt is to provide regex code. An example of this would be extracting the year of the data from a url. Using regex in python requires `re` package.

In [None]:
import re

# Example URL
url = "https://example.com/2021/learn-how-to-extract-year"

# Find any 4-digit number between 1900 and 2024
match = re.search(r'/([1-2][0-9]{3})/', url)

# Check if it's within the range 1900-2024 and print the result
year = match.group(1) if match and 1900 <= int(match.group(1)) <= 2024 else "No valid year found"
print(year)