[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/databyjp/AcademyXI_DA/blob/main/notebooks/AcademyXi_DA_Module_6_data_cleaning_workshop.ipynb)

## AcademyXi Data Analysis - Data Manipulation
### Workshop - Data cleaning with Python
In this workshop module, we will use Python to perform data cleaning tasks.  

Using Python makes it very easy to clean data in a consistent, repeatable manner, as well as to automate the process. This makes it possible to clean large quantities of data in a very short amount of time. 

### Preparation

This will prepare our notebook including installing required packages and loading the data.

In [None]:
# Install additional libraries required (fsspec and s3fs) to load files through AWS S3
%%capture tmp
!pip install fsspec s3fs

# Import libraries to be used
import plotly.express as px
import numpy as np
import pandas as pd

In [None]:
# Load data from S3
df = pd.read_csv("s3://databyjp/academyxi/wk6_missing_data_example_MajorPowerStations_v2.csv")

In [None]:
# Check that the file has been properly loaded
df.head()

## Identify missing values

Pandas includes many powerful tools to inspect and clean your data. The `.info` method will show how many values are in each column, as well as the data type: 

In [None]:
df.info()

And at a more granular level, pandas can include many functions to identify rows or cells with missing data.

For example, the `.isna` method can be used to produce Boolean values indicating whether the data is missing.

In [None]:
df[["GENERATORNUMBER"]].isna()

The resulting series of Boolean values can be used to filter the entire dataframe. For example, the below filters the dataframe to only show rows containing missing values in the `"PRIMARYSUBFUELTYPE"` column:

In [None]:
df[df["PRIMARYSUBFUELTYPE"].isna()]

This can be used to assign the resulting clean(er) dataframes to a new variable.

For example, we can invert the selection, or use the `notna` method to exclude rows with missing data.


In [None]:
df_a = df[-df["GENERATORNUMBER"].isna()]
df_b = df[df["GENERATORNUMBER"].notna()]
df_a.equals(df_b)  # Method to check if two dataframes are identical

To drop rows or columns containing **any** missing data, Pandas' `dropna` method can be used. Note that this results in a much smaller dataframe compared to the original dataframe.

In [None]:
df.dropna()

## Cleaing erroneous text

We learned earlier about different data types such as strings (text) and integers (whole numbers). A mixture of data types can be a more problematic in programming languages more than in Excel, which often silently and dynamically converts data types.

Take a look below, where we use the `.unique` method to show all unique values in the `GENERATORNUMBER` column.

In [None]:
df["GENERATORNUMBER"].unique()

It includes a `nan` value (for not a number). The quotation marks around values indicate that the numbers are actually saved as strings. This is due to the last value, where the text `<Null>` has somehow found its way into the dataset. 

Let's clean up these values, replacing `<Null>` with an actuall null value.

In [None]:
df["GENERATORNUMBER"] = df["GENERATORNUMBER"].replace("<Null>", np.nan)
df["GENERATORNUMBER"] = df["GENERATORNUMBER"].replace(np.nan, None)

Now, we can change the data type to an interger

In [None]:
df["GENERATORNUMBER"] = df["GENERATORNUMBER"].astype(int)

And now, if we view the unique values, we see the following:

In [None]:
df["GENERATORNUMBER"].unique()

We can inspect the data type also:

In [None]:
type(df["GENERATORNUMBER"][0])

## Imputing values

Pandas provides multiple default methods with which missing data may be filled ([Documentation on filling missing values](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#filling-missing-values-fillna)).

One method is to simply fill the missing cells with a scalar value, such as a median:



In [None]:
df["GENERATORNUMBER"] = df["GENERATORNUMBER"].fillna(df["GENERATORNUMBER"].median())

Note that a median value is only able to be calculated for a column of numbers. If we had not cleaned the `GENERATORNUMBER` column by removing the `"<Null>"` text value and converted the column to a set of integers, this median value would not have been possible to determine.

Another way of filling data is to forward/backward fill data `fillna(method='ffill')` or `fillna(method='bfill')`, where the last non-blank value forward or backward of the blank value is used. These may be appropriate where a value is missing in the middle of a series of data, such as stock prices.

## Standarding categorical variables

Using pandas' `unique` method, we can sort the list of categorical variables like so, which will show a number of very similar items:

In [None]:
df["GENERATIONTYPE"] = df["GENERATIONTYPE"].fillna("UNKNOWN")
for i in np.sort(df["GENERATIONTYPE"].unique()):
  print(i)

We see the `"<Null>"` value again, so let's clean it.

In [None]:
df["GENERATIONTYPE"] = df["GENERATIONTYPE"].replace("<Null>", "UNKNOWN")

We can take one of a few different approaches to programmatically clean this column and group these items together. 

One is to manually do so, by using a common string which is used by all common column values. The following code will simplify:
- Cogeneration
- Cogeneration - Spark Ignition Reciprocat
- Cogeneration - Steam Subcritical
All to "Cogeneration" in a new column


In [None]:
df = df.assign(simple_gen_type="UNKNOWN")  # Create a new column, assign value "UNKNOWN" to all as default
df.loc[
       df["GENERATIONTYPE"].str.contains("Cogeneration"), "simple_gen_type"
] = "Cogeneration"  # Where the "GENERATIONTYPE" column contains the string "Cogeneration", assign "Cogeneration" to the "simple_gen_type" column

Taking a look at just the `GENERATIONTYPE` and `simple_gen_type` columns, we see that one values covers all of these types:

In [None]:
df[df["simple_gen_type"] == "Cogeneration"][["GENERATIONTYPE", "simple_gen_type"]]

The same can be done with strings such as "Hydroelectric", and so on.

In [None]:
df.loc[
       df["GENERATIONTYPE"].str.contains("Hydroelectric"), "simple_gen_type"
] = "Hydroelectric"  # Where the "GENERATIONTYPE" column contains the string "Hydroelectric", assign "Hydroelectric" to the "simple_gen_type" column

In [None]:
df[(df["simple_gen_type"] == "Cogeneration") | (df["simple_gen_type"] == "Hydroelectric")][["GENERATIONTYPE", "simple_gen_type"]]

Other approaches to this will involves some form of langugage processing, which in itself is quite complex. Some simple methods might include:
- Grabbing the first "word" (i.e. characters before a space), 
- Grabbing the firt n characters, or

Here are quick demonstrations of each:

In [None]:
df = df.assign(gentype_firstword=df["GENERATIONTYPE"].apply(lambda x: x.split(" ")[0]))

In [None]:
df[["GENERATIONTYPE", "gentype_firstword"]]

In [None]:
df = df.assign(gentype_firstfive=df["GENERATIONTYPE"].str[:5])

In [None]:
df[["GENERATIONTYPE", "gentype_firstfive"]]

Other methods might include grouping unique texts by their similarity 'distance' to each other. This begins to become quite complex both in terms of the natural langugage processing definition of how to measure similarity as well as python implementation, so we will not get into it here.

-----

But this may help you to get started:

https://stackoverflow.com/questions/67240893/how-to-group-data-frame-with-similar-text-in-python

And an explanation of Levenshtein distance can be found here:

https://en.wikipedia.org/wiki/Levenshtein_distance

-----

As you know by now, data cleaning can involve many different types of tasks, including some which can may be potentially very time-consuming. Using a programming language such as R or Python can help to automate them or even tackle tasks which may be otherwise impossible.