### Preparation

This will prepare our notebook including installing required packages and loading the data.

In [1]:
# Install additional libraries required (fsspec and s3fs) to load files through AWS S3
%%capture tmp
!pip install fsspec s3fs

# Import libraries to be used
import plotly.express as px
import numpy as np
import pandas as pd

In [2]:
# Load data from S3
df = pd.read_csv("s3://databyjp/academyxi/wk6_missing_data_example_MajorPowerStations_v2.csv")

In [3]:
# Check that the file has been properly loaded
df.head()

Unnamed: 0,OBJECTID,FEATURETYPE,DESCRIPTION,CLASS,FID,NAME,OPERATIONALSTATUS,OWNER,GENERATIONTYPE,PRIMARYFUELTYPE,PRIMARYSUBFUELTYPE,GENERATIONMW,GENERATORNUMBER,SUBURB,STATE,SPATIALCONFIDENCE,REVISED,COMMENT,LATITUDE,LONGITUDE
0,1,Power Station,A facility used for the generation of electric...,Renewable,120.0,Repulse,Operational,Hydro-Electric Corporation (Tasmania),Hydroelectric (Gravity),Water,,28.0,1,Ouse,Tasmania,5,20171211,Hydro,-42.507695,146.64696
1,2,Power Station,A facility used for the generation of electric...,Renewable,143.0,Gordon,Operational,Hydro-Electric Corporation (Tasmania),Hydroelectric (Gravity),Water,,432.0,3,Southwest,Tasmania,3,20171211,Hydro-Underground,-42.740518,145.982832
2,3,Power Station,A facility used for the generation of electric...,Renewable,134.0,John Butters,Operational,Hydro-Electric Corporation (Tasmania),Hydroelectric (Gravity),Water,,144.0,1,Queenstown,Tasmania,4,20171211,Hydro,-42.154835,145.534477
3,4,Power Station,A facility used for the generation of electric...,Renewable,128.0,Tribute,Operational,Hydro-Electric Corporation (Tasmania),Hydroelectric (Gravity),Water,,82.8,1,West Coast,Tasmania,2,20171211,Hydro-Underground,-41.81287,145.65382
4,5,Power Station,A facility used for the generation of electric...,Renewable,137.0,Bastyan,Operational,Hydro-Electric Corporation (Tasmania),Hydroelectric (Gravity),Water,,79.9,1,Tullah,Tasmania,5,20171211,Hydro,-41.735983,145.532112


## Identify missing values

Pandas includes many powerful tools to inspect and clean your data. The `.info` method will show for each column how many values are missing, as well as the data type: 

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 489 entries, 0 to 488
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   OBJECTID            489 non-null    int64  
 1   FEATURETYPE         489 non-null    object 
 2   DESCRIPTION         489 non-null    object 
 3   CLASS               489 non-null    object 
 4   FID                 486 non-null    float64
 5   NAME                489 non-null    object 
 6   OPERATIONALSTATUS   489 non-null    object 
 7   OWNER               474 non-null    object 
 8   GENERATIONTYPE      397 non-null    object 
 9   PRIMARYFUELTYPE     489 non-null    object 
 10  PRIMARYSUBFUELTYPE  187 non-null    object 
 11  GENERATIONMW        489 non-null    float64
 12  GENERATORNUMBER     403 non-null    object 
 13  SUBURB              471 non-null    object 
 14  STATE               489 non-null    object 
 15  SPATIALCONFIDENCE   489 non-null    int64  
 16  REVISED 

And at a more granular level, pandas can include many functions to identify rows or cells with missing data.

For example, the `.isna` method can be used to produce Boolean values indicating whether the data is missing.

In [5]:
df[["GENERATORNUMBER"]].isna()

Unnamed: 0,GENERATORNUMBER
0,False
1,False
2,False
3,False
4,False
...,...
484,True
485,False
486,False
487,False


The resulting series of Boolean values can be used to filter the entire dataframe. For example, the below filters the dataframe to only show rows containing missing values in the `"PRIMARYSUBFUELTYPE"` column:

In [6]:
df[df["PRIMARYSUBFUELTYPE"].isna()]

Unnamed: 0,OBJECTID,FEATURETYPE,DESCRIPTION,CLASS,FID,NAME,OPERATIONALSTATUS,OWNER,GENERATIONTYPE,PRIMARYFUELTYPE,PRIMARYSUBFUELTYPE,GENERATIONMW,GENERATORNUMBER,SUBURB,STATE,SPATIALCONFIDENCE,REVISED,COMMENT,LATITUDE,LONGITUDE
0,1,Power Station,A facility used for the generation of electric...,Renewable,120.0,Repulse,Operational,Hydro-Electric Corporation (Tasmania),Hydroelectric (Gravity),Water,,28.0,1,Ouse,Tasmania,5,20171211,Hydro,-42.507695,146.646960
1,2,Power Station,A facility used for the generation of electric...,Renewable,143.0,Gordon,Operational,Hydro-Electric Corporation (Tasmania),Hydroelectric (Gravity),Water,,432.0,3,Southwest,Tasmania,3,20171211,Hydro-Underground,-42.740518,145.982832
2,3,Power Station,A facility used for the generation of electric...,Renewable,134.0,John Butters,Operational,Hydro-Electric Corporation (Tasmania),Hydroelectric (Gravity),Water,,144.0,1,Queenstown,Tasmania,4,20171211,Hydro,-42.154835,145.534477
3,4,Power Station,A facility used for the generation of electric...,Renewable,128.0,Tribute,Operational,Hydro-Electric Corporation (Tasmania),Hydroelectric (Gravity),Water,,82.8,1,West Coast,Tasmania,2,20171211,Hydro-Underground,-41.812870,145.653820
4,5,Power Station,A facility used for the generation of electric...,Renewable,137.0,Bastyan,Operational,Hydro-Electric Corporation (Tasmania),Hydroelectric (Gravity),Water,,79.9,1,Tullah,Tasmania,5,20171211,Hydro,-41.735983,145.532112
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
482,483,Power Station,A facility used for the generation of electric...,Non Renewable,486.0,Elliott,Operational,Territory Generation,,Diesel,,1.0,,Elliott,Northern Territory,5,20170605,No details available as of 30/11/2016,-17.550368,133.539297
484,485,Power Station,A facility used for the generation of electric...,Non Renewable,499.0,Roy Hill,Operational,Alinta Energy,Spark Ignition Reciprocating Engine,Distillate,,6.0,,Newman,Western Australia,5,20171105,,-22.481238,119.947187
486,487,Power Station,A facility used for the generation of electric...,Renewable,,Gullen Range Wind Farm,Operational,Tianrun Australia,Wind Turbine,Wind,,165.5,73,Bannister,New South Wales,3,20171211,,-34.614917,149.459743
487,488,Power Station,A facility used for the generation of electric...,Renewable,,Mortons Lane Wind Farm,Operational,China Guangdong Nuclear Wind Energy Company,Wind Turbine,Wind,,20.0,13,Nareeb,Victoria,3,20171211,,-37.841342,142.465313


This can be used to assign the resulting clean(er) dataframes to a new variable.

For example, we can invert the selection, or use the `notna` method to exclude rows with missing data.


In [7]:
df_a = df[-df["GENERATORNUMBER"].isna()]
df_b = df[df["GENERATORNUMBER"].notna()]
df_a.equals(df_b)  # Method to check if two dataframes are identical

True

To drop rows or columns containing **any** missing data, Pandas' `dropna` method can be used. Note that this results in a very small dataframe compared to the original dataframe.

In [8]:
df.dropna()

Unnamed: 0,OBJECTID,FEATURETYPE,DESCRIPTION,CLASS,FID,NAME,OPERATIONALSTATUS,OWNER,GENERATIONTYPE,PRIMARYFUELTYPE,PRIMARYSUBFUELTYPE,GENERATIONMW,GENERATORNUMBER,SUBURB,STATE,SPATIALCONFIDENCE,REVISED,COMMENT,LATITUDE,LONGITUDE
29,30,Power Station,A facility used for the generation of electric...,Renewable,83.0,Claytons,Operational,Energy Developments LFG (Victoria) Pty Ltd,Spark Ignition Reciprocating Engine,Biogas,Landfill Methane,11.000,11,Clayton South,Victoria,5,20171211,Landfill Gas,-37.950200,145.118722
30,31,Power Station,A facility used for the generation of electric...,Renewable,71.0,Springvale,Operational,Energy Developments LFG (Victoria) Pty Ltd,Spark Ignition Reciprocating Engine,Biogas,Landfill Methane,4.200,6,South,Victoria,4,20171211,Landfill Gas,-37.973297,145.139635
31,32,Power Station,A facility used for the generation of electric...,Renewable,77.0,Hallam Road,Operational,LMS Energy Generation Pty Ltd,Spark Ignition Reciprocating Engine,Biogas,Landfill Methane,8.984,2,Hampton Park,Victoria,4,20171211,Landfill Gas,-38.053453,145.269950
32,33,Power Station,A facility used for the generation of electric...,Non Renewable,100.0,Yallourn,Operational,TRUenergy,Steam Subcritical,Coal,Brown Coal,1480.000,4,Yallourn,Victoria,5,20171211,Coal,-38.177015,146.342783
35,36,Power Station,A facility used for the generation of electric...,Non Renewable,146.0,Loy Yang A,Operational,GEAC Great Energy Alliance Corporation,Steam Subcritical,Coal,Brown Coal,2180.000,4,Loy Yang,Victoria,5,20171211,Coal,-38.253600,146.574569
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
466,467,Power Station,A facility used for the generation of electric...,Renewable,469.0,Alice Springs Airport,Operational,Alice Springs Airport,Solar Photovoltaic,Solar,Photovoltaic,0.240,28,Connellan,Northern Territory,5,20140107,Location photo on www.alicesolarcity.com.au/si...,-23.796570,133.897121
467,468,Power Station,A facility used for the generation of electric...,Renewable,470.0,Uterne Solar,Operational,Epuron,Solar Photovoltaic,Solar,Photovoltaic,1.000,3048,Arumbera,Northern Territory,5,20140107,NT Power and Water Corporation News item state...,-23.768517,133.868210
469,470,Power Station,A facility used for the generation of electric...,Renewable,472.0,Ballarat Solar Park,Operational,Central Victoria Solar City Consortium,Solar Photovoltaic,Solar,Photovoltaic,0.300,2000,Mitchell Park,Victoria,5,20140107,"Solar Park is at Ballarat Airport, visible Goo...",-37.514371,143.785473
470,471,Power Station,A facility used for the generation of electric...,Renewable,473.0,Liddell Solar Thermal,Operational,Macquarie Generation,Solar Thermal,Solar,Thermal,9.300,1000,Liddell,New South Wales,5,20140107,Macquarie Generation website verifies/updates ...,-32.376069,150.979667


## Cleaing erroneous text

We learned earlier about different data types such as strings (text) and integers (whole numbers). A mixture of data types can be a particular problem in programming languages more than in Excel. 

Take a look below, where we use the `.unique` method to show all unique values in the `GENERATORNUMBER` column.

In [9]:
df["GENERATORNUMBER"].unique()

array(['1', '3', '2', '4', '6', '14', '20', '140', nan, '128', '35', '11',
       '23', '5', '30', '37', '55', '45', '12', '34', '33', '9', '48',
       '111', '54', '15', '47', '29', '25', '32', '63', '27', '7', '31',
       '8', '67', '53', '10', '22', '16', '17', '24', '18', '13', '40',
       '36', '900', '1350', '56', '5000', '755', '210', '420', '1328',
       '1800', '1584', '574', '3500', '1260', '1326', '28', '3048',
       '2000', '1000', '73', '46', '<Null>'], dtype=object)

It includes a `nan` value (for not a number). The quotation marks around values indicate that the numbers are actually saved as strings. This is due to the last value, where the text `<Null>` has somehow found its way into the dataset. 

Let's clean up these values, replacing `<Null>` with an actuall null value.

In [10]:
df["GENERATORNUMBER"] = df["GENERATORNUMBER"].replace("<Null>", np.nan)
df["GENERATORNUMBER"] = df["GENERATORNUMBER"].replace(np.nan, None)

Now, we can change the data type to an interger

In [11]:
df["GENERATORNUMBER"] = df["GENERATORNUMBER"].astype(int)

And now, if we view the unique values, we see the following:

In [12]:
df["GENERATORNUMBER"].unique()

array([   1,    3,    2,    4,    6,   14,   20,  140,  128,   35,   11,
         23,    5,   30,   37,   55,   45,   12,   34,   33,    9,   48,
        111,   54,   15,   47,   29,   25,   32,   63,   27,    7,   31,
          8,   67,   53,   10,   22,   16,   17,   24,   18,   13,   40,
         36,  900, 1350,   56, 5000,  755,  210,  420, 1328, 1800, 1584,
        574, 3500, 1260, 1326,   28, 3048, 2000, 1000,   73,   46])

We can inspect the data type also:

In [13]:
type(df["GENERATORNUMBER"][0])

numpy.int64

## Imputing values

Pandas provides multiple default methods with which missing data may be filled ([Documentation on filling missing values](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#filling-missing-values-fillna)).

One method is to simply fill the missing cells with a scalar value, such as a median:



In [14]:
df["GENERATORNUMBER"] = df["GENERATORNUMBER"].fillna(df["GENERATORNUMBER"].median())

Note that a median value is only able to be calculated for a column of numbers. If we had not cleaned the `GENERATORNUMBER` column by removing the `"<Null>"` text value and converted the column to a set of integers, this median value would not have been possible to determine.

Another way of filling data is to forward/backward fill data `fillna(method='ffill')` or `fillna(method='bfill')`, where the last non-blank value forward or backward of the blank value is used. These may be appropriate where a value is missing in the middle of a series of data, such as stock prices.

## Standarding categorical variables

Using pandas' `unique` method, we can sort the list of categorical variables like so, which will show a number of very similar items:

In [15]:
df["GENERATIONTYPE"] = df["GENERATIONTYPE"].fillna("UNKNOWN")
for i in np.sort(df["GENERATIONTYPE"].unique()):
  print(i)

<Null>
Cogeneration
Cogeneration - Spark Ignition Reciprocat
Cogeneration - Steam Subcritical
Combined Cycle
Combined Cycle Gas Turbine
Compression Reciprocating Engine
Gas Turbine
Hydroelectric (Gravity)
Hydroelectric (Pumped Storage)
Hydroelectric (Run of River)
Open Cycle
Open Cycle Gas Turbine
Reciprocating Engine
Reciprocating Engines
Solar Photovoltaic
Solar Thermal
Spark Ignition Reciprocating Engine
Steam Sub-Critical
Steam Subcritical
Steam Super Critical
Steam Turbine
UNKNOWN
Wind Turbine


We see the `"<Null>"` value again, so let's clean it.

In [16]:
df["GENERATIONTYPE"] = df["GENERATIONTYPE"].replace("<Null>", "UNKNOWN")

We can take one of a few different approaches to programmatically clean this column and group these items together. 

One is to manually do so, by using a common string which is used by all common column values. The following code will simplify:
- Cogeneration
- Cogeneration - Spark Ignition Reciprocat
- Cogeneration - Steam Subcritical
All to "Cogeneration" in a new column


In [17]:
df = df.assign(simple_gen_type="UNKNOWN")  # Create a new column, assign value "UNKNOWN" to all as default
df.loc[
       df["GENERATIONTYPE"].str.contains("Cogeneration"), "simple_gen_type"
] = "Cogeneration"  # Where the "GENERATIONTYPE" column contains the string "Cogeneration", assign "Cogeneration" to the "simple_gen_type" column

Taking a look at just the `GENERATIONTYPE` and `simple_gen_type` columns, we see that one values covers all of these types:

In [18]:
df[df["simple_gen_type"] == "Cogeneration"][["GENERATIONTYPE", "simple_gen_type"]]

Unnamed: 0,GENERATIONTYPE,simple_gen_type
68,Cogeneration,Cogeneration
120,Cogeneration - Steam Subcritical,Cogeneration
123,Cogeneration - Steam Subcritical,Cogeneration
127,Cogeneration - Steam Subcritical,Cogeneration
141,Cogeneration - Spark Ignition Reciprocat,Cogeneration
178,Cogeneration - Steam Subcritical,Cogeneration
192,Cogeneration - Steam Subcritical,Cogeneration
269,Cogeneration - Steam Subcritical,Cogeneration
275,Cogeneration,Cogeneration
311,Cogeneration,Cogeneration


The same can be done with strings such as "Hydroelectric", and so on.

In [19]:
df.loc[
       df["GENERATIONTYPE"].str.contains("Hydroelectric"), "simple_gen_type"
] = "Hydroelectric"  # Where the "GENERATIONTYPE" column contains the string "Hydroelectric", assign "Hydroelectric" to the "simple_gen_type" column

In [20]:
df[(df["simple_gen_type"] == "Cogeneration") | (df["simple_gen_type"] == "Hydroelectric")][["GENERATIONTYPE", "simple_gen_type"]]

Unnamed: 0,GENERATIONTYPE,simple_gen_type
0,Hydroelectric (Gravity),Hydroelectric
1,Hydroelectric (Gravity),Hydroelectric
2,Hydroelectric (Gravity),Hydroelectric
3,Hydroelectric (Gravity),Hydroelectric
4,Hydroelectric (Gravity),Hydroelectric
...,...,...
340,Hydroelectric (Gravity),Hydroelectric
441,Cogeneration,Cogeneration
442,Cogeneration,Cogeneration
443,Cogeneration,Cogeneration


Other approaches to this will involves some form of langugage processing, which in itself is quite complex. Some simple methods might include:
- Grabbing the first "word" (i.e. characters before a space), 
- Grabbing the firt n characters, or

Here are quick demonstrations of each:

In [21]:
df = df.assign(gentype_firstword=df["GENERATIONTYPE"].apply(lambda x: x.split(" ")[0]))

In [22]:
df[["GENERATIONTYPE", "gentype_firstword"]]

Unnamed: 0,GENERATIONTYPE,gentype_firstword
0,Hydroelectric (Gravity),Hydroelectric
1,Hydroelectric (Gravity),Hydroelectric
2,Hydroelectric (Gravity),Hydroelectric
3,Hydroelectric (Gravity),Hydroelectric
4,Hydroelectric (Gravity),Hydroelectric
...,...,...
484,Spark Ignition Reciprocating Engine,Spark
485,Solar Photovoltaic,Solar
486,Wind Turbine,Wind
487,Wind Turbine,Wind


In [23]:
df = df.assign(gentype_firstfive=df["GENERATIONTYPE"].str[:5])

In [24]:
df[["GENERATIONTYPE", "gentype_firstfive"]]

Unnamed: 0,GENERATIONTYPE,gentype_firstfive
0,Hydroelectric (Gravity),Hydro
1,Hydroelectric (Gravity),Hydro
2,Hydroelectric (Gravity),Hydro
3,Hydroelectric (Gravity),Hydro
4,Hydroelectric (Gravity),Hydro
...,...,...
484,Spark Ignition Reciprocating Engine,Spark
485,Solar Photovoltaic,Solar
486,Wind Turbine,Wind
487,Wind Turbine,Wind


In [None]:
Other methods might include grouping unique texts by their similarity 'distance' to each other. This begins to become quite complex both in terms of the natural langugage processing definition of how to measure similarity as well as python implementation, so we will not get into it here. 

-----

But this may help you to get started:

https://stackoverflow.com/questions/67240893/how-to-group-data-frame-with-similar-text-in-python 

And an explanation of Levenshtein distance can be found here:

https://en.wikipedia.org/wiki/Levenshtein_distance 

