We rarely get our data in just the form we want.  In this worksheet we will investigate some techniques for making relatively simple adjustments that help us get _from our input data into a data frame that is ready for us to use_.

We have some data in the file `iris.csv` which has been hand-transcribed from notebooks and contains a few errors.  To clean the data we will:
  * Load up the data
  * Understand the Data
  * Remove empty values
  * Fix format errors
  * Fix incorrect data
  * Remove duplicates

After this, we are ready to visualise the data (which we will learn more about next week).

In [1]:
import pandas as pd

# load data - notice that iris data has no index, so we will use a fresh one
iris = pd.read_csv("data/iris.csv")
iris


Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
147,6.7,3.0,5.2,2.3,Iris-virginica
148,6.3,2.5,5.0,1.9,Iris-virginica
149,6.5,3.0,5.2,2.0,Iris-virginica
150,6.2,3.4,5.4,2.3,Iris-virginica


Jupyter always gives us a preview of our data when we print it, but we can specifially ask for the `head` rows, the `tail` rows and `info` about any dataframe.  `info` in particular gives us very useful information about our data.  Pay close attention to the "non-null" count and the data type of each column.


In [9]:
import pandas as pd

# load data - notice that iris data has no index, so we will use a fresh one
iris = pd.read_csv("data/iris.csv")
display (iris)
print(iris.head(3))


Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
147,6.7,3.0,5.2,2.3,Iris-virginica
148,6.3,2.5,5.0,1.9,Iris-virginica
149,6.5,3.0,5.2,2.0,Iris-virginica
150,6.2,3.4,5.4,2.3,Iris-virginica


   sepal length  sepal width  petal length  petal width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa


In [6]:
print(iris.head(3))
print("")
print(iris.tail(4))
print("")
print(iris.info())

   sepal length  sepal width  petal length  petal width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa

     sepal length  sepal width  petal length  petal width           class
148           6.3          2.5           5.0          1.9  Iris-virginica
149           6.5          3.0           5.2          2.0  Iris-virginica
150           6.2          3.4           5.4          2.3  Iris-virginica
151           5.9          3.0           5.1          1.8  Iris-virginica

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152 entries, 0 to 151
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal length  152 non-null    float64
 1   sepal width   151 non-null    float64
 2   petal length  152 non-null    float64
 3   petal width   152 non-null  



The first issue that shows up is the missing value in "sepal width".  Every other column has 152 values, but this one has just 151.  Missing values in a table are called "null" values and they can mess with our analysis if we are not aware of them.  Sometimes we do want to leave them there, but often we want to exclude that data.

**YOU MUST NEVER MODIFY YOUR SOURCE DATA**

If there is "junk" in your input file, never make changes directly in the file, for the following reasons:
  * data is often audited and modification of official data could be an infringement
  * one person's junk is another person's treasure
  * pandas can adjust the data for you easily so you can have a "clean" version without adjusting the original

Lets start by finding that null value.  We know it is in the "sepal width" series


In [12]:
mask = iris["sepal width"].isnll()


AttributeError: 'Series' object has no attribute 'isnll'

we can't see it in the preview, but pandas can filter a series to keep only the null values, but do that, we will need to understand indexing with masks.   

Firstly, we get the "sepal width" series from the data frame.  Then we call a function on that series which will convert all values to true or false.  If the original value was not null, we will get false, if it was null, we get true.  You will have to believe me, but there is a single "True" in there.

In [13]:
mask = iris["sepal width"].isnull()

print(mask.info())

mask

<class 'pandas.core.series.Series'>
RangeIndex: 152 entries, 0 to 151
Series name: sepal width
Non-Null Count  Dtype
--------------  -----
152 non-null    bool 
dtypes: bool(1)
memory usage: 284.0 bytes
None


0      False
1      False
2      False
3      False
4      False
       ...  
147    False
148    False
149    False
150    False
151    False
Name: sepal width, Length: 152, dtype: bool

Recall that we call a series that is all booleans a "mask". Only indexes with a "True" in the mask result in a row being included in the result.

In [21]:
iris[mask]

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
128,6.2,,4.8,1.8,Iris-virginica


and thus we can see our problem.  There is a number missing in row 128.  Check the original dataframe to see for yourself.  We can also use `loc` or `iloc` to see the rows in this vicinity now we have identified the problem

In [32]:
iris.loc[120:130,:]

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
120,7.7,2.6,6.9,2.3,Iris-virginica
121,6.0,2.2,5.0,1.5,Iris-virginica
122,6.9,3.2,5.7,2.3,Iris-virginica
123,5.6,2.8,4.9,2.0,Iris-virginica
124,7.7,2.8,6.7,2.0,6.3
125,6.3,2.7,4.9,1.8,Iris-virginica
126,6.7,3.3,5.7,2.1,Iris-virginica
127,7.2,3.2,6.0,1.8,Iris-virginica
128,6.2,,4.8,1.8,Iris-virginica
129,6.1,3.0,4.9,1.8,Iris-virginica


# Exercise - wide petals

Identify all rows where the "petal width" is greater than 5.  How many are there?

In [24]:
# choose a value for the missing entry with `fillna`
clean_iris = iris.fillna(0)
clean_iris.loc[125:128]
 
 
average_width = iris['sepal width'].mean()
iris.fillna(average_width)
 
mask_class_virginica = iris['class'] == 'Iris-virginica'
iris_virginica_only = iris[mask_class_virginica]
average_viriginica_sepal_width = iris_virginica_only['sepal width'].mean()
iris.fillna(average_viriginica_sepal_width)

put your answer in this code block


0      False
1      False
2      False
3      False
4      False
       ...  
147    False
148    False
149    False
150    False
151    False
Name: sepal width, Length: 152, dtype: bool

## Fixing the error

What do to about the error is up to you.  You should create a "clean" data frame, separate to the other one regardless of your decision.  Possible choices are:
  * remove that whole row
  * remove that whole column
  * choose a value for the missing entry

We will demonstrate each.


In [24]:
# remove that row

# we need the opposite mask. A trick to do this is to perform the "equals false" operation :)
mask2 = mask == False

# we can then use this as a mask to get only the rows we want.
clean_iris = iris[mask2]
clean_iris

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
147,6.7,3.0,5.2,2.3,Iris-virginica
148,6.3,2.5,5.0,1.9,Iris-virginica
149,6.5,3.0,5.2,2.0,Iris-virginica
150,6.2,3.4,5.4,2.3,Iris-virginica


In [25]:
# alternative!  Since we know exactly what row to drop, we can use the drop function

clean_iris = iris.drop(index=126)
clean_iris

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
147,6.7,3.0,5.2,2.3,Iris-virginica
148,6.3,2.5,5.0,1.9,Iris-virginica
149,6.5,3.0,5.2,2.0,Iris-virginica
150,6.2,3.4,5.4,2.3,Iris-virginica


In [26]:
# alternative - dropna does _all_ the hard work for us
clean_iris = iris.dropna()
clean_iris

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
147,6.7,3.0,5.2,2.3,Iris-virginica
148,6.3,2.5,5.0,1.9,Iris-virginica
149,6.5,3.0,5.2,2.0,Iris-virginica
150,6.2,3.4,5.4,2.3,Iris-virginica


In [27]:
# remove whole column with drop

clean_iris = iris.drop(columns="sepal width")
clean_iris

Unnamed: 0,sepal length,petal length,petal width,class
0,5.1,1.4,0.2,Iris-setosa
1,4.9,1.4,0.2,Iris-setosa
2,4.7,1.3,0.2,Iris-setosa
3,4.6,1.5,0.2,Iris-setosa
4,5.0,1.4,0.2,Iris-setosa
...,...,...,...,...
147,6.7,5.2,2.3,Iris-virginica
148,6.3,5.0,1.9,Iris-virginica
149,6.5,5.2,2.0,Iris-virginica
150,6.2,5.4,2.3,Iris-virginica


In [26]:
# choose a value for the missing entry with `fillna`
clean_iris = iris.fillna(0)
clean_iris.loc[125:128]
 
 
average_width = iris['sepal width'].mean()
iris.fillna(average_width)
 
mask_class_virginica = iris['class'] == 'Iris-virginica'
iris_virginica_only = iris[mask_class_virginica]
average_viriginica_sepal_width = iris_virginica_only['sepal width'].mean()
iris.fillna(average_viriginica_sepal_width)


average_width = iris['sepal width'].mean()
iris_updated = iris.fillna(average_width)
display(iris_updated.loc[128,])
 
mask_class_virginica = iris['class'] == 'Iris-virginica'
iris_virginica_only = iris[mask_class_virginica]
average_viriginica_sepal_width = iris_virginica_only['sepal width'].mean()
iris_updated = iris.fillna(average_viriginica_sepal_width)
display(iris_updated.loc[128,])

sepal length               6.2
sepal width           3.064238
petal length               4.8
petal width                1.8
class           Iris-virginica
Name: 128, dtype: object

sepal length               6.2
sepal width            2.98125
petal length               4.8
petal width                1.8
class           Iris-virginica
Name: 128, dtype: object

We used a couple of helper functions. [`dropna` will remove any rows with empty data](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) while [`fillna` will replace any empty values with some other value](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html).

I tend not to fill in missing values, I tend to drop the whole row - that should be your default option but don't do it without paying attention to what is dropped and why.  Too many people just apply `dropna` without thinking.

# Exercise - find errors

Identify the erroneous data in the `class` column and remove that row.

**Advanced** Imagine you had not seen the error, what type of pandas code could you construct to find such an error.  I would suggest extracting that column as a series, get all the unique values in the series (`drop_duplicates` will help).

In [28]:
# choose a value for the missing entry with `fillna`

print("put your solution here")
 
print(clean_iris['class'].drop_duplicates())
print(clean_iris['class'].unique())
 
mask_63 = clean_iris['class'] == '6.3'
display(clean_iris[mask_63])
 
# clean_iris = clean_iris.drop(index=124)
display(clean_iris.loc[120:130])
 
even_cleaner_iris = pd.DataFrame.copy(clean_iris)
even_cleaner_iris['class'][124] = 'Iris-virginica'
display(even_cleaner_iris.loc[120:130])

put your solution here
0          Iris-setosa
52     Iris-versicolor
102     Iris-virginica
124                6.3
Name: class, dtype: object
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica' '6.3']


Unnamed: 0,sepal length,sepal width,petal length,petal width,class
124,7.7,2.8,6.7,2.0,6.3


Unnamed: 0,sepal length,sepal width,petal length,petal width,class
120,7.7,2.6,6.9,2.3,Iris-virginica
121,6.0,2.2,5.0,1.5,Iris-virginica
122,6.9,3.2,5.7,2.3,Iris-virginica
123,5.6,2.8,4.9,2.0,Iris-virginica
124,7.7,2.8,6.7,2.0,6.3
125,6.3,2.7,4.9,1.8,Iris-virginica
126,6.7,3.3,5.7,2.1,Iris-virginica
127,7.2,3.2,6.0,1.8,Iris-virginica
128,6.2,0.0,4.8,1.8,Iris-virginica
129,6.1,3.0,4.9,1.8,Iris-virginica


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  even_cleaner_iris['class'][124] = 'Iris-virginica'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  even_cleaner

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
120,7.7,2.6,6.9,2.3,Iris-virginica
121,6.0,2.2,5.0,1.5,Iris-virginica
122,6.9,3.2,5.7,2.3,Iris-virginica
123,5.6,2.8,4.9,2.0,Iris-virginica
124,7.7,2.8,6.7,2.0,Iris-virginica
125,6.3,2.7,4.9,1.8,Iris-virginica
126,6.7,3.3,5.7,2.1,Iris-virginica
127,7.2,3.2,6.0,1.8,Iris-virginica
128,6.2,0.0,4.8,1.8,Iris-virginica
129,6.1,3.0,4.9,1.8,Iris-virginica


# Exercise - fishing

**Advanced** Find any other erroneous data and fix it.

In [29]:
print("put your solution here")
 
 
print(clean_iris['class'].drop_duplicates())
print(clean_iris['class'].unique())
 
mask_63 = clean_iris['class'] == '6.3'
display(clean_iris[mask_63])
 
# clean_iris = clean_iris.drop(index=124)
display(clean_iris.loc[120:130])
 
even_cleaner_iris = pd.DataFrame.copy(clean_iris)
even_cleaner_iris['class'][124] = 'Iris-virginica'
display(even_cleaner_iris.loc[120:130])

put your solution here
put your solution here
0          Iris-setosa
52     Iris-versicolor
102     Iris-virginica
124                6.3
Name: class, dtype: object
['Iris-setosa' 'Iris-versicolor' 'Iris-virginica' '6.3']


Unnamed: 0,sepal length,sepal width,petal length,petal width,class
124,7.7,2.8,6.7,2.0,6.3


Unnamed: 0,sepal length,sepal width,petal length,petal width,class
120,7.7,2.6,6.9,2.3,Iris-virginica
121,6.0,2.2,5.0,1.5,Iris-virginica
122,6.9,3.2,5.7,2.3,Iris-virginica
123,5.6,2.8,4.9,2.0,Iris-virginica
124,7.7,2.8,6.7,2.0,6.3
125,6.3,2.7,4.9,1.8,Iris-virginica
126,6.7,3.3,5.7,2.1,Iris-virginica
127,7.2,3.2,6.0,1.8,Iris-virginica
128,6.2,0.0,4.8,1.8,Iris-virginica
129,6.1,3.0,4.9,1.8,Iris-virginica


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  even_cleaner_iris['class'][124] = 'Iris-virginica'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  even_cleaner

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
120,7.7,2.6,6.9,2.3,Iris-virginica
121,6.0,2.2,5.0,1.5,Iris-virginica
122,6.9,3.2,5.7,2.3,Iris-virginica
123,5.6,2.8,4.9,2.0,Iris-virginica
124,7.7,2.8,6.7,2.0,Iris-virginica
125,6.3,2.7,4.9,1.8,Iris-virginica
126,6.7,3.3,5.7,2.1,Iris-virginica
127,7.2,3.2,6.0,1.8,Iris-virginica
128,6.2,0.0,4.8,1.8,Iris-virginica
129,6.1,3.0,4.9,1.8,Iris-virginica


In [30]:
print("put your solution here")
 
# Ideas
# as above
# comparison by specific class?
 
# print(even_cleaner_iris['sepal length'].unique())
# print(even_cleaner_iris['sepal width'].unique())
# print(even_cleaner_iris['petal length'].unique())
# print(even_cleaner_iris['petal width'].unique())
# print(even_cleaner_iris['class'].unique())
 
mask_22 = iris['petal width'] == 22
# display(even_cleaner_iris[mask_22])
# display(even_cleaner_iris.loc[130:140])
"""
column_names = even_cleaner_iris.columns
for column_name in column_names:
    print(even_cleaner_iris[column_name].unique())
"""
 
 
kinds_of_flowers = even_cleaner_iris['class'].unique() # list of different flowers
display(kinds_of_flowers)
 
 
for flower_type in kinds_of_flowers:
    print(" -- Data for: " + flower_type + " -- ")
    mask_for_specific_flower = even_cleaner_iris['class'] == flower_type
    iris_filtered_by_flower = even_cleaner_iris[mask_for_specific_flower]
 
    column_names = iris_filtered_by_flower.columns
    for column_name in column_names:
        print(iris_filtered_by_flower[column_name].unique())
 

put your solution here


array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

 -- Data for: Iris-setosa -- 
[5.1 4.9 4.7 4.6 5.  5.4 4.4 4.8 4.3 5.8 5.7 5.2 5.5 4.5 5.3]
[3.5 3.  3.2 3.1 3.6 3.9 3.4 2.9 3.7 4.  4.4 3.8 3.3 4.1 4.2 2.3]
[1.4 1.3 1.5 1.7 1.6 1.1 1.2 1.  1.9]
[0.2 0.4 0.3 0.1 0.5 0.6]
['Iris-setosa']
 -- Data for: Iris-versicolor -- 
[7.  6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.  5.9 6.  6.1 5.6 6.7 5.8 6.2
 6.8 5.4 5.1]
[3.2 3.1 2.3 2.8 3.3 2.4 2.9 2.7 2.  3.  2.2 2.5 2.6 3.4]
[4.7 4.5 4.9 4.  4.6 3.3 3.9 3.5 4.2 3.6 4.4 4.1 4.8 4.3 5.  3.8 3.7 5.1
 3. ]
[1.4 1.5 1.3 1.6 1.  1.1 1.8 1.2 1.7]
['Iris-versicolor']
 -- Data for: Iris-virginica -- 
[6.3 5.8 7.1 6.5 7.6 4.9 7.3 6.7 7.2 6.4 6.8 5.7 7.7 6.  6.9 5.6 6.2 6.1
 7.4 7.9 5.9]
[3.3 2.7 3.  2.9 2.5 3.6 3.2 2.8 3.8 2.6 2.2 0.  3.4 3.1]
[6.  5.1 5.9 5.6 5.8 6.6 4.5 6.3 6.1 5.3 5.5 5.  6.7 6.9 5.7 4.9 4.8 6.4
 5.4 5.2]
[ 2.5  1.9  2.1  1.8  2.2  1.7  2.   2.4  2.3  1.5  1.6 22.   1.4]
['Iris-virginica']


# Exercise

There is a very useful method available on dataframes called "describe".  Below is an example of its use.

In [31]:
lithgow = pd.read_csv("data/rainfall/lithgow.csv")
lithgow.describe()

FileNotFoundError: [Errno 2] No such file or directory: 'data/rainfall/lithgow.csv'

Experiment with this method on data you know well, [check the documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html).  How do you think this method can help you find erroneous data in your DataFrames?

# Concept Summary
  * We can index into a table with a mask
  * We can use boolean operators to create useful masks
  * `head`, `tail`, and `info` are useful for learning about your data
  * `dropna`, `drop_duplicates`, `duplicates`, `fillna` are usefull for cleaning your data

# Python concepts
  * `head`, `tail`, `info`, `dropna`, `drop_duplicates`, `duplicates`, `fillna` are all _methods_ on the data frame object
  * most of the methods we used returned entirely fresh values which we needed to capture in a new variable.  The original data frame was not changed by methods like `fillna`.
  * the `drop` method of a dataframe can take different parameters to do different things.  `index=` will drop a row, `columns=` will drop a column.  Both versions return a whole new DataFrame, leaving the original untouched.
