# Data Cleaning Exampling Using the Foreign Exchange Rate Data

Datafile: "14_Foreign_Exchange_Rates _WithErrors.csv" - Note that this is a version of the same dataset that we've used before except that we added some errors.

2020-10-20 - Jingwei Liu
<br>2022-10-17 - Jeff Smith

In [None]:
#import the tools:numpy,pandas and matplotlib
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## In the beginning, understanding the data is very important
If you know the meaning of each column, the expected data type of each column, the range of values for each column (if applicable), all those can give you a great help in cleaning the data

## First, let's read the data as a pandas dataframe
Generally, pandas dataframe can provide an easy way for us to check the data. After learning the pandas dataframe, you should know that the columns of a pandas dataframe are homogeneous (if type), which means all cells in a same column have the same data type.

In [None]:
# Read the dataset as a dataframe
fname = "../data/14_Foreign_Exchange_Rates _WithErrors.csv"
df = pd.read_csv(fname)
df.head()

We see that the first column should be an index and the second column should be a date. All other columns are numbers because they are monetary exchange rates.

## Then, let's check whether all cells in the exchange rates are (as expected), numbers
Generally, we should clean the data column by column rather than row by row. Let's take the column "AUSTRALIA - AUSTRALIAN DOLLAR/US$" as an example

In [None]:
# Check the data type of the column. We can just check the first element in the column.
type(df['AUSTRALIA - AUSTRALIAN DOLLAR/US$'][1])

In [None]:
# OR you can use describe function to check the datatype
df['AUSTRALIA - AUSTRALIAN DOLLAR/US$'].describe()

Based on our understanding of the data, the data type of the cell should be numeric (float) rather than string (str). In addition, according to this result, we should notice that there are cells which contain some non-number-string values (ND). Now, let's first find those non-number-strings.

## So, how to find the non-number-strings 
### Assume we have two strings: '123' and 'abc'. how can we distinguish them?
In Python, the string '123' can be converted to a number using the float() function. 

In [None]:
str1 = '123'
str2 = 'abc'

In [None]:
float(str1)

In [None]:
float(str2)

The error is useful in interactive mode, but we need something that will indicate the problem, but not fail.  One option is Excepting Handling.  You can find a good overview of this method here: https://docs.python.org/3/tutorial/errors.html

In [None]:
# Simple exception handling
try:
    float(str2)
except ValueError:
    print('could not convert string {} to float'.format(str2))

Note that the error (called an *exeption*) was "caught" and we were allowed to continue processing the statement.

Using this method, we can use float() function to help us distinguish non-number-strings without triggering an error. The "try except" structure is very powerful in Python (see the link above for a more detailed introduction). 

## Now, let's try to locate the cells that contain non-number-strings
Let's define a function that returns the indices of the cells contain non-number-strings:

In [None]:
# Define the function that returns indices the non-number strings
# column : a column from a dataframe
def CheckIfNumInCol(column):
    # create a empty list to store the element index 
    indexlist = []
    # from the start to the end
    for i in range(column.count()):
        # check whether float() function is working
        try:
            float(column[i])
        except ValueError:
            # if float() is not working, that cell contain non-number string. Then, we add the cell's index to the list
            indexlist.append(i)        
    return(indexlist)

In [None]:
# Use the above function and get the indices of non-number strings 
resultlist = CheckIfNumInCol(df['AUSTRALIA - AUSTRALIAN DOLLAR/US$'])
resultlist

In [None]:
# Now filter the original dataframe using the list of indices
df['AUSTRALIA - AUSTRALIAN DOLLAR/US$'][resultlist]

Based on the result, we can find that there are 199 cells that contain non-number-strings

## Before we move on, you should think about this question: Are there perticular meanings for these non-number-strings? Are they supposed to be a number-strings? These questions are important because it will affect how you deal with this non-number-strings.

For this particular data, if you open the csv file and find the rows that contain 'ND', you'll find that the corresponding dates are all holidays. So, thee are two different potential situations: 1. There is no data for holidays, or 2. The data provider doesn't record exchange rate during holidays. You need to make a decision in order to "fix" the dataset.

In addition to the ND values, the 89th row contains 'ABC'. Since we don't see any logical reason for this, we assume that it is a "typo" and that the value should be a number-string originally.

## After finding the non-number-strings, all the remaining values are number-strings and can easily be converted into a numeric datatype. But how can we know that the cells that contain number-strings are correct?

For different data, there is no universal rule to judge "correctness". As indicated at the begining of this code, understanding the data will help you make the decision. For this exchange rate data, here, we will plot the rate values to check if there are seemingly abnormal values (i.e., *outliers*).

Let's first convert those number-strings to float 

In [None]:
# get the index list of the dataframe
dflist = list(df.index)

In [None]:
# substract the non-number-strings indices from the above list
leftlist = [item for item in dflist if item not in resultlist]

In [None]:
# an alternative way to find the number-string indices
leftlist= list(set(dflist).difference(resultlist))

In [None]:
# convert the number-strings to numbers
leftdf = df['AUSTRALIA - AUSTRALIAN DOLLAR/US$'][leftlist]
leftdf = leftdf.apply(lambda x: float(x))
# check the data type 
leftdf.dtype

In [None]:
# or just have a quick look
leftdf

### Now we can do some simple plots to check the values for outliers

In [None]:
# modify the figure size
plt.rcParams['figure.figsize'] = (10.0, 5.0)
# do plot
plt.plot(leftdf,'o');
# add x and y axis label
plt.xlabel('Observation Index');
plt.ylabel('Value');

From the plot, you see that the vast majority of values are around 2, but between index 0 and 1000, there seem to be several values greater than 16. We can also use histogram plot to check this. 

In [None]:
# histogram plot
plt.hist(leftdf, bins = 50);
# add x and y axis label
plt.xlabel('Value');
plt.ylabel('Frequency');

In [None]:
# find the value greater than 16
abnormal = leftdf.apply(lambda x: x>16)
leftdf[abnormal]

Since we know what this data represents (exchange rates for Australian dollars vs. the US dollar), we assume that these are data errors.

## Now, we've identified all of the incorrect values (non-number-strings and abnormal numbers). What should we do next?

### Two options: Replace them with other values OR just get rid of the records with wrong values
To deal with the wrong values, generally, you can choose either of the above methods. But there are some pros and cons for these two method:

For Replaceing wrong values, you need to carefully decide what values you will put into the cells. without careful consideration, the cleaned data may have a negative influence on your future analysis.

For getting rid of the wrong values, sometimes you will end up with few observations without any wrong values.

Here, we will show both methods.

## Replacing methods

In [None]:
# Replacing wrong values with a constant value. We will replace those wrong values with 1.6
# let's define a functions that can help use do this
# cell is the value in that cell
# value is the constant value
def ReplaceWithConstantValue(cell,value = 1.6):
    try:
        newcell = float(cell)
        if newcell > 16:
            return(value)
        return(newcell)
    except:
        return(value)


In [None]:
# apply the function
newcol = df['AUSTRALIA - AUSTRALIAN DOLLAR/US$'].apply(lambda x: ReplaceWithConstantValue(x))
# check the resulting data
newcol

### Now we can plot the new column to check

In [None]:
# plot new column
plt.plot(newcol)

### We see that the values are in a reasonable range, but the values are not necessarily consistent. 
The inconsistency seems to be caused by replacing the ND values with a fixed value of 1.6.  Why does this seem inconsistent?  We have to understand the data to know that.

We will provide another way to replace the wrong values with a dynamic value (imputed value)

In [None]:
# Still first define a function to help us do the replacement
# This function will replace the wrong values with the last correct value
# cell is the value in that cell
# value will be used if the first cell is a wrong value
def ReplaceWithDynamicValue(cell, value = 1.6):
    # initialize the lastvalue if it doesn't exist yet
    if not hasattr(ReplaceWithDynamicValue, "lastvalue"):
        ReplaceWithDynamicValue.lastvalue = value
    try:
        newcell = float(cell)
        if newcell > 16:
            newcell = ReplaceWithDynamicValue.lastvalue
    except:
            newcell = ReplaceWithDynamicValue.lastvalue
    
    ReplaceWithDynamicValue.lastvalue = newcell
    return(newcell)

In [None]:
# apply the function
newcol = df['AUSTRALIA - AUSTRALIAN DOLLAR/US$'].apply(lambda x: ReplaceWithDynamicValue(x))
# plot the new column
plt.plot(newcol)

Now, the plot looks consistent with our expectations. Note that this doesn't necessarily make it "correct."

Next, we will show the getting rid of wrong values method.

## Getting rid of the incorrect values

Here, we'll start with the original dataset that has the non-numeric valuse (ND and ABC) and the outliers.

In [None]:
# Define a function to generate a mask for us to select those correct values
def MaskCorrectValue(cell):
    try:
        newcell = float(cell)
        if newcell > 16:
            return(False)
        return(True)
    except:
        return(False)

In [None]:
mask = df['AUSTRALIA - AUSTRALIAN DOLLAR/US$'].apply(lambda x: MaskCorrectValue(x))
mask

In [None]:
newcol2 = df['AUSTRALIA - AUSTRALIAN DOLLAR/US$'][mask]
newcol2

In [None]:
# Now we only have "good" values, but they're still strings
newcol2 = newcol2.apply(lambda x: float(x))
newcol2.describe()

In [None]:
# The original column.
df['AUSTRALIA - AUSTRALIAN DOLLAR/US$'].describe()

So, we can find that we remove 207 observations (199 non-numeric values and 8 numeric values greater than 16)

In [None]:
# And looking at the new plot
plt.plot(newcol2)