Allison Forte

DSC 540

Project: Milestone 3

July 17, 2022

# Cleaning and formatting the website data
Perform at least 5 data transformation and/or cleansing steps to your flat file data. Label each transformation (Step #1, Step #2, etc.) in your code and describe what it is doing in 1-2 sentences.

In [112]:
#  Install libraries

import lxml
import html5lib
import pandas as pd


# Read the website data

url1 = 'https://www.iban.com/exchange-rates'
dfs1 = pd.read_html(url1)
df_convert = dfs1[0]
print('\nThe shape of this dataset is ', df_convert.shape)
df_convert.head()


The shape of this dataset is  (31, 4)


Unnamed: 0,Currency,Currency Name,Exchange Rate = 1 EUR,Convert
0,USD,US dollar,1.0245,
1,JPY,Japanese yen,141.01,
2,BGN,Bulgarian lev,1.9558,
3,CZK,Czech koruna,24.555,
4,DKK,Danish krone,7.4449,


In [113]:
# Add one row with EUR conversion  rate since the table is based on a EUR conversion rate

new_row = {'Currency':'EUR', 'Currency Name': 'Euro', 'Exchange Rate = 1 EUR':1, 'Convert':'NaN'}
df_convert = df_convert.append(new_row, ignore_index=True)

print('\nThe new shape of this dataset is ', df_convert.shape)
df_convert.head()


The new shape of this dataset is  (32, 4)


Unnamed: 0,Currency,Currency Name,Exchange Rate = 1 EUR,Convert
0,USD,US dollar,1.0245,
1,JPY,Japanese yen,141.01,
2,BGN,Bulgarian lev,1.9558,
3,CZK,Czech koruna,24.555,
4,DKK,Danish krone,7.4449,


In [114]:
# Second source needed for the exchange rates- original data contained 'number' that was not the exchange rate

url2 = 'https://www.iban.com/currency-codes'
dfs2 = pd.read_html(url2)
df_codes = dfs2[0]
print('\nThe shape of this dataset is ', df_codes.shape)
df_codes.head()


The shape of this dataset is  (269, 4)


Unnamed: 0,Country,Currency,Code,Number
0,AFGHANISTAN,Afghani,AFN,971.0
1,ÅLAND ISLANDS,Euro,EUR,978.0
2,ALBANIA,Lek,ALL,8.0
3,ALGERIA,Algerian Dinar,DZD,12.0
4,AMERICAN SAMOA,US Dollar,USD,840.0


# Step 1: Change headers
Change headers so they are consistent across the 2 data frames and so they clearly indicate what the column is

In [115]:
# Change headers in df_convert. Doing this in place makes keeps it simple.

df_convert.rename(columns = {'Currency':'Currency_code', 'Currency Name':'Currency_name', 
                             'Exchange Rate = 1 EUR':'Exchange_rate_1EUR'}, inplace = True)


# Change headers in df_codes. Doing this in place makes keeps it simple.

df_codes.rename(columns = {'Currency':'Currency_name', 'Code':'Currency_code'}, inplace = True)


# Show the results

print(df_convert.head())

print(df_codes.head())

  Currency_code  Currency_name  Exchange_rate_1EUR Convert
0           USD      US dollar              1.0245     NaN
1           JPY   Japanese yen            141.0100     NaN
2           BGN  Bulgarian lev              1.9558     NaN
3           CZK   Czech koruna             24.5550     NaN
4           DKK   Danish krone              7.4449     NaN
          Country   Currency_name Currency_code  Number
0     AFGHANISTAN         Afghani           AFN   971.0
1   ÅLAND ISLANDS            Euro           EUR   978.0
2         ALBANIA             Lek           ALL     8.0
3         ALGERIA  Algerian Dinar           DZD    12.0
4  AMERICAN SAMOA       US Dollar           USD   840.0


# Step 2: Drop unnecessary rows
Remove rows that are unnecessary to the final goal (a dataframe with country name, currency name, currency codes, and exchange rate)

In [116]:
# Drop unneeded rows from df_convert

df_convert.drop(columns = ['Convert'], inplace = True)


# Drop unneeded rows from df_codes

df_codes.drop(columns=['Number'], inplace = True)


# Show the results

print(df_convert.head())

print(df_codes.head())

  Currency_code  Currency_name  Exchange_rate_1EUR
0           USD      US dollar              1.0245
1           JPY   Japanese yen            141.0100
2           BGN  Bulgarian lev              1.9558
3           CZK   Czech koruna             24.5550
4           DKK   Danish krone              7.4449
          Country   Currency_name Currency_code
0     AFGHANISTAN         Afghani           AFN
1   ÅLAND ISLANDS            Euro           EUR
2         ALBANIA             Lek           ALL
3         ALGERIA  Algerian Dinar           DZD
4  AMERICAN SAMOA       US Dollar           USD


# Step 3: Check for and fix missing/bad data and duplicates
Missing data will not be helpful and duplicates are not necessary. Duplicates and missing data should be removed.

In [117]:
# Missing data check on df_convert

df_convert.isnull().sum()

Currency_code         0
Currency_name         0
Exchange_rate_1EUR    0
dtype: int64

No missing data to remove from df_convert

In [118]:
# Missing data check on df_codes

df_codes.isnull().sum()

Country          0
Currency_name    0
Currency_code    3
dtype: int64

In [119]:
# Fixing missing data in df_codes

df_codes.dropna(axis = 0, how = 'any', thresh = None, subset = None, inplace = True)

df_codes.isnull().sum()

Country          0
Currency_name    0
Currency_code    0
dtype: int64

Identified and removed 3 rows of data with missing Currency_code
Inspection of the original data set revealed these countries have no universal currency

In [120]:
# Duplicate check on df_convert

convert_dup = any(df_convert.duplicated())

print('There are duplicates in df_convert: {}'.format(convert_dup))

There are duplicates in df_convert: False


In [121]:
# Duplicate check on df_codes

codes_dup = any(df_codes.duplicated())

print('There are duplicates in df_codes: {}'.format(codes_dup))

There are duplicates in df_codes: False


# Step 4: Check data type and correct if needed
Confirm the exchange rate is a number rather than a string and fix if needed.
Confirm no other surprise data types.

In [122]:
# Check data types for df_convert

print(df_convert.dtypes)

Currency_code          object
Currency_name          object
Exchange_rate_1EUR    float64
dtype: object


In [123]:
# Check data types for df_codes

print(df_codes.dtypes)

Country          object
Currency_name    object
Currency_code    object
dtype: object


Exchange rate is a number and the other data types are not unexpected or problematic!

# Step 5: Combine the 2 data frames into one
Match on 'Currency_code', add 'Exchange_rate_1EUR' to df_codes

In [125]:
# Merge the 2 data sets

final_df = df_codes.merge(df_convert, on = 'Currency_code', suffixes=('', '_drop'))


# Drop the second currency_name column

final_df.drop(columns=['Currency_name_drop'], inplace = True)


# Show the final dataframe

final_df

Unnamed: 0,Country,Currency_name,Currency_code,Exchange_rate_1EUR
0,ÅLAND ISLANDS,Euro,EUR,1.0000
1,ANDORRA,Euro,EUR,1.0000
2,AUSTRIA,Euro,EUR,1.0000
3,BELGIUM,Euro,EUR,1.0000
4,CYPRUS,Euro,EUR,1.0000
...,...,...,...,...
101,ROMANIA,Romanian Leu,RON,4.9395
102,SINGAPORE,Singapore Dollar,SGD,1.4269
103,SWEDEN,Swedish Krona,SEK,10.4964
104,THAILAND,Baht,THB,37.4920


Including the EUR row allows the final dataframe to have 106 rows vs 71 had EUR been ignored.