# Data Cleaning | Countries metadata

#### This file is the cleaning of the countries meta data file.
#### The data is from http://www.geonames.org/countries/ on 06/12/21 at 12:58PM.

The intention of this dataset is to observe trends across continents, as our datasets do not give us information about which continent a country is in.

The first attempt to find this data was downloaded from https://www.kaggle.com/statchaitya/country-to-continent however there seemed to be an issue with the binary code: 'invalid continuation byte'.

Similar data was also found at geo_names. Whilst the website provides a tab-deliminated txt file, this proved rather incumbersome, hence it was quicker to copy and past the main table on their homepage to an excel file, adjust the headers and save as a csv.

## Import Libraries

In [24]:
import pandas as pd
import numpy as np

## Import CSV

In [25]:
country_continent = pd.read_csv('raw_geo_names.csv', sep=";") #keep_default_na=False)
#last bit of commented out code is because pd.read_csv() was changing the string, continent of North America, 'NA' to a null.
#this may have some consequence on other nulls in the dataset, hence was left in the comments.
country_continent

Unnamed: 0,ISO-3166 alpha2,ISO-3166 alpha3,ISO-3166 numeric,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8
0,AD,AND,20,AN,Andorra,Andorra la Vella,468.0,77006,EU
1,AE,ARE,784,AE,United Arab Emirates,Abu Dhabi,82880.0,9630959,AS
2,AF,AFG,4,AF,Afghanistan,Kabul,647500.0,37172386,AS
3,AG,ATG,28,AC,Antigua and Barbuda,St. John's,443.0,96286,
4,AI,AIA,660,AV,Anguilla,The Valley,102.0,13254,
...,...,...,...,...,...,...,...,...,...
245,YE,YEM,887,YM,Yemen,Sanaa,527970.0,28498687,AS
246,YT,MYT,175,MF,Mayotte,Mamoudzou,374.0,279471,AF
247,ZA,ZAF,710,SF,South Africa,Pretoria,1219912.0,57779622,AF
248,ZM,ZMB,894,ZA,Zambia,Lusaka,752614.0,17351822,AF


In [26]:
country_continent.dtypes

ISO-3166 alpha2     object
ISO-3166 alpha3     object
ISO-3166 numeric     int64
Unnamed: 3          object
Unnamed: 4          object
Unnamed: 5          object
Unnamed: 6          object
Unnamed: 7          object
Unnamed: 8          object
dtype: object

## Understand the data

- We need to change some of the headers. 
- There are some nulls in 'Unnamed: 8' which are actually where the continent is named 'NA' (standing for North America) but this has been read as a null. 
- There may be nulls elsewhere. 
- We don't need all these columns.
- We don't have the full name of the continents.

## Removing unnecessary columns

In [27]:
country_continent = country_continent.drop(columns=['ISO-3166 alpha2', 'ISO-3166 numeric', 'Unnamed: 3'])

## Header Names

In [28]:
country_continent = country_continent.rename(columns={"ISO-3166 alpha3": "country_code", "Unnamed: 4": "name", "Unnamed: 5": "capital_city", "Unnamed: 6": "area_km2", "Unnamed: 7": "population", "Unnamed: 8": "continent_code"})
country_continent

Unnamed: 0,country_code,name,capital_city,area_km2,population,continent_code
0,AND,Andorra,Andorra la Vella,468.0,77006,EU
1,ARE,United Arab Emirates,Abu Dhabi,82880.0,9630959,AS
2,AFG,Afghanistan,Kabul,647500.0,37172386,AS
3,ATG,Antigua and Barbuda,St. John's,443.0,96286,
4,AIA,Anguilla,The Valley,102.0,13254,
...,...,...,...,...,...,...
245,YEM,Yemen,Sanaa,527970.0,28498687,AS
246,MYT,Mayotte,Mamoudzou,374.0,279471,AF
247,ZAF,South Africa,Pretoria,1219912.0,57779622,AF
248,ZMB,Zambia,Lusaka,752614.0,17351822,AF


## Nulls

In [29]:
# Counting the nulls
country_continent.isnull().sum(axis = 0)

country_code       0
name               0
capital_city       6
area_km2           0
population         0
continent_code    41
dtype: int64

In [30]:
# North America is not a null.
# First double check all these are actually meant to be 'NA' and not actually a null.
country_continent[country_continent['continent_code'].isna()]
# Looks ok? 

Unnamed: 0,country_code,name,capital_city,area_km2,population,continent_code
3,ATG,Antigua and Barbuda,St. John's,443.0,96286,
4,AIA,Anguilla,The Valley,102.0,13254,
13,ABW,Aruba,Oranjestad,193.0,105845,
17,BRB,Barbados,Bridgetown,431.0,286641,
25,BLM,Saint Barthélemy,Gustavia,21.0,845,
26,BMU,Bermuda,Hamilton,53.0,63968,
29,BES,"Bonaire, Sint Eustatius, and Saba",,328.0,18012,
31,BHS,Bahamas,Nassau,13940.0,38564,
36,BLZ,Belize,Belmopan,22966.0,383071,
37,CAN,Canada,Ottawa,9984670.0,37058856,


In [31]:
# Change Nulls to NA.
country_continent['continent_code'] = country_continent['continent_code'].fillna(value='NA')

In [32]:
# Checking it worked.
fillna_result = country_continent['continent_code'].isnull().sum(axis = 0)
if fillna_result != 0 :
    raise Exception("Error filling North America nulls")

In [33]:
country_continent

Unnamed: 0,country_code,name,capital_city,area_km2,population,continent_code
0,AND,Andorra,Andorra la Vella,468.0,77006,EU
1,ARE,United Arab Emirates,Abu Dhabi,82880.0,9630959,AS
2,AFG,Afghanistan,Kabul,647500.0,37172386,AS
3,ATG,Antigua and Barbuda,St. John's,443.0,96286,
4,AIA,Anguilla,The Valley,102.0,13254,
...,...,...,...,...,...,...
245,YEM,Yemen,Sanaa,527970.0,28498687,AS
246,MYT,Mayotte,Mamoudzou,374.0,279471,AF
247,ZAF,South Africa,Pretoria,1219912.0,57779622,AF
248,ZMB,Zambia,Lusaka,752614.0,17351822,AF


In [34]:
#Nulls in capital cities.
country_continent[country_continent['capital_city'].isna()]

Unnamed: 0,country_code,name,capital_city,area_km2,population,continent_code
8,ATA,Antarctica,,14000000.0,0,AN
29,BES,"Bonaire, Sint Eustatius, and Saba",,328.0,18012,
33,BVT,Bouvet Island,,49.0,0,AN
95,HMD,Heard and McDonald Islands,,412.0,0,AN
219,TKL,Tokelau,,10.0,1466,OC
231,UMI,U.S. Outlying Islands,,0.0,0,OC


In [35]:
# These probably don't have capitals so we should just leave it as an empty string ' '
country_continent['capital_city'] = country_continent['capital_city'].fillna(value=' ')

In [36]:
# No more nulls.
country_continent.isnull().sum(axis = 0)

country_code      0
name              0
capital_city      0
area_km2          0
population        0
continent_code    0
dtype: int64

## Continent Names

In [37]:
country_continent['continent_code'].unique()

array(['EU', 'AS', 'NA', 'AF', 'AN', 'SA', 'OC'], dtype=object)

In [38]:
cont_dict = {'EU' : 'Europe',
             'AS' : 'Asia',
             'AF' : 'Africa',
             'NA' : 'North America',
             'AN' : 'Antartica',
             'SA' : 'South America',
             'OC' : 'Oceania'}

country_continent['continent_name'] = country_continent['continent_code'].map(cont_dict)

In [39]:
country_continent

Unnamed: 0,country_code,name,capital_city,area_km2,population,continent_code,continent_name
0,AND,Andorra,Andorra la Vella,468.0,77006,EU,Europe
1,ARE,United Arab Emirates,Abu Dhabi,82880.0,9630959,AS,Asia
2,AFG,Afghanistan,Kabul,647500.0,37172386,AS,Asia
3,ATG,Antigua and Barbuda,St. John's,443.0,96286,,North America
4,AIA,Anguilla,The Valley,102.0,13254,,North America
...,...,...,...,...,...,...,...
245,YEM,Yemen,Sanaa,527970.0,28498687,AS,Asia
246,MYT,Mayotte,Mamoudzou,374.0,279471,AF,Africa
247,ZAF,South Africa,Pretoria,1219912.0,57779622,AF,Africa
248,ZMB,Zambia,Lusaka,752614.0,17351822,AF,Africa


## Renaming Attributes
For consistency with other datasets.

In [40]:
country_continent = country_continent.rename(columns={"country_code": "Country Code",
                                                                "name": "Name",
                                                                "capital_city": "Capital City",
                                                                "area_km2": "Area Km2",
                                                                "population": "Population",
                                                                "continent_code": "Continent Code",
                                                                "continent_name": "Continent Name"})

## Export Data to CSV

In [41]:
country_continent.to_csv('clean_country_metadata.csv')