# COUNTRIES DATABASE

### General instructions
- Create a folder 'downloads' as a sibling of 'pygen'
- Unzip datafile [https://island.ricerca.di.unimi.it/~alfio/gendata.zip](https://island.ricerca.di.unimi.it/~alfio/gendata.zip) in 'downloads' so that this will be accessible as '../../downloads/[folder]/[file]
- install pycountry as <code>pip install pycountry</code>

In [3]:
import numpy as np
import pandas as pd
import pycountry

## Data

### Countries and regions

In [12]:
it = pycountry.countries.get(alpha_2='IT')
print (it)

Country(alpha_2='IT', alpha_3='ITA', name='Italy', numeric='380', official_name='Italian Republic')


In [13]:
s1 = list(pycountry.subdivisions.get(country_code=it.alpha_2))[0]
print (s1)

Subdivision(code='IT-CS', country_code='IT', name='Cosenza', parent='78', parent_code='IT-78', type='Province')


In [14]:
s2 = pycountry.subdivisions.get(code=sub.parent_code)
print (s2)

Subdivision(code='IT-78', country_code='IT', name='Calabria', parent_code=None, type='Region')


### Cities

In [24]:
city_file = '../../downloads/countries/worldcitiespop.csv'
city_names = pd.read_csv(city_file, dtype={
    'Country': str, 'City': str, 'AccentCity': str, 'Region': object, 
    'Population': np.float64,
    'Latitude': np.float64, 'Longitude': np.float64
})

In [25]:
city_names.head()

Unnamed: 0,Country,City,AccentCity,Region,Population,Latitude,Longitude
0,ad,aixas,Aixàs,6,,42.483333,1.466667
1,ad,aixirivali,Aixirivali,6,,42.466667,1.5
2,ad,aixirivall,Aixirivall,6,,42.466667,1.5
3,ad,aixirvall,Aixirvall,6,,42.466667,1.5
4,ad,aixovall,Aixovall,6,,42.466667,1.483333


### Stats

In [26]:
stats_files = {
    2015: '../../downloads/countries/2015.csv',
    2016: '../../downloads/countries/2015.csv',
    2017: '../../downloads/countries/2015.csv',
}
S2015 = pd.read_csv(stats_files[2015])

In [27]:
S2015.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
3,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176


# TASK 1: Clean data
The <code>city_names</code> dataset contains duplicates, due to the fact that the <code>AccentCity</code> name may vary for the same city.
- Define an <code>ID</code> hash column using <code>Latitude</code> and <code>Longitude</code> (see [https://github.com/vinsci/geohash](https://github.com/vinsci/geohash)).
- Use <code>ID</code> as an index for the dataframe.
- Create a new dataframe representing cities with only the first row for of <code>city_names</code> for each group of rows with the same <code>ID</code>.

# TASK 2: Prepare stats
- Join into a unique dataframe the stats of 2015, 2016, and 2017, by adding a column for the year.
- Exploit <code>pycountry</code> to add also a column with the <code>ISO 2</code> of each country, in order to make possible (and easy) the data linking between stats, cities and country data.

# TASK 3: Database
- Create a relational database schema (2nd normal form) for storing data about countries, country subdivisions, cities, and stats.
- Populate the database using pandas.