<a href="https://colab.research.google.com/github/cstar-industries/python-3-beginner/blob/master/998-Solutions/003-Data-Structures/Data Structures - Workshop - Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a>

## 1. Parsing CSV

CSV (comma-separated values) is one of the most well-known file formats for saving structured data.

In its simplest form, it resembles an Excel spreadsheet, where columns are separated by commas (`,`), and rows by a line break (`\n`). Often, the first line serves as a header.

For example, the following table:

|First name|Last name|Year of birth|
|-|-|-|
|Guido|van Rossum|1956|
|Rob|Pike|1956|
|Dennis|Ritchie|1941|

Would be represented as follows in CSV:

```
First name,Last name,Year of birth
Guido,van Rossum,1956
Rob,Pike,1956
Dennis,Ritchie,1941
```

Use Python to parse a long string of CSV data into a `list`. Each item of the `list` should correspond to a row of the CSV data (except for the header row). Each item should be a `dict` associating the header name to the row value. If the value is completely numeric, it should be entered in the `dict` as a number.

Using the example above, this is the expected output:

```python
[{'First name': 'Guido', 'Last name': 'van Rossum', 'Year of birth': 1956},
 {'First name': 'Rob', 'Last name': 'Pike', 'Year of birth': 1956},
 {'First name': 'Dennis', 'Last name': 'Ritchie', 'Year of birth': 1941}]
```

In [0]:
# This block contains a very long string of CSV data. The first line corresponds
# to the header names. Each row following the first is an entry. Transform this
# data following the exercise.
csv_data = '''Country Name,Country Code,Population
Afghanistan,AFG,37172386
Albania,ALB,2866376
Algeria,DZA,42228429
American Samoa,ASM,55465
Andorra,AND,77006
Angola,AGO,30809762
Antigua and Barbuda,ATG,96286
Argentina,ARG,44494502
Armenia,ARM,2951776
Aruba,ABW,105845
Australia,AUS,24982688
Austria,AUT,8840521
Azerbaijan,AZE,9939800
The Bahamas,BHS,385640
Bahrain,BHR,1569439
Bangladesh,BGD,161356039
Barbados,BRB,286641
Belarus,BLR,9483499
Belgium,BEL,11433256
Belize,BLZ,383071
Benin,BEN,11485048
Bermuda,BMU,63973
Bhutan,BTN,754394
Bolivia,BOL,11353142
Bosnia and Herzegovina,BIH,3323929
Botswana,BWA,2254126
Brazil,BRA,209469333
British Virgin Islands,VGB,29802
Brunei Darussalam,BRN,428962
Bulgaria,BGR,7025037
Burkina Faso,BFA,19751535
Burundi,BDI,11175378
Cabo Verde,CPV,543767
Cambodia,KHM,16249798
Cameroon,CMR,25216237
Canada,CAN,37057765
Cayman Islands,CYM,64174
Central African Republic,CAF,4666377
Chad,TCD,15477751
Channel Islands,CHI,170499
Chile,CHL,18729160
China,CHN,1392730000
Colombia,COL,49648685
Comoros,COM,832322
Dem. Rep. Congo,COD,84068091
Rep. Congo,COG,5244363
Costa Rica,CRI,4999441
Cote d'Ivoire,CIV,25069229
Croatia,HRV,4087843
Cuba,CUB,11338138
Curacao,CUW,159800
Cyprus,CYP,1189265
Czech Republic,CZE,10629928
Denmark,DNK,5793636
Djibouti,DJI,958920
Dominica,DMA,71625
Dominican Republic,DOM,10627165
Ecuador,ECU,17084357
Arab Rep. Egypt,EGY,98423595
El Salvador,SLV,6420744
Equatorial Guinea,GNQ,1308974
Eritrea,ERI,..
Estonia,EST,1321977
Eswatini,SWZ,1136191
Ethiopia,ETH,109224559
Faroe Islands,FRO,48497
Fiji,FJI,883483
Finland,FIN,5515525
France,FRA,66977107
French Polynesia,PYF,277679
Gabon,GAB,2119275
The Gambia,GMB,2280102
Georgia,GEO,3726549
Germany,DEU,82905782
Ghana,GHA,29767108
Gibraltar,GIB,33718
Greece,GRC,10731726
Greenland,GRL,56025
Grenada,GRD,111454
Guam,GUM,165768
Guatemala,GTM,17247807
Guinea,GIN,12414318
Guinea-Bissau,GNB,1874309
Guyana,GUY,779004
Haiti,HTI,11123176
Honduras,HND,9587522
China Hong Kong SAR,HKG,7451000
Hungary,HUN,9775564
Iceland,ISL,352721
India,IND,1352617328
Indonesia,IDN,267663435
Islamic Rep. Iran,IRN,81800269
Iraq,IRQ,38433600
Ireland,IRL,4867309
Isle of Man,IMN,84077
Israel,ISR,8882800
Italy,ITA,60421760
Jamaica,JAM,2934855
Japan,JPN,126529100
Jordan,JOR,9956011
Kazakhstan,KAZ,18272430
Kenya,KEN,51393010
Kiribati,KIR,115847
Dem. People’s Rep. Korea,PRK,25549819
Rep. Korea,KOR,51606633
Kosovo,XKX,1845300
Kuwait,KWT,4137309
Kyrgyz Republic,KGZ,6322800
Lao PDR,LAO,7061507
Latvia,LVA,1927174
Lebanon,LBN,6848925
Lesotho,LSO,2108132
Liberia,LBR,4818977
Libya,LBY,6678567
Liechtenstein,LIE,37910
Lithuania,LTU,2801543
Luxembourg,LUX,607950
China Macao SAR,MAC,631636
Madagascar,MDG,26262368
Malawi,MWI,18143315
Malaysia,MYS,31528585
Maldives,MDV,515696
Mali,MLI,19077690
Malta,MLT,484630
Marshall Islands,MHL,58413
Mauritania,MRT,4403319
Mauritius,MUS,1265303
Mexico,MEX,126190788
Fed. Sts. Micronesia,FSM,112640
Moldova,MDA,2706049
Monaco,MCO,38682
Mongolia,MNG,3170208
Montenegro,MNE,622227
Morocco,MAR,36029138
Mozambique,MOZ,29495962
Myanmar,MMR,53708395
Namibia,NAM,2448255
Nauru,NRU,12704
Nepal,NPL,28087871
Netherlands,NLD,17231624
New Caledonia,NCL,284060
New Zealand,NZL,4841000
Nicaragua,NIC,6465513
Niger,NER,22442948
Nigeria,NGA,195874740
North Macedonia,MKD,2082958
Northern Mariana Islands,MNP,56882
Norway,NOR,5311916
Oman,OMN,4829483
Pakistan,PAK,212215030
Palau,PLW,17907
Panama,PAN,4176873
Papua New Guinea,PNG,8606316
Paraguay,PRY,6956071
Peru,PER,31989256
Philippines,PHL,106651922
Poland,POL,37974750
Portugal,PRT,10283822
Puerto Rico,PRI,3195153
Qatar,QAT,2781677
Romania,ROU,19466145
Russian Federation,RUS,144478050
Rwanda,RWA,12301939
Samoa,WSM,196130
San Marino,SMR,33785
Sao Tome and Principe,STP,211028
Saudi Arabia,SAU,33699947
Senegal,SEN,15854360
Serbia,SRB,6982604
Seychelles,SYC,96762
Sierra Leone,SLE,7650154
Singapore,SGP,5638676
Sint Maarten (Dutch part),SXM,40654
Slovak Republic,SVK,5446771
Slovenia,SVN,2073894
Solomon Islands,SLB,652858
Somalia,SOM,15008154
South Africa,ZAF,57779622
South Sudan,SSD,10975920
Spain,ESP,46796540
Sri Lanka,LKA,21670000
St. Kitts and Nevis,KNA,52441
St. Lucia,LCA,181889
St. Martin (French part),MAF,37264
St. Vincent and the Grenadines,VCT,110210
Sudan,SDN,41801533
Suriname,SUR,575991
Sweden,SWE,10175214
Switzerland,CHE,8513227
Syrian Arab Republic,SYR,16906283
Tajikistan,TJK,9100837
Tanzania,TZA,56318348
Thailand,THA,69428524
Timor-Leste,TLS,1267972
Togo,TGO,7889094
Tonga,TON,103197
Trinidad and Tobago,TTO,1389858
Tunisia,TUN,11565204
Turkey,TUR,82319724
Turkmenistan,TKM,5850908
Turks and Caicos Islands,TCA,37665
Tuvalu,TUV,11508
Uganda,UGA,42723139
Ukraine,UKR,44622516
United Arab Emirates,ARE,9630959
United Kingdom,GBR,66460344
United States,USA,326687501
Uruguay,URY,3449299
Uzbekistan,UZB,32955400
Vanuatu,VUT,292680
RB Venezuela,VEN,28870195
Vietnam,VNM,95540395
Virgin Islands (U.S.),VIR,106977
West Bank and Gaza,PSE,4569087
Rep. Yemen,YEM,28498687
Zambia,ZMB,17351822
Zimbabwe,ZWE,14439018'''

In [0]:
lines = csv_data.split('\n')

# Split first line to get header titles
headers = lines[0].split(',')

# Split all
rows = [s.split(',') for s in lines[1:]]

data = []

for row in rows:
  for i, v in enumerate(row):
    if v.isnumeric():
      row[i] = int(v)
  data.append({h: v for h, v in zip(headers, row)})

print(data)

## 2. Broken JSONs

A list of contacts and addresses was exported from a database to a JSON file. The database was not well-maintained and some data was corrupted. We wish to clean this up to get a new, improved contact database.

The first block uses [Requests](https://requests.readthedocs.io/en/master/)–a state-of-the-art library for handling HTTP in Python–to download the JSON file from the Internet. You don't need to understand what is happening, but if you are curious, feel free to dig into it.

After running the first block, the `db` variable contains a dictionary with two keys: `contacts` and `addresses`.

A contact consists of the following information:

* ID
* first name
* last name
* e-mail
* address ID (to be found in the addresses table)

An address consists of the following information:

* ID
* Street number
* Street name
* ZIP code
* City

Once your program is run, the following variables should contain this information:

* `contacts`: a dictionary mapping contact IDs to each _valid_ contact. A valid contact is a contact with all their data filled, and a valid address (with all its data filled. Expected format:

```python
{
  1: {
    'first_name': 'Sean',
    'last_name': 'Price',
    'email': 'sean.price@duckdown.com',
    'address': {
      'street_number': '1303',
      'street_name': 'Bergen Street',
      'zip_code': '11216',
      'city': 'Brooklyn, New York'
    }
  },
  ...
  ```
* `incomplete_contacts`: a `list` of contact IDs that do not have complete data
* `incomplete_addresses`: a `list` of address IDs that do not have complete data
* `missing_addresses`: a `list` of address IDs that appear as the `address` field of a contact, but do not appear in the addresses list
* `dangling_addresses`: a `list` of address IDs for all addresses with no corresponding contact

  > This is your time to start digging in the [docs](https://docs.python.org/3/library/stdtypes.html). You can do many things with the Python built-in types. Hope you can find what you need!

In [0]:
# Run this block to download the JSON file and parse it into the `db` variable.
import requests

res = requests.get('https://chrales.dev/python-3-beginner/contactdb.json')
db = res.json()

In [0]:
contact_fields = {'id', 'first_name', 'last_name', 'email', 'address'}
address_fields = {'id', 'street_number', 'street_name', 'zip_code', 'city'}

# Incomplete contacts: all fields ID from contacts where the set of keys of the 
# contact dict contains all keys from the expected contact fields 
incomplete_contacts = [c['id'] for c in db['contacts'] if not contact_fields <= {*c}]
# Same same but different
incomplete_addresses = [a['id'] for a in db['addresses'] if not address_fields <= {*a}]

# The set of all unique address IDs attached to a contact
contact_addresses = {c['address'] for c in db['contacts'] if 'address' in c}
# The set of all address IDs 
address_ids = {a['id'] for a in db['addresses']}

# missing addresses: the address IDs that are attached to a contact but not in
# the address list
missing_addresses = list(contact_addresses - address_ids)
# dangling addresses: the address IDs that are in the address list but not
# attached to a contact
dangling_addresses = list(address_ids - contact_addresses)

In [0]:
contacts = {}

for c in db['contacts']:
  contact_id = c['id']

  # If c is an incomplete contact, bail early and skip to the next contact
  if contact_id in incomplete_contacts:
    continue
  
  # Try to find address with ID = c['address']
  addr_id = c['address']
  for a in db['addresses']:
    if a['id'] == addr_id:
      break
  else:
    a = None
  
  # If the address wasn't found, bail early and skip to next contact
  if a is None:
    continue
  
  c = c.copy()              # copy, to make sure I don't break the original data
  del c['id']               # Remove ID from address, as per specifications
  contacts[contact_id] = c
  a = a.copy()              # Same same but different
  del a['id']                    
  contacts[contact_id]['address'] = a

print(contacts)
  