# Building images-by-country-id.json dataset

This notebook is a step by step guide on how to create and update the current dataset

## Requirements

### Download SkyScanner supported Countries
```sh
wget http://partners.api.skyscanner.net/apiservices/geo/v1.0?apikey=prtl6749387986743898559646983194 \
  --header 'content-type: application/json' \
  > geo.json
```

### Download and extract Unsplash Research dataset
```sh
wget https://unsplash.com/data/lite/latest -O unsplash-research-dataset-lite-latest.zip
unzip unsplash-research-dataset-lite-latest.zip -d unsplash-research-dataset-lite-latest
```

## Loading libraries

In [1]:
import numpy as np
import pandas as pd
import glob
import json

## Loading the datasets in Pandas

Make sure that you correctly point to the correct path.

In [2]:
path = './unsplash-research-dataset-lite-latest/'
documents = ['photos', 'keywords', 'collections', 'conversions', 'colors']
datasets = {}

for doc in documents:
  files = glob.glob(path + doc + ".tsv*")

  subsets = []
  for filename in files:
    df = pd.read_csv(filename, sep='\t', header=0)
    subsets.append(df)

  datasets[doc] = pd.concat(subsets, axis=0, ignore_index=True)

## Exploring the datasets

Here are the first couple of rows from each dataset, as an example.

Enjoy the exploration!

In [3]:
datasets['photos'].head()

Unnamed: 0,photo_id,photo_url,photo_image_url,photo_submitted_at,photo_featured,photo_width,photo_height,photo_aspect_ratio,photo_description,photographer_username,...,photo_location_longitude,photo_location_country,photo_location_city,stats_views,stats_downloads,ai_description,ai_primary_landmark_name,ai_primary_landmark_latitude,ai_primary_landmark_longitude,ai_primary_landmark_confidence
0,2Q8zDWkj0Yw,https://unsplash.com/photos/2Q8zDWkj0Yw,https://images.unsplash.com/photo-141520117961...,2014-11-05 15:26:26.678711,t,4192,2794,1.5,,lanceanderson,...,,,,15854580,146388,school of jellyfish swimming in body of water,,,,
1,tsBDNuCJiLg,https://unsplash.com/photos/tsBDNuCJiLg,https://images.unsplash.com/photo-141768928330...,2014-12-04 10:31:44.647012,t,4324,2880,1.5,Surfer in the ocean,thehungryjpeg,...,,,,3157825,8247,surfer walking on body of water during daytime,,,,
2,A93gsuMxVcE,https://unsplash.com/photos/A93gsuMxVcE,https://images.unsplash.com/photo-142981401899...,2015-04-23 18:33:44.024841,t,2000,1333,1.5,Barren Countryside,jeremydgreat,...,,,,2159829,7064,landscape photo of road beside field of grass,,,,
3,oYIdH6bFssk,https://unsplash.com/photos/oYIdH6bFssk,https://images.unsplash.com/photo-143275722183...,2015-05-27 20:07:28.35245,t,4288,2848,1.51,Lightbulb reflections,chelaxydp,...,,,,2861489,11778,closed up photo of orange lightened lamp,,,,
4,wgLPy2YBXuc,https://unsplash.com/photos/wgLPy2YBXuc,https://images.unsplash.com/photo-143205996405...,2015-05-19 18:26:15.223161,t,5312,2988,1.78,,csogi,...,,,,6037626,52263,white clouds and blue sky at daytime,,,,


In [4]:
datasets['collections'].head()

Unnamed: 0,photo_id,collection_id,collection_title,photo_collected_at
0,--2IBUMom1I,162470,Majestical Sunsets,2016-03-15 17:04:25
1,--2IBUMom1I,4668070,Pose,2019-04-18 23:59:25
2,--2IBUMom1I,4172658,Guys,2019-02-02 14:40:14
3,--2IBUMom1I,9832457,business,2020-04-04 14:26:10
4,--2IBUMom1I,2143051,Travel / Places,2018-05-22 23:20:05


In [5]:
datasets['conversions'].head()

Unnamed: 0,converted_at,conversion_type,keyword,photo_id,anonymous_user_id,conversion_country
0,2020-02-28 19:17:37,download,birds,RLLR0oRz16Y,d5f2584b-3cde-4e6f-8875-2dde1373f9bb,NL
1,2020-02-28 19:20:15,download,winter,r6TLRDY4Ll0,ee639d04-e079-4737-a500-d3139cdfae9a,KR
2,2020-02-28 19:32:50,download,island,vh0FucFJ7pw,b97f3bdc-bfd2-478c-9a43-fe743ca4813c,ID
3,2020-02-28 19:39:59,download,full hd wallpaper,SbrZdkLtTCY,5a15ec7e-9bf4-464c-9601-1ad731795a8d,US
4,2020-02-28 19:49:56,download,river,x_gyAYzyeQA,4bb2e232-498c-46d8-a02b-eb647ecff408,IN


In [6]:
datasets['colors'].head()

Unnamed: 0,photo_id,hex,red,green,blue,keyword,ai_coverage,ai_score
0,A2mjVkcix-w,101C23,16,28,35,black,0.613267,0.635228
1,0ufkmj46xvU,625946,98,89,70,darkolivegreen,0.037172,0.024936
2,0ufkmj46xvU,C7897A,199,137,122,rosybrown,0.024714,0.111978
3,HY7Az9lZwB4,D18E46,209,142,70,peru,0.005867,0.115786
4,HY7Az9lZwB4,A37343,163,115,67,sienna,0.010667,0.109003


In [7]:
datasets['keywords'].head()

Unnamed: 0,photo_id,keyword,ai_service_1_confidence,ai_service_2_confidence,suggested_by_user
0,zzux2cH-F-A,spring,34.244873,86.833668,f
1,zzux2cH-F-A,compass,26.864105,,f
2,zzux2cH-F-A,nature,99.83799,95.966119,f
3,zzux2cH-F-A,jar,43.128902,,f
4,zzux2cH-F-A,flower,81.635406,,f


## Transform the SkyScanner geo.json to DataFrame

In [8]:
with open('geo.json') as f:
  geoJson = json.load(f)

In [9]:
countries = []
for continent in geoJson['Continents']:
    for country in continent['Countries']:
        countries.append({ 'Id': country['Id'], 'Name': country['Name'] })

In [10]:
pd.DataFrame(data=countries).head()

Unnamed: 0,Id,Name
0,MN,Mongolia
1,CN,China
2,KG,Kyrgyzstan
3,TJ,Tajikistan
4,BT,Bhutan


*SkyScanner* API will return a *2 character ISO code* for each country.
We will use that code to index our images.
However, we cannot be sure if the *Unsplash* dataset will have images for each and every country.
So, we will a *default* Id code, and will map it to the *Travel* category.
This will be useful as a fallback when we can't find images for a Country, to at least show something nice.

In [11]:
countries.append({
    'Id': 'default',
    'Name': 'Travel'
})

In [12]:
dfCountries = pd.DataFrame(data=countries)

In [13]:
dfCountries.tail()

Unnamed: 0,Id,Name
243,MP,Northern Mariana Islands
244,TK,Tokelau
245,VU,Vanuatu
246,AU,Australia
247,default,Travel


We add an additional column `keyword` so we can merge it the *Unsplash* keyword dataframe

In [14]:
dfCountries['keyword'] = dfCountries['Name'].str.lower()

In [15]:
dfCountries.head()

Unnamed: 0,Id,Name,keyword
0,MN,Mongolia,mongolia
1,CN,China,china
2,KG,Kyrgyzstan,kyrgyzstan
3,TJ,Tajikistan,tajikistan
4,BT,Bhutan,bhutan


In [16]:
dfCountryKeywords = dfCountries.merge(datasets['keywords'], how='left')

In [17]:
dfCountryKeywords.sample(5)

Unnamed: 0,Id,Name,keyword,photo_id,ai_service_1_confidence,ai_service_2_confidence,suggested_by_user
1045,NO,Norway,norway,q6vDxCXnvsk,,,t
2645,AU,Australia,australia,uasc967z9fI,,,t
864,IT,Italy,italy,OmelL9tVVno,,,t
629,ES,Spain,spain,Lzx4J_Pb3sk,,,t
3689,default,Travel,travel,6xTKw5ufGTY,,,t


In [18]:
dfCountryPhotos = dfCountryKeywords.merge( datasets['photos'], how='left')

In [19]:
dfCountryPhotos.sample(5)

Unnamed: 0,Id,Name,keyword,photo_id,ai_service_1_confidence,ai_service_2_confidence,suggested_by_user,photo_url,photo_image_url,photo_submitted_at,...,photo_location_longitude,photo_location_country,photo_location_city,stats_views,stats_downloads,ai_description,ai_primary_landmark_name,ai_primary_landmark_latitude,ai_primary_landmark_longitude,ai_primary_landmark_confidence
124,JP,Japan,japan,sMP1im1Jbq4,,,t,https://unsplash.com/photos/sMP1im1Jbq4,https://images.unsplash.com/photo-152836123715...,2018-06-07 08:48:24.155065,...,139.703549,Japan,Shinjuku,698678.0,6426.0,people holding umbrella standing near pedestri...,,,,
3433,default,Travel,travel,NVMF-cAHxCg,,,t,https://unsplash.com/photos/NVMF-cAHxCg,https://images.unsplash.com/photo-149859570799...,2017-06-27 20:36:50.480293,...,,,,3847651.0,19819.0,pine trees on rock formation near mountains un...,,,,
2966,default,Travel,travel,pdRsf77OBoo,,,t,https://unsplash.com/photos/pdRsf77OBoo,https://images.unsplash.com/photo-1551482850-d...,2019-03-01 23:28:59.014058,...,,Croatia,,2659178.0,30437.0,aerial photography of island,,,,
2338,CA,Canada,canada,bQl2kRQyUE8,,,t,https://unsplash.com/photos/bQl2kRQyUE8,https://images.unsplash.com/photo-147697973503...,2016-10-20 16:10:32.354776,...,-123.240752,Canada,West Vancouver,2187424.0,14299.0,two gray and orange backpacks on gray rocks at...,,,,
2993,default,Travel,travel,oO15xC38wj4,,,t,https://unsplash.com/photos/oO15xC38wj4,https://images.unsplash.com/photo-145849666556...,2016-03-20 17:59:52.86722,...,7.308953,Germany,,1505781.0,16999.0,green forest,,,,


From all that information we now have, we only need a couple of columns.
We discard the rest

In [20]:
df = dfCountryPhotos[['Id', 'Name', 'photo_id', 'photo_image_url', 'photo_aspect_ratio']]

We need the *photo_aspect_ratio* to prevent our layout from shifting while the images are loading.

In [21]:
df.sample(5)

Unnamed: 0,Id,Name,photo_id,photo_image_url,photo_aspect_ratio
2933,default,Travel,rFh890jKgcs,https://images.unsplash.com/photo-156410762896...,0.67
2164,AQ,Antarctica,nvBfwtaUBnI,https://images.unsplash.com/photo-146288838706...,1.33
1849,HK,Hong Kong,WmCsa1gPlAc,https://images.unsplash.com/photo-1551335844-0...,1.33
2757,AU,Australia,Gk3Ot2WwLQM,https://images.unsplash.com/photo-157957089342...,1.47
1155,CH,Switzerland,z7tQUhBVOrY,https://images.unsplash.com/photo-144775508655...,1.0


We remove any rows with missing info, since will add errors down the line

In [22]:
df = df.dropna()

Now that we have a nifty dataset, we are gonna index the rows by *Id*, for easy access.

In [23]:
g = df.groupby(
    [
        'Id',
        'Name'
    ],
    as_index=False
).apply(
    lambda group: group[[
        'photo_id',
        'photo_image_url',
        'photo_aspect_ratio'
    ]].to_dict('r')
).reset_index().rename(
    columns={
        0:'photos'
    }
)

In [24]:
imagesJson = json.loads(
    g.to_json(
        orient='records'
    )
)

Finally, we save our file as `images-by-country-id.json`

In [25]:
with open('images-by-country-id.json', 'w') as outfile:
    json.dump(imagesJson, outfile, indent=2)

You can find the latest compiled `CountryCode` to *Unsplash* images here

https://github.com/facutk/wpxvuelos/blob/master/wp/wp-content/themes/xvuelos/assets/country-images-by-id/images-by-country-id.json