## DGI Data Enrichment Demo:

This jupyter notebook gives a brief demo on how the API can also be used to extract and enrich the quality of data. In this case, how webscraping from google can be utilised to combine other publicly available data with that available on DGI.  Here, we scrape the email, phone and opening hours information from google and attach it to the data we already have on the Limerick libraries.

##### Import requirements

In [1]:
!pip install requests fuzzywuzzy pandas pyjstat numpy plotly matplotlib seaborn geopy google folium pandas googlemaps

[31mtwisted 18.7.0 requires PyHamcrest>=1.9.0, which is not installed.[0m
[33mYou are using pip version 10.0.1, however version 19.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


##### Import all necessary libraries:

In [2]:
import requests
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import sys
import pandas as pd
from pyjstat import pyjstat
import numpy as np
from pyjstat import pyjstat
import plotly.plotly as py
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns
import json
from geopy.geocoders import Nominatim
from googlemaps import googlemaps
from geopy.geocoders import GoogleV3
import time
import sys
from bs4 import BeautifulSoup
from googlesearch import search 
import urllib
import folium
import pandas

### Search the datasets given query using DGI API:

This function returns the meta infoemation about matching package. Required resource can be selected from the result of below funtion.

<b>Search Query: "libraries"</b>

In [3]:
def package_search(url, query):
    
    try:
        
        response = requests.get(url+'api/3/action/package_search', {'q':query})
        results = response.json()
        
        if results['success'] is not True:
            raise SystemError
            
        if len(results['result']) == 0:
            raise TypeError
            
        return results['result']

    except SystemError:
        
        print("API failure - ")
        sys.exit(1)
        
    except TypeError:
        
        print("No results found for a given qquery!!")
        sys.exit(1)
        
    except Exception as e:
        print(e)
        sys.exit(1)
    

res = package_search('https://data.gov.ie/', 'libraries')
print(res)

{'count': 29, 'sort': 'score desc, metadata_modified desc', 'facets': {}, 'results': [{'license_title': 'Creative Commons Attribution 4.0', 'maintainer': 'Not supplied', 'issued': '2013-12-18', 'private': False, 'maintainer_email': 'data@smartdublin.ie', 'num_tags': 3, 'frequency': 'Irregular', 'id': 'cd8ef111-10da-4abd-a9dc-0e82c9fd256e', 'metadata_created': '2015-09-13T15:34:06.081426', 'metadata_modified': '2018-03-05T15:54:11.594082', 'author': 'Not supplied', 'author_email': 'data@smartdublin.ie', 'temporal': '2013-01-01 to 2013-12-31', 'theme': 'Arts', 'state': 'active', 'version': '1.0', 'relationships_as_object': [], 'license_id': 'CC-BY-4.0', 'type': 'dataset', 'resources': [{'cache_last_updated': None, 'package_id': 'cd8ef111-10da-4abd-a9dc-0e82c9fd256e', 'datastore_active': False, 'id': '01ac1df7-76c9-4177-a629-51de75850ee5', 'size': None, 'state': 'active', 'api_response_formats': [], 'hash': '', 'description': 'South Dublin Libraries', 'format': 'CSV', 'mimetype_inner': No

### Selected resource/data is Limerick libraries:

id: '042801f0-beb9-4d46-83bf-80a82b23a963'

format: 'json'

Note: id - is the package_id and resc_arr_ind - array index of the resources that need to be extracted.

### Function to extract data from DGI API:

In [4]:
def extract_pkg_data(url, pkg_id, resc_arr_ind):
    
    try:
        
        param = {'id': pkg_id}
        response = requests.get(url+'api/3/action/package_show', param)
        results = response.json()
        
        if not results['success']:
            raise SystemError

        if len(results['result']) == 0:
            return ("No package found")
        
        if results['result']['resources'][resc_arr_ind]['format'].lower() == 'csv':
        
            dataset = pd.read_csv(results['result']['resources'][resc_arr_ind]['url'], encoding = 'ISO-8859-1')
            #df = dataset.write('dataframe')

            return dataset
        
        elif ((results['result']['resources'][resc_arr_ind]['format'].lower() == 'geojson') or 
        (results['result']['resources'][resc_arr_ind]['format'].lower() == 'json')):
            
            response_json = requests.get(results['result']['url'])
            data_json = json.dumps(response_json.json())
            
                
            return json.loads(data_json)
        
        else:
            
            return results['result']
        
    
    except SystemError:
        
        print("Request Failure, please check the URL or the parameters")
        sys.exit(1)
        
    except Exception as e:
        
        print(e)
        sys.exit(1)

In [5]:
lm_lib_data = extract_pkg_data('https://data.gov.ie/', '762c93aa-f821-4ab7-8950-fe86bdf7fd2e', 0)
lm_lib_data

{'name': 'libraries',
 'type': 'FeatureCollection',
 'crs': {'type': 'name', 'properties': {'name': 'EPSG:2157'}},
 'features': [{'type': 'Feature',
   'geometry': {'type': 'Point',
    'coordinates': [511376.631959441, 626468.356670141]},
   'properties': {'name': 'Abbeyfeale Library'}},
  {'type': 'Feature',
   'geometry': {'type': 'Point',
    'coordinates': [546451.335412499, 646235.075368134]},
   'properties': {'name': 'Adare Library'}},
  {'type': 'Feature',
   'geometry': {'type': 'Point',
    'coordinates': [534073.892464841, 650401.229900348]},
   'properties': {'name': 'Askeaton Library'}},
  {'type': 'Feature',
   'geometry': {'type': 'Point',
    'coordinates': [568000.663871202, 649481.618443499]},
   'properties': {'name': 'Caherconlish Library'}},
  {'type': 'Feature',
   'geometry': {'type': 'Point',
    'coordinates': [577240.81915693, 651685.76339251]},
   'properties': {'name': 'Cappamore Library'}},
  {'type': 'Feature',
   'geometry': {'type': 'Point',
    'coordi

### Data Enrichment for Limerick libraries:

This data enrichment is split into 2 stages:

<li>Google Geolocation API: Used to retieve address and cordinates.</li>
<li>Webscrapping: Used to retieve email, phone and opening hours.</li>

##### Case 1: Google geolocation to get address and cordinates -

In [6]:
def limerick_cnt_preprocess(dt):
     
    lm_cnty = dt
    
    lm_cnty_df = []
    
    try:
        for x in lm_cnty['features']:

            time.sleep(10)

            param = {'address': x['properties']['name']+", County Limerick", 'key': ******} # Please use your API key

            response = requests.get('https://maps.googleapis.com/maps/api/geocode/json', param)
            lt = response.json()


            if lt['status'] != 'OK':
                
                raise SystemError
                
                
            dt = {'County': 'Limerick', 'Administrative Autority': 'Limerick City and County Council', 
                  'Name': x['properties']['name'], 'Address1': lt['results'][0]['formatted_address'].split(',')[0], 
                  'Address2': ", ".join(lt['results'][0]['formatted_address'].split(',')[1:]), 
                  'Latitude': lt['results'][0]['geometry']['location']['lat'], 
                  'Longitude': lt['results'][0]['geometry']['location']['lng']}

            lm_cnty_df.append(dt)
        
        return pd.DataFrame(lm_cnty_df)

    except SystemError:

        print("API Status Failed - ")
        pass


lm_cnty = limerick_cnt_preprocess(lm_lib_data)
lm_cnty.to_csv('LimerickCnty_Lib.csv', index=False)

SyntaxError: invalid syntax (<ipython-input-6-1eb04f4b249a>, line 12)

In [7]:
lm_cnty = pd.read_csv('LimerickCnty_Lib.csv')
lm_cnty.head()

Unnamed: 0,Address1,Address2,Administrative Autority,County,Latitude,Longitude,Name
0,Bridge St,"Abbeyfeale West, Abbeyfeale, Co. Limerick, ...",Limerick City and County Council,Limerick,52.383975,-9.301578,Abbeyfeale Library
1,Main St,"Blackabbey, Adare, Co. Limerick, Ireland",Limerick City and County Council,Limerick,52.564622,-8.790106,Adare Library
2,The Quay,"Askeaton, Co. Limerick, Ireland",Limerick City and County Council,Limerick,52.600884,-8.973165,Askeaton Library
3,Caherconlish Library,"Hundredacres East, Caherconlish, Co. Limeri...",Limerick City and County Council,Limerick,52.595662,-8.472314,Caherconlish Library
4,Gortnascarry,"Cappamore, Co. Limerick, Ireland",Limerick City and County Council,Limerick,52.615463,-8.335975,Cappamore Library


##### Case 2: Webscraping from google to get email, phone and opening hours -

In [10]:
def web_scrapping_limerick_cnty(lib_name):
    
    query = lib_name + "Limerick County"
    url = list(search(query, tld="co.in", num=1, stop=1, pause=2))[0]
    
    with urllib.request.urlopen(url) as response:
        html = response.read()
    soup = BeautifulSoup(html, 'lxml')
    
    try:
        
        email = soup.find_all("div", class_="field field--name-field-email field--type-email field--label-hidden field__item")[0].text
        
    except Exception as e:
        email = None
        pass
        
    
    try:
        
        tel = soup.find_all('div', class_='field field--name-field-telephone field--type-telephone field--label-hidden field__items')[0].text.replace("\n", "")
        
    except Exception as e:
        tel = None
        pass

    try:    
        hrs = soup.find_all("div", class_="clearfix text-formatted field field--name-field-opening-hours field--type-text-long field--label-hidden field__item")[0].find_all("li")
        
        opn_hrs = {}

        try:

            for x in hrs:

                if ":" in x.text:
                    tt = x.text.split(":")
                    opn_hrs[tt[0].lower()] = tt[1:][0].replace('\xa0', ' ').replace('&', " ").replace("\u200b", " ").strip()

                elif "monday" in x.text.lower():
                    tt = x.text.lower().split("monday")
                    opn_hrs['monday'] = " ".join(tt[1:]).replace("\xa0", " ").strip()
                    
                elif "closed" in x.text.lower():
                    tt = x.text.lower().split("closed")
                    opn_hrs['Closed'] = " ".join(tt[1:]).replace("\xa0", " ").replace(":", "").replace("-", "").strip()

            return email, opn_hrs, tel, url
        
        except Exception as e:
            opn_hrs = None
            return email.strip(), opn_hrs, tel.strip(), url

    except IndexError :
        opn_hrs = None
        return email, opn_hrs, tel, url
            

def limrk_cnty_final_data():
    
    lm_cnt = pd.read_csv('LimerickCnty_Lib.csv')
    
    for x in ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday', 'Email', 'Phone', 'Closed', 'Website']:
        
        lm_cnt[x] = None
        
    for index, row in lm_cnt.iterrows():
        
        em, opn_hrs, tel, url = web_scrapping_limerick_cnty(row['Name'])
        
        lm_cnt.loc[index, 'Email'] = em
        lm_cnt.loc[index, 'Phone'] = tel
        lm_cnt.loc[index, 'Website'] = url
        
        if isinstance(opn_hrs, dict):
            
            for y in list(opn_hrs.keys()):
                
                try:
                    
                    lm_cnt.loc[index, y] = opn_hrs[y]
                    
                except Exception as e:
                    pass
                
    return lm_cnt
        
lmk_data = limrk_cnty_final_data()

## Final Enriched Data:

In [11]:
lm_cnty.to_csv('LimerickCnty_Lib.csv', index=False)
lmk_data.head()

Unnamed: 0,Address1,Address2,Administrative Autority,County,Latitude,Longitude,Name,monday,tuesday,wednesday,thursday,friday,saturday,sunday,Email,Phone,Closed,Website,wednesday.1
0,Bridge St,"Abbeyfeale West, Abbeyfeale, Co. Limerick, ...",Limerick City and County Council,Limerick,52.383975,-9.301578,Abbeyfeale Library,closed,10am - 1pm 2pm - 5pm,1pm - 5pm 6pm - 8pm,1pm - 5pm 6pm - 8pm,10am - 1pm 2pm - 5pm,10am - 1pm 2pm - 5pm,,abbeyfealelibrary@limerick.ie,+353 68 32488,on the saturday of bank holiday weekends,https://www.limerick.ie/discover/eat-see-do/fa...,
1,Main St,"Blackabbey, Adare, Co. Limerick, Ireland",Limerick City and County Council,Limerick,52.564622,-8.790106,Adare Library,closed,"10am -1pm, 2pm – 5pm 6pm – 8pm",10am -1pm and 2pm – 5pm,"10am-1pm, 2pm – 5pm 6pm – 8pm",10am -1pm and 2pm – 5pm,10am – 1pm 2pm – 5pm,,adarelibrary@limerick.ie,+353 61 396822,on the saturday of bank holiday weekends,https://www.limerick.ie/discover/eat-see-do/fa...,
2,The Quay,"Askeaton, Co. Limerick, Ireland",Limerick City and County Council,Limerick,52.600884,-8.973165,Askeaton Library,closed,,,,,,,askeatonlibrary@gmail.com,+353 61 392256,,https://www.limerick.ie/discover/eat-see-do/fa...,
3,Caherconlish Library,"Hundredacres East, Caherconlish, Co. Limeri...",Limerick City and County Council,Limerick,52.595662,-8.472314,Caherconlish Library,,,,,,,,,+353 61 556526,,https://www.limerick.ie/discover/eat-see-do/fa...,
4,Gortnascarry,"Cappamore, Co. Limerick, Ireland",Limerick City and County Council,Limerick,52.615463,-8.335975,Cappamore Library,Closed,10am - 1pm 2pm - 5pm,10am - 1pm 5pm - 8pm,10am - 1pm 2pm - 5pm,10am - 1pm 2pm - 5pm,,,cappamorelibrary@limerick.ie,+ 353 61 381586,,https://www.limerick.ie/discover/eat-see-do/fa...,
