# Coursework 1

A Jupyter notebook has a series of cells, which are split into two different types.  There is a Markdown cell (like this one) which allows text to be input in either Markdown or HTML format.  The other type of cell is a code cell, which allows you to run the code written inside it.

When you run a code cell, you will not necessarily see any output, but if you do it will be below the cell.  To check whether it has completed running, look at the `In [ ]` next to the cell.  Whilst it's processing it will be filled with an asterisk (\* symbol), and when it's finished it will increase the number inside it.

Run the code in the following cell by pressing `Crtl` + `Enter`.  If you want to move the focus onto the cell after the code has finished running, use `Shift` + `Enter` instead

In [30]:
import json
import requests
from pymongo import MongoClient
from datetime import datetime
from unittest import TestCase
print("Success!")

Success!


# Question 1

Using the API (v2) at http://api.ratings.food.gov.uk, perform the following tasks with the [Python Requests](http://docs.python-requests.org/en/master/) library
## Question 1(a) 

Write a function `get_local_authorities()` which gets a list of all the local authorities with a parameter `data_format`, which accepts a string.
- If the parameter is XML it should return the data in XML format, if it is JSON, it should return in JSON format
- If the parameter is not one of those two strings the function should raise a `ValueError` with an appropriate error message
- You should return the `requests` object


In [49]:
def get_local_authorities(data_format):
    headers = {"x-api-version": 2, "accept": "application/{}".format(data_format), "content-type":"application/{}".format(data_format)}
    r = requests.get('http://api.ratings.food.gov.uk/Authorities',headers=headers)
    return r

<Response [200]>


In [40]:
get_local_authorities("json").json()

{'authorities': [{'CreationDate': '2010-08-17T15:30:24.87',
   'Email': 'commercial@aberdeencity.gov.uk',
   'EstablishmentCount': 1762,
   'FileName': 'http://ratings.food.gov.uk/OpenDataFiles/FHRS760en-GB.xml',
   'FileNameWelsh': None,
   'FriendlyName': 'aberdeen-city',
   'LastPublishedDate': '2016-10-12T00:37:35.363',
   'LocalAuthorityId': 197,
   'LocalAuthorityIdCode': '760',
   'Name': 'Aberdeen City',
   'RegionName': 'Scotland',
   'SchemeType': 2,
   'SchemeUrl': '',
   'Url': 'http://www.aberdeencity.gov.uk',
   'links': [{'href': 'http://api.ratings.food.gov.uk/authorities/197',
     'rel': 'self'}]},
  {'CreationDate': '2010-08-17T15:30:24.87',
   'Email': 'environmental@aberdeenshire.gov.uk',
   'EstablishmentCount': 1925,
   'FileName': 'http://ratings.food.gov.uk/OpenDataFiles/FHRS761en-GB.xml',
   'FileNameWelsh': None,
   'FriendlyName': 'aberdeenshire',
   'LastPublishedDate': '2016-11-05T00:35:58.823',
   'LocalAuthorityId': 198,
   'LocalAuthorityIdCode': '761',

In [50]:
r = get_local_authorities("xml")
r.text

'<AuthorityDetailCollection xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/FHRS.Model.Detailed"><links xmlns="http://schemas.datacontract.org/2004/07/FHRS.Model.MetaLinks"><link><href>http://api.ratings.food.gov.uk/authorities</href><rel>self</rel></link></links><meta xmlns="http://schemas.datacontract.org/2004/07/FHRS.Model.MetaLinks"><dataSource>API</dataSource><extractDate>2016-11-05T13:23:24.4505417+00:00</extractDate><itemCount>392</itemCount><pageNumber>1</pageNumber><pageSize>392</pageSize><returncode>OK</returncode><totalCount>392</totalCount><totalPages>1</totalPages></meta><authorities><authority><links xmlns="http://schemas.datacontract.org/2004/07/FHRS.Model.MetaLinks"><link><href>http://api.ratings.food.gov.uk/authorities/197</href><rel>self</rel></link></links><CreationDate>2010-08-17T15:30:24.87</CreationDate><Email>commercial@aberdeencity.gov.uk</Email><EstablishmentCount>1762</EstablishmentCount><FileName>http://ratin

## Question 1(b)

Write a function `get_establishment_ids()` which accepts parameters `page_number` and `page_size` and returns a list of integers of the FHRSID of each establishment on the page.

- This function should gracefully handle any records which do not have a FHRSID attribute by putting `None` in the list instead of the FHRSID.
- If the `page_number` or `page_error` parameters are not integers the function should raise a ValueError
- If there are no more records left to collect, the function should return `None`


In [54]:
def get_establishment_ids(page_number, page_size):
    headers = {"x-api-version": 2, "accept": "application/json", "content-type":"application/json"}
    data = requests.get('http://api.ratings.food.gov.uk/Establishments/basic/{}/{}'.format(page_number, page_size),headers=headers).json()
    fhrsid_list = [establishment["FHRSID"] if "establishments" in data.keys() else None for establishment in data["establishments"]]
    return fhrsid_list if len(fhrsid_list)>0 else None
get_establishment_ids(500000,12)

None


In [90]:
get_establishment_ids(30,10)

[245129, 875843, 897987, 786266, 182807, 278192, 29166, 859034, 326620, 360305]

In [58]:
print(get_establishment_ids(500000,12))

None


## Question 1(c)

Write a function `get_establishments`, which accepts the parameter `establishment_ids`, which is a list of the establishment IDs

- The function should iterate through the list of IDs, and download the detailed information for that ID from the API
- It should not assume that the caller will provide correct IDs.  If an ID does not exist, the function should not add it to the JSON object
- Use the provided stub function `insert_data(js)` to represent the insertion of data into a database.  The `js` parameter should be a JSON object, or a list of JSON objects.  This function should only be called once within the `get_establishments` function.
- The `insert_data` function should not be called if the JSON object is empty
- A `requests.exceptions.HTTPError` should be thrown for a 4XX or 5XX status code.

In [85]:
def insert_data(js):
    print("number of establishments added {}".format(len(js)))
    print(js)
    pass

def get_establishments(establishment_ids):
    headers = {"x-api-version": 2, "accept": "application/json", "content-type":"application/json"}
    data_to_add = []
    for id in establishment_ids:
        r = requests.get('http://api.ratings.food.gov.uk/Establishments/{}'.format(id),headers=headers)
        if r.status_code in range(400,600):
            raise requests.exceptions.HTTPError("Error Code {}".format(r.status_code))
        if len(r.json().keys()) > 2:
            data_to_add.append(r.json())
    if len(data_to_add) > 0:
       insert_data(data_to_add)

In [88]:
get_establishments(get_establishment_ids(100,4))

number of establishments added 4
[{'FHRSID': 348757, 'links': [], 'AddressLine2': 'Stainland Road', 'Distance': None, 'BusinessTypeID': 1, 'LocalAuthorityWebSite': 'http://www.calderdale.gov.uk', 'AddressLine4': 'Stainland Halifax', 'LocalAuthorityCode': '406', 'SchemeType': 'FHRS', 'Phone': '', 'geocode': {'longitude': '-1.881822', 'latitude': '53.672131'}, 'AddressLine3': '', 'meta': {'pageNumber': 1, 'itemCount': 0, 'extractDate': '0001-01-01T00:00:00', 'dataSource': 'Lucene', 'returncode': 'OK', 'pageSize': 1, 'totalCount': 1, 'totalPages': 1}, 'BusinessName': '1885 The Restaurant', 'NewRatingPending': False, 'RatingDate': '2013-12-04T00:00:00', 'LocalAuthorityBusinessID': '88486', 'PostCode': 'HX4 9HF', 'scores': {'Hygiene': 5, 'Structural': 0, 'ConfidenceInManagement': 5}, 'LocalAuthorityEmailAddress': 'environmental.health@calderdale.gov.uk', 'RatingValue': '5', 'AddressLine1': '', 'BusinessType': 'Restaurant/Cafe/Canteen', 'LocalAuthorityName': 'Calderdale', 'RatingKey': 'fhrs_

In [82]:
get_establishments([100000000])

HTTPError: error code 404

# Question 2

Suppose you have completed collecting the data and are storing it in a MongoDB database.  This question will require you to query that data.  The database is called `health_data`, and contains collections for each local authority (e.g., `db.southampton`, `db.swansea`, `db.westminster`), as well as one for the whole of the UK (`db.uk`).  You can see all the collections by running `db.collection_names()`.  

Note that you will need to be on the ECS network to complete this question.

## Question 2(a)
Using the `MongoClient` class in `PyMongo`, Create a database object `db` with the following information.
- Server: svm-hf1g10-comp6235-temp.ecs.soton.ac.uk
- Port: 27017
- User: COMP6235
- Password: wkbbsdh8oDY2
- Database: health_data

In [115]:
"""
In this cell, the variable db should be defined, as a PyMongo database object connected to health_data.
"""
config = {
    "username":"COMP6235",
    "password":"wkbbsdh8oDY2",
    "host":"svm-hf1g10-comp6235-temp.ecs.soton.ac.uk",
    "port":"27017",
    "db":"health_data"
}

Config = type("conf", (object,), config)

uri = "mongodb://{c.username}:{c.password}@{c.host}:{c.port}".format(c=Config)
print(uri)
client = MongoClient(uri)
db = client.health_data
db.collection_names()

mongodb://COMP6235:wkbbsdh8oDY2@svm-hf1g10-comp6235-temp.ecs.soton.ac.uk:27017


ServerSelectionTimeoutError: svm-hf1g10-comp6235-temp.ecs.soton.ac.uk:27017: timed out

## Question 2(b)

Write a function `get_count`, which takes a PyMongo collection object as a parameter and returns the amount of businesses in the collection.

In [None]:
def get_count(collection):
    """
    Return an integer which gives the amount of unique businesses in the given collection
    """
    # YOUR CODE HERE
    raise NotImplementedError()

## Question 2(c)

Write a function `get_rating_value_percentage` which returns the percentage of businesses which were awarded an overall `RatingValue` of 5?  The function should accept a parameter `collection` of type `Collection`, for which it should return the percentage for.

In [None]:
def get_rating_value_percentage(collection):
    """
    Return a float between 0 and 1 of the amount with a RatingValue of 5
    """
    # YOUR CODE HERE
    raise NotImplementedError()

## Question 2(d)

What was the earliest and latest dates that an inspection was carried out? Write a function which returns a dictionary in the form `{'earliest_date': 'YYYY-MM-DD', 'latest_date': 'YYYY-MM-DD'}`.

In [None]:
from datetime import datetime
def get_earliest_and_latest_dates(collection):
    # YOUR CODE HERE
    raise NotImplementedError()

## Question 2(e)

Write a function `get_nearest_establishment_by_gps()` which returns the nearest eating establishment to the given GPS co-ordinates.  It should have two parameters:
- `collection` - A Python collection object
- `gps_dict` which is a dict in the format `{'lat': lat_value, 'lng': 'lng_value'}`

The `Geocode` field has a 2dsphere index which you will need for this answer.

In [None]:
def get_nearest_establishment_by_gps(collection, gps_dict):
    # YOUR CODE HERE
    raise NotImplementedError()