# Management and Pre-Processing of Data - Due 29th July (23:59)

In this assessment you will go through the process of obtaining data, cleaning it, and then querying it from a database.  We are using data about food hygiene from UK open data.  The data stored is a copy of the official data.

To provide a solution for each task, you might like to do the practice exercises: "HTML and Page Scraping", and "Using MongoDB to Retrieve Information" first.

You may validate your answers by clicking "Validate" on the "Assignments" tab for this exercise.  These will be done automatically, using the tests in this notebook.  The final submission will be both machine checked and human marked.

## Question 0: Setup [1 mark]

Run the following cell to import the core dependencies required for this exercise

In [1]:
# You don't need to write anything here
import requests
import json
from bs4 import BeautifulSoup
from urllib.robotparser import RobotFileParser
from nose.tools import assert_equal, assert_raises
from pymongo import MongoClient

In [2]:
# Check that the required libraries and functions have been imported
# You don't need to write anything here

try:
    imports = [requests, BeautifulSoup, RobotFileParser, assert_equal, assert_raises, json, MongoClient]
except NameError as e:
    print(e)
    raise AssertionError('You appear to be missing one of the required libraries or functions')
assert True
print('Successfully imported libraries and functions')

Successfully imported libraries and functions


## Question 1: Web APIs and Page Scraping



### Question 1(a) [2 marks]

Write a function `get_establishment_by_id` which accepts a parameter `id`, and returns the name of that business as a string.  It should obtain the data from the [food hygeine ratings API](http://ratings.food.gov.uk/open-data/en-GB), and use version 2 of the API.
- You may **assume that the ID exists**
- You should use the **`Establishments`** endpoint  

To complete this question you may wish to look at the information found [here](http://docs.python-requests.org/en/master/user/quickstart/).   

**N.B.** The version of requests installed on the server is relatively recent.  In a previous update, there was a breaking change which meant that only strings or byte-like objects could be passed as headers.  As such, if you wish to pass an integer, you will have to do it as e.g., `{'header_name': '4321'}`.  

*Hint: Week 3, Guided Exercise 2, Scraping With Requests and Beautiful Soup*

*Read: the information given regarding api version in [link page](http://api.ratings.food.gov.uk/help). Remebering that you don't need to include '/Help/Api/GET-' in the URL*

*Read: the information in the requests ['Quickstart Guide'](http://docs.python-requests.org/en/master/user/quickstart/) with regard ['Custom Headers'](http://docs.python-requests.org/en/master/user/quickstart/#custom-headers) and ['JSON Response Content'](http://docs.python-requests.org/en/master/user/quickstart/#json-response-content)*

# Notes:
The food hygiene rating data published at www.food.gov.uk/ratings are available via an application programming interface (API) in XML and JSON formats. Therefore format must use a .json() decoder<br>
Version 2 of the food hygeine ratings API must be included in headers<br>
Please note: If a version is not supplied in the header, calls to FHRS API endpoints will not return data.<br>
'BusinessName' column from the Establishments dataset for the specified id number. See the requests body format on URL.<br>
Return the name of business as a string<br>
### Hint
%d is the placeholder for id (d for data).


In [3]:
url = 'http://api.ratings.food.gov.uk/Help/Api/GET-Establishments-id'
headers = {'x-api-version': '2'}
r = requests.get(url, headers=headers)
r.status_code #success 200

200

In [4]:
#print(r.headers) #headers the servers sent back
#print(r.text) #full text
#print(r.encoding) #utf-8
#print(r.request.headers) #headers I sent to the server, includes the v2 header

In [5]:
def get_establishment_by_id(id):
    
    #YOUR CODE HERE
    '''num -> string
    Using the establishment id number, return the name of that business
    '''
    
    url = 'http://api.ratings.food.gov.uk/Establishments/%d' %id
    headers = {'x-api-version': '2'}
    r = requests.get(url, headers=headers)
    
    return r.json()['BusinessName']

    #raise NotImplementedError()

get_establishment_by_id(511819)

'Star Karahi'

In [6]:
assert_equal(get_establishment_by_id(492474), '360 Beach and Watersports Centre')
assert_equal(get_establishment_by_id(511819), 'Star Karahi')
assert_equal(get_establishment_by_id(692630), 'Baldiesburn Bed & Breakfast')
print('All tests successfully passed')

All tests successfully passed


### Question 1(b) [2 marks]

Data stored at **http://138.68.148.20/**, in HTML format will be used for this question.  Use the Python `requests` library for any requests to the server:

**Write a function** `check_robots`, which accepts a **parameter** `url` which tells you whether the server at **http://138.68.148.20/** will permit you to scrape that page.  

*Hint: Week 3, Guided Exercise 2, Robots.txt*

In [7]:
def check_robots(url):
    """
    Use the RobotFileParser to check if a page on the server can be visited
    """
    # YOUR CODE HERE
    
    rp = RobotFileParser()
    rp.set_url('http://138.68.148.20/robots.txt')
    rp.read()
    return rp.can_fetch('*', url) 
    
    #raise NotImplementedError()
    
#check_robots('http://138.68.148.20/')   
#check_robots('http://138.68.148.20/index.html')
#not check_robots('http://138.68.148.20/data/scotland/glasgow_city')

In [8]:
# Testing whether your code works correctly.  
# You don't need to write anything here

# Confirm an allowed page returns True
assert check_robots('http://138.68.148.20/index.html')
# Confirm a disallowed page returns False
assert not check_robots('http://138.68.148.20/data/scotland/glasgow_city')
print('Passed all the tests')

Passed all the tests


### Question 1(c) [3 marks]

Write a function which takes a URL as a **parameter**, and reads the **XML** on the page it goes to.  The function should **return** a `dict` with the amount of records in `EstablishmentCollection`, and the name of the first business.  
**HINT (READ THIS)**: You can use `BeautifulSoup` for parsing XML as well as HTML.  The function should behave as follows:
- The function should use the Python **`requests`** library.
- **If** the page is banned by robots.txt, then it should not be visited, and should return **`None`**
- **If** the page does not return a **200 status code** in response, then it should not attempt to parse the result, and return **`None`**
- If the page is an **XML** file, it should return a dict in the following format: `{'first_business': 'business name', 'amount_of_records': 1234}`

N.B. The order of a Python `dict` is not guaranteed, so we will not take into account which key appears first.  

*Hint: Week 3, Guided Exercise 2, Parsing HTML - Scraping with Requests and Beautiful Soup*

In [9]:
r = requests.get('http://138.68.148.20/data/west_midlands/cannock_chase')

doc = r.content #create the XML file
print(r.status_code)
soup = BeautifulSoup(doc, 'xml')
#doc
var = soup.find('BusinessName').text
print(var)
var2 = len(soup.find_all('BusinessName'))
var2
print({'first_business': var, 'amount_of_records': var2})

200
1st Choice Pizza/Fish & Chips
{'first_business': '1st Choice Pizza/Fish & Chips', 'amount_of_records': 731}


In [10]:
def parse_xml(url):
    """
    string -> dict
    This function should parse the XML file, for example http://138.68.148.20/west_midlands/cannock_chase
    NOTE: Unlike for HTML, you need to use 'xml' as the second parameter for BeautifulSoup
    You may use any of Python's core libraries, or other libraries installed if you wish rather than BeautifulSoup
    
    >>> parse_xml('http://138.68.148.20/data/west_midlands/cannock_chase')
        {'amount_of_records': 731, 'first_business': '1st Choice Pizza/Fish & Chips'}
    >>> parse_xml('http://138.68.148.20/data/wales/swansea')                     
        {'amount_of_records': 1700, 'first_business': '360 Beach and Watersports Centre'})
    """
    # YOUR CODE HERE
    
    #use robots.txt to answer: can I scrap the page?
    if check_robots(url):#if it's True continue and parse
        r = requests.get(url)
        
        if r.status_code != 200:
            return None
        
        var = r.content #create the xml file
        soup = BeautifulSoup(var, 'xml')
        #find the name of the first business in the collection
        var2 = soup.find('BusinessName').text
        #find the amount of records in establishment collection
        var3 = len(soup.find_all('BusinessName'))
        
        return {'first_business': var2, 'amount_of_records': var3}
        
    return None
    
    #raise NotImplementedError()
#parse_xml('http://138.68.148.20/data/west_midlands/cannock_chase')
parse_xml('http://138.68.148.20/data/scotland/clackmannanshire')

In [11]:
# You don't need to write anything here
# Confirm that the function calls the check_robots function
tmp_check_robots = check_robots
del check_robots

try:
    parse_xml('http://138.68.148.20/data/west_midlands/cannock_chase')
except NameError:
    pass
else:
    raise AssertionError("get_urls does not call check_robots")
finally:
    check_robots = tmp_check_robots

# TEST NOT VISITING PAGES PROHIBITED BY ROBOTS
# THIS SHOULD NOT CALL requests.get

tmp_requests = requests
del requests

try:
    parse_xml('http://138.68.148.20/data/scotland/glasgow_city')
    parse_xml('http://138.68.148.20/data/scotland/clackmannanshire')
except NameError:
    raise AssertionError("The function should not be using requests on this URL")
finally:
    requests = tmp_requests
# TEST OUTPUT RESPONSE
assert_equal(parse_xml('http://138.68.148.20/data/west_midlands/cannock_chase'), 
             {'amount_of_records': 731, 'first_business': '1st Choice Pizza/Fish & Chips'})
assert_equal(parse_xml('http://138.68.148.20/data/wales/swansea'), 
                       {'amount_of_records': 1700, 'first_business': '360 Beach and Watersports Centre'})
# TEST HANDLING 404
assert_equal(parse_xml('http://138.68.148.20/data/calderdale'), None)

print('All test successfully passed')
                       
    

All test successfully passed


## Question 2: Retrieving Data from MongoDB

We will assume that you have successfully cleaned the data, and have stored it in the MongoDB database.  Using the following PyMongo configuration, answer the following questions about the data:

In [12]:
# These are the credentials to connect to the database
# You don't need to write anything here, but you need to run this cell

client = MongoClient('mongodb://cpduser:M13pV5woDW@mongodb/health_data')
db = client.health_data

### Question 2(a) [1 mark]

Write a **function** `get_count`, which takes a PyMongo collection object as a parameter and **returns** the amount of businesses in the collection.  

*Hint: Week 3, Guided Exercise 4, Using MongoDB to Retrieve Information*

In [13]:
db.collection_names() #outputs list of collection names
db.aberdeenshire.count() #db.collection.count() method does not perform the find() operation 
#but instead counts and returns the number of results that match a query.

1952

In [14]:
def get_count(collection):
    """
    Return an integer which gives the amount of unique businesses in the given collection
    """
    # YOUR CODE HERE
    return collection.count()
    #raise NotImplementedError()


In [15]:
# You don't need to write anything here
assert_equal(get_count(db.uk), 511819)
assert_equal(get_count(db.swansea), 1700)
assert_equal(get_count(db.westminster), 4315)
assert_equal(get_count(db.newcastle_upon_tyne), 2308)
print('Passed all the tests')

Passed all the tests


## Question 2(b) [3 marks]

Write a **function** `get_rating_value_percentage` which **returns** the **percentage** of businesses which were awarded an overall `RatingValue` of 5.  The function should accept a parameter `collection` of type `Collection`, for which it should return the percentage as a **float** between 0 and 1.  

*Hint: Week 3, Guided Exercise 4, Cursors*

In [16]:
var = get_count(db.swansea.find({'RatingValue': {'$gte': 5}}))
var2 = get_count(db.swansea)
var/var2

0.6688235294117647

In [17]:
def get_rating_value_percentage(collection):
    """
    Return a float between 0 and 1 of the amount with a RatingValue of 5
    """
    # YOUR CODE HERE
    star_5 = get_count(collection.find({'RatingValue':{'$gte': 5}}))
    all_bus = get_count(collection)
    return star_5/all_bus
    
    #raise NotImplementedError()
    
#get_rating_value_percentage(db.uk)

In [18]:
# You don't need to write anything here
assert_equal(get_rating_value_percentage(db.uk), 0.5287240215779406)
assert_equal(get_rating_value_percentage(db.swansea), 0.6688235294117647)
assert_equal(get_rating_value_percentage(db.westminster), 0.4600231749710313)
assert_equal(get_rating_value_percentage(db.newcastle_upon_tyne), 0.5966204506065858)
print('Passed all the tests')

Passed all the tests


### Question 2(c) [3 marks]

Write a **function** `get_no_geocode` which will find establishments with region Scotland which do not have a `Geocode` recorded.  The parameter `establishment_type` is a string, which will indicate the type of establishment to search for.  All queries should be run on the `uk` collection.

The function should **return** a PyMongo **`Cursor`** object, with only the following fields:
- `BusinessName`, `BusinessType`, and `LocalAuthorityName`.  
- `_id` should not be included  

*Hint: Week 3, Guided Exercise 4, Returning Part of a Document*

In [19]:
db.uk.count({'Region': 'scotland'})
db.uk.find_one({'Region': 'scotland'})
db.uk.find_one({'Region': 'scotland', 'BusinessType': 'Takeaway/sandwich shop', 'Geocode': None}, 
               {'BusinessName': 1, 'BusinessType': 1, 'LocalAuthorityName': 1, '_id': 0})                          

{'BusinessName': 'AMT Coffee',
 'BusinessType': 'Takeaway/sandwich shop',
 'LocalAuthorityName': 'Edinburgh (City of)'}

In [30]:
def get_no_geocode(establishment_type):
    # YOUR CODE HERE
    
    var = db.uk.find({'Region': 'scotland', 'BusinessType': establishment_type, 'Geocode': None},
                    {'BusinessName': 1, 'BusinessType': 1, 'LocalAuthorityName': 1, '_id': 0})
    
    return var
    #raise NotImplementedError()

#len(list(get_no_geocode('Takeaway/sandwich shop')))
#len(list(get_no_geocode('Retailers - other')))

In [21]:
# You don't need to write anything here

cursor = get_no_geocode('Restaurant/Cafe/Canteen' )
for cur in cursor:

    assert '_id' not in cur
    assert 'BusinessType' in cur
    assert_equal(cur['BusinessType'], 'Restaurant/Cafe/Canteen')
    assert 'BusinessName' in cur    
    assert 'LocalAuthorityName' in cur

assert_equal(len(list(get_no_geocode('Takeaway/sandwich shop'))), 405)
assert_equal(len(list(get_no_geocode('Retailers - other'))), 1079)
print('Passed all the tests')

Passed all the tests


## Question 2(d) [5 marks]

What was the earliest and latest dates that an inspection was carried out? Write a **function** which returns a dict in the form `{'earliest_date': 'YYYY-MM-DD', 'latest_date': 'YYYY-MM-DD'}`.  

*Hint: Week 3, Guided Exercise 4, MongoDB Aggregation Framework*

In [65]:
var = db.uk.aggregate([{'$group':{'_id': None, 'earliest':{'$min': '$RatingDate'}, 'latest':{'$max': '$RatingDate'}}}])
for date in var:
# dataset format YYYY, M, D, ?, ?
    earliest = date['earliest']
    latest = date['latest']
    date_format = '%Y-%m-%d'
    e = datetime.strftime(earliest, date_format)
    l = datetime.strftime(latest, date_format)
    print({'earliest_date': e, 'latest_date': l})

{'earliest_date': '1989-01-01', 'latest_date': '2016-09-15'}


In [66]:
from datetime import datetime
def get_earliest_and_latest_dates(collection):
    # YOUR CODE HERE
    # pass a list using aggregate 
    var = collection.aggregate([{'$group':{'_id': None,'earliest':{'$min': '$RatingDate'}, 'latest':{'$max': '$RatingDate'}}}
                                # now that I have the datsets, I need to iterate through the var 
    ])
    for date in var:
        earliest = date['earliest']
        latest = date['latest']
        date_format = '%Y-%m-%d' #convert dates into YYYY-MM-DD
        earliest = datetime.strftime(earliest, date_format)
        latest = datetime.strftime(latest, date_format)
        #finally return a dictionary with 'earliest_date' and 'latest_date'
        return {'earliest_date': earliest, 'latest_date': latest}
    
    raise NotImplementedError()
get_earliest_and_latest_dates(db.uk)

{'earliest_date': '1989-01-01', 'latest_date': '2016-09-15'}

In [67]:
# You don't need to write anything here
assert_equal(get_earliest_and_latest_dates(db.uk),{'earliest_date': '1989-01-01', 'latest_date': '2016-09-15'})
assert_equal(get_earliest_and_latest_dates(db.swansea),{'earliest_date': '2010-10-06', 'latest_date': '2016-08-16'})
assert_equal(get_earliest_and_latest_dates(db.westminster), 
                {'earliest_date': '1999-01-27', 'latest_date': '2016-09-13'})
assert_equal(get_earliest_and_latest_dates(db.newcastle_upon_tyne), 
                {'earliest_date': '2005-07-08', 'latest_date': '2016-09-06'})
print('Passed all the tests')

Passed all the tests


## Question 3 Exploring and fixing data [5 marks]

During this week Huw has talked about issues which may arise when integrating data. For this task, consider the data described in this notebook, and any other source you wish.
- Provide two concise examples of possible issues, and their mitigation in relation to these data
- Each example should be approximately one paragraph

The data described in this notebook is from food.gov.uk describing  food hygiene ratings. The data is fairly large with around 500,000 records. Within each record in the collection are data attributes with numerical, text, and boolean values. 
### Example one
We may want to discover the correlation between a restaurant and stomach illness/food poisoning by integrating the food hygiene rating dataset with GP records. A possible problem that could arise is different measurement values when it comes to the Geocode attribute. The cause of this could be different instrument settings and readings at data collection and entry. The coordinate value may be in a different format, for example the food hygiene is in Decimal Degrees, whereas GP locations may be in Degrees, Minutes, Seconds. One of the datasets would require converting/transformation for data integration consistency. It is also worth noting that different datasets for location based data may use a different datums (for example WGS84 and ED50) meaning the coordinate values would need transformed to the same reference system.


### Example two
If I was integrating the food hygiene rating with Trip Adviser reviews, I could encounter the problem of overlapping data with a naming conflict. Food hygiene may refer to the name of the restaurant as 'BusinessName', and Trip Adviser may use 'RestaurantName'. This is an example of a synonym; different name, same object. To mitigate this problem an agreed term would be used within the dataset, i.e. 'RestaurantName', and the integrated dataset header/tag would reflect this. 

Overall I would use a data cleaning technique called entity resolution to keep additional, complimenting information between two integrating datasets and to remove any duplication.