# Week 5 – JSON and Hestia tasks  


## Current Roadmap 

### Week 5: JSON + Working with URL's
* Using pythons inbuilt json library for reading + writing files
* Extracting data from the web and handling the output using urllib

### Week 6: Advanced Pandas + Data cleaning (with maize pigeon dataset).
* Sampling, pivoting, plotting with matplotlib 
* Plotting with Matplotlib


### Week 7: modelling with sklearn 
* Data preparation for statistical modelling

# Part 1. JSON (JavaScript Object Notation) 
* a data format for reading and writing data as well as for storage and transmitting between a server and http web request i.e. when you Google something and its requeted then displayed on your screen. 

* Similar to a dictionary, JSON consists of key-value pairs enclosed in curly braces {}, where keys are strings and values can be strings, numbers, arrays, objects, booleans, or null.

![json img](https://assets-global.website-files.com/5ff66329429d880392f6cba2/61b76e7fdf48bbef0026f39a_JSON%20works.png)

In [6]:
# Load in the Python json library 
import json

In [47]:
# Agricultural products data
agricultural_data = {
    "products": [
        {
            "name": "Apples",
            "type": "Fruit",
            "origin": "Various",
            "price": 2.5,
            "quantity": 1000
        },
        {
            "name": "Wheat",
            "type": "Grain",
            "origin": "United States",
            "price": 8.75,
            "quantity": 5000
        },
        {
            "name": "Tomatoes",
            "type": "Vegetable",
            "origin": "Spain",
            "price": 3.0,
            "quantity": 750
        }
        # Add more agricultural products as needed
    ]
}

## JSON Reading and Writing with loads and dumps
The json library offers two main methods `loads` for reading data in a JSON format, and `dumps` for writing data to a JSON format.  


In [49]:
# Let's write and save data to a JSON file
file_path = 'agricultural_products.json'


In [58]:
with open(file_path, 'w') as file:
    json.dump(agricultural_data, file, indent=2)
print(f"Data saved successfully to '{file_path}'")


Data saved successfully to 'agricultural_products.json'


Next, let's read in that same data with load

In [55]:
# Path to the JSON file
file_path = 'agricultural_products.json'

In [61]:
# Load data from the JSON file
with open(file_path, 'r') as file:
    agricultural_data2 = json.load(file)

Let's now look at some of the data we loaded in

In [None]:
# Access the loaded data
products = agricultural_data2['products']
for product in products:
    print(f"Product: {product['name']}, Type: {product['type']}, Quantity: {product['quantity']}")


## How to interact with JSON file once loaded in python

In [18]:
agricultural_data
# agricultural_data['products']
# agricultural_data['products'][0]
# agricultural_data['products'][0]['name']

{'products': [{'name': 'Apples',
   'type': 'Fruit',
   'origin': 'Various',
   'price': 2.5,
   'quantity': 1000},
  {'name': 'Wheat',
   'type': 'Grain',
   'origin': 'United States',
   'price': 8.75,
   'quantity': 5000},
  {'name': 'Tomatoes',
   'type': 'Vegetable',
   'origin': 'Spain',
   'price': 3.0,
   'quantity': 750}]}

In [20]:
# updating data e.g. convert price to pence/cents rather than pounds/dollars

for product in agricultural_data['products']:
    product['price'] *= 100

agricultural_data


# dump again if neccessary 

{'products': [{'name': 'Apples',
   'type': 'Fruit',
   'origin': 'Various',
   'price': 250.0,
   'quantity': 1000},
  {'name': 'Wheat',
   'type': 'Grain',
   'origin': 'United States',
   'price': 875.0,
   'quantity': 5000},
  {'name': 'Tomatoes',
   'type': 'Vegetable',
   'origin': 'Spain',
   'price': 300.0,
   'quantity': 750}]}

### Combining data from different JSONs and dumping to new JSON

In [68]:
farm_data = {
  "farm": {
    "name": "Green Fields Farm",
    "location": {
      "country": "United States",
      "city": "Farmington",
      "state": "California",
      "postal_code": "12345"
    },
    "established_year": 1995,
    "owner": "John Farmer",
    "employees": [
      {
        "name": "Alice Smith",
        "position": "Farm Manager",
        "age": 35,
        "experience_years": 10
      },
      {
        "name": "Michael Johnson",
        "position": "Field Worker",
        "age": 28,
        "experience_years": 5
      }
      # Include other employees
    ],
    "equipment": [
      {
        "name": "Tractor",
        "type": "Farm Vehicle",
        "condition": "Good",
        "usage_hours": 500
      },
      {
        "name": "Harvester",
        "type": "Agricultural Equipment",
        "condition": "Fair",
        "usage_hours": 300
      }
      # Include other equipment
    ]
  }
}
# Save the data to a JSON file
file_path = 'farm_data.json'

with open(file_path, 'w') as file:
    json.dump(farm_data, file, indent=2)

In [69]:
# Read data from 'agricultural_products.json'
with open('agricultural_products.json', 'r') as file:
    agricultural_data = json.load(file)

In [70]:
# Read data from 'farm_data.json'
with open('farm_data.json', 'r') as file:
    farm_data = json.load(file)

In [71]:
# Extract 'products' from agricultural_data
products = agricultural_data.get('products', [])
farm_data['farm']['products'] = products


In [72]:
with open('merged_farm_data.json', 'w') as file:
    json.dump(farm_data, file, indent=2)

In [73]:
farm_data

{'farm': {'name': 'Green Fields Farm',
  'location': {'country': 'United States',
   'city': 'Farmington',
   'state': 'California',
   'postal_code': '12345'},
  'established_year': 1995,
  'owner': 'John Farmer',
  'employees': [{'name': 'Alice Smith',
    'position': 'Farm Manager',
    'age': 35,
    'experience_years': 10},
   {'name': 'Michael Johnson',
    'position': 'Field Worker',
    'age': 28,
    'experience_years': 5}],
  'equipment': [{'name': 'Tractor',
    'type': 'Farm Vehicle',
    'condition': 'Good',
    'usage_hours': 500},
   {'name': 'Harvester',
    'type': 'Agricultural Equipment',
    'condition': 'Fair',
    'usage_hours': 300}],
  'products': [{'name': 'Apples',
    'type': 'Fruit',
    'origin': 'Various',
    'price': 2.5,
    'quantity': 1000},
   {'name': 'Wheat',
    'type': 'Grain',
    'origin': 'United States',
    'price': 8.75,
    'quantity': 5000},
   {'name': 'Tomatoes',
    'type': 'Vegetable',
    'origin': 'Spain',
    'price': 3.0,
    'qua

## Merge two JSONs together at top level, i.e. add a new farm to database

In [74]:
sunset_farms = {
  "farm": {
    "name": "Sunset Farms",
    "location": {
      "country": "Canada",
      "city": "Saskatoon",
      "state": "Saskatchewan",
      "postal_code": "A1B 2C3"
    },
    "established_year": 2000,
    "owner": "Emily Farmer",
    "employees": [
      {
        "name": "Michael Johnson",
        "position": "Farm Manager",
        "age": 40,
        "experience_years": 15
      },
      {
        "name": "Sarah Adams",
        "position": "Field Supervisor",
        "age": 32,
        "experience_years": 8
      }
    ],
    "equipment": [
      {
        "name": "Combine Harvester",
        "type": "Agricultural Equipment",
        "condition": "Excellent",
        "usage_hours": 800
      },
      {
        "name": "Seeder",
        "type": "Farm Implement",
        "condition": "Good",
        "usage_hours": 600
      }
    ],
    "products": [
      {
        "name": "Potatoes",
        "type": "Vegetable",
        "origin": "Canada",
        "price": 1.8,
        "quantity": 3000,
        "color": "Brown/Yellow",
        "variety": "Russet",
        "harvest_season": "Fall"
      },
      {
        "name": "Carrots",
        "type": "Vegetable",
        "origin": "Canada",
        "price": 2.2,
        "quantity": 2500,
        "color": "Orange",
        "variety": "Nantes",
        "harvest_season": "Year-round"
      }
    ]
  }
}


In [75]:
# Merge the data into a new dictionary
merged_data = {}
merged_data.update(farm_data)
merged_data.update(sunset_farms)

# Write merged data to a new JSON file
with open('farms.json', 'w') as file:
    json.dump(merged_data, file, indent=2)


#merged_data

# this doesn't work, why?

In [76]:
sunset_farms['farm2'] = sunset_farms['farm']
sunset_farms.pop('farm')
#sunset_farms

{'name': 'Sunset Farms',
 'location': {'country': 'Canada',
  'city': 'Saskatoon',
  'state': 'Saskatchewan',
  'postal_code': 'A1B 2C3'},
 'established_year': 2000,
 'owner': 'Emily Farmer',
 'employees': [{'name': 'Michael Johnson',
   'position': 'Farm Manager',
   'age': 40,
   'experience_years': 15},
  {'name': 'Sarah Adams',
   'position': 'Field Supervisor',
   'age': 32,
   'experience_years': 8}],
 'equipment': [{'name': 'Combine Harvester',
   'type': 'Agricultural Equipment',
   'condition': 'Excellent',
   'usage_hours': 800},
  {'name': 'Seeder',
   'type': 'Farm Implement',
   'condition': 'Good',
   'usage_hours': 600}],
 'products': [{'name': 'Potatoes',
   'type': 'Vegetable',
   'origin': 'Canada',
   'price': 1.8,
   'quantity': 3000,
   'color': 'Brown/Yellow',
   'variety': 'Russet',
   'harvest_season': 'Fall'},
  {'name': 'Carrots',
   'type': 'Vegetable',
   'origin': 'Canada',
   'price': 2.2,
   'quantity': 2500,
   'color': 'Orange',
   'variety': 'Nantes',


In [77]:
# Merge the data into a new dictionary
merged_data = {}
merged_data.update(farm_data)
merged_data.update(sunset_farms)

# Write merged data to a new JSON file
with open('farms.json', 'w') as file:
    json.dump(merged_data, file, indent=2)
    
#merged_data

In [78]:
merged_data = {'farms':[farm_data, sunset_farms]}
#merged_data

# Part 2. Requesting live data using `urllib.request`
As a way of handling URL's Python provides the inbuilt library [urllib](https://docs.python.org/3/library/urllib.html) as part of its Standard Library (`json` is another inbuilt library). In the below example, we run through some of its basic usuage to request data from a live website and then format the JSON response. 


In [79]:
# Load in required libraries
import urllib
import json

Let's write a small function to deal with the output of request. 
We will use `urllib.request.urlopen` to actually perform the request, and it is not essential that you understand how this function works. 
HTTP means Hypertext Transfer Protocol and so it is a protocol for how to send and recieve data across the internet. 

All you need to know for now is that HTTP response of 200 means that the request we have sent out is successful. We need to use the below function to help alert us if the URL fails (i.e. because of a server failure, lack of WiFi, lack of permission etc.)

In [80]:
def handle_url_response(url_to_request):
    """
    Handle response from url and return decoded output
    """
    # 1. Open the URL and read the data
    webUrl = urllib.request.urlopen(url_to_request)
    # 2. Print the HTTP response status code (which determines whether the request as failed or not)
    print("result code: " + str(webUrl.getcode()))
    # 3. Handle the response based on HTTP response status code (200 means success)
    if (webUrl.getcode() == 200):
        # Read and decode the response and return it to the user
        response = webUrl.read().decode("utf-8")
        return response
    else:
        # Raise an error if there is a problem with the server
        raise urllib.error.HTTPError("Received an error from server, cannot retrieve results " + str(webUrl.getcode()))

### Request data
Next, we will request data using the web address which we can store as a string vairable.

In this example we'll use the data from a live feed from the USGS which lists all earthquakes for the last day larger than Mag 2.5

In [81]:
earthquakes_live_feed_url = "https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_day.geojson"


In [82]:
earthquakes_response_raw = handle_url_response(earthquakes_live_feed_url)

result code: 200


Result code 200 means success!

### Formatting response
If our request is successful, we should see in the below cell that the output is a long string containing JSON, we can use Python's `json` library to handle and format this data in a similar manner as we did in Part 1 i.e. using the `json.loads` method

In [83]:
# Long JSON file
earthquakes_response_raw

'{"type":"FeatureCollection","metadata":{"generated":1700929659000,"url":"https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_day.geojson","title":"USGS Magnitude 2.5+ Earthquakes, Past Day","status":200,"api":"1.10.3","count":41},"features":[{"type":"Feature","properties":{"mag":3.19,"place":"22 km SSW of Woodruff, Utah","time":1700929205840,"updated":1700929507640,"tz":null,"url":"https://earthquake.usgs.gov/earthquakes/eventpage/uu60555092","detail":"https://earthquake.usgs.gov/earthquakes/feed/v1.0/detail/uu60555092.geojson","felt":null,"cdi":null,"mmi":null,"alert":null,"status":"automatic","tsunami":0,"sig":157,"net":"uu","code":"60555092","ids":",uu60555092,","sources":",uu,","types":",origin,phase-data,","nst":52,"dmin":0.19,"rms":0.31,"gap":46,"magType":"ml","type":"earthquake","title":"M 3.2 - 22 km SSW of Woodruff, Utah"},"geometry":{"type":"Point","coordinates":[-111.3040009,41.3501663,3.5]},"id":"uu60555092"},\n{"type":"Feature","properties":{"mag":4.8,"place":"M

In [84]:
earthquakes_response_json = json.loads(earthquakes_response_raw)

In [85]:
earthquakes_response_json

{'type': 'FeatureCollection',
 'metadata': {'generated': 1700929659000,
  'url': 'https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_day.geojson',
  'title': 'USGS Magnitude 2.5+ Earthquakes, Past Day',
  'status': 200,
  'api': '1.10.3',
  'count': 41},
 'features': [{'type': 'Feature',
   'properties': {'mag': 3.19,
    'place': '22 km SSW of Woodruff, Utah',
    'time': 1700929205840,
    'updated': 1700929507640,
    'tz': None,
    'url': 'https://earthquake.usgs.gov/earthquakes/eventpage/uu60555092',
    'detail': 'https://earthquake.usgs.gov/earthquakes/feed/v1.0/detail/uu60555092.geojson',
    'felt': None,
    'cdi': None,
    'mmi': None,
    'alert': None,
    'status': 'automatic',
    'tsunami': 0,
    'sig': 157,
    'net': 'uu',
    'code': '60555092',
    'ids': ',uu60555092,',
    'sources': ',uu,',
    'types': ',origin,phase-data,',
    'nst': 52,
    'dmin': 0.19,
    'rms': 0.31,
    'gap': 46,
    'magType': 'ml',
    'type': 'earthquake',
    'title': 

Let's look at how many events have occured in the last 24 hours

In [86]:
# output the number of events, plus the magnitude and each event name
earthquake_count = earthquakes_response_json["metadata"]["count"]
print(str(earthquake_count) + " events recorded")

41 events recorded


Let's have a look at the places that these events occured

In [87]:
for earthquake_event in earthquakes_response_json["features"]:
    print(earthquake_event["properties"]["place"])

22 km SSW of Woodruff, Utah
Maug Islands region, Northern Mariana Islands
Maug Islands region, Northern Mariana Islands
Santa Cruz Islands
86 km ESE of Palora, Ecuador
57 km SSE of Palca, Peru
south of Panama
59 km E of Hami, China
61 km S of Whites City, New Mexico
85 km SSE of Yonakuni, Japan
70 km WNW of Ninilchik, Alaska
4 km SSW of Guánica, Puerto Rico
1 km W of Pāhala, Hawaii
3 km SW of Guánica, Puerto Rico
3 km NNW of Bennington, Kansas
Maug Islands region, Northern Mariana Islands
4 km NNE of Carrizales, Puerto Rico
67 km NW of Madang, Papua New Guinea
17 km NNW of Brenas, Puerto Rico
Kenai Peninsula, Alaska
Idaho-Montana border region
40 km NNW of Valdez, Alaska
Puerto Rico region
Maug Islands region, Northern Mariana Islands
9 km S of Arroyo, Puerto Rico
Jujuy, Argentina
Hawaii region, Hawaii
10 km NW of Crescent City, CA
Gulf of Alaska
Puerto Rico region
203 km ESE of Tadine, New Caledonia
127 km SW of Fakfak, Indonesia
Pagan region, Northern Mariana Islands
2 km WSW of Guán

We can also use Python condition statements to search through and subset the response...

In [88]:
for earthquake_event in earthquakes_response_json["features"]:
    if earthquake_event["properties"]["mag"] >= 4.0:
        print("%2.1f" % earthquake_event["properties"]["mag"], earthquake_event["properties"]["place"])


4.8 Maug Islands region, Northern Mariana Islands
5.5 Maug Islands region, Northern Mariana Islands
4.8 Santa Cruz Islands
5.1 86 km ESE of Palora, Ecuador
4.5 57 km SSE of Palca, Peru
4.8 south of Panama
4.2 59 km E of Hami, China
4.2 85 km SSE of Yonakuni, Japan
5.0 Maug Islands region, Northern Mariana Islands
5.1 67 km NW of Madang, Papua New Guinea
4.9 Maug Islands region, Northern Mariana Islands
4.3 Jujuy, Argentina
4.9 203 km ESE of Tadine, New Caledonia
4.7 127 km SW of Fakfak, Indonesia
5.0 Pagan region, Northern Mariana Islands
4.9 Revilla Gigedo Islands region
4.9 Maug Islands region, Northern Mariana Islands
4.9 western Indian-Antarctic Ridge
4.0 194 km SW of Nikolski, Alaska


## Translating the data to csv
The data may seem a little complex, but we can get it in a more friendly format with a little work. Let's translate it to a pandas Dataframe and then to save to a csv.

In [89]:
## create two empty lists to store our data
earthquake_locations = []
earthquake_magnitudes = [] 

# Store events in these lists by looping through
for earthquake_event in earthquakes_response_json["features"]:
    earthquake_locations.append(earthquake_event["properties"]["place"])
    earthquake_magnitudes.append(earthquake_event["properties"]["mag"])


In [90]:
## check both lists are same length
len(earthquake_magnitudes), len(earthquake_locations)

(41, 41)

In [91]:
# look at first 5 values from each list
earthquake_locations[:5], earthquake_magnitudes[:5]

(['22 km SSW of Woodruff, Utah',
  'Maug Islands region, Northern Mariana Islands',
  'Maug Islands region, Northern Mariana Islands',
  'Santa Cruz Islands',
  '86 km ESE of Palora, Ecuador'],
 [3.19, 4.8, 5.5, 4.8, 5.1])

In [92]:
import pandas as pd

In [93]:
locs_series = pd.Series(earthquake_locations, name="Location") 
mags_series = pd.Series(earthquake_magnitudes, name="Magnitude")

# Use the concatenate method to combine the two series
earthquakes_df = pd.concat([locs_series, mags_series], axis=1) 

In [94]:
earthquakes_df

Unnamed: 0,Location,Magnitude
0,"22 km SSW of Woodruff, Utah",3.19
1,"Maug Islands region, Northern Mariana Islands",4.8
2,"Maug Islands region, Northern Mariana Islands",5.5
3,Santa Cruz Islands,4.8
4,"86 km ESE of Palora, Ecuador",5.1
5,"57 km SSE of Palca, Peru",4.5
6,south of Panama,4.8
7,"59 km E of Hami, China",4.2
8,"61 km S of Whites City, New Mexico",3.2
9,"85 km SSE of Yonakuni, Japan",4.2


Rememember we can save this data using

In [46]:
# earthquakes_df.to_csv('???')

# This week's Tasks
1. https://gitlab.com/hestia-earth/hestia-engine-models/-/issues/451

In [87]:
transport = {
  "@type": "Cycle",
  "site": {
    "country": {
      "@id": "GADM-FRA",
      "name": "France"
    }
  },
  "inputs": [
    {
      "@type": "Input",
      "term": {
        "@id": "ureaKgN",
        "units": "kg N",
      },
      "value": 40,
      "transport": [
        {
          "@type": "Transport",
          "term": {
            "@id": "transportUnspecified",
          },
          "distance": 8017
        }
      ]
    }
  ]
}


write code to add "value" key which is defined as

value = distance in km * mass of input in kg / 1000

In [88]:
transport['inputs']
#transport['inputs'][0]
#transport['inputs'][0]['transport']
#transport['inputs'][0]['transport'][0]['distance']
#transport['inputs'][0]['value']


#transport['inputs'][0]['transport'][0]['value'] = transport['inputs'][0]['transport'][0]['distance'] * transport['inputs'][0]['value'] / 1000
#transport

[{'@type': 'Input',
  'term': {'@id': 'ureaKgN', 'units': 'kg N'},
  'value': 40,
  'transport': [{'@type': 'Transport',
    'term': {'@id': 'transportUnspecified'},
    'distance': 8017}]}]

## Task 2

https://gitlab.com/hestia-earth/hestia-engine-models/-/issues/461
    
    
## Task 3
https://gitlab.com/hestia-earth/hestia-engine-models/-/issues/479

# Extra 
## BeautifulSoup 
beautifulsoup is a third party package that handles url requests to websites, but is more useful for scraping data from HTML or XML web pages. For example, it is useful for website that do not provide an API that returns data in a JSON format

If not already installed, you can collect beautifulsoup (version 4: latest) using the `pip` package manager, see below: 

In [2]:
!pip install beautifulsoup4

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.0/143.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting soupsieve>1.2 (from beautifulsoup4)
  Downloading soupsieve-2.5-py3-none-any.whl.metadata (4.7 kB)
Downloading soupsieve-2.5-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.12.2 soupsieve-2.5


In [3]:
# Import required libraries
import urllib
import json
from bs4 import BeautifulSoup

Let's use that handle url response code again and just re-declare it here so that this task is all in one place.

In [18]:
def handle_url_response(url_to_request):
    """
    Handle response from url and return decoded output
    """
    # 1. Open the URL and read the data
    webUrl = urllib.request.urlopen(url_to_request)
    # 2. Print the HTTP response status code (which determines whether the request as failed or not)
    print("result code: " + str(webUrl.getcode()))
    # 3. Handle the response based on HTTP response status code (200 means success)
    if (webUrl.getcode() == 200):
        # Read and decode the response and return it to the user
        response = webUrl.read().decode("utf-8")
        return response
    else:
        # Raise an error if there is a problem with the server
        raise urllib.error.HTTPError("Received an error from server, cannot retrieve results " + str(webUrl.getcode()))
    

In [20]:
url = 'https://projectbritain.com/farming.html'  # Replace with the URL of the website you want to scrape
html_content = handle_url_response(url)

result code: 200


In [11]:
soup = BeautifulSoup(html_content, 'html.parser')


In [12]:
soup

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml"><!-- InstanceBegin template="/Templates/brit.dwt" codeOutsideHTMLIsLocked="false" -->
<head>
<!-- InstanceBeginEditable name="doctitle" -->
<title>Farming in Britain</title>
<!-- InstanceEndEditable -->
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="customs/template/brit.css" rel="stylesheet" type="text/css"/>
<script src="SpryAssets/SpryMenuBar.js" type="text/javascript"></script>
<script language="javascript" type="text/javascript">
//--------------- LOCALIZEABLE GLOBALS ---------------
var d=new Date();
var monthname=new Array("January","February","March","April","May","June","July","August","September","October","November","December");
//Ensure correct for language. English is "January 1, 2004"
var TODAY = monthname[d.getMonth()] + " " + d.getDate() + ", " + d.getFullYear();
//--------

Beautifulsoup provides methods for extracting paragraphs, let's look at that...

In [95]:
soup = BeautifulSoup(html_content, 'html.parser')

paragraphs = soup.find_all('p')
for paragraph in paragraphs:
    print(paragraph.text)


NameError: name 'BeautifulSoup' is not defined