# Lecture 6: Data formats and API

March 24, 2025

## JSON, XML, HTML, Requests and APIs

In [4]:
!pip install bs4
!pip install json
!pip install boto3 

[31mERROR: Could not find a version that satisfies the requirement json (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for json[0m[31m
[0mCollecting boto3
  Downloading boto3-1.37.18-py3-none-any.whl.metadata (6.7 kB)
Collecting botocore<1.38.0,>=1.37.18 (from boto3)
  Downloading botocore-1.37.18-py3-none-any.whl.metadata (5.7 kB)
Collecting s3transfer<0.12.0,>=0.11.0 (from boto3)
  Downloading s3transfer-0.11.4-py3-none-any.whl.metadata (1.7 kB)
Downloading boto3-1.37.18-py3-none-any.whl (139 kB)
Downloading botocore-1.37.18-py3-none-any.whl (13.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading s3transfer-0.11.4-py3-none-any.whl (84 kB)
Installing collected packages: botocore, s3transfer, boto3
  Attempting uninstall: botocore
    Found existing installation: botocore 1.34.69
    Uninstalling botocore-1.34.69:
      Successfully uninstalled

In [5]:
import os # for environment variables
import pandas as pd # for dataframes
import numpy as np # for numerical operations

import requests # for making HTTP requests
import json # for parsing JSON
from bs4 import BeautifulSoup # for parsing HTML
import boto3 # for interacting with AWS S3

from IPython.display import HTML, display # for displaying HTML

In [6]:
%matplotlib inline
import matplotlib.pyplot as plt

In [7]:
# quick warmup

# imagine a key-value data container
# data about products in a store
inventory = {
      "001": [{"name": "Milk", "quantity": 34, "price": 1.99}],
      "002": [{"name": "Bread", "quantity": 20, "price": 2.49},
              {"name": "Nutella", "quantity": 5, "price": 2.49}]    
    }

inventory

{'001': [{'name': 'Milk', 'quantity': 34, 'price': 1.99}],
 '002': [{'name': 'Bread', 'quantity': 20, 'price': 2.49},
  {'name': 'Nutella', 'quantity': 5, 'price': 2.49}]}

### How would a store manager find out how much is milk?



### How do we add new item? What do we need to do?

In [8]:
type(inventory)

dict

In [9]:
inventory["001"]

[{'name': 'Milk', 'quantity': 34, 'price': 1.99}]

In [10]:
inventory["001"][0]['name'], inventory["001"][0]['quantity']

('Milk', 34)

In [16]:
# add a new product
inventory["003"] = [{"name": "Eggs", "quantity": 12, "price": 3.99}]

In [17]:
inventory

{'001': [{'name': 'Milk', 'quantity': 34, 'price': 1.99}],
 '002': [{'name': 'Bread', 'quantity': 20, 'price': 2.49},
  {'name': 'Nutella', 'quantity': 5, 'price': 2.49}],
 '003': [{'name': 'Eggs', 'quantity': 12, 'price': 3.99}]}

In [None]:
# inventory["004"] = [{"name": "Butter", "quantity": 10}]

In [18]:
name_to_item = {} # a dictionary to map product names to item details
for item_id, details_list in inventory.items(): # iterate over the inventory
    for details in details_list: # iterate over the list of details
        # Here we are assuming product names are unique
        name_to_item[details["name"]] = {"id": item_id, "quantity": details["quantity"], "price": details["price"]} 
        
#if details["name"] is not unique, what happens?
print(name_to_item)

{'Milk': {'id': '001', 'quantity': 34, 'price': 1.99}, 'Bread': {'id': '002', 'quantity': 20, 'price': 2.49}, 'Nutella': {'id': '002', 'quantity': 5, 'price': 2.49}, 'Eggs': {'id': '003', 'quantity': 12, 'price': 3.99}}


In [19]:
pd.DataFrame(name_to_item)

Unnamed: 0,Milk,Bread,Nutella,Eggs
id,1.0,2.0,2.0,3.0
quantity,34.0,20.0,5.0,12.0
price,1.99,2.49,2.49,3.99


In [20]:
pd.DataFrame(name_to_item).T # convert the dictionary to a DataFrame and transpose it

Unnamed: 0,id,quantity,price
Milk,1,34,1.99
Bread,2,20,2.49
Nutella,2,5,2.49
Eggs,3,12,3.99


## Contents

* Standardized data representation
* JSON
* XML
* Introduction to BeautifulSoup
* Basics of HTML (+ Element Inspection)
* Introduction to Requests (GET vs. POST) and APIs


### Goals:
    
* work with data  online/real-time data
* acquisition, processing $\rightarrow$ results

### Overview of Data Formats

- **Data formats**: The structure and encoding of data for storage and transmission.
- **Popular data formats**:
    - **JSON (JavaScript Object Notation)**: Lightweight, easy to read and write, widely used in APIs.
    - **CSV (Comma-Separated Values)**: Simple, flat format ideal for tabular data.
    - **XML (eXtensible Markup Language)**: Hierarchical, suited for documents and complex nested data.
    - **YAML (YAML Ain't Markup Language)**: Human-readable, often used for configuration files.

## Date exchange formats - JSON, XML

This description highlights key aspects of using structured, interoperable data formats (e.g., JSON, YAML, or XML) in Python, especially in internet-based applications. 

`Language of the internet`

* You can send/receive a message with (almost) any service

* send .docx  ->  what if I do not have MS Word?
* we need a simple data format which would work on any machine (system agnostic), is general (can write anything) and is ediatable in basic editors

* More complex than simple tables
* Highly structured - if you dont follow the rules, you are out
* Both sides need to understand the structure (comments in yaml)
* only data, no code to be run (security measure)
* distributed as text/string (to be precise as `bytes` literals) 
* parsed to objects - easy to work with straight away
* Can be persisted as special files, or some data streams from APIs. 
* Human readable
* Hierarchical
* Can be fetched using standard web APIs

In summary, formats like JSON and YAML in Python enable secure, structured, and system-agnostic data exchange across diverse applications and services on the internet.

### Purpose

1. Communication 
    * All imaginable communication channels
    * Applications within single server/machine
    * Only transferring of data
    * Both sides need to understand the structure

2. Storing
    * self-descriptive
    * human readable
    * also in DBs - SQL, MongoDB etc.

3. Standardization
    * predictability
    * cooperation
    * spillovers from standardization


### Dimensionality problem

* rich information comes at costs of data complexity 
* to interrelate information, you need to high dimensionality (or A LOT of columns) or declaratory formats such as protobuf
* Strongly object-oriented


### 1D:
* logs

### 2D: CSVs
* tabular data (like pandas DFs)

### 3+D:
#### XML
* eXtensible Markup Language is a software- and hardware-independent tool for storing and transporting data.
* Officialy defined at 1998, but its roots are even older.
* XML was designed to carry data - with focus on what data is
* HTML was designed to display data - with focus on what data should look like displayed 
* XML tags are not predefined like HTML tags are
* more verbose than JSON
* can have comments !actually a really cool in useful feature!
* used historically as a transaction format in many areas: 
    * Scientific measurements
    * News information
    * Wheather measurements
    * Financial transactions
* Necessary to use XML parser to use in Python or in JavaScript

### JSON
* JavaScript Object Notation
* often *.json* files
* but also used in the web etc.
* supports standard datatypes - strings, integers, floats, lists
* No comments
* More compact, less verbose
* No closing tags
* Used EVERYWHERE, BUT [NOT LICENSED FOR EVIL](https://www.json.org/license.html). If you want to do evil stuff, use XML instead.
* Native in JavaScript and close to native in Python (dictionary)
* Jupyter Notebooks


* common pitfals: properly formatted JSON is different to python dict. -> check: https://jsonlint.com/

# JSON

In [14]:
# general representation of a dictionary
# emphasis on accessibility -> key-value ( hash table )
# contains records, lists, or other dictionaries

teachers = [
    {
        "name": "Jozef Baruník",
        "titles": ["doc.", "PhDr.", "Ph.D.", "Bc.", "Mgr."],
        "ID": 1234,
        "courses": ["JEM005", "JEM116", "JEM059", "JEM061"],
    },
    {
        "name": "Martin Hronec",
        "titles": ["Bc.", "Mgr."],
        "ID": 3421,
        "courses": ["JEM005", "JEM207"],
    },
]

courses = {
    "JEM005": {"name": "Advanced Econometrics", "ECTS": 6, "teachers": [3421, 1234]},
    "JEM207": {"name": "Data Processing in Python", "ECTS": 5, "teachers": [3421]},
    "JEM116": {"name": "Applied Econometrics", "ECTS": 6, "teachers": [1234]},
    "JEM059": {"name": "Quantitative Finance I.", "ECTS": 6, "teachers": [1234, 5678]},
    "JEM061": {"name": "Quantitative Finance II.", "ECTS": 6, "teachers": [1234, 5678]},
}
jsondata = {"teachers": teachers, "courses": courses}
jsondata

{'teachers': [{'name': 'Jozef Baruník',
   'titles': ['doc.', 'PhDr.', 'Ph.D.', 'Bc.', 'Mgr.'],
   'ID': 1234,
   'courses': ['JEM005', 'JEM116', 'JEM059', 'JEM061']},
  {'name': 'Martin Hronec',
   'titles': ['Bc.', 'Mgr.'],
   'ID': 3421,
   'courses': ['JEM005', 'JEM207']}],
 'courses': {'JEM005': {'name': 'Advanced Econometrics',
   'ECTS': 6,
   'teachers': [3421, 1234]},
  'JEM207': {'name': 'Data Processing in Python',
   'ECTS': 5,
   'teachers': [3421]},
  'JEM116': {'name': 'Applied Econometrics', 'ECTS': 6, 'teachers': [1234]},
  'JEM059': {'name': 'Quantitative Finance I.',
   'ECTS': 6,
   'teachers': [1234, 5678]},
  'JEM061': {'name': 'Quantitative Finance II.',
   'ECTS': 6,
   'teachers': [1234, 5678]}}}

Is this a valid JSON?

https://jsonformatter.curiousconcept.com/

![python and JSON](img/python_json.png)

In [21]:
js = json.dumps(
    jsondata 
) # json formatted string! 

isinstance(js,str)

True

In [22]:
js # json string

'{"teachers": [{"name": "Jozef Barun\\u00edk", "titles": ["doc.", "PhDr.", "Ph.D.", "Bc.", "Mgr."], "ID": 1234, "courses": ["JEM005", "JEM116", "JEM059", "JEM061"]}, {"name": "Martin Hronec", "titles": ["Bc.", "Mgr."], "ID": 3421, "courses": ["JEM005", "JEM207"]}], "courses": {"JEM005": {"name": "Advanced Econometrics", "ECTS": 6, "teachers": [3421, 1234]}, "JEM207": {"name": "Data Processing in Python", "ECTS": 5, "teachers": [3421]}, "JEM116": {"name": "Applied Econometrics", "ECTS": 6, "teachers": [1234]}, "JEM059": {"name": "Quantitative Finance I.", "ECTS": 6, "teachers": [1234, 5678]}, "JEM061": {"name": "Quantitative Finance II.", "ECTS": 6, "teachers": [1234, 5678]}}}'

In [23]:
jsondata['courses']['JEM005']

{'name': 'Advanced Econometrics', 'ECTS': 6, 'teachers': [3421, 1234]}

In [24]:
jsondata['courses']['JEM005']['test'] = 'test'

In [25]:
jsondata['courses']['JEM005']

{'name': 'Advanced Econometrics',
 'ECTS': 6,
 'teachers': [3421, 1234],
 'test': 'test'}

In [26]:
jsondata['teachers']


[{'name': 'Jozef Baruník',
  'titles': ['doc.', 'PhDr.', 'Ph.D.', 'Bc.', 'Mgr.'],
  'ID': 1234,
  'courses': ['JEM005', 'JEM116', 'JEM059', 'JEM061']},
 {'name': 'Martin Hronec',
  'titles': ['Bc.', 'Mgr.'],
  'ID': 3421,
  'courses': ['JEM005', 'JEM207']}]

In [27]:
pd.DataFrame(jsondata['courses'])

Unnamed: 0,JEM005,JEM207,JEM116,JEM059,JEM061
name,Advanced Econometrics,Data Processing in Python,Applied Econometrics,Quantitative Finance I.,Quantitative Finance II.
ECTS,6,5,6,6,6
teachers,"[3421, 1234]",[3421],[1234],"[1234, 5678]","[1234, 5678]"
test,test,,,,


In [28]:
pd.DataFrame(jsondata['courses']).transpose()

Unnamed: 0,name,ECTS,teachers,test
JEM005,Advanced Econometrics,6,"[3421, 1234]",test
JEM207,Data Processing in Python,5,[3421],
JEM116,Applied Econometrics,6,[1234],
JEM059,Quantitative Finance I.,6,"[1234, 5678]",
JEM061,Quantitative Finance II.,6,"[1234, 5678]",


In [29]:
pd.read_json?

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mread_json[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mpath_or_buf[0m[0;34m:[0m [0;34m'FilePath | ReadBuffer[str] | ReadBuffer[bytes]'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0morient[0m[0;34m:[0m [0;34m'str | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtyp[0m[0;34m:[0m [0;34m"Literal['frame', 'series']"[0m [0;34m=[0m [0;34m'frame'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdtype[0m[0;34m:[0m [0;34m'DtypeArg | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mconvert_axes[0m[0;34m:[0m [0;34m'bool | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mconvert_dates[0m[0;34m:[0m [0;34m'bool | list[str]'[0m [0;34m=[0m [0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mkeep_default_dates[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mTrue[0m[0;

In [30]:
json.dumps(jsondata["courses"])

'{"JEM005": {"name": "Advanced Econometrics", "ECTS": 6, "teachers": [3421, 1234], "test": "test"}, "JEM207": {"name": "Data Processing in Python", "ECTS": 5, "teachers": [3421]}, "JEM116": {"name": "Applied Econometrics", "ECTS": 6, "teachers": [1234]}, "JEM059": {"name": "Quantitative Finance I.", "ECTS": 6, "teachers": [1234, 5678]}, "JEM061": {"name": "Quantitative Finance II.", "ECTS": 6, "teachers": [1234, 5678]}}'

In [None]:
# from io import StringIO
# dfc = pd.read_json(StringIO(json.dumps(jsondata["courses"])), orient='index')

In [31]:
dfc = pd.read_json(json.dumps(jsondata["courses"]), orient='index')
dfc

  dfc = pd.read_json(json.dumps(jsondata["courses"]), orient='index')


Unnamed: 0,name,ECTS,teachers,test
JEM005,Advanced Econometrics,6,"[3421, 1234]",test
JEM207,Data Processing in Python,5,[3421],
JEM116,Applied Econometrics,6,[1234],
JEM059,Quantitative Finance I.,6,"[1234, 5678]",
JEM061,Quantitative Finance II.,6,"[1234, 5678]",


In [None]:
# lets come back to this a little later

# eXtensible Markup Language (XML)

* elements
* attributes
* tags

### Tag
> <>

### Element

### Convert to python data-types

In [None]:
#either
'''<element>content</element>'''

#or self-closing (no content)
'''<element />''';
# <br />

### Attributes

In [None]:
'''<element attr="value" />''';

![XML tree structure](img/xml_tree_structure.png)

**EXAMPLE:**

```xml
<bookstore>
    <book category="fiction">
        <title lang="ENG">Everyday Italian</title>
        <title lang="CZE">AAaAA</title>
        <author>Giada De Laurentis</author>
        <year>2005</year>
        <price>30.00</price>
    </book>
</bookstore>
```

```json
{
    "bookstore":[
        {
            "title":"Everyday Italian",
            "lang":"ENG",
            "author":"Giada de Laurentis",
            "year":2005,
            "price":30
        }
    ]
}
```

*Takeaway:* JSON and XML are not equivalents and cannot be freely mirrored. Unfortunately.

JSON cannot have multiple tags with different properties -> title_en, title_cze  perhaps

## Navigation
* Xpath
* CSS selectors 
* **BeautifulSoup**

### BeatifulSoup in detail

"Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree."

each BS object represents
* an element
* the position in tree

In [32]:
'''string  on more 
nes '''

'string  on more \nnes '

In [33]:
xml = '''
<?xml version="1.0" encoding="utf-8"?>
<ies_data>
    <courses>
        <course id="JEM005" ects="6" name="Advanced Econometrics">
           <teacher-id>3421</teacher-id>
           <teacher-id>1234</teacher-id>
        </course>
        <course id="JEM207" ects="5" name="Data Processing in Python">
            <teacher-id>3421</teacher-id>
        </course>
            <course id="JEM116" ects="6" name="Applied Econometrics I.">
            <teacher-id>1234</teacher-id>
        </course>
        <course id="JEM059" ects="6" name="Quantitative Finance I.">
            <teacher-id>1234</teacher-id>
            <teacher-id>5678</teacher-id>
        </course>
        <course id="JEM061" ects="6" name="Quantitative Finance II.">
            <teacher-id>1234</teacher-id>
            <teacher-id>5678</teacher-id>
        </course>
    </courses>
    <teachers>
        <teacher teacher-id="3421">
            <name>Martin Hronec</name>
        </teacher>
        <teacher teacher-id="1234">
            <name>Jozef Baruník</name>
        </teacher>
        <teacher teacher-id="5678">
            <name>Lukáš Vácha</name>
        </teacher>
    </teachers>
</ies_data>
'''

#unlike HTML, those tag names are defined by Vitek - no one else 'can' understand them -> flexibility is limited. But same issue with JSON to be fair

soup = BeautifulSoup(xml)

In [34]:
dir(soup)

['ASCII_SPACES',
 'DEFAULT_BUILDER_FEATURES',
 'DEFAULT_INTERESTING_STRING_TYPES',
 'EMPTY_ELEMENT_EVENT',
 'END_ELEMENT_EVENT',
 'ROOT_TAG_NAME',
 'START_ELEMENT_EVENT',
 'STRING_ELEMENT_EVENT',
 '__bool__',
 '__call__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_all_strings',
 '_clone',
 '_decode_markup',
 '_event_stream',
 '_feed',
 '_find_all',
 '_find_one',
 '_format_tag',
 '_indent_string',
 '_is_xml',
 '_lastRecursiveChild',
 '_last_descendant',
 '_linkage_fixer',
 '_marku

```find()``` will find a **first** element given the input

```find_all()``` or ```findAll()```  finds a **all** elements given the input

In [35]:
soup.find_all('course')

[<course ects="6" id="JEM005" name="Advanced Econometrics">
 <teacher-id>3421</teacher-id>
 <teacher-id>1234</teacher-id>
 </course>,
 <course ects="5" id="JEM207" name="Data Processing in Python">
 <teacher-id>3421</teacher-id>
 </course>,
 <course ects="6" id="JEM116" name="Applied Econometrics I.">
 <teacher-id>1234</teacher-id>
 </course>,
 <course ects="6" id="JEM059" name="Quantitative Finance I.">
 <teacher-id>1234</teacher-id>
 <teacher-id>5678</teacher-id>
 </course>,
 <course ects="6" id="JEM061" name="Quantitative Finance II.">
 <teacher-id>1234</teacher-id>
 <teacher-id>5678</teacher-id>
 </course>]

In [36]:
soup.find_all('course')[0].findAll('teacher-id')

[<teacher-id>3421</teacher-id>, <teacher-id>1234</teacher-id>]

In [40]:
jem059 = soup.find('course',{'id':'JEM059'}) #looking for a tag with attributes (optional)


In [41]:
jem059

<course ects="6" id="JEM059" name="Quantitative Finance I.">
<teacher-id>1234</teacher-id>
<teacher-id>5678</teacher-id>
</course>

In [39]:
soup.findAll('teacher-id')

[<teacher-id>3421</teacher-id>,
 <teacher-id>1234</teacher-id>,
 <teacher-id>3421</teacher-id>,
 <teacher-id>1234</teacher-id>,
 <teacher-id>1234</teacher-id>,
 <teacher-id>5678</teacher-id>,
 <teacher-id>1234</teacher-id>,
 <teacher-id>5678</teacher-id>]

`soup['attr']` will return the value of attribute

In [42]:
print(jem059['ects'])
print(jem059['name'])

6
Quantitative Finance I.


In [43]:
soup.findAll('teacher-id')

[<teacher-id>3421</teacher-id>,
 <teacher-id>1234</teacher-id>,
 <teacher-id>3421</teacher-id>,
 <teacher-id>1234</teacher-id>,
 <teacher-id>1234</teacher-id>,
 <teacher-id>5678</teacher-id>,
 <teacher-id>1234</teacher-id>,
 <teacher-id>5678</teacher-id>]

In [44]:
jem059

<course ects="6" id="JEM059" name="Quantitative Finance I.">
<teacher-id>1234</teacher-id>
<teacher-id>5678</teacher-id>
</course>

you can also navigate horizontally

In [47]:
jem059.findNext('course').findNext('course')

In [46]:
jem059.findPrevious('course').findPrevious('course')

<course ects="5" id="JEM207" name="Data Processing in Python">
<teacher-id>3421</teacher-id>
</course>

and even upstream!

In [48]:
jem059.parent.parent

<ies_data>
<courses>
<course ects="6" id="JEM005" name="Advanced Econometrics">
<teacher-id>3421</teacher-id>
<teacher-id>1234</teacher-id>
</course>
<course ects="5" id="JEM207" name="Data Processing in Python">
<teacher-id>3421</teacher-id>
</course>
<course ects="6" id="JEM116" name="Applied Econometrics I.">
<teacher-id>1234</teacher-id>
</course>
<course ects="6" id="JEM059" name="Quantitative Finance I.">
<teacher-id>1234</teacher-id>
<teacher-id>5678</teacher-id>
</course>
<course ects="6" id="JEM061" name="Quantitative Finance II.">
<teacher-id>1234</teacher-id>
<teacher-id>5678</teacher-id>
</course>
</courses>
<teachers>
<teacher teacher-id="3421">
<name>Martin Hronec</name>
</teacher>
<teacher teacher-id="1234">
<name>Jozef Baruník</name>
</teacher>
<teacher teacher-id="5678">
<name>Lukáš Vácha</name>
</teacher>
</teachers>
</ies_data>

In [49]:
#get all teacher ids
teacher_ids = [int(t.text) for t in soup.findAll('teacher-id')]
print(teacher_ids)
#get unique
set(teacher_ids)

[3421, 1234, 3421, 1234, 1234, 5678, 1234, 5678]


{1234, 3421, 5678}

In [50]:
course = soup.find('course')
d = {
    'id':course['id'],
    'name':course['name'],
    'ects':course['ects'],
    'teachers':[int(t.text) for t in course.findAll('teacher-id')]
}
d

{'id': 'JEM005',
 'name': 'Advanced Econometrics',
 'ects': '6',
 'teachers': [3421, 1234]}

### Can convert to JSON-like

In [51]:
l = []
for course in soup.findAll('course'):
    d = {'id':course['id'],
         'name':course['name'],
         'ects':course['ects'],
         'teachers':[int(t.text) for t in course.findAll('teacher-id')]}
    l.append(d)
l

[{'id': 'JEM005',
  'name': 'Advanced Econometrics',
  'ects': '6',
  'teachers': [3421, 1234]},
 {'id': 'JEM207',
  'name': 'Data Processing in Python',
  'ects': '5',
  'teachers': [3421]},
 {'id': 'JEM116',
  'name': 'Applied Econometrics I.',
  'ects': '6',
  'teachers': [1234]},
 {'id': 'JEM059',
  'name': 'Quantitative Finance I.',
  'ects': '6',
  'teachers': [1234, 5678]},
 {'id': 'JEM061',
  'name': 'Quantitative Finance II.',
  'ects': '6',
  'teachers': [1234, 5678]}]

### Or in list-comprehension syntax

In [52]:
l = [{
    'id':course['id'],
    'name':course['name'],
    'ects':course['ects'],
    'teachers':[int(t.text) for t in course.findAll('teacher-id')]
} for course in soup.findAll('course')]

In [None]:
l

In [54]:
pd.DataFrame(l)
#pandas know how to handle it

Unnamed: 0,id,name,ects,teachers
0,JEM005,Advanced Econometrics,6,"[3421, 1234]"
1,JEM207,Data Processing in Python,5,[3421]
2,JEM116,Applied Econometrics I.,6,[1234]
3,JEM059,Quantitative Finance I.,6,"[1234, 5678]"
4,JEM061,Quantitative Finance II.,6,"[1234, 5678]"


# HTML
standard web-page consists of:

* Browser-executed code (`front-end`)
    * HTML "DOM" structure - the website content
        * List of elements that are on website
        * Links to CSS classes, ids and
    * CSS stylesheets - website graphics
    * JavaScripts - website interactivity    

* Server-executed (`back-end`)
    * Server, database, app logic etc.
    * Not available for scraping!
    * May be available as API


## Web-scraping
* client side only
* Navigating HTML DOM by taking advantage of CSS structure

## DOM (Document Object Module):

In [55]:
html = '''
<html>
    <head>
        <title>Sample page</title>
    <script>
        function click_button() {
            alert('Hi guys, still enjoying Python?!')
        }
    </script>
    <style>
        #content div {
            color:black;
        }
        .firstRow {
            background-color:#ddd;
        }

        .normalRow {
            background-color:white;
        }
    </style>
    </head>
    
    <body>
        <div id="header">
            My page header
        </div>
        <div id="table_container">
            <table>
                <tr class="firstRow">
                    <td>name</td>
                    <td>number</td>
                </tr>
                <tr class="normalRow">
                    <td>B</td>
                    <td>2</td>
                </tr>
                <tr class="normalRow">
                    <td>C</td>
                    <td>3</td>
                </tr>
            </table>
        </div>
        <div id="button_container">
            <button id="btn" onclick="click_button()">Click Me!</button>
        </div
    </body>
</html>
'''
display(HTML(html))

0,1
name,number
B,2
C,3


In [56]:
soup = BeautifulSoup(html,'html')
soup

<html>
<head>
<title>Sample page</title>
<script>
        function click_button() {
            alert('Hi guys, still enjoying Python?!')
        }
    </script>
<style>
        #content div {
            color:black;
        }
        .firstRow {
            background-color:#ddd;
        }

        .normalRow {
            background-color:white;
        }
    </style>
</head>
<body>
<div id="header">
            My page header
        </div>
<div id="table_container">
<table>
<tr class="firstRow">
<td>name</td>
<td>number</td>
</tr>
<tr class="normalRow">
<td>B</td>
<td>2</td>
</tr>
<tr class="normalRow">
<td>C</td>
<td>3</td>
</tr>
</table>
</div>
<div id="button_container">
<button id="btn" onclick="click_button()">Click Me!</button>
</div>
</body></html>

In [57]:
rows = soup.findAll('tr',{'class','normalRow'})
rows

[<tr class="normalRow">
 <td>B</td>
 <td>2</td>
 </tr>,
 <tr class="normalRow">
 <td>C</td>
 <td>3</td>
 </tr>]

In [58]:
d = {}

for row in rows:
    key = row.findAll('td')[0].text
    val = int(row.findAll('td')[1].text)
    d[key] = val
pd.Series(d)

B    2
C    3
dtype: int64

In [59]:
d

{'B': 2, 'C': 3}

In [60]:
pd.Series({
    row.findAll('td')[0].text:int(row.findAll('td')[1].text) 
    for row in BeautifulSoup(html).findAll('tr',{'class':'normalRow'})})

B    2
C    3
dtype: int64

In [61]:
soup = BeautifulSoup(html)

In [62]:
row = soup.findAll('tr',{'class':'normalRow'})[0]

In [63]:
row

<tr class="normalRow">
<td>B</td>
<td>2</td>
</tr>

In [64]:
row.findAll('td')[0].text

'B'

In [65]:
int(row.findAll('td')[1].text)

2

In [None]:
{row.findAll('td')[0].text:int(row.findAll('td')[1].text) for row in soup.findAll('tr',{'class':'normalRow'})}

## HTML Inspection

https://ies.fsv.cuni.cz/institut/univerzita-karlova


In [66]:
import requests

# requests and internet communication

* `Client` asks/requests questions (your Jupyter client)
* `Server` replies/serve answers (your Jupyter server)


API = *Application Programming Interface*

very general term! Not only used in web communication

## HTTP requests

A most standard webserver communication channel around

A standard HTTP request contains:

* URL 

    * domain
    * route
    * parameters

* Request Type - GET, POST, PUT, DELETE (see below)

* Content specification - 
    * Application/JSON
    * Application/XML
    * text/html
    * text/css

* Content

* Outcoming data (will see below)

* Cookies 

* Status Code:

    * 200 - success
    * 404 - resource does not exist
    * 500 - the server failed during processing your request


1) REST API - use HTTP request and returns JSON

2) SOAP API - use HTTP request and returns XML

3) Website - use HTTP request and returns set of HTML, JavaScript, CSS and other files

### When to use?
* whenever more applications need to communicate
* user-friendly interface for complicated tasks - DEEP AI, Google Maps
* Data - Golemio, OpenStreetMaps

### GET request
* fast
* public
* data flow only one direction
* parameters via request adress

> https://www.google.com/search?q=how+to+understand+url+parameters&rlz=1C1GCEU_csCZ860CZ860&oq=how+to+understand+url+parameters&aqs=chrome..69i57j33i22i29i30l7.5237j0j4&sourceid=chrome&ie=UTF-8


In [67]:
r = requests.get('https://cs.wikipedia.org/wiki/Institut_ekonomick%C3%BDch_studi%C3%AD_Fakulty_soci%C3%A1ln%C3%ADch_v%C4%9Bd_Univerzity_Karlovy')
#plain request - like browser
r.text

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="cs" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Institut ekonomických studií Fakulty sociálních věd Univerzity Karlovy – Wikipedie</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-client

In [None]:
# !pip install beautifulsoup4

In [68]:
soup = BeautifulSoup(r.text, "html")
tags = soup.findAll("span", {"class": "wd"})

In [69]:
tags

[<span class="wd"><span lang="cs">Budova IES FSV UK v Praze v Opletalově ulici</span></span>,
 <span class="wd"><a href="/wiki/Opletalova" title="Opletalova">Opletalova</a>, <a href="/wiki/Praha" title="Praha">Praha</a>, <a href="/wiki/%C4%8Cesko" title="Česko">Česko</a></span>,
 <span class="wd"><span></span><span class="coordinates"><a class="external text" href="//geohack.toolforge.org/geohack.php?language=cs&amp;pagename=Institut+ekonomick%C3%BDch+studi%C3%AD+Fakulty+soci%C3%A1ln%C3%ADch+v%C4%9Bd+Univerzity+Karlovy&amp;params=50.082219444444_N_14.431111111111_E_type:landmark"><span style="white-space:pre">50°4′55,99″ s. š.</span>, <span style="white-space:pre">14°25′52″ v. d.</span></a></span></span>,
 <span class="wd"><span class="sisterproject sisterproject-commons"><span class="sisterproject_image"><span typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="1376" data-file-width="1024" decoding="async" height="16" src="//upload.wikimedia.org/wikipedia/comm

### POST request
* slow
* private
* both sides can send data

## Static pages x Dynamic pages x JavaScript-rendered pages

### Static

* Pages that do not get updated instantly.
* All information necessary for rendering a website is available after entering the URL.
* It may ask the database, but the output is stable.
* All parameters within the address!

* Typical examples:
    
### JavaScript rendered: 
* Defacto static, but you cannot take advantage of HTML/CSS structure

### Dynamic content
* webpage instantly communicates with the webserver and the database
  * solution -> Selenium!


Summary Table
|Type | Content Loading | Examples|
|--|--|-----|
|Static | Pre-rendered, no real-time server updates | Basic blogs, brochures|
|JavaScript-Rendered | Uses JavaScript to load content post-initial HTML load | Modern web apps (React, Vue)|
|Dynamic | Real-time server updates with every interaction or visit | Social media, e-commerce sites|

Understanding the type of page helps in choosing the right approach for web scraping, data retrieval, or performance optimization.

### Is this website static or dynamic?

1. Facebook
2. Sreality.cz
3. IES website



## How to chose data source for project

You need to know in advance what data you will download:

1. full or satisfactory access to API
2. the web-page is parsable (prefer not too much javascript)
3. plan to generate all requests

# APIs Example
### Get wiki data using GET

In [None]:
# if time, return to geodata

# Lets start with a basic request

In [70]:
api_url = 'https://krcgc3uqga.execute-api.eu-central-1.amazonaws.com'
# this api implements three routers
# GET /time
# GET /stocks
# POST /hashme

In [71]:
route = 'time'
# route /ruːt/
response = requests.get(f'{api_url}/{route}')
response.json()

{'time': '2025-03-24T18:50:20.727Z'}

In [72]:
# route = stocks

route = 'stocks'
# route /ruːt/
response = requests.get(f'{api_url}/{route}')
print(response.json())

{'AAPL': 123.123}


In [73]:
route = "hashme"
url = f"https://krcgc3uqga.execute-api.eu-central-1.amazonaws.com/{route}"

payload = json.dumps({
  "name": "Jan Sila"
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.post(url, headers=headers, data=payload)

print(response.json())


{'hash': '9989540023c7d128cc66a374d79574b6'}


In [74]:
response = requests.get('https://en.wikipedia.org/wiki/Charles_University')
soup = BeautifulSoup(response.text)
div = soup.find('div',{'id':'mw-content-text'}) #  #mw-content-text > div > p:nth-child(10)texts)
article = ' '.join([p.text for p in div.find_all('p')])
print(article)

Charles University (CUNI; Czech: Univerzita Karlova, UK; Latin: Universitas Carolina; German: Karls-Universität), or historically as the University of Prague (Latin: Universitas Pragensis), is the largest university in the Czech Republic.[3] It is one of the oldest universities in the world in continuous operation, the oldest university north of the Alps and east of Paris.[4] Today, the university consists of 17 faculties located in Prague, Hradec Králové, and Plzeň.[5]
 The establishment of a medieval university in Prague was inspired by Holy Roman Emperor Charles IV.[6] He requested his friend and ally, Pope Clement VI, to create the university. On 26 January 1347, the pope issued the bull establishing a university in Prague, modeled on the University of Paris, with all four faculties, including theology. On 7 April 1348 Charles, the king of Bohemia, gave to the established university privileges and immunities from the secular power in a Golden Bull[7] and on 14 January 1349 he repea

# Bonus example:

## GeoJSON

* One standardized data format for transferring geodata
* Plenty of geodata out there
* An example: https://opendata.chmi.cz/meteorology/climate/recent/data/daily/01/dly-0-203-0-10101014001-202501.json


In [75]:
path_chmi_data = 'https://opendata.chmi.cz/meteorology/climate/recent/data/daily/01/dly-0-203-0-10101014001-202501.json'
verbose_request = requests.get(path_chmi_data)

In [76]:
print(verbose_request.status_code)

200


In [77]:
dir(verbose_request)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

In [78]:
verbose_request.json()

{'zaznamID': '138c5f4d-71f7-7aa6-1be8-38013113e0b1',
 'datovyZdrojID': 'meteorologie',
 'datovyTokID': 'Open.Data.1D',
 'datumVytvoreni': '2025-03-03T06:37:15.022Z',
 'verzeDat': '1.0',
 'data': {'type': 'DataCollection',
  'data': {'header': 'STATION,ELEMENT,VTYPE,DT,VAL,FLAG,QUALITY',
   'values': [['0-203-0-10101014001',
     'API30',
     '06:00',
     '2025-01-01T06:00:00Z',
     26.9,
     '',
     0.0],
    ['0-203-0-10101014001',
     'API30',
     '06:00',
     '2025-01-02T06:00:00Z',
     24.6,
     '',
     0.0],
    ['0-203-0-10101014001',
     'API30',
     '06:00',
     '2025-01-03T06:00:00Z',
     25.9,
     '',
     0.0],
    ['0-203-0-10101014001',
     'API30',
     '06:00',
     '2025-01-04T06:00:00Z',
     31.7,
     '',
     0.0],
    ['0-203-0-10101014001',
     'API30',
     '06:00',
     '2025-01-05T06:00:00Z',
     30.5,
     '',
     0.0],
    ['0-203-0-10101014001',
     'API30',
     '06:00',
     '2025-01-06T06:00:00Z',
     41.6,
     '',
     0.0],
    ['

### Geoportal Prague

https://geoportalpraha.cz/en/search?topic=data&type=[opendata]

https://geoportalpraha.cz/data-a-sluzby/otevrena-data

IPR, CHMI, CNB, etc

In [79]:
d = requests.get('https://services5.arcgis.com/SBTXIEUGWbqzUecw/arcgis/rest/services/OVZ_CUR_OVZ_KLIMA_ZNECOVZDUSI_P/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson').json()

In [80]:
# get already json
# d = requests.get('https://opendata.iprpraha.cz/CUR/OVZ/OVZ_Klima_ZnecOvzdusi_p/WGS_84/OVZ_Klima_ZnecOvzdusi_p.json').json()

d['features'][0]['properties']

{'OBJECTID': 1,
 'GRIDVALUE': 4,
 'GLOBALID': '9acb91c4-83e5-4faa-a90b-00ad313a31e7',
 'SHAPE__Area': 780697.57421875,
 'SHAPE__Length': 4089.1951370619568}

In [81]:
import branca # for colormap
import folium # for maps

ModuleNotFoundError: No module named 'branca'

In [None]:
colorscale = branca.colormap.linear.YlOrRd_09.scale(0, 5)

def style_function(feature):
    gridvalue = feature['properties']['GRIDVALUE']
    return {
        'fillOpacity': 0.5,
        'weight': 0,
        'fillColor': colorscale(gridvalue)
    }

m = folium.Map(location=[50.085,14.45], zoom_start=12)

folium.GeoJson('https://services5.arcgis.com/SBTXIEUGWbqzUecw/arcgis/rest/services/OVZ_CUR_OVZ_KLIMA_ZNECOVZDUSI_P/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson',style_function=style_function).add_to(m)
m