<center>
<h1><b>Data Loading</b></h1>
<h2>Python and R for Data Science</h2>
<h3>Management and Data Science</h3>
<img src="../dist/img/cliente-luiss.png">
</center>

# Preliminaries

## Data Science

In data science, we have few key steps:
1. Get the dataset
2. Load the dataset
3. Clean the dataset
4. Process and analyze the dataset
5. Visualize the dataset

Throughout the course, we will cover these steps and refine them. 

Nonetheless, the starting point is... **obtain the data**.

## Many faces of the data

About the data, we need to understand:
- *where it is stored*: 
    - ***locally*** (e.g., our disk)
    - ***remotely*** (e.g., website)

- *how it is stored*: its data structure, i.e., its data ***format***


# Data Collection

## Keep everything local?

In principle, we may want to have the entire dataset on our local machine. For instance, by manually downloading the dataset from the web.

However, this is not always possible:
- *the dataset is too large*: we can only locally store chunks of data

- *the dataset is not fully available*: several web services do not allow us to obtain the entire dataset for different (good or bad) reasons. For instance:
    - A service is selling the access to the data and wants to give you limited access to it.
    - Privacy concerns (e.g., health sector) that require to track exactly the data that you request to collect. 
    - Give you the entire dataset would generate excessive network traffic.

- *the dataset is live*: it has frequent updates.

## Remote access to a dataset?

A *data provider* may store the dataset in quite different and undocumented ways. <br>
Most likely, there will be for you **no direct access** to the dataset or to the system handling the data. 

<center>
<img src="img/no-direct-data-access.png" />
</center>

Indeed, the data provider:
- wants to control how you access the data (*pay as you call*)
- does not want to give you insights about its internal infrastructure:
    - it may lead to security issues
    - it may lead to data leaks
    - it wants to be free to change how it works over time

## Then, how to access the remote data? **REST API**

For these reasons, datasets are often exposed to the external world through a *standard remote interface* called ***REST API*** (or, RESTful API):

<center>
<img src="img/data-rest.png" />
</center>

A REST API:
- is simple to implement (provider) and use (users)
- is typically reachable from the internet via HTTPS (i.e., web)
- can easily integrate authentication
- is often documented by the data provider
- can be fine-grained: users can retrieve exactly the needed bit of data
- can be inefficient: there are better solutions that, however, are less widespread and more complex

## Example: open-meteo.com

<center>
<img src="img/open-meteo.png" />
</center>


The free access is limited to a subset of the data! <br>
**Data has a huge money value** and most of the time you will have to pay for it!



## Example: open-meteo.com (cont'd)

<center>
<img src="img/open-meteo-api.png" width="600" />
</center>

Full documentation: **https://open-meteo.com/en/docs**

## Example: open-meteo.com (cont'd)

Data can be accessed on the web via the endpoint: https://api.open-meteo.com/v1/forecast

For instance, after checking the documentation, we can build the following request:

https://api.open-meteo.com/v1/forecast?latitude=41.8967&longitude=12.4822&hourly=temperature_2m

which returns the last temperature measurements in Rome. You can open this URL with your browser:

<center>
<img src="img/open-meteo-data-1.png" width="700" />
</center>


## Example: open-meteo.com (cont'd)

If we look closely at the response, we can see the data that we care for:
<center>
<img src="img/open-meteo-data-2.png" width="900" />
</center>
These are the temperature measurements!

## REST API in practice

Given a data provider, we need to:
- **[problem \#1]** how to build the REST API requests: check the documentation!

- **[problem \#2]** perform such **requests** in a fully automatic way:
    - while we could do them manually, it is not convenient
    - live data must be fetched periodically

- **[problem \#3]** interpret the **responses**:
    - the response can be anything: 
        - text file
        - image
        - raw data
    - we need to identify the proper **data format**
    - the data format is documented by the data provider
    - the data format will be standard to favor interoperability

## REST API request

In REST API, a request:

- can have some ***request headers***: we may need to set some headers to, e.g., perform authentication

- is performed using a ***URL*** which combines:
    - ***request endpoint***, e.g., `https://api.open-meteo.com/v1/forecast`
    - ***request parameters***: e.g., `?latitude=41.8967&longitude=12.4822`<br> The question mark `?` starts the paramters list, where each parameter is a key-value pair, such as `<parameter_name>=<parameter_value>`, or  `<parameter_name>=<parameter_value1>,<parameter_value2>`, where key-value pairs are separated by the symbol `&`

- can be performed using two HTTP methods:
    - ***GET method***: the most common one (previous example). Any parameter will be embedded into the URL. 
    - ***POST method***: more advanced, often used when we need to send some data that cannot be embedded through headers or request parameters 

All these details will be written by the documentation. 

## How to automatically perform REST API requests?

- Data provider API Python package: some services, or the community, may offer a dedicated Python package. For instance, for the Open Meteo API, we could install with `pip` this package: https://pypi.org/project/open-meteo/

- We can use a popular and well-known Python package: `requests`<br>It will work for any REST API. We can install it with `pip`:

In [4]:
! python3 -m pip install requests

Defaulting to user installation because normal site-packages is not writeable


## `requests`: make a GET request

First, we need to import it:

In [6]:
import requests

Then, we can make our first GET request:

In [9]:
url = 'https://api.open-meteo.com/v1/forecast?latitude=41.89&longitude=12.48&hourly=temperature_2m'
response = requests.get(url)

Now, the response is stored in the `response` variable. Depending on the data format, we must parse it in a different way. 

## `requests`: request parameters

Instead of manually setting the list of parameters in the url, we can do:

In [26]:
endpoint = 'https://api.open-meteo.com/v1/forecast'
params = {'latitude': 41.89, 'longitude': 12.48, 'hourly': 'temperature_2m'}
response = requests.get(endpoint, params=params)

which is way more readable and less error-prone!

## Response status code

Each response come with a ***status code*** that indicates whether the request has been successfully completed:

<center>
<img src="img/http-status-code.png" width="200" /><br>
<a src="https://www.infidigit.com/blog/http-status-codes/">[image credits]<a>
</center>

## `requests`: check response status code

In [11]:
url = 'https://api.open-meteo.com/v1/forecast?latitude=41.89&longitude=12.48&hourly=temperature_2m'
response = requests.get(url)
print(f"Status code: {response.status_code}")

Status code: 200


Our request was successful and we can now get the data our of it :)

## `requests`: retrieve response data

`requests` allows us to retrieve the data in the following formats:

- `response.text`: textual format. Used only when the response data is a single value (e.g., a string, integer, etc.)

- `response.json()`: JSON format. The most common choice. Easy to parse (see later slides!).

- `response.content`: the response is an arbitrary format and we are getting the raw bytes. Need to carefully check the API documentation. It can be used when downloading files (e.g., an image, a zip file, etc.) from arbitrary websites.


For instance, in the case of Open Meteo API, the documentation reports that the response is in a JSON format:
<center>
<img src="img/open-meteo-doc.png" width="800" />
</center>

## `requests`: textual response

We treat the response data as a Python string:

In [15]:
response = requests.get('https://google.it')
print(response.text[:50]) # first 50 characters of google page

<!doctype html><html itemscope="" itemtype="http:/


## ``requests``: JSON response

A JSON is like a Python Dictionary:

In [28]:
url = 'https://api.open-meteo.com/v1/forecast?latitude=41.89&longitude=12.48&hourly=temperature_2m'
response = requests.get(url).json()
print(f"Temperature: {response['hourly']['temperature_2m'][:10]}")

Temperature: [27.4, 27.8, 27.1, 25.8, 25.6, 26.7, 27.9, 29.4, 31.5, 32.7]


The structure of the dictionary, i.e., which key-value pairs are inside it, it expected to be documented by the data provider. 

Hence, in our example, the documentation from Open Meteo is reporting that there is `hourly` key, whose associated value is a dictionary containing the key `temperature_2m`, whose associated value is a list of `float` values, i..e, our temperatures.

More details on the JSON format in later slides!

## `requests`: raw response

When we want to download an arbitrary file over the web, we can get the file and save it to our local filesystem. 

For instance:

In [21]:
# a pic from the web
url = "https://cdn.pixabay.com/photo/2023/11/14/20/08/woman-8388428_1280.jpg"
rawdata = requests.get(url).content # get the image
open('myimage.jpg', 'wb').write(rawdata) # saved the image to a file

88584

After executing these lines of code, you have a local file `myimage.png`. You can open it with your image viewer.

## `requests`: other features

This package is extremely powerful and make it easy to:
- perform a POST request: 
    - use `requests.post(url, data=<our data>)`

- add headers to a request: 
    - pass a dictionary with the key-value pairs
    - e.g., `request.get(url, headers={'User-Agent':'MyApp'})`

Look at its documentation for more details:

https://requests.readthedocs.io/en/latest/user/quickstart/

In future lectures, we may come back to `requests`.

# Data Formats

## Popular data formats

Datasets may come in different common formats:

- Textual: `.txt` file
- CSV: `.csv` file
- TSV: `.csv` file
- JSON: often retrieved via a REST API 
- XSLX: `xlsx` file, Microsoft Excel format

## Textual file (`.txt`)

The most intuitive one. Human-friendly but hard to parse when we have complex data inside it. Used in the real-world only when the data is indeed a simple text (e.g., a book). We can open the local file and fetch its content as a Python string. 

For instance, assuming you have a local file `myfile.txt`:

In [31]:
content = open('myfile.txt', 'r').read()
print(content)

Hello, LUISS!


Since it is a string, we can manipulate it through the string functions and operators available in Python.