# ETL I: Data Extraction

The first phase of an ETL is to get the data. We have several possibilities to obtain data, but the most important are:

-   Data from local resources (CSV, txt, word, excel, ...)
-   Data from apis
-   Data from the internet (Scrapping or documents)

## Data from local sources

In this first part we'll learn how to get data from local files from our computer or the server we're working at.

### open()

The open() function opens a file, and returns it as a file object.

We can specify how to open the file with the mode values:

-   **r**: Read - (It's the Default method if not specified). Opens a
    file for reading, return an error if the file does not exist.
-   **a**: Append - Opens a file for appending, creates the file if it does not exist.
-   **w**: Write - Opens a file for writing, creates the file if it does not exist.
-   **x**: Create - Creates the specified file, returns an error if the file exist.

Also we can setup the file type:

-   **t**: Text - Default value. Text mode
-   **b**: Binary - Binary mode (e.g. images)

We'll have to use the command `read()` if we want to access to the content of the file (usually).

#### Opening a plain text file


In [3]:
open('../sources/example.txt').read()

'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean sodales cursus erat sit amet volutpat. Proin urna urna, consequat a pellentesque vitae, ultrices sit amet nisl. Etiam commodo ligula et eros lobortis, in sollicitudin arcu lacinia. Praesent nec malesuada urna. Nunc id efficitur ex. Sed vitae neque ac est aliquet scelerisque. Sed aliquet felis lacus, in fermentum ante consequat eu.\n\nPraesent eu elementum elit. Integer ex odio, faucibus ac dolor quis, facilisis ullamcorper arcu. Vestibulum in egestas velit. Nunc sit amet interdum leo. Nulla consectetur felis eget fermentum elementum. Nulla facilisi. Sed rutrum nulla nulla, vitae mattis sem faucibus non. Maecenas aliquam tristique congue. Integer non magna turpis. Nunc sit amet suscipit libero, vel maximus tellus. Praesent eros sapien, pharetra ullamcorper lectus ac, cursus aliquam metus. Sed molestie lobortis magna vel fringilla.\n\nDonec quis ipsum est. Donec sed auctor enim. Pellentesque posuere massa a bibendum blandit

#### Opening a csv

In [4]:
open('../sources/aircrafts.csv').read()

'code;name;manufacturer;pax;type\nA321;Airbus A321;Airbus;230;Single Aisle\nB789;Boeing 787-9;Boeing;350;Double Aisle\nB77W;Boeing 777 Long Range;Boeing;410;Double Aisle\nA35X;Airbus A350-1000;Airbus;490;Double Aisle\n'

Which is not the same than doing this:

In [22]:
print(open('../sources/aircrafts.csv').read())

code;name;manufacturer;pax;type
A321;Airbus A321;Airbus;230;Single Aisle
B789;Boeing 787-9;Boeing;350;Double Aisle
B77W;Boeing 777 Long Range;Boeing;410;Double Aisle
A35X;Airbus A350-1000;Airbus;490;Double Aisle



<font size="4">Capisci? 🧐</font>

Or you can do this:

In [23]:
for r in open('../sources/aircrafts.csv').read().split('\n'):
    print(r.split(';'))

['code', 'name', 'manufacturer', 'pax', 'type']
['A321', 'Airbus A321', 'Airbus', '230', 'Single Aisle']
['B789', 'Boeing 787-9', 'Boeing', '350', 'Double Aisle']
['B77W', 'Boeing 777 Long Range', 'Boeing', '410', 'Double Aisle']
['A35X', 'Airbus A350-1000', 'Airbus', '490', 'Double Aisle']
['']


As can be seen, we can open any text related format in the same way. A different topic is how to use that data, but we'll get into that point later.

### Open datasets with pandas

Pandas is an open source Python package that is most widely used for data science/data analysis and machine learning tasks. It is built on top of another package named Numpy. It's a powerfull library to work with dataframes mainly.

It can also open and convert to dataframes excel files or csv files.

You can install pandas with: `pip install pandas` and load it with
`import pandas as pd`

To read a csv we have the native function `.read_csv()`. We can specify
the separator with the option `sep=`

In [6]:
import pandas as pd

pd.read_csv('../sources/aircrafts.csv', sep=';')

Unnamed: 0,code,name,manufacturer,pax,type
0,A321,Airbus A321,Airbus,230,Single Aisle
1,B789,Boeing 787-9,Boeing,350,Double Aisle
2,B77W,Boeing 777 Long Range,Boeing,410,Double Aisle
3,A35X,Airbus A350-1000,Airbus,490,Double Aisle


Pandas can also load and prepare Excel files to be used with the function `.read_excel()`

In [9]:
pd.read_excel('../sources/aircrafts.xlsx')

Unnamed: 0,code,name,manufacturer,pax,type
0,A321,Airbus A321,Airbus,230,Single Aisle
1,B789,Boeing 787-9,Boeing,350,Double Aisle
2,B77W,Boeing 777 Long Range,Boeing,410,Double Aisle
3,A35X,Airbus A350-1000,Airbus,490,Double Aisle


### Opening doc and docx

To open word documents we can use the library `docx2txt` and easily
process it.

Install the library running in the Anaconda prompt
`pip install docx2txt`

The document can be open with the `process()` command

In [13]:
import docx2txt

docx2txt.process('../sources/lorem_ipsum_word.docx')

'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum euismod ante nibh, vel commodo lectus elementum non. Mauris molestie arcu id mi tempor, nec venenatis enim feugiat. In sit amet nisi at sem semper mollis vel sit amet quam. Phasellus non accumsan felis. Sed pretium ligula sed elit porta, in mollis ligula auctor. Mauris volutpat elit est, sit amet feugiat nunc auctor non. Sed sodales vitae arcu eu aliquam. Proin dictum odio malesuada malesuada gravida. Proin vel urna erat.\n\nCurabitur faucibus accumsan est, non efficitur odio aliquet eu. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Phasellus lorem erat, malesuada ut eros in, commodo euismod nunc. In feugiat erat in nisl congue viverra. Nunc laoreet sit amet quam id consequat. Nulla interdum, mauris eu finibus dapibus, metus urna finibus risus, eu pharetra eros diam in quam. Nam vitae risus nisl. Donec ornare ut arcu sed suscipit.\n\nPellentesque porta nibh quis dui varius con

## Getting data from api

First thing we need to understand is what an api is. API stands for “Application Programming Interface,” which is a way to communicate between different software services. Different types of APIs are used in programming hardware and software, including operating systemAPIs, remote APIs and web APIs.

A web API or web service API is a set of tools that allow developers to send and receive instructions and data between a web server and a web browser — usually in JSON format — to build applications.

The APIs will allow us, for our purpouse, to get data from a remote source, without the need to connect directly to the source (SQL, Mongo, local files, CMS, ERP, ...) and many of them will already provide some kind of data preparation.

To connect with APIs we'll use the library requests.

Also we need to have in mind that each API is different from other ones. Most of then provide an extense documentation about how to use them.

The most common ways we'll be connecting to them is by POST and GET method. But, what's POST and what's GET, and how to difference them?

A very basic explanation of it is, the with the GET we are just calling an URL and getting the data (you can even access with your browser to that url and see the data) and with the POST you're "posting" some data to the server and getting information back.

Let's see some examples in action!


### Importing the request library

In [1]:
import requests

Once we have the library imported, let's make some previous setup. Not all the setups we'll make are mandatory, but they will avoid some common errors and blocks.

You can always copy paste the mayority of this details since the're common to any project.

**Setup the headers**

What are the headers? An HTTP request header is a component of a network
packet sent by a browser or client to the server to request for a
specific page or data on the Web server. It is used in Web
communications or Internet browsing to transport user requests to the
corresponding website’s Web server.

We'll setup the User-Agent header, to tell the service we are a user
using a firefox browser. We can also setup the Accept content header, to
specify some format of the data.

In [15]:
basic_headers = {'User-Agent': 'Mozilla/5.0'}

### GET Request

And now it's time to see an example GET request.

We'll use the opensource API named **The Cocktail DB** $ \rightarrow $ <https://www.thecocktaildb.com/api.php>

In [16]:
endpoint = 'https://www.thecocktaildb.com/api/json/v1/1/random.php'
payload = requests.get(endpoint, headers=basic_headers)
print(payload)

<Response [200]>


As you can see, we have received a 200 code, but; What's a 200 code?

The 200 code means that everything is ok. You will probably know some of this codes:

-   **200 - OK code:** A 200 is the most common type of response code, and
    the one we experience most of the time when browsing the web. We
    asked to see a web page, and it was presented to us without any
    trouble.
-   **301 or 302 Moved:** The content has been moved to another URL,
    temporarily or pemanently
-   **401 Unauthorized:** We've requested a content, some kind of login is
    required to access it (token, user + password, certificate, ...).
-   **403 Forbidden:** We've requested a content that we don't have
    permission to access at all. This page isn't for us.
-   **404 Not Found:** We've requested a content, but the web server doesn't
    recognize our request. The page can't be shown because the server
    doesn't know what it is.
-   **500 Internal Server Error:** We've requested a page, and in return, we
    get a generic error message. No information is given.
-   **503 Service Unavailable:** We asked for a page, but are told that it
    is temporarily unavailable. Something is wrong. Perhaps the website
    is down for maintenance or the payload we sent is incorrect.

Once we know about those codes you may be asking, ok but what I do with
that 200 code? Where's the content?

Let's access to it with the `.content` attribute

In [17]:
payload.content

b'{"drinks":[{"idDrink":"178333","strDrink":"Raspberry Julep","strDrinkAlternate":null,"strTags":null,"strVideo":null,"strCategory":"Cocktail","strIBA":null,"strAlcoholic":"Alcoholic","strGlass":"Cordial glass","strInstructions":"Softly muddle the mint leaves and raspberry syrup in the bottom of the cup. Add crushed ice and Bourbon to the cup and then stir. Top with more ice, garnish with a mint sprig.","strInstructionsES":null,"strInstructionsDE":null,"strInstructionsFR":null,"strInstructionsIT":"Pestare delicatamente le foglie di menta e lo sciroppo di lamponi sul fondo della tazza. Aggiungere il ghiaccio tritato e il Bourbon nella tazza e poi mescolare. Completare con altro ghiaccio, guarnire con un rametto di menta.","strInstructionsZH-HANS":null,"strInstructionsZH-HANT":null,"strDrinkThumb":"https:\\/\\/www.thecocktaildb.com\\/images\\/media\\/drink\\/hyztmx1598719265.jpg","strIngredient1":"Bourbon","strIngredient2":"Raspberry syrup","strIngredient3":"Mint","strIngredient4":null,"

In this case, the content is a JSON. The easiest way to work with its content is directly get the content with the `.json()` function

In [18]:
payload.json()

{'drinks': [{'idDrink': '178333',
   'strDrink': 'Raspberry Julep',
   'strDrinkAlternate': None,
   'strTags': None,
   'strVideo': None,
   'strCategory': 'Cocktail',
   'strIBA': None,
   'strAlcoholic': 'Alcoholic',
   'strGlass': 'Cordial glass',
   'strInstructions': 'Softly muddle the mint leaves and raspberry syrup in the bottom of the cup. Add crushed ice and Bourbon to the cup and then stir. Top with more ice, garnish with a mint sprig.',
   'strInstructionsES': None,
   'strInstructionsDE': None,
   'strInstructionsFR': None,
   'strInstructionsIT': 'Pestare delicatamente le foglie di menta e lo sciroppo di lamponi sul fondo della tazza. Aggiungere il ghiaccio tritato e il Bourbon nella tazza e poi mescolare. Completare con altro ghiaccio, guarnire con un rametto di menta.',
   'strInstructionsZH-HANS': None,
   'strInstructionsZH-HANT': None,
   'strDrinkThumb': 'https://www.thecocktaildb.com/images/media/drink/hyztmx1598719265.jpg',
   'strIngredient1': 'Bourbon',
   'strI

In [19]:
payload.json()['drinks'][0]['strDrink']

'Raspberry Julep'

### Post request

Now we'll try the post request, they are the most usual in the big apis, or in those where we need to exchange information.

It's usual that we need to authenticate our requests, but we see how later.

Now we'll use the OnAirNet API, and we'll call different endpoints so, we'll set this up in a different way this time.

Let's make our first post request:

In [3]:
api = 'https://api.onairnet.xyz/'
endpoint = 'assembler/aircraft/last-5-country'
api_call = api + endpoint
payload = requests.post(api_call)
print(payload.json())

JSONDecodeError: [Errno Expecting value] <!doctype html>
<html lang=en>
<title>404 Not Found</title>
<h1>Not Found</h1>
<p>The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.</p>
: 0