# Data Pipelines: ETL vs ELT
Data pipeline is a generic term for moving data from one place to another. For example, it could be moving data from one server to another server.

## ETL
An [ETL pipeline](https://en.wikipedia.org/wiki/Extract,_transform,_load) is a specific kind of data pipeline and very common. ETL stands for Extract, Transform, Load. Imagine that you have a database containing web log data. Each entry contains the IP address of a user, a timestamp, and the link that the user clicked.

What if your company wanted to run an analysis of links clicked by city and by day? You would need another data set that maps IP address to a city, and you would also need to extract the day from the timestamp. With an ETL pipeline, you could run code once per day that would extract the previous day's log data, map IP address to city, aggregate link clicks by city, and then load these results into a new database. That way, a data analyst or scientist would have access to a table of log data by city and day. That is more convenient than always having to run the same complex data transformations on the raw web log data.

Before cloud computing, businesses stored their data on large, expensive, private servers. Running queries on large data sets, like raw web log data, could be expensive both economically and in terms of time. But data analysts might need to query a database multiple times even in the same day; hence, pre-aggregating the data with an ETL pipeline makes sense.

## ELT
ELT (extract, load, transform) pipelines have gained traction since the advent of cloud computing. Cloud computing has lowered the cost of storing data and running queries on large, raw data sets. Many of these cloud services, like Amazon Redshift, Google BigQuery, or IBM Db2 can be queried using SQL or a SQL-like language. With these tools, the data gets extracted, then loaded directly, and finally transformed at the end of the pipeline.

However, ETL pipelines are still used even with these cloud tools. Oftentimes, it still makes sense to run ETL pipelines and store data in a more readable or intuitive format. This can help data analysts and scientists work more efficiently as well as help an organization become more data driven.

## Outline of the Lesson
1. Extract data from different sources such as:

    - csv files
    - json files
    - APIs

2. Transform data

    - combining data from different sources
    - data cleaning
    - data types
    - parsing dates
    - file encodings
    - missing data
    - duplicate data
    - dummy variables
    - remove outliers
    - scaling features
    - engineering features

3. Load

    - send the transformed data to a database

4. ETL Pipeline

    - code an ETL pipeline

This lesson contains many Jupyter notebook exercises where you can practice the different parts of an ETL pipeline. Some of the exercises are challenging, but they also contain hints to help you get through them. You'll notice that the "transformation" section is relatively long. You'll oftentimes hear data scientists say that cleaning and transforming data is how they spend a majority of their time. This lesson reflects that reality.

## World Bank Data
In the next section, you'll find a series of exercises. These are relatively brief and focus on extracting, or in other words, reading in data from different sources. The goal is to familiarize yourself with different types of files and see how the same data can be formatted in different ways. This lesson assumes you have experience with pandas and basic programming skills.

This lesson uses data from the World Bank. The data comes from two sources:

1. World Bank Indicator Data - This data contains socio-economic indicators for countries around the world. A few example indicators include population, arable land, and central government debt.
2. World Bank Project Data - This data set contains information about World Bank project lending since 1947.

## How to Tackle the Exercises
This course assumes you have experience manipulating data with the Pandas library, which is covered in the data analyst nanodegree. Some of these transformation exercises are challenging. The most challenging exercises are marked (challenging). If an exercise is marked as a challenge, it means you’ll get something out of solving it, but it’s not essential for understanding the lesson material or for getting through the final project at the end of this data engineering course.

Throughout the exercises, you might have to read the pandas documentation or search outside the classroom for how to do a certain processing technique. That is not just expected but also encouraged. As a data scientist professional, you will oftentimes have to research how to do something on your own much like what software engineers do. See this answer on Quora about [how often do people use stackoverflow when working on data science projects?](https://www.quora.com/How-often-do-people-use-stackoverflow-when-working-on-data-science-projects).

Use Google and other search engines when you're not sure how to do something!

## What You Will do in the Next Section
In the next section of the lesson, you'll learn about the extract portion of an ETL pipeline. You’ll get practice with a series of exercises. These exercises are relatively brief and focus on extracting, or in other words, reading in data from different sources. The goal is to familiarize yourself with different types of files and see how the same data can be formatted in different ways.

For a review of pandas, click on the "Extracurricular" section of the classroom. Open the Prerequisite: Python for Data Analysis course, and go to Lesson 7: Pandas.

## Summary of the data file types you'll work with
#### CSV files
CSV stands for comma-separated values. These types of files separate values with a comma, and each entry is on a separate line. Oftentimes, the first entry will contain variable names. Here is an example of what CSV data looks like. This is an abbreviated version of the first three lines in the World Bank projects data csv file.
#### JSON
JSON is a file format with key/value pairs. It looks like a Python dictionary. The exact same CSV file represented in JSON could look like this:
#### XML
Another data format is called XML (Extensible Markup Language). XML is very similar to HTML at least in terms of formatting. The main difference between the two is that HTML has pre-defined tags that are standardized. In XML, tags can be tailored to the data set. Here is what this same data would look like as XML.
XML is falling out of favor especially because JSON tends to be easier to navigate; however, you still might come across XML data. The World Bank API, for example, can return either XML data or JSON data. From a data perspective, the process for handling HTML and XML data is essentially the same.

#### SQL databases
SQL databases store data in tables using primary and foreign keys. In a SQL database, the same data would look like this:
#### Text Files
This course won't go into much detail about text data. There are other Udacity courses, namely on natural language processing, that go into the details of processing text for machine learning.

## Extracting Data from the Web
In this lesson, you'll see how to extract data from the web using an APIs (Application Programming Interface). APIs generally provide data in either JSON or XML format.

Companies and organizations provide APIs so that programmers can access data in an official, safe way. APIs allow you to download, and sometimes even upload or modify, data from a web server without giving you direct access.

# Goal of the ETL Lesson

The main goal of this ETL pipelines lesson is to take the [World Bank Project data set](https://datacatalog.worldbank.org/dataset/world-bank-projects-operations) and merge this data with the [World Bank indicator data](https://data.worldbank.org/indicator/SP.POP.TOTL). Then you'll load the merged data into a database.

In the process, you'll need to transform these data sets in different ways. And finally, you'll code an ETL pipeline to extract, transform, and load the data all in one step.

# Extracting data from a csv file

The first step in an ETL pipeline is extraction. Data comes in all types of different formats, and you'll practice extracting data from csv files, JSON files, XML files, SQL databases, and the web.

In this first exercise, you'll practice extracting data from a CSV file and then navigating through the results. You'll see that extracting data is not always a straight-forward process.

This exercise contains a series of coding questions for you to solve. If you get stuck, there is a solution file called 1_csv_exercise_solution.ipynb. You can find this solution file by going to File->Open and then clicking on the file name.

# Part 1 projects_data.csv

You'll be using the following csv files:
* projects_data.csv
* population_data.csv

As a first step, try importing the projects data using the pandas [read_csv method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). The file path is just '../data/projects_data.csv'. You can see the file if you click on File->Open in the workspace and open the data folder.