# 1. ***INTRODUCTION***

* **Data Ingestion** is the process of **extracting** unstructured data from a source, transporting it to a suitable environment, and preparing it for use. This often include **normalizing**, **cleaning** and adding **metadata**

Data Engineers are the architects behind:
* **building** reliable, efficient and scalable  pipelines.
* **optimize data storage** to keep cost low and performance high.
* **ensure data quality and integrity**, addressing duplicates, inconsistencies and missing values.
* **implement data governance**, for secure, compliance and well-managed data.
* **adapt data architectures** to meet changing needs of the organization.

Ultimately, data engineers manage the entire **data lifecycle**, from collection to consumption.

### **Extracting Data**

**Data Streaming vs Batching**

When extracting data, you need to decide **how** to process it:
* **Batching**: processing data in chunks at scheduled intervals. Suitable for scheduled task and reduce system loads.
* **Streaming**: processing data continuously as it arrives. ideal for real-time data processing and immediate insights.

## 2. ***Working with REST APIs (APIs as a Data Source)***

APIs are a major source of data ingestion. depending on how APIs provide data, they can be used in both **Batch and Streaming** workflows.

1. **APIs for batch extraction**: Some APIs return large datasets at once. The data is often fetched on a schedule or as part of an ETL process.

**Common batch API examples**
* **CRM APIs (Salesforce, HubSpot)** - Export customer data daily.
* **E-Commerce APIs (Shopify, Amazon)** - Downloads product catalogue or sales report periodically.
* **Public APIs (Weather, Financial Data)** - Receives daily stocks and markets updates.

**How batch API extraction works**:
1. Call an API at **scheduled intervals** (eg. every hour or every day)
2. Retrieve all available data (eg. last 24 hours of records)
3. Store results in a database, data warehouse, or file storage.


In [None]:
# Get data with request library
import requests

url = 'https://api.github.com/repos/DataTalksClub/data-engineering-zoomcamp/events'

response = requests.get(url)
data = response.json()
data

[{'id': '48993156691',
  'type': 'WatchEvent',
  'actor': {'id': 129647123,
   'login': 'Abdullakhan110100100',
   'display_login': 'Abdullakhan110100100',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/Abdullakhan110100100',
   'avatar_url': 'https://avatars.githubusercontent.com/u/129647123?'},
  'repo': {'id': 419661684,
   'name': 'DataTalksClub/data-engineering-zoomcamp',
   'url': 'https://api.github.com/repos/DataTalksClub/data-engineering-zoomcamp'},
  'payload': {'action': 'started'},
  'public': True,
  'created_at': '2025-04-23T21:19:12Z',
  'org': {'id': 72699292,
   'login': 'DataTalksClub',
   'gravatar_id': '',
   'url': 'https://api.github.com/orgs/DataTalksClub',
   'avatar_url': 'https://avatars.githubusercontent.com/u/72699292?'}},
 {'id': '48992523734',
  'type': 'WatchEvent',
  'actor': {'id': 204797752,
   'login': 'rodrigues39',
   'display_login': 'rodrigues39',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/rodrigues39',
   'avatar

## ***3. Common Challenges of API Extraction***

* **Authentication** - API keys, OAuth Tokens, Basic Authentication
* **Pagination** - Pages, Data Chunks
* **Rate Limits** - Pause requests, Retry-After Header
* **Memory Managenment** - Limited memory, Streaming

1. APIs to check Rate Limit and if exceeded, add sleep time before request again

In [None]:
url = 'https://api.github.com/rate_limit'

response = requests.get(url)
response.json()['rate']['remaining']

58

In [None]:
import time

url = 'https://api.github.com/rate_limit'

response = requests.get(url)
remaining = response.json()['rate']['remaining']

if remaining == 0:
  time.sleep(30)

2. Authentication Error and code to fix the authentication error using Colab and API KEY

In [None]:
url = 'https://api.github.com/user'
requests.get(url).json()

{'message': 'Requires authentication',
 'documentation_url': 'https://docs.github.com/rest/users/users#get-the-authenticated-user',
 'status': '401'}

In [None]:
from google.colab import userdata

API_TOKEN = userdata.get('API_TOKEN')

headers = {
    'Authorization': f'Bearer {API_TOKEN}'
}

url = 'https://api.github.com/user'
response = requests.get(url, headers=headers)
response.json()

SecretNotFoundError: Secret API_TOKEN does not exist.

3. Pagination

In [None]:
url = 'https://api.github.com/repos/DataTalksClub/data-engineering-zoomcamp/events'

response = requests.get(url)
response

<Response [200]>

In [None]:
# Get link of next page (2)

response.links['next']['url']

'https://api.github.com/repositories/419661684/events?page=2'

In [None]:
# while Loop for next page or break if no next page

url = 'https://api.github.com/repos/DataTalksClub/data-engineering-zoomcamp/events'

while True:
  response = requests.get(url)
  data = response.json()
  print(len(data))

  if 'next' not in response.links:
    break

  url = response.links['next']['url']

30
30
30
30
30
30
30
30
30
26


* Instead of getting all the events data all at once, it's best to get them in chunks.

In [None]:
# get events data in chunks

import requests

def events_getter():
  url = 'https://api.github.com/repos/DataTalksClub/data-engineering-zoomcamp/events'

  while True:
    response = requests.get(url)
    data = response.json()
    yield data

    if 'next' not in response.links:
      break

    url = response.links['next']['url']

In [None]:
events_pages = events_getter()

for event_page in events_pages:
  print(event_page)

[{'id': '48993156691', 'type': 'WatchEvent', 'actor': {'id': 129647123, 'login': 'Abdullakhan110100100', 'display_login': 'Abdullakhan110100100', 'gravatar_id': '', 'url': 'https://api.github.com/users/Abdullakhan110100100', 'avatar_url': 'https://avatars.githubusercontent.com/u/129647123?'}, 'repo': {'id': 419661684, 'name': 'DataTalksClub/data-engineering-zoomcamp', 'url': 'https://api.github.com/repos/DataTalksClub/data-engineering-zoomcamp'}, 'payload': {'action': 'started'}, 'public': True, 'created_at': '2025-04-23T21:19:12Z', 'org': {'id': 72699292, 'login': 'DataTalksClub', 'gravatar_id': '', 'url': 'https://api.github.com/orgs/DataTalksClub', 'avatar_url': 'https://avatars.githubusercontent.com/u/72699292?'}}, {'id': '48992523734', 'type': 'WatchEvent', 'actor': {'id': 204797752, 'login': 'rodrigues39', 'display_login': 'rodrigues39', 'gravatar_id': '', 'url': 'https://api.github.com/users/rodrigues39', 'avatar_url': 'https://avatars.githubusercontent.com/u/204797752?'}, 'repo

### ***4. Normalizing Data***

This is the conversion of the retrieved json unstructured data into a tabular structured data. It means all **nested structures** (like dictionaries and lists) need to be flattened, to make it easier to store and query in a database or a dataframe

In [None]:
# get one event page
event = event_page[0]
event

{'id': '48615993501',
 'type': 'ForkEvent',
 'actor': {'id': 31289200,
  'login': 'dinuvdm',
  'display_login': 'dinuvdm',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/dinuvdm',
  'avatar_url': 'https://avatars.githubusercontent.com/u/31289200?'},
 'repo': {'id': 419661684,
  'name': 'DataTalksClub/data-engineering-zoomcamp',
  'url': 'https://api.github.com/repos/DataTalksClub/data-engineering-zoomcamp'},
 'payload': {'forkee': {'id': 965208048,
   'node_id': 'R_kgDOOYfn8A',
   'name': 'data-engineering-zoomcamp',
   'full_name': 'dinuvdm/data-engineering-zoomcamp',
   'private': False,
   'owner': {'login': 'dinuvdm',
    'id': 31289200,
    'node_id': 'MDQ6VXNlcjMxMjg5MjAw',
    'avatar_url': 'https://avatars.githubusercontent.com/u/31289200?v=4',
    'gravatar_id': '',
    'url': 'https://api.github.com/users/dinuvdm',
    'html_url': 'https://github.com/dinuvdm',
    'followers_url': 'https://api.github.com/users/dinuvdm/followers',
    'following_url': 'https://api

In [None]:
# Normalize some keys from the events

def process_events(event):
  result = {}

  result['id'] = event['id']
  result['type'] = event['type']
  result['public'] = event['public']
  result['created_at'] = event['created_at']

  result['actor__id'] = event['actor']['id']  # double underscore for nested dictionaries
  result['actor__login'] = event['actor']['login']

  return result

In [None]:
process_events(event)

{'id': '48615993501',
 'type': 'ForkEvent',
 'public': True,
 'created_at': '2025-04-12T16:46:33Z',
 'actor__id': 31289200,
 'actor__login': 'dinuvdm'}

In [None]:
# iterate and apply above function to all events

processed_events = []

for event in event_page:
  processed_event = process_events(event)
  processed_events.append(processed_event)

processed_events

[{'id': '48615993501',
  'type': 'ForkEvent',
  'public': True,
  'created_at': '2025-04-12T16:46:33Z',
  'actor__id': 31289200,
  'actor__login': 'dinuvdm'},
 {'id': '48615597375',
  'type': 'ForkEvent',
  'public': True,
  'created_at': '2025-04-12T16:09:52Z',
  'actor__id': 150928706,
  'actor__login': 'renad-lab'},
 {'id': '48615497440',
  'type': 'WatchEvent',
  'public': True,
  'created_at': '2025-04-12T16:01:04Z',
  'actor__id': 135336193,
  'actor__login': 'fabriziofranchitti'},
 {'id': '48615147964',
  'type': 'WatchEvent',
  'public': True,
  'created_at': '2025-04-12T15:28:33Z',
  'actor__id': 70435178,
  'actor__login': 'Massinho91'},
 {'id': '48614660649',
  'type': 'WatchEvent',
  'public': True,
  'created_at': '2025-04-12T14:45:47Z',
  'actor__id': 74074514,
  'actor__login': 'omarkhaled122'},
 {'id': '48614602772',
  'type': 'WatchEvent',
  'public': True,
  'created_at': '2025-04-12T14:40:29Z',
  'actor__id': 88022389,
  'actor__login': 'ajit4518'},
 {'id': '48614453

In [None]:
# convert created_at from string to datetime format

from datetime import datetime

def process_events(event):
  result = {}

  result['id'] = event['id']
  result['type'] = event['type']
  result['public'] = event['public']

  parsed_timestamp = datetime.fromisoformat(event['created_at'])
  result['created_at'] = parsed_timestamp.timestamp()

  result['actor__id'] = event['actor']['id']  # double underscore for nested dictionaries
  result['actor__login'] = event['actor']['login']

  return result

process_events(event)

{'id': '48615993501',
 'type': 'ForkEvent',
 'public': True,
 'created_at': 1744476393.0,
 'actor__id': 31289200,
 'actor__login': 'dinuvdm'}