## WHAT IS ETL?

ETL (Extract, Transform, Load) is a data integration process crucial for collecting, cleaning, and storing information. It extracts raw data, transforms it into a structured and usable format, and loads it into a destination, facilitating effective analysis. ETL ensures data quality, accessibility, and adaptability for business insights in various industries.






# Data Extraction:
Utilizing the GitHub API as a Source: During the extraction phase, we harness the GitHub API to retrieve data. APIs offer a structured means of accessing information, and in this instance, we employ it to gather details about repositories owned by a specified user.

In [1]:
# extracting the DATA
import requests

def extract_data(username):
    url = f'https://api.github.com/users/{username}/repos'
    response = requests.get(url)
    response.raise_for_status()
    return response.json()


# Data Transformation:
Strategic Data Selection: In the transformation phase, we implement the practice of selecting pertinent data. By specifically choosing attributes such as repository name, description, and URL, we mold the data to align with the objectives of the project. This corresponds with the fundamental concept of converting raw data into a more organized and valuable format.

In [2]:
# TRANSFORMING THE DATA
def transform_data(raw_data):
    transformed_data = [{'name': repo['name'], 'description': repo['description'], 'url': repo['html_url']} for repo in raw_data]
    return transformed_data


# Data Loading:
Employing a Database as the Target: For the loading phase, we opt for an SQLite database as the target storage. This decision is rooted in the principle of choosing suitable storage solutions that align with the project's specific requirements. SQLite, known for its lightweight nature, proves to be fitting for this modest-scale project.

In [3]:
# LOADING THE DATA
import sqlite3

def load_data(data, db_path='github_repos.db'):
    connection = sqlite3.connect(db_path)
    cursor = connection.cursor()

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS repositories (
            id INTEGER PRIMARY KEY,
            name TEXT,
            description TEXT,
            url TEXT
        )
    ''')


    cursor.executemany('INSERT INTO repositories (name, description, url) VALUES (?, ?, ?)', [(repo['name'], repo['description'], repo['url']) for repo in data])


    connection.commit()
    connection.close()


now, lets try to run the ETL for github

In [5]:
# Set your GitHub username
github_username = 'matter-labs'

# Extract
raw_data = extract_data(github_username)

# Transform
transformed_data = transform_data(raw_data)

# Load
load_data(transformed_data)

print("ETL process completed successfully.")


ETL process completed successfully.


lets check if the program is woking perfectly or not

In [6]:
# Install sql magic extension
!pip install -q ipython-sql

# Load sql extension
%load_ext sql

# Connect to the SQLite database
%sql sqlite:///github_repos.db

# Query the database
%sql SELECT * FROM repositories;



[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.6 MB[0m [31m5.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[?25h * sqlite:///github_repos.db
Done.


id,name,description,url
1,hotel-booking-analysis,,https://github.com/abhi14062000/hotel-booking-analysis
2,machine-learning---classification,,https://github.com/abhi14062000/machine-learning---classification
3,machine-learning---regression,,https://github.com/abhi14062000/machine-learning---regression
4,NLP_Txt-to-SPeech,,https://github.com/abhi14062000/NLP_Txt-to-SPeech
5,unsupervised-machine-learning,,https://github.com/abhi14062000/unsupervised-machine-learning
6,-test-git-poap-integration-2,,https://github.com/matter-labs/-test-git-poap-integration-2
7,.github,zkSync Frontend Team workflow configuration,https://github.com/matter-labs/.github
8,aa-signature-checker,,https://github.com/matter-labs/aa-signature-checker
9,action-hosting-deploy,Automatically deploy shareable previews for your Firebase Hosting sites,https://github.com/matter-labs/action-hosting-deploy
10,ansible-en-role,Ansible role for zkSync Era External Node,https://github.com/matter-labs/ansible-en-role


# Conslusion:

The GitHub API-based ETL project establishes a foundation for a scalable, maintainable, and adaptable system.

Success in real-world scenarios hinges on critical considerations such as error handling, security, and user experience.

Serving as a starting point, this project can be expanded and improved to address evolving requirements in data extraction, transformation, and loading.