# **Course Outline**

## **Phase 1: Data Handling & Storage**
<hr></hr>

### **Working with Data Formats**
- CSV, JSON, Parquet, Avro
- Handling large files efficiently
- Data serialization and deserialization

### **Relational Databases & SQL**
- SQL basics (CRUD operations, joins, indexing)
- Advanced SQL (window functions, CTEs, query optimization)
- PostgreSQL and Oracle ADW hands-on practice

### **NoSQL Databases**
- Introduction to NoSQL (when to use it)**
- MongoDB (document-based storage)
- Redis (key-value store)
- Cassandra (columnar storage)


## **Phase 2: Data Ingestion & Processing**
<hr></hr>

### **Batch Data Processing**
- Pandas and Polars for data manipulation
- PySpark basics (RDDs, DataFrames, Transformations & Actions)
- Writing ETL pipelines with PySpark

### **Real-time Data Processing**
- Introduction to streaming vs batch processing
- Kafka for real-time messaging
- Apache Flink and Spark Streaming basics

### **Web Scraping & APIs**
- Scraping with BeautifulSoup & Selenium
- REST & GraphQL APIs
- Handling rate limits, authentication, and pagination

## **Phase 3: Workflow Orchestration & Automation**
<hr></hr>

### **Airflow for Workflow Orchestration**
- DAGs, Operators, Tasks
- Scheduling & monitoring workflows
- Integrating Airflow with databases and cloud storage

### **Data Pipelines & CI/CD**
- Building modular data pipelines
- Testing data pipelines
- CI/CD for data engineering (GitHub Actions, Azure Pipelines)

## **Phase 4: Cloud & Infrastructure**
<hr></hr>

### **Cloud Data Engineering**

- Storing data in cloud storage (Azure Blob, AWS S3, GCP Storage)
- Managed data services (BigQuery, Snowflake, Redshift)
- Hands-on with Oracle ADW in OCI

### **Infrastructure as Code (IaC)**
- Terraform basics for cloud provisioning
- Automating deployments in Azure/GCP/AWS

## **Phase 5: Data Governance & Analytics**
<hr></hr>

### **Data Governance & Security**
- Data privacy laws (GDPR, CCPA)
- Role-based access control (RBAC)
- Data lineage & cataloging tools (Apache Atlas, DataHub)

### **Data Warehousing & BI**
- Data modeling (star schema, snowflake schema)
- Building dashboards with Plotly & Power BI
- Connecting data warehouses to BI tools

## **Final Project**
<hr></hr>

**Building a complete data pipeline**

- Extracting stock and financial data (WSJ & PSE)
- Storing in Oracle ADW
- Processing with PySpark
- Visualizing trends using Plotly or Power BI
- Automating workflows with Airflow

# **Phase 1: Data Handling & Storage**
<hr></hr>

## **Working with Data Formats**
<hr></hr>

### **Overview of Data Formats**

<div style="float: left;">

| Format | Structure | Pros | Cons |
|------|---------------|--------------|--------------|
| **CSV** | Row-based | Easy to read, widely supported | Large file sizes, inefficient for large data |
| **JSON** | Semi-structured (key-value) | Flexible, widely used in APIs | Larger size, needs parsing |
| **Parquet** | Columnar | Efficient storage, great for analytics | Not human-readable |
| **Avro** | Binary format | Schema evolution, fast | Requires special tools to read |
</div>

### **Handling CSV Files Efficiently**

In [16]:
import pandas as pd

df = pd.read_csv("../data/MEG.csv")
print(df)

            Date   Open   High    Low   Close        Volume
0       10/27/23  1.960  1.970  1.940   1.960  6.800000e+06
1       10/26/23  1.970  1.990  1.950   1.950  1.296500e+07
2       10/25/23  2.000  2.030  2.000   2.010  6.178000e+06
3       10/24/23  2.000  2.020  1.990   2.000  7.524000e+06
4       10/23/23  2.010  2.020  2.000   2.000  8.299000e+06
...          ...    ...    ...    ...     ...           ...
6923  01/09/1995  2.314  2.314  2.288   2.314  1.387800e+07
6924  01/06/1995  2.314  2.341  2.314   2.314  2.407262e+07
6925  01/05/1995  2.288  2.314  2.288   2.288  3.185465e+08
6926  01/04/1995  2.314  2.314  2.288   2.314  2.231886e+07
6927  01/03/1995  2.341  2.367  2.314   2.341  3.149516e+08

[6928 rows x 6 columns]


In [18]:
# For large files

In [21]:
df = pd.read_csv("../data/MEG.csv", chunksize=1000)  # Process in chunks
for chunk in df:
    print(chunk.head())  # Process each chunk separately

       Date   Open   High   Low   Close    Volume
0  10/27/23   1.96   1.97  1.94    1.96   6800000
1  10/26/23   1.97   1.99  1.95    1.95  12965000
2  10/25/23   2.00   2.03  2.00    2.01   6178000
3  10/24/23   2.00   2.02  1.99    2.00   7524000
4  10/23/23   2.01   2.02  2.00    2.00   8299000
          Date   Open   High   Low   Close    Volume
1000  09/27/19   4.79   4.81  4.46    4.46  92220000
1001  09/26/19   4.74   4.87  4.70    4.79  17581000
1002  09/25/19   4.97   4.98  4.70    4.72  26913000
1003  09/24/19   4.85   4.98  4.84    4.97  16350000
1004  09/23/19   4.95   5.00  4.80    4.81  17000000
          Date   Open   High   Low   Close    Volume
2000  08/19/15   4.30   4.38  4.20    4.35  56249000
2001  08/18/15   4.45   4.47  4.27    4.28  30480000
2002  08/17/15   4.62   4.66  4.39    4.49  38611000
2003  08/14/15   4.56   4.61  4.56    4.60  30808000
2004  08/13/15   4.55   4.62  4.50    4.55  31808000
            Date   Open   High   Low   Close    Volume
3000  07/

**`chunksize`**

- When working with large CSV files, loading everything into memory at once may cause high RAM usage and slow performance.
- `chunksize` allows reading the file in smaller parts, reducing memory consumption.
- Instead of returning a full DataFrame, `pd.read_csv()` returns an **iterator** (a chunk at a time).

**Common Use Cases for `chunksize`**

In [26]:
#Counting Rows in a Large File Efficiently
total_rows = 0
for chunk in pd.read_csv("../data/MEG.csv", chunksize=10000):
    total_rows += len(chunk)

print(f"Total Rows: {total_rows}")


Total Rows: 6928


In [38]:
# Filtering Data While Reading (Memory Efficient)

filtered_data = []

for chunk in pd.read_csv("../data/MEG.csv", chunksize=5000):
    filtered_chunk = chunk[chunk[" Volume"] > 10000000]
    filtered_data.append(filtered_chunk)

df_filtered = pd.concat(filtered_data)  # Merge all chunks together
df_filtered.shape

(4634, 6)

### **Working with JSON Data**

- `json.load()` loads JSON into a Python list/dict structure
- This format is flexible but can be inefficient for large-scale storage

In [7]:
import json

with open("../data/out.json") as f:
    data = json.load(f)

print(data)  # Prints a list of dictionaries


{'documentMetadata': {'pageCount': 1, 'mimeType': 'image/png'}, 'pages': [{'pageNumber': 1, 'dimensions': {'width': 940.0, 'height': 529.0, 'unit': 'PIXEL'}, 'detectedDocumentTypes': None, 'detectedLanguages': None, 'words': [{'text': 'May', 'confidence': 0.9234257, 'boundingPolygon': {'normalizedVertices': [{'x': 0.00851063829787234, 'y': 0.011342155009451797}, {'x': 0.0425531914893617, 'y': 0.011342155009451797}, {'x': 0.0425531914893617, 'y': 0.03780718336483932}, {'x': 0.00851063829787234, 'y': 0.03780718336483932}]}}, {'text': '3', 'confidence': 0.9234257, 'boundingPolygon': {'normalizedVertices': [{'x': 0.045744680851063826, 'y': 0.011342155009451797}, {'x': 0.05638297872340425, 'y': 0.011342155009451797}, {'x': 0.05638297872340425, 'y': 0.03780718336483932}, {'x': 0.045744680851063826, 'y': 0.03780718336483932}]}}, {'text': 'May', 'confidence': 0.9384263, 'boundingPolygon': {'normalizedVertices': [{'x': 0.18829787234042553, 'y': 0.011342155009451797}, {'x': 0.22127659574468084, 

In [51]:
import pandas as pd

# Load JSON file
file_path = "../data/expenses.json"
data = pd.read_json(file_path)
data

Unnamed: 0,id,date,category,amount,currency,description,payment_method
0,1,2025-03-04,Food,12.5,USD,Lunch at a restaurant,Credit Card
1,2,2025-03-03,Transport,5.0,USD,Bus fare,Cash
2,3,2025-03-02,Groceries,45.3,USD,Weekly grocery shopping,Debit Card
3,4,2025-03-01,Entertainment,15.99,USD,Movie ticket,Credit Card
4,5,2025-02-28,Utilities,80.75,USD,Electricity bill,Bank Transfer


In [60]:
#If JSON is complex
import json
import pandas as pd

# Load JSON from file
with open("../data/out.json", "r") as file:
    data = json.load(file)

# Extract words and confidence per page
rows = []
for page_idx, page in enumerate(data.get("pages", []), start=1):
    for word in page.get("words", []):
        rows.append({"page": page_idx, "text": word["text"], "confidence": float(word["confidence"])})

# Convert to DataFrame
df = pd.DataFrame(rows)
df

Unnamed: 0,page,text,confidence
0,1,May,0.923426
1,1,3,0.923426
2,1,May,0.938426
3,1,3,0.938426
4,1,Olympic,0.955067
...,...,...,...
227,1,22,0.999598
228,1,Lazada,0.986753
229,1,Ph,0.986753
230,1,Makati,0.999196


In [66]:
import json
import pandas as pd

# Load JSON from file
with open("../data/out.json", "r") as file:
    data = json.load(file)

# Extract words, confidence, and normalizedVertices
rows = []
for page_idx, page in enumerate(data.get("pages", []), start=1):
    for word in page.get("words", []):
        text = word.get("text", "")
        confidence = float(word.get("confidence", 0))
        
        # Extract bounding polygon coordinates
        bounding_polygon = word.get("boundingPolygon", {}).get("normalizedVertices", [])
        
        for vertex in bounding_polygon:
            rows.append({
                "page": page_idx,
                "text": text,
                "confidence": confidence,
                "x": vertex.get("x", None),
                "y": vertex.get("y", None)
            })

# Convert to DataFrame
df = pd.DataFrame(rows)
df


Unnamed: 0,page,text,confidence,x,y
0,1,May,0.923426,0.008511,0.011342
1,1,May,0.923426,0.042553,0.011342
2,1,May,0.923426,0.042553,0.037807
3,1,May,0.923426,0.008511,0.037807
4,1,3,0.923426,0.045745,0.011342
...,...,...,...,...,...
923,1,Makati,0.999196,0.510638,0.947070
924,1,1130.34,0.997102,0.891489,0.926276
925,1,1130.34,0.997102,0.960638,0.926276
926,1,1130.34,0.997102,0.960638,0.948960


In [77]:
import requests

url = "https://pseops.azurewebsites.net/api/GetDividendInformation?code=1h/6bs7u4tzxVbSWEpCmqjDMda8tgcD7Pt7tjiT6WOX/YjNMpIbBsQ==&ticker=LTG"

# Send GET request to the API
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()

    df = pd.DataFrame(data)
else:
    print(f"Failed to retrieve data. Status code: {response.status_code}")

df

Unnamed: 0,ID,Ticker,CompanyName,TypeofSecurity,TypeofDividend,DividendRate,ExDividendDate,RecordDate,PaymentDate,CircularNumber
0,1959,LTG,"LT Group, Inc.",COMMON,Cash,Php 0.15,"Mar 08, 2024","Mar 11, 2024","Mar 22, 2024",C00942-2024
1,1958,LTG,"LT Group, Inc.",COMMON,Cash,Php 0.15,"Mar 08, 2024","Mar 11, 2024","Mar 22, 2024",C00943-2024
2,877,LTG,"LT Group, Inc.",COMMON,Cash,Php 0.15,"Mar 01, 2023","Mar 6, 2023","Mar 17, 2023",C01205-2023
3,876,LTG,"LT Group, Inc.",COMMON,Cash,Php 0.15,"Mar 01, 2023","Mar 6, 2023","Mar 17, 2023",C01206-2023
4,227,LTG,"LT Group, Inc.",COMMON,Cash,Php 0.15,"Mar 25, 2022","Mar 30, 2022","Apr 12, 2022",C01718-2022
5,226,LTG,"LT Group, Inc.",COMMON,Cash,Php 0.15,"Mar 25, 2022","Mar 30, 2022","Apr 12, 2022",C01719-2022
6,2572,LTG,"LT Group, Inc.",COMMON,Cash,Php 0.30,"May 31, 2024","Jun 3, 2024","Jun 14, 2024",C03313-2024
7,427,LTG,"LT Group, Inc.",COMMON,Cash,Php 0.30,"May 26, 2022","May 31, 2022","Jun 15, 2022",C03654-2022
8,1105,LTG,"LT Group, Inc.",COMMON,Cash,Php 0.30,"May 25, 2023","May 30, 2023","Jun 13, 2023",C03866-2023
9,225,LTG,"LT Group, Inc.",COMMON,Cash,Php 0.24,"Jun 22, 2021","Jun 25, 2021","Jul 9, 2021",C04063-2021


### **Working with `parquet`**

In [3]:
import pandas as pd

# Sample data
data = {
    "id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "age": [25, 30, 35, 40, 45],
    "salary": [50000, 60000, 70000, 80000, 90000]
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Save as Parquet
df.to_parquet("sample.parquet", engine="pyarrow", index=False)
print("Parquet file 'sample.parquet' created successfully!")


Parquet file 'sample.parquet' created successfully!


In [5]:
# Read the Parquet file
df_parquet = pd.read_parquet("sample.parquet")

# Display Data
print("Parquet File Data:")
print(df_parquet)


Parquet File Data:
   id     name  age  salary
0   1    Alice   25   50000
1   2      Bob   30   60000
2   3  Charlie   35   70000
3   4    David   40   80000
4   5      Eve   45   90000


### **Working with `avro`**

Avro allows you to evolve your schema over time without breaking existing data. This is useful in data engineering pipelines where requirements change over time.

**How Avro Supports Schema Evolution**

Schema is stored with the data - `Avro` embeds the schema inside the file, so consumers can read the data correctly even if the schema changes.
Backward and forward compatibility - `Avro` supports adding, removing, or modifying fields in a controlled way.

In [15]:
!pip install fastavro

Collecting fastavro
  Downloading fastavro-1.10.0-cp312-cp312-win_amd64.whl.metadata (5.7 kB)
Downloading fastavro-1.10.0-cp312-cp312-win_amd64.whl (487 kB)
Installing collected packages: fastavro
Successfully installed fastavro-1.10.0




In [17]:
import fastavro

schema = {
    "type": "record",
    "name": "Employee",
    "fields": [
        {"name": "id", "type": "int"},
        {"name": "name", "type": "string"},
        {"name": "salary", "type": "float"}
    ]
}

data = [{"id": 1, "name": "John", "salary": 55000.0}]

with open("data.avro", "wb") as f:
    fastavro.writer(f, schema, data)


In [19]:
with open("data.avro", "rb") as f:
    reader = fastavro.reader(f)
    for record in reader:
        print(record)


{'id': 1, 'name': 'John', 'salary': 55000.0}


| Feature           | Avro  | Parquet | JSON | CSV  |
|------------------|-------|---------|------|------|
| Schema Evolution | Yes | Limited | No | No |
| Storage Efficiency | Compact | Highly Compressed | Large | Large |
| Read Performance | Medium | Best for queries | Slow | Slow |
| Write Performance | Fast | Slower | Medium | Fast |
| Best Use Case | Streaming, Kafka, log storage | Analytical queries (Big Data) | Web APIs, readable storage | Simple tabular data |


### **Serialization & Deserialization**
<hr></hr>

- Serialization means converting data into a format that can be saved and later reconstructed
- Python supports multiple serialization formats like `Pickle` and `MessagePack`

#### **Using Pickle (Python's Native Format)**

Pickle is useful for Python objects but not cross-language compatible

In [23]:
import pickle

data = {"name": "John", "age": 30}

with open("data.pkl", "wb") as f:
    pickle.dump(data, f)

with open("data.pkl", "rb") as f:
    loaded_data = pickle.load(f)

print(loaded_data)  # {'name': 'John', 'age': 30}


{'name': 'John', 'age': 30}


#### **Using MessagePack (Faster & Cross-Language)**

In [28]:
import msgpack

data = {"name": "John", "age": 30}

with open("data.msgpack", "wb") as f:
    f.write(msgpack.packb(data))

with open("data.msgpack", "rb") as f:
    loaded_data = msgpack.unpackb(f.read())

print(loaded_data)  # {'name': 'John', 'age': 30}


{'name': 'John', 'age': 30}


```sql
-- Create a server-level login (for SQL authentication users)
CREATE LOGIN dataengineer WITH PASSWORD = 'd@taEngineer';

-- Switch to your database
USE ExpenseMonitoring;

-- Create a database user mapped to the login
CREATE USER dataengineer FOR LOGIN dataengineer;
```

In [1]:
!pip install pyodbc



In [3]:
import pyodbc

# Azure SQL Database connection details
server = "fjpicasoserver.database.windows.net"  # Replace with your server name
database = "ExpenseMonitoring"  # Replace with your database name
username = "dataengineer"  # Replace with your username
password = "D@taengine3r"  # Replace with your password
driver = "{ODBC Driver 17 for SQL Server}"  # Ensure this driver is installed

# Create connection string
conn_str = f"DRIVER={driver};SERVER={server};DATABASE={database};UID={username};PWD={password}"

try:
    # Establish connection
    conn = pyodbc.connect(conn_str)
    cursor = conn.cursor()
    
    # Test query
    cursor.execute("SELECT @@VERSION")
    row = cursor.fetchone()
    
    print("Connected to Azure SQL Database!")
    print("SQL Server Version:", row[0])

except Exception as e:
    print("Error:", e)

finally:
    if 'conn' in locals():
        conn.close()
        print("Connection closed.")


Connected to Azure SQL Database!
SQL Server Version: Microsoft SQL Azure (RTM) - 12.0.2000.8 
	Feb  9 2025 20:57:20 
	Copyright (C) 2024 Microsoft Corporation

Connection closed.


In [17]:
import requests
import pandas as pd
from sqlalchemy import create_engine

# API URL
url = "https://pseops.azurewebsites.net/api/GetDividendInformation?code=1h/6bs7u4tzxVbSWEpCmqjDMda8tgcD7Pt7tjiT6WOX/YjNMpIbBsQ==&ticker=LTG"

# Send GET request to the API
response = requests.get(url)

# Check if request was successful
if response.status_code == 200:
    data = response.json()
    df = pd.DataFrame(data)
else:
    print(f"Failed to retrieve data. Status code: {response.status_code}")
    exit()  # Stop execution if data fetch fails


# Azure SQL Database connection details
server = "fjpicasoserver.database.windows.net"  # Replace with your server name
database = "ExpenseMonitoring"  # Replace with your database name
username = "dataengineer"  # Replace with your username
password = "D@taengine3r"  # Replace with your password
driver = "{ODBC Driver 17 for SQL Server}"  # Ensure this driver is installed

# Establish connection
conn_str = f"DRIVER={driver};SERVER={server};DATABASE={database};UID={username};PWD={password}"
conn = pyodbc.connect(conn_str)
cursor = conn.cursor()

# Define the table name
table_name = "DividendInformation"

# Ensure the table exists (modify schema as needed)
cursor.execute(f"""
DROP TABLE IF EXISTS dengr.{table_name};

CREATE TABLE dengr.{table_name} (
        ID INT,
        Ticker NVARCHAR(100),
        CompanyName NVARCHAR(100),
        TypeofSecurity NVARCHAR(100),
        TypeofDividend NVARCHAR(100),
        DividendRate NVARCHAR(100),
        ExDividendDate NVARCHAR(100),
        RecordDate NVARCHAR(100),
        PaymentDate NVARCHAR(100),
        CircularNumber NVARCHAR(100)
)
""")
conn.commit()

# Insert data into the table
insert_query = f"""
INSERT INTO dengr.{table_name} (ID,Ticker,CompanyName,TypeofSecurity,TypeofDividend,DividendRate,ExDividendDate,RecordDate,PaymentDate,CircularNumber)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
"""

# Convert DataFrame to list of tuples for batch insertion
records = df[[
    "ID",
    "Ticker",
    "CompanyName",
    "TypeofSecurity",
    "TypeofDividend",
    "DividendRate",
    "ExDividendDate",
    "RecordDate",
    "PaymentDate",
    "CircularNumber"
]].values.tolist()

# Execute batch insert
cursor.executemany(insert_query, records)
conn.commit()

print(f"Data successfully inserted into Azure SQL table: {table_name}")

# Close connection
cursor.close()
conn.close()

Data successfully inserted into Azure SQL table: DividendInformation


In [21]:
!pip install oracledb

Collecting oracledb
  Downloading oracledb-3.0.0-cp312-cp312-win_amd64.whl.metadata (5.6 kB)
Downloading oracledb-3.0.0-cp312-cp312-win_amd64.whl (2.1 MB)
   ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
   ---------------------------------------- 2.1/2.1 MB 23.0 MB/s eta 0:00:00
Installing collected packages: oracledb
Successfully installed oracledb-3.0.0


In [23]:
import oracledb

# Oracle ADW connection configuration
try:
    connection = oracledb.connect(
        config_dir="../adw/Wallet_dataengineeringDB",
        user="DEVUSER",
        password="D@taengine3r",
        dsn="dataengineeringdb_high",
        wallet_location="../adw/Wallet_dataengineeringDB",
        wallet_password="Admin123456789"
    )
    
    cursor = connection.cursor()
    
    # Run a simple test query
    cursor.execute("SELECT 'Connected to Oracle ADW' FROM dual")
    result = cursor.fetchone()
    
    print(result[0])  # Should print "Connected to Oracle ADW"
    
    # Close connection
    cursor.close()
    connection.close()

except oracledb.DatabaseError as e:
    print("Error while connecting to Oracle ADW:", e)


Connected to Oracle ADW


In [29]:
import requests
import pandas as pd
import oracledb

# API URL
url = "https://pseops.azurewebsites.net/api/GetDividendInformation?code=1h/6bs7u4tzxVbSWEpCmqjDMda8tgcD7Pt7tjiT6WOX/YjNMpIbBsQ==&ticker=LTG"

# Send GET request to the API
response = requests.get(url)

# Check if request was successful
if response.status_code == 200:
    data = response.json()
    df = pd.DataFrame(data)
else:
    print(f"Failed to retrieve data. Status code: {response.status_code}")
    exit()  # Stop execution if data fetch fails

# Oracle ADW connection configuration
connection = oracledb.connect(
    config_dir="../adw/Wallet_dataengineeringDB",
    user="DEVUSER",
    password="D@taengine3r",
    dsn="dataengineeringdb_high",
    wallet_location="../adw/Wallet_dataengineeringDB",
    wallet_password="Admin123456789"
)

cursor = connection.cursor()

# Define table name
table_name = "DIVIDEND_INFORMATION"

# Ensure the table exists (modify schema as needed)
cursor.execute(f"""
DECLARE
    table_count NUMBER;
BEGIN
    SELECT COUNT(*) INTO table_count FROM user_tables WHERE table_name = UPPER('{table_name}');
    IF table_count = 0 THEN
        EXECUTE IMMEDIATE '
            CREATE TABLE {table_name} (
                ID NUMBER,
                TICKER VARCHAR2(100),
                COMPANYNAME VARCHAR2(100),
                TYPEOFSECURITY VARCHAR2(100),
                TYPEOFDIVIDEND VARCHAR2(100),
                DIVIDENDRATE VARCHAR2(100),
                EXDIVIDENDDATE VARCHAR2(100),
                RECORDDATE VARCHAR2(100),
                PAYMENTDATE VARCHAR2(100),
                CIRCULARNUMBER VARCHAR2(100)


            )';
    END IF;
END;
""")

# Prepare SQL insert statement
insert_query = f"""
INSERT INTO {table_name} (ID,TICKER,COMPANYNAME,TYPEOFSECURITY,TYPEOFDIVIDEND,DIVIDENDRATE,EXDIVIDENDDATE,RECORDDATE,PAYMENTDATE,CIRCULARNUMBER
)
VALUES (:1, :2, :3, :4, :5, :6, :7, :8, :9, :10)
"""

# Convert DataFrame to list of tuples
records = df[[
    "ID",
    "Ticker",
    "CompanyName",
    "TypeofSecurity",
    "TypeofDividend",
    "DividendRate",
    "ExDividendDate",
    "RecordDate",
    "PaymentDate",
    "CircularNumber"
]].values.tolist()

# Execute batch insert
cursor.executemany(insert_query, records)
connection.commit()

print(f"Data successfully inserted into Oracle ADW table: {table_name}")

# Close connection
cursor.close()
connection.close()


Data successfully inserted into Oracle ADW table: DIVIDEND_INFORMATION


In [33]:
!pip install borneo oci

Collecting borneo
  Downloading borneo-5.4.2-py3-none-any.whl.metadata (16 kB)
Collecting oci
  Downloading oci-2.149.0-py3-none-any.whl.metadata (5.3 kB)
Collecting circuitbreaker<3.0.0,>=1.3.1 (from oci)
  Downloading circuitbreaker-2.1.0-py2.py3-none-any.whl.metadata (7.8 kB)
Downloading borneo-5.4.2-py3-none-any.whl (175 kB)
Downloading oci-2.149.0-py3-none-any.whl (29.4 MB)
   ---------------------------------------- 0.0/29.4 MB ? eta -:--:--
   -- ------------------------------------- 2.1/29.4 MB 11.8 MB/s eta 0:00:03
   ---- ----------------------------------- 3.1/29.4 MB 9.2 MB/s eta 0:00:03
   ----- ---------------------------------- 3.9/29.4 MB 6.3 MB/s eta 0:00:05
   ----- ---------------------------------- 4.2/29.4 MB 6.5 MB/s eta 0:00:04
   -------- ------------------------------- 6.0/29.4 MB 5.8 MB/s eta 0:00:05
   --------- ------------------------------ 7.3/29.4 MB 6.0 MB/s eta 0:00:04
   ----------- ---------------------------- 8.4/29.4 MB 6.3 MB/s eta 0:00:04
   -----

In [43]:
import os
from borneo import (Regions, NoSQLHandle, NoSQLHandleConfig, PutRequest,
                    TableRequest, GetRequest, TableLimits, State)
from borneo.iam import SignatureProvider

# Given a region, and compartment, instantiate a connection to the
# cloud service and return it
def get_connection(region):
    print("Connecting to the Oracle NoSQL Cloud Service")
    provider = SignatureProvider();
    #If using the DEFAULT profile with the config file in default location ~/.oci/config
    config = NoSQLHandleConfig(region, provider)
    config.set_default_compartment("ocid1.compartment.oc1..aaaaaaaa6x4ezmuleaubvwgsx3jdolywyalb6iyqbw4ucmimzs7rmfiwhktq")
    return(NoSQLHandle(config))

# Given a handle to the Oracle NoSQL Database cloud service, the name of the table
# to write the record to, and an instance of a dictionary, formatted as a
# record for the table, this function will write the record to the table
def write_a_record(handle, table_name, record):
    request = PutRequest().set_table_name(table_name)
    request.set_value(record)
    handle.put(request)

# Given a handle to the Oracle NoSQL Database cloud service, the name of the table
# to read from, and the primary key value for the table, this function will
# read the record from the table and return it
def read_a_record(handle, table_name, pk):
    request = GetRequest().set_table_name(table_name)
    request.set_key({'ID' : pk})
    return(handle.get(request))



In [41]:
handle = get_connection("us-phoenix-1")

record = {
    'ID':1961,
    'TICKER':'LTG',
    'COMPANYNAME':'LT Group, Inc.',
    'TYPEOFSECURITY':'COMMON',
    'TYPEOFDIVIDEND':'Cash',
    'DIVIDENDRATE':'Php 0.15',
    'EXDIVIDENDDATE':'Mar 08, 2024',
    'RECORDDATE':'Mar 11, 2024',
    'PAYMENTDATE':'Mar 22, 2024',
    'CIRCULARNUMBER':'C00942-2024'
}
write_a_record(handle, 'DIVIDEND_HISTORY', record)
print('Wrote record: \n\t'  + str(record))

the_written_record = read_a_record(handle, 'DIVIDEND_HISTORY', 1961)
print('Read record: \n\t' + str(record))

Connecting to the Oracle NoSQL Cloud Service
Wrote record: 
	{'ID': 1960, 'TICKER': 'LTG', 'COMPANYNAME': 'LT Group, Inc.', 'TYPEOFSECURITY': 'COMMON', 'TYPEOFDIVIDEND': 'Cash', 'DIVIDENDRATE': 'Php 0.15', 'EXDIVIDENDDATE': 'Mar 08, 2024', 'RECORDDATE': 'Mar 11, 2024', 'PAYMENTDATE': 'Mar 22, 2024', 'CIRCULARNUMBER': 'C00942-2024'}


IllegalArgumentException: GET: Illegal Argument: Missing primary key field: ID