# ETL Extract Lab - DSA 2040A


## Project Setup
 This notebook demonstrates:
- Full dataset extraction
- Incremental extraction based on last run timestamp
- Proper ETL workflow practices


In [24]:
#Import Required Libraries
import pandas as pd
from datetime import datetime

In [25]:
# Section 1: Full Extraction

# %%
def full_extraction(file_path):
    """Perform a full extraction of the dataset"""
    try:
        df = pd.read_csv(file_path)
        print("Full extraction completed successfully.")
        print(f"Extracted {len(df)} rows fully.")
        
        # Display basic stats
        print("\nDataset Info:")
        print(df.info())
        
        print("\nSample Data:")
        return df.head()
    except Exception as e:
        print(f"Error during full extraction: {e}")
        return None


In [27]:
# Full extraction example
file_path = "custom_data.csv"
full_extraction(file_path)

Full extraction completed successfully.
Extracted 1000 rows fully.

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Transaction ID    1000 non-null   int64 
 1   Date              1000 non-null   object
 2   Customer ID       1000 non-null   object
 3   Gender            1000 non-null   object
 4   Age               1000 non-null   int64 
 5   Product Category  1000 non-null   object
 6   Quantity          1000 non-null   int64 
 7   Price per Unit    1000 non-null   int64 
 8   Total Amount      1000 non-null   int64 
dtypes: int64(5), object(4)
memory usage: 70.4+ KB
None

Sample Data:


Unnamed: 0,Transaction ID,Date,Customer ID,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount
0,1,2023-11-24,CUST001,Male,34,Beauty,3,50,150
1,2,2023-02-27,CUST002,Female,26,Clothing,2,500,1000
2,3,2023-01-13,CUST003,Male,50,Electronics,1,30,30
3,4,2023-05-21,CUST004,Male,37,Clothing,1,500,500
4,5,2023-05-06,CUST005,Male,30,Beauty,2,50,100


In [28]:
# Incremental Extraction

# %%
def read_last_extraction_time():
    """Read the last extraction timestamp from file"""
    try:
        with open('last_extraction.txt', 'r') as f:
            return datetime.strptime(f.read().strip(), '%Y-%m-%d %H:%M:%S')
    except (FileNotFoundError, ValueError):
        # Default to beginning of time if file doesn't exist
        return datetime.min

# %%
def incremental_extraction(file_path):
    """Perform incremental extraction based on last run"""
    try:
        # Read the complete data
        full_df = pd.read_csv(file_path)
        
        # Convert Date column to datetime
        full_df['Date'] = pd.to_datetime(full_df['Date'])
        
        # Get last extraction time
        last_time = read_last_extraction_time()
        print(f"Last extraction was at: {last_time}")
        
        # Filter for new records
        new_data = full_df[full_df['Date'] > last_time]
        
        print(f"Extracted {len(new_data)} rows incrementally since last check.")
        return new_data
    except Exception as e:
        print(f"Error during incremental extraction: {e}")
        return None

# Example incremental extraction
incremental_data = incremental_extraction(file_path)
if incremental_data is not None and not incremental_data.empty:
    print("\nNew data since last extraction:")
    print(incremental_data)



Last extraction was at: 2024-02-19 00:00:00
Extracted 0 rows incrementally since last check.


In [29]:

# Section 3: Save New Timestamp
# %%
def update_extraction_time():
    """Update the last extraction timestamp to current time"""
    try:
        current_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        with open('last_extraction.txt', 'w') as f:
            f.write(current_time)
        print(f"Updated last extraction time to: {current_time}")
    except Exception as e:
        print(f"Error updating extraction time: {e}")

# Update timestamp after successful incremental extraction
update_extraction_time()

Updated last extraction time to: 2025-06-10 18:45:58
