<a href="https://www.kaggle.com/code/arunachal/simple-etl-pipeline-using-python?scriptVersionId=140797920" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

Here, we create a simple ETL pipeline using Python. Source data exists in multiple formats (csv, json, xml) in a local directory. We extract amd merge data from 9 files into a dataframe. Then, we carry out a simple transformation and load the transformed data, again, in a local directory. Alongside, we generate a simple log file.

In [1]:
import pandas as pd
from datetime import datetime
from glob import glob
import xml.etree.ElementTree as et

In [2]:
# Creating output files

tmpfile = "dealership_temp.tmp"               # store all extracted data
logfile = "dealership_logfile.txt"            # all event logs will be stored
targetfile = "dealership_transformed_data.csv"   # transformed data is stored

In [3]:
# Function to read data from csv files

def extract_from_csv(file_to_process): 
    dataframe = pd.read_csv(file_to_process) 
    return dataframe

In [4]:
# Function to read data from json files

def extract_from_json(file_to_process):
    dataframe = pd.read_json(file_to_process,lines=True)
    return dataframe

In [5]:
# Function to read data from xml files

def extract_from_xml(file_to_process):

    cols=['car_model','year_of_manufacture','price', 'fuel']
    rows = []

    tree = et.parse(file_to_process) 
    root = tree.getroot() 

    for person in root: 

        car_model = person.find("car_model").text 
        year_of_manufacture = int(person.find("year_of_manufacture").text)
        price = float(person.find("price").text) 
        fuel = person.find("fuel").text 

        rows.append({"car_model":car_model, "year_of_manufacture":year_of_manufacture, "price":price, "fuel":fuel}) 

        return pd.DataFrame(rows, columns = cols)

In [6]:
# Function to merge data from all file types
# Column schema is hard-coded here but automated schema detection can also be performed.

def extract():
    
    extracted_data = pd.DataFrame(columns=['car_model','year_of_manufacture','price', 'fuel']) 

    for csvfile in glob("../input/car-dealership/dealership_data/*.csv"):        
        extracted_data = pd.concat([extracted_data, extract_from_csv(csvfile)], ignore_index=True)
                        
    for jsonfile in glob("../input/car-dealership/dealership_data/*.json"):        
        extracted_data = pd.concat([extracted_data, extract_from_json(jsonfile)], ignore_index=True)

    for xmlfile in glob("../input/car-dealership/dealership_data/*.xml"):
        extracted_data = pd.concat([extracted_data, extract_from_xml(xmlfile)], ignore_index=True)
        
    return extracted_data

In [7]:
# Transforming the data

def transform(df):
    
    data = df.copy()
    
    data['price'] = round(data.price, 1)    
    
    return data

In [8]:
# Loading the data, as a local csv file

def load(targetfile,data_to_load):
    
    data_to_load.to_csv(targetfile)

In [9]:
# Formatting output file

def log(message):
    
    timestamp_format = '%H:%M:%S-%h-%d-%Y'
    now = datetime.now() 
    timestamp = now.strftime(timestamp_format)
    
    with open("dealership_logfile.txt","a") as f: 
        
        f.write(timestamp + ' , ' + message + '\n')

In [10]:
# Running the ETL workload sequentially with logs

log("\n\nETL Job Started\n")

log("Extract phase Started\n")
extracted_data = extract() 
log("Extract phase Ended\n")

log('Transform phase Started\n')
transformed_data = transform(extracted_data)
log("Transform phase Ended\n")

log("Load phase Started\n")
load(targetfile, transformed_data)
log("Load phase Ended\n")

log("ETL Job Ended\n")

In [11]:
extracted_data.head()

Unnamed: 0,car_model,year_of_manufacture,price,fuel
0,ritz,2014,5000.0,Petrol
1,sx4,2013,7089.552239,Diesel
2,ciaz,2017,10820.895522,Petrol
3,wagon r,2011,4253.731343,Petrol
4,swift,2014,6865.671642,Diesel


In [12]:
extracted_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   car_model            63 non-null     object 
 1   year_of_manufacture  63 non-null     object 
 2   price                63 non-null     float64
 3   fuel                 63 non-null     object 
dtypes: float64(1), object(3)
memory usage: 2.1+ KB


In [13]:
# Validating data transformations

transformed_data.head()

Unnamed: 0,car_model,year_of_manufacture,price,fuel
0,ritz,2014,5000.0,Petrol
1,sx4,2013,7089.6,Diesel
2,ciaz,2017,10820.9,Petrol
3,wagon r,2011,4253.7,Petrol
4,swift,2014,6865.7,Diesel


In [14]:
transformed_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   car_model            63 non-null     object 
 1   year_of_manufacture  63 non-null     object 
 2   price                63 non-null     float64
 3   fuel                 63 non-null     object 
dtypes: float64(1), object(3)
memory usage: 2.1+ KB


In [15]:
f = open(logfile, "r")
print(f.read())

14:57:40-Aug-23-2023 , 
ETL Job Started

14:57:40-Aug-23-2023 , Extract phase Started

14:57:40-Aug-23-2023 , Extract phase Ended

14:57:40-Aug-23-2023 , Transform phase Started

14:57:40-Aug-23-2023 , Transform phase Ended

14:57:40-Aug-23-2023 , Load phase Started

14:57:40-Aug-23-2023 , Load phase Ended

14:57:40-Aug-23-2023 , ETL Job Ended

14:58:25-Aug-23-2023 , 
ETL Job Started

14:58:25-Aug-23-2023 , Extract phase Started

14:58:25-Aug-23-2023 , Extract phase Ended

14:58:25-Aug-23-2023 , Transform phase Started

14:58:25-Aug-23-2023 , Transform phase Ended

14:58:25-Aug-23-2023 , Load phase Started

14:58:25-Aug-23-2023 , Load phase Ended

14:58:25-Aug-23-2023 , ETL Job Ended

14:59:12-Aug-23-2023 , 
ETL Job Started

14:59:12-Aug-23-2023 , Extract phase Started

14:59:12-Aug-23-2023 , Extract phase Ended

14:59:12-Aug-23-2023 , Transform phase Started

14:59:12-Aug-23-2023 , Transform phase Ended

14:59:12-Aug-23-2023 , Load phase Started

14:59:12-Aug-23-2023 , Load phase Ende