## Mini ETL Project Using Python OOP

For Data Engineering Track ‚Äî Practice on Classes, Inheritance & Polymorphism

### üìå Task Description

In this exercise, you will build a simple ETL pipeline using Object-Oriented Programming concepts.

You will simulate receiving raw data from different sources (CSV, JSON, etc.), then process it, then load it.
Your goal is to apply:

- Classes & Objects

- Attributes & Methods

- Instance vs Class behavior

- Inheritance

- Method overriding

- Polymorphism

This is NOT a real ETL pipeline ‚Äî it‚Äôs a small conceptual example to help you understand how OOP can organize data engineering workflows.


## What You Will Build?

You will create:

1Ô∏è‚É£ Base Class ‚Äì DataSource

- Holds raw data

- Defines the generic steps: extract(), transform(), and load()

- Child classes will override transform() depending on the data type

2Ô∏è‚É£ Child Classes:

- CSVData ‚Üí simulates cleaning CSV rows

- JSONData ‚Üí simulates extracting JSON fields

- Each class has its own version of transform() ‚Üí THIS is polymorphism.

3Ô∏è‚É£ ETL Runner Function

- A single function called run_etl(source) that works with any data source object.

- This shows how one interface can work with multiple classes.


In [1]:
class DataSource:
    def __init__(self,raw_data):
        self.raw_data = raw_data
        self.data = None

    def extract(self):
        pass

    def transform(self):
        pass

    def load(self):
        pass

class CSVData(DataSource):
    def __init__(self, raw_data):
        super().__init__(raw_data)

    def extract(self):
        self.data = list(self.raw_data.split(','))
    

    def transform(self):
        cleaned_data = []
        for element in self.data:
            cleaned_data.append(element.strip())
        self.data = cleaned_data # we had to update the data in the process and this is the idea of our ETL pipeline, that it flows and update it self
        

    def load(self):
        print(self.data)
        return self.data


class JSONData(DataSource):
    def __init__(self, raw_data):
        super().__init__(raw_data)

    def extract(self):
        self.data = self.raw_data.copy() # we do this here so we dont modify the original, this is an extra idea i learnt from chatgpt
        return self.data # i do this here so if someone want to extract the data using this method and use the value returned
    
    def transform(self):
        keys = ["name","age","score"]
        cleaned_data = {}

        for key in keys:
            value = self.data.get(key)
            if value.isdigit():
                value = int(value)
            cleaned_data[key] = value
        
        self.data = cleaned_data
        return self.data
    
    def load(self):
        print(self.data)
        return self.data

def run_etl(source):
    source.extract()
    source.transform()
    source.load()

so what is remaining now is trying to integrate real csv and json files into this pipeline so its fully functional pipeline, will try to do this