# Simple Extract, Transfrom, Load project

in this file, we make some etl process using [coursera](https://www.coursera.org/learn/python-project-for-data-engineering) as a reference. for the first step, we read data from different source(csv and json). after that, the data that been readed will be extracted into a dataframe. if we done with dataframe, we transfrom data into format we want. finally, we can export/load that dataframe into database.

## Download files

In [1]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/source.zip

--2022-10-30 02:59:04--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/source.zip
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 198.23.119.245
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|198.23.119.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2707 (2.6K) [application/zip]
Saving to: 'source.zip.12'

     0K ..                                                    100%  453M=0s

2022-10-30 02:59:06 (453 MB/s) - 'source.zip.12' saved [2707/2707]



## Unzip the source

because i dont have unzip package, i manually unzip using winRAR

## Set Paths


In [2]:
tmpfile    = "temp.tmp"               # file used to store all extracted data
logfile    = "logfile.txt"            # all event logs will be stored in this file
targetfile = "transformed_data.csv"   # file where transformed data is stored

## import module

In [3]:
import glob                         # this module helps in selecting files 
import pandas as pd                 # this module helps in processing CSV files
import xml.etree.ElementTree as ET  # this module helps in processing XML files.
from datetime import datetime

## Extract

### CSV extract function

In [4]:
def extract_from_csv(file_to_process):
    dataframe = pd.read_csv(file_to_process)
    return dataframe

### JSON extract function

In [5]:
def extract_from_json(file_to_process):
    dataframe = pd.read_json(file_to_process, lines=True)
    return dataframe

### XML extract function

In [6]:
def extract_from_xml(file_to_process):
    dataframe = pd.DataFrame(columns=['name', 'height', 'weight'])
    tree = ET.parse(file_to_process)
    root = tree.getroot()
    for person in root:
        name = person.find('name').text
        height = float(person.find('height').text)
        weight = float(person.find('weight').text)
        dataframe.loc[len(dataframe.index)] = {'name':name, 'height':height, 'weight':weight}
#         dataframe = pd.concat([dataframe, {'name':name, 'height':height, 'weight':weight}], axis=0, ignore_index=True)
    return dataframe

### Extract function

In [7]:
def extract():
    extracted_data = pd.DataFrame(columns=['name', 'height', 'weight']) # make a empty dataframe 
    
    #process all csv files
    for csv_file in glob.glob('source/*.csv'):
        extracted_data = pd.concat([extracted_data, extract_from_csv(csv_file)], axis=0, ignore_index=True)
#         extracted_data = extracted_data.append(extract_from_csv(csv_file), ignore_index=True)
    
    #process all json files
    for json_file in glob.glob('source/*.json'):
        extracted_data = pd.concat([extracted_data, extract_from_json(json_file)], axis=0, ignore_index=True)
#         extracted_data = extracted_data.append(extract_from_csv(json_file), ignore_index=True)
        
    #process all xml files
    for xml_file in glob.glob('source/*.xml'):
        extracted_data = pd.concat([extracted_data, extract_from_xml(xml_file)], axis=0, ignore_index=True)
#         extracted_data = extracted_data.append(extract_from_csv(xml_file), ignore_index=True)
        
    return extracted_data

## Transform

in this function we convert height in inch to milimeter and weight in pounds to kilometer


In [8]:
def transform(data):
    data['height'] = data['height'].apply(lambda x: x * 0.0254,2)
    data['weight'] = data['weight'].apply(lambda x: x * 0.45359237,2)
    return data

## Load

In [9]:
def load(targetfile, data_to_load):
    data_to_load.to_csv(targetfile)

## Logging 

In [10]:
def log(message):
    timestamp_format = '%Y-%m-%d-%H:%M:%S'
    now = datetime.now()
    timestamp= now.strftime(timestamp_format)
    with open(logfile, 'a') as f:
        f.write(timestamp + ',' + message + '\n')

## Run ETL process

In [11]:
log("ETL Job Started")

In [12]:
log("Extract phase Started")
extracted_data = extract()
log("Extract phase Ended")
extracted_data

Unnamed: 0,name,height,weight
0,alex,65.78,112.99
1,ajay,71.52,136.49
2,alice,69.4,153.03
3,ravi,68.22,142.34
4,joe,67.79,144.3
5,alex,65.78,112.99
6,ajay,71.52,136.49
7,alice,69.4,153.03
8,ravi,68.22,142.34
9,joe,67.79,144.3


In [13]:
log("Transform phase Started")
transformed_data = transform(extracted_data)
log("Transform phase Ended")
transformed_data

Unnamed: 0,name,height,weight
0,alex,1.670812,51.251402
1,ajay,1.816608,61.910823
2,alice,1.76276,69.41324
3,ravi,1.732788,64.564338
4,joe,1.721866,65.453379
5,alex,1.670812,51.251402
6,ajay,1.816608,61.910823
7,alice,1.76276,69.41324
8,ravi,1.732788,64.564338
9,joe,1.721866,65.453379


In [14]:
log("Load phase Started")
load(targetfile,transformed_data)
log("Load phase Ended")

In [15]:
log("ETL Job Ended")