# Data ETL Pipeline using Python

This process imports, transforms, and stores the Fashion MNIST dataset in an SQLite database, facilitating easy access and manipulation of the data later.

The code below demonstrates how to create a Data ETL pipeline using Python and SQLite. It includes the following steps:

- Import the sqlite3 library to work with SQLite databases.
- Create a connection to the database.
- Create a table called “images” in the database.
- Loop through each image in the training data, inserting the images and their labels into the “images” table.
- Use the commit() method to save changes to the database.
- Loop through each image in the test data, inserting the images and their labels into the “images” table.
- Use the commit() method again to save changes.
- Close the connection to the database.

## Extract

To collect data, let's use the Fashion-MNIST dataset provided by the Keras library.

In [2]:
import tensorflow.keras as keras
2
(xtrain, ytrain), (xtest, ytest) = keras.datasets.fashion_mnist.load_data()

Looking at the shape of the data

In [3]:
print(xtrain.shape)
print(ytrain.shape)
print(xtest.shape)
print(ytest.shape)

(60000, 28, 28)
(60000,)
(10000, 28, 28)
(10000,)


## Transform

Cleaning and transforming the data we normalize the pixel values to be between 0 and 1 and reshape the data into a 4D tensor.

In [6]:
import numpy as np

xtrain = xtrain.astype('float32') / 255
xtest = xtest.astype('float32') / 255

xtrain = np.reshape(xtrain, (xtrain.shape[0], 28, 28, 1))
xtest = np.reshape(xtest, (xtest.shape[0], 28, 28, 1))

print(xtrain.shape)
print(ytrain.shape)
print(xtest.shape)
print(ytest.shape)

(60000, 28, 28, 1)
(60000,)
(10000, 28, 28, 1)
(10000,)


## Load

Inserting data into a SQLite Database

In [8]:
import sqlite3

conn = sqlite3.connect('fashion_mnist.db')

conn.execute('''CREATE TABLE IF NOT EXISTS images
             (id INTEGER PRIMARY KEY AUTOINCREMENT,
             image BLOB NOT NULL,
             label INTEGER NOT NULL);''')

for i in range(xtrain.shape[0]):
    conn.execute('INSERT INTO images (image, label) VALUES (?, ?)',
                [sqlite3.Binary(xtrain[i]), ytrain[i]])

conn.commit()

for i in range(xtest.shape[0]):
    conn.execute('INSERT INTO images (image, label) VALUES (?, ?)',
                [sqlite3.Binary(xtest[i]), ytest[i]])

conn.commit()

conn.close()

Reading the data

In [10]:
import sqlite3
conn = sqlite3.connect('fashion_mnist.db')
cursor = conn.cursor()

cursor.execute('SELECT * FROM images')
rows = cursor.fetchall()

import pandas as pd
data = pd.read_sql_query('SELECT * FROM images', conn)

In [14]:
data

Unnamed: 0,id,image,label
0,1,"b""\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...",b'\t'
1,2,b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...,b'\x00'
2,3,b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...,b'\x00'
3,4,"b""\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...",b'\x03'
4,5,b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...,b'\x00'
...,...,...,...
69995,69996,b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...,b'\t'
69996,69997,"b""\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...",b'\x01'
69997,69998,b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...,b'\x08'
69998,69999,b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...,b'\x01'
