## Apply Schema on the lists from files

Let us understand how to apply schema while processing the data from the files. 
* In many cases, data files might not contain the metadata such as column names, data types, etc.
* We might get the data metadata in the form of separate files. Also, it is common that metadata is available via Database Tables or REST based schema registries.
* We need to make sure that the metadata (schema) is applied on the data as part of data processing.

In this case data files are available under **/data/retail_db**, the json file with metadata is available under **schemas/retail_db/retail.json**.

In [8]:
!ls -ltr /data/retail_db

total 20156
-rw-r--r-- 1 root root      806 Jan 21  2021 README.md
drwxr-xr-x 2 root root     4096 Jan 21  2021 products
drwxr-xr-x 2 root root     4096 Jan 21  2021 orders
drwxr-xr-x 2 root root     4096 Jan 21  2021 order_items
-rw-r--r-- 1 root root 10297372 Jan 21  2021 load_db_tables_pg.sql
drwxr-xr-x 2 root root     4096 Jan 21  2021 departments
drwxr-xr-x 2 root root     4096 Jan 21  2021 customers
-rw-r--r-- 1 root root     1748 Jan 21  2021 create_db_tables_pg.sql
-rw-r--r-- 1 root root 10303297 Jan 21  2021 create_db.sql
drwxr-xr-x 2 root root     4096 Jan 21  2021 categories


In [9]:
!ls -ltr schemas/retail_db/retail.json

-rw-r--r-- 1 itv002461 students 2083 Apr  6 01:18 schemas/retail_db/retail.json


In [10]:
!cat schemas/retail_db/retail.json

{
    "categories": [
        {"column_name": "category_id", "data_type": "int"},
        {"column_name": "category_department_id", "data_type": "int"},
        {"column_name": "category_name", "data_type": "str"}
    ],
    "customers": [
        {"column_name": "customer_id", "data_type": "int"},
        {"column_name": "customer_fname", "data_type": "str"},
        {"column_name": "customer_lname", "data_type": "str"},
        {"column_name": "customer_email", "data_type": "str"},
        {"column_name": "customer_password", "data_type": "str"},
        {"column_name": "customer_street", "data_type": "str"},
        {"column_name": "customer_city", "data_type": "str"},
        {"column_name": "customer_state", "data_type": "str"},
        {"column_name": "customer_zipcode", "data_type": "str"}
    ],
    "departments": [
        {"column_name": "department_id", "data_type": "int"},
        {"column_name": "department_name", "data_type": "str"}
    ],
    "order_items": [
        {"c

In [11]:
!ls -ltr /data/retail_db/orders

total 2932
-rw-r--r-- 1 root root 2999944 Jan 21  2021 part-00000


In [12]:
# Read orders data into list of strings

orders_path = '/data/retail_db/orders/part-00000'
orders = open(orders_path). \
    read(). \
    splitlines()

In [13]:
orders[:10]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [14]:
# Load schemas into dict using json

import json
retail_schemas = json.load(open('schemas/retail_db/retail.json'))

In [15]:
retail_schemas

{'categories': [{'column_name': 'category_id', 'data_type': 'int'},
  {'column_name': 'category_department_id', 'data_type': 'int'},
  {'column_name': 'category_name', 'data_type': 'str'}],
 'customers': [{'column_name': 'customer_id', 'data_type': 'int'},
  {'column_name': 'customer_fname', 'data_type': 'str'},
  {'column_name': 'customer_lname', 'data_type': 'str'},
  {'column_name': 'customer_email', 'data_type': 'str'},
  {'column_name': 'customer_password', 'data_type': 'str'},
  {'column_name': 'customer_street', 'data_type': 'str'},
  {'column_name': 'customer_city', 'data_type': 'str'},
  {'column_name': 'customer_state', 'data_type': 'str'},
  {'column_name': 'customer_zipcode', 'data_type': 'str'}],
 'departments': [{'column_name': 'department_id', 'data_type': 'int'},
  {'column_name': 'department_name', 'data_type': 'str'}],
 'order_items': [{'column_name': 'order_item_id', 'data_type': 'int'},
  {'column_name': 'order_item_order_id', 'data_type': 'int'},
  {'column_name': 

In [16]:
# Get the schema for relevant data set

retail_schemas['orders']

[{'column_name': 'order_id', 'data_type': 'int'},
 {'column_name': 'order_date', 'data_type': 'str'},
 {'column_name': 'order_customer_id', 'data_type': 'int'},
 {'column_name': 'order_status', 'data_type': 'int'}]

In [17]:
# Fetch the column names

columns = list(map(lambda col: col['column_name'], retail_schemas['orders']))

In [18]:
columns

['order_id', 'order_date', 'order_customer_id', 'order_status']

In [19]:
import csv

In [20]:
csv.DictReader?

[0;31mInit signature:[0m
[0mcsv[0m[0;34m.[0m[0mDictReader[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mf[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfieldnames[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrestkey[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrestval[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdialect[0m[0;34m=[0m[0;34m'excel'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0margs[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0m[0mkwds[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m      <no docstring>
[0;31mFile:[0m           /opt/anaconda3/envs/beakerx/lib/python3.6/csv.py
[0;31mType:[0m           type
[0;31mSubclasses:[0m     


In [21]:
# Create DictReader object using list of strings and column names
# We will get list of dicts. The keys in the dicts are from columns
csv_reader = csv.DictReader(open(orders_path), fieldnames=columns)

In [22]:
csv_reader

<csv.DictReader at 0x7fd984de2e10>

In [23]:
list(csv_reader)[:10]

[OrderedDict([('order_id', '1'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '11599'),
              ('order_status', 'CLOSED')]),
 OrderedDict([('order_id', '2'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '256'),
              ('order_status', 'PENDING_PAYMENT')]),
 OrderedDict([('order_id', '3'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '12111'),
              ('order_status', 'COMPLETE')]),
 OrderedDict([('order_id', '4'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '8827'),
              ('order_status', 'CLOSED')]),
 OrderedDict([('order_id', '5'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '11318'),
              ('order_status', 'COMPLETE')]),
 OrderedDict([('order_id', '6'),
              ('order_date', '2013-07-25 00:00:00.0'),
            

In [1]:
folder_name = '/data/retail_db/orders'

In [2]:
import os
file_names = os.listdir(folder_name)

In [3]:
file_names

['part-00000']

In [4]:
l1 = [1, 2, 3]

In [5]:
l2 = [4, 5]

In [6]:
l1 + l2

[1, 2, 3, 4, 5]

In [7]:
import os
import json
import csv

def get_dicts(base_folder, data_set_name, schema_file):
    file_names = os.listdir(f'{base_folder}/{data_set_name}')
    retail_schemas = json.load(open(schema_file))
    columns = list(map(lambda col: col['column_name'], retail_schemas[data_set_name]))
    data = []
    for file_name in file_names:
        file_path = f'{base_folder}/{data_set_name}/{file_name}'
        csv_reader = csv.DictReader(open(file_path), fieldnames=columns)
        data += list(csv_reader)
    return data

In [26]:
data = get_dicts('/data/retail_db', 'orders', 'schemas/retail_db/retail.json')

In [27]:
len(data)

68883

In [28]:
data[:10]

[OrderedDict([('order_id', '1'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '11599'),
              ('order_status', 'CLOSED')]),
 OrderedDict([('order_id', '2'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '256'),
              ('order_status', 'PENDING_PAYMENT')]),
 OrderedDict([('order_id', '3'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '12111'),
              ('order_status', 'COMPLETE')]),
 OrderedDict([('order_id', '4'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '8827'),
              ('order_status', 'CLOSED')]),
 OrderedDict([('order_id', '5'),
              ('order_date', '2013-07-25 00:00:00.0'),
              ('order_customer_id', '11318'),
              ('order_status', 'COMPLETE')]),
 OrderedDict([('order_id', '6'),
              ('order_date', '2013-07-25 00:00:00.0'),
            

In [29]:
data = get_dicts('/data/retail_db', 'order_items', 'schemas/retail_db/retail.json')

In [30]:
len(data)

172198

In [31]:
data[:10]

[OrderedDict([('order_item_id', '1'),
              ('order_item_order_id', '1'),
              ('order_item_product_id', '957'),
              ('order_item_quantity', '1'),
              ('order_item_subtotal', '299.98'),
              ('order_item_product_price', '299.98')]),
 OrderedDict([('order_item_id', '2'),
              ('order_item_order_id', '2'),
              ('order_item_product_id', '1073'),
              ('order_item_quantity', '1'),
              ('order_item_subtotal', '199.99'),
              ('order_item_product_price', '199.99')]),
 OrderedDict([('order_item_id', '3'),
              ('order_item_order_id', '2'),
              ('order_item_product_id', '502'),
              ('order_item_quantity', '5'),
              ('order_item_subtotal', '250.0'),
              ('order_item_product_price', '50.0')]),
 OrderedDict([('order_item_id', '4'),
              ('order_item_order_id', '2'),
              ('order_item_product_id', '403'),
              ('order_item_quantity