# Exploratory Data Analysis - Data Structure
This notebook starts off the takehome challenge exploration and will contribute to the data understanding and metric creation process.

The first thing to do is get all the data and understand the format, which is what is happenning here.

Naming convention is a number (for ordering), the creator's initials, and a short `-` delimited description. [source](https://cookiecutter-data-science.drivendata.org/)

Starting off by loading some data and checking out how it looks.

In [71]:
import os
import json

In [5]:
!ls ../examples/model_outputs/

2.json	4.json	5.json	7.json


In [60]:
!ls ../examples/ground_truth/

example_01.json  example_05.json  example_09.json  example_13.json
example_02.json  example_06.json  example_10.json
example_03.json  example_07.json  example_11.json
example_04.json  example_08.json  example_12.json


In [58]:
with open("../examples/model_outputs/2.json", 'r') as f:
    data = json.load(f)

In [61]:
with open("../examples/ground_truth/example_01.json", 'r') as f:
    gt_data = json.load(f)

In [68]:
gt_data

{'test_n': 1,
 'input': 'Bathroom Remodel project: 7\'x10\'-6", 8 ft ceilings\nDemo a tub/shower combo\nReplace with a complete shower system, with tiled walls, niche, and curb.\nReplace tile floor with new tile.\nNew double vanity.\nNew Toilet.\nPlumbing fixtures stay in place, no relocation or new rough-in needed.\nRepaint whole room 3 coats.\nReplace 4 light fixtures.\nShower Glass Enclosure excluded.',
 'rows': [{'sectionName': 'Demolition',
   'qty': 66.0,
   'rateUsd': 2.4,
   'rowTotalCostUsd': 158.4,
   'label': 'Demo Shower Surround',
   'uom': 'SF',
   'category': 'other'},
  {'sectionName': 'Demolition',
   'qty': 1.0,
   'rateUsd': 80.0,
   'rowTotalCostUsd': 80.0,
   'label': 'Demo Vanity',
   'uom': 'EA',
   'category': 'other'},
  {'sectionName': 'Demolition',
   'qty': 73.0,
   'rateUsd': 4.0,
   'rowTotalCostUsd': 292.0,
   'label': 'Demo Tile',
   'uom': 'SF',
   'category': 'other'},
  {'sectionName': 'Demolition',
   'qty': 1.0,
   'rateUsd': 240.0,
   'rowTotalCost

In [36]:
data["estimate_preds"][0]

{'valid_file_name': 'example_01',
 'rows': [{'label': 'Protect Work Area and Surroundings',
   'qty': 75.0,
   'uom': 'SF',
   'rateUsd': 0.25,
   'rowTotalCostUsd': 18.75,
   'category': 'labor',
   'sectionName': 'Demolition',
   'metadata': None},
  {'label': 'Demo Tub',
   'qty': 2.0,
   'uom': 'HRS',
   'rateUsd': 65.0,
   'rowTotalCostUsd': 130.0,
   'category': 'labor',
   'sectionName': 'Demolition',
   'metadata': None},
  {'label': 'Demo Shower Pan',
   'qty': 2.0,
   'uom': 'HRS',
   'rateUsd': 65.0,
   'rowTotalCostUsd': 130.0,
   'category': 'labor',
   'sectionName': 'Demolition',
   'metadata': None},
  {'label': 'Demo Floor Tile',
   'qty': 8.0,
   'uom': 'HRS',
   'rateUsd': 65.0,
   'rowTotalCostUsd': 520.0,
   'category': 'labor',
   'sectionName': 'Demolition',
   'metadata': None},
  {'label': 'Demo Vanity',
   'qty': 1.0,
   'uom': 'HRS',
   'rateUsd': 65.0,
   'rowTotalCostUsd': 65.0,
   'category': 'labor',
   'sectionName': 'Demolition',
   'metadata': None},
 

In [51]:
data["estimate_preds"][0].keys()

dict_keys(['valid_file_name', 'rows', 'time_to_estimate_sec'])

In [50]:
sorted([(row['label']) for row in data["estimate_preds"][1]['rows']])

['Bathroom Floor Tile',
 'Bathroom Vanity Cabinet',
 'Cabinet Installation Labor',
 'Demo Floor Tile',
 'Demo Tub',
 'Demo Vanity',
 'Disposal Costs',
 'Finish Paint (Sherwin Williams Duration Line)',
 'Grout',
 'Labor',
 'Mask Off and Protect Work Area',
 'Membrane',
 'Mortar',
 'PEX Piping',
 'PVC Drain Piping',
 'Painting Labor',
 'Painting Supplies and Consumables',
 'Pipe Fittings and Adhesives',
 'Plumbing Labor - Rough In and Finish',
 'Prepare Ceiling Surfaces for Paint',
 'Prepare Wall Surfaces for Paint',
 'Protect Work Area and Surroundings',
 'Recessed Lighting',
 'Shower Wall Tile',
 'Spacers',
 'Tile Installation Labor',
 'Toilet']

In [49]:
sorted([(row['label']) for row in data["estimate_preds"][0]['rows']])

['Bathroom Floor Tile',
 'Bathroom Light Fixtures',
 'Bathroom Lighting Labor',
 'Bathroom Vanity Cabinet',
 'Cabinet Doors',
 'Cabinet Drawer Hardware',
 'Cabinet Installation Labor',
 'Demo Floor Tile',
 'Demo Shower Pan',
 'Demo Tub',
 'Demo Vanity',
 'Disposal Costs',
 'Finish Paint (Sherwin Williams Duration Line)',
 'Grout',
 'Mask Off and Protect Work Area',
 'Membrane',
 'Mortar',
 'Painting Labor',
 'Painting Supplies and Consumables',
 'Plumbing Labor - Rough In and Finish',
 'Prepare Ceiling Surfaces for Paint',
 'Prepare Wall Surfaces for Paint',
 'Protect Work Area and Surroundings',
 'Shower Drain',
 'Shower Faucet Set',
 'Shower Wall Tile',
 'Spacers',
 'Tile Installation Labor',
 'Toilet']

In [53]:
sorted(list(set([(row['category']) for row in data["estimate_preds"][0]['rows']])))

['labor', 'material']

In [57]:
sorted(list(set([(row['sectionName']) for row in data["estimate_preds"][0]['rows']])))

['Cabinets', 'Demolition', 'Electrical', 'Painting', 'Plumbing', 'Tile']

In [70]:
sorted(list(set([(row['sectionName']) for row in gt_data['rows']])))

['Demolition', 'Electrical', 'Painting', 'Plumbing', 'Tile', 'Trim']

In [78]:
for fn in sorted(os.listdir("../examples/model_outputs")):
    with open(f"../examples/model_outputs/{fn}", 'r') as f:
        data = json.load(f)
        valid_file_names = []
        for example in data["estimate_preds"]:
            valid_file_names.append(example['valid_file_name'])
        print(fn)
        print(sorted(list(set(valid_file_names))))
        print("-----")

2.json
['example_01', 'example_02', 'example_03', 'example_04', 'example_05', 'example_06', 'example_07', 'example_08', 'example_09', 'example_10', 'example_11', 'example_12', 'example_13']
-----
4.json
['example_01', 'example_02', 'example_03', 'example_04', 'example_05', 'example_06', 'example_07', 'example_08', 'example_09', 'example_10', 'example_11', 'example_12', 'example_13']
-----
5.json
['example_01', 'example_02', 'example_03', 'example_04', 'example_05', 'example_06', 'example_07', 'example_08', 'example_09', 'example_10', 'example_11', 'example_12', 'example_13']
-----
7.json
['example_01', 'example_02', 'example_03', 'example_04', 'example_05', 'example_06', 'example_07', 'example_08', 'example_09', 'example_10', 'example_11', 'example_12', 'example_13']
-----


Each "model_outputs" file contains two model responses to each of the inputs available in the 'valid_file_name' file.