## Data Migration/Ingestion Test by Yaron Shamash 
- The task at hand is modeled on part of a common data ingestion workflow. A customer provides
us with two sources of data. One is their customer sheet, which has come from QuickBooks.
The second is their route sheet, which they have created in Excel. In order to ingest their data
into our system, we need to match the customer names in the route sheet against the customer
names in the customer sheet, and parse the data into a JSON object that can be passed to our
API.


In [1]:
import numpy as np
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
import re
from numpy import int64
import json
from pandas.io.json import json_normalize 
%matplotlib inline

In [2]:
# read datasets and rename the customer sheet columns 
dfc = pd.read_csv('Customers - ServiceCore Data Test.csv')
dfr = pd.read_csv('Routesheet - ServiceCore Data Test.csv')
dfc.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True)

In [3]:
# dimensions and view of routesheet df
print(dfr.shape)
dfr.head()

(2528, 8)


Unnamed: 0,Customer,TOILET #,Job Name/Well Name,Address,City,Schedule,Scheduled Day,Charge Type
0,src,1,Ag Pad FRAC,O Street x 59th Ave W1.1 S into,West Greeley,2xWeekly,7/22/18change billing to production,pending
1,Hauer Custom Homes,3611,,"19299 CR 70, Eaton",Eaton,Weekly,address does not exist 10/7/19,MONTHLY (8/1/19)
2,Ridgeway Custom homes,1,,"6879 Crooked Stick, Windsor",West Windsor,Weekly,address does not exist 3/22/19,MONTHLY (9/19/18)
3,ASTER RIDGE,3051,,1827 AA ST,EAST GREELEY,WEEKLY,BISON RIDGE TOOK OVER TOILET 1/27/22,MONTHLY (10/12/21)
4,Blackeagle,"3540, 2489",Angus Compressor Station,"60315 CR 71, Grover\n128x69 E1 N into",Grover,Weekly,BLOWNOVER/TRADED 10/24/19,MONTHLY (6/11/19)


In [4]:
# dimensions and view of customers df
print(dfc.shape)
dfc.head()

(15135, 21)


Unnamed: 0,customer,company,mr./ms./...,first_name,m.i.,last_name,main_phone,main_email,bill_to_1,bill_to_2,...,bill_to_4,bill_to_5,ship_to_1,ship_to_2,ship_to_3,ship_to_4,ship_to_5,terms,sales_tax_code,tax_item
0,*JESUS SANCHEZ,JESUS SANCHEZ,,,,,9670-342-6100,,JESUS SANCHEZ,,...,,,,,,,,,Tax,NON TAXABLE ITEM
1,*JESUS SANCHEZ:17900 CR 5,JESUS SANCHEZ,,,,,9670-342-6100,,JESUS SANCHEZ,,...,,,,,,,,,Tax,NON TAXABLE ITEM
2,1888 INDUSTRIAL SERVICES,1888 INDUSTRIAL SERVICES,,,,,970-702-7610,AP@1888IS.COM,1888 INDUSTRIAL SERVICES,"800 8TH AVE, SUITE 301",...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
3,1888 INDUSTRIAL SERVICES:WELLS RANCH TO REPUBLIC,1888 INDUSTRIAL SERVICES,,JOSEPH,,MONTOYA,970-702-7610,AP@1888IS.COM,1888 INDUSTRIAL SERVICES,"800 8TH AVE, SUITE 301",...,,,,,,,,Net 30,Tax,Wyoming Sales Tax
4,2 RINGS TRUCKING,2 RINGS TRUCKING,,,,,406-289-0901,,2 RINGS TRUCKING,,...,,,,,,,,Due on receipt,Non,NON TAXABLE ITEM


In [5]:
# check data types and nulls
dfc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15135 entries, 0 to 15134
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   customer        15135 non-null  object 
 1   company         14957 non-null  object 
 2   mr./ms./...     11 non-null     object 
 3   first_name      2173 non-null   object 
 4   m.i.            67 non-null     object 
 5   last_name       2033 non-null   object 
 6   main_phone      13135 non-null  object 
 7   main_email      12892 non-null  object 
 8   bill_to_1       15113 non-null  object 
 9   bill_to_2       9797 non-null   object 
 10  bill_to_3       9619 non-null   object 
 11  bill_to_4       38 non-null     object 
 12  bill_to_5       0 non-null      float64
 13  ship_to_1       35 non-null     object 
 14  ship_to_2       32 non-null     object 
 15  ship_to_3       29 non-null     object 
 16  ship_to_4       3 non-null      object 
 17  ship_to_5       0 non-null     

In [6]:
# customer df sum nulls columns. 0 nulls in the customer column 
dfc.isnull().sum()

customer              0
company             178
mr./ms./...       15124
first_name        12962
m.i.              15068
last_name         13102
main_phone         2000
main_email         2243
bill_to_1            22
bill_to_2          5338
bill_to_3          5516
bill_to_4         15097
bill_to_5         15135
ship_to_1         15100
ship_to_2         15103
ship_to_3         15106
ship_to_4         15132
ship_to_5         15135
terms              3560
sales_tax_code     3202
tax_item            599
dtype: int64

Using the two source files provided, implement the following logic:
-Match the route sheet against the customer sheet based on the “Customer” field in both
tables.
- Note: matching is case-insensitive.
- Only the top level customer from the QuickBooks customers export should be
matched against. For example, the top level customer in the QuickBooks
customer field “John Smith:123 Main Street” would be “John Smith”.
- For each unique customer name (case insensitive) in the route sheet, create an object
consisting of the following fields:
○ From the customer sheet:
- Customer
- Bill to 1
- Bill to 2
- Bill to 3
- Main Phone
- Note: each phone number should be formatted to include only
numeric characters. For example “555-123-4567: Tracy” should
become “5551234567”.
- Main Email
- Terms

### The customer column contains colons, apostrophes and slashes which should be removed.  There are also text strings such as LLC and INC which should be removed for a better match on the Route sheet.

In [7]:
# remove white space, text after the colons, asterisks, commas and slashes from the customer column, 'LLC', 'INC'
dfc.customer.str.strip()
dfc.customer=dfc.customer.str.split(':').str[0]
dfc.customer=dfc.customer.str.split(',').str[0]
dfc.customer=dfc.customer.str.split('/').str[0]
dfc.customer = dfc.customer.str.replace('\*', '', regex=True)
dfc.customer = dfc.customer.str.replace('LLC/','').str.replace('INC/','')
dfc.customer.str.replace('INC/','')

0                   JESUS SANCHEZ
1                   JESUS SANCHEZ
2        1888 INDUSTRIAL SERVICES
3        1888 INDUSTRIAL SERVICES
4                2 RINGS TRUCKING
                   ...           
15130                       ZTERS
15131                       ZTERS
15132                       ZTERS
15133                       ZTERS
15134                       ZTERS
Name: customer, Length: 15135, dtype: object

In [8]:
# duplicate rows in the customer column
dfc[dfc["customer"].duplicated()]

Unnamed: 0,customer,company,mr./ms./...,first_name,m.i.,last_name,main_phone,main_email,bill_to_1,bill_to_2,...,bill_to_4,bill_to_5,ship_to_1,ship_to_2,ship_to_3,ship_to_4,ship_to_5,terms,sales_tax_code,tax_item
1,JESUS SANCHEZ,JESUS SANCHEZ,,,,,9670-342-6100,,JESUS SANCHEZ,,...,,,,,,,,,Tax,NON TAXABLE ITEM
3,1888 INDUSTRIAL SERVICES,1888 INDUSTRIAL SERVICES,,JOSEPH,,MONTOYA,970-702-7610,AP@1888IS.COM,1888 INDUSTRIAL SERVICES,"800 8TH AVE, SUITE 301",...,,,,,,,,Net 30,Tax,Wyoming Sales Tax
5,2 RINGS TRUCKING,2 RINGS TRUCKING,,,,,406-289-0901,,2 RINGS TRUCKING,,...,,,,,,,,,Non,NON TAXABLE ITEM
7,2 VALLEY BUILDERS,"2 VALLEY BUILDERS, INC",,,,,970-599-2134- John,2valleybuilders@gmail.com,"2 VALLEY BUILDERS, INC",6637 SPANISH BAY DRIVE,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
8,2 VALLEY BUILDERS,"2 VALLEY BUILDERS, INC",,,,,970-599-2134- John,2valleybuilders@gmail.com,"2 VALLEY BUILDERS, INC",6637 SPANISH BAY DRIVE,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15130,ZTERS,ZTERS INC.,,,,,832-698-2203 X 106 Mary Alvarado/AP,Invoices@zters.com,ZTERS INC.,13727 Office Park Drive,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
15131,ZTERS,ZTERS INC.,,,,,832-698-2203 X 106 Mary Alvarado/AP,Invoices@zters.com,ZTERS INC.,13727 Office Park Drive,...,,,,,,,,,Tax,NON TAXABLE ITEM
15132,ZTERS,ZTERS INC.,,,,,832-698-2203 X 106 Mary Alvarado/AP,Invoices@zters.com,ZTERS INC.,13727 Office Park Drive,...,,,,,,,,Net 30,Tax,NON TAXABLE ITEM
15133,ZTERS,ZTERS INC.,,,,,832-698-2203 X 106 Mary Alvarado/AP,Invoices@zters.com,ZTERS INC.,13727 Office Park Drive,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM


In [9]:
# drop duplicate rows in company column and store in new df
dfcl=dfc.drop_duplicates(subset='customer', keep='first', inplace=False)
dfcl

Unnamed: 0,customer,company,mr./ms./...,first_name,m.i.,last_name,main_phone,main_email,bill_to_1,bill_to_2,...,bill_to_4,bill_to_5,ship_to_1,ship_to_2,ship_to_3,ship_to_4,ship_to_5,terms,sales_tax_code,tax_item
0,JESUS SANCHEZ,JESUS SANCHEZ,,,,,9670-342-6100,,JESUS SANCHEZ,,...,,,,,,,,,Tax,NON TAXABLE ITEM
2,1888 INDUSTRIAL SERVICES,1888 INDUSTRIAL SERVICES,,,,,970-702-7610,AP@1888IS.COM,1888 INDUSTRIAL SERVICES,"800 8TH AVE, SUITE 301",...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
4,2 RINGS TRUCKING,2 RINGS TRUCKING,,,,,406-289-0901,,2 RINGS TRUCKING,,...,,,,,,,,Due on receipt,Non,NON TAXABLE ITEM
6,2 VALLEY BUILDERS,"2 VALLEY BUILDERS, INC",,,,,970-599-2134- John,2valleybuilders@gmail.com,"2 VALLEY BUILDERS, INC",6637 SPANISH BAY DRIVE,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
15,2020 APEX LLC,2020 APEX LLC,,,,,970.381.1081 Ryan Andre,RANDRE@SEARSREALESTATE.COM,2020 APEX LLC,,...,,,,,,,,CREDIT CARD ONLY,Tax,CREDIT CARD FEE 3%
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15053,ZAK DIRT,"ZAK DIRT, INC.",,BRANDI,,WILSON,970-535-4657,BWILSON@ZAKDIRT.COM; accounting@zakdirt.com,"ZAK DIRT, INC.",14290 HILLTOP ROAD,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
15060,ZAP Engineering & Construction Services,ZAP Engineering & Construction Services,,,,,303-565-5567,apinvoices@zapecs.com,ZAP Engineering & Construction Services,"333 S. Allison Pky, Suite 100",...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
15063,ZAYALA FIELD SERVICES,ZAYALA FIELD SERVICES,,,,,303-549-5978,AZAVALA@ZAVALAFIELD SERVICES.COM,ZAYALA FIELD SERVICES,,...,,,,,,,,CREDIT CARD ONLY,Tax,CREDIT CARD FEE 3%
15066,ZAYRA DIAZ,ZAYRA DIAZ,,,,,970-888-2605,ZDIAZ1820@GMAIL.COM,ZAYRA DIAZ,,...,,,,,,,,CREDIT CARD ONLY,Non,NON TAXABLE ITEM


### 3,183 row remaining.  The main_phone number column has to be cleaned and reformatted.  The first step will be to remove some of the text strings from the column.

In [10]:
# extract the phone number (digits only)
dfcl['main_phone'] = dfcl['main_phone'].str.extract('((?:\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})', expand=True)
dfcl

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfcl['main_phone'] = dfcl['main_phone'].str.extract('((?:\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})', expand=True)


Unnamed: 0,customer,company,mr./ms./...,first_name,m.i.,last_name,main_phone,main_email,bill_to_1,bill_to_2,...,bill_to_4,bill_to_5,ship_to_1,ship_to_2,ship_to_3,ship_to_4,ship_to_5,terms,sales_tax_code,tax_item
0,JESUS SANCHEZ,JESUS SANCHEZ,,,,,670-342-6100,,JESUS SANCHEZ,,...,,,,,,,,,Tax,NON TAXABLE ITEM
2,1888 INDUSTRIAL SERVICES,1888 INDUSTRIAL SERVICES,,,,,970-702-7610,AP@1888IS.COM,1888 INDUSTRIAL SERVICES,"800 8TH AVE, SUITE 301",...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
4,2 RINGS TRUCKING,2 RINGS TRUCKING,,,,,406-289-0901,,2 RINGS TRUCKING,,...,,,,,,,,Due on receipt,Non,NON TAXABLE ITEM
6,2 VALLEY BUILDERS,"2 VALLEY BUILDERS, INC",,,,,970-599-2134,2valleybuilders@gmail.com,"2 VALLEY BUILDERS, INC",6637 SPANISH BAY DRIVE,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
15,2020 APEX LLC,2020 APEX LLC,,,,,970.381.1081,RANDRE@SEARSREALESTATE.COM,2020 APEX LLC,,...,,,,,,,,CREDIT CARD ONLY,Tax,CREDIT CARD FEE 3%
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15053,ZAK DIRT,"ZAK DIRT, INC.",,BRANDI,,WILSON,970-535-4657,BWILSON@ZAKDIRT.COM; accounting@zakdirt.com,"ZAK DIRT, INC.",14290 HILLTOP ROAD,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
15060,ZAP Engineering & Construction Services,ZAP Engineering & Construction Services,,,,,303-565-5567,apinvoices@zapecs.com,ZAP Engineering & Construction Services,"333 S. Allison Pky, Suite 100",...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
15063,ZAYALA FIELD SERVICES,ZAYALA FIELD SERVICES,,,,,303-549-5978,AZAVALA@ZAVALAFIELD SERVICES.COM,ZAYALA FIELD SERVICES,,...,,,,,,,,CREDIT CARD ONLY,Tax,CREDIT CARD FEE 3%
15066,ZAYRA DIAZ,ZAYRA DIAZ,,,,,970-888-2605,ZDIAZ1820@GMAIL.COM,ZAYRA DIAZ,,...,,,,,,,,CREDIT CARD ONLY,Non,NON TAXABLE ITEM


In [11]:
# remove period from main_phone column
# remove dashes from phone column
dfcl.main_phone= dfcl.main_phone.replace({'\.':''}, regex=True)
dfcl.main_phone= dfcl.main_phone.replace({'-': ''}, regex=True)
dfcl

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfcl.main_phone= dfcl.main_phone.replace({'\.':''}, regex=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfcl.main_phone= dfcl.main_phone.replace({'-': ''}, regex=True)


Unnamed: 0,customer,company,mr./ms./...,first_name,m.i.,last_name,main_phone,main_email,bill_to_1,bill_to_2,...,bill_to_4,bill_to_5,ship_to_1,ship_to_2,ship_to_3,ship_to_4,ship_to_5,terms,sales_tax_code,tax_item
0,JESUS SANCHEZ,JESUS SANCHEZ,,,,,6703426100,,JESUS SANCHEZ,,...,,,,,,,,,Tax,NON TAXABLE ITEM
2,1888 INDUSTRIAL SERVICES,1888 INDUSTRIAL SERVICES,,,,,9707027610,AP@1888IS.COM,1888 INDUSTRIAL SERVICES,"800 8TH AVE, SUITE 301",...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
4,2 RINGS TRUCKING,2 RINGS TRUCKING,,,,,4062890901,,2 RINGS TRUCKING,,...,,,,,,,,Due on receipt,Non,NON TAXABLE ITEM
6,2 VALLEY BUILDERS,"2 VALLEY BUILDERS, INC",,,,,9705992134,2valleybuilders@gmail.com,"2 VALLEY BUILDERS, INC",6637 SPANISH BAY DRIVE,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
15,2020 APEX LLC,2020 APEX LLC,,,,,9703811081,RANDRE@SEARSREALESTATE.COM,2020 APEX LLC,,...,,,,,,,,CREDIT CARD ONLY,Tax,CREDIT CARD FEE 3%
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15053,ZAK DIRT,"ZAK DIRT, INC.",,BRANDI,,WILSON,9705354657,BWILSON@ZAKDIRT.COM; accounting@zakdirt.com,"ZAK DIRT, INC.",14290 HILLTOP ROAD,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
15060,ZAP Engineering & Construction Services,ZAP Engineering & Construction Services,,,,,3035655567,apinvoices@zapecs.com,ZAP Engineering & Construction Services,"333 S. Allison Pky, Suite 100",...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
15063,ZAYALA FIELD SERVICES,ZAYALA FIELD SERVICES,,,,,3035495978,AZAVALA@ZAVALAFIELD SERVICES.COM,ZAYALA FIELD SERVICES,,...,,,,,,,,CREDIT CARD ONLY,Tax,CREDIT CARD FEE 3%
15066,ZAYRA DIAZ,ZAYRA DIAZ,,,,,9708882605,ZDIAZ1820@GMAIL.COM,ZAYRA DIAZ,,...,,,,,,,,CREDIT CARD ONLY,Non,NON TAXABLE ITEM


In [12]:
# phone number column successfully cleaned
dfcl.main_phone.value_counts()

6013191527    3
2539735556    3
9706899345    2
9702302052    2
7207239130    2
             ..
3607262334    1
2077524852    1
3037202824    1
3039319616    1
8326982203    1
Name: main_phone, Length: 2902, dtype: int64

In [13]:
dfcl.main_phone.value_counts(sort=False)

6703426100    1
9707027610    1
4062890901    2
9705992134    1
9703811081    1
             ..
9705354657    1
3035655567    1
3035495978    1
9708882605    1
8326982203    1
Name: main_phone, Length: 2902, dtype: int64

### The customer column seems to be closely related to the company column.  I'm going to filter out rows where the two columns do not match and see what i can fix manually in a spreadsheet.  I will also create a seperate dataframe with clean rows and will append the two dfs.  

In [14]:
# filter for matching columns and store for later 
dfrop= dfcl.loc[(dfcl['company'] == dfcl['customer'])]
dfrop

Unnamed: 0,customer,company,mr./ms./...,first_name,m.i.,last_name,main_phone,main_email,bill_to_1,bill_to_2,...,bill_to_4,bill_to_5,ship_to_1,ship_to_2,ship_to_3,ship_to_4,ship_to_5,terms,sales_tax_code,tax_item
0,JESUS SANCHEZ,JESUS SANCHEZ,,,,,6703426100,,JESUS SANCHEZ,,...,,,,,,,,,Tax,NON TAXABLE ITEM
2,1888 INDUSTRIAL SERVICES,1888 INDUSTRIAL SERVICES,,,,,9707027610,AP@1888IS.COM,1888 INDUSTRIAL SERVICES,"800 8TH AVE, SUITE 301",...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
4,2 RINGS TRUCKING,2 RINGS TRUCKING,,,,,4062890901,,2 RINGS TRUCKING,,...,,,,,,,,Due on receipt,Non,NON TAXABLE ITEM
15,2020 APEX LLC,2020 APEX LLC,,,,,9703811081,RANDRE@SEARSREALESTATE.COM,2020 APEX LLC,,...,,,,,,,,CREDIT CARD ONLY,Tax,CREDIT CARD FEE 3%
21,3 BEANS LLC,3 BEANS LLC,,,,,2539735556,3BEANSLLC@GMAIL.COM,3 BEANS LLC,,...,,,,,,,,CREDIT CARD ONLY,Tax,CREDIT CARD FEE 3%
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15049,ZACH ROSATO,ZACH ROSATO,,,,,3038189442,,ZACH ROSATO,,...,,,,,,,,CREDIT CARD ONLY,Tax,NON TAXABLE ITEM
15051,ZACH SNAVELY,ZACH SNAVELY,,,,,7207626709,,ZACH SNAVELY,,...,,,,,,,,CREDIT CARD ONLY,Tax,NON TAXABLE ITEM
15060,ZAP Engineering & Construction Services,ZAP Engineering & Construction Services,,,,,3035655567,apinvoices@zapecs.com,ZAP Engineering & Construction Services,"333 S. Allison Pky, Suite 100",...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
15063,ZAYALA FIELD SERVICES,ZAYALA FIELD SERVICES,,,,,3035495978,AZAVALA@ZAVALAFIELD SERVICES.COM,ZAYALA FIELD SERVICES,,...,,,,,,,,CREDIT CARD ONLY,Tax,CREDIT CARD FEE 3%


In [15]:
# rows where customer and company name dont match
dfgc=dfcl.loc[(dfcl['company'] != dfcl['customer'])]
dfgc

Unnamed: 0,customer,company,mr./ms./...,first_name,m.i.,last_name,main_phone,main_email,bill_to_1,bill_to_2,...,bill_to_4,bill_to_5,ship_to_1,ship_to_2,ship_to_3,ship_to_4,ship_to_5,terms,sales_tax_code,tax_item
6,2 VALLEY BUILDERS,"2 VALLEY BUILDERS, INC",,,,,9705992134,2valleybuilders@gmail.com,"2 VALLEY BUILDERS, INC",6637 SPANISH BAY DRIVE,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
23,307 MEAT COMPANY,(307) MEAT COMPANY,,,,,3073439001,kelcey@307meat.com,(307) MEAT COMPANY,"3745 Cherrywood, E St.",...,,,,,,,,CREDIT CARD ONLY,Tax,NON TAXABLE ITEM
24,38 NORTH CONSTRUCTION,38 NORTH CONSTRUCTION GROUP,,JIM,,HOPPER,7193589834,accountspayable@38northcg.com,38 NORTH CONSTRUCTION GROUP,"11641 Ridgeline Drive, Unit 160",...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
33,4X INDUSTRIAL,"4X INDUSTRIAL, LLC",,,,,9703521790,ap@4xindustrial.com,"4X Industrial, LLC",800 8th Ave #300,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
72,5280 S SERVICES,"5280 S SERVICES, LLC",,FRANK,,SILVA,9705186487,heather@5280sservices.com,"5280 S SERVICES, LLC",18494 CR 39,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14996,WSP,LT ENVIRONMENTAL INC,,,,,3039625523,ap@ltenv.com,WSP USA,4600 WEST 60TH AVE,...,,,,,,,,Net 60,Non,NON TAXABLE ITEM
15004,XCEL ENERGY - Customer,XCEL ENERGY,,,,,,Kami.R.Moore@xcelenergy.com,XCEL ENERGY,Attn: Robert McKay,...,,,,,,,,CREDIT CARD ONLY,Non,CREDIT CARD FEE 3%
15044,z GIENGERICH STRUCTURES DO NOT USE,GEINGERICH STRUCTURES,,,,,9702302052,will@giengerichstructures.com,GEINGERICH STRUCTURES,,...,,,,,,,,Due on receipt,Non,NON TAXABLE ITEM
15053,ZAK DIRT,"ZAK DIRT, INC.",,BRANDI,,WILSON,9705354657,BWILSON@ZAKDIRT.COM; accounting@zakdirt.com,"ZAK DIRT, INC.",14290 HILLTOP ROAD,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM


In [16]:
#  create csv file for viewing in spreadsheet 
dfgc.to_csv('dfgc.csv')

### Notes on cleaning the customer column:
#### - I would check with the db administrator before making any changes to the Customer sheet
#### - deleted "base rate sheet" from about 100 columns which left the company name in the column only  
#### - deleted remaining LLC, INC, Co from a total of about 10 rows
#### - kept rows marked "collections"
#### - most of the other non matching rows will be matched when they are in upper case

read the partial dataframe from the Excel file 'dfgcc.xlsx', with the first column as the index
save the resulting dataframe to a CSV file 'dfgcc_cleaned.csv' using the to_csv() method, and set index=True to include the index column in the CSV file.

In [17]:
#   413 clean rows remaining
dfgcc = pd.read_excel('dfgcc.xlsx', index_col=0)
dfgcc.to_csv('dfgcc.csv')
dfgcc

Unnamed: 0,customer,company,mr./ms./...,first_name,m.i.,last_name,main_phone,main_email,bill_to_1,bill_to_2,...,bill_to_4,bill_to_5,ship_to_1,ship_to_2,ship_to_3,ship_to_4,ship_to_5,terms,sales_tax_code,tax_item
6,2 VALLEY BUILDERS,"2 VALLEY BUILDERS, INC",,,,,9.705992e+09,2valleybuilders@gmail.com,"2 VALLEY BUILDERS, INC",6637 SPANISH BAY DRIVE,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
23,307 MEAT COMPANY,(307) MEAT COMPANY,,,,,3.073439e+09,kelcey@307meat.com,(307) MEAT COMPANY,"3745 Cherrywood, E St.",...,,,,,,,,CREDIT CARD ONLY,Tax,NON TAXABLE ITEM
24,38 NORTH CONSTRUCTION,38 NORTH CONSTRUCTION GROUP,,JIM,,HOPPER,7.193590e+09,accountspayable@38northcg.com,38 NORTH CONSTRUCTION GROUP,"11641 Ridgeline Drive, Unit 160",...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
33,4X INDUSTRIAL,"4X INDUSTRIAL, LLC",,,,,9.703522e+09,ap@4xindustrial.com,"4X Industrial, LLC",800 8th Ave #300,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
72,5280 S SERVICES,"5280 S SERVICES, LLC",,FRANK,,SILVA,9.705186e+09,heather@5280sservices.com,"5280 S SERVICES, LLC",18494 CR 39,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14996,WSP,LT ENVIRONMENTAL INC,,,,,3.039626e+09,ap@ltenv.com,WSP USA,4600 WEST 60TH AVE,...,,,,,,,,Net 60,Non,NON TAXABLE ITEM
15004,XCEL ENERGY - Customer,XCEL ENERGY,,,,,,Kami.R.Moore@xcelenergy.com,XCEL ENERGY,Attn: Robert McKay,...,,,,,,,,CREDIT CARD ONLY,Non,CREDIT CARD FEE 3%
15044,z GIENGERICH STRUCTURES DO NOT USE,GEINGERICH STRUCTURES,,,,,9.702302e+09,will@giengerichstructures.com,GEINGERICH STRUCTURES,,...,,,,,,,,Due on receipt,Non,NON TAXABLE ITEM
15053,ZAK DIRT,"ZAK DIRT, INC.",,BRANDI,,WILSON,9.705355e+09,BWILSON@ZAKDIRT.COM; accounting@zakdirt.com,"ZAK DIRT, INC.",14290 HILLTOP ROAD,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM


In [18]:
# No duplicates in the new df
dfgcc[dfgcc["customer"].duplicated()] 

Unnamed: 0,customer,company,mr./ms./...,first_name,m.i.,last_name,main_phone,main_email,bill_to_1,bill_to_2,...,bill_to_4,bill_to_5,ship_to_1,ship_to_2,ship_to_3,ship_to_4,ship_to_5,terms,sales_tax_code,tax_item


### At this point I am going to append the new clean df to the previously filtered df.  The new df has 3,183 rows, and the customer column looks much cleaner and more closely related (for joining) to the customer column in the Resource sheet.

In [19]:
# append partial clean rows to filtered clean dfrop
dfm=dfrop.append(dfgcc) 
dfm

  dfm=dfrop.append(dfgcc)


Unnamed: 0,customer,company,mr./ms./...,first_name,m.i.,last_name,main_phone,main_email,bill_to_1,bill_to_2,...,bill_to_4,bill_to_5,ship_to_1,ship_to_2,ship_to_3,ship_to_4,ship_to_5,terms,sales_tax_code,tax_item
0,JESUS SANCHEZ,JESUS SANCHEZ,,,,,6703426100,,JESUS SANCHEZ,,...,,,,,,,,,Tax,NON TAXABLE ITEM
2,1888 INDUSTRIAL SERVICES,1888 INDUSTRIAL SERVICES,,,,,9707027610,AP@1888IS.COM,1888 INDUSTRIAL SERVICES,"800 8TH AVE, SUITE 301",...,,,,,,,,Net 30,Non,NON TAXABLE ITEM
4,2 RINGS TRUCKING,2 RINGS TRUCKING,,,,,4062890901,,2 RINGS TRUCKING,,...,,,,,,,,Due on receipt,Non,NON TAXABLE ITEM
15,2020 APEX LLC,2020 APEX LLC,,,,,9703811081,RANDRE@SEARSREALESTATE.COM,2020 APEX LLC,,...,,,,,,,,CREDIT CARD ONLY,Tax,CREDIT CARD FEE 3%
21,3 BEANS LLC,3 BEANS LLC,,,,,2539735556,3BEANSLLC@GMAIL.COM,3 BEANS LLC,,...,,,,,,,,CREDIT CARD ONLY,Tax,CREDIT CARD FEE 3%
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14996,WSP,LT ENVIRONMENTAL INC,,,,,3039625523.0,ap@ltenv.com,WSP USA,4600 WEST 60TH AVE,...,,,,,,,,Net 60,Non,NON TAXABLE ITEM
15004,XCEL ENERGY - Customer,XCEL ENERGY,,,,,,Kami.R.Moore@xcelenergy.com,XCEL ENERGY,Attn: Robert McKay,...,,,,,,,,CREDIT CARD ONLY,Non,CREDIT CARD FEE 3%
15044,z GIENGERICH STRUCTURES DO NOT USE,GEINGERICH STRUCTURES,,,,,9702302052.0,will@giengerichstructures.com,GEINGERICH STRUCTURES,,...,,,,,,,,Due on receipt,Non,NON TAXABLE ITEM
15053,ZAK DIRT,"ZAK DIRT, INC.",,BRANDI,,WILSON,9705354657.0,BWILSON@ZAKDIRT.COM; accounting@zakdirt.com,"ZAK DIRT, INC.",14290 HILLTOP ROAD,...,,,,,,,,Net 30,Non,NON TAXABLE ITEM


#### Clean the Routes Sheet

In [20]:
dfm.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3183 entries, 0 to 15068
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   customer        3183 non-null   object 
 1   company         3124 non-null   object 
 2   mr./ms./...     1 non-null      object 
 3   first_name      262 non-null    object 
 4   m.i.            11 non-null     object 
 5   last_name       237 non-null    object 
 6   main_phone      2965 non-null   object 
 7   main_email      2484 non-null   object 
 8   bill_to_1       3177 non-null   object 
 9   bill_to_2       903 non-null    object 
 10  bill_to_3       866 non-null    object 
 11  bill_to_4       3 non-null      object 
 12  bill_to_5       0 non-null      float64
 13  ship_to_1       27 non-null     object 
 14  ship_to_2       25 non-null     object 
 15  ship_to_3       22 non-null     object 
 16  ship_to_4       2 non-null      object 
 17  ship_to_5       0 non-null      

In [21]:
dfm.rename(columns={"customer": "Customer"}, inplace=True)

In [22]:
dfr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2528 entries, 0 to 2527
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Customer            2528 non-null   object
 1   TOILET #            2528 non-null   object
 2   Job Name/Well Name  1077 non-null   object
 3   Address             2384 non-null   object
 4   City                2523 non-null   object
 5   Schedule            2528 non-null   object
 6   Scheduled Day       2528 non-null   object
 7   Charge Type         2517 non-null   object
dtypes: object(8)
memory usage: 158.1+ KB


In [23]:
dfr = dfr.reindex(columns=['Customer','Job Name/Well Name', 'Address', 'City','Schedule','Scheduled Day','Charge Type', 'Toilet#'])
dfr

Unnamed: 0,Customer,Job Name/Well Name,Address,City,Schedule,Scheduled Day,Charge Type,Toilet#
0,src,Ag Pad FRAC,O Street x 59th Ave W1.1 S into,West Greeley,2xWeekly,7/22/18change billing to production,pending,
1,Hauer Custom Homes,,"19299 CR 70, Eaton",Eaton,Weekly,address does not exist 10/7/19,MONTHLY (8/1/19),
2,Ridgeway Custom homes,,"6879 Crooked Stick, Windsor",West Windsor,Weekly,address does not exist 3/22/19,MONTHLY (9/19/18),
3,ASTER RIDGE,,1827 AA ST,EAST GREELEY,WEEKLY,BISON RIDGE TOOK OVER TOILET 1/27/22,MONTHLY (10/12/21),
4,Blackeagle,Angus Compressor Station,"60315 CR 71, Grover\n128x69 E1 N into",Grover,Weekly,BLOWNOVER/TRADED 10/24/19,MONTHLY (6/11/19),
...,...,...,...,...,...,...,...,...
2523,BLUE BEAR WASTE,,7789 W 5TH AVE,LAKEWOOD,WEEKLY,WEDNESDAY,MONTHLY (5/20/22),
2524,BEAR CREEK,,1074 MARFELL ST,ERIE,WEEKLY,TUESDAY,MONTHLY (5/20/22),
2525,TRISTAR HEATING & AIR,,21350 CR 10,HUDSON,WEEKLY,FRIDAY,MONTHLY (5/20/22),
2526,MARTIN MARIETTA,,,NORTH FORT COLLINS,2XWEEKLY,TUESDAY/FRIDAY,MONTHLY (5/20/22),


In [24]:
# remove white space and text after the dashes in the
dfr.Customer.str.strip()
dfr.Customer.str.split('\n').str[0]
dfr.Customer.str.replace('LLC/','').str.replace('INC/','')
dfr

Unnamed: 0,Customer,Job Name/Well Name,Address,City,Schedule,Scheduled Day,Charge Type,Toilet#
0,src,Ag Pad FRAC,O Street x 59th Ave W1.1 S into,West Greeley,2xWeekly,7/22/18change billing to production,pending,
1,Hauer Custom Homes,,"19299 CR 70, Eaton",Eaton,Weekly,address does not exist 10/7/19,MONTHLY (8/1/19),
2,Ridgeway Custom homes,,"6879 Crooked Stick, Windsor",West Windsor,Weekly,address does not exist 3/22/19,MONTHLY (9/19/18),
3,ASTER RIDGE,,1827 AA ST,EAST GREELEY,WEEKLY,BISON RIDGE TOOK OVER TOILET 1/27/22,MONTHLY (10/12/21),
4,Blackeagle,Angus Compressor Station,"60315 CR 71, Grover\n128x69 E1 N into",Grover,Weekly,BLOWNOVER/TRADED 10/24/19,MONTHLY (6/11/19),
...,...,...,...,...,...,...,...,...
2523,BLUE BEAR WASTE,,7789 W 5TH AVE,LAKEWOOD,WEEKLY,WEDNESDAY,MONTHLY (5/20/22),
2524,BEAR CREEK,,1074 MARFELL ST,ERIE,WEEKLY,TUESDAY,MONTHLY (5/20/22),
2525,TRISTAR HEATING & AIR,,21350 CR 10,HUDSON,WEEKLY,FRIDAY,MONTHLY (5/20/22),
2526,MARTIN MARIETTA,,,NORTH FORT COLLINS,2XWEEKLY,TUESDAY/FRIDAY,MONTHLY (5/20/22),


In [25]:
# 1,706 duplicate rows in the resource sheet
dfr.Customer.duplicated().sum()

1706

In [26]:
# convert both columns to Upper Case
dfm.Customer = dfm.Customer.str.upper()
dfr.Customer  = dfr.Customer.str.upper()
dfr

Unnamed: 0,Customer,Job Name/Well Name,Address,City,Schedule,Scheduled Day,Charge Type,Toilet#
0,SRC,Ag Pad FRAC,O Street x 59th Ave W1.1 S into,West Greeley,2xWeekly,7/22/18change billing to production,pending,
1,HAUER CUSTOM HOMES,,"19299 CR 70, Eaton",Eaton,Weekly,address does not exist 10/7/19,MONTHLY (8/1/19),
2,RIDGEWAY CUSTOM HOMES,,"6879 Crooked Stick, Windsor",West Windsor,Weekly,address does not exist 3/22/19,MONTHLY (9/19/18),
3,ASTER RIDGE,,1827 AA ST,EAST GREELEY,WEEKLY,BISON RIDGE TOOK OVER TOILET 1/27/22,MONTHLY (10/12/21),
4,BLACKEAGLE,Angus Compressor Station,"60315 CR 71, Grover\n128x69 E1 N into",Grover,Weekly,BLOWNOVER/TRADED 10/24/19,MONTHLY (6/11/19),
...,...,...,...,...,...,...,...,...
2523,BLUE BEAR WASTE,,7789 W 5TH AVE,LAKEWOOD,WEEKLY,WEDNESDAY,MONTHLY (5/20/22),
2524,BEAR CREEK,,1074 MARFELL ST,ERIE,WEEKLY,TUESDAY,MONTHLY (5/20/22),
2525,TRISTAR HEATING & AIR,,21350 CR 10,HUDSON,WEEKLY,FRIDAY,MONTHLY (5/20/22),
2526,MARTIN MARIETTA,,,NORTH FORT COLLINS,2XWEEKLY,TUESDAY/FRIDAY,MONTHLY (5/20/22),


### In order to ingest their data into our system, we need to match the customer names in the route sheet against the customer names in the customer sheet, and parse the data into a JSON object that can be passed to our API.

### Attach an array created from the route sheet
- An array of jobs, each with the fields:
- Job Name/Well Name,
- Address
- City
- Schedule
- Scheduled Day
- Charge Type
- TOILET #

In [27]:
# Match the route sheet against the customer sheet based on the "Customer" field
merged_df = pd.merge(dfr, dfm, on="Customer", how="left")

In [28]:
# For each unique customer name in the route sheet, create an object with the required fields
output = []
for customer in merged_df["Customer"].unique():
    customer_data = merged_df.loc[merged_df["Customer"] == customer].iloc[0]
    jobs_data = customer_data.loc[["Job Name/Well Name", "Address", "City", "Schedule", "Scheduled Day", "Charge Type", "Toilet#"]].fillna('').to_frame().T.to_dict(orient="records")
    customer_info = {
        "Customer": customer_data["Customer"],
        "Bill to 1": customer_data["bill_to_1"],
        "Bill to 2": customer_data["bill_to_2"],
        "Bill to 3": customer_data["bill_to_3"],
        "Main Phone": customer_data["main_phone"],
        "Main Email": customer_data["main_email"],
        "Terms": customer_data["terms"],
        "jobs": jobs_data
    }
    output.append(customer_info)

# Return a JSON-formatted array of the objects
import json
json_output = json.dumps(output)
print(json_output)


[{"Customer": "SRC", "Bill to 1": NaN, "Bill to 2": NaN, "Bill to 3": NaN, "Main Phone": NaN, "Main Email": NaN, "Terms": NaN, "jobs": [{"Job Name/Well Name": "Ag Pad FRAC", "Address": "O Street x 59th Ave W1.1 S into", "City": "West Greeley", "Schedule": "2xWeekly", "Scheduled Day": "7/22/18change billing to production", "Charge Type": "pending", "Toilet#": ""}]}, {"Customer": "HAUER CUSTOM HOMES", "Bill to 1": NaN, "Bill to 2": NaN, "Bill to 3": NaN, "Main Phone": NaN, "Main Email": NaN, "Terms": NaN, "jobs": [{"Job Name/Well Name": "", "Address": "19299 CR 70, Eaton", "City": "Eaton", "Schedule": "Weekly", "Scheduled Day": "address does not exist 10/7/19", "Charge Type": "MONTHLY (8/1/19)", "Toilet#": ""}]}, {"Customer": "RIDGEWAY CUSTOM HOMES", "Bill to 1": NaN, "Bill to 2": NaN, "Bill to 3": NaN, "Main Phone": NaN, "Main Email": NaN, "Terms": NaN, "jobs": [{"Job Name/Well Name": "", "Address": "6879 Crooked Stick, Windsor", "City": "West Windsor", "Schedule": "Weekly", "Scheduled 

In [29]:
output

[{'Customer': 'SRC',
  'Bill to 1': nan,
  'Bill to 2': nan,
  'Bill to 3': nan,
  'Main Phone': nan,
  'Main Email': nan,
  'Terms': nan,
  'jobs': [{'Job Name/Well Name': 'Ag Pad FRAC',
    'Address': 'O Street x 59th Ave W1.1 S into',
    'City': 'West Greeley',
    'Schedule': '2xWeekly',
    'Scheduled Day': '7/22/18change billing to production',
    'Charge Type': 'pending',
    'Toilet#': ''}]},
 {'Customer': 'HAUER CUSTOM HOMES',
  'Bill to 1': nan,
  'Bill to 2': nan,
  'Bill to 3': nan,
  'Main Phone': nan,
  'Main Email': nan,
  'Terms': nan,
  'jobs': [{'Job Name/Well Name': '',
    'Address': '19299 CR 70, Eaton',
    'City': 'Eaton',
    'Schedule': 'Weekly',
    'Scheduled Day': 'address does not exist 10/7/19',
    'Charge Type': 'MONTHLY (8/1/19)',
    'Toilet#': ''}]},
 {'Customer': 'RIDGEWAY CUSTOM HOMES',
  'Bill to 1': nan,
  'Bill to 2': nan,
  'Bill to 3': nan,
  'Main Phone': nan,
  'Main Email': nan,
  'Terms': nan,
  'jobs': [{'Job Name/Well Name': '',
    'Ad

The code above loops through each unique customer name in the "Customer" column of the merged_df DataFrame, and for each customer name, it selects the first row of data in merged_df where the "Customer" column matches the current customer name.

Specifically, the line merged_df["Customer"].unique() returns an array of all unique values in the "Customer" column of merged_df. The for loop then iterates over each of these unique customer names.

For each customer name, the line merged_df.loc[merged_df["Customer"] == customer] selects all rows in merged_df where the "Customer" column matches the current customer name. The .iloc[0] method is then used to select the first row of data from this subset of the DataFrame.

Overall, this code is used to extract the first row of data in merged_df for each unique customer name in the "Customer" column. The resulting customer_data variable contains this first row of data for each customer.

In [30]:
output = json.loads(json_output)

# Loop through the list of customer dictionaries and find the one for 'COLORADO POND PROS'
for customer in output:
    if customer['Customer'] == 'COLORADO POND PROS':
        print(customer)
        break

{'Customer': 'COLORADO POND PROS', 'Bill to 1': 'COLORADO POND PROS', 'Bill to 2': nan, 'Bill to 3': nan, 'Main Phone': '3037041505', 'Main Email': 'COLORADOPONDPROS@GMAIL.COM', 'Terms': 'CREDIT CARD ONLY', 'jobs': [{'Job Name/Well Name': '', 'Address': '8450 N FOOTHILLS HIGHWAY', 'City': 'BOULDER', 'Schedule': 'ON CALL 11/29/21', 'Scheduled Day': 'ON CALL', 'Charge Type': 'MAKE A TICKET', 'Toilet#': ''}]}


In [31]:
# Define an empty list to store customer data
customer_data1 = []

# Loop through each unique customer name in the merged DataFrame
for customer in merged_df["Customer"].unique():

    # Create a DataFrame for the current customer by selecting all rows with the current customer name
    group = merged_df[merged_df['Customer'] == customer]

    # Define an empty list to store job data for the current customer
    job_data = []

    # Loop through each row in the DataFrame for the current customer
    for _, row in group.iterrows():

        # Create a dictionary to store the current job's data
        job_dict = {}
        job_dict['Job Name/Well Name'] = row['Job Name/Well Name']
        job_dict['Address'] = row['Address']
        job_dict['City'] = row['City']
        job_dict['Schedule'] = row['Schedule']
        job_dict['Scheduled Day'] = row['Scheduled Day']
        job_dict['Charge Type'] = row['Charge Type']
        job_dict['Toilet #'] = row['Toilet#']

        # Append the job dictionary to the list of job data for the current customer
        job_data.append(job_dict)

    # Create a dictionary to store the current customer's data
    customer_dict = {}

    # Select the row from the customer DataFrame for the current customer (if it exists)
    customer_info = dfm[dfm['Customer'] == customer]
    if not customer_info.empty:
        customer_info = customer_info.iloc[0]
        customer_dict['Customer'] = customer_info['Customer']
        customer_dict['Bill to 1'] = customer_info['bill_to_1']
        customer_dict['Bill to 2'] = customer_info['bill_to_2']
        customer_dict['Bill to 3'] = customer_info['bill_to_3']
        customer_dict['Main Phone'] = customer_info['main_phone']
        customer_dict['Main Email'] = customer_info['main_email']
        customer_dict['Terms'] = customer_info['terms']

    # Add the list of job data for the current customer to the customer dictionary
    customer_dict['jobs'] = job_data

    # Append the customer dictionary to the list of customer data
    customer_data1.append(customer_dict)

# Convert the list of customer dictionaries to a JSON-formatted string
output_json = json.dumps(customer_data1)

# Print the JSON-formatted string
print(output_json)


[{"jobs": [{"Job Name/Well Name": "Ag Pad FRAC", "Address": "O Street x 59th Ave W1.1 S into", "City": "West Greeley", "Schedule": "2xWeekly", "Scheduled Day": "7/22/18change billing to production", "Charge Type": "pending", "Toilet #": NaN}, {"Job Name/Well Name": "Weideman Pad TB", "Address": "83rd Ave x Hwy 34 Business N to 4th Street E1/2 N1/4 E into", "City": "West Greeley", "Schedule": "Post Frac", "Scheduled Day": "changed billing to boomerang pad 12/1/18", "Charge Type": "PENDING", "Toilet #": NaN}, {"Job Name/Well Name": "Bebe Pad", "Address": "0 Street x 31 E1/2 S into", "City": "West Greeley", "Schedule": "2xWeekly", "Scheduled Day": "moved to frac 3/5/19", "Charge Type": "MAKE A TICKET", "Toilet #": NaN}, {"Job Name/Well Name": "Boomerang Pad FRAC", "Address": "4th Street x 71st Ave W1/4 N into", "City": "West Greeley", "Schedule": "Weekly", "Scheduled Day": "moved to TB for Production. 12/10/18", "Charge Type": "PENDING (4/11/18)", "Toilet #": NaN}, {"Job Name/Well Name": 

This code defines an empty list called customer_data1 to store the data for each customer. It then loops through each unique customer name in the merged DataFrame and creates a DataFrame for the current customer by selecting all rows with the current customer name. It also defines an empty list called job_data to store the job data for the current customer.

The code then loops through each row in the DataFrame for the current customer and creates a dictionary called job_dict to store the data for the current job. It populates job_dict with the relevant data from the current row and appends it to the job_data list.

Once all job data has been collected for the current customer, the code creates a dictionary called customer_dict to store the data for the current customer. It selects the row from the customer DataFrame that corresponds to the current customer (if it exists) and populates customer_dict with the relevant data from that row.

The code then adds the job_data list to customer_dict and appends customer_dict to the customer_data1 list. Once all customer data has been collected, the code converts the customer_data1 list to a JSON-formatted string using the json.dumps() function, which returns the JSON-formatted string. 

The resulting string can then be saved to a file or passed to an API. For example, to save the JSON data to a file named "customer_data.json", you can use the following code:

In [33]:
# Code to collect customer data goes here...

# Convert customer_data list to JSON string
json_data = json.dumps(customer_data1)

# Save JSON data to a file
with open('customer_data1.json', 'w') as f:
    f.write(json_data)


Alternatively, to pass the JSON data to an API, you can use the requests library to make a POST request with the JSON data in the request body. For example:

In [35]:
import requests
# Code to collect customer data goes here...

# Convert customer_data list to JSON string
json_data = json.dumps(customer_data1)

# Define API endpoint URL
url = 'https://example.com/api/customer_data'

# Define headers for the request (if needed)
headers = {'Content-Type': 'application/json'}

# Make POST request with JSON data in the request body
response = requests.post(url, data=json_data, headers=headers)

# Check the response status code
if response.status_code == requests.codes.ok:
    print('Data successfully sent to API.')
else:
    print('Error sending data to API.')


Error sending data to API.


Test the JSON output vs the Sample output provided succesfull:

In [36]:
output = json.loads(json_output)

# Loop through the list of customer dictionaries and find the one for 'COLORADO POND PROS'
for customer in output:
    if customer['Customer'] == 'COLORADO POND PROS':
        print(customer)
        break

{'Customer': 'COLORADO POND PROS', 'Bill to 1': 'COLORADO POND PROS', 'Bill to 2': nan, 'Bill to 3': nan, 'Main Phone': '3037041505', 'Main Email': 'COLORADOPONDPROS@GMAIL.COM', 'Terms': 'CREDIT CARD ONLY', 'jobs': [{'Job Name/Well Name': '', 'Address': '8450 N FOOTHILLS HIGHWAY', 'City': 'BOULDER', 'Schedule': 'ON CALL 11/29/21', 'Scheduled Day': 'ON CALL', 'Charge Type': 'MAKE A TICKET', 'Toilet#': ''}]}


View Customer Data based Index 

In [47]:
first_customer_data = customer_data1[50]

In [48]:
print(first_customer_data)

{'Customer': 'BADLANDS TANK LINES', 'Bill to 1': 'BADLANDS TANK LINES', 'Bill to 2': nan, 'Bill to 3': nan, 'Main Phone': '4342380248', 'Main Email': 'james.sumner@bltanklines.com', 'Terms': 'CREDIT CARD ONLY', 'jobs': [{'Job Name/Well Name': nan, 'Address': '4800 I-80 SERVICE RD', 'City': 'BURNS', 'Schedule': 'BIWEEKLY 5/27/22', 'Scheduled Day': 'FRIDAY', 'Charge Type': 'MONTHLY (8/18/20)', 'Toilet #': nan}]}
