## Tamr Take Home - Chris Smith

In [2]:
import pandas as pd
from tools import FuzzPipe, FuzzyUSA
from pprint import pprint

## How many distinct suppliers are there in the USA spend dataset?
- Number of initial records.
- Number of records after reducing based on exact supplier/vendor matches.
- Number of records after reducing further based on “fuzzy” matching criteria. This should
    group together records where the same supplier had slightly different names (such as
    “W.W. Grainger” and “WW Grainger” or “IBM” and “International Business Machines”).
    Some of the fuzzy matching logic might also mean matching across columns such as
    matching vendorname with vendoralternatename. Fields like phonenumber,
    streetaddress, city, state, and dunsnumber can also provide useful signals.
- Some measure(s) of accuracy with explanations.

In [3]:
# pick which file to analyze
file_name = './data/all_2021.csv'

In [4]:
df = pd.read_csv(file_name)

  exec(code_obj, self.user_global_ns, self.user_ns)


### Which fields have suitable cardinality?
- Can we target a field that has minimal missing values but also gives us a solid identifier for each company?
- DUNS is a likely candidate of the top of my head
- Looking for good mix of cardinality and low missing values

In [5]:
counts = pd.DataFrame(df.nunique()).reset_index(drop=False).rename(columns={'index': 'col', 0: 'count'})

In [6]:
counts.sort_values(by='count', ascending=True).head(10)

Unnamed: 0,col,count
132,small_business_competitiveness_demonstration_p...,1
19,action_date_fiscal_year,1
204,emerging_small_business,2
197,asian_pacific_american_owned_business,2
198,black_american_owned_business,2
199,hispanic_american_owned_business,2
200,native_american_owned_business,2
201,other_minority_owned_business,2
79,multiple_or_single_award_idv,2
78,multiple_or_single_award_idv_code,2


In [7]:
# let's filter out boolean fields
counts = counts[counts['count'] > 2]

In [8]:
# check out top 10 percent
top_10 = counts[counts['count'] >= counts['count'].quantile(.9)]

In [9]:
top_10

Unnamed: 0,col,count
0,contract_transaction_unique_key,6350210
1,contract_award_unique_key,5639541
2,award_id_piid,5611228
9,federal_action_obligation,1324880
10,total_dollars_obligated,1394468
11,base_and_exercised_options_value,1288787
12,current_total_value_of_award,1370041
13,base_and_all_options_value,1262537
14,potential_total_value_of_award,1394595
45,recipient_duns,116593


DUNs looks intriguing!

### Which fields don't have much missing data?

In [10]:
missing_values = pd.DataFrame(df.isna().sum()).reset_index(drop=False).rename(columns={'index': 'col', 0: 'count'})
# let's filter out boolean fields as those won;t be much help for our problem set
# lets start with fields that have 0 missing values
no_missing = missing_values[missing_values['count'] == 0]

In [11]:
no_missing_low_card = pd.merge(
    left=top_10,
    right=no_missing,
    left_on='col',
    right_on='col',
    how='inner'
)

In [12]:
no_missing_low_card.sort_values(by='count_x')

Unnamed: 0,col,count_x,count_y
7,recipient_duns,116593,0
5,base_and_all_options_value,1262537,0
3,federal_action_obligation,1324880,0
4,total_dollars_obligated,1394468,0
6,potential_total_value_of_award,1394595,0
9,last_modified_date,3052218,0
2,award_id_piid,5611228,0
1,contract_award_unique_key,5639541,0
8,usaspending_permalink,5639541,0
0,contract_transaction_unique_key,6350210,0


In [13]:
duns_count = no_missing_low_card[no_missing_low_card.col == 'recipient_duns']['count_x'].values[0]

#### Looks like DUNs is an ideal field to groupby on
- No missing values
- Suitable cardinality
- Unique identifer
- We can unify records on that key with minimal data loss

Lets clean up the recipient dataset and drop any duplicates that have like string fields

Goal is to get as close DUNs unique records and then scale the analysis up with more data

Let's subset all columns with that are dealing recipients and use DUNs as our primary indentifier

In [14]:
larger_recip_data = [
    col for col in df.columns
    if str(col).startswith('recipient')
]
larger_recip_data

['recipient_duns',
 'recipient_uei',
 'recipient_name',
 'recipient_doing_business_as_name',
 'recipient_parent_duns',
 'recipient_parent_uei',
 'recipient_parent_name',
 'recipient_country_code',
 'recipient_country_name',
 'recipient_address_line_1',
 'recipient_address_line_2',
 'recipient_city_name',
 'recipient_county_name',
 'recipient_state_code',
 'recipient_state_name',
 'recipient_zip_4_code',
 'recipient_congressional_district',
 'recipient_phone_number',
 'recipient_fax_number']

In [15]:
recip_data = df[larger_recip_data]

### Workflow is broken into two main classes
These classes are logical groupings of dataframe operations using DUNs to group like records together
- FuzzPipe -> simple cleaning and deduping
    - Normalize data into string format
        - Strip trailing zeros from float from string fields
        - Dedup identical records from string fields
    - Establish a UID (in our case DUNs) and use that to find ids with with repeats values for our UID
        - In our case, we are looking for DUNs with multiple names still associated with them
    - Filter to fields with both single and multi UIDs for down stream fuzzy matching and joining into golden table
- FuzzyUSA -> Fuzzy matching based on multiple fields on our UID
    - Take data with multiple values for a UID field (DUNs)
    - Group those values into look up table to increase performance and elimiate long looping
    - Convert fields to string to be evaluated by fuzzy matching algorithm `rapidfuzz`
        - Using simple ratio for now (could test other approaches or ensemble together)
        - Simple Ratio -> ratio of characters shared between comparison
    - Score like values against first record
        - If average score is greater than 90 for subset based UID, use first record (this can be improved, works for now)
        - Generate `match_report` off of FuzzyUSA for analysis
    - Return unifed records based on evaluation

In [16]:
pipe = FuzzPipe(recip_data)

In [17]:
multi_duns, single_duns = pipe.run(
    group_id='recipient_duns',
    count_field='recipient_name'
)

[!] Original size: 6350210
[!] Dedup on string fields size: 176001
[!] Multi IDs: 90831
[!] Single IDs: 85170


In [18]:
fuzzer = FuzzyUSA(multi_duns)
deduped = fuzzer.fuzz_match(
    key_label='recipient_duns',
    fuzz_fields=list(multi_duns.columns)
)

[!] Length before resolving: 90831


100%|██████████| 30593/30593 [00:00<00:00, 34624.54it/s]


[!] Missing keys: 0
[!] Length after resolving: 30593


In [19]:
pprint(fuzzer.match_report[0:5])

[{'compare': '7914906 * KABZK8W6PQT3 * AMERISOURCEBERGEN DRUG CORPORATION * '
             'nan * 3927759 * NWEGNLYTBDW4 * AMERISOURCEBERGEN CORPORATION * '
             'USA * UNITED STATES * 1300 MORRIS DR STE 1 * nan * CHESTERBROOK '
             '* CHESTER * PA * PENNSYLVANIA * 190875559 * nan * 8002708464 * '
             '804553104 * ',
  'compared': '7914906 * KABZK8W6PQT3 * AMERISOURCEBERGEN DRUG CORPORATION * '
              'nan * 3927759 * NWEGNLYTBDW4 * AMERISOURCEBERGEN CORPORATION * '
              'USA * UNITED STATES * 1300 MORRIS DR STE 1 * nan * CHESTERBROOK '
              '* CHESTER * PA * PENNSYLVANIA * 190875559 * nan * 6238263181 * '
              'nan * ',
  'match': 95.01915708812261},
 {'compare': '7914906 * KABZK8W6PQT3 * AMERISOURCEBERGEN DRUG CORPORATION * '
             'nan * 3927759 * NWEGNLYTBDW4 * AMERISOURCEBERGEN CORPORATION * '
             'USA * UNITED STATES * 1300 MORRIS DR STE 1 * nan * CHESTERBROOK '
             '* CHESTER * PA * PENNSYLVANIA

In [20]:
final = pd.concat([single_duns, deduped])

In [21]:
final.head()

Unnamed: 0,recipient_duns,recipient_uei,recipient_name,recipient_doing_business_as_name,recipient_parent_duns,recipient_parent_uei,recipient_parent_name,recipient_country_code,recipient_country_name,recipient_address_line_1,recipient_address_line_2,recipient_city_name,recipient_county_name,recipient_state_code,recipient_state_name,recipient_zip_4_code,recipient_congressional_district,recipient_phone_number,recipient_fax_number
13,107389434,D91NJLQAALK5,"CARSON SOLUTIONS, LLC",,107389434,D91NJLQAALK5,CARSON SOLUTIONS LLC,USA,UNITED STATES,6305 IVY LN STE 65,,GREENBELT,PRINCE GEORGE'S,MD,MARYLAND,207701465,5,8004807132,2404070773.0
35,148992295,YX2XUVBF3BK5,"SUDANO'S PRODUCE, LLC",,148992295,YX2XUVBF3BK5,SUDANOS PRODUCE LLC,USA,UNITED STATES,7480 CONOWINGO AVE UNT 16-28,,JESSUP,HOWARD,MD,MARYLAND,207949408,2,4107998224,4107999554.0
43,80185177,GLLPKM158NS7,"CO FIRE AVIATION, INC",,80185177,GLLPKM158NS7,CO FIRE AVIATION INC,USA,UNITED STATES,23101 HWY 52,,FORT MORGAN,MORGAN,CO,COLORADO,807019401,4,9708678414,
52,39895743,FT6CY4K64LX1,UBC INC,,39895743,FT6CY4K64LX1,UBC INC,USA,UNITED STATES,6101 JOHNS RD STE 1,,TAMPA,HILLSBOROUGH,FL,FLORIDA,336344425,14,8138846076,8138848318.0
58,877772418,GJVFDKY295L1,"DOUGLAS WEBB & ASSOCIATES, INC",,877772418,GJVFDKY295L1,DOUGLAS WEBB & ASSOCIATES INC,USA,UNITED STATES,8080 CORPORATE BLVD,,PLAIN CITY,UNION,OH,OHIO,43064922,4,614873983,6148739834.0


In [22]:
print(f'Final records for recipients in {file_name}\nis {len(final)} compared to DUNs {duns_count} total')

Final records for recipients in ./data/all_2021.csv
is 115763 compared to DUNs 116593 total


### Next steps and comments
- From here, we can join this clean recipient data back into our larger dataset to use for further analysis. Having clean records to join and uniquely indentify recipents will allows us to pull toegether aggregate metrics using a resovled entity, giving us a more complete picture of what a given entity looks like in a dataset.
- One weakness of this approach is it's reliance on DUNs number. If a company were to misreport their DUNs or have an upstream data entry mistake where the DUNs was misassigned to a company, this approach would falter since it uses DUNs to group values together prior to matching.
- We also lose the records that get macthed upon but more work can be done to either concatinate them into multi value fields in the final result ...