Consolidates Registrars found in our complaint data that have identical contact information 

With deduplication based upon numbers, we found that 100/371 remaining unique registrars actually appeared to belong to network systems, because they mentioned them somewhere in the contact row. Time on the wayback machine checking some of the domains forming some of their 'registrar names' confirmed almost all redirected to snapnames, and *never* appeared to be independent registrars, at least as of ~2006.

A couple mention network systems in contact information, but appear to have their own websites, so we will only take those which list networksystems as their *link*, and there is no indication of them being actual registrars.

Reads in and over-writes most_malicious. Kind of yikes maybe.

In [None]:
import pandas as pd

In [None]:
df = pd.read_json('/data/all_types_domains_balanced_registered_domains_whois_parsed_correct_registrar_names_deduplicated_ids.json', lines=True)
registrars = pd.read_csv("/data/Accredited-Registrars-202412054424.csv")
registrars_interesting = registrars[['Country/Territory','Public Contact','Link','IANA Number']]

In [None]:
# First, we'll tack on the ICANN contact info for every entry that has it available for the registrar in question
size_before = len(df)
df = pd.merge(df,registrars_interesting,how='left',left_on='registrar_id',right_on='IANA Number',suffixes=(None,None))
del df['IANA Number'] # Redundant, just needed to merge on
assert(size_before == len(df)) # We shouldn't lose/gain any entries when merging

In [None]:
# It's a one-liner, surprise! We replace it with the first registrar_id which might be messy, visually.
df['registrar_id'] = df.groupby('Link')['registrar_id'].transform('first')

In [None]:
# This is for aesthetic convenience: network solutions appears to be the biggest offenders, so we manually rename
df.loc[df['Link'] == 'http://www.networksolutions.com', 'registrar_id'] = 2
df.loc[df['registrar_id'] == 2, 'registrar'] = "Network Solutions, LLC"
# Other registrars that get consolidated may instead have unexpected names, but their information is still correct.

In [None]:
df.to_json("/data/all_types_domains_balanced_registered_domains_whois_parsed_correct_registrar_names_deduplicated_ids_links.json",orient='records',lines=True)

In [None]:
# PART 2: domain counts
# Now let's do the same for the deduplicated registrars count
df = pd.read_json("/data/registrar_domain_count_flat_deduplicated.json", lines=True)
registrars_interesting = registrars[['Link','IANA Number']] # Minimal

In [None]:
# Deja vu
size_before = len(df)
df = pd.merge(df,registrars_interesting,how='left',left_on='id',right_on='IANA Number',suffixes=(None,None))
del df['IANA Number'] # Redundant, just needed to merge on
assert(size_before == len(df)) # We shouldn't lose/gain any entries when merging

In [None]:
# Copy-pasta from deduplicate_shell_companies.py with a different groupby
df = df.groupby('Link').agg({
    'id': 'first',  # Choose the first ID as the new ID for this group
    'name': 'first',
    'domains': 'sum',
    'share': 'sum',
    'tlds': 'max',
    'signedZones': 'sum', # I think its correct to sum these?
    'upcomingDeletes': 'sum', # I think its correct to sum these?
}).reset_index()

In [None]:
# Ditto rename, but also this time is important to make sure it's all working right!!
df.loc[df['Link'] == 'http://www.networksolutions.com', 'id'] = 2
df.loc[df['id'] == 2, 'name'] = "Network Solutions, LLC"

In [None]:
del df['Link'] # phoenix stop blowing up dataset size challenge
df.to_json("/data/registrar_domain_count_flat_deduplicated_links.json", orient='records', lines=True)