## Fix duplicate start links

Sometimes we got the same start_url from Google for multiple government units and it's clearly incorrect (e.g. Baltimore County vs. Baltimore city). This notebook facilitates a semi-manual fix.

In [1]:
import pandas as pd
import sqlite3

pd.set_option('display.max_rows', 50)

In [2]:
dbconn = sqlite3.connect("sbc_db_2022.sqlite")

In [64]:
gov_info =  pd.read_sql("SELECT * FROM gov_info WHERE sbc_count>0;", dbconn)

In [65]:
gov_info.shape

(216, 21)

In [66]:
gov_info.drop_duplicates(subset=['start_url']).shape

(197, 21)

In [67]:
dup_counts = gov_info.groupby('start_url').count().reset_index()[['start_url', 'ID',]]
url_dup_counts = gov_info[['id_idcd_plant', 'MNAME', 'start_url']].merge(dup_counts, how='left', on='start_url')
url_dup_counts.sort_values(by=['ID', 'start_url',], ascending=False).head(50)

Unnamed: 0,id_idcd_plant,MNAME,start_url,ID
177,43107907900000,SHELBY COUNTY,https://www.shelbycountytn.gov/167/Employee-Be...,3
178,43107907900100,SHELBY COUNTY HEALTH CARE CORPORATION- PERSONN...,https://www.shelbycountytn.gov/167/Employee-Be...,3
181,43207900501000,MEMPHIS AND SHELBY COUNTY CENTER CITY COMMISSION,https://www.shelbycountytn.gov/167/Employee-Be...,3
23,5100700700000,CONTRA COSTA COUNTY,https://www.contracosta.ca.gov/1343/Employee-B...,3
24,5100700704700,CONTRA COSTA HOUSING AUTHORITY,https://www.contracosta.ca.gov/1343/Employee-B...,3
26,5100700730100,CONTRA COSTA COUNTY SPECIAL SCHOOLS,https://www.contracosta.ca.gov/1343/Employee-B...,3
17,4000000004105,SYSTEM OFFICE,https://www.utsystem.edu/offices/employee-bene...,2
182,44000000017301,UNIVERSITY OF TEXAS OFFICE OF EMPLOYEE BENEFITS,https://www.utsystem.edu/offices/employee-bene...,2
117,26000000003900,SOUTHEAST MISSOURI STATE UNIV,https://www.missouristate.edu/human/medical-in...,2
118,26000000004000,MISSOURI STATE UNIVERSITY,https://www.missouristate.edu/human/medical-in...,2


### Fix a specified ID

In [74]:
# get the next url from the google queries
id_to_fix = gov_info.loc[gov_info['MNAME'] == 'DENVER SCHOOL DISTRICT 1 ']['id_idcd_plant'].values[0]

In [75]:
google_urls = pd.read_sql("SELECT * FROM google_query_results WHERE id_idcd_plant=?", dbconn, params=(id_to_fix,))

In [76]:
google_urls

Unnamed: 0,url_index,start_url,id_idcd_plant,MNAME,is_queried_search,date_queried_search
0,1,https://www.dcsdk12.org/about/our_district/dep...,6501600100000,DENVER SCHOOL DISTRICT 1,1,2022-08-17 14:01:55.247567
1,2,http://thecommons.dpsk12.org/Page/1397,6501600100000,DENVER SCHOOL DISTRICT 1,1,2022-08-17 14:01:55.247567
2,3,https://www.denverhealth.org/-/media/files/emp...,6501600100000,DENVER SCHOOL DISTRICT 1,1,2022-08-17 14:01:55.247567
3,4,https://www.cu.edu/employee-services/benefits-...,6501600100000,DENVER SCHOOL DISTRICT 1,1,2022-08-17 14:01:55.247567
4,5,https://dhr.colorado.gov/state-employees/state...,6501600100000,DENVER SCHOOL DISTRICT 1,1,2022-08-17 14:01:55.247567
5,6,https://www.dpsk12.org/,6501600100000,DENVER SCHOOL DISTRICT 1,1,2022-08-17 14:01:55.247567
6,7,https://careers.dpsk12.org/teach/teachercomp/,6501600100000,DENVER SCHOOL DISTRICT 1,1,2022-08-17 14:01:55.247567
7,8,https://www.landerschools.org/Human-Resources,6501600100000,DENVER SCHOOL DISTRICT 1,1,2022-08-17 14:01:55.247567
8,9,https://www.jeffcopublicschools.org/employment...,6501600100000,DENVER SCHOOL DISTRICT 1,1,2022-08-17 14:01:55.247567
9,10,https://www.metlife.com/,6501600100000,DENVER SCHOOL DISTRICT 1,1,2022-08-17 14:01:55.247567


In [77]:
# from this I can see I want to pick the second row
new_start_url = google_urls.loc[google_urls['url_index']==2]['start_url'].values[0]
new_start_url

'http://thecommons.dpsk12.org/Page/1397'

In [78]:
update = '''
         UPDATE gov_info
         SET start_url=?,
             is_scraped=0,
             num_scraped=0,
             pdf_count=0,
             sbc_count=0
         WHERE id_idcd_plant=?;
         '''

In [73]:
cur = dbconn.cursor()
cur.execute(update, (new_start_url, id_to_fix, ))
dbconn.commit()

In [38]:
cur.close()

#### Changelog for this manual process
- Baltimore City changed to second Google url
- BALTIMORE COUNTY PUBLIC SCHOOLS changed to fourth Google url
- Community College of Baltimore City changed to fourth Google url
- Memphis City changed to fourth Google url
- Southeast Missouri State University changed to second Google url
- Denver Public Schools changed to second Google url