## Summary

The client records downloaded from headquarter patent database contains address info, however, for data analyssi purpose, we need to extract zip code, city, state and country(keep USA only) from the address column. 

To do that, we will import the following 2 files:

- `总部专利客户 as of 2019-08-29.xlsx` 
- `US zip codes.xlsx`

In [1761]:
import pandas as pd
import numpy as np
import re

# import all custom functions from myFunctions.ipynb
from ipynb.fs.full.myFunctions import *

## 1. Loading datasets

### Headquarter client data with address

In [1762]:
HQ_clients = pd.read_excel('../总部专利客户 as of 2019-08-29.xlsx')
print(HQ_clients.index)
print(HQ_clients.info())
HQ_clients.head()

RangeIndex(start=0, stop=3731, step=1)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3731 entries, 0 to 3730
Data columns (total 8 columns):
客户代码      3731 non-null object
客户名称      3718 non-null object
客户类别      3731 non-null object
客户地址      3640 non-null object
电话        890 non-null object
传真        784 non-null object
客户中文名称    2690 non-null object
状态        79 non-null object
dtypes: object(8)
memory usage: 233.3+ KB
None


Unnamed: 0,客户代码,客户名称,客户类别,客户地址,电话,传真,客户中文名称,状态
0,US002562,CERION，LLC,(S),"One Blossom Road, Rochester, NY 14610 USA",,,丝润有限责任公司,
1,US002565,"AD-VANTAGE NETWORKS, INC.",(S),"600 North Brand Blvd., Suite 230 Glendale, CA ...",,,AD-优势网络股份公司,
2,US002566,"TOPS Products, LLC",(S),c/o R.R. Donnelley & Sons Company 111 South Wa...,,,托普斯产品有限责任公司,
3,US002567,"BRIGHENTI, Peter",(S),"430 Carmel Court Canton, Georgia 30114 USA",,,彼得·布里根蒂,
4,US002568,"RANA THERAPEUTICS, INC.",(S),"200 Sidney Street Suite 310 Cambridge, Massac...",,,RANA医疗有限公司,


In [1763]:
''' remove rows without address'''
HQ_clients = HQ_clients.dropna(subset=['客户地址'])
HQ_clients = HQ_clients.reset_index(drop=True).copy() # reset index so position index won't go out of bound
print(HQ_clients.index)
print(HQ_clients.info())
HQ_clients.head()

RangeIndex(start=0, stop=3640, step=1)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3640 entries, 0 to 3639
Data columns (total 8 columns):
客户代码      3640 non-null object
客户名称      3640 non-null object
客户类别      3640 non-null object
客户地址      3640 non-null object
电话        882 non-null object
传真        783 non-null object
客户中文名称    2635 non-null object
状态        67 non-null object
dtypes: object(8)
memory usage: 227.6+ KB
None


Unnamed: 0,客户代码,客户名称,客户类别,客户地址,电话,传真,客户中文名称,状态
0,US002562,CERION，LLC,(S),"One Blossom Road, Rochester, NY 14610 USA",,,丝润有限责任公司,
1,US002565,"AD-VANTAGE NETWORKS, INC.",(S),"600 North Brand Blvd., Suite 230 Glendale, CA ...",,,AD-优势网络股份公司,
2,US002566,"TOPS Products, LLC",(S),c/o R.R. Donnelley & Sons Company 111 South Wa...,,,托普斯产品有限责任公司,
3,US002567,"BRIGHENTI, Peter",(S),"430 Carmel Court Canton, Georgia 30114 USA",,,彼得·布里根蒂,
4,US002568,"RANA THERAPEUTICS, INC.",(S),"200 Sidney Street Suite 310 Cambridge, Massac...",,,RANA医疗有限公司,


### US zipcode reference

In [1764]:
zipcodes = pd.read_excel('../US zip codes.xlsx')
print(zipcodes.info())
zipcodes.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40933 entries, 0 to 40932
Data columns (total 7 columns):
Zip Code      40933 non-null int64
Place Name    40933 non-null object
State         40918 non-null object
State_Abbr    40933 non-null object
County        40928 non-null object
Latitude      40933 non-null float64
Longitude     40933 non-null float64
dtypes: float64(2), int64(1), object(4)
memory usage: 2.2+ MB
None


Unnamed: 0,Zip Code,Place Name,State,State_Abbr,County,Latitude,Longitude
0,501,Holtsville,New York,NY,Suffolk,40.8154,-73.0451
1,544,Holtsville,New York,NY,Suffolk,40.8154,-73.0451
2,1001,Agawam,Massachusetts,MA,Hampden,42.0702,-72.6227
3,1002,Amherst,Massachusetts,MA,Hampshire,42.3671,-72.4646
4,1003,Amherst,Massachusetts,MA,Hampshire,42.3919,-72.5248


In [1765]:
'''Fill zip of len < 5 with use of custom fuction `fillzip_leading_0` '''
zipcodes['Zip Code'] = fillzip_leading_0(zipcodes['Zip Code'])

# verify the number of converted zipcodes
len(zipcodes['Zip Code'][zipcodes['Zip Code'].str.len()==5])

40933

## 2. Construct Regex and Functions

In [1766]:
'''Construct regex to capture Country'''

# string pattern in EN
regex_US_EN = (
          # capture `U.S.A` or `USA`
          r"\b(?:U\.?\s?S\.?\s?A?\.?"
          # capture `United States of America`(including mispelling) with or without 'THE' ahead
          # (?:word_to_search?)? - non capturing group (?: ?), plus ? to indicate optional
          # (?:\w){1,2} inside (?:T(?:\w){1,2}\s*\W*)? to capture 1 or 2 numbers of word character (non-optional) after 'T'
          r"|(?:T(?:\w){1,2}\s*\W*)?(?:UN\w*\s+?)[ST]\w*\s+OF\s+A\w*"
          r"|(?:T(?:\w){1,2}\s*\W*)?UN\w*\s+STA\w*"
          r"|UNITED STATES OF AMERICA TATES OF AMERICA)\b"
         )
regex_others_EN =(r"\b(?:CANADA|CHINA|P\.?R\.?C\.?|SWITZERLAND|GERMANY|INDIA|KOREA|ISRAEL)\b")
regex_country_EN = r"(?:" + regex_US_EN + r"|" + regex_others_EN + r")"


# string pattern in CN
regex_US_CN = (r"(?:美国|加州)")
regex_others_CN = (r"(?:中国|上海|新竹市|山东省)")
regex_country_CN = r"(?:" + regex_US_CN + r"|\W*" + regex_others_CN + r")"


# combine US in EN and CN
regex_US_both = regex_US_EN + r"|" + regex_US_CN

# combine other country in EN and CN
regex_others_both = regex_others_EN + r"|" + regex_others_CN

# combine all countries in EN and CN
regex_country = r"(?P<Country>" + regex_US_both + r"|" + regex_others_both + r")"

### Test on the regex
test_s = '''333 CETNENN新竹市IAL Canada usa 美国p.r.c.the united states OF amriCA SUITE B U.sa LOUISVILLE, CO th United States of 
    A china UNITED STATES OF AMERICA 中国 中国 TATES OF AMERICA'''.upper()

#print(regex_country_EN)
#print(address[2882])
re.findall(regex_country, test_s)

['新竹市',
 'CANADA',
 'USA',
 '美国',
 'THE UNITED STATES OF AMRICA',
 'U.SA',
 'TH UNITED STATES OF \n    A',
 'CHINA',
 'UNITED STATES OF AMERICA',
 '中国',
 '中国']

In [1767]:
'''Construct regex to capture zip'''

### ****** SPLIT RAW STRING INTO MULTIPLE LINES ******* ###
# Use parenthesis to trigger automatic line continuation. 
# The strings will be automatically concatenated.'''


regex_zip = (# the digit must be preceded by a word or non-word character(not empty) - forward positive lookaround
           r"(?<=[\w\W])"
           # the digit might be preeceded by word charactor (e.g. CA94105)
           r"\b[a-zA-Z]*?"
           # any digits of len > 4, might be followed by 1 non word character and then extra digits
           # name the captured string as 'Zip'
           r"(?P<Zip>(?:\d{5}\W?\d{4}?|\d{4,})\b)")

### test regex on individual address:
#val = HQ_clients.Address[0]
val='9701 SE JOHNSON CREEK BOULEVARD, APT. 1306 97086 HAPPYVALLEY OREGON UNITED STATES OF AMERICA'
#print("Test string: {}".format(val))
re.findall(regex_zip, val)

['1306', '97086']

In [1768]:
'''Construct regex to capture city from address'''

# construct regex to capture any patterns representing roads or units
regex_roads = (r"(\b(?:RO?A?D|PLACE|AVE|AVENUE|PARK|PARKWAY|WAY|STREET|S\.?T\.?(?!ATES)|" #negative lookahead make sure the captured 'ST' not before 'ATES'
               r"LANE|DRIVE|BOULDER|B\w*L\w*V\w*D|PIKE)"
               r"\b\W*\s*"
               r"(?:N\.?\s?W\.?|N\.?\s?E\.?|S\.?\s?W\.?|S\.?\s?E\.?)?\W*\s*)"
               )

regex_roads_spanish = r"(\b(?:AVENIDA|CAMINO)(?:\s*\w*){0,}\s*\,\s*)" # in Spanish, street name(one word or more) is placed behind (before next ,)

regex_units = (r"((?:\b(?:UNIT|FL(?:OOR)?|APT\.?|SUITE|COURT|B\w*L\w*D\w*G|P\.?O\.?\s*BOX)\s"
               r"|#)"
               r"\s*(?:\d+)?[A-Z]?\W*\s*)") #UNIT followed optionally by number or A-Z
regex_road_or_unit = r"(" + regex_roads + r"|" + regex_roads_spanish + r"|" + regex_units + r")"



# custom functions to get the string after road/unit wording
def idx_aft_roads(val):
    allmatches = re.findall(regex_road_or_unit, val)
    if len(allmatches)>1:
        start_idx = 0
        end_idx = len(val)
        # search till 
        while re.search(regex_road_or_unit, val[start_idx:end_idx]):
            start_idx += re.search(regex_road_or_unit, val[start_idx:end_idx]).end()
        return start_idx
    elif len(allmatches) ==1:
        return re.search(regex_road_or_unit, val).end()
    else:
        return 0
    
def rest_of_str_aft_road_or_unit(val):
    #print("string: " + val)
    return val[idx_aft_roads(val):]

#### Test the above extraction ###
#val = US_addresses.Address.loc[0]
#val = 'P.O. BOX 194344 SAN JUAN, PR 00919 USA'.upper()
#val = '132 N. EL CAMINO REAL REAL, ENCINITAS, CALIFORNIA 92924 USA'
#val = '132 N. EL CAMINO REAL #287, ENCINITAS, CALIFORNIA 92924 USA'
#val = '420 CHESTNUT LANE, WESTON, FL 33226 USA' ### FL as state duplicates FL as floor
#val = '256 ELEANOR STATES ROOSEVELT ST. SAN JUAN, PUERTO RICO 00918 USA'
#val = '9701 SE JOHNSON CREEK BOULEVARD, APT. 1306 97086 HAPPYVALLEY OREGON UNITED STATES OF AMERICA'
#val = '941 AVENIDA ACASO, CAMARILLO, CALIFORNIA, USA'
#val = '1434 AIR RAIL AVENUE VIRGINIA BEACH'
val = '333 CENTENNIAL PARKWAY SUITE B LOUISVILLE, CO USA'
rest_of_str_aft_road_or_unit(val)


'LOUISVILLE, CO USA'

In [1769]:
# on a single value
def isEnglish(val):
    try:
        str(val).encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True
    
# on a column
def clean_rest_of_str_aft_rd(col):
    return (col.str.replace(regex_US_both, 'USA') # standard USA input
                 .str.replace(r'\d+$', '') # remove digit at the end
                 .str.replace(r'\w*?[\W\s]*?\d', '') # remove any word string before a digit
                 .str.replace(r'(\d+?[\W\s]+)','')
                 .str.strip()
                 .str.replace(r'(^[\.,]*?|[\.,]*?$)', '')) # remove any . or , at the start or the end

## 3. Select US clients (extract country from address)

### Standardize address formatting

In [1770]:
'''use custom fuction `transform_addresses`'''
address=transform_addresses(HQ_clients['客户地址'])
print("Total addresses on file: {}".format(address.shape[0]))
print('Index: {}'.format(address.index))
address.head().tolist()

Total addresses on file: 3640
Index: RangeIndex(start=0, stop=3640, step=1)


['ONE BLOSSOM ROAD, ROCHESTER, NY 14610 USA',
 '600 NORTH BRAND BLVD., SUITE 230 GLENDALE, CA 91203, USA',
 'C/O R.R. DONNELLEY & SONS COMPANY 111 SOUTH WACKER DRIVE CHICAGO, ILLINOIS 60606 U.S.A.',
 '430 CARMEL COURT CANTON, GEORGIA 30114 USA',
 '200 SIDNEY STREET SUITE 310 CAMBRIDGE, MASSACHUSETTS 02139 USA']

### Extract `Country` from `Address` 

In [1771]:
'''Extract all matching strings from address data'''
country_extracted = address.str.extractall(regex_country)
print("Index of records with matches found: {}".format(country_extracted.index.get_level_values(0)))
country_extracted.head()

Index of records with matches found: Int64Index([   0,    1,    2,    3,    4,    5,    7,    8,    9,   10,
            ...
            3629, 3630, 3631, 3632, 3633, 3634, 3635, 3637, 3638, 3639],
           dtype='int64', length=3608)


Unnamed: 0_level_0,Unnamed: 1_level_0,Country
Unnamed: 0_level_1,match,Unnamed: 2_level_1
0,0,USA
1,0,USA
2,0,U.S.A
3,0,USA
4,0,USA


####  1. Fix the addresses without matching countries 

In [1772]:
'''Those address cannot find matching country strings'''
no_matching_country = (address[~pd.Series(address.index)
                               .isin(pd.Series(country_extracted.index.get_level_values(0)))])

'''examine data, if any new pattern found, add to regex_country, re-run the extraction'''
print(no_matching_country.shape[0])
no_matching_country.tolist()

50


['5201 GREAT AMERICA PARKWAY, SUITE 270 SANTA CLARA，CA 95054',
 '2717 LINCOLN STREET EVANSTON, IL 60201',
 '300 CARNEGIE CENTER SUITE 220 PRINCETON, NJ 08540',
 'TWO LIBERTY PLACE 50 S. 16TH STREET, SUITE 3200 PHILADELPHIA, PA 19102-2555',
 '150 N MICHIGAN AVE | SUITE 2700 | CHICAGO, ILLINOIS 60601',
 '9530 JEFFERSON BOULEVARD, CULVER CITY, CALIFORNIA 90232',
 '523 OCEAN FRONT WALK VENICE, CALIFORNIA 90291',
 'TWO SEAPORT LANE, BOSTON, MA 02210-2001',
 '600 BANNER PLACE TOWER 12770 COIT ROAD DALLAS, TEXAS 75251',
 '354 TURNPIKE STREET - SUITE 301A CANTON, MA 02021-2714',
 '158 ROUNDHILL RD BOALSBURG, PA 16827',
 '70 WEST MADISON – SUITE 3500 CHICAGO, ILLINOIS 60602-4424',
 '10250 CONSTELLATION BLVD. SUITE 1700 LOS ANGELES, CA 90067',
 '1411 FOURTH AVENUE, SUITE 760 SEATTLE, WASHINGTON 98101',
 '10041 RAMPART COURT #140 LITTLETON, CO 80125',
 '9280 CRESTWYN HILLS DRIVE MEMPHIS, TN 38125',
 '2603 AUGUSTA DRIVE SUITE 1270 HOUSTON, TX 77057',
 '100 EAST WISCONSIN AVENUE, SUITE 1100 MILWAUK

In [1773]:
'''After confirming all the above are US addresses, save it to a new dataframe'''

cols = ['ClientID','Address', 'Zip', 'Country', 'City']
clientIDs = HQ_clients.loc[no_matching_country.index, '客户代码']
addresses = no_matching_country
zipCodes = np.full(len(no_matching_country), np.nan)
countries = np.full(len(no_matching_country), 'USA')
cities = np.full(len(no_matching_country), np.nan)

US_addresses = pd.DataFrame(list(zip(clientIDs, addresses,zipCodes,countries, cities)),
                          columns=cols, index=addresses.index)

print("Number of US clients added: {}".format(US_addresses.shape[0]))
US_addresses.head()

Number of US clients added: 50


Unnamed: 0,ClientID,Address,Zip,Country,City
6,US002570,"5201 GREAT AMERICA PARKWAY, SUITE 270 SANTA CL...",,USA,
64,US002912,"2717 LINCOLN STREET EVANSTON, IL 60201",,USA,
115,US003521,"300 CARNEGIE CENTER SUITE 220 PRINCETON, NJ 08540",,USA,
136,US002719,"TWO LIBERTY PLACE 50 S. 16TH STREET, SUITE 320...",,USA,
162,US002755,"150 N MICHIGAN AVE | SUITE 2700 | CHICAGO, ILL...",,USA,


#### 2. Examine those with matches found and fix any issues

In [1774]:
'''Check the countries captured''' 
country_extracted.Country.value_counts().sort_index()

CANADA                             2
CHINA                              7
INDIA                              3
ISRAEL                             1
KOREA                              2
P.R.C                              1
SWITZERLAND                        2
THE UNITD STATES OF AMERICA        1
THE UNITED STATES OF AMERICA       3
U. S. A                            2
U.S.                               3
U.S.A                            253
UNITED STATE                       1
UNITED STATE OF AMERICA            1
UNITED STATES                     33
UNITED STATES OF AMERCIA           1
UNITED STATES OF AMERICA         539
UNITED STATES OF AMRICA            3
UNITED STATS OF AMERICA            1
UNITES STATES OF AMERICA           1
UNTED STATES OF AMERICA            1
UNTIED STATES OF AMERICA           1
US                                25
US                                 7
USA                             2704
上海                                 1
中国                                 1
加

In [1775]:
'''If any matches are not countries, replace them with empty string '''
raw_exclude = r"(SITY OF ARIZONA |TOURO STREET|ULSTER ST|UNION SQUARE|UNION STREET|UNIVERSITY (?:SERVICES|STREET))"
country_extracted.Country = country_extracted.Country.str.replace(raw_exclude, 'NA')
(country_extracted.Country=='NA').sum()

0

In [1776]:
'''Make the extracted data more easy to work with'''

# Unstack the multiindexed extracted data frame and then merge 'address' column
address_with_matching_country = country_extracted.unstack()
cols = list(address_with_matching_country.columns.get_level_values(1))
address_with_matching_country.columns = ["country{}".format(col) for col in cols]
address_with_matching_country['Address'] = address[address_with_matching_country.index]
address_with_matching_country['ClientID'] = HQ_clients.loc[address_with_matching_country.index, '客户代码']

print('Number of address found with matching country: {}'.format(address_with_matching_country.shape[0]))
address_with_matching_country.head()

Number of address found with matching country: 3590


Unnamed: 0,country0,country1,Address,ClientID
0,USA,,"ONE BLOSSOM ROAD, ROCHESTER, NY 14610 USA",US002562
1,USA,,"600 NORTH BRAND BLVD., SUITE 230 GLENDALE, CA ...",US002565
2,U.S.A,,C/O R.R. DONNELLEY & SONS COMPANY 111 SOUTH WA...,US002566
3,USA,,"430 CARMEL COURT CANTON, GEORGIA 30114 USA",US002567
4,USA,,"200 SIDNEY STREET SUITE 310 CAMBRIDGE, MASSACH...",US002568


In [1777]:
''' Now, filter the data with more than 1 matches and save it to a temp df '''
filter_1 = ~address_with_matching_country.country1.isna()
addresses_with_more_than_1match = address_with_matching_country[filter_1].sort_values('country1')
print("Number of records with more than 1 match found: {}".format(addresses_with_more_than_1match.shape[0]))
addresses_with_more_than_1match.sort_index()

Number of records with more than 1 match found: 18


Unnamed: 0,country0,country1,Address,ClientID
16,US,USA,"600 NORTH US HIGHWAY 45, LIBERTYVILLE, ILLINOI...",US002592
664,US,USA,"US 08648 NJ LAWRENCEVILLE, 1009 LENOX DRIVE, S...",US003161
1163,USA,USA,"CORNELL BUSINESS AND TECHNOLOGY PARK, 20 THORN...",US000240
1278,US,USA,"1815 N. US HIGHWAY 1, ORMOND BEACH, FLORIDA 32...",US000362
1388,CHINA,CHINA,5701 CHINA WORLD TOWER NO.1 JIANGUOMENWAI AVEN...,US000474
1389,US,USA,"1031 US HIGHWAY 22. SUITE 303 BRIDGEWATER, NEW...",US000475
1432,INDIA,INDIA,"NICHOLAS PIRAMAL RESEARCH CENTRE, NICHOLAS PIR...",US000518
1462,CHINA,CHINA,3201 CHINA WORLD TOWER 1 NO. 1 JIANGUOMENWAI A...,US000548
1575,INDIA,UNITED STATES OF AMERICA,"85 EAST INDIA ROW, 15G, BOSTON, MASSACHUSETTS ...",US000662
1586,U.S.,UNITED STATES OF AMERICA,"1801 U.S. HIGHWAY 52 N.W., WEST LAFAYETTE, IND...",US000672


In [1778]:
# inspect a specific address if any issue
addresses_with_more_than_1match.loc[1432].tolist()

['INDIA',
 'INDIA',
 'NICHOLAS PIRAMAL RESEARCH CENTRE, NICHOLAS PIRAMAL INDIA LIMITED, 1, NIRLON COMPLEX, OFF WESTERN EXPRESS HIGHWAY, GOREGAON (EAST). MUMBAI 400063, STATE OF MAHARASHTRA INDIA',
 'US000518']

In [1779]:
'''Update country data for the above with verified selection '''

# Selecte other country and US indices
other_country_idx = filter_rows_with_val2(addresses_with_more_than_1match, ['country1'], regex_others_both).index
us_idx = filter_rows_with_val2(addresses_with_more_than_1match, ['country1'], regex_US_both).index

# Fill in country data in original address_with_matching_country 
address_with_matching_country.loc[us_idx,'Country']='USA'
address_with_matching_country.loc[other_country_idx,'Country']=addresses_with_more_than_1match.loc[other_country_idx, 'country1']

# review the updated data
print("Number of records updated: {}".format(address_with_matching_country[~address_with_matching_country.Country.isnull()].shape[0]))
address_with_matching_country[~address_with_matching_country.Country.isnull()].head()

Number of records updated: 18


Unnamed: 0,country0,country1,Address,ClientID,Country
16,US,USA,"600 NORTH US HIGHWAY 45, LIBERTYVILLE, ILLINOI...",US002592,USA
664,US,USA,"US 08648 NJ LAWRENCEVILLE, 1009 LENOX DRIVE, S...",US003161,USA
1163,USA,USA,"CORNELL BUSINESS AND TECHNOLOGY PARK, 20 THORN...",US000240,USA
1278,US,USA,"1815 N. US HIGHWAY 1, ORMOND BEACH, FLORIDA 32...",US000362,USA
1388,CHINA,CHINA,5701 CHINA WORLD TOWER NO.1 JIANGUOMENWAI AVEN...,US000474,CHINA


In [1780]:
'''Now, fill in the verified countries for the rest of data (only 1 match) '''

filter_country1_NA = address_with_matching_country.country1.isna()
rest_data_with_only_1match = address_with_matching_country[filter_country1_NA]
print("Number of records with only 1 match found: {}".format(rest_data_with_only_1match.shape[0]))
rest_data_with_only_1match.head()

Number of records with only 1 match found: 3572


Unnamed: 0,country0,country1,Address,ClientID,Country
0,USA,,"ONE BLOSSOM ROAD, ROCHESTER, NY 14610 USA",US002562,
1,USA,,"600 NORTH BRAND BLVD., SUITE 230 GLENDALE, CA ...",US002565,
2,U.S.A,,C/O R.R. DONNELLEY & SONS COMPANY 111 SOUTH WA...,US002566,
3,USA,,"430 CARMEL COURT CANTON, GEORGIA 30114 USA",US002567,
4,USA,,"200 SIDNEY STREET SUITE 310 CAMBRIDGE, MASSACH...",US002568,


In [1781]:
# Select indices
filter_others = rest_data_with_only_1match.country0.str.contains(regex_others_both)
other_country_idx = rest_data_with_only_1match[filter_others].index.tolist()
filter_US = rest_data_with_only_1match.country0.str.contains(regex_US_both)
us_idx = rest_data_with_only_1match[filter_US].index.tolist()

# Fill in country data
address_with_matching_country.loc[us_idx,'Country']='USA'
address_with_matching_country.loc[other_country_idx, 'Country']=address_with_matching_country.loc[other_country_idx, 'country0']

# Review the final count of countries
address_with_matching_country['Country'].value_counts()


USA            3572
CHINA             5
山东省               2
SWITZERLAND       2
KOREA             2
P.R.C             1
中国                1
CANADA            1
INDIA             1
新竹市               1
上海                1
ISRAEL            1
Name: Country, dtype: int64

In [1782]:
# verify the above result by examining specific country 
filter_rows_with_val2(address_with_matching_country, ['country0','country1'],'山东省')

Unnamed: 0,country0,country1,Address,ClientID,Country
3326,山东省,,山东省潍坊市奎文区世纪泰华水印公寓D座28层,US002447,山东省
3327,山东省,,山东省潍坊市奎文区世纪泰华水印公寓D座28层,US002448,山东省


In [1783]:
US_addresses.columns

Index(['ClientID', 'Address', 'Zip', 'Country', 'City'], dtype='object')

#### 3. Save above data to corresponding dataframes

In [1784]:
'''Add the data with matching USA to `US_addresses'''
filter_US = address_with_matching_country.Country == 'USA'
cols = US_addresses.columns
US_addresses = US_addresses.append(address_with_matching_country.loc[filter_US, cols]).sort_index().drop_duplicates()
print("Number of US addresses: {}".format(US_addresses.shape[0]))
US_addresses.head()

Number of US addresses: 3622


Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Unnamed: 0,ClientID,Address,Zip,Country,City
0,US002562,"ONE BLOSSOM ROAD, ROCHESTER, NY 14610 USA",,USA,
1,US002565,"600 NORTH BRAND BLVD., SUITE 230 GLENDALE, CA ...",,USA,
2,US002566,C/O R.R. DONNELLEY & SONS COMPANY 111 SOUTH WA...,,USA,
3,US002567,"430 CARMEL COURT CANTON, GEORGIA 30114 USA",,USA,
4,US002568,"200 SIDNEY STREET SUITE 310 CAMBRIDGE, MASSACH...",,USA,


In [1785]:
''' Save the data with other matching countries to `other_addresses`'''
filter_other = address_with_matching_country.Country != 'USA'
other_addresses = address_with_matching_country.loc[filter_other,cols]
print("Number of other addresses: {}".format(other_addresses.shape[0]))
other_addresses.to_excel('Clients in other countries.xlsx')
other_addresses

Number of other addresses: 18


Unnamed: 0,ClientID,Address,Zip,Country,City
346,US002993,"403, 30 MCHUGH CRT., CALGARY, ALBERTA T2E 7X3 ...",,CANADA,
854,US002836,"33-4, SANGBONGJUNGANG-RO 8-NA-GIL, JUNGNANG-GU...",,KOREA,
1388,US000474,5701 CHINA WORLD TOWER NO.1 JIANGUOMENWAI AVEN...,,CHINA,
1404,US000490,中国北京市朝阳区 延静西里2号华商大厦918,,中国,
1432,US000518,"NICHOLAS PIRAMAL RESEARCH CENTRE, NICHOLAS PIR...",,INDIA,
1462,US000548,3201 CHINA WORLD TOWER 1 NO. 1 JIANGUOMENWAI A...,,CHINA,
1514,US000600,"4TH FLOOR, 27 ZHONGSHAN DONG YI ROAD, SHANGHAI...",,CHINA,
1623,US000708,"SUITE 4201, 42D FLOOR, BUND CENTER 222 YAN AN ...",,CHINA,
1653,US000739,"1F, BUILDING 28, RING BUILDING, ZHONGGUANCUN S...",,P.R.C,
2247,US001344,上海市天钥桥路30号美罗大厦17楼 邮编：200030,,上海,


## 4. Extract Zips from US addresses

### Extract Zip from address

In [1786]:
'''Extract all possible matching strings from address'''
extracted_Zip_from_address = US_addresses.Address.str.extractall(regex_zip).unstack()
print("Matches found: {}".format(extracted_Zip_from_address.shape[0]))
print("\nSample: ")
extracted_Zip_from_address.head()

Matches found: 3577

Sample: 


Unnamed: 0_level_0,Zip,Zip,Zip,Zip
match,0,1,2,3
0,14610,,,
1,91203,,,
2,60606,,,
3,30114,,,
4,2139,,,


#### 1. Review unmatched records to catch more pattern

In [1787]:
'''Examine the addresses without zip captured'''
'''look for patterns to capture more zip and add them to regex above, rerun extraction'''
no_matching_zip = US_addresses[~US_addresses.index.isin(extracted_Zip_from_address.index)]
print("Number of records without matches found: {}".format(no_matching_zip.shape[0]))
no_matching_zip.Address.tolist()

Number of records without matches found: 45


['333 CENTENNIAL PARKWAY SUITE B LOUISVILLE, CO USA',
 '1909 K STREET NW, SUITE 900, USA',
 'LAGUNA NIGUEL, CA USA',
 'LAGUNA HILLS, CA USA',
 'DELAWARE USA',
 '1434 AIR RAIL AVENUE VIRGINIA BEACH, VA USA',
 '57 SEAVIEW BOULEVARD PORT WASHINGTON,NY U.S.A.',
 'NORTH CAROLINA, USA',
 'ROCKWALL TEXAS UNITED STATES OF AMERICA',
 'ROCKWALL TEXAS UNITED STATES OF AMERICA',
 '3472 88TH AVENUE NE CIRCLE PINES, MN USA',
 '1613 PINEVIEW DRIVE RALEIGH, NORTH CAROLINA UNITED STATES OF AMERICA',
 '1401 CROOKS ROAD, TROY, MICHIGAN USA',
 '941 AVENIDA ACASO, CAMARILLO, CALIFORNIA, USA',
 'SAN DIEGO USA',
 '10 SAINT JAMES AVENUE, 11TH FLOOR | BOSTON, MA 021',
 'TEXAS USA',
 'PLYMOUTH, MN USA',
 'BURNSVILLE, MN USA',
 '3411 SILVERSIDE RD., RODNEY BLDG. SUITE 104, WILMINGTON, DELAWARE USA',
 '941 AVENIDA ACASO, CAMARILLO, CALIFORNIA, USA',
 '9000 CROW CANYON ROAD, SUITE S393, DANVILLE, CALIFORNIA 94S06 USA',
 'IN. U.S.A.',
 'DELAWARE, U.S.A.',
 '美国加州新港海滨詹伯瑞路4311号',
 '301 MERRITT 7, NORWALK, CONNECTICUT 

In [1788]:
'''Quick fix to specific case'''
# if there is particular error, e.g. `CALIFORNIA 94S06 USA`, fix it right away in `US_addresses`, and then rerun extraction
US_addresses['Address'] = US_addresses.Address.str.replace('94S06', '94506')

In [1789]:
'''Review the result after adding a zip pattern to regex'''
# Use custom function filter_rows_with_zip to check whether the a specific zip code pattern is captured now
# e.g. can it capture '97086' in "APT. 1306 97086 HAPPYVALLEY OREGON UNITED STATES OF AMERICA" now? - Yes!
containZip = filter_rows_with_val(extracted_Zip_from_address, extracted_Zip_from_address.columns, '94506')
US_addresses.Address.loc[extracted_Zip_from_address[containZip].index].tolist()

['5443 BLACKHAWK DRIVE DANVILLE, CALIFORNIA 94506 UNITED STATES OF AMERICA',
 '9000 CROW CANYON ROAD, SUITE S393, DANVILLE, CALIFORNIA 94506, USA']

#### 2. Review the records with matching zip

In [1790]:
'''Make the extracted data more easy to work with'''
extracted_Zip_from_address['Address'] = US_addresses.Address.loc[extracted_Zip_from_address.index]
extracted_Zip_from_address.columns = ['Zip1', 'Zip2', 'Zip3', 'Zip4','Address']
print(extracted_Zip_from_address.shape[0])
extracted_Zip_from_address.sample(10)

3577


Unnamed: 0,Zip1,Zip2,Zip3,Zip4,Address
3438,92008,,,,"5545 FERMI COURT CARLSBAD, CALIFORNIA 92008 TH..."
2868,94104,,,,"555 CALIFORNIA STREET, 12TH FLOOR SAN FRANCISC..."
1054,30097,,,,"305 GREEN WAY, DULUTH, CA 30097 U.S.A."
739,06268-0503,,,,"P.O. BOX 503 STORRS, CONNECTICUT U.S.A. 06268-..."
3151,92507,,,,"1040 IOWA AVENUE, SUITE 100 RIVERSIDE, CALIFOR..."
2772,84055,,,,"4383 NORTH RIVER ROAD OAKLEY, UTAH 84055 USA"
558,7400,48098,,,"750 TOWER DRIVE, MAIL CODE 7400 TROY, MI 48098..."
3066,60093,,,,"THREE LAKES DRIVE NORTHFIELD, ILLINOIS 60093 USA"
2387,1300,84106,,,"2150 SOUTH 1300 EAST, SUITE 500, SALT LAKE CIT..."
1841,1201,30309-3424,,,ONE ATLANTIC CENTER 1201 WEST PEACHTREE STREET...


## 5. Extract city from addresses without matching zip

In [1791]:
'''Extract the possible city information - any string after road/unit'''
no_matching_zip_extracted_cities = no_matching_zip.Address.apply(rest_of_str_aft_road_or_unit)
clean_rest_of_str_aft_rd(no_matching_zip_extracted_cities)

72                                     LOUISVILLE, CO USA
168                                                   USA
238                                 LAGUNA NIGUEL, CA USA
239                                  LAGUNA HILLS, CA USA
283                                          DELAWARE USA
284                                VIRGINIA BEACH, VA USA
305                                PORT WASHINGTON,NY USA
336                                   NORTH CAROLINA, USA
337                                    ROCKWALL TEXAS USA
338                                    ROCKWALL TEXAS USA
379                                  CIRCLE PINES, MN USA
396                           RALEIGH, NORTH CAROLINA USA
399                                    TROY, MICHIGAN USA
474                            CAMARILLO, CALIFORNIA, USA
520                                         SAN DIEGO USA
589                                            BOSTON, MA
613                                             TEXAS USA
665           

In [1792]:
'''Clean up the extracted city information'''

# clean up English data only
onlyEnglish = no_matching_zip_extracted_cities.apply(isEnglish)
no_matching_zip_extracted_cities[onlyEnglish] = clean_rest_of_str_aft_rd(no_matching_zip_extracted_cities[onlyEnglish])
no_matching_zip_extracted_cities   

72                                     LOUISVILLE, CO USA
168                                                   USA
238                                 LAGUNA NIGUEL, CA USA
239                                  LAGUNA HILLS, CA USA
283                                          DELAWARE USA
284                                VIRGINIA BEACH, VA USA
305                                PORT WASHINGTON,NY USA
336                                   NORTH CAROLINA, USA
337                                    ROCKWALL TEXAS USA
338                                    ROCKWALL TEXAS USA
379                                  CIRCLE PINES, MN USA
396                           RALEIGH, NORTH CAROLINA USA
399                                    TROY, MICHIGAN USA
474                            CAMARILLO, CALIFORNIA, USA
520                                         SAN DIEGO USA
589                                            BOSTON, MA
613                                             TEXAS USA
665           

In [1793]:
# Examine speicifc address if needed
US_addresses.loc[2981].tolist()

['US002094', '8905 S.W. 87TH AVENUE, MIAMI, USA', nan, 'USA', nan]

In [1794]:
# Update the non-English data with English city info'''
no_matching_zip_extracted_cities[(~onlyEnglish)]

1776           美国加州新港海滨詹伯瑞路4311号
2316    加州95124-3400圣荷西罗吉克路2100号
2365            美国 加州 佛利蒙 利马街51号
Name: Address, dtype: object

In [1795]:
no_matching_zip_extracted_cities[(~onlyEnglish)] = ['New Port, CA USA','San Jose, CA USA','Fremont, CA USA']
no_matching_zip_extracted_cities[(~onlyEnglish)]

1776    New Port, CA USA
2316     Fremont, CA USA
2365    New Port, CA USA
Name: Address, dtype: object

In [1796]:
'''Save the extracted cities from no_matching_zip addresses to US_addresses'''
clean = transform_addresses(no_matching_zip_extracted_cities).str.replace(r'[\s\,\.]*?USA$', '')
US_addresses.loc[no_matching_zip_extracted_cities.index,'City'] = clean
print("Number of records with city: {}".format(US_addresses[~US_addresses.City.isna()].shape[0]))
US_addresses[~US_addresses.City.isna()]

Number of records with city: 45


Unnamed: 0,ClientID,Address,Zip,Country,City
72,US002913,"333 CENTENNIAL PARKWAY SUITE B LOUISVILLE, CO USA",,USA,"LOUISVILLE, CO"
168,US002740,"1909 K STREET NW, SUITE 900, USA",,USA,
238,US002809,"LAGUNA NIGUEL, CA USA",,USA,"LAGUNA NIGUEL, CA"
239,US002810,"LAGUNA HILLS, CA USA",,USA,"LAGUNA HILLS, CA"
283,US003046,DELAWARE USA,,USA,DELAWARE
284,US003048,"1434 AIR RAIL AVENUE VIRGINIA BEACH, VA USA",,USA,"VIRGINIA BEACH, VA"
305,US003485,"57 SEAVIEW BOULEVARD PORT WASHINGTON,NY U.S.A.",,USA,"PORT WASHINGTON,NY"
336,US002977,"NORTH CAROLINA, USA",,USA,NORTH CAROLINA
337,US002963,ROCKWALL TEXAS UNITED STATES OF AMERICA,,USA,ROCKWALL TEXAS
338,US002964,ROCKWALL TEXAS UNITED STATES OF AMERICA,,USA,ROCKWALL TEXAS


## 6. Continue to work on the addresses with matching zips - pick the right zip

### Clean all captured zip strings


In [1797]:
### Standardize zip codes formatting
cols = ['Zip1', 'Zip2', 'Zip3', 'Zip4']
extracted_Zip_from_address[cols] = extracted_Zip_from_address[cols].apply(clean_zip_cols)
print("Number of records: {}".format(extracted_Zip_from_address.shape[0]))
extracted_Zip_from_address.sample(10)

Number of records: 3577


Unnamed: 0,Zip1,Zip2,Zip3,Zip4,Address
1350,52242,,,,"6 GILMORE HALL, 112 N. CAPITOL STREET IOWA CIT..."
575,94107,,,,"303 2ND STREET SAN FRANCISCO, CA 94107 USA"
2609,48083,,,,"1870 TECHNOLOGY DRIVE TROY, MICHIGAN 48083 USA"
393,94539,,,,"47102 MISSION FALLS COURT, SUITE 218 FREMONT, ..."
1247,95670,,,,"2330 GOLD MEADOW WAY, GOLD RIVER, CALIFORNIA 9..."
3456,97223,,,,"7166 SW OLESON ROAD, APT. 47 97223 PORTLAND OR..."
3247,63010,,,,"537 HICKORY MANOR ARNOLD, MISSOURI 63010 USA"
2833,5600,77002-1001,,,"600 TRAVIS SUITE 5600, HOUSTON TEXAS 77002-100..."
916,19103-7505,,,,"1735 MARKET STREET PHILADELPHIA, PA 19103-7505..."
988,06901-3431,,,,"ONE STAMFORD FORUM, 201 TRESSER BOULEVARD, STA..."


### Filtered 1 - those have only 1 matching zip string and length is not shorter than 5 (assumed correct for now)

In [1798]:
# conditions
match_1_only = ~extracted_Zip_from_address[cols[1:]].any(axis=1) # df.any(axis=1) --> check if value in all cols by row 
zip1_len_smaller_than_5 = extracted_Zip_from_address['Zip1'].str.len() < 5

# filtered data
filtered_1 = extracted_Zip_from_address[match_1_only & (~zip1_len_smaller_than_5)].copy()
print(filtered_1.shape)
filtered_1.sample(20)

(3070, 5)


Flushing oldest 200 entries.
  'Flushing oldest {cull_count} entries.'.format(sz=sz, cull_count=cull_count))


Unnamed: 0,Zip1,Zip2,Zip3,Zip4,Address
1221,06254,,,,"841 ROUTE 32, UNIT 2, FRANKLIN, CONNECTICUT 06..."
3168,02421,,,,"33 HAYDEN AVENUE LEXINGTON, MASSACHUSETTS 0242..."
1187,20151,,,,"4511 SINGER COURT, SUITE 300, CHANTILLY, VIRGI..."
1436,24018,,,,"3959 ELECTRIC ROAD SW, SUITE 330, ROANOKE, VIR..."
764,04103,,,,"500 RIVERSIDE INDUSTRIAL PARKWAY, PORTLAND ME ..."
2883,91107,,,,"2409 ONEIDA STREET, UNIT A, PASADENA, CA 91107..."
760,06851,,,,"761 MAIN AVENUE BUILDING G, 2ND FLOOR, NORWALK..."
106,90017,,,,"725 S. FIGUEROA STREET, SUITE 350 LOS ANGELES,..."
1989,11794-8480,,,,"SUNY, STONY BROOK, HSC, L4, #060 STONY BROOK, ..."
2330,30328,,,,"5871 GLENRIDGE DRIVE, SUITE 300, ATLANTA, GEOR..."


In [1799]:
''' save the extracted zips to `US_addresses'''
US_addresses.loc[filtered_1.index, 'Zip']=filtered_1['Zip1']
print(US_addresses.loc[filtered_1.index].shape[0])
US_addresses.loc[filtered_1.index].head()

3070


Unnamed: 0,ClientID,Address,Zip,Country,City
0,US002562,"ONE BLOSSOM ROAD, ROCHESTER, NY 14610 USA",14610,USA,
1,US002565,"600 NORTH BRAND BLVD., SUITE 230 GLENDALE, CA ...",91203,USA,
2,US002566,C/O R.R. DONNELLEY & SONS COMPANY 111 SOUTH WA...,60606,USA,
3,US002567,"430 CARMEL COURT CANTON, GEORGIA 30114 USA",30114,USA,
4,US002568,"200 SIDNEY STREET SUITE 310 CAMBRIDGE, MASSACH...",2139,USA,


### Filter 2 - those have only 1 matching zip string and length is shorter than 5 (there must be some issues)

In [1800]:
filtered_2 = extracted_Zip_from_address.loc[match_1_only & zip1_len_smaller_than_5].copy()

# inspect the filtered data
for row in filtered_2.iterrows():
  print("index: {} \n".format(row[0]), 
        "Zip: {}\n".format(row[1][0]), 
        "Address: {}".format(row[1][4]),
        "\n")

index: 1085 
 Zip: 1760
 Address: INTERNATIONAL TOWER, SUITE 1760 S. FIGUEROA ., LOS ANGELES, CALIFORNIA, U.S.A. 

index: 1351 
 Zip: 2042
 Address: 810 VERMONT AVENUE N. W., WASHINGTON, D. C. 2042 USA 

index: 2263 
 Zip: 3465
 Address: 2227 WELBILT BOULEVARD, NEW PORT RICHEY, FLORIDA 3465 USA 

index: 2812 
 Zip: 0962
 Address: 330 BLAISDELL ROAD ORANGEBURG, NEW YORK 0962, UNITED STATES OF AMERICA 

index: 2944 
 Zip: 2800
 Address: OFFICE OF TECHNOLOGY TRANSFER & ECONOMIC DEVELOPMENT, 2800 WOODLAWN DRIVE, SUITE 280 HONOLULU, HI, USA 



In [1801]:
# Create a dict to store city info with row index as key for the above
dataMap = {1085: ('', 'Los Angeles, CA, USA'),
           1351: ('20420', 'Washington, DC, USA'),
           2263: ('34655', 'NEW PORT RICHEY, FL, USA'),
           2812: ('10962', 'ORANGEBURG, NY, USA'),
           2944: ('', 'HONOLULU, HI, USA')}

In [1802]:
filtered_2_rest_of_str_aft_road_or_unit = (filtered_2.Address[filtered_2.Address.apply(isEnglish)]
                                    .apply(rest_of_str_aft_road_or_unit))
filtered_2_rest_of_str_aft_road_or_unit

1085       S. FIGUEROA ., LOS ANGELES, CALIFORNIA, U.S.A.
1351                           WASHINGTON, D. C. 2042 USA
2263                      W PORT RICHEY, FLORIDA 3465 USA
2812    ORANGEBURG, NEW YORK 0962, UNITED STATES OF AM...
2944                                    HONOLULU, HI, USA
Name: Address, dtype: object

In [1803]:
def make_regex_stateAbbr(state_abbr):
    regex = state_abbr[0] + r"\.?\s*" + state_abbr[1] + r"\.?\s*"
    return regex


regex_state_abbr = make_regex_stateAbbr('DC')
filtered_2_rest_of_str_aft_road_or_unit.str.contains(regex_state_abbr)

1085    False
1351     True
2263    False
2812    False
2944    False
Name: Address, dtype: bool

In [1804]:
# city,\s+ (state | regex_state_abbr)
# if match found, return the reference value (not the matching string in the original value)
def make_df_regex_city_state_with_ref(cities,states, state_abbrs):
    city_state_ref = cities + ', ' + state_abbrs
    regex = (r"(?:" 
            + cities.str.upper() + '[\,\.]?\s+' 
            + states.str.upper() + r"|" 
            + cities.str.upper() + '[\,\.]?\s+' 
            + state_abbrs.apply(make_regex_stateAbbr) + r")")
    return pd.DataFrame({'ref':city_state_ref, 'regex': regex}).drop_duplicates().reset_index()

regex_city_state_with_ref = (make_df_regex_city_state_with_ref(zipcodes['Place Name'], zipcodes['State'], zipcodes['State_Abbr']))
print('Number of regex_city_state_with_ref created: {}'.format(regex_city_state_with_ref.shape[0]))
regex_city_state_with_ref[regex_city_state_with_ref.ref=='Los Angeles, CA']

Number of regex_city_state_with_ref created: 29545


Unnamed: 0,index,ref,regex
27111,36725,"Los Angeles, CA","(?:LOS ANGELES[\,\.]?\s+CALIFORNIA|LOS ANGELES..."


In [1805]:
def extract_match_of_a_pat(val, pat):
    # matchObject.group(0) to return match if found, None otherwise
    if re.search(str(pat),val):
        #return re.search(str(pat),val).group(0)
        return pat

def extract_match_from_a_patlist(val, patlist):
    if isinstance(patlist,(pd.core.series.Series,np.ndarray)):
        patlist = patlist.tolist()
    try:
        return [patlist.index(x) for x in patlist if extract_match_of_a_pat(val, x)][0]
    except IndexError:
    #if len(matching_pat_idx) == 0:
     #   return None
        return None

#### test of regex used in make_regex_city_state_with_ref ###
#regex_test = r"(?:" + 'LOS ANGELES' + '[\,\.]?\s+' + 'CALIFORNIA' + r"|" + 'LOS ANGELES' + '[\,\.]?\s+' + make_regex_stateAbbr('CA') + r")"
#regex_test2 = r"(?:WASHINGTON[\,\.]?\s+DISTRICT OF COLUMBIA|WASHINGTON[\,\.]?\s+D\.?\s*C\.?\s*)"
#print(regex_city_state_with_ref[0])
#print(regex_test)
#re.search(regex_test, filtered_2_rest_of_str_aft_road_or_unit.loc[1351])


#### test of extract_match_of_a_pat and regex ###
# extract_match_of_a_pat(regex_test,filtered_2_rest_of_str_aft_road_or_unit.loc[1351])
#val = filtered_2_rest_of_str_aft_road_or_unit.loc[1085]
#patlist = [regex_test2, regex_test]
#extract_match_from_a_patlist(val, patlist)

In [1806]:
def extract_match_from_a_pat_df(val, pat_df, ref_col_in_pat_df, pat_col_in_pat_df):
    ls = pat_df[pat_col_in_pat_df].tolist()
    try:
        idx = [ls.index(x) for x in ls if extract_match_of_a_pat(val, x)][0]
    except IndexError:
    #if no match extracted then no idx
        return np.NaN
    return pat_df[ref_col_in_pat_df][idx]

In [1807]:
# test
extract_match_from_a_pat_df('LOS ANGELES, CALIFORNIA', regex_city_state_with_ref, 'ref', 'regex')

'Los Angeles, CA'

In [1808]:
result = filtered_2_rest_of_str_aft_road_or_unit.apply(extract_match_from_a_pat_df, 
                                                args=(regex_city_state_with_ref,'ref', 'regex'))

In [1809]:
'''save the extracted cities to `US_addresses '''
US_addresses.loc[result.index,'City']=result
US_addresses.loc[result.index]

Unnamed: 0,ClientID,Address,Zip,Country,City
1085,US000161,"INTERNATIONAL TOWER, SUITE 1760 S. FIGUEROA .,...",,USA,"Los Angeles, CA"
1351,US000436,"810 VERMONT AVENUE N. W., WASHINGTON, D. C. 20...",,USA,"Washington, DC"
2263,US001360,"2227 WELBILT BOULEVARD, NEW PORT RICHEY, FLORI...",,USA,"Port Richey, FL"
2812,US001922,"330 BLAISDELL ROAD ORANGEBURG, NEW YORK 0962, ...",,USA,"Orangeburg, NY"
2944,US002057,OFFICE OF TECHNOLOGY TRANSFER & ECONOMIC DEVEL...,,USA,"Honolulu, HI"


In [1810]:
US_addresses.loc[2263,'Zip'] = '34655'

### Filter 3 - those with more than 1 match require further cleaning

In [1811]:
filtered3 = extracted_Zip_from_address[~match_1_only]
print(filtered3.info())
filtered3.sample(20)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 502 entries, 23 to 3636
Data columns (total 5 columns):
Zip1       502 non-null object
Zip2       502 non-null object
Zip3       31 non-null object
Zip4       1 non-null object
Address    502 non-null object
dtypes: object(5)
memory usage: 23.5+ KB
None


Unnamed: 0,Zip1,Zip2,Zip3,Zip4,Address
929,2200,55402,,,"150 SOUTH FIFTH STREET SUITE 2200 MINNEAPOLIS,..."
351,2160,90067,,,"1801 CENTURY PARK EAST, SUITE 2160 LOS ANGELES..."
2945,1800,28202-5013,,,FIRST CITIZENS BANK PLAZA 128 SOUTH TRYON STRE...
3173,1106,76101,,,"LEGAL DEPARTMENT, MAIL STOP 1106 POST OFFICE B..."
737,2800,60661,,,"500 WEST MADISON STREET, SUITE 2800 CHICAGO, I..."
413,4505,90274,,,"P.O. BOX 4505 PALOS VERDES PENINSULA, CALIFORN..."
2656,1100,63105,,,"7700 FORSYTH BOULEVARD, SUITE 1100, ST. LOUIS,..."
2833,5600,77002-1001,,,"600 TRAVIS SUITE 5600, HOUSTON TEXAS 77002-100..."
782,1007,1596,19899.0,,"NEMOURS BUILDING, 1007 ORANGE STREET, SUITE 20..."
1512,1100,20006,,,"1625 K STREET, N.W. - SUITE 1100 WASHINGTON, D..."


#### Those found with digits in this format '12345-1234' are highly possible to be zips

In [1812]:
for label, col in filtered3[['Zip1', 'Zip2', 'Zip3', 'Zip4']].items():
     filtered3[str(label)+"match"] = col.str.extract(r"(\d{5}-\d{3,})")

filtered3.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Zip1,Zip2,Zip3,Zip4,Address,Zip1match,Zip2match,Zip3match,Zip4match
23,358880,32635-8880,,,"P.O. BOX 358880 GAINESVILLE, FL 32635-8880 USA",,32635-8880,,
35,1926,29304,,,LEGAL DEPARTMENT (M-495) P.O. BOX 1926 SPARTAN...,,,,
45,2100,44114,,,"600 SUPERIOR AVENUE SUITE 2100 CLEVELAND, OH 4...",,,,
58,1500,55402,,,"50 SOUTH SIXTH STREET, SUITE 1500 MINNEAPOLIS,...",,,,
60,2100,75251,,,"12700 PARK CENTRAL DRIVE, #2100 DALLAS, TX 752...",,,,


In [1813]:
cols = ['Zip1match','Zip2match','Zip3match', 'Zip4match']
filtered3_zip_OK = filtered3[filtered3[cols].any(axis=1)]
for col in cols:
    print(col+" {}".format((~filtered3_zip_OK[col].isnull()).sum()))
print("Total: {}".format(filtered3_zip_OK.shape[0]))
filtered3_zip_OK.head()

Zip1match 0
Zip2match 136
Zip3match 12
Zip4match 0
Total: 148


Unnamed: 0,Zip1,Zip2,Zip3,Zip4,Address,Zip1match,Zip2match,Zip3match,Zip4match
23,358880,32635-8880,,,"P.O. BOX 358880 GAINESVILLE, FL 32635-8880 USA",,32635-8880,,
119,8210,27695-8210,,,"CAMPUS BOX 8210 RALEIGH, NORTH CAROLINA 27695-...",,27695-8210,,
136,3200,19102-2555,,,"TWO LIBERTY PLACE 50 S. 16TH STREET, SUITE 320...",,19102-2555,,
231,3500,60602-4424,,,"70 WEST MADISON – SUITE 3500 CHICAGO, ILLINOIS...",,60602-4424,,
247,4000,55402-1425,,,"200 SOUTH SIXTH STREET, SUITE 4000 MINNEAPOLIS...",,55402-1425,,


In [1814]:
Zip1match_dict = filtered3_zip_OK.loc[filtered3_zip_OK.Zip1match.notna(),'Zip1match'].to_dict()
Zip2match_dict = filtered3_zip_OK.loc[filtered3_zip_OK.Zip2match.notna(),'Zip2match'].to_dict()
Zip3match_dict = filtered3_zip_OK.loc[filtered3_zip_OK.Zip3match.notna(),'Zip3match'].to_dict()
zipMatch_dict = {**Zip1match_dict, **Zip2match_dict, **Zip3match_dict}
zipMatch = pd.Series(zipMatch)
zipMatch.sort_index().head()

23     32635-8880
119    27695-8210
136    19102-2555
231    60602-4424
247    55402-1425
dtype: object

In [1815]:
'''Save the extracted zips from filtered3_zip_OK to US_addresses'''
US_addresses.loc[zipMatch.index,'Zip'] = zipMatch
print(US_addresses.loc[zipMatch.index].shape[0])
US_addresses.loc[zipMatch.index].sort_index().head()

149


Unnamed: 0,ClientID,Address,Zip,Country,City
23,US002581,"P.O. BOX 358880 GAINESVILLE, FL 32635-8880 USA",32635-8880,USA,
119,US003537,"CAMPUS BOX 8210 RALEIGH, NORTH CAROLINA 27695-...",27695-8210,USA,
136,US002719,"TWO LIBERTY PLACE 50 S. 16TH STREET, SUITE 320...",19102-2555,USA,
231,US002802,"70 WEST MADISON – SUITE 3500 CHICAGO, ILLINOIS...",60602-4424,USA,
247,US002826,"200 SOUTH SIXTH STREET, SUITE 4000 MINNEAPOLIS...",55402-1425,USA,


#### Continue to work on the address with more than 1 zip matches and don't meet zip 10digits format requirement

In [1816]:
filtered3_zip_not_OK = filtered3[~filtered3.index.isin(filtered3_zip_OK.index)]
print(filtered3_zip_not_OK.shape[0])
filtered3_zip_not_OK.head()

354


Unnamed: 0,Zip1,Zip2,Zip3,Zip4,Address,Zip1match,Zip2match,Zip3match,Zip4match
35,1926,29304,,,LEGAL DEPARTMENT (M-495) P.O. BOX 1926 SPARTAN...,,,,
45,2100,44114,,,"600 SUPERIOR AVENUE SUITE 2100 CLEVELAND, OH 4...",,,,
58,1500,55402,,,"50 SOUTH SIXTH STREET, SUITE 1500 MINNEAPOLIS,...",,,,
60,2100,75251,,,"12700 PARK CENTRAL DRIVE, #2100 DALLAS, TX 752...",,,,
85,1600,77046,,,"24 GREENWAY PLAZA, SUITE 1600, HOUSTON, TX 770...",,,,


In [1817]:
'''regex to capture zip and country(US) in the following order'''
regex = regex_zip + r"\W*" + r"(?P<Country>" + regex_country_EN + r")?" + r"\W*$"

# extract the zip and country that match the above regex
#filtered3_zip_not_OK = filtered3_zip_not_OK.str.extractall(regex)
#print(filtered3_zip_not_OK.info())
#filtered3_zip_not_OK.head()

filtered3_zip_not_OK_extracted = filtered3_zip_not_OK.Address.str.extractall(regex).unstack()
print(filtered3_zip_not_OK_extracted.shape[0])
filtered3_zip_not_OK_extracted.head()

351


Unnamed: 0_level_0,Zip,Country
match,0,0
35,29304,USA
45,44114,USA
58,55402,UNITED STATES OF AMERICA
60,75251,USA
85,77046,USA


In [1818]:
filtered3_zip_not_OK_extracted.columns = ['Zip', 'Country']
filtered3_zip_not_OK_extracted.head()

Unnamed: 0,Zip,Country
35,29304,USA
45,44114,USA
58,55402,UNITED STATES OF AMERICA
60,75251,USA
85,77046,USA


In [1819]:
# verify if country captured is correct
filtered3_zip_not_OK_extracted.loc[filtered3_zip_not_OK_extracted.Country.notna(), 'Country'].value_counts()

USA                         271
UNITED STATES OF AMERICA     33
U.S.A                        28
US                            3
UNITED STATES                 2
UNITED STATES OF AMRICA       1
Name: Country, dtype: int64

In [1820]:
'''assume those with matching country(US) found following zip are done, save to US_addresses (review later)'''
more_than_1match_zip_followed_by_country_idx = filtered3_zip_not_OK_extracted[filtered3_zip_not_OK_extracted.Country.notna()].index
US_addresses.loc[more_than_1match_zip_followed_by_country_idx,'Zip'] = (
clean_zip_cols(filtered3_zip_not_OK_extracted.loc[more_than_1match_zip_followed_by_country_idx,'Zip']))

print(US_addresses.loc[more_than_1match_zip_followed_by_country_idx.shape[0]])
US_addresses.loc[more_than_1match_zip_followed_by_country_idx].head()

ClientID                                   US002964
Address     ROCKWALL TEXAS UNITED STATES OF AMERICA
Zip                                             NaN
Country                                         USA
City                                 ROCKWALL TEXAS
Name: 338, dtype: object


Unnamed: 0,ClientID,Address,Zip,Country,City
35,US002678,LEGAL DEPARTMENT (M-495) P.O. BOX 1926 SPARTAN...,29304,USA,
45,US003426,"600 SUPERIOR AVENUE SUITE 2100 CLEVELAND, OH 4...",44114,USA,
58,US002905,"50 SOUTH SIXTH STREET, SUITE 1500 MINNEAPOLIS,...",55402,USA,
60,US002907,"12700 PARK CENTRAL DRIVE, #2100 DALLAS, TX 752...",75251,USA,
85,US003459,"24 GREENWAY PLAZA, SUITE 1600, HOUSTON, TX 770...",77046,USA,


In [1821]:
''' Review those without matching country(US) '''
#more_than_1match_zip_followed_by_NO_country = US_addresses[filtered3_zip_not_OK_extracted.Country.isnull()]
#more_than_1match_zip_followed_by_NO_country

more_than_1match_zip_followed_by_NO_country_idx = (
    filtered3_zip_not_OK_extracted[filtered3_zip_not_OK_extracted.Country.isnull()].index)

US_addresses.loc[more_than_1match_zip_followed_by_NO_country_idx,'Address'].tolist()

['150 N MICHIGAN AVE | SUITE 2700 | CHICAGO, ILLINOIS 60601',
 '600 BANNER PLACE TOWER 12770 COIT ROAD DALLAS, TEXAS 75251',
 '10250 CONSTELLATION BLVD. SUITE 1700 LOS ANGELES, CA 90067',
 '2603 AUGUSTA DRIVE SUITE 1270 HOUSTON, TX 77057',
 '100 EAST WISCONSIN AVENUE, SUITE 1100 MILWAUKEE, WI 53202',
 '6400 SOUTH FIDDLERS GREEN CIRCLE SUITE 1610 GREENWOOD VILLAGE,CO 80111',
 'THREE EMBARCADERO CENTER,SUITE 1350 SAN FRANCISCO, CA 94111',
 '222 SOUTH MAIN STREET, SUITE 2200 SALT LAKE CITY, UTAH 84101',
 '1875 EYE STREET NW, SUITE 1200 WASHINGTON, DC 20006',
 '4 PENN CENTER, 1600 JFK BLVD., 2ND FLOOR, PHILADELPHIA, PA 19103',
 '1170 PEACHTREE STREET NE, SUITE 1200, ATLANTA, GEORGIA, USA, 300309',
 '999 PEACHTREE STREET NE SUITE 1300 ATLANTA, GEORGIA 30309',
 '701 FIFTH AVENUE, SUITE 4800-SEATTLE, WASHINGTON 98104']

In [1822]:
filtered3_zip_not_OK_extracted.loc[more_than_1match_zip_followed_by_NO_country_idx,'Zip']

162      60601
221      75251
234      90067
421      77057
429      53202
560      80111
576      94111
595      84101
757      20006
867      19103
2604    300309
3441     30309
3636     98104
Name: Zip, dtype: object

In [1823]:
filtered3_zip_not_OK_extracted.loc[2604, 'Zip']='30309'
filtered3_zip_not_OK_extracted.loc[more_than_1match_zip_followed_by_NO_country_idx,'Zip']

162     60601
221     75251
234     90067
421     77057
429     53202
560     80111
576     94111
595     84101
757     20006
867     19103
2604    30309
3441    30309
3636    98104
Name: Zip, dtype: object

In [1824]:
# Save the above extracted zips to US_addresses
US_addresses.loc[more_than_1match_zip_followed_by_NO_country_idx,'Zip'] = (
    filtered3_zip_not_OK_extracted.loc[more_than_1match_zip_followed_by_NO_country_idx,'Zip'])
US_addresses.loc[more_than_1match_zip_followed_by_NO_country_idx]

Unnamed: 0,ClientID,Address,Zip,Country,City
162,US002755,"150 N MICHIGAN AVE | SUITE 2700 | CHICAGO, ILL...",60601,USA,
221,US002801,"600 BANNER PLACE TOWER 12770 COIT ROAD DALLAS,...",75251,USA,
234,US002807,10250 CONSTELLATION BLVD. SUITE 1700 LOS ANGEL...,90067,USA,
421,US003601,"2603 AUGUSTA DRIVE SUITE 1270 HOUSTON, TX 77057",77057,USA,
429,US003614,"100 EAST WISCONSIN AVENUE, SUITE 1100 MILWAUKE...",53202,USA,
560,US002944,6400 SOUTH FIDDLERS GREEN CIRCLE SUITE 1610 GR...,80111,USA,
576,US002937,"THREE EMBARCADERO CENTER,SUITE 1350 SAN FRANCI...",94111,USA,
595,US003491,"222 SOUTH MAIN STREET, SUITE 2200 SALT LAKE CI...",84101,USA,
757,US002758,"1875 EYE STREET NW, SUITE 1200 WASHINGTON, DC ...",20006,USA,
867,US002875,"4 PENN CENTER, 1600 JFK BLVD., 2ND FLOOR, PHIL...",19103,USA,


### Review those without matching zip and city info

In [1825]:
 US_addresses[US_addresses.Zip.isnull() & US_addresses.City.isnull()]

Unnamed: 0,ClientID,Address,Zip,Country,City
664,US003161,"US 08648 NJ LAWRENCEVILLE, 1009 LENOX DRIVE, S...",,USA,
3454,US003324,"9701 SE JOHNSON CREEK BOULEVARD, APT. 1306 970...",,USA,


In [1826]:
US_addresses[US_addresses.Zip.isnull() & US_addresses.City.isnull()].Address.tolist()

['US 08648 NJ LAWRENCEVILLE, 1009 LENOX DRIVE, SUITE 106 PRINCETON PIKE CORPORATE CENTER USA',
 '9701 SE JOHNSON CREEK BOULEVARD, APT. 1306 97086 HAPPYVALLEY OREGON UNITED STATES OF AMERICA']

In [1827]:
US_addresses.loc[[664, 2316, 2374, 3454], 'Zip'] = ['08648', '95124-3400', '20814','97086']
US_addresses[US_addresses.Zip.isnull() & US_addresses.City.isnull()]

Unnamed: 0,ClientID,Address,Zip,Country,City


## Review of `US_addresses`

In [1828]:
'''Examine the dataframe'''
US_addresses.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3622 entries, 0 to 3639
Data columns (total 5 columns):
ClientID    3622 non-null object
Address     3622 non-null object
Zip         3574 non-null object
Country     3622 non-null object
City        50 non-null object
dtypes: object(5)
memory usage: 329.8+ KB


In [1829]:
print("Number of addresses without zip: {}".format(US_addresses.loc[US_addresses.Zip.isna()].shape[0]))
US_addresses.loc[US_addresses.Zip.isna()].head()

Number of addresses without zip: 48


Unnamed: 0,ClientID,Address,Zip,Country,City
72,US002913,"333 CENTENNIAL PARKWAY SUITE B LOUISVILLE, CO USA",,USA,"LOUISVILLE, CO"
168,US002740,"1909 K STREET NW, SUITE 900, USA",,USA,
238,US002809,"LAGUNA NIGUEL, CA USA",,USA,"LAGUNA NIGUEL, CA"
239,US002810,"LAGUNA HILLS, CA USA",,USA,"LAGUNA HILLS, CA"
283,US003046,DELAWARE USA,,USA,DELAWARE


In [1830]:
print("Number of addresses with City: {}".format(US_addresses.loc[US_addresses.City.notna()].shape[0]))
US_addresses.loc[US_addresses.City.notna()].head()

Number of addresses with City: 50


Unnamed: 0,ClientID,Address,Zip,Country,City
72,US002913,"333 CENTENNIAL PARKWAY SUITE B LOUISVILLE, CO USA",,USA,"LOUISVILLE, CO"
168,US002740,"1909 K STREET NW, SUITE 900, USA",,USA,
238,US002809,"LAGUNA NIGUEL, CA USA",,USA,"LAGUNA NIGUEL, CA"
239,US002810,"LAGUNA HILLS, CA USA",,USA,"LAGUNA HILLS, CA"
283,US003046,DELAWARE USA,,USA,DELAWARE


In [1831]:
print("Number of addresses without city: {}".format(US_addresses.loc[US_addresses.City.isna()].shape[0]))
US_addresses.loc[US_addresses.City.isna()].head()

Number of addresses without city: 3572


Unnamed: 0,ClientID,Address,Zip,Country,City
0,US002562,"ONE BLOSSOM ROAD, ROCHESTER, NY 14610 USA",14610,USA,
1,US002565,"600 NORTH BRAND BLVD., SUITE 230 GLENDALE, CA ...",91203,USA,
2,US002566,C/O R.R. DONNELLEY & SONS COMPANY 111 SOUTH WA...,60606,USA,
3,US002567,"430 CARMEL COURT CANTON, GEORGIA 30114 USA",30114,USA,
4,US002568,"200 SIDNEY STREET SUITE 310 CAMBRIDGE, MASSACH...",2139,USA,


In [1832]:
'''Fix the pending issues with city data''' 

US_addresses.loc[US_addresses.City.notna(),'City'].value_counts()

DELAWARE                                              3
ROCKWALL TEXAS                                        2
CAMARILLO, CALIFORNIA                                 2
NEW PORT, CA                                          2
LAGUNA HILLS, CA                                      1
PRINCETON, NJ                                         1
W JERSEY                                              1
LAGUNA NIGUEL, CA                                     1
IN                                                    1
PORT WASHINGTON,NY                                    1
CAMBRIDGE, MASSACHUSETTS                              1
TROY, MICHIGAN                                        1
NORWALK, CONNECTICUT                                  1
WASHINGTON, DISTRICT OF COLUMBIA                      1
LOS ANGELES, CALIFORNIA                               1
NORTH CAROLINA                                        1
CIRCLE PINES, MN                                      1
MIAMI                                           

In [1833]:
# Fix the one containing county
check = US_addresses.loc[US_addresses.City.notna(), 'City'].str.upper().str.contains('COUNTY')
US_addresses.loc[US_addresses.City.notna() & check, 'City'].tolist()

['CITY OF WILMINGTON, COUNTY OF NEW CASTLE, DELAWARE']

In [1834]:
US_addresses.loc[US_addresses.City.notna() & check, 'City'] = 'WILMINGTON, DE'
US_addresses.loc[US_addresses.City.notna() & check, 'City']

2224    WILMINGTON, DE
Name: City, dtype: object

In [1835]:
US_addresses.loc[168,'City'] = 'Washington, DC'
US_addresses.loc[168,'Zip'] = '20006'

In [1836]:
'''replace all full state name to state abbrv'''

states_in_city = US_addresses.loc[US_addresses.City.notna(),'City'].str.extract(r'\b(\w+)$')
cities_to_fix = US_addresses.loc[states_in_city[states_in_city[0].str.len()>2].index,'City']
cities_to_fix

283                             DELAWARE
336                       NORTH CAROLINA
337                       ROCKWALL TEXAS
338                       ROCKWALL TEXAS
396              RALEIGH, NORTH CAROLINA
399                       TROY, MICHIGAN
474                CAMARILLO, CALIFORNIA
520                            SAN DIEGO
613                                TEXAS
763                 WILMINGTON, DELAWARE
892                CAMARILLO, CALIFORNIA
974                             DANVILLE
1288                            DELAWARE
1895                NORWALK, CONNECTICUT
2234    WASHINGTON, DISTRICT OF COLUMBIA
2306                    SALINE, MICHIGAN
2447               SPRINGFIELD, ILLINOIS
2709            CAMBRIDGE, MASSACHUSETTS
2887                   BEAVERTON, OREGON
2888                SCHAUMBURG, ILLINOIS
2981                               MIAMI
2993                            DELAWARE
3179                        AUSTIN TEXAS
3368                            W JERSEY
3411            

In [1837]:
# re.compile the regex to store as a dictionary key
# https://stackoverflow.com/questions/33343680/can-a-regular-expression-be-used-as-a-key-in-a-dictionary
compiled_regex_city = [re.compile(str(x)) for x in regex_city_state_with_ref.regex]
regex_city_dict = dict(zip(compiled_regex_city, regex_city_state_with_ref.ref.str.upper()))
regex_city_dict

{re.compile(r'(?:HOLTSVILLE[\,\.]?\s+NEW YORK|HOLTSVILLE[\,\.]?\s+N\.?\s*Y\.?\s*)',
 re.UNICODE): 'HOLTSVILLE, NY',
 re.compile(r'(?:AGAWAM[\,\.]?\s+MASSACHUSETTS|AGAWAM[\,\.]?\s+M\.?\s*A\.?\s*)',
 re.UNICODE): 'AGAWAM, MA',
 re.compile(r'(?:AMHERST[\,\.]?\s+MASSACHUSETTS|AMHERST[\,\.]?\s+M\.?\s*A\.?\s*)',
 re.UNICODE): 'AMHERST, MA',
 re.compile(r'(?:BARRE[\,\.]?\s+MASSACHUSETTS|BARRE[\,\.]?\s+M\.?\s*A\.?\s*)',
 re.UNICODE): 'BARRE, MA',
 re.compile(r'(?:BELCHERTOWN[\,\.]?\s+MASSACHUSETTS|BELCHERTOWN[\,\.]?\s+M\.?\s*A\.?\s*)',
 re.UNICODE): 'BELCHERTOWN, MA',
 re.compile(r'(?:BLANDFORD[\,\.]?\s+MASSACHUSETTS|BLANDFORD[\,\.]?\s+M\.?\s*A\.?\s*)',
 re.UNICODE): 'BLANDFORD, MA',
 re.compile(r'(?:BONDSVILLE[\,\.]?\s+MASSACHUSETTS|BONDSVILLE[\,\.]?\s+M\.?\s*A\.?\s*)',
 re.UNICODE): 'BONDSVILLE, MA',
 re.compile(r'(?:BRIMFIELD[\,\.]?\s+MASSACHUSETTS|BRIMFIELD[\,\.]?\s+M\.?\s*A\.?\s*)',
 re.UNICODE): 'BRIMFIELD, MA',
 re.compile(r'(?:CHESTER[\,\.]?\s+MASSACHUSETTS|CHESTER[\,\.]?\s+M\.?\s*A\.?

In [1838]:
# Replace cities with standard 'City, State_abbr' format using the 'regex_city_dict'
cities_to_fix.replace(regex_city_dict, inplace=True)
cities_to_fix

283                  DELAWARE
336            NORTH CAROLINA
337              ROCKWALL, TX
338              ROCKWALL, TX
396               RALEIGH, NC
399                  TROY, MI
474             CAMARILLO, CA
520                 SAN DIEGO
613                     TEXAS
763            WILMINGTON, DE
892             CAMARILLO, CA
974                  DANVILLE
1288                 DELAWARE
1895              NORWALK, CT
2234           WASHINGTON, DC
2306               SALINE, MI
2447          SPRINGFIELD, IL
2709            CAMBRIDGE, MA
2887            BEAVERTON, OR
2888           SCHAUMBURG, IL
2981                    MIAMI
2993                 DELAWARE
3179               AUSTIN, TX
3368                 W JERSEY
3411               CHASKA, MN
3448               CALIFORNIA
3515    PACIFIC PALISADES, CA
3516          LOS ANGELES, CA
Name: City, dtype: object

In [1839]:
# fix those still cannot meet the standard formats
cities_to_fix_manual_search = cities_to_fix[cities_to_fix.str.split(',').str.len()<2]
cities_to_fix_manual_search

283           DELAWARE
336     NORTH CAROLINA
520          SAN DIEGO
613              TEXAS
974           DANVILLE
1288          DELAWARE
2981             MIAMI
2993          DELAWARE
3368          W JERSEY
3448        CALIFORNIA
Name: City, dtype: object

In [1840]:
values = {283:'DE', 
          336:'NC', 
          520:'SAN DIEGO, CA', 
          613:'TX', 
          974:'DANVILLE CA',
          1288:'DE', 
          2981:'MIAMI, FL',
          2993:'DE', 
          3368:'ALLENDALE, NJ',
          3448:'Van Nuys, CA'.upper()}

cities_to_fix_manual_search = pd.Series(values)
cities_to_fix[cities_to_fix_manual_search.index]=cities_to_fix_manual_search
cities_to_fix

283                        DE
336                        NC
337              ROCKWALL, TX
338              ROCKWALL, TX
396               RALEIGH, NC
399                  TROY, MI
474             CAMARILLO, CA
520             SAN DIEGO, CA
613                        TX
763            WILMINGTON, DE
892             CAMARILLO, CA
974               DANVILLE CA
1288                       DE
1895              NORWALK, CT
2234           WASHINGTON, DC
2306               SALINE, MI
2447          SPRINGFIELD, IL
2709            CAMBRIDGE, MA
2887            BEAVERTON, OR
2888           SCHAUMBURG, IL
2981                MIAMI, FL
2993                       DE
3179               AUSTIN, TX
3368            ALLENDALE, NJ
3411               CHASKA, MN
3448             VAN NUYS, CA
3515    PACIFIC PALISADES, CA
3516          LOS ANGELES, CA
Name: City, dtype: object

In [1841]:
# save the updated info to US_addresses
US_addresses.loc[cities_to_fix.index,'City']=cities_to_fix
US_addresses.loc[cities_to_fix.index]

Unnamed: 0,ClientID,Address,Zip,Country,City
283,US003046,DELAWARE USA,,USA,DE
336,US002977,"NORTH CAROLINA, USA",,USA,NC
337,US002963,ROCKWALL TEXAS UNITED STATES OF AMERICA,,USA,"ROCKWALL, TX"
338,US002964,ROCKWALL TEXAS UNITED STATES OF AMERICA,,USA,"ROCKWALL, TX"
396,US003542,"1613 PINEVIEW DRIVE RALEIGH, NORTH CAROLINA UN...",,USA,"RALEIGH, NC"
399,US003544,"1401 CROOKS ROAD, TROY, MICHIGAN USA",,USA,"TROY, MI"
474,US003072,"941 AVENIDA ACASO, CAMARILLO, CALIFORNIA, USA",,USA,"CAMARILLO, CA"
520,US002961,SAN DIEGO USA,,USA,"SAN DIEGO, CA"
613,US003269,TEXAS USA,,USA,TX
763,US002527,"3411 SILVERSIDE RD., RODNEY BLDG. SUITE 104, W...",,USA,"WILMINGTON, DE"


### Look up city information by `Zip` in `US_addresses`

In [1842]:
no_city = US_addresses[US_addresses.City.isna()]
print(no_city.shape[0])
no_city.head()

3572


Unnamed: 0,ClientID,Address,Zip,Country,City
0,US002562,"ONE BLOSSOM ROAD, ROCHESTER, NY 14610 USA",14610,USA,
1,US002565,"600 NORTH BRAND BLVD., SUITE 230 GLENDALE, CA ...",91203,USA,
2,US002566,C/O R.R. DONNELLEY & SONS COMPANY 111 SOUTH WA...,60606,USA,
3,US002567,"430 CARMEL COURT CANTON, GEORGIA 30114 USA",30114,USA,
4,US002568,"200 SIDNEY STREET SUITE 310 CAMBRIDGE, MASSACH...",2139,USA,


In [1843]:
US_addresses.loc[US_addresses.City.isna() | US_addresses.Zip.notna()].shape[0]

3575

In [1844]:
US_addresses.loc[US_addresses.City.notna() & US_addresses.Zip.notna()]

Unnamed: 0,ClientID,Address,Zip,Country,City
168,US002740,"1909 K STREET NW, SUITE 900, USA",20006,USA,"Washington, DC"
2263,US001360,"2227 WELBILT BOULEVARD, NEW PORT RICHEY, FLORI...",34655,USA,"Port Richey, FL"
2316,US001413,加州95124-3400圣荷西罗吉克路2100号,95124-3400,USA,"FREMONT, CA"


In [1845]:
zip_city_dict = dict(zip(zipcodes['Zip Code'],zipcodes['Place Name']+', '+zipcodes['State_Abbr']))
zip_city_dict

{'00501': 'Holtsville, NY',
 '00544': 'Holtsville, NY',
 '01001': 'Agawam, MA',
 '01002': 'Amherst, MA',
 '01003': 'Amherst, MA',
 '01004': 'Amherst, MA',
 '01005': 'Barre, MA',
 '01007': 'Belchertown, MA',
 '01008': 'Blandford, MA',
 '01009': 'Bondsville, MA',
 '01010': 'Brimfield, MA',
 '01011': 'Chester, MA',
 '01012': 'Chesterfield, MA',
 '01013': 'Chicopee, MA',
 '01014': 'Chicopee, MA',
 '01020': 'Chicopee, MA',
 '01021': 'Chicopee, MA',
 '01022': 'Chicopee, MA',
 '01026': 'Cummington, MA',
 '01027': 'Easthampton, MA',
 '01028': 'East Longmeadow, MA',
 '01029': 'East Otis, MA',
 '01030': 'Feeding Hills, MA',
 '01031': 'Gilbertville, MA',
 '01032': 'Goshen, MA',
 '01033': 'Granby, MA',
 '01034': 'Granville, MA',
 '01035': 'Hadley, MA',
 '01036': 'Hampden, MA',
 '01037': 'Hardwick, MA',
 '01038': 'Hatfield, MA',
 '01039': 'Haydenville, MA',
 '01040': 'Holyoke, MA',
 '01041': 'Holyoke, MA',
 '01050': 'Huntington, MA',
 '01053': 'Leeds, MA',
 '01054': 'Leverett, MA',
 '01056': 'Ludlo

In [1846]:
def lookup_values_by_keys(data_dict, df, key_col, value_col, key_start=0, key_end=5):
    for i,k in df[key_col].iteritems():
        try:
            df.loc[i][value_col]=(data_dict[k[key_start:key_end]])
        except KeyError:
            df.loc[i][value_col]='Key Error'
    return df

def lookup_values_by_keys2(data_dict, df, key_col, keylen=True, key_start=0, key_end=5):
    idx = df.index
    values = []  
    for k in df[key_col]:
        if keylen is False:
            key_end=len(k)
        try:
            values.append(data_dict[k[key_start:key_end]])
        except KeyError:
            values.append(np.nan)
    return pd.Series(values,idx)

cities_found = lookup_values_by_keys2(zip_city_dict, no_city, 'Zip')
US_addresses.loc[cities_found.index,'City']=cities_found
US_addresses.sample(20)

Unnamed: 0,ClientID,Address,Zip,Country,City
618,US003281,"484 OAKMEAD PARKWAY SUNNYVALE, CA 94085, USA",94085,USA,"Sunnyvale, CA"
2269,US001366,"ONE LOGAN SQUARE, STE. 2000 PHILADELPHIA, PA 1...",19103-6996,USA,"Philadelphia, PA"
1141,US000217,"7229 S. ALTON WAY, CENTENNIAL WAY, COLORADO 80...",80112,USA,"Englewood, CO"
2955,US002068,"10 PALMER AVENUE CROTON ON HUDSON, NY 10520 U....",10520,USA,"Croton On Hudson, NY"
133,US002703,"375 WEST STREET WEST BRIDGEWATER, MASSACHUSETT...",02379,USA,"West Bridgewater, MA"
3318,US002439,"535 MIDDLEFIELD ROAD, STE 280 MENLO PARK, CALI...",94025,USA,"Menlo Park, CA"
1085,US000161,"INTERNATIONAL TOWER, SUITE 1760 S. FIGUEROA .,...",,USA,"Los Angeles, CA"
929,US000005,"150 SOUTH FIFTH STREET SUITE 2200 MINNEAPOLIS,...",55402,USA,"Minneapolis, MN"
3205,US002324,"7070 WINCHESTER CIRCLE BOULDER, COLORADO, 8030...",80301,USA,"Boulder, CO"
1342,US000427,"2006-A WINDY TERRACE, CEDAR PARK, TEXAS, 78613...",78613,USA,"Cedar Park, TX"


In [1847]:
'''Finally, check on those cannot find cities'''
no_cities_aft_lookup_by_zip = US_addresses.loc[US_addresses.City.isnull()]
no_cities_aft_lookup_by_zip

Unnamed: 0,ClientID,Address,Zip,Country,City
178,US002770,"2946 SOUTH WAUKESHA ROAD, WEST ALLIS,WI 53117 USA",53117,USA,
296,US003471,"5 THIRD STREET, SUITE 732 SAN FRANCISCO, CA 94...",94193,USA,
671,US003171,"P.O. BOX 194344 SAN JUAN, PR 00919 USA",919,USA,
922,US002773,"50 TANNERY ROAD, BRANCHBURG, NJ 08878, USA",8878,USA,
1174,US000252,"BATTLE RUN ROAD, TRIADELPHIA, WEST VIRGINIA 26...",26603,USA,
1385,US000471,"180 DEXTER AVENUE, P.O. BOX 9143, WATERTOWN, M...",2272,USA,
1456,US000542,"1009 WESLEY ROAD, OCEAN CITY, NEW JERSY 07115 USA",7115,USA,
1899,US000986,"304 OLD MAIN STREET, UNIVERSITY PARK, PENNSYLV...",70000,USA,
2039,US001133,"132 N. EL CAMINO REAL #287, ENCINITAS, CALIFOR...",92924,USA,
3376,US002499,"420 CHESTNUT LANE, WESTON, FL 33226 USA",33226,USA,


In [1848]:
# update manual

no_cities_aft_lookup_by_zip.Address.tolist()

['2946 SOUTH WAUKESHA ROAD, WEST ALLIS,WI 53117 USA',
 '5 THIRD STREET, SUITE 732 SAN FRANCISCO, CA 94193, USA',
 'P.O. BOX 194344 SAN JUAN, PR 00919 USA',
 '50 TANNERY ROAD, BRANCHBURG, NJ 08878, USA',
 'BATTLE RUN ROAD, TRIADELPHIA, WEST VIRGINIA 26603, USA',
 '180 DEXTER AVENUE, P.O. BOX 9143, WATERTOWN, MA 02272, US',
 '1009 WESLEY ROAD, OCEAN CITY, NEW JERSY 07115 USA',
 '304 OLD MAIN STREET, UNIVERSITY PARK, PENNSYLVANIA 16802-70000 USA',
 '132 N. EL CAMINO REAL #287, ENCINITAS, CALIFORNIA 92924 USA',
 '420 CHESTNUT LANE, WESTON, FL 33226 USA',
 '256 ELEANOR ROOSEVELT ST. SAN JUAN, PUERTO RICO 00918 USA']

In [1849]:
#no_cities_aft_lookup_by_zip.Address.str.extract(r"("+regex_country_EN+r")")
no_cities_aft_lookup_by_zip_extracted_city = no_cities_aft_lookup_by_zip.Address.apply(rest_of_str_aft_road_or_unit)
no_cities_aft_lookup_by_zip_extracted_city

178                   WEST ALLIS,WI 53117 USA
296              SAN FRANCISCO, CA 94193, USA
671                    SAN JUAN, PR 00919 USA
922                 BRANCHBURG, NJ 08878, USA
1174    TRIADELPHIA, WEST VIRGINIA 26603, USA
1385                  WATERTOWN, MA 02272, US
1456          OCEAN CITY, NEW JERSY 07115 USA
1899             PENNSYLVANIA 16802-70000 USA
2039          ENCINITAS, CALIFORNIA 92924 USA
3376                                      USA
3575          SAN JUAN, PUERTO RICO 00918 USA
Name: Address, dtype: object

In [1850]:
no_cities_aft_lookup_by_zip_extracted_city = (no_cities_aft_lookup_by_zip_extracted_city
                                              .str.strip()
                                              .str.replace(r"\d+?\W?\d+\,?\s*USA?", '')
                                              .replace(regex_city_dict))

no_cities_aft_lookup_by_zip_extracted_city

178             WEST ALLIS,WI 
296          SAN FRANCISCO, CA
671              SAN JUAN, PR 
922            BRANCHBURG, NJ 
1174          TRIADELPHIA, WV 
1385             WATERTOWN, MA
1456    OCEAN CITY, NEW JERSY 
1899             PENNSYLVANIA 
2039            ENCINITAS, CA 
3376                       USA
3575    SAN JUAN, PUERTO RICO 
Name: Address, dtype: object

In [1851]:
no_cities_aft_lookup_by_zip.Address.tolist()

['2946 SOUTH WAUKESHA ROAD, WEST ALLIS,WI 53117 USA',
 '5 THIRD STREET, SUITE 732 SAN FRANCISCO, CA 94193, USA',
 'P.O. BOX 194344 SAN JUAN, PR 00919 USA',
 '50 TANNERY ROAD, BRANCHBURG, NJ 08878, USA',
 'BATTLE RUN ROAD, TRIADELPHIA, WEST VIRGINIA 26603, USA',
 '180 DEXTER AVENUE, P.O. BOX 9143, WATERTOWN, MA 02272, US',
 '1009 WESLEY ROAD, OCEAN CITY, NEW JERSY 07115 USA',
 '304 OLD MAIN STREET, UNIVERSITY PARK, PENNSYLVANIA 16802-70000 USA',
 '132 N. EL CAMINO REAL #287, ENCINITAS, CALIFORNIA 92924 USA',
 '420 CHESTNUT LANE, WESTON, FL 33226 USA',
 '256 ELEANOR ROOSEVELT ST. SAN JUAN, PUERTO RICO 00918 USA']

In [1852]:
'''Save to US_addresses'''
US_addresses.loc[no_cities_aft_lookup_by_zip.index,'City']=no_cities_aft_lookup_by_zip_extracted_city
US_addresses.loc[3376,'City']='Weston, FL'
US_addresses.loc[3376,'Zip']='33326'

In [1853]:
US_addresses.loc[US_addresses.City.isnull()]

Unnamed: 0,ClientID,Address,Zip,Country,City


In [1854]:
US_addresses[US_addresses.City.str.contains('USA')]

Unnamed: 0,ClientID,Address,Zip,Country,City


In [1855]:
'''Review those without zip to catch remaining errors'''
US_addresses.loc[US_addresses.Zip.isnull()]

Unnamed: 0,ClientID,Address,Zip,Country,City
72,US002913,"333 CENTENNIAL PARKWAY SUITE B LOUISVILLE, CO USA",,USA,"LOUISVILLE, CO"
238,US002809,"LAGUNA NIGUEL, CA USA",,USA,"LAGUNA NIGUEL, CA"
239,US002810,"LAGUNA HILLS, CA USA",,USA,"LAGUNA HILLS, CA"
283,US003046,DELAWARE USA,,USA,DE
284,US003048,"1434 AIR RAIL AVENUE VIRGINIA BEACH, VA USA",,USA,"VIRGINIA BEACH, VA"
305,US003485,"57 SEAVIEW BOULEVARD PORT WASHINGTON,NY U.S.A.",,USA,"PORT WASHINGTON,NY"
336,US002977,"NORTH CAROLINA, USA",,USA,NC
337,US002963,ROCKWALL TEXAS UNITED STATES OF AMERICA,,USA,"ROCKWALL, TX"
338,US002964,ROCKWALL TEXAS UNITED STATES OF AMERICA,,USA,"ROCKWALL, TX"
379,US002965,"3472 88TH AVENUE NE CIRCLE PINES, MN USA",,USA,"CIRCLE PINES, MN"


In [1856]:
'''Match all the cities with standard formatting'''
US_addresses.loc[US_addresses.Zip.isnull(),'City'].replace(regex_city_dict,inplace=True)
US_addresses.loc[US_addresses.Zip.isnull(),'City']

72             LOUISVILLE, CO
238         LAGUNA NIGUEL, CA
239          LAGUNA HILLS, CA
283                        DE
284        VIRGINIA BEACH, VA
305        PORT WASHINGTON,NY
336                        NC
337              ROCKWALL, TX
338              ROCKWALL, TX
379          CIRCLE PINES, MN
396               RALEIGH, NC
399                  TROY, MI
474             CAMARILLO, CA
520             SAN DIEGO, CA
589                BOSTON, MA
613                        TX
665              PLYMOUTH, MN
696            BURNSVILLE, MN
763            WILMINGTON, DE
892             CAMARILLO, CA
974               DANVILLE CA
1071                       IN
1085          Los Angeles, CA
1288                       DE
1351           Washington, DC
1776             NEW PORT, CA
1895              NORWALK, CT
2224           WILMINGTON, DE
2234           WASHINGTON, DC
2306               SALINE, MI
2365             NEW PORT, CA
2447          SPRINGFIELD, IL
2709            CAMBRIDGE, MA
2812      

In [1857]:
'''Add zips through look up by cities'''
cities_from_zipcodes = zipcodes['Place Name'].str.upper()+ ', '+zipcodes['State_Abbr']
city_zip_dict = dict(zip(cities_from_zipcodes, zipcodes['Zip Code']))
city_zip_dict

{'HOLTSVILLE, NY': '11742',
 'AGAWAM, MA': '01001',
 'AMHERST, MA': '01004',
 'BARRE, MA': '01005',
 'BELCHERTOWN, MA': '01007',
 'BLANDFORD, MA': '01008',
 'BONDSVILLE, MA': '01009',
 'BRIMFIELD, MA': '01010',
 'CHESTER, MA': '01011',
 'CHESTERFIELD, MA': '01012',
 'CHICOPEE, MA': '01022',
 'CUMMINGTON, MA': '01026',
 'EASTHAMPTON, MA': '01027',
 'EAST LONGMEADOW, MA': '01028',
 'EAST OTIS, MA': '01029',
 'FEEDING HILLS, MA': '01030',
 'GILBERTVILLE, MA': '01031',
 'GOSHEN, MA': '01032',
 'GRANBY, MA': '01033',
 'GRANVILLE, MA': '01034',
 'HADLEY, MA': '01035',
 'HAMPDEN, MA': '01036',
 'HARDWICK, MA': '01037',
 'HATFIELD, MA': '01038',
 'HAYDENVILLE, MA': '01039',
 'HOLYOKE, MA': '01041',
 'HUNTINGTON, MA': '01050',
 'LEEDS, MA': '01053',
 'LEVERETT, MA': '01054',
 'LUDLOW, MA': '01056',
 'MONSON, MA': '01057',
 'NORTH AMHERST, MA': '01059',
 'NORTHAMPTON, MA': '01063',
 'FLORENCE, MA': '01062',
 'NORTH HATFIELD, MA': '01066',
 'OAKHAM, MA': '01068',
 'PALMER, MA': '01069',
 'PLAINFI

In [1858]:
US_addresses.loc[US_addresses.Zip.isnull(),'Zip'] = lookup_values_by_keys2(city_zip_dict,
                                                                           US_addresses.loc[US_addresses.Zip.isnull()],
                                                                           'City', keylen=False)

US_addresses.loc[US_addresses.Zip.isnull()]


Unnamed: 0,ClientID,Address,Zip,Country,City
283,US003046,DELAWARE USA,,USA,DE
305,US003485,"57 SEAVIEW BOULEVARD PORT WASHINGTON,NY U.S.A.",,USA,"PORT WASHINGTON,NY"
336,US002977,"NORTH CAROLINA, USA",,USA,NC
613,US003269,TEXAS USA,,USA,TX
665,US003143,"PLYMOUTH, MN USA",,USA,"PLYMOUTH, MN"
974,US000050,"9000 CROW CANYON ROAD, SUITE S393, DANVILLE, C...",,USA,DANVILLE CA
1071,US000147,IN. U.S.A.,,USA,IN
1085,US000161,"INTERNATIONAL TOWER, SUITE 1760 S. FIGUEROA .,...",,USA,"Los Angeles, CA"
1288,US000373,"DELAWARE, U.S.A.",,USA,DE
1351,US000436,"810 VERMONT AVENUE N. W., WASHINGTON, D. C. 20...",,USA,"Washington, DC"


In [1864]:
zipmap = {305:'11050',
          336:'94506',
           1085:'90012', 
           1776:np.nan, 
           2365:np.nan, 
           2812:'10962',
           2944:'96822'}

citymap = {305:'Port Washington, NY',
           336:'Danville, CA',
           974: 'Danville, CA',
           1085:'Los Angeles, CA',
           1776:'Newport Beach, CA',
           2365:'Fremont, CA',
           2812:'Orangeburg, NY',
           2944:'Honolulu, HI'}


US_addresses.loc[list(zipmap.keys()), 'Zip'] = (US_addresses.loc[list(zipmap.keys())].index
                                                      .to_series()
                                                      .map(zipmap))

US_addresses.loc[list(citymap.keys()), 'City'] = (US_addresses.loc[list(citymap.keys())].index
                                                      .to_series()
                                                      .map(citymap))

US_addresses.loc[US_addresses.Zip.isnull()]

Unnamed: 0,ClientID,Address,Zip,Country,City
283,US003046,DELAWARE USA,,USA,DE
613,US003269,TEXAS USA,,USA,TX
665,US003143,"PLYMOUTH, MN USA",,USA,"PLYMOUTH, MN"
974,US000050,"9000 CROW CANYON ROAD, SUITE S393, DANVILLE, C...",,USA,"Danville, CA"
1071,US000147,IN. U.S.A.,,USA,IN
1288,US000373,"DELAWARE, U.S.A.",,USA,DE
1351,US000436,"810 VERMONT AVENUE N. W., WASHINGTON, D. C. 20...",,USA,"Washington, DC"
1776,US000862,美国加州新港海滨詹伯瑞路4311号,,USA,"Newport Beach, CA"
2365,US001463,美国 加州 佛利蒙 利马街51号,,USA,"Fremont, CA"
2993,US002106,DELAWARE USA,,USA,DE


## `US_addresses` combined with client name and other info, export to excel

In [1874]:
final = pd.merge(left=US_addresses,
                 right=HQ_clients[['客户代码','客户名称','客户类别','客户中文名称', '电话','传真', '状态']] ,
                 left_on='ClientID',
                 right_on='客户代码')

final = final.drop('客户代码',axis=1)
print(final.shape[0])
final.head()

3622


Unnamed: 0,ClientID,Address,Zip,Country,City,客户名称,客户类别,客户中文名称,电话,传真,状态
0,US002562,"ONE BLOSSOM ROAD, ROCHESTER, NY 14610 USA",14610,USA,"Rochester, NY",CERION，LLC,(S),丝润有限责任公司,,,
1,US002565,"600 NORTH BRAND BLVD., SUITE 230 GLENDALE, CA ...",91203,USA,"Glendale, CA","AD-VANTAGE NETWORKS, INC.",(S),AD-优势网络股份公司,,,
2,US002566,C/O R.R. DONNELLEY & SONS COMPANY 111 SOUTH WA...,60606,USA,"Chicago, IL","TOPS Products, LLC",(S),托普斯产品有限责任公司,,,
3,US002567,"430 CARMEL COURT CANTON, GEORGIA 30114 USA",30114,USA,"Canton, GA","BRIGHENTI, Peter",(S),彼得·布里根蒂,,,
4,US002568,"200 SIDNEY STREET SUITE 310 CAMBRIDGE, MASSACH...",2139,USA,"Cambridge, MA","RANA THERAPEUTICS, INC.",(S),RANA医疗有限公司,,,


In [1875]:
final.to_excel('Final US addresses.xlsx')

## PENDING

#### Revise `US_addresses` to include client name

#### Review the code using the following functions and revise to use `df.replace(dict)`
        - extract_match_from_a_patlist
        - extract_match_from_a_pat_df
        - extract_match_of_a_pat
        
#### Add other better tools to my personal notes, like `df.replace(dict)` to replace regex search in loop

#### Make a simplified version of this notebook for faster execution