# Module 4 Homework

## More Walmart Cleaning

The file **Walmart_United_States\_&\_Canada.csv** contains data on all
Walmarts, Sam's Clubs and Murphy USA gas/diesel in the USA and Canada.
Note that Gas/Diesel and No Over Night Parking (NOP) are indicated if
known. These data can be obtained from the site
<http://www.poi-factory.com/node/25560>.


**Before you start.** You started cleaning this data set in Activity 4.1.  Start by copying over your code and fixing the encoding issue.

In [1]:
import pandas as pd
pd.set_option("display.max_colwidth", None) # don't cut off text
pd.set_option("display.max_columns", None) # horizontal scroll columns instead of wrapping around
import collections
collections.Iterable = collections.abc.Iterable # python 3.10 fix for dfply if_else
from dfply import *
from more_dfply import case_when
from more_dfply.facets import text_filter
import re

In [2]:
# Notepad indicates this file is ANSI, which means "pretty much whatever the system wants"
# cp1252 is for Western European Languages, so it ought to suffice for US/Canada data
walmart = pd.read_csv("./data/Walmart_United_States_&_Canada.csv",
                      names = ["Lat", "Long", "Description", "AddressPhone"],
                      encoding="cp1252")
walmart.head()

Unnamed: 0,Lat,Long,Description,AddressPhone
0,-114.005671,51.262567,"Walmart Supercentre; #1050,","2881 Main St SW,Airdrie ,AB T4B 3G5,(403) 945-1295"
1,-111.900542,50.577939,"Walmart Supercentre; #3658,","917 3rd St W,Brooks ,AB T1R 1L5,(403) 793-2111"
2,-114.039133,51.107253,"Walmart Supercentre; #3013,","1110 57th Ave NE,Calgary ,(NOP),AB T2E 9B7,(403) 730-0990"
3,-114.138488,51.040871,"Walmart Supercentre; #3009,Gas,","1212 37 St SW,Calgary ,(NOP),AB T3C 1S3,(403) 242-2205"
4,-114.028603,50.930551,"Walmart; #1144,","1221 Canyon Meadows Dr SE,Calgary ,AB T2J 6G2,(403) 225-6638"


In [3]:
# Copy your code from Activity 4.1 here
wm_cleaned = (walmart
                >> mutate(Description = if_else(X.Description.str.contains("; Supercenter"),
                                              X.Description.str.replace(";", ""), X.Description)) # fix one weird case
                >> mutate(Description = X.Description.str.replace(';', ','))
                >> mutate(Store_type = X.Description.str.split(',').str.get(0),
                         Store_number = X.Description.str.split(',').str.get(1),
                         Gas = X.Description.str.split(',').str.get(2))
                >> mutate(Gas = X.Gas.map(lambda s: s if s else "None")) # replace empty strings in Gas field
)
wm_cleaned.head()

Unnamed: 0,Lat,Long,Description,AddressPhone,Store_type,Store_number,Gas
0,-114.005671,51.262567,"Walmart Supercentre, #1050,","2881 Main St SW,Airdrie ,AB T4B 3G5,(403) 945-1295",Walmart Supercentre,#1050,
1,-111.900542,50.577939,"Walmart Supercentre, #3658,","917 3rd St W,Brooks ,AB T1R 1L5,(403) 793-2111",Walmart Supercentre,#3658,
2,-114.039133,51.107253,"Walmart Supercentre, #3013,","1110 57th Ave NE,Calgary ,(NOP),AB T2E 9B7,(403) 730-0990",Walmart Supercentre,#3013,
3,-114.138488,51.040871,"Walmart Supercentre, #3009,Gas,","1212 37 St SW,Calgary ,(NOP),AB T3C 1S3,(403) 242-2205",Walmart Supercentre,#3009,Gas
4,-114.028603,50.930551,"Walmart, #1144,","1221 Canyon Meadows Dr SE,Calgary ,AB T2J 6G2,(403) 225-6638",Walmart,#1144,


In [4]:
# verify the weird case was handled
walmart.iloc[[893]]

Unnamed: 0,Lat,Long,Description,AddressPhone
893,-120.419884,34.919944,"Walmart; Supercenter,#2507,","2220 S Bradley,Santa Maria,CA,93455 ,(NOP),(805) 349-7885"


In [5]:
wm_cleaned.iloc[[893]]

Unnamed: 0,Lat,Long,Description,AddressPhone,Store_type,Store_number,Gas
893,-120.419884,34.919944,"Walmart Supercenter,#2507,","2220 S Bradley,Santa Maria,CA,93455 ,(NOP),(805) 349-7885",Walmart Supercenter,#2507,


1.  Some of the address columns contain `(NOP)` to indicated *No overnight parking*.  Extract this information into a new indicator column, then remove it from the address column.

In [6]:
# yes, bad variables names. But it's kind of necessary to be able to preserve changes while developing one cell at a time.
# We could do in-place modification but that's also messy.
# At the end, we'll combine into a single pipe anyways.

wm1 = (wm_cleaned
          >> mutate(OvernightParking = 1 - X.AddressPhone.str.count("(NOP)"),
                    AddressPhone = X.AddressPhone.str.replace("\(NOP\),?", "", regex=True)
                   ) 
)
wm1.head()

Unnamed: 0,Lat,Long,Description,AddressPhone,Store_type,Store_number,Gas,OvernightParking
0,-114.005671,51.262567,"Walmart Supercentre, #1050,","2881 Main St SW,Airdrie ,AB T4B 3G5,(403) 945-1295",Walmart Supercentre,#1050,,1
1,-111.900542,50.577939,"Walmart Supercentre, #3658,","917 3rd St W,Brooks ,AB T1R 1L5,(403) 793-2111",Walmart Supercentre,#3658,,1
2,-114.039133,51.107253,"Walmart Supercentre, #3013,","1110 57th Ave NE,Calgary ,AB T2E 9B7,(403) 730-0990",Walmart Supercentre,#3013,,0
3,-114.138488,51.040871,"Walmart Supercentre, #3009,Gas,","1212 37 St SW,Calgary ,AB T3C 1S3,(403) 242-2205",Walmart Supercentre,#3009,Gas,0
4,-114.028603,50.930551,"Walmart, #1144,","1221 Canyon Meadows Dr SE,Calgary ,AB T2J 6G2,(403) 225-6638",Walmart,#1144,,1


2.  The address column contains the phone number of most of the stores.  Extract this information into a new column.  There are some issue with different patterns, so use the divide-and-conquer approach is advised.

In [7]:
# Your code here
(wm1
    >> select(X.AddressPhone)
    >> filter_by(~text_filter(X.AddressPhone, "\(\d{3}\)\s+\d{3}-\d{4}", regex=True)) # vast majority of cases
    >> filter_by(~text_filter(X.AddressPhone, "\(\d{3}\)\s+\d{3} \d{4}", regex=True)) # space instead of dash
    >> filter_by(~text_filter(X.AddressPhone, "\(\d{3}\)\d{3}-\d{4}", regex=True)) # no initial space
    >> filter_by(~text_filter(X.AddressPhone, "\(\d{3}0\s+\d{3}-\d{4}", regex=True)) # end parenthesis replaced with 0
#    >> filter_by(text_filter(X.AddressPhone, "\(\d{3}\)\d{3}-\d{4}", regex=True))
)

Unnamed: 0,AddressPhone
354,"8303 Rogers Ave,Fort Smith,AR,72903 ,,(479) 452-161"
1921,"510 Ave C,Denison,IA,51442,(712_263-2000,"
2633,"6225 Coliseum Blvd,Alexandria,LA,71303 ,,(318-448-8881"
3733,"1318 Mebane Oaks Rd; I-40 Exit 154,Mebane,NC,27302 ,,(919) 30400171"
3994,"950 Rte 37 W,Toms River,NJ,08755,(732_349-6000,"
4232,"1134 Wicker St,Ticonderoga,NY,12883 ,,(518(585-3060"
4543,"7520 E Reno Ave,Midwest City,OK,73110 ,(405( 455-4070"
4796,"100 Stonebridge Blvd,Wasaga Beach ,ON L9Z 0C1,(705 )442-7100"
5017,"1333 Boul Michele-Bohec,Blainville,(450_ 419-5930,QC J7C 0M4"
6135,"6040 Coit Rd,Plano,TX,75023,"


In [8]:
def get_last(item, sep=","):
    # split on sep and return the last element
    return item.split(sep)[-1]

In [9]:
(wm1
    >> mutate(PhoneNumber = case_when((text_filter(X.AddressPhone, "\(\d{3}\)\s+\d{3}-\d{4}", regex=True),
                                            X.AddressPhone.map(get_last)),
                                      (text_filter(X.AddressPhone, "\(\d{3}\)\s+\d{3} \d{4}", regex=True),
                                            X.AddressPhone.map(get_last).str.replace("\s+(\d{4})$", "-\\1", regex=True)),
                                      (text_filter(X.AddressPhone, "\(\d{3}\)\d{3}-\d{4}", regex=True),
                                            X.AddressPhone.map(get_last).str.replace(")", ") ", regex=False)),
                                      (text_filter(X.AddressPhone, "\(\d{3}0 \d{3}-\d{4}", regex=True),
                                            X.AddressPhone.map(get_last).str.replace("^\((\d{3})0", r"(\1)", regex=True)),
                                      (True, "Missing or invalid phone")
                                     )
             )
     >> filter_by(text_filter(X.AddressPhone, "\(\d{3}0 \d{3}-\d{4}", regex=True))
#     >> filter_by(text_filter(X.PhoneNumber, "no phone"))
).head()

Unnamed: 0,Lat,Long,Description,AddressPhone,Store_type,Store_number,Gas,OvernightParking,PhoneNumber
207,-88.194899,30.679921,"Wm Nbrhd Mkt, #4648,","6575 Airport Blvd,Mobile,AL,36608 ,(2510 370-9845",Wm Nbrhd Mkt,#4648,,0,(251) 370-9845
973,-104.793389,39.625883,"Wm Nbrhd Mkt, #3126,","16746 E Smokey Hill Rd,Centennial,CO,80015 ,(3030 305-1110",Wm Nbrhd Mkt,#3126,,0,(303) 305-1110
3569,-66.092624,45.255944,"Walmart, #1175,","621 Fairville Blvd,Saint John ,NB E2M 4X5,(5060 693-1668",Walmart,#1175,,1,(506) 693-1668
3812,-82.578265,35.703458,"Walmart Supercenter, #4334,","25 Northbridge Commons; I-26 Exit 19,Weaverville,NC,28787 ,,(8280 645-5028",Walmart Supercenter,#4334,,1,(828) 645-5028


In [10]:
# We could continue in this pattern, but we're basically down to 1 case per pattern and several more patterns to go
# I observed that you can discern the three fields of the phone number in just about any pattern
# If we can get that, we can make it comply with our formatting
# The only ones that don't match are invalid (missing a digit or just entirely missing)
(wm1
    >> select(X.AddressPhone)
    >> filter_by(~text_filter(X.AddressPhone, "\(\d{3}.*\d{3}.*\d{4}", regex=True)) # vast majority of cases)
)

Unnamed: 0,AddressPhone
354,"8303 Rogers Ave,Fort Smith,AR,72903 ,,(479) 452-161"
6135,"6040 Coit Rd,Plano,TX,75023,"
6193,"3440 S Bryant Blvd,San Angelo,TX,76903 ,,(325) 26-6599"


In [11]:
# Field either contains a phone number, in which case we can then extract the parts and format as desired
#     or it does not, in which case we should mark it as such

#The opening parenthesis is important to keep it from starting too early, and all phone numbers had this anyway
wm2 = (wm1
    >> mutate(PhoneNumber = X.AddressPhone.str.extract("(\(\d{3}.*\d{3}.*\d{4})"))
    >> mutate(PhoneNumber = if_else(X.PhoneNumber.isna(), # did not contain the pattern
                                  "Missing or invalid phone",
                                  X.PhoneNumber.str.replace("\((\d{3}).*(\d{3}).*(\d{4})", r"(\1) \2-\3", regex=True)
                                  ))
)
wm2.head()

Unnamed: 0,Lat,Long,Description,AddressPhone,Store_type,Store_number,Gas,OvernightParking,PhoneNumber
0,-114.005671,51.262567,"Walmart Supercentre, #1050,","2881 Main St SW,Airdrie ,AB T4B 3G5,(403) 945-1295",Walmart Supercentre,#1050,,1,(403) 945-1295
1,-111.900542,50.577939,"Walmart Supercentre, #3658,","917 3rd St W,Brooks ,AB T1R 1L5,(403) 793-2111",Walmart Supercentre,#3658,,1,(403) 793-2111
2,-114.039133,51.107253,"Walmart Supercentre, #3013,","1110 57th Ave NE,Calgary ,AB T2E 9B7,(403) 730-0990",Walmart Supercentre,#3013,,0,(403) 730-0990
3,-114.138488,51.040871,"Walmart Supercentre, #3009,Gas,","1212 37 St SW,Calgary ,AB T3C 1S3,(403) 242-2205",Walmart Supercentre,#3009,Gas,0,(403) 242-2205
4,-114.028603,50.930551,"Walmart, #1144,","1221 Canyon Meadows Dr SE,Calgary ,AB T2J 6G2,(403) 225-6638",Walmart,#1144,,1,(403) 225-6638


In [12]:
# What about the weird cases?
wm2.iloc[[354, 6135, 6193]]

Unnamed: 0,Lat,Long,Description,AddressPhone,Store_type,Store_number,Gas,OvernightParking,PhoneNumber
354,-94.339372,35.34797,"Murphy: USA, #7133,Gas/Diesel,","8303 Rogers Ave,Fort Smith,AR,72903 ,,(479) 452-161",Murphy: USA,#7133,Gas/Diesel,1,Missing or invalid phone
6135,-96.769302,33.054745,"Walmart Fuel Center,#0997,Gas,","6040 Coit Rd,Plano,TX,75023,",Walmart Fuel Center,#0997,Gas,1,Missing or invalid phone
6193,-100.441994,31.426951,"Walmart Supercenter, #7281,","3440 S Bryant Blvd,San Angelo,TX,76903 ,,(325) 26-6599",Walmart Supercenter,#7281,,1,Missing or invalid phone


3.  Extract the country from the address column.

In [13]:
# observe some addresses to look for the pattern
wm2.AddressPhone.sample(15)

4380                       1640 S Washington St,Millersburg,OH,44654 ,,(330) 674-2888
5295                               1602 W Market St,Bolivar,TN,38008 ,,(731) 659-3900
3726                       2985 E Elizabethtown Rd,Lumberton,NC,28358 ,(910) 887-6107
5626                           951 SW Wilshire Blvd,Burleson,TX,76028 ,(817) 572-9574
4563                   10307 S Western Avenue,Oklahoma City,OK,73139 ,,(405) 692-6267
3700                         1170 Western Blvd,Jacksonville,NC,28546 ,,(910) 346-2148
2931                            3549 Russett Green E,Laurel,MD,20724 ,,(301) 604-0180
6654               1486 Dike Access Rd; I-5 Exit 22,Woodland,WA,98674 ,(360) 841-9131
476                           1900 E Chandler Blvd,Chandler,AZ,85225 ,,(480) 448-4322
2188                  420 Weber Rd; I-55 Exit 263,Romeoville,IL,60446 ,(815) 439-1666
5160                        5009 Old Buncombre Rd,Greenville,SC,29617 ,(864) 605-6309
6344    6760 Westworth Blvd; I-30 Exit 78,Westworth Vi

In [14]:
# examine a pattern idea
wm2 >> filter_by(~X.AddressPhone.str.contains(",[A-Z]{2},", regex=True))

Unnamed: 0,Lat,Long,Description,AddressPhone,Store_type,Store_number,Gas,OvernightParking,PhoneNumber
0,-114.005671,51.262567,"Walmart Supercentre, #1050,","2881 Main St SW,Airdrie ,AB T4B 3G5,(403) 945-1295",Walmart Supercentre,#1050,,1,(403) 945-1295
1,-111.900542,50.577939,"Walmart Supercentre, #3658,","917 3rd St W,Brooks ,AB T1R 1L5,(403) 793-2111",Walmart Supercentre,#3658,,1,(403) 793-2111
2,-114.039133,51.107253,"Walmart Supercentre, #3013,","1110 57th Ave NE,Calgary ,AB T2E 9B7,(403) 730-0990",Walmart Supercentre,#3013,,0,(403) 730-0990
3,-114.138488,51.040871,"Walmart Supercentre, #3009,Gas,","1212 37 St SW,Calgary ,AB T3C 1S3,(403) 242-2205",Walmart Supercentre,#3009,Gas,0,(403) 242-2205
4,-114.028603,50.930551,"Walmart, #1144,","1221 Canyon Meadows Dr SE,Calgary ,AB T2J 6G2,(403) 225-6638",Walmart,#1144,,1,(403) 225-6638
...,...,...,...,...,...,...,...,...,...
5279,-106.642526,52.087947,"Walmart Supercentre, #5878,","3035 Clarence Ave S,Saskatoon ,SK S7T 0B6,(306) 653-8200",Walmart Supercentre,#5878,,0,(306) 653-8200
5280,-107.774910,50.306610,"Walmart Supercentre, #3099,","1800 22nd Ave,Swift Current ,SK S9H 0E5,(306) 778-3489",Walmart Supercentre,#3099,,1,(306) 778-3489
5281,-103.866420,49.660470,"Walmart, #5790,","1000 Sims Ave,Weyburn ,SK S4H 3N9,(306) 842-6030",Walmart,#5790,,1,(306) 842-6030
5282,-102.444819,51.204644,"Walmart Supercentre, #3176,","240 Hamilton Rd,Yorkton ,SK S3N 4C6,(306) 782-9820",Walmart Supercentre,#3176,,1,(306) 782-9820


In [15]:
# US addresses have a 2-letter state abbreviation set off by commas, Canadian addresses do not
wm3 = (wm2
    >> mutate(Country = if_else(X.AddressPhone.str.contains(",[A-Z]{2},"), "USA", "Canada"))
)
wm3.sample(10)

Unnamed: 0,Lat,Long,Description,AddressPhone,Store_type,Store_number,Gas,OvernightParking,PhoneNumber,Country
5848,-96.107452,33.093896,"Walmart Supercenter, #0427,","7401 Interstate 30; I-30 Exit 89,Greenville,TX,75402 ,(903) 455-1792",Walmart Supercenter,#0427,,0,(903) 455-1792,USA
3006,-83.441765,42.558807,"Walmart Supercenter, #2618,","3301 Pontiac Trail Rd,Commerce,MI,48382 ,,(248) 668-0274",Walmart Supercenter,#2618,,1,(248) 668-0274,USA
5070,-74.01691,45.7608,"Walmart Supercentre, #3190,","1030 Boul Du Grand-Héron,Saint-Jérôme ,QC J7Y 5K8,(450) 438-6776",Walmart Supercentre,#3190,,1,(450) 438-6776,Canada
5625,-97.339432,32.529201,"Murphy: USA, #5627,Gas/Diesel,","921 SW Wilshire Blvd,Burleson,TX,76028 ,,(817) 426-1505",Murphy: USA,#5627,Gas/Diesel,1,(817) 426-1505,USA
5684,-97.338653,32.576217,"Walmart Supercenter, #3631,Gas/Diesel,","1221 FM 1187; I-35 Exit,Crowley,TX,76036 ,(682) 233-7834",Walmart Supercenter,#3631,Gas/Diesel,0,(682) 233-7834,USA
3107,-85.654857,41.941804,"Walmart Supercenter, #3791,","101 S Tolbert Dr,Three Rivers,MI,49093 ,,(269) 273-7820",Walmart Supercenter,#3791,,1,(269) 273-7820,USA
5159,-82.26864,34.86002,"Walmart Supercenter, #4583,","3925 Pelham Rd; I-85 Exit 54,Greenville,SC,29615 ,,(864) 288-8081",Walmart Supercenter,#4583,,1,(864) 288-8081,USA
3411,-90.881064,38.813326,"Sam's Club, #4875,Gas/Diesel,","3055 Bear Creek Dr: I-70 Exit 208,Wentzville,MO,63385 .,,(636) 698-9774",Sam's Club,#4875,Gas/Diesel,1,(636) 698-9774,USA
5298,-82.2369,36.544304,"Murphy: USA, #6971,Gas/Diesel,","260 Century Blvd; I-81 Exit 1,Bristol,TN,37620 ,,(423) 968-7395",Murphy: USA,#6971,Gas/Diesel,1,(423) 968-7395,USA
4174,-77.725081,43.214631,"Walmart, #1610,","100 Elm Ridge Center Dr,Greece,NY,14626 ,,(585) 227-0720",Walmart,#1610,,1,(585) 227-0720,USA


4.  Extract the state or province from the address columns

In [16]:
# Pattern is clear from previous step
wm4 = (wm3 >>
    mutate(StateOrProvince = X.AddressPhone.str.extract(",([A-Z]{2})"))
)
wm4.sample(10)

Unnamed: 0,Lat,Long,Description,AddressPhone,Store_type,Store_number,Gas,OvernightParking,PhoneNumber,Country,StateOrProvince
6645,-122.480896,47.239844,"Walmart Supercenter, #4137,","1965 S Union Ave; I-5 Exit 132,Tacoma,WA,98405 ,(253) 414-9526",Walmart Supercenter,#4137,,0,(253) 414-9526,USA,WA
6546,-80.088733,37.289302,"Walmart Supercenter, #1309,","1851 W Main St; I-81 Exit 137,Salem,VA,24153 ,,(540) 375-2919",Walmart Supercenter,#1309,,1,(540) 375-2919,USA,VA
6218,-98.548481,29.356398,"Sam's Club, #8264,Gas,","3150 Sw Military Dr; I-35 Exit 148,San Antonio,TX,78224 ,,(210) 927-3593",Sam's Club,#8264,Gas,1,(210) 927-3593,USA,TX
2211,-87.796047,41.597997,"Sam's Club, #6485,","16100 Harlem AvE ,Tinley Park,IL,60477 ,,(708) 429-6069",Sam's Club,#6485,,1,(708) 429-6069,USA,IL
2797,-93.711365,32.430073,"Murphy: USA, #5673,Gas,","8020 Youree Dr,Shreveport,LA,71115 ,,(318) 797-7803",Murphy: USA,#5673,Gas,1,(318) 797-7803,USA,LA
3022,-83.648486,42.938093,"Walmart Supercenter, #3726,","6170 S Saginaw Rd,Grand Blanc,MI,48439 ,(810) 603-9739",Walmart Supercenter,#3726,,0,(810) 603-9739,USA,MI
5036,-70.910627,45.594042,"Walmart, #1019,","3130 Rue Laval,Lac-Megantic ,QC G6B 1A4,(819) 583-2882",Walmart,#1019,,1,(819) 583-2882,Canada,QC
1688,-82.969864,32.549038,"Murphy: USA, #7468,Gas/Diesel,","2419 Hwy 80 W,Dublin,GA,31021 ,,(478) 272-3505",Murphy: USA,#7468,Gas/Diesel,1,(478) 272-3505,USA,GA
3910,-71.069762,43.027671,"Walmart Supercenter, #3535,","35 Fresh River Rd,Epping,NH,03042 ,(603) 679-5919",Walmart Supercenter,#3535,,0,(603) 679-5919,USA,NH
6745,-90.508121,44.019824,"Walmart Supercenter, #0965,","222 W Mccoy Blvd; I-94 Exit 143,Tomah,WI,54660 ,,(608) 372-7900",Walmart Supercenter,#0965,,1,(608) 372-7900,USA,WI


5. Combine all of your transformations into one pipe, then re-factor your code to be more readable.

In [17]:
def get_last(item, sep=","):
    # split on sep and return the last element
    return item.split(sep)[-1]

In [18]:
# One pipe for all transformations.
(walmart
    >> mutate(Description = if_else(X.Description.str.contains("; Supercenter"),
                                  X.Description.str.replace(";", ""), X.Description)) # fix one weird case
    >> mutate(Description = X.Description.str.replace(';', ',')) # standardize formatting
    >> mutate(Store_type = X.Description.str.split(',').str.get(0),
              Store_number = X.Description.str.split(',').str.get(1),
              Gas = X.Description.str.split(',').str.get(2))
    >> mutate(Gas = X.Gas.map(lambda s: s if s else "None")) # replace empty strings in Gas field
    >> mutate(OvernightParking = 1 - X.AddressPhone.str.count("(NOP)"),
              AddressPhone = X.AddressPhone.str.replace("\(NOP\),?", "", regex=True))
    >> mutate(PhoneNumber = X.AddressPhone.str.extract("(\(\d{3}.*\d{3}.*\d{4})"))
    >> mutate(PhoneNumber = if_else(X.PhoneNumber.isna(), # did not contain the pattern
                                  "Missing or invalid phone",
                                  X.PhoneNumber.str.replace("\((\d{3}).*(\d{3}).*(\d{4})", r"(\1) \2-\3", regex=True)))
    >> mutate(Country = if_else(X.AddressPhone.str.contains(",[A-Z]{2},"), "USA", "Canada"))
    >> mutate(StateOrProvince = X.AddressPhone.str.extract(",([A-Z]{2})"))
) 

Unnamed: 0,Lat,Long,Description,AddressPhone,Store_type,Store_number,Gas,OvernightParking,PhoneNumber,Country,StateOrProvince
0,-114.005671,51.262567,"Walmart Supercentre, #1050,","2881 Main St SW,Airdrie ,AB T4B 3G5,(403) 945-1295",Walmart Supercentre,#1050,,1,(403) 945-1295,Canada,AB
1,-111.900542,50.577939,"Walmart Supercentre, #3658,","917 3rd St W,Brooks ,AB T1R 1L5,(403) 793-2111",Walmart Supercentre,#3658,,1,(403) 793-2111,Canada,AB
2,-114.039133,51.107253,"Walmart Supercentre, #3013,","1110 57th Ave NE,Calgary ,AB T2E 9B7,(403) 730-0990",Walmart Supercentre,#3013,,0,(403) 730-0990,Canada,AB
3,-114.138488,51.040871,"Walmart Supercentre, #3009,Gas,","1212 37 St SW,Calgary ,AB T3C 1S3,(403) 242-2205",Walmart Supercentre,#3009,Gas,0,(403) 242-2205,Canada,AB
4,-114.028603,50.930551,"Walmart, #1144,","1221 Canyon Meadows Dr SE,Calgary ,AB T2J 6G2,(403) 225-6638",Walmart,#1144,,1,(403) 225-6638,Canada,AB
...,...,...,...,...,...,...,...,...,...,...,...
6811,-107.209281,41.792084,"Walmart Supercenter, #4471,Gas,","2390 E Cedar St; I-80 Exit 214,Rawlins,WY,82301 ,,(307) 417-3001",Walmart Supercenter,#4471,Gas,1,(307) 417-3001,USA,WY
6812,-108.379227,43.042858,"Walmart Supercenter, #1457,","1733 N Federal Blvd,Riverton,WY,82501 ,,(307) 856-3261",Walmart Supercenter,#1457,,1,(307) 856-3261,USA,WY
6813,-109.251020,41.579761,"Walmart Supercenter, #1461,","201 Gateway Blvd; I-80 Exit 102,Rock Springs,WY,82901 ,,(307) 362-1957",Walmart Supercenter,#1461,,1,(307) 362-1957,USA,WY
6814,-106.940967,44.779474,"Walmart Supercenter, #1508,","1695 Coffeen Ave; I-90 Exit 25,Sheridan,WY,82801 ,(307) 674-6492",Walmart Supercenter,#1508,,0,(307) 674-6492,USA,WY


In [19]:
# Refactored coded here (definitions/lambdas then the pipe)
phone_capture_full = re.compile("(\(\d{3}.*?\d{3}.*?\d{4})")
phone_capture_parts = re.compile("\((\d{3}).*?(\d{3}).*?(\d{4})")
phone_formatting = r"(\1) \2-\3" # based on the vast majority of the forms

state_pattern = re.compile(",[A-Z]{2},")
stateprovince_capture = re.compile(",([A-Z]{2})")

@dfpipe
def fix_anomaly(df):
    # One entry contains "Walmart; Supercenter" which needs to be fixed
    # However, semicolons are regularly used as delimiters, so we can't indiscriminately replace them
    return (df >> mutate(Description = if_else(X.Description.str.contains("; Supercenter"),
                                  X.Description.str.replace(";", ""), X.Description)))
@dfpipe
def standardize_description_delimiter(df):
    return (df >> mutate(Description = X.Description.str.replace(';', ',')))

def replace_missing(s, on_missing): return s if s else on_missing

description_fields = ["StoreType", "StoreNumber", "Gas"]
description_extraction = {f"{c}" : X["Description"].str.split(',').str.get(i).apply(replace_missing, args=("None",))
                          for i, c in enumerate(description_fields)}

def get_indicator_value(s, value, reverse=False):
    if reverse:
        return int(value not in s)
    else:
        return int(value in s)

def create_indicator(col, value, reverse=False):
    # create an indicator column based on containing the provided value
    # Use reverse=True to create a negative indicator (1 if value is not present)
    return col.apply(get_indicator_value, args=(value,), reverse=reverse)    

def format_phone(s): return re.sub(phone_capture_parts, phone_formatting, s)

def get_and_format_phone(s):
    raw_phone = m.group() if (m:=re.search(phone_capture_full, s)) else "Missing or invalid phone"
    return format_phone(raw_phone)

def detect_country(col):
    return col.map(lambda s: "USA" if state_pattern.search(s) else "Canada")

def extract_state(col):
    return col.str.extract(stateprovince_capture)

wm_finished = (walmart
    >> fix_anomaly()
    >> standardize_description_delimiter()
    >> mutate(**description_extraction)
    >> mutate(OvernightParking = create_indicator(X.AddressPhone, "(NOP)", reverse=True),
              PhoneNumber = X.AddressPhone.map(get_and_format_phone),
              Country = detect_country(X.AddressPhone),
              StateOrProvince = extract_state(X.AddressPhone))
)
wm_finished.sample(10)

Unnamed: 0,Lat,Long,Description,AddressPhone,StoreType,StoreNumber,Gas,OvernightParking,PhoneNumber,Country,StateOrProvince
6380,-111.711053,40.273414,"Walmart Supercenter, #1768,","1355 S Sandhill Rd; I-15 Exit 269,Orem,UT,84058 ,,(801) 221-0600",Walmart Supercenter,#1768,,1,(801) 221-0600,USA,UT
1471,-81.012795,29.116841,"Murphy: USA, #7544,Gas/Diesel,","1596 Dunlawton Ave; I-95 Exit 256,Port Orange,FL,32127 ,,(386) 761-7010",Murphy: USA,#7544,Gas/Diesel,1,(386) 761-7010,USA,FL
6444,-76.225073,36.743022,"Wm Nbrhd Mkt, #3299,","475 Kempsville Rd,Chesapeake,VA,23320 ,(NOP),(410) 454-0021",Wm Nbrhd Mkt,#3299,,0,(410) 454-0021,USA,VA
5308,-85.385487,35.020699,"Walmart Supercenter, #3660,Gas/Diesel,","3550 Cummings Hwy; I-24 Exit 174,Chattanooga,TN,37419 ,,(423) 821-1556",Walmart Supercenter,#3660,Gas/Diesel,1,(423) 821-1556,USA,TN
2812,-90.815808,29.807642,"Murphy: USA, #5817,Gas/Diesel,","412 N Canal Blvd,Thibodaux,LA,70301 ,,(985) 446-6355",Murphy: USA,#5817,Gas/Diesel,1,(985) 446-6355,USA,LA
5282,-102.444819,51.204644,"Walmart Supercentre, #3176,","240 Hamilton Rd,Yorkton ,SK S3N 4C6,(306) 782-9820",Walmart Supercentre,#3176,,1,(306) 782-9820,Canada,SK
5824,-96.652141,32.862091,"Walmart Supercenter, #1800,","1801 MarketPL Dr; I-635 Exit 11,Garland,TX,75041 ,(NOP),(972) 279-8700",Walmart Supercenter,#1800,,0,(972) 279-8700,USA,TX
5015,-68.256474,49.20089,"Walmart, #3002,","630 Boul Laflèche,Baie-Comeau ,QC G5C 2Y3,(418) 589-9971",Walmart,#3002,,1,(418) 589-9971,Canada,QC
5307,-85.376054,35.018647,"Murphy: USA, #7587,Gas/Diesel,","3538 Cummings Hwy; I-24 Exit 174,Chattanooga,TN,37419 ,,(423) 875-7055",Murphy: USA,#7587,Gas/Diesel,1,(423) 875-7055,USA,TN
6385,-110.789621,39.596304,"Walmart Supercenter, #1573,","255 S Hwy 55,Price,UT,84501 ,,(435) 637-6712",Walmart Supercenter,#1573,,1,(435) 637-6712,USA,UT


In [20]:
# Write the results to a file named walmart_locations_clean.csv.  Make sure to include this file in your submission on D2L
wm_finished.to_csv("./data/walmart_locations_clean.csv")