# Activity 4.2 - Refactoring the Walmart Data Clean Up 

In Activity 4.1, we started cleaning up the Walmart location data; focusing on the column with store information.  In this activity, we will clean up this code by refactoring the messy bits.

Below, I have provided a copy of a solution to the previous activity.

In [1]:
import pandas as pd
from dfply import *
from more_dfply import case_when, ifelse
from more_dfply.facets import text_facet, text_filter

In [2]:
header = ['long', 'lat', 'store', 'address'] 

walmart_locations = pd.read_csv("./data/Walmart_United_States_&_Canada_uft8.csv", 
                                names = header, 
                                sep = ',')
walmart_locations.head()

Unnamed: 0,long,lat,store,address
0,-114.005671,51.262567,"Walmart Supercentre; #1050,","2881 Main St SW,Airdrie ,AB T4B 3G5,(403) 945-..."
1,-111.900542,50.577939,"Walmart Supercentre; #3658,","917 3rd St W,Brooks ,AB T1R 1L5,(403) 793-2111"
2,-114.039133,51.107253,"Walmart Supercentre; #3013,","1110 57th Ave NE,Calgary ,(NOP),AB T2E 9B7,(40..."
3,-114.138488,51.040871,"Walmart Supercentre; #3009,Gas,","1212 37 St SW,Calgary ,(NOP),AB T3C 1S3,(403) ..."
4,-114.028603,50.930551,"Walmart; #1144,","1221 Canyon Meadows Dr SE,Calgary ,AB T2J 6G2,..."


In [3]:
# Messy (partial) solution
walmart_loc_messy = (walmart_locations
                     >> select(X.store)
                     >> mutate(has_gas = ifelse(text_filter(X.store, 'Gas'), 1, 0),
                               has_diesel = ifelse(text_filter(X.store, 'Gas/Diesel'), 1, 0),
                               store = X.store.str.split(',').str.get(0)
                              )
                     >> mutate(store_type = case_when((text_filter(X.store, ';\s?#', regex=True),
                                                       (X.store
                                                        .str.split(';')
                                                        .str.get(0))
                                                       ),
                                                      (True, (X.store 
                                                              .str.split(',')
                                                              .str.get(0)
                                                              .str.replace(';', ''))
                                                             )
                                                     ),
                               store_number = case_when((text_filter(X.store, ';\s?#', regex=True),
                                                         (X.store
                                                          .str.split(';')
                                                          .str.get(1))
                                                        ),
                                                        (True, (X.store
                                                              .str.split(',')
                                                              .str.get(1))
                                                             )
                                                     ),
                              )
                    )
walmart_loc_messy.head()

Unnamed: 0,store,has_gas,has_diesel,store_type,store_number
0,Walmart Supercentre; #1050,0,0,Walmart Supercentre,#1050
1,Walmart Supercentre; #3658,0,0,Walmart Supercentre,#3658
2,Walmart Supercentre; #3013,0,0,Walmart Supercentre,#3013
3,Walmart Supercentre; #3009,1,0,Walmart Supercentre,#3009
4,Walmart; #1144,0,0,Walmart,#1144


In [4]:
# not all data is being handled properly
walmart_loc_messy >> filter_by(X.store_number.isna())

Unnamed: 0,store,has_gas,has_diesel,store_type,store_number
429,Wm Nbrhd Mkt,1,0,Wm Nbrhd Mkt,
893,Walmart; Supercenter,0,0,Walmart Supercenter,
5511,Murphy: USA; #7235,1,1,Murphy: USA #7235,
5838,Murphy: USA; #7258,1,1,Murphy: USA #7258,
5982,Murphy: USA; #6797,1,1,Murphy: USA #6797,
6135,Walmart Fuel Center,1,0,Walmart Fuel Center,
6156,Walmart Supercenter,1,1,Walmart Supercenter,


In [5]:
# Hey look! Using whitespace effectively helps a lot!
# Still messy - repeated functionality that could be parameterized
walmart_messy = (walmart_locations
                     >> select(X.store)
                     >> mutate(has_gas = ifelse(text_filter(X.store, 'Gas'), 1, 0),
                               has_diesel = ifelse(text_filter(X.store, 'Gas/Diesel'), 1, 0),
                               store = X.store.str.split(',').str.get(0)
                              )
                     >> mutate(store_type = case_when((text_filter(X.store, ';\s?#', regex=True),
                                                           (X.store.str.split(';').str.get(0))),
                                                      (True,
                                                            (X.store.str.split(',').str.get(0).str.replace(';', '')))
                                                     ),
                               store_number = case_when((text_filter(X.store, ';\s?#', regex=True),
                                                             (X.store.str.split(';').str.get(1))),
                                                        (True,
                                                             (X.store.str.split(',').str.get(1)))
                                                     ),
                              )
                    )
walmart_messy.head()

Unnamed: 0,store,has_gas,has_diesel,store_type,store_number
0,Walmart Supercentre; #1050,0,0,Walmart Supercentre,#1050
1,Walmart Supercentre; #3658,0,0,Walmart Supercentre,#3658
2,Walmart Supercentre; #3013,0,0,Walmart Supercentre,#3013
3,Walmart Supercentre; #3009,1,0,Walmart Supercentre,#3009
4,Walmart; #1144,0,0,Walmart,#1144


In [6]:
# These values have a comma instead of a semicolon before the store number
# The current pipeline breaks for these cases, but since that isn't the task, we'll just exclude them
anomalies = [429, 893, 6135, 6156]

def make_indicator(col, value):
    return col.map(lambda s: 1 if value in s else 0)

def split_get_store(sep, idx):
    return X.store.str.split(sep, 1, regex=True).str.get(idx)

wm2 = (walmart_locations
               >> mutate(idx = X.index) # filtering on the index itself doesn't work too well...
                     >> filter_by(~X.idx.isin(anomalies))
                     >> drop(X.idx)
             >> select(X.store)
             >> mutate(has_gas = make_indicator(X.store, 'Gas'),
                       has_diesel = make_indicator(X.store, 'Gas/Diesel'),
                       store = split_get_store(',', 0))
             >> mutate(store_type = split_get_store(';|,', 0),
                       store_number = split_get_store(';|,', 1)
                      )
)
wm2.head()

Unnamed: 0,store,has_gas,has_diesel,store_type,store_number
0,Walmart Supercentre; #1050,0,0,Walmart Supercentre,#1050
1,Walmart Supercentre; #3658,0,0,Walmart Supercentre,#3658
2,Walmart Supercentre; #3013,0,0,Walmart Supercentre,#3013
3,Walmart Supercentre; #3009,1,0,Walmart Supercentre,#3009
4,Walmart; #1144,0,0,Walmart,#1144


In [7]:
wm2 >> filter_by(X.store_type.isna())

Unnamed: 0,store,has_gas,has_diesel,store_type,store_number


In [8]:
wm2 >> filter_by(text_filter(X.store_type, '\d', regex=True))

Unnamed: 0,store,has_gas,has_diesel,store_type,store_number


In [9]:
wm2.columns

Index(['store', 'has_gas', 'has_diesel', 'store_type', 'store_number'], dtype='object')

## What is refactoring?

Refactoring code involves

1. Identifying part of our code that can be named by their purpose.
2. Packaging this code in an variable or function with a good name.
3. Replacing the messy code with the variable or function call.
4. *Testing that the code still works*

We will practice the process together by completing the following tasks.

#### Tasks

1. Refactoring the `has_gas` expression by saving the `ifelse` intention as a variable.
2. Refactoring the `store` expression using a `lambda` to allow reuse in later expressions.

In [10]:
walmart_locations >> filter_by(~text_filter(X.store, ".*;\s*#\d+", regex=True))

Unnamed: 0,long,lat,store,address
429,-94.152051,36.280774,"Wm Nbrhd Mkt,#0241,Gas,","4206 S Pleasant Crossing Blvd,Rogers,AR,72758 ..."
893,-120.419884,34.919944,"Walmart; Supercenter,#2507,","2220 S Bradley,Santa Maria,CA,93455 ,(NOP),(80..."
6135,-96.769302,33.054745,"Walmart Fuel Center,#0997,Gas,","6040 Coit Rd,Plano,TX,75023,"
6156,-96.796902,33.221384,"Walmart Supercenter,#6300,Gas/Diesel,","500 Richland Blvd,Prosper,TX,75078 ,,(972) 347..."


In [11]:
walmart_locations.head()

Unnamed: 0,long,lat,store,address
0,-114.005671,51.262567,"Walmart Supercentre; #1050,","2881 Main St SW,Airdrie ,AB T4B 3G5,(403) 945-..."
1,-111.900542,50.577939,"Walmart Supercentre; #3658,","917 3rd St W,Brooks ,AB T1R 1L5,(403) 793-2111"
2,-114.039133,51.107253,"Walmart Supercentre; #3013,","1110 57th Ave NE,Calgary ,(NOP),AB T2E 9B7,(40..."
3,-114.138488,51.040871,"Walmart Supercentre; #3009,Gas,","1212 37 St SW,Calgary ,(NOP),AB T3C 1S3,(403) ..."
4,-114.028603,50.930551,"Walmart; #1144,","1221 Canyon Meadows Dr SE,Calgary ,AB T2J 6G2,..."


In [12]:
walmart_locations.iloc[[429,893,6135,6156, 2000]] # 2000 included as a contrasting normal case

Unnamed: 0,long,lat,store,address
429,-94.152051,36.280774,"Wm Nbrhd Mkt,#0241,Gas,","4206 S Pleasant Crossing Blvd,Rogers,AR,72758 ..."
893,-120.419884,34.919944,"Walmart; Supercenter,#2507,","2220 S Bradley,Santa Maria,CA,93455 ,(NOP),(80..."
6135,-96.769302,33.054745,"Walmart Fuel Center,#0997,Gas,","6040 Coit Rd,Plano,TX,75023,"
6156,-96.796902,33.221384,"Walmart Supercenter,#6300,Gas/Diesel,","500 Richland Blvd,Prosper,TX,75078 ,,(972) 347..."
2000,-116.544085,48.308524,"Walmart Supercenter; #2485,Gas,","476999 Hwy 95 N,Ponderay,ID,83864 ,,(208) 265-..."


In [13]:
# Refactored expressions here

# These values have a comma instead of a semicolon before the store number
# The current pipeline breaks for these cases, but since that isn't the task, we'll just exclude them
anomalies = [429, 893, 6135, 6156]

def make_indicator(col, value):
    return col.map(lambda s: 1 if value in s else 0)

def split_get_store(sep, idx):
    return X.store.str.split(sep).str.get(idx)

# For some reason (maybe the def instead of a lambda/intention object?) splitting doesn't always work correctly
# Some items refuse to bin correctly and just won't split even though they visually match the patterns
# However, providing a split of semicolon OR comma seems to satisfy them and then no values are missing
# As a plus, we no longer need cases for store_type and store_number

# Refactored code here
walmart_loc_refactored = (walmart_locations
                    # deal with a breaking case by skipping it for now
                     >> mutate(idx = X.index) # filtering on the index itself doesn't work too well...
                     >> filter_by(~X.idx.isin(anomalies))
                     >> drop(X.idx)
                     >> select(X.store)
                     >> mutate(has_gas = make_indicator(X.store, "Gas"),
                               has_diesel = make_indicator(X.store, "Gas/Diesel"),
                               store = split_get_store(",", 0)
                              )
                     >> mutate(store_type = split_get_store(';|,', 0),
                                store_number = split_get_store(';|,', 1)
                              )
)
walmart_loc_refactored.sample(5)

Unnamed: 0,store,has_gas,has_diesel,store_type,store_number
4341,Murphy: USA; #6833,1,1,Murphy: USA,#6833
4040,Walmart Supercenter; #0821,1,0,Walmart Supercenter,#0821
6203,Murphy: USA; #7055,1,1,Murphy: USA,#7055
1358,Wm Nbrhd Mkt; #2391,0,0,Wm Nbrhd Mkt,#2391
4964,Walmart Supercenter; #2300,0,0,Walmart Supercenter,#2300


In [14]:
(walmart_loc_refactored >> filter_by(X.store_number.isna()))

Unnamed: 0,store,has_gas,has_diesel,store_type,store_number


In [15]:
# Better yet, pandas can do its own asserts, with much more helpful error messages
from pandas import testing as pt

wlm = (walmart_loc_messy 
         >> mutate(idx = X.index) # filtering on the index itself doesn't work too well...
         >> filter_by(~X.idx.isin(anomalies))
         >> drop(X.idx)
      )

pt.assert_series_equal(wlm.has_gas, walmart_loc_refactored.has_gas)
pt.assert_series_equal(wlm.has_diesel, walmart_loc_refactored.has_diesel)
#pt.assert_series_equal(wlm.store_type, walmart_loc_refactored.store_type) # old version does some wrong
#pt.assert_series_equal(wlm.store_number, walmart_loc_refactored.store_number) # same

In [16]:
(walmart_loc_refactored >> filter_by(X.store_number != wlm.store_number))

Unnamed: 0,store,has_gas,has_diesel,store_type,store_number
5511,Murphy: USA; #7235,1,1,Murphy: USA,#7235
5838,Murphy: USA; #7258,1,1,Murphy: USA,#7258
5982,Murphy: USA; #6797,1,1,Murphy: USA,#6797


In [17]:
wlm.iloc[[55, 5836, 5980]]

Unnamed: 0,store,has_gas,has_diesel,store_type,store_number
55,Walmart Supercentre; #1102,0,0,Walmart Supercentre,#1102
5838,Murphy: USA; #7258,1,1,Murphy: USA #7258,
5982,Murphy: USA; #6797,1,1,Murphy: USA #6797,


#### Problem 1

To complete this activity, you should.

1. Copy our current progress below.
2. Perform each of the following refactors, while adding appropriate `assert` statements to test the results.
    - Refactor the rest of the `split` & `get` parts of the code.
    - Refactor the remaining `text_filter`.  Note that these are all intentions, so can be saved as variables.
    - Refactor any `True` cases to use `else_` instead.  Explain why this is a cleaner approach.
    - See if you can come us with a solution to the `split`, `get`, then `replace` expression in the last case.  **Hint:** The best solution will should reuse our previous solution!

In [18]:
# Copy and continue to refactor here