# Activity 4.1 - Cleaning Walmart Data the OpenRefine Way

In this activity, you will practice what you learned in Lecture 4.5 by cleaning up a data set containing information on various Walmart locations.

In [1]:
import pandas as pd
from dfply import *

#### Initial Tasks

1. Try to read in the `./data/Walmart_United_States_&_Canada.csv` file and verify that you get an encoding error.  This means that the [character encoding](https://en.wikipedia.org/wiki/Character_encoding) isn't the default of `utf-8`.  The easiest way to fix this is to open and save the file in Visual Studio Code.

In [2]:
walmart = pd.read_csv("./data/Walmart_United_States_&_Canada.csv")

2. Read in the data to verify that the encoding is fixed, but that there are two more problems.  What are they?

In [3]:
walmart = pd.read_csv("./data/Walmart_United_States_&_Canada.csv")
walmart.head(10)

Unnamed: 0,-114.005671,51.262567,"Walmart Supercentre; #1050,","2881 Main St SW,Airdrie ,AB T4B 3G5,(403) 945-1295"
0,-111.900542,50.577939,"Walmart Supercentre; #3658,","917 3rd St W,Brooks ,AB T1R 1L5,(403) 793-2111"
1,-114.039133,51.107253,"Walmart Supercentre; #3013,","1110 57th Ave NE,Calgary ,(NOP),AB T2E 9B7,(40..."
2,-114.138488,51.040871,"Walmart Supercentre; #3009,Gas,","1212 37 St SW,Calgary ,(NOP),AB T3C 1S3,(403) ..."
3,-114.028603,50.930551,"Walmart; #1144,","1221 Canyon Meadows Dr SE,Calgary ,AB T2J 6G2,..."
4,-113.91159,51.04009,"Walmart Supercentre; #1136,","255 E Hills Blvd SE,Calgary ,AB T2A 4X7,(403) ..."
5,-114.145518,51.1757,"Walmart Supercentre; #1097,","35 Sage Hill Gate NW,Calgary ,AB T3R 0S4,(587)..."
6,-113.989925,51.053615,"Walmart Supercentre; #3012,","3800 Memorial Dr NE,Calgary ,(NOP),AB T2A 2K2,..."
7,-113.966699,50.930235,"Walmart Supercentre; #3650,","4705 130th Ave,Calgary ,AB T2Z 4J2,(403) 726-0430"
8,-114.142114,51.097447,"Walmart; #3011,Gas,","5005 Northland Dr NW,Calgary ,(NOP),AB T2L 2K1..."
9,-114.039281,50.984422,"Walmart Supercentre; #1089,","7979 11 St SE,Calgary ,(NOP),AB T2H 0B8,(403) ..."


* No headers
* Name and store number together
* Additional optional "gas" field?
* More commas as address separators
* Address and phone concatenated

3. Take another look at the file in VS Code and determine solutions to the two/three issues, then read in the data correctly by passing `pd.read_csv` the correct defaults for this data. **Note.** Leave the `"` in place for now, as they serve an important role here!

In [4]:
help(pd.read_csv) # This might help!

Help on function read_csv in module pandas.io.parsers.readers:

read_csv(filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]', sep=<no_default>, delimiter=None, header='infer', names=<no_default>, index_col=None, usecols=None, squeeze=None, prefix=<no_default>, mangle_dupe_cols=True, dtype: 'DtypeArg | None' = None, engine: 'CSVEngine | None' = None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression: 'CompressionOptions' = 'infer', thousands=None, decimal: 'str' = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors: 'str | None' = 'strict', dialect=None, error_bad_li

<font color="blue"> Your thoughts here </font>

In [5]:
# Leave the quoted parts alone for now - they're intentional
walmart = pd.read_csv("./data/Walmart_United_States_&_Canada.csv",
                     names = ["Lat", "Long", "Description", "Address_Phone"]
                     )
walmart.head()

Unnamed: 0,Lat,Long,Description,Address_Phone
0,-114.005671,51.262567,"Walmart Supercentre; #1050,","2881 Main St SW,Airdrie ,AB T4B 3G5,(403) 945-..."
1,-111.900542,50.577939,"Walmart Supercentre; #3658,","917 3rd St W,Brooks ,AB T1R 1L5,(403) 793-2111"
2,-114.039133,51.107253,"Walmart Supercentre; #3013,","1110 57th Ave NE,Calgary ,(NOP),AB T2E 9B7,(40..."
3,-114.138488,51.040871,"Walmart Supercentre; #3009,Gas,","1212 37 St SW,Calgary ,(NOP),AB T3C 1S3,(403) ..."
4,-114.028603,50.930551,"Walmart; #1144,","1221 Canyon Meadows Dr SE,Calgary ,AB T2J 6G2,..."


## Cleaning up the store information.

As hinted at above, the presence of the `"` meant the two of the columns--one containing the store type/number and the other contain the address/phone number--are combined together.  This was done because some of these entries have a different number of variables.  For example, the store type/number column sometimes occasionally `Gas`.

In this part of the activity, you should apply the iterative OpenRefine approach to separate the information in the store column.

**Warning!** There is one entry that doesn't follow the same pattern as the rest.  You won't find this entry unless you carefully define/fix/eliminate patterns.

In [6]:
from more_dfply import case_when, ifelse
from more_dfply.facets import text_facet, text_filter

# Your code here.

In [7]:
# View cell
(walmart
    >> select(X.Description)
# All descriptions have values
    >> filter_by(~text_filter(X.Description, "Walmart(?: .*)?(?:;|,)\s?#\d{1,4}", regex=True))
    >> filter_by(~text_filter(X.Description, "Murphy|Wm |Sam's Club", regex=True))
    >> filter_by(~text_filter(X.Description, "; Supercenter"))

)

Unnamed: 0,Description


In [8]:
# Transform cell
wm_cleaned = (walmart
    >> mutate(Description = ifelse(X.Description.str.contains("; Supercenter"),
                                  X.Description.str.replace(";", ""), X.Description)) # fix one weird case
    >> mutate(Description = X.Description.str.replace(';', ','))
    >> mutate(Store_type = X.Description.str.split(',').str.get(0),
             Store_number = X.Description.str.split(',').str.get(1),
             Gas = X.Description.str.split(',').str.get(2))
)
wm_cleaned

Unnamed: 0,Lat,Long,Description,Address_Phone,Store_type,Store_number,Gas
0,-114.005671,51.262567,"Walmart Supercentre, #1050,","2881 Main St SW,Airdrie ,AB T4B 3G5,(403) 945-...",Walmart Supercentre,#1050,
1,-111.900542,50.577939,"Walmart Supercentre, #3658,","917 3rd St W,Brooks ,AB T1R 1L5,(403) 793-2111",Walmart Supercentre,#3658,
2,-114.039133,51.107253,"Walmart Supercentre, #3013,","1110 57th Ave NE,Calgary ,(NOP),AB T2E 9B7,(40...",Walmart Supercentre,#3013,
3,-114.138488,51.040871,"Walmart Supercentre, #3009,Gas,","1212 37 St SW,Calgary ,(NOP),AB T3C 1S3,(403) ...",Walmart Supercentre,#3009,Gas
4,-114.028603,50.930551,"Walmart, #1144,","1221 Canyon Meadows Dr SE,Calgary ,AB T2J 6G2,...",Walmart,#1144,
...,...,...,...,...,...,...,...
6811,-107.209281,41.792084,"Walmart Supercenter, #4471,Gas,","2390 E Cedar St; I-80 Exit 214,Rawlins,WY,8230...",Walmart Supercenter,#4471,Gas
6812,-108.379227,43.042858,"Walmart Supercenter, #1457,","1733 N Federal Blvd,Riverton,WY,82501 ,,(307) ...",Walmart Supercenter,#1457,
6813,-109.251020,41.579761,"Walmart Supercenter, #1461,","201 Gateway Blvd; I-80 Exit 102,Rock Springs,W...",Walmart Supercenter,#1461,
6814,-106.940967,44.779474,"Walmart Supercenter, #1508,","1695 Coffeen Ave; I-90 Exit 25,Sheridan,WY,828...",Walmart Supercenter,#1508,


In [9]:
# verify the weird case was handled
walmart.iloc[[893]]

Unnamed: 0,Lat,Long,Description,Address_Phone
893,-120.419884,34.919944,"Walmart; Supercenter,#2507,","2220 S Bradley,Santa Maria,CA,93455 ,(NOP),(80..."


In [10]:
wm_cleaned.iloc[[893]]

Unnamed: 0,Lat,Long,Description,Address_Phone,Store_type,Store_number,Gas
893,-120.419884,34.919944,"Walmart Supercenter,#2507,","2220 S Bradley,Santa Maria,CA,93455 ,(NOP),(80...",Walmart Supercenter,#2507,


## Preview of Coming Attractions

In this module's homework assignment, you will continue to clean up this data set.