# Data Preprocessing Part 2 (SP702 PGA04)
**Set assigned: Sales.csv ad Handling_Missing_Values.csv**

Instructions are listed as markdowns below

**Guide Questions** 

1. **Describe how and where did you get your data.**
    * The Data I acquired for part 3 are the submissions of 50 respondents from my Survey on Convenience Stores for SP201: Excel Basics**


2. **What are the challenges you encountered in gathering data?**
    * Multiple formats of string type entries that deviate from your intended format. For you to be able to group similar entries but with different formats, you must be able to standardize them. Finding a optimal way to standardize these formats is also a challenge.
    * Ensuring that numbers fresh from surveys to not be interpreted as strings is also a challenge, as this can impact any arithmetic operations that are conducted prior to preprocessing.


3. **What are the preprocessing techniques you used to clean the data?**
    * The following were used in preprocessing my own data:
        * Defaulting NaN values to a constant 'Unemployed' since this was specified in the survey that if the respondent is unemployed, they are to respond with N/A on the City of Work item.
        * City name standardization using replacements, unique() function for checking, and str.title() to display first letter capitalized.
        * For the other data sets, standardization of continuous variables, using mean as a global constant for NaN continuous variables, and using an "unknown" global constant (with stated rational) were other techniques used in this exercise


4. **Why did you perform those techniques?**
    * Standardization of the city names were needed in further exploratory analysis should residents from such cities need to be grouped together. And replacement of work city NaN's to unemployed was something the respondents were aware of when they were willing up thus it was appropriate to set this as a global constant for that column.

### Part 0: importing the 3 Datasets
The following datasets were imported:
1. sales.csv
2. Handling_Missing_Values.csv
3. RawData_SP201.csv

In [217]:
import pandas as pd
import numpy as np

data01 = pd.read_csv("sales.csv")
data02 = pd.read_csv("Handling_Missing_Values.csv")
data03 = pd.read_csv("RawData_SP201.csv")

### Part 1:
Using the Data found in sales.csv, perform the following tasks:
1. Standardize the Price Column
2. Identify and remove outliers in the dataset

*Note: Based on the previous PGA, an item in the price column, "13,000", in index 560 is a string. We'll correct that.*

In [218]:
data01.head(10) ## Print file's first 10 entries

Unnamed: 0,Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude
0,1/2/2009 6:17,Product1,1200,Mastercard,carolina,Basildon,England,United Kingdom,1/2/2009 6:00,1/2/2009 6:08,51.5,-1.116667
1,1/2/2009 4:53,Product1,1200,Visa,Betina,Parkville,MO,United States,1/2/2009 4:42,1/2/2009 7:49,39.195,-94.68194
2,1/2/2009 13:08,Product1,1200,Mastercard,Federica e Andrea,Astoria,OR,United States,1/1/2009 16:21,1/3/2009 12:32,46.18806,-123.83
3,1/3/2009 14:44,Product1,1200,Visa,Gouya,Echuca,Victoria,Australia,9/25/2005 21:13,1/3/2009 14:22,-36.133333,144.75
4,1/4/2009 12:56,Product2,3600,Visa,Gerd W,Cahaba Heights,AL,United States,11/15/2008 15:47,1/4/2009 12:45,33.52056,-86.8025
5,1/4/2009 13:19,Product1,1200,Visa,LAURENCE,Mickleton,NJ,United States,9/24/2008 15:19,1/4/2009 13:04,39.79,-75.23806
6,1/4/2009 20:11,Product1,1200,Mastercard,Fleur,Peoria,IL,United States,1/3/2009 9:38,1/4/2009 19:45,40.69361,-89.58889
7,1/2/2009 20:09,Product1,1200,Mastercard,adam,Martin,TN,United States,1/2/2009 17:43,1/4/2009 20:01,36.34333,-88.85028
8,1/2/2009 20:09,Product1,1200,Mastercard,adam,Martin,TN,United States,1/2/2009 17:43,1/4/2009 20:01,36.34333,-88.85028
9,1/4/2009 13:17,Product1,1200,Mastercard,Renee Elisabeth,Tel Aviv,Tel Aviv,Israel,1/4/2009 13:03,1/4/2009 22:10,32.066667,34.766667


In [219]:
print(data01[data01['Price']=='13,000'].dtypes)

Transaction_date     object
Product              object
Price                object
Payment_Type         object
Name                 object
City                 object
State                object
Country              object
Account_Created      object
Last_Login           object
Latitude            float64
Longitude           float64
dtype: object


**We'll convert the string 13,000 by replacing the comma with a blank. Then convert the whole Price column to float**

In [220]:
data01['Price'].str.replace(",","").astype('float64')
##Successful conversion of 13,000 string to numerical.

0      1200.0
1      1200.0
2      1200.0
3      1200.0
4      3600.0
        ...  
995    1200.0
996    3600.0
997    7500.0
998    1200.0
999    1200.0
Name: Price, Length: 1000, dtype: float64

In [221]:
data01.insert(12,'std_sales',0) #insert a new column std_sales for standardized prices

In [222]:
##Standardization while converting price entries to float
data01.std_sales = (data01['Price'].str.replace(",","").astype(float)-\
                    np.mean(data01['Price'].str.replace(",","").astype(float)))\
                    /np.std(data01['Price'].str.replace(",","").astype(float))

data01.std_sales.describe()

count    9.990000e+02
mean    -3.737417e-16
std      1.000501e+00
min     -1.347942e+00
25%     -3.727122e-01
50%     -3.727122e-01
75%     -3.727122e-01
max      9.379587e+00
Name: std_sales, dtype: float64

In [223]:
##Show Preview of standardization

data01[['Price','std_sales']]

Unnamed: 0,Price,std_sales
0,1200,-0.372712
1,1200,-0.372712
2,1200,-0.372712
3,1200,-0.372712
4,3600,1.610806
...,...,...
995,1200,-0.372712
996,3600,1.610806
997,7500,4.834024
998,1200,-0.372712


**Finding: Max Value is greater than 3 Standard Deviations. We eliminate that/those outlier/s.**

In [224]:
data01_toFile = data01[data01['std_sales']<3]
data01_toFile.to_csv('sales_new.csv',index = False) #write to file part 1 output

preview = pd.read_csv('sales_new.csv')
preview.head(10)

Unnamed: 0,Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude,std_sales
0,1/2/2009 6:17,Product1,1200,Mastercard,carolina,Basildon,England,United Kingdom,1/2/2009 6:00,1/2/2009 6:08,51.5,-1.116667,-0.372712
1,1/2/2009 4:53,Product1,1200,Visa,Betina,Parkville,MO,United States,1/2/2009 4:42,1/2/2009 7:49,39.195,-94.68194,-0.372712
2,1/2/2009 13:08,Product1,1200,Mastercard,Federica e Andrea,Astoria,OR,United States,1/1/2009 16:21,1/3/2009 12:32,46.18806,-123.83,-0.372712
3,1/3/2009 14:44,Product1,1200,Visa,Gouya,Echuca,Victoria,Australia,9/25/2005 21:13,1/3/2009 14:22,-36.133333,144.75,-0.372712
4,1/4/2009 12:56,Product2,3600,Visa,Gerd W,Cahaba Heights,AL,United States,11/15/2008 15:47,1/4/2009 12:45,33.52056,-86.8025,1.610806
5,1/4/2009 13:19,Product1,1200,Visa,LAURENCE,Mickleton,NJ,United States,9/24/2008 15:19,1/4/2009 13:04,39.79,-75.23806,-0.372712
6,1/4/2009 20:11,Product1,1200,Mastercard,Fleur,Peoria,IL,United States,1/3/2009 9:38,1/4/2009 19:45,40.69361,-89.58889,-0.372712
7,1/2/2009 20:09,Product1,1200,Mastercard,adam,Martin,TN,United States,1/2/2009 17:43,1/4/2009 20:01,36.34333,-88.85028,-0.372712
8,1/2/2009 20:09,Product1,1200,Mastercard,adam,Martin,TN,United States,1/2/2009 17:43,1/4/2009 20:01,36.34333,-88.85028,-0.372712
9,1/4/2009 13:17,Product1,1200,Mastercard,Renee Elisabeth,Tel Aviv,Tel Aviv,Israel,1/4/2009 13:03,1/4/2009 22:10,32.066667,34.766667,-0.372712


### Part 2: 
Using the Data in Handling_Missing_Values.csv:
1. Handle all the missing values. Provide an explanation for choosing the specific method (delete, impute, or keep: if impute, why mean, median, or mode?) in handling specific missing values


* The following were actions taken in dealing with missing values:
    
    * **Sex** - it was chosen to apply a global constant "Unkown" instead of a mode since these are impactful in a study. Thus, instead of assuming a 'sex'through mode(), 'unknown' is declared instead as these may have significantly different statistics with their corresponding attributes.
    
    * **Income** - We use the mean of income for the missing data in this column
    
    * **Children** - We use the median of No. of Children since it is a discrete variable.
    


In [225]:
data02 ## Print file's entries

Unnamed: 0,ID,Sex,Age,Income,Employed,Children,Buy_Car
0,1,Male,25,25146.0,Single,0.0,No
1,2,Male,30,26939.0,Married,2.0,Yes
2,3,Male,27,26693.0,Married,0.0,No
3,4,Male,28,26666.0,Married,3.0,Yes
4,5,Male,29,25899.0,Married,0.0,No
5,6,Male,28,26462.0,Married,1.0,No
6,7,Female,28,,Married,3.0,Yes
7,8,,30,26037.0,Married,2.0,Yes
8,9,Female,28,26167.0,Married,1.0,Yes
9,10,,28,,Single,,No


In [226]:
data02.describe() ##EDA checking the properties of the set

Unnamed: 0,ID,Age,Income,Children
count,20.0,20.0,17.0,18.0
mean,10.5,28.15,26245.705882,1.222222
std,5.91608,1.308877,583.052395,1.437136
min,1.0,25.0,25094.0,0.0
25%,5.75,28.0,26037.0,0.0
50%,10.5,28.0,26234.0,1.0
75%,15.25,29.0,26666.0,2.0
max,20.0,30.0,26969.0,4.0


**Finding: it is worth nothing that the Age and Income have low spread of data given the range. We can apply mean for continuous variables and median for discrete variables.**

In [227]:
data02.mode(numeric_only = True).head(1) ## EDA Check the set's modes

Unnamed: 0,ID,Age,Income,Children
0,1,28.0,25094.0,0.0


In [228]:
data02.isnull().sum() #Check for counts of null values in each column

ID          0
Sex         4
Age         0
Income      3
Employed    0
Children    2
Buy_Car     0
dtype: int64

In [229]:
##For Sex, it was chosen to apply a global constant "Unkown" instead of a mode since these \
## are impactful in a study. Thus, instead of assuming a 'sex', 'unknown' is declared instead \
## as these may have significantly different statistics with their corresponding attributes.
data02['Sex'] = data02['Sex'].fillna("Unkown")

##We use the mean of income for the missing data in this column
inc_mean = data02['Income'].mean()
data02['Income'] = data02['Income'].fillna(inc_mean)

##We use the median of No. of Children since it is a discrete variable.
children_median = data02['Children'].median()
data02['Children'] = data02['Children'].fillna(children_median) 

data02.to_csv('Handling_Missing_Values_new.csv', index = False)
preview = pd.read_csv('Handling_Missing_Values_new.csv')
preview
#No more NaN/Null cells

Unnamed: 0,ID,Sex,Age,Income,Employed,Children,Buy_Car
0,1,Male,25,25146.0,Single,0.0,No
1,2,Male,30,26939.0,Married,2.0,Yes
2,3,Male,27,26693.0,Married,0.0,No
3,4,Male,28,26666.0,Married,3.0,Yes
4,5,Male,29,25899.0,Married,0.0,No
5,6,Male,28,26462.0,Married,1.0,No
6,7,Female,28,26245.705882,Married,3.0,Yes
7,8,Unkown,30,26037.0,Married,2.0,Yes
8,9,Female,28,26167.0,Married,1.0,Yes
9,10,Unkown,28,26245.705882,Single,1.0,No


### Part 3: Application

1. Gather Data from Any source (it would be better if you use your own data)
2. Preprocess the data you gathered

### Data Dictionary (data from survey in SP201)

**Age** - Age of Respondent

**ConvStore_nearto** - Where are the convenience store/s you visit near to?

**HomeProv_add** - Province of Residence

**HomeCity_add** - City/Municipality of Residence

**WorkProv_add** - Province of Work

**WorkCity_add** - City/Municipality of Work

**ConvStore_TD** - How far (in travel time) is the nearest convenience store?

**ConvStoresVisited** - Which Convenience stores do you visit?

**Frequency** - How frequent do you visit convenience stores?

**Reasons** - For what reason/s do you visit a convenience store?

**Environment** - Would you classify your place of residence to be rural or urban?

**Nr_CityCenter** - Do you consider to be very near the city center? ( < 10 min walk)

In [230]:
data03.head(10) ## Print file's first 10 entries

Unnamed: 0,Age,ConvStoresVisited,Frequency,Reasons,ConvStore_nearto,HomeProv_add,HomeCity_add,WorkProv_add,WorkCity_add,ConvStore_TD,...,Supermarket_td,Fastfood_td,CoffeeShop_td,StreetVendor_td,PublicMarket_td,Pharmacy_td,Atm_td,BillPayCounter_td,Environment,Nr_CityCenter
0,33,"7-Eleven, Family Mart, Ministop, Alfamart, Lawson",Fair (Twice a week to more than twice a month.),"Purchase of Non-food/drink items, Purchase of ...","Place of Work, Current Address",Metro Manila,Mandaluyong,Metro Manila,Mandaluyong,Less than 5 mins,...,4 - 15 to 25 mins,3 - 10 to 15 mins,4 - 15 to 25 mins,1 - Less than 5 mins,5 - Greater than 25 mins,4 - 15 to 25 mins,4 - 15 to 25 mins,5 - Greater than 25 mins,Urban,Yes
1,26,7-Eleven,Fair (Twice a week to more than twice a month.),Purchase of food and drinks (consumed elsewhere),Place of Work,Metro Manila,Caloocan City,Metro Manila,Mandaluyong,5 to 10 mins,...,5 - Greater than 25 mins,5 - Greater than 25 mins,3 - 10 to 15 mins,3 - 10 to 15 mins,5 - Greater than 25 mins,4 - 15 to 25 mins,4 - 15 to 25 mins,4 - 15 to 25 mins,Urban,No
2,25,"7-Eleven, Ministop, Lawson",Rarely (On occasion only and less than twice a...,"Purchase of Non-food/drink items, Purchase of ...","Place of Work, Current Address",Metro Manila,Quezon City,Metro Manila,Quezon City,Less than 5 mins,...,3 - 10 to 15 mins,3 - 10 to 15 mins,4 - 15 to 25 mins,1 - Less than 5 mins,2 - 5 to 10 mins,3 - 10 to 15 mins,3 - 10 to 15 mins,3 - 10 to 15 mins,Urban,Yes
3,24,"7-Eleven, Family Mart, Ministop, Alfamart",Fair (Twice a week to more than twice a month.),"Purchase of Non-food/drink items, Purchase of ...",Current Address,Metro Manila,"Diliman, QC",Metro Manila,"Diliman, QC",5 to 10 mins,...,4 - 15 to 25 mins,4 - 15 to 25 mins,3 - 10 to 15 mins,1 - Less than 5 mins,2 - 5 to 10 mins,2 - 5 to 10 mins,3 - 10 to 15 mins,5 - Greater than 25 mins,Urban,Yes
4,27,"7-Eleven, Family Mart, Ministop",Fair (Twice a week to more than twice a month.),"Purchase of Non-food/drink items, Purchase of ...","Place of Work, Current Address",Metro Manila,Quezon City,Metro Manila,Quezon City,5 to 10 mins,...,4 - 15 to 25 mins,4 - 15 to 25 mins,4 - 15 to 25 mins,2 - 5 to 10 mins,5 - Greater than 25 mins,5 - Greater than 25 mins,4 - 15 to 25 mins,4 - 15 to 25 mins,Urban,Yes
5,24,"7-Eleven, Family Mart, Ministop",Fair (Twice a week to more than twice a month.),Purchase of food and drinks (consumed elsewher...,Current Address,Pampanga,Angeles City,Pampanga,Clark,5 to 10 mins,...,2 - 5 to 10 mins,2 - 5 to 10 mins,2 - 5 to 10 mins,2 - 5 to 10 mins,3 - 10 to 15 mins,2 - 5 to 10 mins,2 - 5 to 10 mins,3 - 10 to 15 mins,Urban,Yes
6,35,"7-Eleven, Family Mart, Ministop",Seldom (Twice a month or less),Purchase of food and drinks (consumed elsewher...,Place of Work,Bulacan,Guiguinto,Metro Manila,Quezon City,10 to 15 mins,...,2 - 5 to 10 mins,2 - 5 to 10 mins,3 - 10 to 15 mins,3 - 10 to 15 mins,5 - Greater than 25 mins,2 - 5 to 10 mins,2 - 5 to 10 mins,3 - 10 to 15 mins,Rural,No
7,27,"7-Eleven, Family Mart, Ministop, Lawson",Rarely (On occasion only and less than twice a...,"Purchase of Non-food/drink items, Purchase of ...",Place of Work,Benguet,Baguio City,Metro Manila,Mandaluyong,10 to 15 mins,...,4 - 15 to 25 mins,4 - 15 to 25 mins,5 - Greater than 25 mins,5 - Greater than 25 mins,5 - Greater than 25 mins,5 - Greater than 25 mins,5 - Greater than 25 mins,5 - Greater than 25 mins,Urban,Yes
8,30,"7-Eleven, Ministop, Alfamart",Seldom (Twice a month or less),"Purchase of Non-food/drink items, Purchase of ...",Current Address,Metro Manila,Quezon City,Metro Manila,Quezon City,Less than 5 mins,...,4 - 15 to 25 mins,4 - 15 to 25 mins,4 - 15 to 25 mins,1 - Less than 5 mins,1 - Less than 5 mins,1 - Less than 5 mins,4 - 15 to 25 mins,5 - Greater than 25 mins,Urban,Yes
9,26,"7-Eleven, Ministop",Often (Every day to more than twice a week),Purchase of food and drinks (consumed elsewhere),Place of Work,Metro Manila,San Juan,Metro Manila,Pasig City,Less than 5 mins,...,3 - 10 to 15 mins,3 - 10 to 15 mins,4 - 15 to 25 mins,2 - 5 to 10 mins,4 - 15 to 25 mins,3 - 10 to 15 mins,2 - 5 to 10 mins,2 - 5 to 10 mins,Urban,Yes


In [231]:
data03.describe() ##EDA Check on the only numerical variable.

Unnamed: 0,Age
count,54.0
mean,26.777778
std,5.74511
min,18.0
25%,24.0
50%,25.0
75%,28.0
max,56.0


In exploring the data in excel, manual entry for the City of Residence/Work in the survey caused it to have multiple formats. We shall standardize their answers

In [237]:
print(data03['HomeCity_add'].unique()) #Displays Unique entries in the Home City Adddress column.

print(data03['WorkCity_add'].unique()) #Displays Unique entries in the Work City Adddress column.

['Mandaluyong' 'Caloocan' 'Quezon' 'Angeles' 'Guiguinto' 'Baguio'
 'San Juan' 'Sto. Domingo' 'Taytay' 'Taguig' 'Daet' 'Lipa' 'Bacoor'
 'Las Pinas' 'Manila' 'Cainta' 'Quezon City' 'Imus' 'Pasig' 'Bay' 'Teresa'
 'Marilao' 'Pulilan' 'Hermosa' 'Malabon' 'Pasay' 'Muntinlupa']
['Mandaluyong' 'Mandaluyong ' 'Quezon City' 'Diliman, QC' 'Clark'
 'Pasig City' nan 'Mandaluyong city' 'Makati City' 'Quezon' 'Makati'
 'Daet' 'General Trias' 'San Juan City' 'Taguig' 'Baguio City'
 'QUEZON CITY' 'Quezon city' 'Pasig city' 'Taguig City' 'Muntinlupa']


**Finding: We can see that data are in different formats**

In [248]:
data03['HomeCity_add'] = data03['HomeCity_add'].str.replace(' City','')
data03['HomeCity_add'] = data03['HomeCity_add'].str.replace('qc','Quezon')
data03['HomeCity_add'] = data03['HomeCity_add'].str.replace('Diliman, QC','Quezon')
data03['HomeCity_add'] = data03['HomeCity_add'].str.replace('Sampaloc, Manila','Manila')
data03['HomeCity_add'] = data03['HomeCity_add'].str.replace('Hermosa ','Hermosa')
##Brute force replacement of cities in a different format was done due to the small size of the data set (54 entries)
data03['HomeCity_add'] = data03['HomeCity_add'].str.title() ##Convert All cities to title format

data03['WorkCity_add'] = data03['WorkCity_add'].str.lower()
data03['WorkCity_add'] = data03['WorkCity_add'].str.replace(' city','')
data03['WorkCity_add'] = data03['WorkCity_add'].str.replace('mandaluyong ','mandaluyong')
data03['WorkCity_add'] = data03['WorkCity_add'].str.replace('diliman, qc','quezon')
data03['WorkCity_add'] = data03['WorkCity_add'].fillna('Unemployed')
data03['WorkCity_add'] = data03['WorkCity_add'].str.title()

data03[['HomeCity_add','WorkCity_add']]

Unnamed: 0,HomeCity_add,WorkCity_add
0,Mandaluyong,Mandaluyong
1,Caloocan,Mandaluyong
2,Quezon,Quezon
3,Quezon,Quezon
4,Quezon,Quezon
5,Angeles,Clark
6,Guiguinto,Quezon
7,Baguio,Mandaluyong
8,Quezon,Quezon
9,San Juan,Pasig


In [247]:
data03.to_csv('SP201_preprocessed.csv', index = False) ##write to file
preview = pd.read_csv('SP201_preprocessed.csv')
preview ###preview the preprocessed data

Unnamed: 0,Age,ConvStoresVisited,Frequency,Reasons,ConvStore_nearto,HomeProv_add,HomeCity_add,WorkProv_add,WorkCity_add,ConvStore_TD,...,Supermarket_td,Fastfood_td,CoffeeShop_td,StreetVendor_td,PublicMarket_td,Pharmacy_td,Atm_td,BillPayCounter_td,Environment,Nr_CityCenter
0,33,"7-Eleven, Family Mart, Ministop, Alfamart, Lawson",Fair (Twice a week to more than twice a month.),"Purchase of Non-food/drink items, Purchase of ...","Place of Work, Current Address",Metro Manila,Mandaluyong,Metro Manila,Mandaluyong,Less than 5 mins,...,4 - 15 to 25 mins,3 - 10 to 15 mins,4 - 15 to 25 mins,1 - Less than 5 mins,5 - Greater than 25 mins,4 - 15 to 25 mins,4 - 15 to 25 mins,5 - Greater than 25 mins,Urban,Yes
1,26,7-Eleven,Fair (Twice a week to more than twice a month.),Purchase of food and drinks (consumed elsewhere),Place of Work,Metro Manila,Caloocan,Metro Manila,Mandaluyong,5 to 10 mins,...,5 - Greater than 25 mins,5 - Greater than 25 mins,3 - 10 to 15 mins,3 - 10 to 15 mins,5 - Greater than 25 mins,4 - 15 to 25 mins,4 - 15 to 25 mins,4 - 15 to 25 mins,Urban,No
2,25,"7-Eleven, Ministop, Lawson",Rarely (On occasion only and less than twice a...,"Purchase of Non-food/drink items, Purchase of ...","Place of Work, Current Address",Metro Manila,Quezon,Metro Manila,Quezon,Less than 5 mins,...,3 - 10 to 15 mins,3 - 10 to 15 mins,4 - 15 to 25 mins,1 - Less than 5 mins,2 - 5 to 10 mins,3 - 10 to 15 mins,3 - 10 to 15 mins,3 - 10 to 15 mins,Urban,Yes
3,24,"7-Eleven, Family Mart, Ministop, Alfamart",Fair (Twice a week to more than twice a month.),"Purchase of Non-food/drink items, Purchase of ...",Current Address,Metro Manila,Quezon,Metro Manila,Quezon,5 to 10 mins,...,4 - 15 to 25 mins,4 - 15 to 25 mins,3 - 10 to 15 mins,1 - Less than 5 mins,2 - 5 to 10 mins,2 - 5 to 10 mins,3 - 10 to 15 mins,5 - Greater than 25 mins,Urban,Yes
4,27,"7-Eleven, Family Mart, Ministop",Fair (Twice a week to more than twice a month.),"Purchase of Non-food/drink items, Purchase of ...","Place of Work, Current Address",Metro Manila,Quezon,Metro Manila,Quezon,5 to 10 mins,...,4 - 15 to 25 mins,4 - 15 to 25 mins,4 - 15 to 25 mins,2 - 5 to 10 mins,5 - Greater than 25 mins,5 - Greater than 25 mins,4 - 15 to 25 mins,4 - 15 to 25 mins,Urban,Yes
5,24,"7-Eleven, Family Mart, Ministop",Fair (Twice a week to more than twice a month.),Purchase of food and drinks (consumed elsewher...,Current Address,Pampanga,Angeles,Pampanga,Clark,5 to 10 mins,...,2 - 5 to 10 mins,2 - 5 to 10 mins,2 - 5 to 10 mins,2 - 5 to 10 mins,3 - 10 to 15 mins,2 - 5 to 10 mins,2 - 5 to 10 mins,3 - 10 to 15 mins,Urban,Yes
6,35,"7-Eleven, Family Mart, Ministop",Seldom (Twice a month or less),Purchase of food and drinks (consumed elsewher...,Place of Work,Bulacan,Guiguinto,Metro Manila,Quezon,10 to 15 mins,...,2 - 5 to 10 mins,2 - 5 to 10 mins,3 - 10 to 15 mins,3 - 10 to 15 mins,5 - Greater than 25 mins,2 - 5 to 10 mins,2 - 5 to 10 mins,3 - 10 to 15 mins,Rural,No
7,27,"7-Eleven, Family Mart, Ministop, Lawson",Rarely (On occasion only and less than twice a...,"Purchase of Non-food/drink items, Purchase of ...",Place of Work,Benguet,Baguio,Metro Manila,Mandaluyong,10 to 15 mins,...,4 - 15 to 25 mins,4 - 15 to 25 mins,5 - Greater than 25 mins,5 - Greater than 25 mins,5 - Greater than 25 mins,5 - Greater than 25 mins,5 - Greater than 25 mins,5 - Greater than 25 mins,Urban,Yes
8,30,"7-Eleven, Ministop, Alfamart",Seldom (Twice a month or less),"Purchase of Non-food/drink items, Purchase of ...",Current Address,Metro Manila,Quezon,Metro Manila,Quezon,Less than 5 mins,...,4 - 15 to 25 mins,4 - 15 to 25 mins,4 - 15 to 25 mins,1 - Less than 5 mins,1 - Less than 5 mins,1 - Less than 5 mins,4 - 15 to 25 mins,5 - Greater than 25 mins,Urban,Yes
9,26,"7-Eleven, Ministop",Often (Every day to more than twice a week),Purchase of food and drinks (consumed elsewhere),Place of Work,Metro Manila,San Juan,Metro Manila,Pasig,Less than 5 mins,...,3 - 10 to 15 mins,3 - 10 to 15 mins,4 - 15 to 25 mins,2 - 5 to 10 mins,4 - 15 to 25 mins,3 - 10 to 15 mins,2 - 5 to 10 mins,2 - 5 to 10 mins,Urban,Yes


**Note: Other data can be preprocessed via feature engineering which is a week 5 lesson and will not be applied here in week 4**