<center style="font-size:48px;">Clean Up</center>
<br>
Steps needed to clean the data and create useful features to use in our analysis

# Importing Libraries and Data

## Libraries

In [11]:
# Data Science
import pandas as pd 

## Data

In [12]:
cars = pd.read_csv('../Data/car-assignments.csv')
cc = pd.read_csv('../Data/cc_data.csv', encoding='cp1252', parse_dates=['timestamp'])
gps = pd.read_csv('../Data/gps.csv', parse_dates=['Timestamp'])
loyalty = pd.read_csv('../Data/loyalty_data.csv', encoding='cp1252', parse_dates=['timestamp'])

# Cleaning the Data

## Location Data

Merge the gps and the car assignment data frames. Do a left outer join with gos being the left dataframe. This will keep the gps points for the truck drivers. 

In [13]:
locations = gps.merge(cars, left_on='id', right_on='CarID', how='left')
locations.drop(columns='CarID', inplace =True)
locations.head()

Unnamed: 0,Timestamp,id,lat,long,LastName,FirstName,CurrentEmploymentType,CurrentEmploymentTitle
0,2014-01-06 06:28:01,35,36.076225,24.874689,Vasco-Pais,Willem,Executive,Environmental Safety Advisor
1,2014-01-06 06:28:01,35,36.07622,24.874596,Vasco-Pais,Willem,Executive,Environmental Safety Advisor
2,2014-01-06 06:28:03,35,36.076211,24.874443,Vasco-Pais,Willem,Executive,Environmental Safety Advisor
3,2014-01-06 06:28:05,35,36.076217,24.874253,Vasco-Pais,Willem,Executive,Environmental Safety Advisor
4,2014-01-06 06:28:06,35,36.076214,24.874167,Vasco-Pais,Willem,Executive,Environmental Safety Advisor


For the gps data for the truck drivers the names are null. Lets fill that with the name "Truck Driver_X" where "X" is the CarID number

In [14]:
locations['CurrentEmploymentType'] = locations['CurrentEmploymentType'].fillna('Facilities')
locations['CurrentEmploymentTitle'] = locations['CurrentEmploymentTitle'].fillna('Truck Driver')
locations['LastName'] = locations['LastName'].fillna('Driver')
locations['FirstName'] = locations['FirstName'].fillna('Truck')
locations['LastName'] = locations.apply(lambda x: 'Driver_{}'.format(x['id']) if x['LastName'] == 'Driver' else x['LastName'], axis =1 )

In [15]:
# Lets Seperate the varibale time units into their own features
timeUnit = ['day', 'hour', 'minute', 'second']
for unit in timeUnit:
    if unit == 'day':
        locations[unit] = locations['Timestamp'].apply(lambda x: x.day)
    if unit == 'hour':
        locations[unit] = locations['Timestamp'].apply(lambda x: x.hour)
    if unit == 'minute':
        locations[unit] = locations['Timestamp'].apply(lambda x: x.minute)
    if unit == 'second':
        locations[unit] = locations['Timestamp'].apply(lambda x: x.second)

Lets create a feature denoting if a day is the weekend

In [16]:
def isWeekend(x):
    if x.weekday() >= 5:
        return True
    else:
        return False

locations['Weekend'] = locations.apply(lambda x: isWeekend(x['Timestamp']), axis = 1)
locations.sample(10)

Unnamed: 0,Timestamp,id,lat,long,LastName,FirstName,CurrentEmploymentType,CurrentEmploymentTitle,day,hour,minute,second,Weekend
142483,2014-01-08 13:37:24,6,36.048118,24.87957,Bergen,Linnea,Information Technology,IT Group Manager,8,13,37,24,False
313690,2014-01-12 13:10:42,27,36.066543,24.857024,Orilla,Kare,Engineering,Drill Technician,12,13,10,42,True
400129,2014-01-14 08:06:11,28,36.059385,24.871692,Orilla,Elsa,Engineering,Drill Technician,14,8,6,11,False
438694,2014-01-14 19:15:42,28,36.056831,24.871544,Orilla,Elsa,Engineering,Drill Technician,14,19,15,42,False
550732,2014-01-16 17:19:34,28,36.055166,24.875181,Orilla,Elsa,Engineering,Drill Technician,16,17,19,34,False
377508,2014-01-13 18:02:43,10,36.07235,24.867864,Campo-Corrente,Ada,Executive,SVP/CIO,13,18,2,43,False
162842,2014-01-08 20:19:46,22,36.052049,24.890332,Herrero,Kanon,Security,Badging Office,8,20,19,46,False
644973,2014-01-17 19:30:17,17,36.054665,24.891036,Flecha,Sven,Information Technology,IT Technician,17,19,30,17,False
521964,2014-01-16 08:15:09,24,36.048118,24.879571,Mies,Minke,Security,Perimeter Control,16,8,15,9,False
55395,2014-01-07 07:16:30,32,36.071475,24.87144,Strum,Orhan,Executive,SVP/COO,7,7,16,30,False


This is now a useable set of data for our analysis. We can subset by Name, Job Title, Job Type, Car, day, time of day, and weekend. We can also resample to get a datapoint per 5, 10, 15, etc. minute intervals for ur desired subset.

## Purchase Data

Create a Is Loyalty variable that denotes if a purchase is a loyalty card purchase or not.

In [17]:
cc['Is_Loyalty'] = False
loyalty['Is_Loyalty'] = True

Similar to the locaton data lets seperate the time units and create a weekend variable.

In [18]:
# Lets Seperate the varibale time units into their own features
timeUnit = ['day', 'hour', 'minute', 'second']
for unit in timeUnit:
    if unit == 'day':
        cc[unit] = cc['timestamp'].apply(lambda x: x.day)
        loyalty[unit] = loyalty['timestamp'].apply(lambda x: x.day)
    if unit == 'hour':
        cc[unit] = cc['timestamp'].apply(lambda x: x.hour)
        loyalty[unit] = loyalty['timestamp'].apply(lambda x: x.hour)
    if unit == 'minute':
        cc[unit] = cc['timestamp'].apply(lambda x: x.minute)
        loyalty[unit] = loyalty['timestamp'].apply(lambda x: x.minute)
    if unit == 'second':
        cc[unit] = cc['timestamp'].apply(lambda x: x.second)
        loyalty[unit] = loyalty['timestamp'].apply(lambda x: x.second)
# Deontate Weekends
cc['Weekend'] = cc.apply(lambda x: isWeekend(x['timestamp']), axis = 1)
loyalty['Weekend'] = loyalty.apply(lambda x: isWeekend(x['timestamp']), axis = 1)

We can find which transactions are duplicates (at least most of them) by a person's name, purchase location, purchase day, and the cents in the price. The loyalty purchase price is less prone to outliers so we will use that as the final purchase price. The timestamp in the credit card dataframe has more infirmation on time of purchase so we will use that.

In [19]:
# Seperate the cents from the purcahase price
cc['cents'] = round(cc.apply(lambda x: (x['price'] % 1) * 100,  axis = 1))
loyalty['cents'] = round(loyalty.apply(lambda x: (x['price'] % 1) * 100,  axis = 1))
# Loop to match the duplicate purchases. Overwrite the CREDIT CARD dataframe with tthe loyalty price and loyalty card flag values
for index, row in cc.iterrows():
    first = row['FirstName']
    last = row['LastName']
    location = row['location']
    day = row['day']
    cents = row['cents']
    temp = loyalty[loyalty.FirstName == first]
    temp = temp[temp.LastName == last]
    temp = temp[temp.location == location]
    temp = temp[temp.day == day]
    temp = temp[temp.cents == cents]
    if len(temp) >= 1:
        cc.loc[index, 'Is_Loyalty'] = True
        cc.loc[index, 'price'] = temp.price.values

# Merge the two dataframe by appending one to the other and dropping duplicates
buys = pd.concat([cc, loyalty]).drop_duplicates(['FirstName', 'LastName', 'location', 'day', 'cents'], keep='first')
buys.drop(columns  ='cents', inplace =True)

Lets also add the personal information (job title and type). Use a left outer join to not lose data for anybody who isn't in the cars dataframe

In [20]:
buys = buys.merge(cars, left_on=['LastName', 'FirstName'], right_on=['LastName', 'FirstName'], how= 'left')

Replace the null values for job type and title with 'Other'. Also, save their carId as 100

In [21]:
buys.fillna({'CurrentEmploymentType' : 'Other', 'CurrentEmploymentTitle':'Other'}, inplace= True)
def CarNulls(x):
    if pd.isna(x['CarID']):
        if x['CurrentEmploymentTitle'] == 'Truck Driver':
            return 100
        else:
            return 0
    else:
        return x['CarID']

buys['CarID'] = buys.apply(lambda x: CarNulls(x), axis =1)
buys.sample(10)


Unnamed: 0,timestamp,location,price,FirstName,LastName,Is_Loyalty,day,hour,minute,second,Weekend,CarID,CurrentEmploymentType,CurrentEmploymentTitle
231,2014-01-07 20:10:00,Albert's Fine Clothing,78.91,Varro,Awelon,True,7,20,10,0,False,0.0,Other,Other
699,2014-01-12 13:03:00,Ouzeri Elian,38.55,Ada,Campo-Corrente,True,12,13,3,0,True,10.0,Executive,SVP/CIO
879,2014-01-14 07:40:00,Brew've Been Served,5.87,Brand,Tempestad,True,14,7,40,0,False,33.0,Engineering,Drill Technician
1001,2014-01-14 21:31:00,Guy's Gyros,26.14,Isia,Vann,True,14,21,31,0,False,16.0,Security,Perimeter Control
641,2014-01-11 13:35:00,Gelatogalore,17.31,Birgitta,Frente,True,11,13,35,0,True,18.0,Engineering,Geologist
1062,2014-01-15 13:36:00,Kalami Kafenion,37.15,Linda,Lagos,True,15,13,36,0,False,0.0,Other,Other
1439,2014-01-18 20:15:00,Katerina’s Café,37.21,Inga,Ferro,True,18,20,15,0,True,13.0,Security,Site Control
1088,2014-01-15 14:12:00,Kronos Pipe and Irrigation,2564.0,Irene,Nant,True,15,14,12,0,False,100.0,Facilities,Truck Driver
821,2014-01-13 13:40:00,Katerina’s Café,8.29,Lidelse,Dedos,True,13,13,40,0,False,14.0,Engineering,Engineering Group Manager
718,2014-01-12 15:10:00,Shoppers' Delight,135.78,Marin,Onda,True,12,15,10,0,True,26.0,Engineering,Drill Site Manager


# Save the transformed data

In [22]:
locations.to_csv('../CheckPoints/Locations_Clean.csv')
buys.to_csv('../CheckPoints/Buys_Clean.csv')

<div>
    <span  style=width:700px;display:inline-block;align:left">
        <a href="./FurtherEDA.ipynb"><< Further Exploratory Data Analysis</a>
    </span>
    <span style="width:700px;display:inline-block;" align="right">
        <a href="./Modeling.ipynb">Modeling >></a>
    </span>
</div>
<div>
    <center>
        <span style="width:250px;display:inline-block">
            <a href="../Master.ipynb">Master Notebook</a>
        </span>
        <span style="width:250px;display:inline-block">
            <a href="../README.md">Table of Contents</a>
        </span>
    <center>
</div>