<center style="font-size:48px;">Clean Up</center>
<br>
Steps needed to clean the data and create useful features to use in our analysis

# Importing Libraries and Data

## Libraries

In [2]:
# Data Science
import pandas as pd 

## Data

In [3]:
cars = pd.read_csv('../Data/car-assignments.csv')
cc = pd.read_csv('../Data/cc_data.csv', encoding='cp1252', parse_dates=['timestamp'])
gps = pd.read_csv('../Data/gps.csv', parse_dates=['Timestamp'])
loyalty = pd.read_csv('../Data/loyalty_data.csv', encoding='cp1252', parse_dates=['timestamp'])

# Cleaning the Data

## Location Data

Merge the gps and the car assignment data frames. Do a left outer join with gos being the left dataframe. This will keep the gps points for the truck drivers. 

In [4]:
locations = gps.merge(cars, left_on='id', right_on='CarID', how='left')
locations.drop(columns='CarID', inplace =True)
locations.head()

Unnamed: 0,Timestamp,id,lat,long,LastName,FirstName,CurrentEmploymentType,CurrentEmploymentTitle
0,2014-01-06 06:28:01,35,36.076225,24.874689,Vasco-Pais,Willem,Executive,Environmental Safety Advisor
1,2014-01-06 06:28:01,35,36.07622,24.874596,Vasco-Pais,Willem,Executive,Environmental Safety Advisor
2,2014-01-06 06:28:03,35,36.076211,24.874443,Vasco-Pais,Willem,Executive,Environmental Safety Advisor
3,2014-01-06 06:28:05,35,36.076217,24.874253,Vasco-Pais,Willem,Executive,Environmental Safety Advisor
4,2014-01-06 06:28:06,35,36.076214,24.874167,Vasco-Pais,Willem,Executive,Environmental Safety Advisor


For the gps data for the truck drivers the names are null. Lets fill that with the name "Truck Driver_X" where "X" is the CarID number

In [5]:
locations['CurrentEmploymentType'] = locations['CurrentEmploymentType'].fillna('Facilities')
locations['CurrentEmploymentTitle'] = locations['CurrentEmploymentTitle'].fillna('Truck Driver')
locations['LastName'] = locations['LastName'].fillna('Driver')
locations['FirstName'] = locations['FirstName'].fillna('Truck')
locations['LastName'] = locations.apply(lambda x: 'Driver_{}'.format(x['id']) if x['LastName'] == 'Driver' else x['LastName'], axis =1 )

In [6]:
# Lets Seperate the varibale time units into their own features
timeUnit = ['day', 'hour', 'minute', 'second']
for unit in timeUnit:
    if unit == 'day':
        locations[unit] = locations['Timestamp'].apply(lambda x: x.day)
    if unit == 'hour':
        locations[unit] = locations['Timestamp'].apply(lambda x: x.hour)
    if unit == 'minute':
        locations[unit] = locations['Timestamp'].apply(lambda x: x.minute)
    if unit == 'second':
        locations[unit] = locations['Timestamp'].apply(lambda x: x.second)

Lets create a feature denoting if a day is the weekend

In [7]:
def isWeekend(x):
    if x.weekday() >= 5:
        return True
    else:
        return False

locations['Weekend'] = locations.apply(lambda x: isWeekend(x['Timestamp']), axis = 1)
locations.sample(10)

Unnamed: 0,Timestamp,id,lat,long,LastName,FirstName,CurrentEmploymentType,CurrentEmploymentTitle,day,hour,minute,second,Weekend
265764,2014-01-10 17:25:59,15,36.05012,24.883604,Bodrogi,Loreto,Security,Site Control,10,17,25,59,False
525671,2014-01-16 10:57:45,104,36.050178,24.866679,Driver_104,Truck,Facilities,Truck Driver,16,10,57,45,False
251404,2014-01-10 12:24:26,30,36.074105,24.865698,Resumir,Felix,Security,Security Group Manager,10,12,24,26,False
420727,2014-01-14 13:43:04,21,36.055753,24.875443,Osvaldo,Hennie,Security,Perimeter Control,14,13,43,4,False
522999,2014-01-16 08:21:40,13,36.050326,24.894857,Ferro,Inga,Security,Site Control,16,8,21,40,False
283625,2014-01-10 23:31:10,14,36.080671,24.86168,Dedos,Lidelse,Engineering,Engineering Group Manager,10,23,31,10,False
357228,2014-01-13 12:40:22,13,36.062071,24.857416,Ferro,Inga,Security,Site Control,13,12,40,22,False
129864,2014-01-08 11:57:07,20,36.048632,24.879573,Fusil,Stenig,Security,Building Control,8,11,57,7,False
365124,2014-01-13 13:58:52,17,36.053144,24.876392,Flecha,Sven,Information Technology,IT Technician,13,13,58,52,False
679268,2014-01-19 16:05:28,35,36.078963,24.874092,Vasco-Pais,Willem,Executive,Environmental Safety Advisor,19,16,5,28,True


This is now a useable set of data for our analysis. We can subset by Name, Job Title, Job Type, Car, day, time of day, and weekend. We can also resample to get a datapoint per 5, 10, 15, etc. minute intervals for ur desired subset.

To shrink data size (so we can put it on github). Take the median of any duplicate location data (by person and time).

In [8]:
locations[['lat', 'long']] = locations.groupby(['day', 'hour', 'minute', 'second', 'FirstName', 'LastName'])['lat', 'long'].transform('median')
locations.drop_duplicates()

  locations[['lat', 'long']] = locations.groupby(['day', 'hour', 'minute', 'second', 'FirstName', 'LastName'])['lat', 'long'].transform('median')


Unnamed: 0,Timestamp,id,lat,long,LastName,FirstName,CurrentEmploymentType,CurrentEmploymentTitle,day,hour,minute,second,Weekend
0,2014-01-06 06:28:01,35,36.076223,24.874643,Vasco-Pais,Willem,Executive,Environmental Safety Advisor,6,6,28,1,False
2,2014-01-06 06:28:03,35,36.076211,24.874443,Vasco-Pais,Willem,Executive,Environmental Safety Advisor,6,6,28,3,False
3,2014-01-06 06:28:05,35,36.076217,24.874253,Vasco-Pais,Willem,Executive,Environmental Safety Advisor,6,6,28,5,False
4,2014-01-06 06:28:06,35,36.076214,24.874167,Vasco-Pais,Willem,Executive,Environmental Safety Advisor,6,6,28,6,False
5,2014-01-06 06:28:07,35,36.076191,24.874056,Vasco-Pais,Willem,Executive,Environmental Safety Advisor,6,6,28,7,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
685164,2014-01-19 20:56:43,30,36.058110,24.902130,Resumir,Felix,Security,Security Group Manager,19,20,56,43,True
685165,2014-01-19 20:56:47,30,36.058258,24.901774,Resumir,Felix,Security,Security Group Manager,19,20,56,47,True
685166,2014-01-19 20:56:48,30,36.058296,24.901711,Resumir,Felix,Security,Security Group Manager,19,20,56,48,True
685167,2014-01-19 20:56:49,30,36.058304,24.901620,Resumir,Felix,Security,Security Group Manager,19,20,56,49,True


## Purchase Data

Create a Is Loyalty variable that denotes if a purchase is a loyalty card purchase or not.

In [9]:
cc['Is_Loyalty'] = False
loyalty['Is_Loyalty'] = True

Similar to the locaton data lets seperate the time units and create a weekend variable.

In [10]:
# Lets Seperate the varibale time units into their own features
timeUnit = ['day', 'hour', 'minute', 'second']
for unit in timeUnit:
    if unit == 'day':
        cc[unit] = cc['timestamp'].apply(lambda x: x.day)
        loyalty[unit] = loyalty['timestamp'].apply(lambda x: x.day)
    if unit == 'hour':
        cc[unit] = cc['timestamp'].apply(lambda x: x.hour)
        loyalty[unit] = loyalty['timestamp'].apply(lambda x: x.hour)
    if unit == 'minute':
        cc[unit] = cc['timestamp'].apply(lambda x: x.minute)
        loyalty[unit] = loyalty['timestamp'].apply(lambda x: x.minute)
    if unit == 'second':
        cc[unit] = cc['timestamp'].apply(lambda x: x.second)
        loyalty[unit] = loyalty['timestamp'].apply(lambda x: x.second)
# Deontate Weekends
cc['Weekend'] = cc.apply(lambda x: isWeekend(x['timestamp']), axis = 1)
loyalty['Weekend'] = loyalty.apply(lambda x: isWeekend(x['timestamp']), axis = 1)

We can find which transactions are duplicates (at least most of them) by a person's name, purchase location, purchase day, and the cents in the price. The loyalty purchase price is less prone to outliers so we will use that as the final purchase price. The timestamp in the credit card dataframe has more infirmation on time of purchase so we will use that.

In [11]:
# Seperate the cents from the purcahase price
cc['cents'] = round(cc.apply(lambda x: (x['price'] % 1) * 100,  axis = 1))
loyalty['cents'] = round(loyalty.apply(lambda x: (x['price'] % 1) * 100,  axis = 1))
# Loop to match the duplicate purchases. Overwrite the CREDIT CARD dataframe with tthe loyalty price and loyalty card flag values
for index, row in cc.iterrows():
    first = row['FirstName']
    last = row['LastName']
    location = row['location']
    day = row['day']
    cents = row['cents']
    temp = loyalty[loyalty.FirstName == first]
    temp = temp[temp.LastName == last]
    temp = temp[temp.location == location]
    temp = temp[temp.day == day]
    temp = temp[temp.cents == cents]
    if len(temp) >= 1:
        cc.loc[index, 'Is_Loyalty'] = True
        cc.loc[index, 'price'] = temp.price.values

# Merge the two dataframe by appending one to the other and dropping duplicates
buys = pd.concat([cc, loyalty]).drop_duplicates(['FirstName', 'LastName', 'location', 'day', 'cents'], keep='first')
buys.drop(columns  ='cents', inplace =True)

Lets also add the personal information (job title and type). Use a left outer join to not lose data for anybody who isn't in the cars dataframe

In [12]:
buys = buys.merge(cars, left_on=['LastName', 'FirstName'], right_on=['LastName', 'FirstName'], how= 'left')

Replace the null values for job type and title with 'Other'. Also, save their carId as 100

In [13]:
buys.fillna({'CurrentEmploymentType' : 'Other', 'CurrentEmploymentTitle':'Other'}, inplace= True)
def CarNulls(x):
    if pd.isna(x['CarID']):
        if x['CurrentEmploymentTitle'] == 'Truck Driver':
            return 100
        else:
            return 0
    else:
        return x['CarID']

buys['CarID'] = buys.apply(lambda x: CarNulls(x), axis =1)
buys.sample(10)


Unnamed: 0,timestamp,location,price,FirstName,LastName,Is_Loyalty,day,hour,minute,second,Weekend,CarID,CurrentEmploymentType,CurrentEmploymentTitle
943,2014-01-14 13:40:00,Ouzeri Elian,27.32,Varja,Lagos,True,14,13,40,0,False,23.0,Security,Badging Office
189,2014-01-07 13:33:00,Gelatogalore,34.04,Linnea,Bergen,True,7,13,33,0,False,6.0,Information Technology,IT Group Manager
335,2014-01-08 13:45:00,Kalami Kafenion,23.82,Kare,Orilla,True,8,13,45,0,False,27.0,Engineering,Drill Technician
1227,2014-01-16 19:40:00,Guy's Gyros,37.8,Edvard,Vann,True,16,19,40,0,False,34.0,Security,Perimeter Control
1339,2014-01-17 19:24:00,Katerina’s Café,22.18,Kare,Orilla,True,17,19,24,0,False,27.0,Engineering,Drill Technician
1188,2014-01-16 13:27:00,Katerina’s Café,39.74,Ruscella,Mies Haber,True,16,13,27,0,False,0.0,Other,Other
277,2014-01-08 07:57:00,Hallowed Grounds,11.63,Anda,Ribera,True,8,7,57,0,False,0.0,Other,Other
147,2014-01-07 07:58:00,Brew've Been Served,17.4,Dante,Coginian,True,7,7,58,0,False,0.0,Other,Other
1360,2014-01-17 20:35:00,Ouzeri Elian,26.4,Rachel,Pantanal,True,17,20,35,0,False,0.0,Other,Other
1371,2014-01-17 21:58:00,Hippokampos,12.86,Felix,Balas,True,17,21,58,0,False,3.0,Engineering,Engineer


# Save the transformed data

In [24]:
n = len(locations)
locations1 = locations[:round(n/2)]
locations2 = locations[round(n/2):]
locations1.to_csv('../CheckPoints/Locations1_Clean.csv')
locations2.to_csv('../CheckPoints/Locations2_Clean.csv')
buys.to_csv('../CheckPoints/Buys_Clean.csv')

<div>
    <span  style="width:600px;display:inline-block;text-align:left">
        <a href="./FurtherEDA.ipynb">&#60;&#60;Further Exploratoy Data Analysis</a>
    </span>
    <span style="width:600px;display:inline-block;text-align:right">
        <a href="./Modeling.ipynb">Modeling&#62;&#62;</a>
    </span>
</div>
<div>
    <center>
        <span style="width:200px;display:inline-block;text-align:center">
            <a href="./Master.ipynb">Master Notebook</a>
        </span>
        <span style="width:200px;display:inline-block;text-align:center">
            <a href="../README.md">Table of Contents</a>
        </span>
    </center>
</div>