Steps - 
1.Import the libraries

2.Get the data and observe it

3.Check missing values, either remove it or fill it.

4.Get summary of data using python function.

5.Explore the data parameter wise

Here we have information of destination(start and stop), time(start and stop), category and purpose of trip, miles covered.


In [1]:
# Import the libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Get the Data 
df = pd.read_csv('uberdrive.csv')

#View the first 5 rows of data
df.head(5)

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
0,1/1/2016 21:11,1/1/2016 21:17,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain
1,1/2/2016 1:25,1/2/2016 1:37,Business,Fort Pierce,Fort Pierce,5.0,
2,1/2/2016 20:25,1/2/2016 20:38,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies
3,1/5/2016 17:31,1/5/2016 17:45,Business,Fort Pierce,Fort Pierce,4.7,Meeting
4,1/6/2016 14:42,1/6/2016 15:49,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit


In [3]:
#View the last 5 rows of data
df.tail()

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
1151,12/31/2016 13:24,12/31/2016 13:42,Business,Kar?chi,Unknown Location,3.9,Temporary Site
1152,12/31/2016 15:03,12/31/2016 15:38,Business,Unknown Location,Unknown Location,16.2,Meeting
1153,12/31/2016 21:32,12/31/2016 21:50,Business,Katunayake,Gampaha,6.4,Temporary Site
1154,12/31/2016 22:08,12/31/2016 23:51,Business,Gampaha,Ilukwatta,48.2,Temporary Site
1155,Totals,,,,,12204.7,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1156 entries, 0 to 1155
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   START_DATE*  1156 non-null   object 
 1   END_DATE*    1155 non-null   object 
 2   CATEGORY*    1155 non-null   object 
 3   START*       1155 non-null   object 
 4   STOP*        1155 non-null   object 
 5   MILES*       1156 non-null   float64
 6   PURPOSE*     653 non-null    object 
dtypes: float64(1), object(6)
memory usage: 63.3+ KB


In [5]:
# understand shape and size of data 
print(df.shape)
print (df.size)

(1156, 7)
8092


**The dataset has 1156 rows and 7 columns**

In [6]:
#Get a summary of the numerical columns in the data
df.describe()

Unnamed: 0,MILES*
count,1156.0
mean,21.115398
std,359.299007
min,0.5
25%,2.9
50%,6.0
75%,10.4
max,12204.7


**The miles driven ranges from 0.5 miles to 12204 miles with an average of 21 miles**

In [7]:
#get more information about data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1156 entries, 0 to 1155
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   START_DATE*  1156 non-null   object 
 1   END_DATE*    1155 non-null   object 
 2   CATEGORY*    1155 non-null   object 
 3   START*       1155 non-null   object 
 4   STOP*        1155 non-null   object 
 5   MILES*       1156 non-null   float64
 6   PURPOSE*     653 non-null    object 
dtypes: float64(1), object(6)
memory usage: 63.3+ KB


**The dataset has 1 numerical variable and 6 categorical variables** <br>
**The "PURPOSE" column has many missing values**

In [8]:
#Get the number of missing values in each column
df.isnull().sum()

START_DATE*      0
END_DATE*        1
CATEGORY*        1
START*           1
STOP*            1
MILES*           0
PURPOSE*       503
dtype: int64

In [9]:
# Get the initial data with dropping the NA values
df = df.dropna()

#Get the shape of the dataframe after removing the null values
df.shape

(653, 7)

**The dataset now contains 653 rows of non-null values**

In [10]:
#get the summary of data
df.describe()

Unnamed: 0,MILES*
count,653.0
mean,11.196325
std,22.986429
min,0.5
25%,3.2
50%,6.4
75%,10.4
max,310.3


### Lets explore the data parameter wise - 

1.Destination - (starting and stopping)

2.Time - (hour of the day, day of week, month of year)

3.Categories

4.Purpose 

5.Grouping two parameters to get more insights


# 1.Destination - (starting and stopping)

In [11]:
# Get the starting destination, unique destination
print(df['START*'].unique()) #names of unique start points
print(len(df['START*'].unique())) #count of unique start points

['Fort Pierce' 'West Palm Beach' 'Cary' 'Jamaica' 'New York' 'Elmhurst'
 'Midtown' 'East Harlem' 'Flatiron District' 'Midtown East'
 'Hudson Square' 'Lower Manhattan' "Hell's Kitchen" 'Downtown' 'Gulfton'
 'Houston' 'Eagan Park' 'Morrisville' 'Durham' 'Farmington Woods'
 'Lake Wellingborough' 'Fayetteville Street' 'Raleigh' 'Whitebridge'
 'Hazelwood' 'Fairmont' 'Meredith Townes' 'Apex' 'Chapel Hill'
 'Northwoods' 'Edgehill Farms' 'Eastgate' 'East Elmhurst'
 'Long Island City' 'Katunayaka' 'Colombo' 'Nugegoda' 'Unknown Location'
 'Islamabad' 'R?walpindi' 'Noorpur Shahan' 'Preston' 'Heritage Pines'
 'Tanglewood' 'Waverly Place' 'Wayne Ridge' 'Westpark Place' 'East Austin'
 'The Drag' 'South Congress' 'Georgian Acres' 'North Austin'
 'West University' 'Austin' 'Katy' 'Sharpstown' 'Sugar Land' 'Galveston'
 'Port Bolivar' 'Washington Avenue' 'Briar Meadow' 'Latta' 'Jacksonville'
 'Lake Reams' 'Orlando' 'Kissimmee' 'Daytona Beach' 'Ridgeland' 'Florence'
 'Meredith' 'Holly Springs' 'Chessingt

**There are 131 unique start destinations in the dataset**

In [12]:
# Get the starting destination, unique destination
print(df['STOP*'].unique()) #names of unique stop points
print(len(df['STOP*'].unique())) #count of unique stop points

['Fort Pierce' 'West Palm Beach' 'Palm Beach' 'Cary' 'Morrisville'
 'New York' 'Queens' 'East Harlem' 'NoMad' 'Midtown' 'Midtown East'
 'Hudson Square' 'Lower Manhattan' "Hell's Kitchen" 'Queens County'
 'Gulfton' 'Downtown' 'Houston' 'Jamestown Court' 'Durham' 'Whitebridge'
 'Raleigh' 'Umstead' 'Hazelwood' 'Westpark Place' 'Meredith Townes'
 'Leesville Hollow' 'Apex' 'Chapel Hill' 'Williamsburg Manor'
 'Macgregor Downs' 'Edgehill Farms' 'Walnut Terrace' 'Midtown West'
 'Long Island City' 'Jamaica' 'Unknown Location' 'Colombo' 'Nugegoda'
 'Katunayaka' 'Islamabad' 'R?walpindi' 'Noorpur Shahan' 'Heritage Pines'
 'Tanglewood' 'Waverly Place' 'Wayne Ridge' 'Northwoods'
 'Depot Historic District' 'West University' 'Congress Ave District'
 'Convention Center District' 'North Austin' 'The Drag' 'Coxville'
 'South Congress' 'Katy' 'Alief' 'Sharpstown' 'Sugar Land' 'Galveston'
 'Port Bolivar' 'Washington Avenue' 'Greater Greenspoint' 'Latta'
 'Jacksonville' 'Kissimmee' 'Lake Reams' 'Orlando' 'D

**There are 137 unique start destinations in the dataset**

In [13]:
#Identify popular start destinations - top 10
df['START*'].value_counts().head(10)

Cary                161
Unknown Location     55
Morrisville          54
Whitebridge          36
Durham               30
Kar?chi              26
Raleigh              21
Lahore               19
Islamabad            15
Midtown              11
Name: START*, dtype: int64

**We can say that Cary is most poplular starting point for this driver.**

In [14]:
df['START*'].value_counts(normalize=True).nlargest(10)

Cary                0.246554
Unknown Location    0.084227
Morrisville         0.082695
Whitebridge         0.055130
Durham              0.045942
Kar?chi             0.039816
Raleigh             0.032159
Lahore              0.029096
Islamabad           0.022971
Midtown             0.016845
Name: START*, dtype: float64

In [15]:
df['START*'].value_counts().nlargest(10)

Cary                161
Unknown Location     55
Morrisville          54
Whitebridge          36
Durham               30
Kar?chi              26
Raleigh              21
Lahore               19
Islamabad            15
Midtown              11
Name: START*, dtype: int64

In [16]:
#Identify popular stop destinations - top 10
df['STOP*'].value_counts().head(10)

Cary                155
Morrisville          60
Unknown Location     56
Whitebridge          37
Durham               30
Kar?chi              26
Raleigh              21
Lahore               19
Islamabad            14
Apex                 11
Name: STOP*, dtype: int64

**Cary also features in the most popular stop destinations**

In [17]:
df[df['START*'] != 'Unknown Location']

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
0,1/1/2016 21:11,1/1/2016 21:17,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain
2,1/2/2016 20:25,1/2/2016 20:38,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies
3,1/5/2016 17:31,1/5/2016 17:45,Business,Fort Pierce,Fort Pierce,4.7,Meeting
4,1/6/2016 14:42,1/6/2016 15:49,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit
5,1/6/2016 17:15,1/6/2016 17:19,Business,West Palm Beach,West Palm Beach,4.3,Meal/Entertain
...,...,...,...,...,...,...,...
1149,12/30/2016 23:06,12/30/2016 23:10,Business,Kar?chi,Kar?chi,0.8,Customer Visit
1150,12/31/2016 1:07,12/31/2016 1:14,Business,Kar?chi,Kar?chi,0.7,Meeting
1151,12/31/2016 13:24,12/31/2016 13:42,Business,Kar?chi,Unknown Location,3.9,Temporary Site
1153,12/31/2016 21:32,12/31/2016 21:50,Business,Katunayake,Gampaha,6.4,Temporary Site


In [18]:
#Find out most farthest start and stop pair -top10
#Dropping Unknown Location Value
df2 = df[df['START*']!= 'Unknown Location']
df2 = df2[df2['STOP*']!= 'Unknown Location']
df2.groupby(['START*','STOP*'])['MILES*'].sum().sort_values(ascending = False).head(10)

START*        STOP*       
Cary          Durham          312.3
Latta         Jacksonville    310.3
Durham        Cary            298.4
Cary          Morrisville     293.7
Raleigh       Cary            269.5
Morrisville   Cary            250.6
Cary          Cary            233.9
              Raleigh         230.4
Jacksonville  Kissimmee       201.0
Boone         Cary            180.2
Name: MILES*, dtype: float64

**Cary and Durham are the farthest from each other**

In [19]:
#Find out most popular start and stop pair - top10
df2.groupby(['START*','STOP*']).size().sort_values(ascending=False).head(10)

START*       STOP*      
Cary         Morrisville    52
Morrisville  Cary           51
Cary         Cary           44
             Durham         30
Durham       Cary           29
Kar?chi      Kar?chi        20
Cary         Raleigh        17
Lahore       Lahore         16
Raleigh      Cary           15
Cary         Apex           11
dtype: int64

**The most popular start to destination pair is Cary-Morrisville**

## 2. Manipulating date & time objects

https://strftime.org/

In [20]:
df_n = df.copy()

### 1. Converting the columns into datetime - standard way

In [21]:
df_n['START_DATE*_dt']  = pd.to_datetime(df_n['START_DATE*'], format= '%m/%d/%Y %H:%M')
df_n['END_DATE*_dt']  = pd.to_datetime(df_n['END_DATE*'], format= '%m/%d/%Y %H:%M')
df_n.head()

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*,START_DATE*_dt,END_DATE*_dt
0,1/1/2016 21:11,1/1/2016 21:17,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain,2016-01-01 21:11:00,2016-01-01 21:17:00
2,1/2/2016 20:25,1/2/2016 20:38,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies,2016-01-02 20:25:00,2016-01-02 20:38:00
3,1/5/2016 17:31,1/5/2016 17:45,Business,Fort Pierce,Fort Pierce,4.7,Meeting,2016-01-05 17:31:00,2016-01-05 17:45:00
4,1/6/2016 14:42,1/6/2016 15:49,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit,2016-01-06 14:42:00,2016-01-06 15:49:00
5,1/6/2016 17:15,1/6/2016 17:19,Business,West Palm Beach,West Palm Beach,4.3,Meal/Entertain,2016-01-06 17:15:00,2016-01-06 17:19:00


**Using pd.to_datetime function within pandas**

In [22]:
df_n.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 653 entries, 0 to 1154
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   START_DATE*     653 non-null    object        
 1   END_DATE*       653 non-null    object        
 2   CATEGORY*       653 non-null    object        
 3   START*          653 non-null    object        
 4   STOP*           653 non-null    object        
 5   MILES*          653 non-null    float64       
 6   PURPOSE*        653 non-null    object        
 7   START_DATE*_dt  653 non-null    datetime64[ns]
 8   END_DATE*_dt    653 non-null    datetime64[ns]
dtypes: datetime64[ns](2), float64(1), object(6)
memory usage: 51.0+ KB


### 2. Converting the columns into datetime - Custom Function way

In [23]:
def date_convert(date_col_to_convert):
   
    """
    Takes the date column as input parameter. You can use this to apply on any date column which is in string format and
    we want it to convert into datetime format.
    
    We can use the above via apply method in python, apply takes function as input parameter as any function does
    
    """
    
    return datetime.strptime(date_col_to_convert,'%m/%d/%Y %H:%M')

In [24]:
df_n['START_DATE*_def_func'] = df_n['START_DATE*'].apply(date_convert)
df_n['END_DATE*_def_func'] = df_n['END_DATE*'].apply(date_convert)

NameError: name 'datetime' is not defined

In [None]:
df_n.info()

In [25]:
df_n.head()

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*,START_DATE*_dt,END_DATE*_dt
0,1/1/2016 21:11,1/1/2016 21:17,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain,2016-01-01 21:11:00,2016-01-01 21:17:00
2,1/2/2016 20:25,1/2/2016 20:38,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies,2016-01-02 20:25:00,2016-01-02 20:38:00
3,1/5/2016 17:31,1/5/2016 17:45,Business,Fort Pierce,Fort Pierce,4.7,Meeting,2016-01-05 17:31:00,2016-01-05 17:45:00
4,1/6/2016 14:42,1/6/2016 15:49,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit,2016-01-06 14:42:00,2016-01-06 15:49:00
5,1/6/2016 17:15,1/6/2016 17:19,Business,West Palm Beach,West Palm Beach,4.3,Meal/Entertain,2016-01-06 17:15:00,2016-01-06 17:19:00


### 3. Converting the columns into datetime - Anonymous function (lambda) way - compare with above def way

In [26]:
# START DATE and END_DATE have string format. Convert it to datetime object
df.loc[:, 'START_DATE*'] = df['START_DATE*'].apply(lambda x: pd.datetime.strptime(x, '%m/%d/%Y %H:%M'))
df.loc[:, 'END_DATE*'] = df['END_DATE*'].apply(lambda x: pd.datetime.strptime(x, '%m/%d/%Y %H:%M'))

  df.loc[:, 'START_DATE*'] = df['START_DATE*'].apply(lambda x: pd.datetime.strptime(x, '%m/%d/%Y %H:%M'))
  df.loc[:, 'END_DATE*'] = df['END_DATE*'].apply(lambda x: pd.datetime.strptime(x, '%m/%d/%Y %H:%M'))


**In above step we do not need to define the function name since lambda is an anonymous function in python which can be used with apply to apply it on any column here**

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 653 entries, 0 to 1154
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   START_DATE*  653 non-null    datetime64[ns]
 1   END_DATE*    653 non-null    datetime64[ns]
 2   CATEGORY*    653 non-null    object        
 3   START*       653 non-null    object        
 4   STOP*        653 non-null    object        
 5   MILES*       653 non-null    float64       
 6   PURPOSE*     653 non-null    object        
dtypes: datetime64[ns](2), float64(1), object(4)
memory usage: 40.8+ KB


In [28]:
df.head()

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*
0,2016-01-01 21:11:00,2016-01-01 21:17:00,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain
2,2016-01-02 20:25:00,2016-01-02 20:38:00,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies
3,2016-01-05 17:31:00,2016-01-05 17:45:00,Business,Fort Pierce,Fort Pierce,4.7,Meeting
4,2016-01-06 14:42:00,2016-01-06 15:49:00,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit
5,2016-01-06 17:15:00,2016-01-06 17:19:00,Business,West Palm Beach,West Palm Beach,4.3,Meal/Entertain


In [29]:
#Calculate the duration for the rides
df['DIFF'] = df['END_DATE*'] - df['START_DATE*']

In [30]:
df.head()

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*,DIFF
0,2016-01-01 21:11:00,2016-01-01 21:17:00,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain,00:06:00
2,2016-01-02 20:25:00,2016-01-02 20:38:00,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies,00:13:00
3,2016-01-05 17:31:00,2016-01-05 17:45:00,Business,Fort Pierce,Fort Pierce,4.7,Meeting,00:14:00
4,2016-01-06 14:42:00,2016-01-06 15:49:00,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit,01:07:00
5,2016-01-06 17:15:00,2016-01-06 17:19:00,Business,West Palm Beach,West Palm Beach,4.3,Meal/Entertain,00:04:00


DiFF is in time delta format (HH:MM:SS) - we need to convert this into a number format.This is rides data, values should be in minutes only

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 653 entries, 0 to 1154
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype          
---  ------       --------------  -----          
 0   START_DATE*  653 non-null    datetime64[ns] 
 1   END_DATE*    653 non-null    datetime64[ns] 
 2   CATEGORY*    653 non-null    object         
 3   START*       653 non-null    object         
 4   STOP*        653 non-null    object         
 5   MILES*       653 non-null    float64        
 6   PURPOSE*     653 non-null    object         
 7   DIFF         653 non-null    timedelta64[ns]
dtypes: datetime64[ns](2), float64(1), object(4), timedelta64[ns](1)
memory usage: 45.9+ KB


In [32]:
import datetime
print(pd.Timedelta("1 days"))
print(pd.Timedelta("1 days 2 hours"))
print(pd.Timedelta("-1 days 2 min 3us"))

1 days 00:00:00
1 days 02:00:00
-2 days +23:57:59.999997


**datetime.timedelta-**
A duration expressing the difference between two date, time, or datetime.

**Use Timedelta.to_pytimedelta() function to convert the given Timedelta object into an ndarray.**

In [33]:
pd.Timedelta("1 days 2 hours").to_pytimedelta()

datetime.timedelta(days=1, seconds=7200)

In [34]:
pd.Timedelta("1 days 2 hours").to_pytimedelta().days

1

In [35]:
pd.Timedelta("00:06:00").to_pytimedelta()

datetime.timedelta(seconds=360)

In [36]:
pd.Timedelta("00:06:00").to_pytimedelta().days/(24*60)

0.0

In [37]:
pd.Timedelta("00:06:00").to_pytimedelta().seconds

360

360 seconds - converting them in to minutes

In [38]:
pd.Timedelta("00:06:00").to_pytimedelta().seconds/60

6.0

In [39]:
# Adding them toegther
(pd.Timedelta("00:06:00").to_pytimedelta().days/(24*60) + pd.Timedelta("00:06:00").to_pytimedelta().seconds/60)

6.0

We are adding above the days and within day ading the minutes 

In [40]:
#convert duration to numbers(minutes)
df.loc[:, 'DIFF'] = df['DIFF'].apply(lambda x: pd.Timedelta.to_pytimedelta(x).days/(24*60) + pd.Timedelta.to_pytimedelta(x).seconds/60)

In [41]:
df['DIFF'].head()

0     6.0
2    13.0
3    14.0
4    67.0
5     4.0
Name: DIFF, dtype: float64

In [42]:
df['DIFF'].describe()

count    653.000000
mean      23.398162
std       25.769640
min        2.000000
25%       11.000000
50%       18.000000
75%       28.000000
max      330.000000
Name: DIFF, dtype: float64

**Ride durations range from 2 minutes to 330 minutes with an average duration of 23 minutes**

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 653 entries, 0 to 1154
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   START_DATE*  653 non-null    datetime64[ns]
 1   END_DATE*    653 non-null    datetime64[ns]
 2   CATEGORY*    653 non-null    object        
 3   START*       653 non-null    object        
 4   STOP*        653 non-null    object        
 5   MILES*       653 non-null    float64       
 6   PURPOSE*     653 non-null    object        
 7   DIFF         653 non-null    float64       
dtypes: datetime64[ns](2), float64(2), object(4)
memory usage: 45.9+ KB


In [44]:
df.head()

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*,DIFF
0,2016-01-01 21:11:00,2016-01-01 21:17:00,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain,6.0
2,2016-01-02 20:25:00,2016-01-02 20:38:00,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies,13.0
3,2016-01-05 17:31:00,2016-01-05 17:45:00,Business,Fort Pierce,Fort Pierce,4.7,Meeting,14.0
4,2016-01-06 14:42:00,2016-01-06 15:49:00,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit,67.0
5,2016-01-06 17:15:00,2016-01-06 17:19:00,Business,West Palm Beach,West Palm Beach,4.3,Meal/Entertain,4.0


In [45]:
#Capture Hour, Day, Month and Year of Ride in a separate column
df['month'] = pd.to_datetime(df['START_DATE*']).dt.month
df['Year'] = pd.to_datetime(df['START_DATE*']).dt.year
df['Day'] = pd.to_datetime(df['START_DATE*']).dt.day
df['Hour'] = pd.to_datetime(df['START_DATE*']).dt.hour

In [46]:
#Capture day of week and rename to weekday names
df['day_of_week'] = pd.to_datetime(df['START_DATE*']).dt.dayofweek

days = {0:'Mon',1:'Tue',2:'Wed',3:'Thur',4:'Fri',5:'Sat',6:'Sun'}

df['day_of_week'] = df['day_of_week'].apply(lambda x: days[x])

In [47]:
#Rename the numbers in the Month column to calendar months
import calendar
df['month'] = df['month'].apply(lambda x: calendar.month_abbr[x])
df.head()

Unnamed: 0,START_DATE*,END_DATE*,CATEGORY*,START*,STOP*,MILES*,PURPOSE*,DIFF,month,Year,Day,Hour,day_of_week
0,2016-01-01 21:11:00,2016-01-01 21:17:00,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain,6.0,Jan,2016,1,21,Fri
2,2016-01-02 20:25:00,2016-01-02 20:38:00,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies,13.0,Jan,2016,2,20,Sat
3,2016-01-05 17:31:00,2016-01-05 17:45:00,Business,Fort Pierce,Fort Pierce,4.7,Meeting,14.0,Jan,2016,5,17,Tue
4,2016-01-06 14:42:00,2016-01-06 15:49:00,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit,67.0,Jan,2016,6,14,Wed
5,2016-01-06 17:15:00,2016-01-06 17:19:00,Business,West Palm Beach,West Palm Beach,4.3,Meal/Entertain,4.0,Jan,2016,6,17,Wed


In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 653 entries, 0 to 1154
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   START_DATE*  653 non-null    datetime64[ns]
 1   END_DATE*    653 non-null    datetime64[ns]
 2   CATEGORY*    653 non-null    object        
 3   START*       653 non-null    object        
 4   STOP*        653 non-null    object        
 5   MILES*       653 non-null    float64       
 6   PURPOSE*     653 non-null    object        
 7   DIFF         653 non-null    float64       
 8   month        653 non-null    object        
 9   Year         653 non-null    int64         
 10  Day          653 non-null    int64         
 11  Hour         653 non-null    int64         
 12  day_of_week  653 non-null    object        
dtypes: datetime64[ns](2), float64(2), int64(3), object(6)
memory usage: 71.4+ KB


In [48]:
#Extract the total number of trips per month, weekday
print(df['month'].value_counts())
print(df['day_of_week'].value_counts())

Dec    134
Feb     82
Jun     73
Mar     71
Nov     60
Jan     59
Apr     50
Jul     46
May     46
Oct     20
Aug     12
Name: month, dtype: int64
Fri     125
Tue      94
Thur     92
Sun      87
Mon      87
Wed      85
Sat      83
Name: day_of_week, dtype: int64


**December has maximum number of trips and August has the least** <br>
**Friday has the maximum number of trips**

In [151]:
#Getting the average distance covered per month
df.groupby('month').mean()['MILES*'].sort_values(ascending = False)

month
Oct    24.840000
Apr    21.898000
Mar    20.505634
Jul    10.615217
Nov    10.590000
Feb     8.868293
Jan     8.486441
May     7.793478
Jun     7.410959
Aug     7.341667
Dec     6.898507
Name: MILES*, dtype: float64

**Longest average distance is covered in Oct and least in Dec**

In [152]:
#Number of trips based of hour of day
df['Hour'].value_counts()

13    55
14    52
18    51
17    51
15    51
20    45
16    45
12    43
11    39
19    35
21    34
10    33
9     26
23    21
22    21
8     17
0     13
7      8
1      4
5      3
3      3
6      2
2      1
Name: Hour, dtype: int64

**Afternoons and evenings seem to have the maximum number of trips**

In [153]:
# calculate trip speed for each trip
df['Duration_hours'] = df['DIFF'] / 60
df['Speed_KM'] = df['MILES*'] / df['Duration_hours']
df['Speed_KM'].describe()

count    653.000000
mean      25.261340
std       16.815108
min        6.000000
25%       16.571429
50%       22.285714
75%       29.100000
max      228.000000
Name: Speed_KM, dtype: float64

## 3. Category & Purpose

In [154]:
df['CATEGORY*'].value_counts()

Business    647
Personal      6
Name: CATEGORY*, dtype: int64

**Most trips are in the business category**

In [155]:
#Purpose
df['PURPOSE*'].value_counts()


Meeting            187
Meal/Entertain     160
Errand/Supplies    128
Customer Visit     101
Temporary Site      50
Between Offices     18
Moving               4
Airport/Travel       3
Charity ($)          1
Commute              1
Name: PURPOSE*, dtype: int64

**Most trips are for meetings**

In [156]:
#Average distance traveled for each activity
df.groupby('PURPOSE*').mean()['MILES*'].sort_values(ascending = False)

PURPOSE*
Commute            180.200000
Customer Visit      20.688119
Meeting             15.247594
Charity ($)         15.100000
Between Offices     10.944444
Temporary Site      10.474000
Meal/Entertain       5.698125
Airport/Travel       5.500000
Moving               4.550000
Errand/Supplies      3.968750
Name: MILES*, dtype: float64

Now lets try to answer some questions from this data.

Question1: How many miles was earned per category and purpose ?

Question2: What is percentage of business miles vs personal?

Question3: How much time was spend for drives per category and purpose? 


In [157]:
#Question1: How many miles was earned per category and purpose ?
df.groupby('PURPOSE*').sum()['MILES*'].sort_values(ascending = False)

PURPOSE*
Meeting            2851.3
Customer Visit     2089.5
Meal/Entertain      911.7
Temporary Site      523.7
Errand/Supplies     508.0
Between Offices     197.0
Commute             180.2
Moving               18.2
Airport/Travel       16.5
Charity ($)          15.1
Name: MILES*, dtype: float64

In [158]:
#Question1: How many miles was earned per category and purpose ?
df.groupby('CATEGORY*').sum()['MILES*'].sort_values(ascending = False)

CATEGORY*
Business    7097.7
Personal     213.5
Name: MILES*, dtype: float64

## Summary of findings from analysis