# Uber Supply-Demand Gap

<h1>Name : Vikash Kumar Shrivastava</h1>
<h2>Reg. No. : 12018607</h2>
<h2>Section : K20RU</h2>

## Importing numpy and pandas libraries

In [2]:
# Import the numpy and pandas packages
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

## Importing matplotlib and seaborn libraries

In [3]:
import matplotlib.pyplot as plt
import seaborn as sns

## Importing Data from CSV

In [4]:
# Read the csv file using 'read_csv'
uber_data = pd.read_csv('Uber Request Data.csv')

## Inspecting the dataframe

In [5]:
# Observing actual data in dataframe

uber_data

Unnamed: 0,Request id,Pickup point,Driver id,Status,Request timestamp,Drop timestamp
0,619,Airport,1.0,Trip Completed,11/7/2016 11:51,11/7/2016 13:00
1,867,Airport,1.0,Trip Completed,11/7/2016 17:57,11/7/2016 18:47
2,1807,City,1.0,Trip Completed,12/7/2016 9:17,12/7/2016 9:58
3,2532,Airport,1.0,Trip Completed,12/7/2016 21:08,12/7/2016 22:03
4,3112,City,1.0,Trip Completed,13-07-2016 08:33:16,13-07-2016 09:25:47
...,...,...,...,...,...,...
6740,6745,City,,No Cars Available,15-07-2016 23:49:03,
6741,6752,Airport,,No Cars Available,15-07-2016 23:50:05,
6742,6751,City,,No Cars Available,15-07-2016 23:52:06,
6743,6754,City,,No Cars Available,15-07-2016 23:54:39,


** Inference 1:** 'Driver id' fields have decimal values however, it should be integer values

** Inference 2:** 'Request timestamp' and 'Drop timestamp' fields have dates in different formats however, it should be in uniform format for analysis. Eg. '15-07-2016 10:00:43' vs. '11/7/2016 13:08'

In [6]:
# Check the number of rows and columns in the dataframe
uber_data.shape

(6745, 6)

In [7]:
# Check the column-wise info of the dataframe
uber_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6745 entries, 0 to 6744
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Request id         6745 non-null   int64  
 1   Pickup point       6745 non-null   object 
 2   Driver id          4095 non-null   float64
 3   Status             6745 non-null   object 
 4   Request timestamp  6745 non-null   object 
 5   Drop timestamp     2831 non-null   object 
dtypes: float64(1), int64(1), object(4)
memory usage: 316.3+ KB


** Inference 3:** 'Driver id' fields and 'Drop timestamp' have many 'NaN' values

### Summary of Inspecting the dataframe
* 'Driver id' fields have decimal values however, it should be integer values
* 'Request timestamp' and 'Drop timestamp' fields have dates in different formats however, it should be in uniform format for analysis. Eg. '15-07-2016 10:00:43' vs. '11/7/2016 13:08'
* 'Driver id' fields and 'Drop timestamp' have many 'NaN' values

## Data Cleaning

### Converting the data type of 'Driver id' column.
#### In order to do this, we will replace the nan in 'driver id' column by zero as because we have no special need of this column.
#### Now we will change the float type to integer type.

In [8]:
display(uber_data.dtypes)

Request id             int64
Pickup point          object
Driver id            float64
Status                object
Request timestamp     object
Drop timestamp        object
dtype: object

In [9]:
uber_data['Driver id'] = uber_data['Driver id'].fillna(0)

In [10]:
uber_data['Driver id'] = uber_data['Driver id'].astype(int)

In [11]:
uber_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6745 entries, 0 to 6744
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Request id         6745 non-null   int64 
 1   Pickup point       6745 non-null   object
 2   Driver id          6745 non-null   int32 
 3   Status             6745 non-null   object
 4   Request timestamp  6745 non-null   object
 5   Drop timestamp     2831 non-null   object
dtypes: int32(1), int64(1), object(4)
memory usage: 289.9+ KB


### Removed the nan and converted the data type for 'Driver id' column

### Converting 'Request timestamp' and 'Drop timestamp' fields to uniform format

In [12]:
from datetime import datetime

# Make the date-time columns, 'Request timestamp' and 'Drop timestamp' as uniform datetime format.
def uniform_format(var):
    try:
        if var == ' ':
            return var
        elif '-' in var:
            return datetime.strptime(var, '%d-%m-%Y %H:%M:%S')
        elif '/' in var:
            return datetime.strptime(var, '%d/%m/%Y %H:%M')
    except Exception as e:
        print(e, var)


        
# Replacing missing values in 'Drop timestamp' by 'Request timestamp', thus 'Ride Duration' becomes 0 minutes
uber_data.loc[pd.isnull(uber_data['Drop timestamp']), ['Drop timestamp']] = uber_data['Request timestamp']



# Applying uniform datetime format on 'Request timestamp' and 'Drop timestamp' columns.
uber_data['Request timestamp'] = uber_data['Request timestamp'].apply(lambda x: uniform_format(x))
uber_data['Drop timestamp'] = uber_data['Drop timestamp'].apply(lambda x: uniform_format(x))

uber_data.head()

Unnamed: 0,Request id,Pickup point,Driver id,Status,Request timestamp,Drop timestamp
0,619,Airport,1,Trip Completed,2016-07-11 11:51:00,2016-07-11 13:00:00
1,867,Airport,1,Trip Completed,2016-07-11 17:57:00,2016-07-11 18:47:00
2,1807,City,1,Trip Completed,2016-07-12 09:17:00,2016-07-12 09:58:00
3,2532,Airport,1,Trip Completed,2016-07-12 21:08:00,2016-07-12 22:03:00
4,3112,City,1,Trip Completed,2016-07-13 08:33:16,2016-07-13 09:25:47


### Identifying additional data quality issues

In [13]:
# Get the column-wise Null count using 'is.null()' alongwith the 'sum()' function
uber_data.isnull().sum()

Request id           0
Pickup point         0
Driver id            0
Status               0
Request timestamp    0
Drop timestamp       0
dtype: int64

### Everything looks good.