<a href="https://colab.research.google.com/github/geoffrey-lawhorn/Portfolio/blob/main/Uber%20Case%20Study.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Uber Data Analysis**
### Context
Uber Technologies, Inc. is an American multinational transportation network company based in San Francisco and has operations in approximately 72 countries and 10,500 cities. In the fourth quarter of 2021, Uber had 118 million monthly active users worldwide and generated an average of 19 million trips per day.

Ridesharing is a very volatile market and demand fluctuates wildly with time, place, weather, local events, etc. The key to being successful in this business is to be able to detect patterns in these fluctuations and cater to the demand at any given time.

As a newly hired Data Scientist in Uber's New York Office, you have been given the task of extracting insights from data that will help the business better understand the demand profile and take appropriate actions to drive better outcomes for the business. Your goal is to identify good insights that are potentially actionable, i.e., the business can do something with it.

### Objective
To extract actionable insights around demand patterns across various factors.

### Key Questions
1.   What are the different variables that influence pickups?
2.   Which factor affects the pickups the most? What could be plausible reasons for that?
3. What are your recommendations to Uber management to capitalize on fluctuating demand?

### Dataset Description
The data contains information about the weather, location, and pickups:
*   pickup_dt: Date and time of the pick-up
*   borough: NYC's borough
*   pickups: Number of pickups for the period (1 hour)
*   spd: Wind speed in miles/hour
*   vsb: Visibility in miles to the nearest tenth
*   temp: Temperature in Fahrenheit
*   dewp: Dew point in Fahrenheit
*   slp: Sea level pressure
*   pcp01: 1-hour liquid precipitation
*   pcp06: 6-hour liquid precipitation
*   pcp24: 24-hour liquid precipitation
*   sd: Snow depth in inches
*   hday: Being a holiday (Y) or not (N)


In [2]:
# Library to suppress warnings
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
%matplotlib inline

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **Loading the dataset**

In [3]:
# Read in the data through pandas
data = pd.read_csv('/content/drive/MyDrive/MIT Course/Pre-Work/Week1/uber.csv')

In [4]:
# Copy the data to ensure the original data remains unaltered
df = data.copy()

## **View the first 5 rows of the dataset**

In [6]:
df.head(10)

Unnamed: 0,pickup_dt,borough,pickups,spd,vsb,temp,dewp,slp,pcp01,pcp06,pcp24,sd,hday
0,2015-01-01 01:00:00,Bronx,152,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
1,2015-01-01 01:00:00,Brooklyn,1519,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
2,2015-01-01 01:00:00,EWR,0,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
3,2015-01-01 01:00:00,Manhattan,5258,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
4,2015-01-01 01:00:00,Queens,405,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
5,2015-01-01 01:00:00,Staten Island,6,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
6,2015-01-01 01:00:00,,4,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
7,2015-01-01 02:00:00,Bronx,120,3.0,10.0,30.0,6.0,1023.0,0.0,0.0,0.0,0.0,Y
8,2015-01-01 02:00:00,Brooklyn,1229,3.0,10.0,30.0,6.0,1023.0,0.0,0.0,0.0,0.0,Y
9,2015-01-01 02:00:00,EWR,0,3.0,10.0,30.0,6.0,1023.0,0.0,0.0,0.0,0.0,Y


**Observations:**

* The column pickup_dt includes the pickup date and time. The date shows that the data starts from 01-Jan-2015. The time looks like it only indicates the hour though.
* The column borough contains the name of the New York borough in which the pickup was made.
* The column pickups contain the number of pickups in the borough at the given time.
* All of the weather variables are numerical.
* The variable holiday is a categorical variable.

In [7]:
# Look at the tail
df.tail()

Unnamed: 0,pickup_dt,borough,pickups,spd,vsb,temp,dewp,slp,pcp01,pcp06,pcp24,sd,hday
29096,2015-06-30 23:00:00,EWR,0,7.0,10.0,75.0,65.0,1011.8,0.0,0.0,0.0,0.0,N
29097,2015-06-30 23:00:00,Manhattan,3828,7.0,10.0,75.0,65.0,1011.8,0.0,0.0,0.0,0.0,N
29098,2015-06-30 23:00:00,Queens,580,7.0,10.0,75.0,65.0,1011.8,0.0,0.0,0.0,0.0,N
29099,2015-06-30 23:00:00,Staten Island,0,7.0,10.0,75.0,65.0,1011.8,0.0,0.0,0.0,0.0,N
29100,2015-06-30 23:00:00,,3,7.0,10.0,75.0,65.0,1011.8,0.0,0.0,0.0,0.0,N


**Observations:**

* The head indicated that the data began on January 1, 2015, whereas the tail indicates that it continued until June 30, 2015. This means we have **six months' worth of data to analyze**.

## **Checking the shape of the dataset**

In [8]:
df.shape

(29101, 13)

## **Checking the info()**

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29101 entries, 0 to 29100
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pickup_dt  29101 non-null  object 
 1   borough    26058 non-null  object 
 2   pickups    29101 non-null  int64  
 3   spd        29101 non-null  float64
 4   vsb        29101 non-null  float64
 5   temp       29101 non-null  float64
 6   dewp       29101 non-null  float64
 7   slp        29101 non-null  float64
 8   pcp01      29101 non-null  float64
 9   pcp06      29101 non-null  float64
 10  pcp24      29101 non-null  float64
 11  sd         29101 non-null  float64
 12  hday       29101 non-null  object 
dtypes: float64(9), int64(1), object(3)
memory usage: 2.9+ MB


**Observations:**

* All columns have 29,101 observations except borough, which has 26,058 observations indicating that there are null values in it.
* pickup_dt is read as an 'object' data type, but it should have the data type as DateTime.
* borough and hday (holiday) should be categorical variables.