# Uber Picking Up in New York Analysis

## Zhang Qinhao - z5263046

# 1. Introduction:

The main source of this data analysis is the Uber data provided by Data.world. I chose this set of data for my research because of the rising cost of living and the busy urban traffic. Uber needs to arrange the driver's area and time more reasonably, so I chose this set of data to have a positive impact on Uber and passengers.

### 2. Imprting necessary module and dataset

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import warnings; warnings.simplefilter('ignore')
from matplotlib import style
%matplotlib inline

In [None]:
if os.path.isfile("uber-raw-data-may14.csv"):
    filepath = "uber-raw-data-may14.csv"
    print("loading from file")
else:
    filepath = "https://github.com/fivethirtyeight/uber-tlc-foil-response/blob/master/uber-trip-data/uber-raw-data-may14.csv"
    print("loading from the internet")

uber_df = pd.read_csv(filepath)

print("done")

### 3. Data cleaning

#### We could just take a overlook to this lovely dataset, totally 652435 rows and 4 columns. It is a raw dataset cause we could not get the correct data because of privacy.

In [None]:
uber_df

In [None]:
uber_df.tail()

Checking the information of dataset

In [None]:
uber_df.info()

In [None]:
type(uber_df)

Setting the date time to the index 

In [None]:
uber_df.set_index(uber_df['Date/Time'],inplace = True)
uber_df.index = pd.to_datetime(uber_df.index)
del uber_df['Date/Time']
uber_df.head()

Splitting the datetime into other different columns for processing data easily

In [None]:
uber_df['Date'] = uber_df.index.day
uber_df['Weekday'] = uber_df.index.weekday
uber_df['Hour'] = uber_df.index.hour
uber_df['Minute'] = uber_df.index.minute
uber_df.head()

### 4. Data Visualization

# a. Uber Daily Statistics

First, we calculate the daily Uber usage in mid-May, and then draw a line chart according to the daily usage to reflect the trend of Uber vehicle change. Next, use the 5-day moving average as a trend to see the changes in Uber usage in May. 

In [None]:
Date_lists = list(uber_df['Date'])
Date_count_set = set(Date_lists)
Date_count_list = list()
for k in Date_count_set:
    Date_count_list.append((k,Date_lists.count(k)))
print (Date_count_list)
Date_count = pd.DataFrame(Date_count_list)
Date_count['SMA_5'] = Date_count[1].rolling(5).mean()

plt.style.use('seaborn')
plt.figure(figsize = (10,8))
Date_line, = plt.plot(Date_count[0],Date_count[1],'--',marker = 'o')
Moving_line, = plt.plot(Date_count[0],Date_count['SMA_5'],'r')
plt.ylabel('Uber Frequency',size = 20)
plt.xlabel('Date of May',size = 20)
plt.title('Statistics of Uber Driving Count/Frequency in May',size = 20)
plt.axis([0,31,0,40000])
plt.xticks(Date_count[0], fontsize=9)
plt.legend(loc='best',handles=[Date_line,Moving_line],labels=['Daily Frequency','Five Days Moving Average'])
plt.tight_layout()

The 5-day average feedback showed that the usage of Uber in the middle of the month was lower than that in the beginning of the month, but the usage of Uber at the end of the month was greatly improved. We continue to rank daily usage. 

In [None]:
Date_count = Date_count.fillna(0)
Date_count.sort_values(by = 1,ascending = False).head(5)

Less frequent using Uber of days in the middle of the month

In [None]:
Date_count.sort_values(by = 1).head(5)

# b. Uber Hourly statistics

Uber frequencies should be higher in the morning and evening hours, so we draw a time-based frequency histogram.

In [None]:
Hour_count = uber_df['Hour']
Hour_count.hist(bins=24, figsize=(10,8),range= (0,24),color = '#87CEFA')
plt.xticks(range(1,25))

def counts(i):
    return len(i)
Hour = uber_df.groupby('Hour').apply(counts)
plt.axhline(y = Hour.mean(), color='r')
plt.plot(Hour,marker= 'o',color = 'b')
plt.ylabel('Uber Hour Frequency',size = 20)
plt.xlabel('Hour',size = 20)
plt.title('Statistics of Uber Driving Count/Frequency by Hour',size = 20)
plt.tight_layout()

The use of vehicles from 4 p.m. to 9 p.m. is significantly higher than usual, which also confirms our conjecture.

# c. Weekday Uber Frequency

We also need to look at weekly usage. Uber usage on weekdays from Monday to Friday is significantly higher than on weekends, which indicates that people tend to use more Uber on commutes and on weekends they may prefer to drive by themselves.

In [None]:
uber_df.loc[uber_df['Weekday'] == 0,'Weekday']=7

Weekday_count = uber_df['Weekday']
Weekday_count.hist(bins=7, figsize=(10,8),range= (1,8),color = 'orange')
plt.xticks(range(1,8))

Weekday = uber_df.groupby('Weekday').apply(counts)
plt.axhline(y = Weekday.mean(), color='r')
plt.plot(Weekday,marker= 'o',color = 'b')
plt.ylabel('Uber Weekday Frequency',size = 20)
plt.xlabel('Weekday',size = 20)
plt.title('Statistics of Uber Driving Count/Frequency by Weekday',size = 20)
plt.tight_layout()

# d. Heat map of hour Uber Picking up in everyday

Finally, for the analysis of time, we separated the working day and time to draw a thermogram to see the Uber usage per hour per day.

In [None]:
Corr_data = uber_df.groupby('Weekday Hour'.split()).apply(counts).unstack()
Corr_data

In [None]:
plt.figure(figsize =(10,8))
sns.heatmap(Corr_data)
plt.title('The Heat Map of weekly and hour uber frequency',size =20)
plt.text(0,8,'The lighter the colour means the more times of uber riding',size =18)
plt.tight_layout()

# e. Uber Analysis based on Base

Organize and sort Base data

In [None]:
Base_data = pd.DataFrame(uber_df.groupby('Base').apply(counts))
Base_data.sort_values(by = 0)

Draw a frequency histogram in the order of Base addresses to see which area has more Uber usage

In [None]:
plt.figure(figsize=(10, 8))
plt.bar(Base_data.index,Base_data[0],label ='Frequency',color ='#87CEFA')
plt.xlabel('Base')
plt.ylabel('Frequency')
plt.title('Statistics of Uber Driving Count/Frequency by Base')
plt.legend(loc="upper right")
plt.tight_layout()

#### Create a heatmap shows that in weekdays, different bases have different characteristics.

In [None]:
Base1 = uber_df.groupby('Base Weekday'.split()).apply(counts).unstack()
Base1

In [None]:
plt.figure(figsize =(10,8))
sns.heatmap(Base1)
plt.title('The Heat Map of Uber using during weekdays in different Bases',size =20)
plt.tight_layout()

# e. Uber Analysis based on Location

Latitude and longitude are angles that uniquely define points on a sphere. Together, the angles comprise a coordinate scheme that can locate or identify geographic positions on the surfaces of planets such as the earth.

# Uber Ride Longitude Line Graph - May 2014

In [None]:
plt.hist(uber_df['Lon'], bins = 100, range = (-74.2,-73.84), color = "Purple", alpha = 0.7)
plt.xlabel('Longitude', size = 25)
plt.ylabel('Frequency', size = 25)
plt.xticks(size = 15)
plt.yticks(size = 20)
plt.title('Uber Ride Longitude Histgram - May 2014', size = 30)
;

# Uber Ride Latitude Line Graph - May 2014         

In [None]:
plt.hist(uber_df['Lat'], bins = 100, range = (40.57, 40.92), color = "Red", alpha = 0.7)
plt.xlabel('Latitude', size = 25)
plt.ylabel('Frequency', size = 25)
plt.xticks(size = 15)
plt.yticks(size = 20)
plt.title('Uber Ride Latitude Histgram - May 2014', size = 30)
;

# Uber Ride Longitude and latitude Line Graph - May 2014         
#### Drawing latitude and longitude into a graph can better highlight the relationship between the two.Drawing latitude and longitude into a graph can better highlight the relationship between the two. The deviation of Latitude is negative; the deviation of longtitude is positive.

In [None]:
plt.hist(uber_df['Lat'], bins = 100, range = (40.57, 40.92), rwidth = 1, color = "Red", alpha = 0.7, label = 'Latitude')
plt.xlabel("Latitude", size = 20)
plt.xticks(size = 15)
plt.yticks(size = 20)
plt.legend(loc="upper left")
plt.twiny()
plt.hist(uber_df['Lon'], bins = 100, range = (-74.2,-73.84), rwidth = 1, color = "Purple", alpha = 0.7, label = 'Longitude')
plt.xlabel("Longitude", size = 20)
plt.xticks(size = 15)
plt.legend(loc="upper right")
;

#### Finaly, we use latitude and longitude as the horizontal and vertical coordinates to roughly draw the map used by Uber in Manhattan. Interestingly, this map is basically a brief version of the map of Manhattan, which also reflects the Uber usage in this area can well indicate the degree of urban development.

As most of the locations are concentrated in a small area at the lower left, we draw that small area separately.

In [None]:
plt.figure(figsize = (12,10))
plt.plot(uber_df['Lon'], uber_df['Lat'],'.',ms =1, alpha = 0.7)
plt.xlim(-74.2, -73.82)
plt.ylim(40.57, 40.92)
plt.xlabel('Longitude', size = 20)
plt.ylabel('Latitude', size = 20)
plt.xticks(size = 20)
plt.yticks(size = 20)
plt.title('Uber picking up plot in Manhttan(Location)', size = 20)


#### hexbin diagram （test）

In [None]:
plt.figure(figsize = (12,10))
x = uber_df['Lon']
y = uber_df['Lat']
plt.hexbin(x, y, gridsize = 500, cmap ='BuGn')  
plt.xlim(-74.2, -73.82)
plt.ylim(40.57, 40.92)
plt.xlabel('Longitude', size = 20)
plt.ylabel('Latitude', size = 20)
plt.xticks(size = 20)
plt.yticks(size = 20)
plt.title('Uber picking up hexbin diagram in Manhttan(Location)', size = 20)

# Thank you！