In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Part 1 Exploratory Data Analysis


In [2]:
data_df = pd.read_json('logins.json')

In [3]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93142 entries, 0 to 93141
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   login_time  93142 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 727.8 KB


In [4]:
data_df.describe

<bound method NDFrame.describe of                login_time
0     1970-01-01 20:13:18
1     1970-01-01 20:16:10
2     1970-01-01 20:16:37
3     1970-01-01 20:16:36
4     1970-01-01 20:26:21
...                   ...
93137 1970-04-13 18:50:19
93138 1970-04-13 18:43:56
93139 1970-04-13 18:54:02
93140 1970-04-13 18:57:38
93141 1970-04-13 18:54:23

[93142 rows x 1 columns]>

In [5]:
data_df['login_time'].dtype

dtype('<M8[ns]')

In [6]:
data_df.set_index('login_time', inplace = True)

In [7]:
data_df.head()

1970-01-01 20:13:18
1970-01-01 20:16:10
1970-01-01 20:16:37
1970-01-01 20:16:36
1970-01-01 20:26:21


In [8]:
data_df.index.dtype

dtype('<M8[ns]')

In [9]:
data_df['count'] = 1
sorted_df = data_df.sort_index()
sorted_df.tail(15)

Unnamed: 0_level_0,count
login_time,Unnamed: 1_level_1
1970-04-13 18:16:48,1
1970-04-13 18:18:26,1
1970-04-13 18:35:43,1
1970-04-13 18:36:53,1
1970-04-13 18:36:55,1
1970-04-13 18:40:31,1
1970-04-13 18:40:40,1
1970-04-13 18:43:19,1
1970-04-13 18:43:56,1
1970-04-13 18:46:06,1


In [10]:
resampled_df = sorted_df.resample('15T').sum()
resampled_df.info

<bound method DataFrame.info of                      count
login_time                
1970-01-01 20:00:00      2
1970-01-01 20:15:00      6
1970-01-01 20:30:00      9
1970-01-01 20:45:00      7
1970-01-01 21:00:00      1
...                    ...
1970-04-13 17:45:00      5
1970-04-13 18:00:00      5
1970-04-13 18:15:00      2
1970-04-13 18:30:00      7
1970-04-13 18:45:00      6

[9788 rows x 1 columns]>

In [11]:
resampled_df = resampled_df.reset_index()
resampled_df['login_time'].dtype

dtype('<M8[ns]')

In [12]:
weekly_mean = sorted_df.resample('W').count()
weekly_mean.info

<bound method DataFrame.info of             count
login_time       
1970-01-04   2374
1970-01-11   5217
1970-01-18   5023
1970-01-25   4751
1970-02-01   4744
1970-02-08   5572
1970-02-15   5915
1970-02-22   7035
1970-03-01   6554
1970-03-08   7398
1970-03-15   7338
1970-03-22   8955
1970-03-29   7285
1970-04-05   8095
1970-04-12   6491
1970-04-19    395>

In [13]:
weekly_mean = weekly_mean.reset_index()
weekly_mean['login_time'].dtype

dtype('<M8[ns]')

In [14]:
resampled_df.to_csv('ultimate.csv')

![alt text](ultimate_dashboard_2.png "")

![](ultimate_counts_by_hour.png "")

## Part 1 Exploratory Data Analysis Conclusion:
No apparent data quality issues. All logins fall between 1970-01-01 20:12:16 and 1970-04-13 18:57:38			
There are no null values. All login values in datatype '<M8[ns]'

There is an upward trend upward in login counts over the time period covered by the data. Additionally, strong weekly cycles are shown in that Fridays, Saturdays, and Sundays have consistently higher login counts than other days. Viewing login counts by day, both aggregated and unaggregated, show daily trends with low logins in the early morning, afternoon and evening. 

Note: In the two charts that look at weeks, the first and final (incomplete) weeks are not shown. 

In [15]:
#![alt text](ultimate_dashboard_2.png "")

# Part 2 Experiment and Metrics Design

In [16]:
pd.read_json('ultimate_data_challenge.json')

Unnamed: 0,city,trips_in_first_30_days,signup_date,avg_rating_of_driver,avg_surge,last_trip_date,phone,surge_pct,ultimate_black_user,weekday_pct,avg_dist,avg_rating_by_driver
0,King's Landing,4,2014-01-25,4.7,1.10,2014-06-17,iPhone,15.4,True,46.2,3.67,5.0
1,Astapor,0,2014-01-29,5.0,1.00,2014-05-05,Android,0.0,False,50.0,8.26,5.0
2,Astapor,3,2014-01-06,4.3,1.00,2014-01-07,iPhone,0.0,False,100.0,0.77,5.0
3,King's Landing,9,2014-01-10,4.6,1.14,2014-06-29,iPhone,20.0,True,80.0,2.36,4.9
4,Winterfell,14,2014-01-27,4.4,1.19,2014-03-15,Android,11.8,False,82.4,3.13,4.9
...,...,...,...,...,...,...,...,...,...,...,...,...
49995,King's Landing,0,2014-01-25,5.0,1.00,2014-06-05,iPhone,0.0,False,100.0,5.63,4.2
49996,Astapor,1,2014-01-24,,1.00,2014-01-25,iPhone,0.0,False,0.0,0.00,4.0
49997,Winterfell,0,2014-01-31,5.0,1.00,2014-05-22,Android,0.0,True,100.0,3.86,5.0
49998,Astapor,2,2014-01-14,3.0,1.00,2014-01-15,iPhone,0.0,False,100.0,4.58,3.5


1) The purpose of the experiment is to encourage drivers to serve both cities, so a key measure of success could be the ratio of trips in each city, per working day, per driver.

2)
  a) The proposed experiment is to offer free toll reimbursement a percentage of drivers. The drivers should be selected randomly. This could be done using driver id numbers and a random number generator. The number of drivers should be selected using sample size calculations. The experiment should run for 4-6 weeks to account for variance, allow time for driver behavior modification, and to show weekly patterns.
  b) 


   


# Part 3 Predictive Modeling