# Exploring bike share data

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description
> This is randomly selected data related to bike share systems for the first six months of 2017 in three large cities: Chicago, New York City, and Washington, DC.

|   **Column**  |                    **Description**                    |
|:-------------:|:-----------------------------------------------------:|
|       X       |                        Trip ID                        |
|   Start.Time  |                Trip start day and time                |
|    End.Time   |                 Trip end day and time                 |
| Trip.Duration |              Duration of trip in seconds              |
| Start.Station |                   Trip start station                  |
|  End.Station  |                    Trip end station                   |
|   User.Type   |          Rider type (Subscriber or Customer)          |
|     Gender    |    Male or Female (Chicago and New York City only)    |
|   Birth.Year  | User's year of birth (Chicago and New York City only) |

### Questions for Analysis

- What is the most common month for bike share use?
- What is the most common start station?
- What are the counts of each user type?

### Import necessary libraries

In [1]:
library(tidyverse)
library(lubridate)

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.1.1       v purrr   0.3.2  
v tibble  2.1.1       v dplyr   0.8.0.1
v tidyr   0.8.3       v stringr 1.4.0  
v readr   1.3.1       v forcats 0.4.0  
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

Attaching package: 'lubridate'

The following object is masked from 'package:base':

    date



### Read in csv files

In [2]:
ny = read.csv('./data/new_york_city.csv')
wash = read.csv('./data/washington.csv')
chi = read.csv('./data/chicago.csv')

<a id='wrangling'></a>
## Data Wrangling

<ul>
    <li><a href="#ny">New York City</a></li>
    <li><a href="#chi">Chicago</a></li>
    <li><a href="#wash">Washington, DC</a></li>
</ul>

<a id='ny'></a>
### Explore the NY dataset

In [3]:
# Find number of rows and columns
dim(ny) 

In [4]:
# Check column names
names(ny)

In [5]:
# Check top 6 rows of dataset
head(ny)

X,Start.Time,End.Time,Trip.Duration,Start.Station,End.Station,User.Type,Gender,Birth.Year
5688089,2017-06-11 14:55:05,2017-06-11 15:08:21,795,Suffolk St & Stanton St,W Broadway & Spring St,Subscriber,Male,1998
4096714,2017-05-11 15:30:11,2017-05-11 15:41:43,692,Lexington Ave & E 63 St,1 Ave & E 78 St,Subscriber,Male,1981
2173887,2017-03-29 13:26:26,2017-03-29 13:48:31,1325,1 Pl & Clinton St,Henry St & Degraw St,Subscriber,Male,1987
3945638,2017-05-08 19:47:18,2017-05-08 19:59:01,703,Barrow St & Hudson St,W 20 St & 8 Ave,Subscriber,Female,1986
6208972,2017-06-21 07:49:16,2017-06-21 07:54:46,329,1 Ave & E 44 St,E 53 St & 3 Ave,Subscriber,Male,1992
1285652,2017-02-22 18:55:24,2017-02-22 19:12:03,998,State St & Smith St,Bond St & Fulton St,Subscriber,Male,1986


In [6]:
# Check bottom 6 rows of dataset
tail(ny) 

Unnamed: 0,X,Start.Time,End.Time,Trip.Duration,Start.Station,End.Station,User.Type,Gender,Birth.Year
54765,1293888,2017-02-23 06:14:14,2017-02-23 06:23:32,558.0,E 27 St & 1 Ave,E 47 St & Park Ave,Subscriber,Male,1984.0
54766,642855,2017-01-28 16:44:18,2017-01-28 16:48:18,240.0,W 52 St & 9 Ave,9 Ave & W 45 St,Subscriber,Male,1991.0
54767,2157959,2017-03-29 06:30:35,2017-03-29 06:32:41,125.0,W 84 St & Columbus Ave,W 87 St & Amsterdam Ave,Subscriber,Male,1984.0
54768,5679624,2017-06-11 12:52:27,2017-06-11 12:58:35,367.0,8 Ave & W 33 St,W 45 St & 8 Ave,Subscriber,Male,1954.0
54769,6762960,2017-06-30 07:48:34,2017-06-30 08:17:16,1722.0,Cathedral Pkwy & Broadway,Broadway & W 51 St,Subscriber,Male,1974.0
54770,6078570,2017-06-18 16:20:21,201,,,,,,


In [7]:
# Check structure of data including datatypes
str(ny)

'data.frame':	54770 obs. of  9 variables:
 $ X            : int  5688089 4096714 2173887 3945638 6208972 1285652 1675753 1692245 2271331 1558339 ...
 $ Start.Time   : Factor w/ 54568 levels "2017-01-01 00:17:01",..: 45448 32799 17316 31589 49688 10220 13390 13509 18111 12449 ...
 $ End.Time     : Factor w/ 54562 levels "201","2017-01-01 00:30:56",..: 45432 32783 17295 31567 49668 10204 13364 13505 18092 12422 ...
 $ Trip.Duration: int  795 692 1325 703 329 998 478 4038 5132 309 ...
 $ Start.Station: Factor w/ 636 levels "","1 Ave & E 16 St",..: 522 406 10 93 5 521 325 309 151 245 ...
 $ End.Station  : Factor w/ 638 levels "","1 Ave & E 16 St",..: 613 8 362 558 269 107 389 110 151 243 ...
 $ User.Type    : Factor w/ 3 levels "","Customer",..: 3 3 3 3 3 3 3 3 2 3 ...
 $ Gender       : Factor w/ 3 levels "","Female","Male": 3 3 3 2 3 3 3 3 1 3 ...
 $ Birth.Year   : num  1998 1981 1987 1986 1992 ...


`Start.Time` and `End.Time` are both factor types and will need to be converted

`End.Time` has a value of 201 instead of a date and time

`User.Type` has 3 levels. We expect Customer and Subscriber and some of the values are missing.

`Gender` has 3 levels. We expect Male and Female and some of the values are missing.

In [8]:
# Create object to store every row that has at least one missing value
ny_missing <- !complete.cases(ny)

# Show all the rows with missing data
ny[ny_missing,]

Unnamed: 0,X,Start.Time,End.Time,Trip.Duration,Start.Station,End.Station,User.Type,Gender,Birth.Year
9,2271331,2017-04-02 08:02:36,2017-04-02 09:28:08,5132,Central Park S & 6 Ave,Central Park S & 6 Ave,Customer,,
11,2287178,2017-04-02 14:37:20,2017-04-02 14:56:12,1131,Bank St & Washington St,Little West St & 1 Pl,Customer,,
20,5857,2017-01-01 13:32:39,2017-01-01 13:49:57,1038,W 22 St & 8 Ave,W 45 St & 6 Ave,Customer,,
24,2497952,2017-04-08 13:39:48,2017-04-08 14:04:24,1476,Dean St & Hoyt St,Plaza St West & Flatbush Ave,Customer,,
33,3676202,2017-05-02 21:43:28,2017-05-02 22:29:15,2746,Old Fulton St,Broadway & E 14 St,Customer,,
37,1975396,2017-03-22 08:56:43,2017-03-22 09:07:13,630,Broadway & W 29 St,E 17 St & Broadway,Customer,,
39,5630375,2017-06-10 14:03:43,2017-06-10 14:05:00,76,Bayard St & Baxter St,Bayard St & Baxter St,Customer,,
53,2897347,2017-04-16 15:23:43,2017-04-16 15:44:16,1233,Cleveland Pl & Spring St,S 5 Pl & S 4 St,Customer,,
61,3847598,2017-05-06 15:58:00,2017-05-06 16:31:17,1997,Front St & Maiden Ln,Old Fulton St,Customer,,
66,6018157,2017-06-17 08:06:57,2017-06-17 08:35:44,1727,Pier 40 - Hudson River Park,Pier 40 - Hudson River Park,Customer,,


Most of the missing values appears to be coming from `Gender` and `Birth.Year`.

`X` ID of 6078570 is missing values or has incorrect values in almost all the columns. This record will need to be removed completely.

In [9]:
# Count duplicated rows
sum(duplicated(ny))

<a id='chi'></a>
### Explore the Chicago dataset

In [10]:
# Find number of rows and columns
dim(chi) 

In [11]:
# Check column names
names(chi)

In [12]:
# Check top 6 rows of dataset
head(chi)

X,Start.Time,End.Time,Trip.Duration,Start.Station,End.Station,User.Type,Gender,Birth.Year
1423854,2017-06-23 15:09:32,2017-06-23 15:14:53,321,Wood St & Hubbard St,Damen Ave & Chicago Ave,Subscriber,Male,1992
955915,2017-05-25 18:19:03,2017-05-25 18:45:53,1610,Theater on the Lake,Sheffield Ave & Waveland Ave,Subscriber,Female,1992
9031,2017-01-04 08:27:49,2017-01-04 08:34:45,416,May St & Taylor St,Wood St & Taylor St,Subscriber,Male,1981
304487,2017-03-06 13:49:38,2017-03-06 13:55:28,350,Christiana Ave & Lawrence Ave,St. Louis Ave & Balmoral Ave,Subscriber,Male,1986
45207,2017-01-17 14:53:07,2017-01-17 15:02:01,534,Clark St & Randolph St,Desplaines St & Jackson Blvd,Subscriber,Male,1975
1473887,2017-06-26 09:01:20,2017-06-26 09:11:06,586,Clinton St & Washington Blvd,Canal St & Taylor St,Subscriber,Male,1990


In [13]:
# Check bottom 6 rows of dataset
tail(chi)

Unnamed: 0,X,Start.Time,End.Time,Trip.Duration,Start.Station,End.Station,User.Type,Gender,Birth.Year
8625,397518,2017-03-24 16:52:16,2017-03-24 16:57:57,341,Southport Ave & Waveland Ave,Southport Ave & Waveland Ave,Subscriber,Male,1990.0
8626,879494,2017-05-18 05:06:50,2017-05-18 05:22:10,920,Artesian Ave & Hubbard St,Wacker Dr & Washington St,Subscriber,Male,1959.0
8627,360389,2017-03-19 07:21:29,2017-03-19 07:27:18,349,Wabash Ave & Roosevelt Rd,Wells St & Polk St,Subscriber,Male,1987.0
8628,858496,2017-05-16 17:03:24,2017-05-16 17:31:12,1668,Ashland Ave & Harrison St,Wells St & Concord Ln,Subscriber,Male,1963.0
8629,777620,2017-05-10 08:53:03,2017-05-10 08:54:32,89,Western Ave & Leland Ave,Western Ave & Leland Ave,Subscriber,Male,1977.0
8630,1230561,2017-06-11 14:52:13,2017-06-11 15:42:33,3020,Waba,,,,


In [14]:
# Check structure of data including datatypes
str(chi)

'data.frame':	8630 obs. of  9 variables:
 $ X            : int  1423854 955915 9031 304487 45207 1473887 961916 65924 606841 135470 ...
 $ Start.Time   : Factor w/ 8624 levels "2017-01-01 00:40:14",..: 7876 5303 73 1721 267 8173 5347 368 3376 795 ...
 $ End.Time     : Factor w/ 8625 levels "2017-01-01 00:46:32",..: 7876 5303 73 1722 267 8173 5346 368 3376 796 ...
 $ Trip.Duration: int  321 1610 416 350 534 586 281 723 689 493 ...
 $ Start.Station: Factor w/ 472 levels "2112 W Peterson Ave",..: 468 424 291 80 103 119 22 255 374 420 ...
 $ End.Station  : Factor w/ 471 levels "","2112 W Peterson Ave",..: 132 381 469 409 151 70 467 251 200 118 ...
 $ User.Type    : Factor w/ 3 levels "","Customer",..: 3 3 3 3 3 3 3 2 3 3 ...
 $ Gender       : Factor w/ 3 levels "","Female","Male": 3 2 3 3 3 3 2 1 3 3 ...
 $ Birth.Year   : num  1992 1992 1981 1986 1975 ...


`Start.Time` and `End.Time` are both factor types and will need to be converted

`End.Station` has a missing value and will need to be investigated when cleaning.

`User.Type` has 3 levels. We expect Customer and Subscriber and some of the values are missing.

`Gender` has 3 levels. We expect Male and Female and some of the values are missing.

In [15]:
# Create object to store every row that has at least one missing value
chi_missing <- !complete.cases(chi)

# Show all the rows with missing data
chi[chi_missing,]

Unnamed: 0,X,Start.Time,End.Time,Trip.Duration,Start.Station,End.Station,User.Type,Gender,Birth.Year
8,65924,2017-01-21 14:28:38,2017-01-21 14:40:41,723,Larrabee St & Kingsbury St,Larrabee St & Armitage Ave,Customer,,
20,475456,2017-04-08 11:37:55,2017-04-08 11:51:55,840,Adler Planetarium,Burnham Harbor,Customer,,
32,1539334,2017-06-30 10:56:50,2017-06-30 11:40:20,2610,McCormick Place,Adler Planetarium,Customer,,
36,243879,2017-02-22 15:33:56,2017-02-22 15:54:07,1211,Streeter Dr & Grand Ave,Theater on the Lake,Customer,,
39,720062,2017-05-03 16:27:08,2017-05-03 16:45:15,1087,Clark St & Elm St,Michigan Ave & Pearson St,Customer,,
41,1314009,2017-06-16 19:34:44,2017-06-16 20:16:23,2499,State St & Van Buren St,McClurg Ct & Erie St,Customer,,
45,1372709,2017-06-20 16:14:15,2017-06-20 16:42:26,1691,Streeter Dr & Grand Ave,Streeter Dr & Grand Ave,Customer,,
53,157790,2017-02-11 15:11:34,2017-02-11 16:30:04,4710,McCormick Place,Wabash Ave & Wacker Pl,Customer,,
57,1526760,2017-06-29 13:50:47,2017-06-29 14:10:04,1157,Lake Shore Dr & Belmont Ave,Lake Shore Dr & North Blvd,Customer,,
62,1539175,2017-06-30 10:44:24,2017-06-30 11:11:03,1599,Millennium Park,Streeter Dr & Grand Ave,Customer,,


Most of the missing values appears to be coming from `Gender` and `Birth.Year`.

`X` ID of 1230561 is missing values. This record will need to be removed completely.

In [16]:
# Count duplicated rows
sum(duplicated(chi))

<a id='wash'></a>
### Explore the Washington, DC dataset

In [17]:
# Find number of rows and columns
dim(wash) 

In [18]:
# Check column names
names(wash)

In [19]:
# Check top 6 rows of dataset
head(wash)

X,Start.Time,End.Time,Trip.Duration,Start.Station,End.Station,User.Type
1621326,2017-06-21 08:36:34,2017-06-21 08:44:43,489.066,14th & Belmont St NW,15th & K St NW,Subscriber
482740,2017-03-11 10:40:00,2017-03-11 10:46:00,402.549,Yuma St & Tenley Circle NW,Connecticut Ave & Yuma St NW,Subscriber
1330037,2017-05-30 01:02:59,2017-05-30 01:13:37,637.251,17th St & Massachusetts Ave NW,5th & K St NW,Subscriber
665458,2017-04-02 07:48:35,2017-04-02 08:19:03,1827.341,Constitution Ave & 2nd St NW/DOL,M St & Pennsylvania Ave NW,Customer
1481135,2017-06-10 08:36:28,2017-06-10 09:02:17,1549.427,Henry Bacon Dr & Lincoln Memorial Circle NW,Maine Ave & 7th St SW,Subscriber
1148202,2017-05-14 07:18:18,2017-05-14 07:24:56,398.0,1st & K St SE,Eastern Market Metro / Pennsylvania Ave & 7th St SE,Subscriber


In [20]:
# Check top 6 rows of dataset
tail(wash)

Unnamed: 0,X,Start.Time,End.Time,Trip.Duration,Start.Station,End.Station,User.Type
89046,1484340,2017-06-10 10:58:09,2017-06-10 11:25:58,1669.7,M St & New Jersey Ave SE,4th St & Madison Dr NW,Customer
89047,555788,2017-03-22 18:46:00,2017-03-22 19:04:00,1082.789,8th & H St NW,21st & I St NW,Subscriber
89048,739004,2017-04-09 04:00:22,2017-04-09 04:09:54,571.879,Eckington Pl & Q St NE,Columbus Circle / Union Station,Subscriber
89049,1214907,2017-05-19 09:00:53,2017-05-19 09:07:38,404.152,1st & M St NE,1st & Rhode Island Ave NW,Subscriber
89050,1419806,2017-06-06 04:27:33,2017-06-06 04:49:59,1345.911,10th & Florida Ave NW,Georgetown Harbor / 30th St NW,Customer
89051,132,,,,,,


In [21]:
# Check structure of data including datatypes
str(wash)

'data.frame':	89051 obs. of  7 variables:
 $ X            : int  1621326 482740 1330037 665458 1481135 1148202 1594275 1601832 574182 327058 ...
 $ Start.Time   : Factor w/ 81223 levels "","2017-01-01 00:11:00",..: 74753 19510 59964 26708 67716 50891 73381 73775 23142 13333 ...
 $ End.Time     : Factor w/ 81217 levels "","2017-01-01 00:14:00",..: 74744 19473 59981 26732 67753 50918 73397 73775 23114 13350 ...
 $ Trip.Duration: num  489 403 637 1827 1549 ...
 $ Start.Station: Factor w/ 478 levels "","10th & E St NW",..: 27 478 66 221 278 84 368 82 71 60 ...
 $ End.Station  : Factor w/ 479 levels "","10th & E St NW",..: 47 219 144 312 315 239 162 376 51 308 ...
 $ User.Type    : Factor w/ 3 levels "","Customer",..: 3 3 3 2 3 3 3 3 3 3 ...


`Start.Time` and `End.Time` are both factor types and will need to be converted

`Start.Station` and `End.Station` have missing values and will need to be investigated when cleaning.

`Trip.Duration` is a num instead of int in the `Chicago` and `New York City` datasets.

`End.Station` has a missing value and will need to be investigated when cleaning.

`User.Type` has 3 levels. We expect Customer and Subscriber and some of the values are missing.

In [22]:
# Create object to store every row that has at least one missing value
wash_missing <- !complete.cases(wash)

# Show all the rows with missing data
wash[wash_missing,]

Unnamed: 0,X,Start.Time,End.Time,Trip.Duration,Start.Station,End.Station,User.Type
89051,132,,,,,,


`X` ID of 132 is missing values. This record will need to be removed completely.

In [23]:
# Count duplicated rows
sum(duplicated(wash))

In [24]:
sum(duplicated(wash$User.Type))

### Data Cleaning

To help answer the analysis questions each dataset will need to be cleaned. The following steps will be taken to clean the data:

- Remove columns that are not needed to answer analysis questions
- Update and transform data in columns for easier understanding and consistency
- Remove records with incorrect or missing values
- Add new columns to assist with answering questions
- Merge each new clean dataset into a single dataset

In [25]:
# Make copies of all the datasets before cleaning
# Creates a copy with a different memory address
# https://rdrr.io/cran/data.table/man/copy.html

ny_clean <- data.table::copy(ny)
chi_clean <- data.table::copy(chi)
wash_clean <- data.table::copy(wash)

#### Clean New York City Dataset

In [26]:
names(ny_clean)

In [27]:
# Remove Gender and Birth.Year columns
ny_clean <- ny_clean %>%
              select(X, Start.Time, End.Time, Trip.Duration, Start.Station, End.Station, User.Type)

In [28]:
# Check that columns have been removed
names(ny_clean)

In [29]:
# Add column for city name for merging into single dataframe
ny_clean <- ny_clean %>%
              mutate(city = 'NYC') %>%
              select(X, Start.Time, End.Time, Trip.Duration, Start.Station, End.Station, User.Type, city)

In [30]:
# Check that city column has been added
sample_n(ny_clean, 5)

X,Start.Time,End.Time,Trip.Duration,Start.Station,End.Station,User.Type,city
2378101,2017-04-04 19:34:51,2017-04-04 19:48:50,838,Christopher St & Greenwich St,Fulton St & Broadway,Subscriber,NYC
2393600,2017-04-05 09:13:00,2017-04-05 09:24:42,701,W 13 St & 7 Ave,Grand St & Greene St,Subscriber,NYC
1565900,2017-03-02 08:38:55,2017-03-02 08:51:48,773,E 2 St & 2 Ave E,W 13 St & Hudson St,Subscriber,NYC
6380845,2017-06-23 18:14:42,2017-06-23 18:19:15,273,W Broadway & Spring St,Barrow St & Hudson St,Subscriber,NYC
447760,2017-01-20 20:37:59,2017-01-20 21:03:42,1543,W 39 St & 9 Ave,Liberty St & Broadway,Subscriber,NYC


In [31]:
# Add start_date column with date from Start.Time
ny_clean$start_date <- as.Date(ny_clean$Start.Time)

In [32]:
# Check that new start_date column is created and only has dates
sample_n(ny_clean, 5)

X,Start.Time,End.Time,Trip.Duration,Start.Station,End.Station,User.Type,city,start_date
2196909,2017-03-29 21:06:39,2017-03-29 21:13:40,420,W 27 St & 10 Ave,Broadway & W 32 St,Subscriber,NYC,2017-03-29
3741722,2017-05-04 07:24:33,2017-05-04 07:45:59,1285,W 64 St & West End Ave,E 55 St & 2 Ave,Subscriber,NYC,2017-05-04
3152607,2017-04-21 20:17:46,2017-04-21 20:28:53,667,9 Ave & W 18 St,W 45 St & 8 Ave,Subscriber,NYC,2017-04-21
2643990,2017-04-11 15:36:19,2017-04-11 17:14:01,5861,Broadway & W 51 St,E 58 St & 3 Ave,Customer,NYC,2017-04-11
5828768,2017-06-13 21:05:39,2017-06-13 21:29:17,1418,Fulton St & William St,Bayard St & Leonard St,Subscriber,NYC,2017-06-13


In [33]:
# Check data types
str(ny_clean)

'data.frame':	54770 obs. of  9 variables:
 $ X            : int  5688089 4096714 2173887 3945638 6208972 1285652 1675753 1692245 2271331 1558339 ...
 $ Start.Time   : Factor w/ 54568 levels "2017-01-01 00:17:01",..: 45448 32799 17316 31589 49688 10220 13390 13509 18111 12449 ...
 $ End.Time     : Factor w/ 54562 levels "201","2017-01-01 00:30:56",..: 45432 32783 17295 31567 49668 10204 13364 13505 18092 12422 ...
 $ Trip.Duration: int  795 692 1325 703 329 998 478 4038 5132 309 ...
 $ Start.Station: Factor w/ 636 levels "","1 Ave & E 16 St",..: 522 406 10 93 5 521 325 309 151 245 ...
 $ End.Station  : Factor w/ 638 levels "","1 Ave & E 16 St",..: 613 8 362 558 269 107 389 110 151 243 ...
 $ User.Type    : Factor w/ 3 levels "","Customer",..: 3 3 3 3 3 3 3 3 2 3 ...
 $ city         : chr  "NYC" "NYC" "NYC" "NYC" ...
 $ start_date   : Date, format: "2017-06-11" "2017-05-11" ...


In [34]:
# Add end_date column with date from End.Time
ny_clean$end_date <- as.Date(ny_clean$End.Time)

In [35]:
# Check that new end_date column is created and only has dates
sample_n(ny_clean, 5)

X,Start.Time,End.Time,Trip.Duration,Start.Station,End.Station,User.Type,city,start_date,end_date
5524074,2017-06-08 18:20:43,2017-06-08 18:30:33,590,W 27 St & 7 Ave,W 47 St & 10 Ave,Subscriber,NYC,2017-06-08,2017-06-08
1837213,2017-03-11 16:43:49,2017-03-11 16:45:39,110,9 Ave & W 45 St,W 43 St & 10 Ave,Subscriber,NYC,2017-03-11,2017-03-11
821176,2017-02-03 18:38:07,2017-02-03 18:41:58,230,Lewis Ave & Decatur St,Fulton St & Utica Ave,Subscriber,NYC,2017-02-03,2017-02-03
6736764,2017-06-29 17:49:19,2017-06-29 18:04:10,891,8 Ave & W 33 St,Carmine St & 6 Ave,Subscriber,NYC,2017-06-29,2017-06-29
2585269,2017-04-10 14:00:51,2017-04-10 14:21:26,1234,Pier 40 - Hudson River Park,E 17 St & Broadway,Customer,NYC,2017-04-10,2017-04-10


In [36]:
# Check data types
str(ny_clean)

'data.frame':	54770 obs. of  10 variables:
 $ X            : int  5688089 4096714 2173887 3945638 6208972 1285652 1675753 1692245 2271331 1558339 ...
 $ Start.Time   : Factor w/ 54568 levels "2017-01-01 00:17:01",..: 45448 32799 17316 31589 49688 10220 13390 13509 18111 12449 ...
 $ End.Time     : Factor w/ 54562 levels "201","2017-01-01 00:30:56",..: 45432 32783 17295 31567 49668 10204 13364 13505 18092 12422 ...
 $ Trip.Duration: int  795 692 1325 703 329 998 478 4038 5132 309 ...
 $ Start.Station: Factor w/ 636 levels "","1 Ave & E 16 St",..: 522 406 10 93 5 521 325 309 151 245 ...
 $ End.Station  : Factor w/ 638 levels "","1 Ave & E 16 St",..: 613 8 362 558 269 107 389 110 151 243 ...
 $ User.Type    : Factor w/ 3 levels "","Customer",..: 3 3 3 3 3 3 3 3 2 3 ...
 $ city         : chr  "NYC" "NYC" "NYC" "NYC" ...
 $ start_date   : Date, format: "2017-06-11" "2017-05-11" ...
 $ end_date     : Date, format: "2017-06-11" "2017-05-11" ...


In [37]:
# Check unique values for end_date
unique(ny_clean$end_date)

In [38]:
# Keep only records that have a value in the end date field
ny_clean <- ny_clean %>%
              select(everything()) %>%
              filter(complete.cases(ny_clean$end_date))

In [39]:
# Check that no values are empty
unique(ny_clean$end_date)

In [40]:
# Make sure there are no rows with missing values
ny_missing <- !complete.cases(ny_clean)
ny_clean[ny_missing,]

X,Start.Time,End.Time,Trip.Duration,Start.Station,End.Station,User.Type,city,start_date,end_date


In [41]:
table(ny_clean$User.Type)


             Customer Subscriber 
       118       5558      49093 

#### Clean Chicago Dataset

#### Clean Washington Dataset

#### Merge datasets

<a id='eda'></a>
## Exploratory Data Analysis

### What is the most common month for bike share use?

### What is the most common start station?

### What are the counts of each user type?

<a id='conclusions'></a>
## Conclusions

#### Most common month


#### Most common start station


#### Counts of each user type



### Limitation