![](https://media.giphy.com/media/l0HluULNylbTu44Ao/giphy.gif)
## Introduction

In this project, we carry out exploratory analysis of the Divvy dataset by setting out research questions, and then exploring relationship between stations, user behaviors, user types, and bike types to answer those questions.

*The project was completed as a part of Google's Data Analytics Professional Certificate online course on Coursera.*

## Setup
**Import packages**

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

---
## The Data

This dataset contains biketrip data in Chicago from year 2020-2021. The data is provided by [Divvy bikes](https://divvybikes.com) according to the [Divvy Data License Agreement](https://ride.divvybikes.com/data-license-agreement).

Each trip is anonymized and includes:
- Trip start day and time
- Trip end day and time
- Trip start station
- Trip end station
- Rider type (Member, Single Ride, and Day Pass)
- Rideable type (Casual, Docked, Electric)

The data has been processed to remove trips that are taken by staff as they service and inspect the system; and any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it was secure).

The dataset is wrangled by [Chris](github.com/ca-ros). To know more about data wrangling documentation, visit this [link](https://github.com/ca-ros/divvy-bikeshare/blob/master/docs/data_wrangling.md). This contains stations names with null values and will be filled with data in the future after I have a good grasp in **Machine Learning** and **Web Scraping** by using the stations' coordinates.

> To know more about Divvy and the dataset, visit this [link](https://ride.divvybikes.com/system-data).

---

## Exploratory data analysis
### Research question 1:
### Research question 2:
### Research question 3:

## Overview of data

In [2]:
def overview():
    data = pd.read_csv("trips_p2.csv")
    data.round()
    print("The first 5 rows of data are:\n")
    print(data.head(5))
    print("\n\n\nDataset has {} rows and {} columns".format(data.shape[0], data.shape[1]))
    print("\n\n\nDatatype: \n")
    print(data.dtypes)
    print("\n\n\nThe number of unique values in each column are: \n")
    print(data.nunique())
    print("\n\n\nThe number of null values for each column are: \n")
    print(data.isnull().sum())
    print("\n\n\nData summary: \n")
    print(data.describe())
    return data

# Lastly, assigning a variable to overview()
data = overview()

## Note: Uploading this file takes time, took me 5 mins to finish running this block

FileNotFoundError: [Errno 2] No such file or directory: 'trips_p2.csv'

### What do we see?
- The dataset has 8,988,891 rows and 13 columns
- We notice null values in station_name column (12.68 % of the data) that will be omitted in analysis
- We need to convert columns start_time and end_time into datetime datatype

#### Change start_time & end_time into datetime

In [None]:
#Datetime
data[['start_time', 'end_time']] = data[['start_time', 'end_time']].apply(pd.to_datetime, format = '%Y-%m-%d %H:%M:%S')
data.dtypes

#### The NaNs


In [None]:
# Removing rows where station names are nulls
data.dropna(subset = ['start_station_name', 'end_station_name'], inplace = True)

In [None]:
print('The number of nulls in each column\n')
print(data[['start_station_name', 'end_station_name']].isnull().sum())

## Looking into biketrips over the years

In [None]:
# Indexing start_time
data_i = data.set_index('start_time')

In [None]:
axes = data_i[[count(ride_id), start_station_name.value_counts()]].plot(figsize=(11, 9), subplots=True, linewidth=1)

## Summary