## Analyzing the Citi Bike trip data for Jersey City in 2022 - Data Science Portfolio Project 

This project focuses on using Python to explore and analyze CitiBike rides in Jersey City in 2022 (Data provided by citibikenyc.com) with the aim of answering the following questions:

* What is the most used bike share and by whom?
* What is the most used route?
* Is the bike share service used exclusively for business purposes?
* In which month is the service most used?

### Import and Clean Data

Import libraries


In [None]:
import pandas as pd

trip = pd.read_csv("dataset\citibike.csv")

We inspect citibike.csv with __.info()__. Let's visualize the first 5 rings. So we check the headers and the presence of unnecessary columns. We use __.head__

In [2]:
trip.info()
trip.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 895497 entries, 0 to 895496
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   ride_id             895497 non-null  object
 1   rideable_type       895496 non-null  object
 2   started_at          895496 non-null  object
 3   ended_at            895496 non-null  object
 4   start_station_name  895486 non-null  object
 5   start_station_id    895486 non-null  object
 6   end_station_name    892292 non-null  object
 7   end_station_id      892292 non-null  object
 8   start_lat           895496 non-null  object
 9   start_lng           895496 non-null  object
 10  end_lat             893526 non-null  object
 11  end_lng             893526 non-null  object
 12  member_casual       895496 non-null  object
dtypes: object(13)
memory usage: 88.8+ MB


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,CA5837152804D4B5,electric_bike,2022-01-26 18:50:39,2022-01-26 18:51:53,12 St & Sinatra Dr N,HB201,12 St & Sinatra Dr N,HB201,40.75060414236896,-74.02402013540268,40.75060414236896,-74.02402013540268,member
1,BA06A5E45B6601D2,classic_bike,2022-01-28 13:14:07,2022-01-28 13:20:23,Essex Light Rail,JC038,Essex Light Rail,JC038,40.7127742,-74.0364857,40.7127742,-74.0364857,member
2,7B6827D7B9508D93,classic_bike,2022-01-10 19:55:13,2022-01-10 20:00:37,Essex Light Rail,JC038,Essex Light Rail,JC038,40.7127742,-74.0364857,40.7127742,-74.0364857,member
3,6E5864EA6FCEC90D,electric_bike,2022-01-26 07:54:57,2022-01-26 07:55:22,12 St & Sinatra Dr N,HB201,12 St & Sinatra Dr N,HB201,40.75060414236896,-74.02402013540268,40.75060414236896,-74.02402013540268,member
4,E24954255BBDE32D,electric_bike,2022-01-13 18:44:46,2022-01-13 18:45:43,12 St & Sinatra Dr N,HB201,12 St & Sinatra Dr N,HB201,40.75060414236896,-74.02402013540268,40.75060414236896,-74.02402013540268,member


### Data Cleaning

Delete the columns _'ride_id'_, _'start_lat'_, _'start_lng'_, _'end_lat'_, and _'end_lng'_

In [3]:
remove_col=['ride_id', 'start_lat', 'start_lng', 'end_lat', 'end_lng']
trip = trip.drop(labels=remove_col, axis=1)
trip.head()

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,member_casual
0,electric_bike,2022-01-26 18:50:39,2022-01-26 18:51:53,12 St & Sinatra Dr N,HB201,12 St & Sinatra Dr N,HB201,member
1,classic_bike,2022-01-28 13:14:07,2022-01-28 13:20:23,Essex Light Rail,JC038,Essex Light Rail,JC038,member
2,classic_bike,2022-01-10 19:55:13,2022-01-10 20:00:37,Essex Light Rail,JC038,Essex Light Rail,JC038,member
3,electric_bike,2022-01-26 07:54:57,2022-01-26 07:55:22,12 St & Sinatra Dr N,HB201,12 St & Sinatra Dr N,HB201,member
4,electric_bike,2022-01-13 18:44:46,2022-01-13 18:45:43,12 St & Sinatra Dr N,HB201,12 St & Sinatra Dr N,HB201,member


### Exploratory Data Analysis


#### Data Question #1 - What is the most used bike share and by whom?
We analyze the most used ride among _'electric_bike'_, _'classic_bike'_, _'docked_bike'_, _'rideable_type'_. 
Showing the total number of rides and its percentage of use.

In [4]:
def count_bike_types(df, column_name):
    bike_type_counts = df[column_name].value_counts()
    total_rides = len(df)
    bike_type_percentages = (bike_type_counts / total_rides) * 100

    bike_type_stats = pd.DataFrame({
        'Count': bike_type_counts,
        'Percentage': bike_type_percentages
    })

    return bike_type_stats

bike_type_stats = count_bike_types(trip, 'rideable_type')

most_used_bike_type = bike_type_stats.index[0]
most_used_bike_count = bike_type_stats['Count'].iloc[0]
most_used_bike_percentage = bike_type_stats['Percentage'].iloc[0]

print(f"The most used bike type is: {most_used_bike_type}")
print(f"Total rides using {most_used_bike_type}: {most_used_bike_count}")
print(f"Percentage of rides using {most_used_bike_type}: {most_used_bike_percentage:.2f}%")

The most used bike type is: classic_bike
Total rides using classic_bike: 627175
Percentage of rides using classic_bike: 70.04%


Let's create a __top3__ of the most used means:

In [5]:
def count_bike_types(df, column_name):
    
    bike_type_counts = df[column_name].value_counts()
    total_rides = len(df)
    bike_type_percentages = (bike_type_counts / total_rides) * 100

    return bike_type_percentages

bike_type_percentages = count_bike_types(trip, 'rideable_type')
top_3_bike_types = bike_type_percentages.nlargest(3)

print("Top 3 Bike Types and Percentages:")
for index, percentage in top_3_bike_types.items():
    print(f"{index}: {percentage:.2f}%")


Top 3 Bike Types and Percentages:
classic_bike: 70.04%
electric_bike: 29.10%
docked_bike: 0.87%


 Let's analyze who uses the service: _'member'_ and _'casual'_


In [6]:
def count_user_types(df, column_name):
    user_type_counts = df[column_name].value_counts()
    total_rides = len(df)
    user_type_percentages = (user_type_counts / total_rides) * 100

    return user_type_percentages

user_type_percentages = count_user_types(trip, 'member_casual')

user_type_stats = pd.DataFrame({
    'User Type': user_type_percentages.index,
    'Count': user_type_percentages.values,
    'Percentage': user_type_percentages.values * 100
})

print("User Type Usage Statistics:")
print(user_type_stats.to_string(index=True))

User Type Usage Statistics:
       User Type      Count   Percentage
0         member  65.752314  6575.231408
1         casual  34.246346  3424.634588
2  member_casual   0.001228     0.122837


 Let's analyze which medium _'member'_ users prefer to use

In [7]:
member_trips = trip[trip['member_casual'] == 'member']

member_bike_type_counts = count_bike_types(member_trips, 'rideable_type')

most_used_bike_type_member = member_bike_type_counts.idxmax()

member_bike_type_percentages = count_bike_types(member_trips, 'rideable_type')
most_used_bike_percentage_member = member_bike_type_percentages[most_used_bike_type_member]

print(f"The most preferred bike type among 'member' users is: {most_used_bike_type_member}")
print(f"Percentage of rides using {most_used_bike_type_member} by 'member' users: {most_used_bike_percentage_member:.2f}%")

The most preferred bike type among 'member' users is: classic_bike
Percentage of rides using classic_bike by 'member' users: 74.75%


Question number #1 allows us to understand that the largest users of the service are 'member' users i.e., those with a subscription and that the preferred means of transportation is the classic bicycle. We have established the basis for our research.

#### Data Question #2 - What is the most used route?

 Let us now analyze which route is the most used by bikesharing users. 

In [8]:
trip['route'] = trip['start_station_name'] + " - " + trip['end_station_name']

route_counts = trip['route'].value_counts()

most_used_route = route_counts.idxmax()
most_used_route_count = route_counts.max()

print(f"The most used bike trip route is: {most_used_route}")
print(f"Number of times the route {most_used_route} was used: {most_used_route_count}")

The most used bike trip route is: Hoboken Terminal - Hudson St & Hudson Pl - Hoboken Ave at Monmouth St
Number of times the route Hoboken Terminal - Hudson St & Hudson Pl - Hoboken Ave at Monmouth St was used: 5565


__Top 10__ most used routes:

In [9]:
top_10_routes = route_counts.nlargest(10)

top_10_routes = top_10_routes.reset_index()
top_10_routes.columns = ['Route', 'Count']  # Rename columns for clarity

print("Top 10 Most Used Bike Trip Routes:")
print(top_10_routes.to_string(index=False))

Top 10 Most Used Bike Trip Routes:
                                                                                      Route  Count
                      Hoboken Terminal - Hudson St & Hudson Pl - Hoboken Ave at Monmouth St   5565
South Waterfront Walkway - Sinatra Dr & 1 St - South Waterfront Walkway - Sinatra Dr & 1 St   5439
                                                           Marin Light Rail - Grove St PATH   4113
                      Hoboken Ave at Monmouth St - Hoboken Terminal - Hudson St & Hudson Pl   4083
                                                           Grove St PATH - Marin Light Rail   3973
                        12 St & Sinatra Dr N - South Waterfront Walkway - Sinatra Dr & 1 St   3964
                                                    Liberty Light Rail - Liberty Light Rail   3696
                        South Waterfront Walkway - Sinatra Dr & 1 St - 12 St & Sinatra Dr N   3495
                                                              Hamilton Par

 The results also show the use of the bikesharing service for pure enjoyment. 
As evidenced by the route __'South Waterfront Walkway - Sinatra Dr & 1 St - South Waterfront Walkway - Sinatra Dr & 1 St'__ where the start station is the same as the end station. 

We determine who mainly used between _'member'_ and _'casual'_ the main route _'Hoboken Terminal - Hudson St & Hudson Pl - Hoboken Ave at Monmouth St'_

In [10]:
most_used_route = route_counts.idxmax()

member_trips = trip[trip['member_casual'] == 'member']
casual_trips = trip[trip['member_casual'] == 'casual']

member_route_count = member_trips[member_trips['route'] == most_used_route]['route'].count()
casual_route_count = casual_trips[casual_trips['route'] == most_used_route]['route'].count()

if member_route_count > casual_route_count:
    most_frequent_user_type = "member"
    most_frequent_user_count = member_route_count
else:
    most_frequent_user_type = "casual"
    most_frequent_user_count = casual_route_count

print(f"The user type that has utilized the most used route ({most_used_route}) the most is: {most_frequent_user_type}")
print(f"Number of times {most_used_route} was used by {most_frequent_user_type} users: {most_frequent_user_count}")


The user type that has utilized the most used route (Hoboken Terminal - Hudson St & Hudson Pl - Hoboken Ave at Monmouth St) the most is: casual
Number of times Hoboken Terminal - Hudson St & Hudson Pl - Hoboken Ave at Monmouth St was used by casual users: 3081


#### Data Question #3 - Is the bike share service used exclusively for business purposes?

From an analysis of the data, taking into account the top 10 most frequently used routes, it is possible to predict that __40% of the rides are related to pure enjoyment__. This deduction stems from the fact that the start and finish stations are in the vicinity of parks and cycle paths along the Hudson River. The routes considered are:
* South Waterfront Walkway - Sinatra Dr & 1 St - South Waterfront Walkway - Sinatra Dr & 1 St
* 12 St & Sinatra Dr N - South Waterfront Walkway - Sinatra Dr & 1 St
* Liberty Light Rail - Liberty Light Rail
* South Waterfront Walkway - Sinatra Dr & 1 St - 12 St & Sinatra Dr N

#### Data Question #4 - In which month is the service most used and for what reason

Let's analyze the month with the highest number of runs

In [11]:
trip = trip.dropna(subset=['started_at'])
trip['started_at'] = pd.to_datetime(trip['started_at'], format='%Y-%m-%d', errors='coerce')

months = trip['started_at'].dt.month

month_counts = months.value_counts()
most_common_month = month_counts.index[0]

print("The bike-sharing service was most used in month:", most_common_month)

The bike-sharing service was most used in month: 8.0


Now that we know that __August__ is the month with the most bikesharing rides, let's determine what type of user used the service, by what means and on what route.

In [12]:
most_common_month = 8

august_trips = trip[trip['started_at'].dt.month == most_common_month]

user_type_counts = august_trips['member_casual'].value_counts()
vehicle_type_counts = august_trips['rideable_type'].value_counts()
route_counts = august_trips['route'].value_counts()

print("User type distribution for August trips:")
print(user_type_counts)

print("\nVehicle type distribution for August trips:")
print(vehicle_type_counts)

print("\nRoute distribution for August trips:")
print(route_counts)


User type distribution for August trips:
member    71633
casual    43598
Name: member_casual, dtype: int64

Vehicle type distribution for August trips:
classic_bike     87493
electric_bike    26597
docked_bike       1141
Name: rideable_type, dtype: int64

Route distribution for August trips:
South Waterfront Walkway - Sinatra Dr & 1 St - South Waterfront Walkway - Sinatra Dr & 1 St    842
Hoboken Terminal - Hudson St & Hudson Pl - Hoboken Ave at Monmouth St                          617
Liberty Light Rail - Liberty Light Rail                                                        560
Marin Light Rail - Grove St PATH                                                               512
Grove St PATH - Marin Light Rail                                                               510
                                                                                              ... 
Hamilton Park - W 16 St & The High Line                                                          1
Monmouth and 6

August is the month with the most rides. Despite the summer period, non-recreational rides are the most used. Despite the fact that the most used route is:
* South Waterfront Walkway - Sinatra Dr & 1 St - South Waterfront Walkway - Sinatra Dr & 1 St