## Goals:

Analyze CitiBike data between December 1, 2020 and January 1, 2021 to identify the behavior of the users and derive potential to save cost and grow the business as Citi Product owner. 

## Assumptions:
1. Used Python and the Pandas library to clean the data. 
2. Then dropped rows where the trip duration was under 120 seconds and the starting and ending stations were the same. This was done to weed out any instances of a faulty bike.
3. Then trip duration was converted from seconds to minutes, which is personally easier to understand for me.
4. Then all trips over 24 hours (1440 minutes) were dropped, we had 10 such trips in our dataset. Under CitiBike’s current pricing model, bikes can be used an unlimited amount of times (under the day passes and subscriptions) but in 30 or 45-minute intervals. If you keep a bike out for longer than 45 minutes at a time, you will be charged an extra USD 0.18 or USD 0.12 for each additional minute and your account may be suspended. If you do not return a bike within a 24-hour period, you will be charged a lost or stolen bike fee of USD 1,200 (plus tax). Therefore, all bikes that were not docked for over 24-hours were most likely stolen, and not legitimate rides. 
https://help.citibikenyc.com/hc/en-us/articles/360032367371-What-if-I-keep-a-bike-out-too-long- 
5. Next, I converted the gender column (which was originally numerically coded) into strings using a list comprehension, and added an extra column for ride id to uniquely identify each ride by conncatenating Start station ID and end station ID.
6. Then, I calculated the distance for each trip, and the ages of each rider
7. Finally, I divided the riders into age bins to help with visulaizations per age brackets

In [1]:
import pandas as pd
from math import sin, cos, sqrt, atan2, radians
import numpy as np

C:\Users\abhij\anaconda3\lib\site-packages\numpy\.libs\libopenblas.GK7GX5KEQ4F6UYO3P26ULGBQYHGQO7J4.gfortran-win_amd64.dll
C:\Users\abhij\anaconda3\lib\site-packages\numpy\.libs\libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64.dll
  stacklevel=1)


### Import Data

In [2]:
# Read the data in csv file and save in a dataframe
citibike_data = pd.read_csv(r"C:\Users\abhij\Desktop\Celonis\JC-202012-citibike-tripdata.csv")
citibike_data

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,146,2020-12-01 00:02:50.1450,2020-12-01 00:05:16.1940,3202,Newport PATH,40.727224,-74.033759,3199,Newport Pkwy,40.728745,-74.032108,42308,Subscriber,1989,2
1,572,2020-12-01 00:11:57.3910,2020-12-01 00:21:30.2510,3640,Journal Square,40.733670,-74.062500,3280,Astor Place,40.719282,-74.071262,18568,Subscriber,1997,2
2,387,2020-12-01 00:14:49.3610,2020-12-01 00:21:16.8730,3640,Journal Square,40.733670,-74.062500,3194,McGinley Square,40.725340,-74.067622,44543,Subscriber,1960,1
3,188,2020-12-01 00:45:06.3680,2020-12-01 00:48:14.4280,3186,Grove St PATH,40.719586,-74.043117,3270,Jersey & 6th St,40.725289,-74.045572,43098,Subscriber,1998,1
4,594,2020-12-01 01:17:17.0110,2020-12-01 01:27:11.9400,3212,Christ Hospital,40.734786,-74.050444,3209,Brunswick St,40.724176,-74.050656,44723,Subscriber,1988,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11689,1750,2020-12-31 23:07:44.0030,2020-12-31 23:36:54.4710,3199,Newport Pkwy,40.728745,-74.032108,3199,Newport Pkwy,40.728745,-74.032108,40440,Customer,1969,0
11690,1519,2020-12-31 23:18:00.2630,2020-12-31 23:43:19.8590,3184,Paulus Hook,40.714145,-74.033552,3195,Sip Ave,40.730897,-74.063913,46340,Customer,1969,0
11691,1761,2020-12-31 23:31:09.4620,2021-01-01 00:00:31.3290,3195,Sip Ave,40.730897,-74.063913,3270,Jersey & 6th St,40.725289,-74.045572,40907,Customer,1998,1
11692,637,2020-12-31 23:35:45.4640,2020-12-31 23:46:22.7300,3681,Grand St,40.715178,-74.037683,3199,Newport Pkwy,40.728745,-74.032108,42250,Customer,1969,0


### Clean Data

In [3]:
# Drop rows where the trip duration is less than 120 seconds and where the start and end stations are the same
# Filter out records where users rented a bike but found that it wasn't working properly, total of 1403 rows removed
citibike_data = citibike_data.loc[((citibike_data["tripduration"]>120) & (citibike_data["start station id"] != citibike_data["end station id"])),:]
citibike_data

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,146,2020-12-01 00:02:50.1450,2020-12-01 00:05:16.1940,3202,Newport PATH,40.727224,-74.033759,3199,Newport Pkwy,40.728745,-74.032108,42308,Subscriber,1989,2
1,572,2020-12-01 00:11:57.3910,2020-12-01 00:21:30.2510,3640,Journal Square,40.733670,-74.062500,3280,Astor Place,40.719282,-74.071262,18568,Subscriber,1997,2
2,387,2020-12-01 00:14:49.3610,2020-12-01 00:21:16.8730,3640,Journal Square,40.733670,-74.062500,3194,McGinley Square,40.725340,-74.067622,44543,Subscriber,1960,1
3,188,2020-12-01 00:45:06.3680,2020-12-01 00:48:14.4280,3186,Grove St PATH,40.719586,-74.043117,3270,Jersey & 6th St,40.725289,-74.045572,43098,Subscriber,1998,1
4,594,2020-12-01 01:17:17.0110,2020-12-01 01:27:11.9400,3212,Christ Hospital,40.734786,-74.050444,3209,Brunswick St,40.724176,-74.050656,44723,Subscriber,1988,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11686,704,2020-12-31 22:37:08.3630,2020-12-31 22:48:52.5220,3194,McGinley Square,40.725340,-74.067622,3203,Hamilton Park,40.727596,-74.044247,40858,Customer,1969,0
11688,982,2020-12-31 22:59:05.5250,2020-12-31 23:15:28.3710,3677,Glenwood Ave,40.727551,-74.071061,3184,Paulus Hook,40.714145,-74.033552,46340,Customer,1969,0
11690,1519,2020-12-31 23:18:00.2630,2020-12-31 23:43:19.8590,3184,Paulus Hook,40.714145,-74.033552,3195,Sip Ave,40.730897,-74.063913,46340,Customer,1969,0
11691,1761,2020-12-31 23:31:09.4620,2021-01-01 00:00:31.3290,3195,Sip Ave,40.730897,-74.063913,3270,Jersey & 6th St,40.725289,-74.045572,40907,Customer,1998,1


In [4]:
# Convert trip duration column from seconds to minutes
citibike_data["tripduration"] = [trip/60 for trip in citibike_data["tripduration"]]

citibike_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,2.433333,2020-12-01 00:02:50.1450,2020-12-01 00:05:16.1940,3202,Newport PATH,40.727224,-74.033759,3199,Newport Pkwy,40.728745,-74.032108,42308,Subscriber,1989,2
1,9.533333,2020-12-01 00:11:57.3910,2020-12-01 00:21:30.2510,3640,Journal Square,40.733670,-74.062500,3280,Astor Place,40.719282,-74.071262,18568,Subscriber,1997,2
2,6.450000,2020-12-01 00:14:49.3610,2020-12-01 00:21:16.8730,3640,Journal Square,40.733670,-74.062500,3194,McGinley Square,40.725340,-74.067622,44543,Subscriber,1960,1
3,3.133333,2020-12-01 00:45:06.3680,2020-12-01 00:48:14.4280,3186,Grove St PATH,40.719586,-74.043117,3270,Jersey & 6th St,40.725289,-74.045572,43098,Subscriber,1998,1
4,9.900000,2020-12-01 01:17:17.0110,2020-12-01 01:27:11.9400,3212,Christ Hospital,40.734786,-74.050444,3209,Brunswick St,40.724176,-74.050656,44723,Subscriber,1988,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11686,11.733333,2020-12-31 22:37:08.3630,2020-12-31 22:48:52.5220,3194,McGinley Square,40.725340,-74.067622,3203,Hamilton Park,40.727596,-74.044247,40858,Customer,1969,0
11688,16.366667,2020-12-31 22:59:05.5250,2020-12-31 23:15:28.3710,3677,Glenwood Ave,40.727551,-74.071061,3184,Paulus Hook,40.714145,-74.033552,46340,Customer,1969,0
11690,25.316667,2020-12-31 23:18:00.2630,2020-12-31 23:43:19.8590,3184,Paulus Hook,40.714145,-74.033552,3195,Sip Ave,40.730897,-74.063913,46340,Customer,1969,0
11691,29.350000,2020-12-31 23:31:09.4620,2021-01-01 00:00:31.3290,3195,Sip Ave,40.730897,-74.063913,3270,Jersey & 6th St,40.725289,-74.045572,40907,Customer,1998,1


In [5]:
# Filter out trips over 24 hours ( 1440 minutes) long, there are 10 trips in this data set.
hours_data = citibike_data.loc[citibike_data["tripduration"] > 1440, :]

hours_data

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
2343,34282.95,2020-12-05 17:08:54.9040,2020-12-29 12:31:52.1180,3191,Union St,40.718211,-74.083639,3273,Manila & 1st,40.721651,-74.042884,19209,Customer,1969,0
2691,2636.2,2020-12-06 14:06:13.0770,2020-12-08 10:02:25.1170,3792,Columbus Dr at Exchange Pl,40.71687,-74.03281,3186,Grove St PATH,40.719586,-74.043117,42195,Customer,1969,0
3792,2985.183333,2020-12-08 21:11:49.7510,2020-12-10 22:57:01.6330,3277,Communipaw & Berry Lane,40.714358,-74.066611,3681,Grand St,40.715178,-74.037683,26703,Customer,2000,1
4313,34475.833333,2020-12-10 10:43:05.6190,2021-01-03 09:18:56.1980,3639,Harborside,40.719252,-74.034234,3638,Washington St,40.724294,-74.035483,47019,Customer,1969,0
6015,1502.4,2020-12-13 09:30:07.0140,2020-12-14 10:32:31.7040,3277,Communipaw & Berry Lane,40.714358,-74.066611,3203,Hamilton Park,40.727596,-74.044247,42436,Customer,1969,0
7355,11238.233333,2020-12-15 16:38:43.1430,2020-12-23 11:56:57.2060,3199,Newport Pkwy,40.728745,-74.032108,3186,Grove St PATH,40.719586,-74.043117,45274,Customer,1969,0
8000,2400.116667,2020-12-19 06:35:46.2210,2020-12-20 22:35:53.3520,3638,Washington St,40.724294,-74.035483,3792,Columbus Dr at Exchange Pl,40.71687,-74.03281,40168,Customer,1969,0
9674,6052.416667,2020-12-25 04:39:28.1300,2020-12-29 09:31:53.1550,3681,Grand St,40.715178,-74.037683,3270,Jersey & 6th St,40.725289,-74.045572,42153,Subscriber,1951,1
10345,2988.633333,2020-12-28 11:08:39.6820,2020-12-30 12:57:18.4990,3186,Grove St PATH,40.719586,-74.043117,3794,Pier 40 Dock Station,40.72866,-74.01198,48743,Subscriber,1992,1
11532,3022.583333,2020-12-31 15:06:59.0820,2021-01-02 17:29:34.1160,3193,Lincoln Park,40.724605,-74.078406,3194,McGinley Square,40.72534,-74.067622,46529,Customer,1969,0


In [6]:
# Rename values in gender column with strings
gender = {0:"Unknown", 1:"Male", 2:"Female"}
citibike_data.gender = [gender[item] for item in citibike_data.gender]
citibike_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,2.433333,2020-12-01 00:02:50.1450,2020-12-01 00:05:16.1940,3202,Newport PATH,40.727224,-74.033759,3199,Newport Pkwy,40.728745,-74.032108,42308,Subscriber,1989,Female
1,9.533333,2020-12-01 00:11:57.3910,2020-12-01 00:21:30.2510,3640,Journal Square,40.73367,-74.0625,3280,Astor Place,40.719282,-74.071262,18568,Subscriber,1997,Female
2,6.45,2020-12-01 00:14:49.3610,2020-12-01 00:21:16.8730,3640,Journal Square,40.73367,-74.0625,3194,McGinley Square,40.72534,-74.067622,44543,Subscriber,1960,Male
3,3.133333,2020-12-01 00:45:06.3680,2020-12-01 00:48:14.4280,3186,Grove St PATH,40.719586,-74.043117,3270,Jersey & 6th St,40.725289,-74.045572,43098,Subscriber,1998,Male
4,9.9,2020-12-01 01:17:17.0110,2020-12-01 01:27:11.9400,3212,Christ Hospital,40.734786,-74.050444,3209,Brunswick St,40.724176,-74.050656,44723,Subscriber,1988,Male


In [7]:
# Add ride id column to uniquely identify each ride
#citibike_data["rideid"] = citibike_data["start station id"].map(str) + "_" + citibike_data["end station id"].map(str) + "_" + citibike_data["bikeid"].map(str)
citibike_data["rideid"] = citibike_data["bikeid"].map(str) + "_" + citibike_data["tripduration"].map(str) + "_" + citibike_data["start station id"].map(str) + "_" + citibike_data["end station id"].map(str)
citibike_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,rideid
0,2.433333,2020-12-01 00:02:50.1450,2020-12-01 00:05:16.1940,3202,Newport PATH,40.727224,-74.033759,3199,Newport Pkwy,40.728745,-74.032108,42308,Subscriber,1989,Female,42308_2.433333333333333_3202_3199
1,9.533333,2020-12-01 00:11:57.3910,2020-12-01 00:21:30.2510,3640,Journal Square,40.733670,-74.062500,3280,Astor Place,40.719282,-74.071262,18568,Subscriber,1997,Female,18568_9.533333333333333_3640_3280
2,6.450000,2020-12-01 00:14:49.3610,2020-12-01 00:21:16.8730,3640,Journal Square,40.733670,-74.062500,3194,McGinley Square,40.725340,-74.067622,44543,Subscriber,1960,Male,44543_6.45_3640_3194
3,3.133333,2020-12-01 00:45:06.3680,2020-12-01 00:48:14.4280,3186,Grove St PATH,40.719586,-74.043117,3270,Jersey & 6th St,40.725289,-74.045572,43098,Subscriber,1998,Male,43098_3.1333333333333333_3186_3270
4,9.900000,2020-12-01 01:17:17.0110,2020-12-01 01:27:11.9400,3212,Christ Hospital,40.734786,-74.050444,3209,Brunswick St,40.724176,-74.050656,44723,Subscriber,1988,Male,44723_9.9_3212_3209
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11686,11.733333,2020-12-31 22:37:08.3630,2020-12-31 22:48:52.5220,3194,McGinley Square,40.725340,-74.067622,3203,Hamilton Park,40.727596,-74.044247,40858,Customer,1969,Unknown,40858_11.733333333333333_3194_3203
11688,16.366667,2020-12-31 22:59:05.5250,2020-12-31 23:15:28.3710,3677,Glenwood Ave,40.727551,-74.071061,3184,Paulus Hook,40.714145,-74.033552,46340,Customer,1969,Unknown,46340_16.366666666666667_3677_3184
11690,25.316667,2020-12-31 23:18:00.2630,2020-12-31 23:43:19.8590,3184,Paulus Hook,40.714145,-74.033552,3195,Sip Ave,40.730897,-74.063913,46340,Customer,1969,Unknown,46340_25.316666666666666_3184_3195
11691,29.350000,2020-12-31 23:31:09.4620,2021-01-01 00:00:31.3290,3195,Sip Ave,40.730897,-74.063913,3270,Jersey & 6th St,40.725289,-74.045572,40907,Customer,1998,Male,40907_29.35_3195_3270


In [8]:
# How many unique rideid's are there?
citibike_data['rideid'].nunique()

10290

### Calculate the distance of each ride

In [9]:
# Convert degrees to radians
start_lat = [radians(lat) for lat in citibike_data["start station latitude"]]
start_lon = [radians(lon) for lon in citibike_data["start station longitude"]]
end_lat = [radians(lat) for lat in citibike_data["end station latitude"]]
end_lon = [radians(lon) for lon in citibike_data["end station longitude"]]

# Convert lists into series
start_lat = pd.Series(start_lat)
start_lon = pd.Series(start_lon)
end_lat = pd.Series(end_lat)
end_lon = pd.Series(end_lon)

# Calculate difference between each set of latitude and longitude
distance_lat = end_lat - start_lat
distance_lon = end_lon - start_lon

In [10]:
# Use haversine formula to calculate the great-circle distance (as crow files) between 2 points on earth's surface
# Approximate radius of Earth in km
R = 6373.0

# Empty list to store trip distances
distance = []

for i in range(0, len(start_lat)):
    
    a = sin(distance_lat[i] / 2)**2 + cos(start_lat[i]) * cos(end_lat[i]) * sin(distance_lon[i] / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    # Get distance and convert km to miles. C is the angular distance in radiand as per haversine formula.
    miles = (R * c) * .6214
    
    # Append miles travel to 'distance' list
    distance.append(miles)

In [11]:
# Add trip distance as new column to data frame
citibike_data["tripdistance (mi)"] = distance
citibike_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,rideid,tripdistance (mi)
0,2.433333,2020-12-01 00:02:50.1450,2020-12-01 00:05:16.1940,3202,Newport PATH,40.727224,-74.033759,3199,Newport Pkwy,40.728745,-74.032108,42308,Subscriber,1989,Female,42308_2.433333333333333_3202_3199,0.136133
1,9.533333,2020-12-01 00:11:57.3910,2020-12-01 00:21:30.2510,3640,Journal Square,40.73367,-74.0625,3280,Astor Place,40.719282,-74.071262,18568,Subscriber,1997,Female,18568_9.533333333333333_3640_3280,1.095254
2,6.45,2020-12-01 00:14:49.3610,2020-12-01 00:21:16.8730,3640,Journal Square,40.73367,-74.0625,3194,McGinley Square,40.72534,-74.067622,44543,Subscriber,1960,Male,44543_6.45_3640_3194,0.635198
3,3.133333,2020-12-01 00:45:06.3680,2020-12-01 00:48:14.4280,3186,Grove St PATH,40.719586,-74.043117,3270,Jersey & 6th St,40.725289,-74.045572,43098,Subscriber,1998,Male,43098_3.1333333333333333_3186_3270,0.414616
4,9.9,2020-12-01 01:17:17.0110,2020-12-01 01:27:11.9400,3212,Christ Hospital,40.734786,-74.050444,3209,Brunswick St,40.724176,-74.050656,44723,Subscriber,1988,Male,44723_9.9_3212_3209,0.733382


In [12]:
# Calculate age of riders
ages = []

for i in citibike_data["birth year"]:
    age = 2020 - i
    ages.append(age)

In [13]:
# Append new Age column
citibike_data["Age"] = ages
citibike_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,rideid,tripdistance (mi),Age
0,2.433333,2020-12-01 00:02:50.1450,2020-12-01 00:05:16.1940,3202,Newport PATH,40.727224,-74.033759,3199,Newport Pkwy,40.728745,-74.032108,42308,Subscriber,1989,Female,42308_2.433333333333333_3202_3199,0.136133,31
1,9.533333,2020-12-01 00:11:57.3910,2020-12-01 00:21:30.2510,3640,Journal Square,40.733670,-74.062500,3280,Astor Place,40.719282,-74.071262,18568,Subscriber,1997,Female,18568_9.533333333333333_3640_3280,1.095254,23
2,6.450000,2020-12-01 00:14:49.3610,2020-12-01 00:21:16.8730,3640,Journal Square,40.733670,-74.062500,3194,McGinley Square,40.725340,-74.067622,44543,Subscriber,1960,Male,44543_6.45_3640_3194,0.635198,60
3,3.133333,2020-12-01 00:45:06.3680,2020-12-01 00:48:14.4280,3186,Grove St PATH,40.719586,-74.043117,3270,Jersey & 6th St,40.725289,-74.045572,43098,Subscriber,1998,Male,43098_3.1333333333333333_3186_3270,0.414616,22
4,9.900000,2020-12-01 01:17:17.0110,2020-12-01 01:27:11.9400,3212,Christ Hospital,40.734786,-74.050444,3209,Brunswick St,40.724176,-74.050656,44723,Subscriber,1988,Male,44723_9.9_3212_3209,0.733382,32
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11686,11.733333,2020-12-31 22:37:08.3630,2020-12-31 22:48:52.5220,3194,McGinley Square,40.725340,-74.067622,3203,Hamilton Park,40.727596,-74.044247,40858,Customer,1969,Unknown,40858_11.733333333333333_3194_3203,1.234264,51
11688,16.366667,2020-12-31 22:59:05.5250,2020-12-31 23:15:28.3710,3677,Glenwood Ave,40.727551,-74.071061,3184,Paulus Hook,40.714145,-74.033552,46340,Customer,1969,Unknown,46340_16.366666666666667_3677_3184,2.172406,51
11690,25.316667,2020-12-31 23:18:00.2630,2020-12-31 23:43:19.8590,3184,Paulus Hook,40.714145,-74.033552,3195,Sip Ave,40.730897,-74.063913,46340,Customer,1969,Unknown,46340_25.316666666666666_3184_3195,1.967220,51
11691,29.350000,2020-12-31 23:31:09.4620,2021-01-01 00:00:31.3290,3195,Sip Ave,40.730897,-74.063913,3270,Jersey & 6th St,40.725289,-74.045572,40907,Customer,1998,Male,40907_29.35_3195_3270,1.035927,22


In [14]:
# Divide users in age bins
bins = [0, 16, 25, 35, 50, 65, 90]
labels = ["0-16", "17-25", "26-35", "36-50", "51-65", "65-90"]

citibike_data["Age Group"] = pd.cut(citibike_data["Age"], bins, labels=labels)
citibike_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,rideid,tripdistance (mi),Age,Age Group
0,2.433333,2020-12-01 00:02:50.1450,2020-12-01 00:05:16.1940,3202,Newport PATH,40.727224,-74.033759,3199,Newport Pkwy,40.728745,-74.032108,42308,Subscriber,1989,Female,42308_2.433333333333333_3202_3199,0.136133,31,26-35
1,9.533333,2020-12-01 00:11:57.3910,2020-12-01 00:21:30.2510,3640,Journal Square,40.733670,-74.062500,3280,Astor Place,40.719282,-74.071262,18568,Subscriber,1997,Female,18568_9.533333333333333_3640_3280,1.095254,23,17-25
2,6.450000,2020-12-01 00:14:49.3610,2020-12-01 00:21:16.8730,3640,Journal Square,40.733670,-74.062500,3194,McGinley Square,40.725340,-74.067622,44543,Subscriber,1960,Male,44543_6.45_3640_3194,0.635198,60,51-65
3,3.133333,2020-12-01 00:45:06.3680,2020-12-01 00:48:14.4280,3186,Grove St PATH,40.719586,-74.043117,3270,Jersey & 6th St,40.725289,-74.045572,43098,Subscriber,1998,Male,43098_3.1333333333333333_3186_3270,0.414616,22,17-25
4,9.900000,2020-12-01 01:17:17.0110,2020-12-01 01:27:11.9400,3212,Christ Hospital,40.734786,-74.050444,3209,Brunswick St,40.724176,-74.050656,44723,Subscriber,1988,Male,44723_9.9_3212_3209,0.733382,32,26-35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11686,11.733333,2020-12-31 22:37:08.3630,2020-12-31 22:48:52.5220,3194,McGinley Square,40.725340,-74.067622,3203,Hamilton Park,40.727596,-74.044247,40858,Customer,1969,Unknown,40858_11.733333333333333_3194_3203,1.234264,51,51-65
11688,16.366667,2020-12-31 22:59:05.5250,2020-12-31 23:15:28.3710,3677,Glenwood Ave,40.727551,-74.071061,3184,Paulus Hook,40.714145,-74.033552,46340,Customer,1969,Unknown,46340_16.366666666666667_3677_3184,2.172406,51,51-65
11690,25.316667,2020-12-31 23:18:00.2630,2020-12-31 23:43:19.8590,3184,Paulus Hook,40.714145,-74.033552,3195,Sip Ave,40.730897,-74.063913,46340,Customer,1969,Unknown,46340_25.316666666666666_3184_3195,1.967220,51,51-65
11691,29.350000,2020-12-31 23:31:09.4620,2021-01-01 00:00:31.3290,3195,Sip Ave,40.730897,-74.063913,3270,Jersey & 6th St,40.725289,-74.045572,40907,Customer,1998,Male,40907_29.35_3195_3270,1.035927,22,17-25


### Export cleaned data to csv

In [15]:
citibike_data.to_csv(r"C:\Users\abhij\Desktop\Celonis\Cleaned_JC-202012-citibike-tripdata.csv", index=False)

In [16]:
# check statistical facts about number of rides per day
citibike_data['rideid'].describe()

count                    10291
unique                   10290
top       42307_3.35_3276_3186
freq                         2
Name: rideid, dtype: object