# Data Analysis Project City Bike NYC

**Goals**
This dataset contains a sample of bike trips from the City Bike system in New York City.
Each row represents one trip and includes information about the start and end stations, the duration, the
user type, and other contextual data like age, season, temperature, and weekday.
Your goal is to explore this dataset and extract insights through data analysis with Pandas.

You'll practice basic pandas operations (loading, exploring, cleaning, transforming, summarizing) and use descriptive statistics and simple visualizations to support your answers.

## 1. Dataset Exploration
  - What information does each column contain?
  - Are there missing or duplicated values?
  - What is the overall time span of the trips?

Environment requirements: Jupyter, Python, ipython, 
pandas openpyxl

In [None]:
# %pip install pandas openpyxl



### What information does each column contain?

Each row contains a record of a trip or unit of service usage: a bicycle was collected by a user somewhere sometime, used for a certain period of time and returned. It also contains information about the user's demographics and whether or not they are enrolled service members.

Looking at the Dataframe we can see there are a few columns,
['Start Time', 'Stop Time', 'Start Station ID', 'Start Station Name',<br>
       'End Station ID', 'End Station Name', 'Bike ID', 'User Type',<br>
       'Birth Year', 'Age', 'Age Groups', 'Trip Duration',<br>
       'Trip_Duration_in_min', 'Month', 'Season', 'Temperature', 'Weekday'],<br>

```'Start Time'```, ```'Stop Time'```: Show when the bicycle was picked up and when it was returned.<br>
```'Start Station ID'```, ```'End Station ID'```: The IDs of stations where bicycles were collected and returned.<br>
```'Start Station Name'```, ```'End Station Name'```: The station names corresponding to the Station IDs (vid. infra.)<br>
```'Bike ID'```: the unique ID for the bike used for the trip.<br>
```'User Type'```: whether the user is a member or not.<br>
```'Birth Year'```, ```'Age'```, ```'Age Groups'```: user demographic information (vid. infra.)<br>
```'Trip Duration'```, ```'Trip_Duration_in_min'```: Time elapsed between bicycle collection and return. Available in seconds and minutes (vid. infra.)<br>
```'Month'```, ```'Season'```: Colums related to time of year. (NB. there appear to be records only for January through March.) <br>
```'Temperature'```: the only weather mesurement available in the dataset. This will likely hinder any advanced weather-related insight, as we have no
information about rain, snow, etc.<br>
```'Weekday'```: using this column we might know what age groups use bicycles more often, as well as test assumptions on current bicycle usage.<br><br>

There appears to be strong correlation among certain fields -- likely the result of calculated fields,

- ```'Birth Year'```, ```'Age'``` and ```'Age Groups'```
- ```'Trip Duration'``` and ```'Trip_Duration_in_min'``` (both are obtained from either ```'Start Time'``` or ```'Stop Time'``` as they are in full date format.)
- ```'Month'``` and ```'Season'```, ```'Weekday'``` (same as above.)

Also, there is some data duplication as there is no relational database, namely;

- ```'Start Station Name'``` and ```'End Station Name'``` will correspond to the same IDs (ie. ```'Start Station ID'``` or ```'End Station ID'```, respectively.)

In [4]:
# Load Pandas dataframe

import pandas as pd

df = pd.read_excel('ny_citibikes_raw.xlsx', sheet_name='NYCitiBikes')

# Test df has loaded up
df.head()

Unnamed: 0,Start Time,Stop Time,Start Station ID,Start Station Name,End Station ID,End Station Name,Bike ID,User Type,Birth Year,Age,Age Groups,Trip Duration,Trip_Duration_in_min,Month,Season,Temperature,Weekday
0,2017-01-01 00:38:00,2017-01-01 01:03:00,3194,McGinley Square,3271,Danforth Light Rail,24668,Subscriber,1961,60,55-64,1513,25,1,Winter,10,Sunday
1,2017-01-01 01:47:00,2017-01-01 01:58:00,3183,Exchange Place,3203,Hamilton Park,26167,Subscriber,1993,28,25-34,639,11,1,Winter,10,Sunday
2,2017-01-01 01:47:00,2017-01-01 01:58:00,3183,Exchange Place,3203,Hamilton Park,26167,Subscriber,1993,28,25-34,639,11,1,Winter,10,Sunday
3,2017-01-01 01:56:00,2017-01-01 02:00:00,3186,Grove St PATH,3270,Jersey & 6th St,24604,Subscriber,1970,51,45-54,258,4,1,Winter,10,Sunday
4,2017-01-01 02:12:00,2017-01-01 02:23:00,3270,Jersey & 6th St,3206,Hilltop,24641,Subscriber,1978,43,35-44,663,11,1,Winter,10,Sunday


In [6]:
# List column names

print(df.columns)

Index(['Start Time', 'Stop Time', 'Start Station ID', 'Start Station Name',
       'End Station ID', 'End Station Name', 'Bike ID', 'User Type',
       'Birth Year', 'Age', 'Age Groups', 'Trip Duration',
       'Trip_Duration_in_min', 'Month', 'Season', 'Temperature', 'Weekday'],
      dtype='object')


### Are there missing or duplicated values?

In [None]:
# isna, drop, null values?

## 2. Basic Statistics
  - What is the average trip duration (in minutes)?
  - What is the minimum and maximum duration?
  - What are the most common start and end stations?

## 3. Users and Demographics
  - How many unique bikes were used?
  - What are the proportions of user types (Subscriber vs Customer)?
  - What is the age distribution of the users? Which age group uses the service the most?

## 4. Temporal Analysis
  - How does the number of trips vary by weekday?
  - Which month or season has the most rides?
  - What time of day do most trips start?

## 5. Geographic Analysis
  - Which station pairs (start → end) appear most often?
  - Are there any stations that appear only as start or only as end stations?

## 6. Temperature and Duration
  - Is there any visible relationship between temperature and trip duration?

  - How does average trip duration vary by season?


## 7. Summary and Interpretation
  - Write a short summary (5–10 lines) of your findings.
  - Mention patterns, anomalies, or interesting trends you observed.