# Data Mining Project
This notebook demonstrates the data preprocessing and analysis steps for the Bike Rentals dataset in San Jose. The goal is to clean the data, extract relevant features, and perform exploratory data analysis to derive insights.

## Step 1: Import Libraries and Load Data
In this step, we import the required Python libraries (like pandas) and load the datasets:
- `data_trip`: Contains trip data (start and end dates, stations, etc.).
- `data_weather`: Includes weather information during the trip period.
- `data_station`: Provides details about bike stations.

In [20]:
import pandas as pd
data_trip = pd.read_csv(r"C:\Users\utente\Desktop\UNITN\data mining\data bike rentals san jose\trip_data.csv")
data_weather = pd.read_csv(r"C:\Users\utente\Desktop\UNITN\data mining\data bike rentals san jose\weather_data.csv")
data_station = pd.read_csv(r"C:\Users\utente\Desktop\UNITN\data mining\data bike rentals san jose\station_data.csv")

## Step 2: Inspect the Data
To understand the structure of the datasets, we display the first 5 rows and the number of rows in each dataset.

In [21]:
print(data_trip.head(5))
print(len(data_trip))

print(data_weather.head(5))
print(len(data_weather))

print(data_station.head(5))
print(len(data_station))

   Trip ID        Start Date  Start Station          End Date  End Station  \
0   913460  31/08/2015 23:26             50  31/08/2015 23:39           70   
1   913459  31/08/2015 23:11             31  31/08/2015 23:28           27   
2   913455  31/08/2015 23:13             47  31/08/2015 23:18           64   
3   913454  31/08/2015 23:10             10  31/08/2015 23:17            8   
4   913453  31/08/2015 23:09             51  31/08/2015 23:22           60   

  Subscriber Type  
0      Subscriber  
1      Subscriber  
2      Subscriber  
3      Subscriber  
4        Customer  
354152
         Date  Max TemperatureF  Mean TemperatureF  Min TemperatureF  \
0  01/09/2014              83.0               70.0              57.0   
1  02/09/2014              72.0               66.0              60.0   
2  03/09/2014              76.0               69.0              61.0   
3  04/09/2014              74.0               68.0              61.0   
4  05/09/2014              72.0             

In [22]:
print(data_trip.columns)


Index(['Trip ID', 'Start Date', 'Start Station', 'End Date', 'End Station',
       'Subscriber Type'],
      dtype='object')


## Step 3: Preprocessing Trip Data
We process the `Start Date` and `End Date` columns to extract:
- **Start_date**: The date when the trip started.
- **Start_time**: The time when the trip started.
- **End_date**: The date when the trip ended.
- **End_time**: The time when the trip ended.
The original `Start Date` and `End Date` columns are dropped after extraction.

In [23]:
# Substitute "Start Date" with "Start_date" and "Start_time" columns
data_trip["Start Date"] = pd.to_datetime(data_trip["Start Date"], dayfirst=True)

data_trip["Start_date"] = data_trip["Start Date"].dt.date
data_trip["Start_time"] = data_trip["Start Date"].dt.time

data_trip = data_trip.drop(columns=["Start Date"])

# Substitute "End Date" with "End_date" and "End_time" columns
data_trip["End Date"] = pd.to_datetime(data_trip["End Date"], dayfirst=True)

data_trip["End_date"] = data_trip["End Date"].dt.date
data_trip["End_time"] = data_trip["End Date"].dt.time

data_trip = data_trip.drop(columns=["End Date"])

## Step 6: Merging Datasets
In this step, we combine the datasets to create a unified view:
1. **Merge `data_trip` and `data_weather`:**
   - Merge on `Start_date` from `data_trip` and `Date` from `data_weather`.
   - Use an inner join to retain only matching rows.
2. **Merge `merged_data` with `data_station`:**
   - Merge on `Start Station` from `data_trip` and `Id` from `data_station`.
   - Use an inner join to include station details.

This process creates a combined dataset that links trips, weather data, and station information.

In [24]:
# Merge datasets data and data weather based on start_date
data_trip["Start_date"] = pd.to_datetime(data_trip["Start_date"], dayfirst=True)
data_weather["Date"] = pd.to_datetime(data_weather["Date"], dayfirst=True)

merged_data = pd.merge(data_trip, data_weather, left_on="Start_date", right_on="Date", how="inner")

merged_data2 = pd.merge(merged_data, data_station[["Id", "Name", "City"]], left_on="Start Station", right_on="Id", how="inner")

In [29]:
print("Final dataset:")
print(merged_data2.head(4), "\n")
print("Columns of the final dataset:")
print(merged_data2.columns)
print(merged_data2["City"].value_counts())
print(merged_data2["Subscriber Type"].value_counts())
print(len(merged_data2))

Final dataset:
   Trip ID  Start Station  End Station Subscriber Type Start_date Start_time  \
0   913460             50           70      Subscriber 2015-08-31   23:26:00   
1   913460             50           70      Subscriber 2015-08-31   23:26:00   
2   913460             50           70      Subscriber 2015-08-31   23:26:00   
3   913460             50           70      Subscriber 2015-08-31   23:26:00   

     End_date  End_time       Date  Max TemperatureF  ...  Mean Wind SpeedMPH  \
0  2015-08-31  23:39:00 2015-08-31              78.0  ...                 9.0   
1  2015-08-31  23:39:00 2015-08-31              80.0  ...                 4.0   
2  2015-08-31  23:39:00 2015-08-31              82.0  ...                 8.0   
3  2015-08-31  23:39:00 2015-08-31              82.0  ...                 6.0   

   Max Gust SpeedMPH  PrecipitationIn  CloudCover  Events  WindDirDegrees  \
0               21.0              0.0         1.0     NaN           246.0   
1               20.0    