# Training and Test Data
This notebook checks and prepares the final training and test data for the research task.
## Quality Check
This section is meant to inspect the quality of the created dataset and search for any discrepencies.

In [1]:
import pandas as pd
rentals_jan = pd.read_csv(r"C:\Users\singh\Desktop\TUD (All Semesters)\Courses - Semester 5 (TU Dresden)\Research Task - Spatial Modelling\Code\rentals_with_demand_new_time_units.csv")
rentals_feb = pd.read_csv(r"C:\Users\singh\Desktop\TUD (All Semesters)\Courses - Semester 5 (TU Dresden)\Research Task - Spatial Modelling\Code\rentals_with_demand_new_time_units_feb.csv")
rentals_mar = pd.read_csv(r"C:\Users\singh\Desktop\TUD (All Semesters)\Courses - Semester 5 (TU Dresden)\Research Task - Spatial Modelling\Code\rentals_with_demand_new_time_units_mar.csv")
rentals_apr = pd.read_csv(r"C:\Users\singh\Desktop\TUD (All Semesters)\Courses - Semester 5 (TU Dresden)\Research Task - Spatial Modelling\Code\rentals_with_demand_new_time_units_apr.csv")

In [2]:
# Columns preserved
rentals_jan.columns

Index(['name', 'lat', 'lng', 'station_id', 'datetime', '#_rentals', 'year',
       'month', 'day', 'hour'],
      dtype='object')

In [3]:
# Unique stations in each dataset
print(len(rentals_jan["station_id"].unique()), len(rentals_feb["station_id"].unique()), len(rentals_mar["station_id"].unique()), len(rentals_apr["station_id"].unique()))

2124 2124 2127 2125


- March seems to have the most stations
- March datset has <u>3 additional stations</u>. This means that *station_id* is not uniform and is different for the month of **March**. This is also an issue for **April**, since there is <u>one additional station</u>.

In [4]:
# evaluating the time range for each dataset
print(min(rentals_jan["month"]), min(rentals_feb["month"]), min(rentals_mar["month"]), min(rentals_apr["month"]))
print(max(rentals_jan["month"]), max(rentals_feb["month"]), max(rentals_mar["month"]), max(rentals_apr["month"]))

1 2 3 4
12 2 3 4


In [5]:
# Even in rentals_jan, data is only stored for Jan (in addition to 31 Dec'23)
rentals_jan[rentals_jan["month"] == 2]

Unnamed: 0,name,lat,lng,station_id,datetime,#_rentals,year,month,day,hour


Data for timeline range looks good!

## Reading station information
Each dataset from January to April may not have the same station name and IDs. This looks especially true for March and April.

In [6]:
# finding elements in A not in B
a = [1,2,3,4]
b = [1,4]

set(a) - set(b)

{2, 3}

In [28]:
# Comparing January data with remaining months
print(len(set(rentals_jan["name"]) - set(rentals_feb["name"])))
print(len(set(rentals_jan["name"]) - set(rentals_mar["name"])))
print(len(set(rentals_jan["name"]) - set(rentals_apr["name"])))

15
16
26


The month of january seems to have a lot of stations that were later discontinued.

In [29]:
# Comparing February data with remaining months
print(len(set(rentals_feb["name"]) - set(rentals_jan["name"])))
print(len(set(rentals_feb["name"]) - set(rentals_mar["name"])))
print(len(set(rentals_feb["name"]) - set(rentals_apr["name"])))

17
6
17


In [30]:
# Comparing March data with remaining months
print(len(set(rentals_mar["name"]) - set(rentals_jan["name"])))
print(len(set(rentals_mar["name"]) - set(rentals_feb["name"])))
print(len(set(rentals_mar["name"]) - set(rentals_apr["name"])))

23
11
12


In [76]:
# Number of common stations by name - all datasets
jan_stations = set(rentals_jan["name"].unique())
feb_stations = set(rentals_feb["name"].unique())
mar_stations = set(rentals_mar["name"].unique())
apr_stations = set(rentals_apr["name"].unique())
common_stations = set.intersection(jan_stations, feb_stations, mar_stations, apr_stations)
len(common_stations)

2088

In [38]:
# Viewing the list
common_stations = list(common_stations)
common_stations[:5]

['E 180 St & Monterey Ave',
 '71 St & 37 Ave',
 'Van Brunt St & Van Dyke St',
 'Wyckoff St & 3 Ave',
 'W 218 St & Indian Rd']

So, there are 2088 common stations throughout the entire four-month dataset. This must be used for data uniformity. Since there are some stations missing in each month dataset, the stored *station ids* are <u>not uniform</u>. 

In [43]:
# Comparing common records (for Jan) with Jan records
print(len(rentals_jan))
print(len(rentals_jan[rentals_jan["name"].isin(common_stations)]))

475776
468832


## Combining the dataset 
The entire 4 month dataset is combined into a single dataframe, by row.

In [46]:
# combining the data
citibike = pd.concat([rentals_jan, rentals_feb, rentals_mar, rentals_apr], axis=0, ignore_index=True)
citibike.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1814757 entries, 0 to 1814756
Data columns (total 10 columns):
 #   Column      Dtype  
---  ------      -----  
 0   name        object 
 1   lat         float64
 2   lng         float64
 3   station_id  int64  
 4   datetime    object 
 5   #_rentals   int64  
 6   year        int64  
 7   month       int64  
 8   day         int64  
 9   hour        int64  
dtypes: float64(2), int64(6), object(2)
memory usage: 138.5+ MB


In [47]:
# dropping the station_id column since it is irrelevant
citibike.drop(columns=["station_id"], inplace=True)
citibike.columns

Index(['name', 'lat', 'lng', 'datetime', '#_rentals', 'year', 'month', 'day',
       'hour'],
      dtype='object')

In [49]:
# rows after filtering for common stations
print(len(citibike))
len(citibike[citibike["name"].isin(common_stations)])

1814757


1785105

In [50]:
# filtering for common stations 
citibike = citibike[citibike["name"].isin(common_stations)]
len(citibike)

1785105

In [51]:
# number of unique stations
print(len(citibike["name"].unique()))

2088


## Preparing station id
In this dataset, a station may have the same name but slightly different coordinates. So station *name* must be **combined** with the *coordinates* to get a unique <u>station id</u>. 

In [58]:
# Unique name-coordinate pair
station_identifier = citibike.drop_duplicates(subset=["name", "lat", "lng"])
print(len(station_identifier))
print(len(citibike["name"].unique()))

2096
2088


Hence, there are some stations in *citibike* with the same name but different coordinates!

In [61]:
# Preparing ID list
ids = citibike.drop_duplicates(subset=["name", "lat", "lng"], ignore_index=True)[["name", "lat", "lng"]]
ids.sort_values(by = "name", ignore_index=True, inplace=True)
ids.head()

Unnamed: 0,name,lat,lng
0,1 Ave & E 110 St,40.792327,-73.9383
1,1 Ave & E 16 St,40.732219,-73.981656
2,1 Ave & E 18 St,40.733812,-73.980544
3,1 Ave & E 30 St,40.741444,-73.975361
4,1 Ave & E 39 St,40.74714,-73.97113


In [62]:
# Assigning IDs
ids["ID"] = range(0,len(ids))
ids.head()

Unnamed: 0,name,lat,lng,ID
0,1 Ave & E 110 St,40.792327,-73.9383,0
1,1 Ave & E 16 St,40.732219,-73.981656,1
2,1 Ave & E 18 St,40.733812,-73.980544,2
3,1 Ave & E 30 St,40.741444,-73.975361,3
4,1 Ave & E 39 St,40.74714,-73.97113,4


In [65]:
# Merging ids with citibike data
citibike = citibike.merge(ids, on=["name", "lat", "lng"], how='left')
citibike.head()

Unnamed: 0,name,lat,lng,datetime,#_rentals,year,month,day,hour,ID_x,ID_y
0,1 Ave & E 110 St,40.792327,-73.9383,2023-12-31 08:00:00.000,0,2023,12,31,8,-,0
1,1 Ave & E 110 St,40.792327,-73.9383,2023-12-31 10:00:00.000,0,2023,12,31,10,-,0
2,1 Ave & E 110 St,40.792327,-73.9383,2023-12-31 12:00:00.000,0,2023,12,31,12,-,0
3,1 Ave & E 110 St,40.792327,-73.9383,2023-12-31 14:00:00.000,0,2023,12,31,14,-,0
4,1 Ave & E 110 St,40.792327,-73.9383,2023-12-31 16:00:00.000,0,2023,12,31,16,-,0


In [66]:
# Processing columns
citibike.drop(columns=["ID_x"], inplace=True)
citibike = citibike.rename(columns={'ID_y': 'ID'})
citibike.sort_values(by="name", ignore_index=True, inplace=True)
citibike.head()

Unnamed: 0,name,lat,lng,datetime,#_rentals,year,month,day,hour,ID
0,1 Ave & E 110 St,40.792327,-73.9383,2023-12-31 08:00:00.000,0,2023,12,31,8,0
1,1 Ave & E 110 St,40.792327,-73.9383,2024-04-29 16:00:00.000,2,2024,4,29,16,0
2,1 Ave & E 110 St,40.792327,-73.9383,2024-04-29 18:00:00.000,2,2024,4,29,18,0
3,1 Ave & E 110 St,40.792327,-73.9383,2024-04-29 20:00:00.000,1,2024,4,29,20,0
4,1 Ave & E 110 St,40.792327,-73.9383,2024-04-30 08:00:00.000,4,2024,4,30,8,0


In [68]:
# Cheking data for station: '1 Ave & E 110 St'
len(citibike[citibike["name"] == '1 Ave & E 110 St'])

854

In [71]:
# total records for station "1 Ave & E 110 St"
(7*32) + (7*29) + (7*31) + (7*30)

854

There should be 854 records for every station present in the dataset.

In [81]:
# for which station do we not have 854 rows?
records = []
for i in range(0,len(ids)):
    records.append(len(citibike[citibike["ID"] == i]))

records[:5]

[854, 854, 854, 854, 854]

In [85]:
# there are 8 stations having less records!
sum([j < 854 for j in records])

8

It appears that dataset is incomplete for 8 stations!

In [89]:
# Index: station ids
[i for i, num in enumerate(records) if num < 854]

[504, 580, 846, 1028, 1310, 1759, 1938, 2019]

In [93]:
# Checking for station ID: 504
len(citibike[citibike["ID"] == 504])

224

## Removing some stations
Stations with IDs *504, 580, 846, 1028, 1310, 1759, 1938, 2019* are being removed from the dataset since complete information for these stations is not available. Retaining these stations will lead to unnecessary loss in predicition performance.

In [99]:
# filtering the data
unwanted_ids = [504, 580, 846, 1028, 1310, 1759, 1938, 2019]
citibike = citibike[~citibike["ID"].isin(unwanted_ids)]
citibike.sort_values(by="name", ignore_index=True, inplace=True)
citibike[citibike["ID"] == 504]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  citibike.sort_values(by="name", ignore_index=True, inplace=True)


Unnamed: 0,name,lat,lng,datetime,#_rentals,year,month,day,hour,ID


In [100]:
# Unique station remaining
print(len(citibike["ID"].unique()))
print(len(citibike))

2088
1783152


In [101]:
# Calculating total records now
854*2088

1783152

Hence, the dataset is correct and complete!

In [102]:
# Exporting the final data
citibike.to_csv("RENTALS.csv", index=False)