## Station Locations to POINTs
The 4 month combined dataset is very huge. Instead of taking the entire, we will firstly take a look at the 200 nearest stations for a selected station. Since, the first station in the dataset is <u>1 Ave & E 110 St</u> the nearest 200 stations will be calculated.

In [1]:
# Importing the rentals dataset
import geopandas as gpd
import pandas as pd
rentals = pd.read_csv(r"C:\Users\singh\Desktop\TUD (All Semesters)\Courses - Semester 5 (TU Dresden)\Research Task - Spatial Modelling\Code\RENTALS.csv")
rentals.info()                      

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1783152 entries, 0 to 1783151
Data columns (total 10 columns):
 #   Column     Dtype  
---  ------     -----  
 0   name       object 
 1   lat        float64
 2   lng        float64
 3   datetime   object 
 4   #_rentals  int64  
 5   year       int64  
 6   month      int64  
 7   day        int64  
 8   hour       int64  
 9   ID         int64  
dtypes: float64(2), int64(6), object(2)
memory usage: 136.0+ MB


In [2]:
# adding POINT geometry to rentals
rentals["coordinates"] = gpd.points_from_xy(rentals.lng, rentals.lat, crs="EPSG:4326")
rentals.head()

Unnamed: 0,name,lat,lng,datetime,#_rentals,year,month,day,hour,ID,coordinates
0,1 Ave & E 110 St,40.792327,-73.9383,2023-12-31 08:00:00.000,0,2023,12,31,8,0,POINT (-73.9383 40.79233)
1,1 Ave & E 110 St,40.792327,-73.9383,2024-03-28 10:00:00.000,1,2024,3,28,10,0,POINT (-73.9383 40.79233)
2,1 Ave & E 110 St,40.792327,-73.9383,2024-03-28 12:00:00.000,1,2024,3,28,12,0,POINT (-73.9383 40.79233)
3,1 Ave & E 110 St,40.792327,-73.9383,2024-03-28 14:00:00.000,1,2024,3,28,14,0,POINT (-73.9383 40.79233)
4,1 Ave & E 110 St,40.792327,-73.9383,2024-03-28 16:00:00.000,3,2024,3,28,16,0,POINT (-73.9383 40.79233)


In [3]:
# Identifying uniques stations
stations = rentals.drop_duplicates(subset=['name', 'ID', 'coordinates'], ignore_index=True)
stations = stations[["name", "ID", "coordinates"]]
stations.head()

Unnamed: 0,name,ID,coordinates
0,1 Ave & E 110 St,0,POINT (-73.9383 40.79233)
1,1 Ave & E 16 St,1,POINT (-73.98166 40.73222)
2,1 Ave & E 18 St,2,POINT (-73.98054 40.73381)
3,1 Ave & E 30 St,3,POINT (-73.97536 40.74144)
4,1 Ave & E 39 St,4,POINT (-73.97113 40.74714)


## Selecting Stations
Stations need to be filtered so that only the nearest stations are listed. Using stations that are closer together will help while modelling <u>spatial dependencies</u>. For determining which stations are closer, distance needs to be calculated. 

In [4]:
# stations df is not a GeoDataFrame
print(type(stations))

# Convert to a GeoDataFrame
stations_gdf = gpd.GeoDataFrame(stations, geometry='coordinates', crs="EPSG:4326")
print(type(stations_gdf))

<class 'pandas.core.frame.DataFrame'>
<class 'geopandas.geodataframe.GeoDataFrame'>


In [5]:
# Point geometries
stations_gdf.head()

Unnamed: 0,name,ID,coordinates
0,1 Ave & E 110 St,0,POINT (-73.9383 40.79233)
1,1 Ave & E 16 St,1,POINT (-73.98166 40.73222)
2,1 Ave & E 18 St,2,POINT (-73.98054 40.73381)
3,1 Ave & E 30 St,3,POINT (-73.97536 40.74144)
4,1 Ave & E 39 St,4,POINT (-73.97113 40.74714)


In [6]:
# Distance between the first two stations (shortest distance) 
stations_gdf.loc[0,"coordinates"].distance(stations_gdf.loc[1,"coordinates"])

0.07411314093461886

In [7]:
# Calculating distance of each station with '1 Ave & E 110 St'
dist = []
for i in stations_gdf["coordinates"]:
    dist.append(stations_gdf.loc[0,"coordinates"].distance(i))
dist[:5]

[0.0,
 0.07411314093461886,
 0.07217048842703006,
 0.06294932605645888,
 0.055854202562030636]

In [8]:
# Adding dist as a column
stations_gdf["distance_to_1Ave&E110_St"] = dist
stations_gdf.head()

Unnamed: 0,name,ID,coordinates,distance_to_1Ave&E110_St
0,1 Ave & E 110 St,0,POINT (-73.9383 40.79233),0.0
1,1 Ave & E 16 St,1,POINT (-73.98166 40.73222),0.074113
2,1 Ave & E 18 St,2,POINT (-73.98054 40.73381),0.07217
3,1 Ave & E 30 St,3,POINT (-73.97536 40.74144),0.062949
4,1 Ave & E 39 St,4,POINT (-73.97113 40.74714),0.055854


In [20]:
# nearest 200 stations
selected_stations = stations_gdf.sort_values(by="distance_to_1Ave&E110_St")
selected_stations = selected_stations.head(200)
selected_stations.head()

Unnamed: 0,name,ID,coordinates,distance_to_1Ave&E110_St
0,1 Ave & E 110 St,0,POINT (-73.9383 40.79233),0.0
898,E 114 St & 1 Ave,901,POINT (-73.93625 40.79457),0.003033
891,E 106 St & 1 Ave,894,POINT (-73.93956 40.78925),0.003323
115,3 Ave & E 112 St,115,POINT (-73.94161 40.79551),0.004588
901,E 116 St & 2 Ave,904,POINT (-73.93726 40.79688),0.004669


In [26]:
# filtering rentals data for chosen stations
selected_rentals = rentals[rentals["ID"].isin(list(selected_stations["ID"]))]
selected_rentals.info()

<class 'pandas.core.frame.DataFrame'>
Index: 170800 entries, 0 to 1765217
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype   
---  ------       --------------   -----   
 0   name         170800 non-null  object  
 1   lat          170800 non-null  float64 
 2   lng          170800 non-null  float64 
 3   datetime     170800 non-null  object  
 4   #_rentals    170800 non-null  int64   
 5   year         170800 non-null  int64   
 6   month        170800 non-null  int64   
 7   day          170800 non-null  int64   
 8   hour         170800 non-null  int64   
 9   ID           170800 non-null  int64   
 10  coordinates  170800 non-null  geometry
dtypes: float64(2), geometry(1), int64(6), object(2)
memory usage: 15.6+ MB


In [27]:
# Checking the number of stations
len(selected_rentals["ID"].unique())

200

In [28]:
# exporting the selected rentals
selected_rentals.to_csv("rentals_near200_st.csv", index=False)