-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Speed boost? Using Geo indexing dependency #30
Comments
After some testing, I think I will stick to using tabular query instead of spatial query for efficiency: Two process seems to add complexity for model training and predicting:
These three arguments are presented with the setting in mind that I further use Still, I will test if this tabular query is better for spherical coordinates as well. --- Update Jan 23, 2024: I finished implementing the spherical coordinates.
This is already optimized useing vectorization in determining whether a point falls in a triangle. I can imagine that using a Shapley object will cause performance problem here. Additionally, I also check the relative time consumption of For run_normal_query + run_linalg_transform
For run_geo_query + run_linalg_transform
For run_geo_query + run_geo_transform
It is apparent that the transformation function for geo-object is super time consuming. In contrary, the time use for linear algebra is almost negligible. And the pandas query is also super efficient:
Code to regenerate this:Generate pseudo-samples:def get_samples(sample_size = 1000):
#
width=height=10
time_step = 10
cali_point_x = np.random.uniform(-100,100, 1000)
cali_point_y = np.random.uniform(-100,100, 1000)
start_time = [int(i) for i in np.random.uniform(1,10,1000)]
tabular_df = pd.DataFrame({
'start_time':start_time,
'end_time':[i+time_step for i in start_time],
'x0':cali_point_x,
'x1':cali_point_x+width,
'y0':cali_point_y,
'y1':cali_point_y+height
})
query = pd.DataFrame({
'lng':np.random.uniform(-100,100,sample_size),
'lat':np.random.uniform(-100,100,sample_size),
'time':np.random.uniform(1,10,sample_size),
})
return tabular_df, query
Transformation function (jitter and rotation):# linalg transform
def transform_pandas_to_geopandas(query):
query_ = gpd.GeoDataFrame(
query, geometry=[Point(a,b) for a,b in zip(
query['lng'],query['lat']
)]
)
return query_
def run_linalg_transform(query_):
if isinstance(query_, gpd.geodataframe.GeoDataFrame):
query_['lng'] = query_['geometry'].x
query_['lat'] = query_['geometry'].y
a,b = JitterRotator.rotate_jitter(
query_['lng'],
query_['lat'],
0,50,50)
query_.loc[:,'lng'] = np.array(a)
query_.loc[:,'lat'] = np.array(b)
if isinstance(query_, gpd.geodataframe.GeoDataFrame):
query_ = gpd.GeoDataFrame(query_, geometry=gpd.GeoSeries.from_xy(a,b))
return query_
# geo transform
def run_geo_transform(query_):
query_ = JitterRotator.rotate_jitter_gpd(
query_,
0,50,50)
return query_
Query Function:# normal query:
def run_normal_query(query, tabular_df, transform_func):
# query = transform_pandas_to_geopandas(_query)
res = []
unique_start_time = sorted(tabular_df['start_time'].unique())
for start_time in unique_start_time:
tmp_tabular_df = tabular_df[tabular_df['start_time']==start_time]
tmp_query = query[(query['time']>=start_time) &\
(query['time']<tmp_tabular_df['end_time'].iloc[0])]
tmp_query = transform_func(tmp_query)
for index,line in tmp_tabular_df.iterrows():
tmp = tmp_query[
(tmp_query['lng']>=line['x0']) &\
(tmp_query['lng']<line['x1']) &\
(tmp_query['lat']>=line['y0']) &\
(tmp_query['lat']<line['y1'])
]
res.append(tmp)
res = pd.concat(res, axis=0)
return res.shape[0]
# geo query:
from shapely.geometry import Point, Polygon
def run_geo_query(query, tabular_df, transform_func):
tabular_df_ = gpd.GeoDataFrame(
tabular_df, geometry=[Polygon([(a,c),(a,d),(b,d),(b,c)]) for a,b,c,d in zip(
tabular_df['x0'], tabular_df['x1'],
tabular_df['y0'], tabular_df['y1']
)]
)
query_ = transform_pandas_to_geopandas(query)
res_list = []
unique_start_time = sorted(tabular_df_['start_time'].unique())
for start_time in unique_start_time:
tmp_tabular_df_ = tabular_df_[tabular_df_['start_time']==start_time]
end_time = tmp_tabular_df_['end_time'].iloc[0]
tmp_query_ = query_[(query_['time']>=start_time) & (query_['time']<end_time)]
tmp_query_ = transform_func(tmp_query_)
res = tmp_query_.sjoin(tmp_tabular_df_)
res_list.append(res)
res_list = pd.concat(res_list, axis=0)
return res_list.shape[0] Executiondef get_time(query, tabular_df, query_func, transform_func):
time_list = []
for i in range(5):
start_time = time.time()
query_result_shape = query_func(query, tabular_df, transform_func)
print(query_result_shape)
end_time = time.time()
time_list.append(end_time - start_time)
return np.mean(time_list)
res_list = []
for sample_size in tqdm(np.logspace(2,4,8)):
sample_size = int(sample_size)
tabular_df, query = get_samples(sample_size = sample_size)
# 1. run_normal_query + run_linalg_transform
# 2. run_geo_query + run_linalg_transform
# 3. run_geo_query + run_geo_transform
time1 = get_time(query, tabular_df, run_normal_query, run_linalg_transform)
time2 = get_time(query, tabular_df, run_geo_query, run_linalg_transform)
time3 = get_time(query, tabular_df, run_geo_query, run_geo_transform)
res_list.append({
'sample_size':sample_size,
'run_normal_query + run_linalg_transform':time1,
# 'run_normal_query + run_geo_transform':time2,
'run_geo_query + run_linalg_transform':time2,
'run_geo_query + run_geo_transform':time3
})
res_list = pd.DataFrame(res_list) Plot results:
|
Closing this. |
As suggested during the JOSS review, I should probably use geo-indexing for, like, prediction problem.
This issue is to see of geopandas will speed up the indexing-related tasks.
The text was updated successfully, but these errors were encountered: