This Jupyter Notebook builds a pipeline for extracting, enriching, and preparing Points of Interest (POI) data for Kraków from OpenStreetMap (OSM), with a focus on supporting travel assistant and Retrieval-Augmented Generation (RAG) applications. The workflow includes:

1. **Data Extraction**:  
    - Uses OSMnx to download POIs for Kraków, grouped into categories like restaurants, attractions, accommodation, transport, entertainment, shopping, services, and outdoor.
    - Aggregates POIs into a single GeoDataFrame.

2. **Data Cleaning & Feature Engineering**:  
    - Identifies and concatenates all available name fields for each POI.
    - Analyzes missing values and coverage for key descriptive columns.

3. **Wikipedia Enrichment**:  
    - Extracts Wikipedia links from OSM data.
    - Uses the `wikipediaapi` library to fetch summaries in the original language and, if available, in English.
    - For non-English summaries, leverages OpenAI's GPT model to translate Polish descriptions to English, ensuring proper handling of named entities.

4. **Column Selection for RAG**:  
    - Uses an LLM prompt to select the most relevant columns for search, retrieval, and user-facing applications, focusing on descriptive, categorical, and location-based features.

5. **Data Export**:  
    - Saves the final, enriched, and filtered POI dataset to CSV files for downstream use.

6. **Analysis & Statistics**:  
    - Computes statistics on missing data for selected columns to inform further data cleaning or feature selection.

**Overall**, the notebook demonstrates a robust approach to building a high-quality, multilingual POI dataset for Kraków, suitable for powering intelligent travel assistants and search/retrieval systems. It combines geospatial data processing, knowledge enrichment from Wikipedia, and LLM-powered translation and feature selection.

In [None]:
import osmnx as ox
import geopandas as gpd
import pandas as pd
import os
import wikipediaapi
import tqdm

In [2]:
category_tags = {
            'restaurants': {
                'amenity': ['restaurant', 'cafe', 'fast_food', 'food_court', 'ice_cream', 'pub', 'bar', 'biergarten']
            },
            'attractions': {
                'tourism': ['attraction', 'museum', 'monument', 'artwork', 'viewpoint', 'zoo', 'theme_park','yes'],
                'historic': ['castle', 'church', 'cathedral', 'monastery', 'ruins', 'memorial', 'monument'],
                'leisure': ['park', 'garden', 'nature_reserve']
            },
            'accommodation': {
                'tourism': ['hotel', 'hostel', 'guest_house', 'apartment', 'camp_site', 'chalet']
            },
            'transport': {
                'amenity': ['bus_station', 'taxi'],
                'railway': ['station', 'tram_stop'],
                'aeroway': ['aerodrome', 'terminal'],
                'public_transport': ['station', 'stop_position']
            },
            'entertainment': {
                'leisure': ['cinema', 'theatre', 'nightclub', 'bowling_alley', 'amusement_arcade'],
                'amenity': ['casino', 'community_centre', 'social_centre']
            },
            'shopping': {
                'shop': ['mall', 'department_store', 'supermarket', 'marketplace'],
                'amenity': ['marketplace']
            },
            'services': {
                'amenity': ['hospital', 'clinic', 'pharmacy', 'bank', 'atm', 'post_office', 'library'],
                'tourism': ['information']
            },
            'outdoor': {
                'natural': ['beach', 'peak', 'cave', 'spring'],
                'leisure': ['beach_resort', 'sports_centre', 'stadium', 'swimming_pool'],
                'sport': ['skiing', 'climbing', 'hiking']
            }
        }
    

In [3]:
categories = list(category_tags.keys())

In [4]:

all_pois = gpd.GeoDataFrame()

for category in categories:
   
    tags = category_tags[category]
    
    pois = ox.features_from_place(f"Kraków", tags=tags)
    
    if not pois.empty:
        pois['poi_category'] = category
        all_pois = pd.concat([all_pois, pois], ignore_index=True)
        



In [5]:
all_pois.head(10)

Unnamed: 0,geometry,amenity,check_date,name,opening_hours,outdoor_seating,wheelchair,brand,brand:wikidata,brand:wikipedia,...,climbing:rock,rock,supervised,lifeguard,climbing:grade:polish:max,climbing:grade:polish:min,climbing:routes,roof:direction,swimming,lanes
0,POINT (19.93182 50.06126),pub,2024-02-23,Stary Port,"Mo-We 10:00-01:00; Th,Fr 10:00-03:00; Sa 12:00...",yes,no,,,,...,,,,,,,,,,
1,POINT (19.89277 50.0883),fast_food,2024-01-28,McDonald's,24/7,,yes,McDonald's,Q38076,en:McDonald's,...,,,,,,,,,,
2,POINT (19.94146 50.06113),biergarten,,Re,,,,,,,...,,,,,,,,,,
3,POINT (19.94886 50.05036),bar,2024-10-15,Duffy's Irish Bar,,,,,,,...,,,,,,,,,,
4,POINT (19.94478 50.05172),fast_food,,Bar Na Maxa,,,,,,,...,,,,,,,,,,
5,POINT (19.9448 50.05163),fast_food,,Frytki Belgijskie,,,,,,,...,,,,,,,,,,
6,POINT (19.945 50.0518),fast_food,,Wanda Frączek Królowa,,no,,,,,...,,,,,,,,,,
7,POINT (19.94511 50.05167),fast_food,,Bar pod Okrąglakiem,,,,,,,...,,,,,,,,,,
8,POINT (19.93699 50.06324),cafe,,DeCafencja,,,,,,,...,,,,,,,,,,
9,POINT (19.93434 50.06195),restaurant,2024-08-03,Restauracja Cechowa,Mo-Su 12:00-22:00,,,,,,...,,,,,,,,,,


In [6]:
wiki_data_columns = [x for x in all_pois.columns if 'wiki' in x]

In [7]:
name_columns = [x for x in all_pois.columns if 'name' in x]

In [8]:
wiki_data_columns

['brand:wikidata',
 'brand:wikipedia',
 'wikidata',
 'wikipedia',
 'operator:wikidata',
 'operator:wikipedia',
 'not:brand:wikidata',
 'commemorates:wikidata',
 'subject:wikidata',
 'subject:wikipedia',
 'wikimedia_commons',
 'artist:wikidata',
 'artist:wikipedia',
 'subject:wikipedia:2',
 'model:wikipedia',
 'model:wikidata',
 'species:wikidata',
 'name:etymology:wikidata',
 'name:etymology:wikipedia',
 'parish:wikidata',
 'parish:wikipedia',
 'dedicated_to:wikidata',
 'flag:wikidata',
 'flag:wikipedia',
 'network:wikidata',
 'brand:wikipedia:en']

In [9]:
all_pois['name'].isna().sum()/all_pois.shape[0]

np.float64(0.3330767915170173)

In [10]:
def concat_non_null_names(row):
    non_nulls = [str(row[col]) for col in name_columns if pd.notnull(row.get(col))]
    # Remove duplicates while preserving order
    unique_non_nulls = []
    seen = set()
    for val in non_nulls:
        if val not in seen:
            unique_non_nulls.append(val)
            seen.add(val)
    return ';'.join(unique_non_nulls) if unique_non_nulls else None

all_pois['all_names_concat'] = all_pois.apply(concat_non_null_names, axis=1)
all_pois = all_pois.copy() 

  super().__setitem__(key, value)


In [11]:
all_pois['name'].isna().sum()/all_pois.shape[0]

np.float64(0.3330767915170173)

In [12]:
all_pois['all_names_concat'].isna().sum()/all_pois.shape[0]

np.float64(0.3283735248845562)

In [13]:
all_pois.loc[all_pois['all_names_concat'].isna() &
    all_pois['amenity'].isna() &
    all_pois['leisure'].isna() &
    all_pois['natural'].isna() &
    all_pois['tourism'].isna() &
    all_pois['historic'].isna() 
].apply(
    lambda row: [col for col in row.dropna().index if col not in ['geometry', 'poi_category']],
    axis=1
).describe()

count                          14
unique                          8
top       [bus, public_transport]
freq                            6
dtype: object

In [14]:
all_pois.loc[~all_pois['all_names_concat'].isna() |
    ~all_pois['amenity'].isna() |
    ~all_pois['leisure'].isna() |
    ~all_pois['natural'].isna() |
    ~all_pois['tourism'].isna() |
    ~all_pois['historic'].isna() 
].shape[0] / all_pois.shape[0]

0.9988028048571918

In [15]:
all_pois[['wiki_lang', 'wiki_title']] = all_pois['wikipedia'].str.split(':', expand=True)
all_pois = all_pois.copy()

In [16]:
all_pois['wiki_lang'].value_counts()

wiki_lang
pl    333
en      1
es      1
Name: count, dtype: int64

In [17]:
langs = list(all_pois['wiki_lang'].dropna().unique())

In [18]:

for lang in langs:
    wiki_wiki = wikipediaapi.Wikipedia(user_agent = 'travel assistance', language=lang)

    for idx, row in tqdm.tqdm(all_pois[all_pois['wiki_lang'] == lang].iterrows(), total=all_pois[all_pois['wiki_lang'] == lang].shape[0]):
        title = row['wiki_title']
        page = wiki_wiki.page(title)
        en_summary = None
        en_found = False
        if page.exists() and 'en' in page.langlinks:
            en_page = page.langlinks['en']
            if en_page.exists():
                en_summary = en_page.summary
                en_found = True
        else:
            summary = page.summary if page.exists() else None
        all_pois.at[idx, 'wiki_summary'] = en_summary if en_found else summary
        all_pois.at[idx, 'wiki_summary_en_found'] = en_found
 

100%|██████████| 333/333 [04:23<00:00,  1.27it/s]
100%|██████████| 1/1 [00:01<00:00,  1.94s/it]
100%|██████████| 1/1 [00:01<00:00,  1.91s/it]


In [19]:
all_pois[~all_pois['wiki_summary'].isna()].head(10)

Unnamed: 0,geometry,amenity,check_date,name,opening_hours,outdoor_seating,wheelchair,brand,brand:wikidata,brand:wikipedia,...,climbing:grade:polish:min,climbing:routes,roof:direction,swimming,lanes,all_names_concat,wiki_lang,wiki_title,wiki_summary,wiki_summary_en_found
34,POINT (19.93741 50.06038),restaurant,,Wierzynek,Mo-Su 13:00-23:00,,,,,,...,,,,,,Wierzynek,pl,Wierzynek,Wierzynek is a restaurant located at the Main ...,True
39,POINT (19.93819 50.06085),pub,,Klub Pod Jaszczurami,,,no,,,,...,,,,,,Klub Pod Jaszczurami,pl,Pod Jaszczurami,Pod Jaszczurami – klub studencki w Krakowie. J...,False
95,POINT (20.03788 50.0747),restaurant,2025-01-12,Stylowa,Mo-Su 10:00-21:30,pedestrian_zone,,,,,...,,,,,,Stylowa,pl,Restauracja Stylowa w Krakowie,Restauracja „Stylowa” – najstarsza restauracja...,False
449,POINT (19.94093 50.06408),cafe,2024-08-03,Kawiarnia Jama Michalika,,,,,,,...,,,,,,"Kawiarnia Jama Michalika;Kaffee ""Jama Michalik...",pl,Jama Michalika,"Jama Michalika is a historic café in Kraków, P...",True
1162,POINT (19.93574 50.06178),bar,2024-07-10,Vis-à-vis,"Mo-Th 08:00-24:00, Fr-Su 08:00-01:00",yes,,,,,...,,,,,,Vis-à-vis,pl,Vis-à-vis (bar),Vis-à-vis – bar znajdujący się w Krakowie na S...,False
2381,POINT (19.89335 50.05492),,,Kopiec Kościuszki,,,,,,,...,,,,,,Kopiec Kościuszki;Kościuszko Mound;Kościuszko-...,pl,Kopiec Kościuszki w Krakowie,Kościuszko Mound (Polish: kopiec Kościuszki) i...,True
2382,POINT (19.93276 50.06132),,,Mikołaj Kopernik,,,,,,,...,,,,,,Mikołaj Kopernik;Statue of Nicolaus Copernicus...,pl,Pomnik Mikołaja Kopernika w Krakowie,The Nicolaus Copernicus Monument in Kraków (Po...,True
2384,POINT (19.93539 50.06602),,,Tadeusz Rejtan,,,,,,,...,,,,,,Tadeusz Rejtan,pl,Pomnik Tadeusza Rejtana w Krakowie,Pomnik Tadeusza Rejtana – pomnik znajdujący si...,False
2385,POINT (19.93782 50.05681),,,Piotr Skarga,,,yes,,,,...,,,,,,Piotr Skarga;Piotr-Skarga-Denkmal,pl,Pomnik Piotra Skargi w Krakowie,Pomnik Piotra Skargi w Krakowie – pomnik jezui...,False
2386,POINT (19.94214 50.06642),,,Pomnik Grunwaldzki,,,,,,,...,,,,,,Pomnik Grunwaldzki;Grunwald Monument;Grunwaldd...,pl,Pomnik Grunwaldzki w Krakowie,The Grunwald Monument (Polish: Pomnik Grunwald...,True


In [20]:
all_pois.loc[all_pois['wikipedia'].notna(), 'wiki_summary_en_found'].sum()/all_pois.loc[all_pois['wikipedia'].notna()].shape[0]


0.3044776119402985

In [21]:
all_pois.to_csv('krakow_pois_with_wiki.csv', index=False,header=True)

In [4]:
all_pois = pd.read_csv('krakow_pois_with_wiki.csv')

  all_pois = pd.read_csv('krakow_pois_with_wiki.csv')


In [5]:
from openai import OpenAI

openai_client = OpenAI()

In [6]:
def llm(prompt):
    response = openai_client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

In [7]:
prompt_template = """
Translate below text from Polish to English:{input_text}. 
Own name is {name}.

Don't translate own name, only descriptive text.
Return only translated text without any additional information.
"""

In [10]:
llm(prompt_template.format(input_text=all_pois.loc[39,'wiki_summary'],name=all_pois.loc[39,'wiki_title']))

"Pod Jaszczurami – a student club in Kraków. One of the oldest student clubs in Poland, it has been operating since 1960. It is located in the medieval building called Kamienica pod Jaszczurką at Main Market Square 8. It is known for promoting creators from the Kraków student environment and for promoting jazz.  \nThe club hosts concerts, political debates, poetry evenings, film screenings, discos, exhibitions, and other events. The adjacent Teatr 38 hosts theatrical performances. Pod Jaszczurami is an integral part of Kraków's cultural life. Since 2001, the club has been overseen by the Kraków Institute of Art."

In [11]:
for idx, row in tqdm.tqdm(all_pois[all_pois['wiki_summary_en_found'].eq(False) & all_pois['wiki_summary'].notna()].iterrows(), total=all_pois[all_pois['wiki_summary_en_found'].eq(False) & all_pois['wiki_summary'].notna()].shape[0]):
    all_pois.at[idx, 'wiki_summary_translated'] = llm(prompt_template.format(input_text=row['wiki_summary'],name=row['wiki_title']))


  0%|          | 0/233 [00:00<?, ?it/s]

100%|██████████| 233/233 [13:44<00:00,  3.54s/it]


In [12]:
all_pois.to_csv('krakow_pois_with_wiki_translted.csv', index=False,header=True)

In [15]:
all_pois['wiki_summary_en'] = all_pois.apply(
    lambda row: row['wiki_summary'] if row.get('wiki_summary_en_found') else row.get('wiki_summary_translated'),
    axis=1
)

In [None]:
response = openai_client.chat.completions.create(
    model='gpt-4o-mini',
    messages=[{
        "role": "user",
        "content": (
            f"You are helping to design a travel assistant RAG (Retrieval-Augmented Generation) application. "
            f"Given the following list of OpenStreetMap POI dataframe columns:\n\n{list(all_pois.columns)}\n\n"
            "Select and list the columns that are most likely to be useful for retrieval, search, or providing relevant information to travelers. "
            "Focus on columns that contain descriptive, categorical, or location-based information about the POIs. "
            "Ignore columns that are technical, rarely filled, or not useful for end users. "
            "Return only the names of the recommended columns, separated by commas, without any explanation."
        )
    }]
)



'amenity, name, opening_hours, outdoor_seating, wheelchair, brand, cuisine, smoking, takeaway, toilets, addr:city, addr:street, phone, postal_code, website, email, internet_access, description, alt_name, contact:phone, contact:website, contact:facebook, contact:instagram, contact:twitter, image, operating_status, dietary:vegetarian, dietary:vegan, dietary:gluten_free, dietary:halal, dietary:kosher, dietary:paleo, dietary:seafood, dietary:healthy, children_area, highchair, reservation, pets_allowed, swimming_pool, parking, public_transport, community_centre, tourist_attraction, tourist_information, opening_hours:reception, social_facility, guest_house, health_care, museum, zoo, park, castle, cemetery, attraction, historical, emergency, location, service_time, visiting_time'

In [29]:
selected_columns = response.choices[0].message.content

In [37]:
selected_columns = selected_columns.split(', ')

AttributeError: 'list' object has no attribute 'split'

In [43]:
selected_columns = [col for col in selected_columns if col not in ['name','operating_status', 'dietary:vegetarian', 'dietary:vegan', 'dietary:gluten_free', 'dietary:halal', 'dietary:kosher', 'dietary:paleo', 'dietary:seafood', 'dietary:healthy', 'children_area', 'tourist_attraction', 'tourist_information', 'health_care', 'park', 'castle', 'historical', 'service_time']]

In [23]:
text_columns = ['all_names_concat','amenity','leisure','natural','tourism','historic','wiki_summary_en']

In [None]:
columns_to_rag = list(set(text_columns + selected_columns + ['geometry']))

In [79]:
len(all_pois.columns)

703

In [80]:
len(columns_to_rag)

48

In [81]:
for col in columns_to_rag:
    n_nans = all_pois[col].isna().sum()
    example = all_pois[col].dropna().iloc[0] if all_pois[col].notna().any() else None
    perc_nans = n_nans / all_pois.shape[0] * 100
    if 'col_stats' not in locals():
        col_stats = []
    col_stats.append({'column': col, 'n_nans': n_nans, 'perc_nans': perc_nans, 'example': example})


In [82]:
col_stats_df = pd.DataFrame(col_stats)


In [83]:
col_stats_df[col_stats_df['perc_nans'] > 90]

Unnamed: 0,column,n_nans,perc_nans,example
1,phone,10821,92.534633,+48 12 431 08 81
2,cemetery,11691,99.974346,grave
3,emergency,11678,99.863178,yes
6,pets_allowed,11693,99.991449,yes
8,historic,11080,94.749444,tramcar
...,...,...,...,...
91,location,11550,98.768599,rooftop
92,outdoor_seating,11147,95.322388,yes
93,museum,11662,99.726355,history
94,takeaway,11396,97.451685,yes


In [None]:
all_pois['id'] = all_pois.index.astype(str)
all_pois.columns = [col.replace(':', '_') for col in all_pois.columns]

In [None]:
all_pois.rename(columns={'all_names_concat': 'name'}, inplace=True)

In [None]:
text_columns = ['name','amenity','leisure','natural','tourism','historic','wiki_summary_en']
all_pois[text_columns] = all_pois[text_columns].fillna('no information')

In [84]:
all_pois[columns_to_rag].to_csv('krakow_pois_for_rag.csv', index=False,header=True)

In [85]:
all_pois[columns_to_rag]

Unnamed: 0,phone,cemetery,emergency,opening_hours,website,pets_allowed,geometry,historic,wiki_summary_en,postal_code,...,guest_house,addr:city,contact:instagram,image,location,outdoor_seating,museum,takeaway,smoking,all_names_concat
0,,,,"Mo-We 10:00-01:00; Th,Fr 10:00-03:00; Sa 12:00...",,,POINT (19.9318205 50.0612594),,,,...,,,,,,yes,,,,Stary Port
1,,,,24/7,,,POINT (19.8927726 50.0883023),,,,...,,,,,,,,yes,no,McDonald's
2,+48 12 431 08 81,,,,http://www.klubre.pl/,,POINT (19.9414639 50.0611316),,,31-027,...,,Kraków,,,,,,,,Re
3,+48 662 376 093,,,,,,POINT (19.9488582 50.0503615),,,,...,,,,,,,,,,Duffy's Irish Bar
4,,,,,,,POINT (19.9447786 50.0517173),,,,...,,Kraków,,,,,,,,Bar Na Maxa
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11689,,,,,,,"POLYGON ((19.894169 50.0980159, 19.8941635 50....",,,,...,,,,,,,,,,
11690,,,,,,,"POLYGON ((20.1394951 50.0679735, 20.1394757 50...",,,,...,,,,,,,,,,
11691,,,,,,,"POLYGON ((19.8878834 50.0855656, 19.887916 50....",,,,...,,,,,,,,,,
11692,,,,,,,"POLYGON ((19.9106095 50.0521738, 19.91063 50.0...",,,,...,,,,,,,,,,
