# Grafos

## Autores

| Nome | nUSP |
| :--- | :--- |
| Guilherme de Abreu Barreto | 12543033 |
| Lucas Eduardo Gulka Pulcinelli | 12547336 |
| Vinicio Yusuke Hayashibara | 13642797 |

## Enunciado

> - [x] Faça a carga de um grafo que tenha:
>   - Como arestas: os Aeroportos disponíveis no arquivo Airports.dat (arquivo limpo, só com aeroportos).
>   - Como vértices: As Rotas disponíveis no arquivo Routes.dat.
> - [x] Os dois arquivos obtidos no site OpenFlights devem ser:
>   - Ou convertidos para arquivos de carga no formato requerido pelo
>   - ou carregados em tabelas regulares e a seguir carregados no grafo via
> comandos Cypher.
> - [ ] A seguir, escreva uma consulta em Cypher que retorne todas as rotas com
> até 3 hops entre **São Paulo** e **Brasília**. (você pode estabelecer critérios de
> filtragem adicionais para limitar a quantidade de respostas)

## Resolução

Criamos um grafo com Aeroportos enquanto vértices e rotas enquanto arestas. Essa foi uma organização que nos aparentou mais intuitiva uma vez que a navegação do grafo se dá em função das arestas, o que seria análogo às rotas que dão navegação entre os aeroportos.

Em seguida, lemos os arquivos `.dat` enquanto `.csv` e geramos a partir destes dataframes os quais foram utilizados para programáticamente gerar as queries que dão origem aos elementos dos grafo.

Finalmente, buscamos a execução de uma query de consulta para buscar encontrar os trajetos possíveis entre as cidades especificadas dentro do limite de paradas. Em nossa formulação mais simples, esta seria:

```sql
SELECT * FROM cypher('openflights', $$
    MATCH path = (sp:Airport {{city: "Sao Paulo"}})-[r:Route*1..3]->(bsb:Airport {{city: "Brasilia"}})
    WHERE r.stops = 0
    RETURN 
        [node IN nodes(path) | node.name] AS airport_names,
        [node IN nodes(path) | node.city] AS cities,
        length(path) AS number_of_flights
$$) AS (airport_names agtype, cities agtype, number_of_flights agtype);
```
Mas esta deu origim a um erro de sintaxe os quais não conseguimos resolver. Visto que a query funcionava para caminhos sem comprimento variável, fizemos uma nova query onde o recurso `UNION ALL` foi empregado para juntar queries por caminhos de diferentes comprimentos. Mas esta query não obteve resultado em tempo hábil para a entrega, tendo permaneciado em processamento ainda que consumisse totalmente a capacidade de 3 dos 4 cores disponíveis em meu computador.

![Gerenciador de processos bottom exbibindo o consumo de memória observado para a query](imgs/btm.png)

## Configuração inicial

A seguir são carregadas as dependências utilizadas nesta análise e (re)criado o database a ser manipulado. Variáveis para o estabelecimento da conexão com o database são aqui descritas em constantes as quais devem ser alteradas conforme a configuração local do banco de dados em que venha a ocorrer a reprodução deste experiênto.

In [1]:
import json
import pandas as pd
import psycopg2
import unicodedata
from sqlalchemy import create_engine, text
from pyvis import network as net
from tqdm import tqdm

In [2]:
DEFAULT_DATABASE = "postgres"
FLIGHTS_DATABASE = "flights" 
USER = "postgres"
PASSWORD = "postgres"
HOST = "localhost"
PORT = 5432
DRIVER = "postgresql+psycopg2"

engine = create_engine(
    f"{DRIVER}://{USER}:{PASSWORD}@{HOST}/{DEFAULT_DATABASE}", echo=True
)

In [3]:
with engine.connect().execution_options(isolation_level="AUTOCOMMIT") as conn:
    try:
        conn.execute(text(
        f"""
        SELECT pg_terminate_backend(pid)
        FROM pg_stat_activity
        WHERE datname = '{FLIGHTS_DATABASE}';
        """
    ))
    except ProgrammingError as e:
        pass # Could not terminate connections (there are no connections)
    # NOTE: DROP DATABASE cannot run inside a transaction block, that is why we're
    # running it separately below.
    conn.execute(text(f"DROP DATABASE IF EXISTS {FLIGHTS_DATABASE};"))
    conn.execute(text(f"CREATE DATABASE {FLIGHTS_DATABASE};"))

2025-11-18 11:58:44,236 INFO sqlalchemy.engine.Engine select pg_catalog.version()
2025-11-18 11:58:44,237 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-11-18 11:58:44,240 INFO sqlalchemy.engine.Engine select current_schema()
2025-11-18 11:58:44,241 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-11-18 11:58:44,243 INFO sqlalchemy.engine.Engine show standard_conforming_strings
2025-11-18 11:58:44,244 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-11-18 11:58:44,247 INFO sqlalchemy.engine.Engine BEGIN (implicit; DBAPI should not BEGIN due to autocommit mode)
2025-11-18 11:58:44,248 INFO sqlalchemy.engine.Engine 
        SELECT pg_terminate_backend(pid)
        FROM pg_stat_activity
        WHERE datname = 'flights';
        
2025-11-18 11:58:44,249 INFO sqlalchemy.engine.Engine [generated in 0.00256s] {}
2025-11-18 11:58:44,253 INFO sqlalchemy.engine.Engine DROP DATABASE IF EXISTS flights;
2025-11-18 11:58:44,254 INFO sqlalchemy.engine.Engine [generated in 0.00102s] {}
2025-11-18 11:5

## Carregamento da extensão APACHE AGE e criação do grafo

In [4]:
engine = create_engine(
    f"{DRIVER}://{USER}:{PASSWORD}@{HOST}/{FLIGHTS_DATABASE}", echo=True
)

In [5]:
with engine.connect().execution_options(isolation_level="AUTOCOMMIT") as conn:
    # Use raw connection for better control
    raw_conn = conn.connection
    cursor = raw_conn.cursor()
    
    cursor.execute("CREATE EXTENSION IF NOT EXISTS age;")
    cursor.execute("LOAD 'age';")
    cursor.execute('SET search_path = ag_catalog, "$user", public;')
    cursor.execute("SHOW search_path;")
    
    search_path = cursor.fetchone()
    print("Current search_path:", search_path[0])

2025-11-18 11:58:44,363 INFO sqlalchemy.engine.Engine select pg_catalog.version()
2025-11-18 11:58:44,364 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-11-18 11:58:44,367 INFO sqlalchemy.engine.Engine select current_schema()
2025-11-18 11:58:44,368 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-11-18 11:58:44,370 INFO sqlalchemy.engine.Engine show standard_conforming_strings
2025-11-18 11:58:44,370 INFO sqlalchemy.engine.Engine [raw sql] {}
Current search_path: ag_catalog, "$user", public


In [6]:
GRAPH = "openflights"

with engine.begin() as conn:
    result = conn.execute(
        text(f"SELECT * FROM pg_namespace WHERE nspname = '{GRAPH}'")
    )
    if len(result.fetchall()) > 0:
        conn.execute(text(f"SELECT drop_graph('{GRAPH}', true)"))
    conn.execute(text(f"SELECT create_graph('{GRAPH}');"))

engine.echo = False

2025-11-18 11:58:44,441 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-11-18 11:58:44,442 INFO sqlalchemy.engine.Engine SELECT * FROM pg_namespace WHERE nspname = 'openflights'
2025-11-18 11:58:44,443 INFO sqlalchemy.engine.Engine [generated in 0.00103s] {}
2025-11-18 11:58:44,446 INFO sqlalchemy.engine.Engine SELECT create_graph('openflights');
2025-11-18 11:58:44,447 INFO sqlalchemy.engine.Engine [generated in 0.00142s] {}
2025-11-18 11:58:44,459 INFO sqlalchemy.engine.Engine COMMIT


In [None]:
## Criação dos vértices do grafo (Aeroportos)

In [7]:
airport_columns = {
    'index': 'int64',
    'name': 'string', 
    'city': 'string', 
    'country': 'string', 
    'iata': 'string', 
    'icao': 'string', 
    'lat': 'float64',
    'lon': 'float64',
    'altitude': 'int64',
    'timezone': 'string', 
    'dst': 'string', 
    'tz': 'string', 
    'type': 'string', 
    'source': 'string'
}

# Load the CSV with your column names
airports_df = pd.read_csv(
    'data/airports.csv',
    header=None,
    names=airport_columns.keys(),
    dtype=airport_columns,
    na_values=['\\N'],  # Treat \N as missing values
)
airports_df = airports_df.set_index('index')
airports_df = airports_df.fillna('NULL')
airports_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7698 entries, 1 to 14110
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      7698 non-null   string 
 1   city      7698 non-null   string 
 2   country   7698 non-null   string 
 3   iata      7698 non-null   string 
 4   icao      7698 non-null   string 
 5   lat       7698 non-null   float64
 6   lon       7698 non-null   float64
 7   altitude  7698 non-null   int64  
 8   timezone  7698 non-null   string 
 9   dst       7698 non-null   string 
 10  tz        7698 non-null   string 
 11  type      7698 non-null   string 
 12  source    7698 non-null   string 
dtypes: float64(2), int64(1), string(10)
memory usage: 842.0 KB


In [8]:
batch_size = 100
total_batches = (len(airports_df) + batch_size - 1) // batch_size

for batch_num in tqdm(range(total_batches)):
    start_idx = batch_num * batch_size
    end_idx = min((batch_num + 1) * batch_size, len(airports_df))
    batch_df = airports_df.iloc[start_idx:end_idx]

    with engine.begin() as conn:
        queries = []
        for index, row in batch_df.iterrows():
            query = f"""
            (a_{index}:Airport {{
                index: {index},
                name: "{row['name'].replace('"', '\\"')}",
                city: "{row['city'].replace('"', '\\"')}",
                country: "{row['country'].replace('"', '\\"')}",
                iata: "{row['iata']}",
                icao: "{row['icao']}",
                lat: {row['lat']},
                lon: {row['lon']},
                altitude: {row['altitude']},
                timezone: "{row['timezone']}",
                dst: "{row['dst']}",
                tz: "{row['tz']}",
                type: "{row['type']}",
                source: "{row['source']}"
            }})
            """
            queries.append(query)
        
        conn.execute(text(
            f"""
            SELECT * FROM cypher('{GRAPH}', $$
            CREATE {",".join(queries)}
            $$) AS (result agtype);
            """
        ))

print("Airport creation completed!")

100%|█████████████████████████████████████████████████████████████████████████████| 77/77 [00:02<00:00, 26.91it/s]

Airport creation completed!





In [None]:
## Criação das arestas do grafo (Rotas)

In [9]:
route_columns = {
    'airline': 'string',
    'airline_id': 'Int64',
    'source': 'string',
    'source_id': 'Int64',
    'dest': 'string',
    'dest_id': 'Int64',
    'codeshare': 'string',
    'stops': 'Int64',
    'equipment': 'string'
}

# Load the CSV with your column names
routes_df = pd.read_csv(
    'data/routes.csv',
    header=None,
    names=route_columns.keys(),
    dtype=route_columns,
    na_values=['\\N'],  # Treat \N as missing values
)
routes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67663 entries, 0 to 67662
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   airline     67663 non-null  string
 1   airline_id  67184 non-null  Int64 
 2   source      67663 non-null  string
 3   source_id   67443 non-null  Int64 
 4   dest        67663 non-null  string
 5   dest_id     67442 non-null  Int64 
 6   codeshare   14597 non-null  string
 7   stops       67663 non-null  Int64 
 8   equipment   67645 non-null  string
dtypes: Int64(4), string(5)
memory usage: 4.9 MB


In [10]:
batch_size = 100
total_batches = (len(routes_df) + batch_size - 1) // batch_size

for batch_num in tqdm(range(total_batches)):
    start_idx = batch_num * batch_size
    end_idx = min((batch_num + 1) * batch_size, len(routes_df))
    batch_df = routes_df.iloc[start_idx:end_idx]
    
    with engine.begin() as conn:
        queries = []
        for index, row in batch_df.iterrows():
            # Handle nullable values by checking for pd.NA
            airline_id = row['airline_id'] if pd.notna(row['airline_id']) else 'NULL'
            source_id = row['source_id'] if pd.notna(row['source_id']) else 'NULL'
            dest_id = row['dest_id'] if pd.notna(row['dest_id']) else 'NULL'
            codeshare = f'"{row["codeshare"]}"' if pd.notna(row['codeshare']) else 'NULL'
            equipment = f'"{row["equipment"]}"' if pd.notna(row['equipment']) else 'NULL'
            
            query = f"""
            MATCH (source:Airport {{iata: "{row['source']}"}})
            MATCH (dest:Airport {{iata: "{row['dest']}"}})
            CREATE (source)-[r:Route {{
                airline: "{row['airline']}",
                airline_id: {airline_id},
                source_id: {source_id},
                dest_id: {dest_id},
                codeshare: {codeshare},
                stops: {row['stops']},
                equipment: {equipment}
            }}]->(dest)
            """
            queries.append(query)
        
        # Execute all queries in this batch
        for query in queries:
            conn.execute(text(
                f"""
                SELECT * FROM cypher('{GRAPH}', $$
                {query}
                $$) AS (result agtype);
                """
            ))

print("Routes creation completed!")

100%|███████████████████████████████████████████████████████████████████████████| 677/677 [08:55<00:00,  1.26it/s]

Routes creation completed!





# Consulta pelos trajetos entre São Paulo e Brasília

> Não concluída

In [14]:
def remove_accents(text):
    """Remove accents from text while preserving the original characters"""
    return ''.join(
        c for c in unicodedata.normalize('NFD', text)
        if unicodedata.category(c) != 'Mn'
    )

# Define origin and destination cities
origin_city = "São Paulo"
destination_city = "Brasília"

# Remove accents for query
origin_city_clean = remove_accents(origin_city)
destination_city_clean = remove_accents(destination_city)

# Since direct and one-stop routes work, let's combine them
query = f"""
SELECT * FROM cypher('{GRAPH}', $$
    // Direct routes (0 stops)
    MATCH (sp:Airport {{city: "{origin_city_clean}"}})-[r1:Route]->(bsb:Airport {{city: "{destination_city_clean}"}})
    WHERE r1.stops = 0
    RETURN 
        [sp.name, bsb.name] AS airport_names,
        [sp.city, bsb.city] AS cities, 
        [r1.airline] AS airlines,
        1 AS number_of_flights
    
    UNION ALL
    
    // One-stop routes (1 intermediate airport)
    MATCH (sp:Airport {{city: "{origin_city_clean}"}})-[r1:Route]->(stop:Airport)-[r2:Route]->(bsb:Airport {{city: "{destination_city_clean}"}})
    WHERE r1.stops = 0 AND r2.stops = 0
    AND stop.city <> "{origin_city_clean}" AND stop.city <> "{destination_city_clean}"
    RETURN 
        [sp.name, stop.name, bsb.name] AS airport_names,
        [sp.city, stop.city, bsb.city] AS cities,
        [r1.airline, r2.airline] AS airlines,
        2 AS number_of_flights
    
    UNION ALL
    
    // Two-stop routes (2 intermediate airports)  
    MATCH (sp:Airport {{city: "{origin_city_clean}"}})-[r1:Route]->(stop1:Airport)-[r2:Route]->(stop2:Airport)-[r3:Route]->(bsb:Airport {{city: "{destination_city_clean}"}})
    WHERE r1.stops = 0 AND r2.stops = 0 AND r3.stops = 0
    AND stop1.city <> "{origin_city_clean}" AND stop1.city <> "{destination_city_clean}"
    AND stop2.city <> "{origin_city_clean}" AND stop2.city <> "{destination_city_clean}"
    AND stop1 <> stop2
    RETURN 
        [sp.name, stop1.name, stop2.name, bsb.name] AS airport_names,
        [sp.city, stop1.city, stop2.city, bsb.city] AS cities,
        [r1.airline, r2.airline, r3.airline] AS airlines,
        3 AS number_of_flights
$$) AS (airport_names agtype, cities agtype, airlines agtype, number_of_flights agtype);
"""

with engine.connect() as conn:
    result = conn.execute(text(query))
    
    print(f"\nAll routes from {origin_city} to {destination_city} with 0-stop flights:")
    print("=" * 60)
    
    for i, row in enumerate(result, 1):
        print(f"\nRoute #{i}:")
        print(f"  Airports: {' → '.join(row.airport_names)}")
        print(f"  Cities: {' → '.join(row.cities)}")
        print(f"  Airlines: {' → '.join(row.airlines)}")
        print(f"  Number of flights: {row.number_of_flights}")
        print(f"  Total stops: {row.number_of_flights - 1}")
        print("-" * 40)


All routes from São Paulo to Brasília with 0-stop flights:

Route #1:
  Airports: [ → " → G → u → a → r → u → l → h → o → s →   → - →   → G → o → v → e → r → n → a → d → o → r →   → A → n → d → r → é →   → F → r → a → n → c → o →   → M → o → n → t → o → r → o →   → I → n → t → e → r → n → a → t → i → o → n → a → l →   → A → i → r → p → o → r → t → " → , →   → " → P → r → e → s → i → d → e → n → t → e →   → J → u → s → c → e → l → i → n → o →   → K → u → b → i → s → t → s → c → h → e → k →   → I → n → t → e → r → n → a → t → i → o → n → a → l →   → A → i → r → p → o → r → t → " → ]
  Cities: [ → " → S → a → o →   → P → a → u → l → o → " → , →   → " → B → r → a → s → i → l → i → a → " → ]
  Airlines: [ → " → A → D → " → ]
  Number of flights: 1


TypeError: unsupported operand type(s) for -: 'str' and 'int'

## Consulta mais concisa

> Não operacionalizada por um erro de sintaxe

In [13]:
query = f"""
SELECT * FROM cypher('{GRAPH}', $$
    MATCH path = (sp:Airport {{city: "Sao Paulo"}})-[r:Route*1..3]->(bsb:Airport {{city: "Brasilia"}})
    WHERE r.stops = 0
    RETURN 
        [node IN nodes(path) | node.name] AS airport_names,
        [node IN nodes(path) | node.city] AS cities,
        length(path) AS number_of_flights
$$) AS (airport_names agtype, cities agtype, number_of_flights agtype);
"""2

with engine.connect() as conn:
    for i, row in enumerate(conn.execute(text(query)), 1):
        print(f"\nRoute #{i}:")        
        print(f"  Airports: {' → '.join(row.airport_names)}")
        print(f"  Cities: {' → '.join(row.cities)}")
        print(f"  Number of Stops: {row.number_of_flights - 1}")  # Calculate stops from flights
        print("-" * 50)

ProgrammingError: (psycopg2.errors.UndefinedObject) could not find properties for node
LINE 2: SELECT * FROM cypher('openflights', $$
                                             ^

[SQL: 
SELECT * FROM cypher('openflights', $$
    MATCH path = (sp:Airport {city: "Sao Paulo"})-[*1..3]->(bsb:Airport {city: "Brasilia"})
    RETURN 
        [node IN nodes(path) | node.name] AS airport_names,
        [node IN nodes(path) | node.city] AS cities,
        length(path) AS number_of_flights
$$) AS (airport_names agtype, cities agtype, number_of_flights agtype);
]
(Background on this error at: https://sqlalche.me/e/20/f405)