# Advanced Databases 2025/2026 
### Prof. Márcia Barros and Prof. Francisco Couto
TP4 - Week5 
 MongoDB: interaction with Pandas; testing performance with mongoDB, Pandas and mySQL


# Part 1 - Creating the databases

### Create the database in mongoDB


### database "schema"
```
{
  "name": <string>,
  "position": {
    "RA_ICRS": <number>,
    "DE_ICRS": <number>,
    "Plx": <number>,
    "dist_PLX": <number>
  },
  "features": {
    "r50": <number>,
    "Vr": <number>,
    "age": <number>,
    "FeH": <number>,
    "Diam_pc": <number>
  }
}
```

In [None]:
import pandas as pd
import sys

# read the dataset into a pandas dataframe
df = pd.read_csv('dias_catalogue.csv')


# Create Nested dict (Object)
df['position'] = df[['RA_ICRS', 'DE_ICRS', 'Plx', 'dist_PLX']].apply(
    lambda s: s.to_dict(), axis=1
)

df['features'] = df[['r50', 'Vr', 'age', 'FeH', 'Diam_pc']].apply(
    lambda s: s.to_dict(), axis=1
)

# Write out to a json file
df[['name', 'position', 'features']].to_json("dias_catalogue_filtered.json", 
orient = "records", date_format = "epoch", 
double_precision = 10, force_ascii = True, date_unit = "ms", 
default_handler = None, indent=2)



In [None]:
# pip install pymongo (if it's not installed)
import pymongo
# connection
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["openClusters"] 
collection = db["cluster"]

# insert data
import json
with open("dias_catalogue_filtered.json", "r") as f:
    stars_data = json.load(f)

# Step 3: Clear old data (optional, for repeat runs)
collection.drop()

# Step 4: Insert into MongoDB
result = collection.insert_many(stars_data)

print("Inserted documents:", len(result.inserted_ids))
print("Total in collection:", collection.count_documents({}))

### Create the database in mySQL

In [None]:
""" 
Create the same database in mySQL
"""

# pip install sqlalchemy (if it's not installed)
# pip install mysqlclient (if it's not installed)

import pandas as pd
from sqlalchemy import create_engine, text
import numpy as np

# Load CSV
df = pd.read_csv("dias_catalogue.csv")
df = df.replace('', None)
df.replace([np.inf, -np.inf], np.nan, inplace=True)

# Create a connection to MySQL
# Replace user, password, host, port, database_name with your info

engine = create_engine("mysql+mysqlconnector://root:1234@localhost:3306/")

# Execute raw SQL to create database
with engine.connect() as conn:
    conn.execute(text("CREATE DATABASE IF NOT EXISTS openclusters"))
    conn.commit()
engine = create_engine("mysql+mysqlconnector://root:1234@localhost:3306/openclusters")

# If the table does not exist, it will be created automatically
df.to_sql(name='clusters', con=engine, if_exists='replace', index=False)


### Use the data in the CSV with Pandas

In [None]:
import pandas as pd

# geting the pandas dataframe to have the same data
df = pd.read_csv('dias_catalogue.csv')
df['name'] = df['name'].str.strip()
df =  df[[ 'name','RA_ICRS', 'DE_ICRS', 'Plx', 'dist_PLX', 'Vr', 'age', 'FeH', 'Diam_pc', 'r50']]

print(df.head())

# Part 2 - Testing the performance for retrieving the same data

### Test 1 - simples query in mongoDB, mySQL and Pandas

In [None]:
### 1 query to mongodb using pymongo
import time
time_i = time.time()

myquery_nested = {'position.RA_ICRS': { "$gt": 50 }}
docs_nested = collection.find(myquery_nested)
time_f = time.time()
print('total time pymongo = ', time_f-time_i)

In [None]:
# same query with mySQL
import mysql.connector

mydb = mysql.connector.connect(
  host="localhost",
  user="root",
  password="1234",
  database="openclusters"
)

mycursor = mydb.cursor()

time_ip = time.time()
mycursor.execute("SELECT * FROM clusters WHERE RA_ICRS > 50")

myresult = mycursor.fetchall()

time_fp = time.time()

print('total time muSQL = ', time_fp-time_ip)

In [None]:
### same query with pandas

time_ip = time.time()
# select all the clusters with the RA_ICSA bigger than 50
my_pands_teste = df[df.RA_ICRS > 50]
time_fp = time.time()
print('total time Pandas = ', time_fp-time_ip)

### Try it yourself

### Exercise 1 - more complex query
Define a query to get all the clusters with:

    * name starting with A
    * RA_ICRS smaller than 180
    * DE_ICRS bigger than 60 or smaller than -60

Test the performance using mongoDB, Pandas and mySQL

In [None]:
# Mongo


In [None]:
# Pandas


In [None]:
# mySQL


### Exercise 2 – Filtering and sorting results

Retrieve all clusters with:

* age > 4
* FeH < 0

Sort the results by RA_ICRS in ascending order.

Compare query execution time between Pandas, MongoDB, and MySQL.

### Exercise 3 – Range query with multiple conditions

* Find all clusters where:
* Diam_pc is between 5 and 20
* age is less than 9
* Display the top 10 results 

Measure performance in each database.

### Exercise 4 – Combining logical operators

Write a query to return clusters that meet either of the following:

* RA_ICRS between 100 and 200 and DE_ICRS > 0

* RA_ICRS < 50 and DE_ICRS < -30

Test with MongoDB’s $or operator, a Pandas boolean mask, and a MySQL OR clause.

### Exercise 5 – Aggregation and statistics

Compute the average and maximum Diam_pc grouped by FeH metallicity bins.
Perform this aggregation using:

* MongoDB’s aggregate() pipeline

* Pandas groupby()

* MySQL GROUP BY statement

Compare execution speed and output consistency.