# Gemma Analytics Hiring Test – Surgeon Skill Ranking

Author: David Gonzalez  
Date: 11.07.2025

This notebook analyzes the performance of surgeons conducting hip replacement operations based on EQ-5D-5L questionnaire results.

In [1]:

from sqlalchemy import create_engine
import pandas as pd

# DB Credentials
db_user = "c50c162d93e1b19027aafe01f4915371e"
db_pass = "f1c1e1f88935a9c21b05e200cc938c0c"
db_host = "candidate-testing.cowkpei4bgel.eu-central-1.rds.amazonaws.com"
db_port = "5432"
db_name = "hiring_test"

# Create SQLAlchemy engine
engine = create_engine(f"postgresql+psycopg2://{db_user}:{db_pass}@{db_host}:{db_port}/{db_name}")


### 1. Explore Available Tables

In [None]:
# Look for relevant tables within list of tables
tables_df = pd.read_sql("""
    SELECT table_name
    FROM information_schema.tables
    ORDER BY table_name;
""", engine)

with pd.option_context('display.max_rows', None):
    display(tables_df)

### High priority tables found
 + patients
 + surgeons
 + answer_options
 + answers
 + questionnaires
 + questions

In [21]:
# Preview tables

pd.read_sql("SELECT * FROM answers LIMIT 1;", engine)

Unnamed: 0,id,question_id,patient_id,questionnaire_id,answer
0,1,1,1,1,I have no problems in walking around


In [22]:
pd.read_sql("SELECT * FROM answer_options LIMIT 1;", engine)

Unnamed: 0,question_id,answer,severity_code,central_estimate
0,1,I have no problems in walking around,1,0.0


In [23]:
pd.read_sql("SELECT * FROM questionnaires LIMIT 1;", engine)

Unnamed: 0,id,type,treatment,questions
0,1,pre,Hip,"[1, 2, 3, 4, 5]"


In [24]:
pd.read_sql("SELECT * FROM questions LIMIT 1;", engine)

Unnamed: 0,id,title,description
0,1,Mobility,Please indicate what applies


In [25]:
pd.read_sql("SELECT * FROM patients LIMIT 1;", engine)

Unnamed: 0,id,gender,surgeon_id
0,1,Male,3


In [26]:
pd.read_sql("SELECT * FROM surgeons LIMIT 1;", engine)

Unnamed: 0,id,name
0,1,Padme Amidala


### 2. Extracting Patient Responses with Health Scores

SQL queries are saved in the `/sql` folder and loaded as needed.

This query joins:
- `answers` → raw responses
- `answer_options` → to get `central_estimate` scores
- `questionnaires` → to filter only `Hip` operations and distinguish `pre/post`

The result is one row per question answered, with the corresponding health score component.


In [32]:
with open("../sql/responses_and_scores.sql", "r") as file:
    query = file.read()

patient_scores_df = pd.read_sql(query, engine)
patient_scores_df.head()

Unnamed: 0,patient_id,questionnaire_id,questionnaire_type,treatment,central_estimate
0,6163,1,pre,Hip,0.274
1,7862,1,pre,Hip,0.274
2,4802,2,post,Hip,0.274
3,9581,1,pre,Hip,0.274
4,9582,1,pre,Hip,0.274


### 3. Pre - Post scores per patient

- Sum central estimates by patient + questionnaire type (pre/post).
- Convert to health score.
- Pivot so each patient has pre/post on one row.
- Add improvement column (post - pre).

In [None]:
score_sums = (
    patient_scores_df
    .groupby(['patient_id', 'questionnaire_type'], as_index=False)['central_estimate']
    .sum()
)

score_sums['health_score'] = 1 - score_sums['central_estimate']

score_pivot = (
    score_sums
    .pivot(index='patient_id', columns='questionnaire_type', values='health_score')
    .reset_index()
)

score_pivot['improvement'] = score_pivot['post'] - score_pivot['pre']
score_pivot.head()


questionnaire_type,patient_id,post,pre,improvement
0,1,0.84,0.613,0.227
1,4,,0.615,
2,5,0.838,0.691,0.147
3,6,0.501,0.09,0.411
4,7,0.443,0.445,-0.002
