In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

In [3]:
%load_ext chapyter



In [4]:
import os
os.environ["NOTEBOOK_NAME"] = "01-sepsis-gender-distribution-2.ipynb"

In [5]:
%%mimicSQL

Retrieve all data from PATIENTS table where patients database. 

Sure, here is a SQL query to retrieve all data from the PATIENTS table:

```sql
SELECT * 
FROM mimic.mimiciii.patients;
```

Please replace `mimic.mimiciii` with your actual database schema name if it's different.
RECEIVED <class 'pandas.core.frame.DataFrame'>       row_id subject_id gender                      dob  \
0        234        249      F  2075-03-13 00:00:00.000   
1        235        250      F  2164-12-27 00:00:00.000   
2        236        251      M  2090-03-15 00:00:00.000   
3        237        252      M  2078-03-06 00:00:00.000   
4        238        253      F  2089-11-26 00:00:00.000   
...      ...        ...    ...                      ...   
46469  31840      44089      M  2026-05-25 00:00:00.000   
46470  31841      44115      F  2124-07-27 00:00:00.000   
46471  31842      44123      F  2049-11-26 00:00:00.000   
46472  31843      44126      F  2076-07-25 00:00:00.000   
46473  31844      44128      M  2098-07-25 00:00:00.000   

                           dod

Unnamed: 0,row_id,subject_id,gender,dob,dod,dod_hosp,dod_ssn,expire_flag
0,234,249,F,2075-03-13 00:00:00.000,,,,0
1,235,250,F,2164-12-27 00:00:00.000,2188-11-22 00:00:00.000,2188-11-22 00:00:00.000,,1
2,236,251,M,2090-03-15 00:00:00.000,,,,0
3,237,252,M,2078-03-06 00:00:00.000,,,,0
4,238,253,F,2089-11-26 00:00:00.000,,,,0


In [6]:
%%mimicPython

How many men and women are there?

You can use the `value_counts()` function in pandas to count the number of men and women in the 'gender' column. Here is the Python code:

```python
df['gender'].value_counts()
```

This will return the count of each unique value in the 'gender' column, which should be 'M' for male and 'F' for female.

 Result : gender
M    26088
F    20386
Name: count, dtype: int64


In [8]:
%%mimicSQL

Now retrieve the same dataset from MIMIC-III, except only patients over 18.

To retrieve data for patients over 18, we need to calculate the age of patients at their first admission. However, the MIMIC-III dataset intentionally shifts dates for patients aged over 89 to protect patient confidentiality. Therefore, we can't accurately calculate age for these patients. 

Here is a SQL query that retrieves data for patients where we can calculate the age and the age is over 18:

```sql
SELECT p.*
FROM mimic.mimiciii.patients p
INNER JOIN (
    SELECT 
        a.subject_id,
        MIN( ROUND( (cast(admittime as date) - cast(dob as date)) / 365.25 ) ) as age
    FROM mimic.mimiciii.admissions a
    INNER JOIN mimic.mimiciii.patients p
    ON a.subject_id = p.subject_id
    GROUP BY a.subject_id
) a
ON p.subject_id = a.subject_id
WHERE a.age > 18;
```

Please replace `mimic.mimiciii` with your actual database schema name if it's different.
Query failed! State: FAILED
Reason: FUNCTION_NOT_FOUND: line 6:14: Unexpected parameters (interval day to second) for function rou

In [9]:
%%mimicSQL

Try again.

I apologize for the confusion. The ROUND function in SQL does not accept interval data types. We can use the EXTRACT function to get the year from the dates and then subtract to calculate the age. Here is the corrected SQL query:

```sql
SELECT p.*
FROM mimic.mimiciii.patients p
INNER JOIN (
    SELECT 
        a.subject_id,
        MIN( EXTRACT(year FROM a.admittime) - EXTRACT(year FROM p.dob) ) as age
    FROM mimic.mimiciii.admissions a
    INNER JOIN mimic.mimiciii.patients p
    ON a.subject_id = p.subject_id
    GROUP BY a.subject_id
) a
ON p.subject_id = a.subject_id
WHERE a.age > 18;
```

Please replace `mimic.mimiciii` with your actual database schema name if it's different.


Unnamed: 0,row_id,subject_id,gender,dob,dod,dod_hosp,dod_ssn,expire_flag
0,234,249,F,2075-03-13 00:00:00.000,,,,0
1,235,250,F,2164-12-27 00:00:00.000,2188-11-22 00:00:00.000,2188-11-22 00:00:00.000,,1
2,236,251,M,2090-03-15 00:00:00.000,,,,0
3,237,252,M,2078-03-06 00:00:00.000,,,,0
4,238,253,F,2089-11-26 00:00:00.000,,,,0


In [12]:
%%mimicSQL

Try again

I apologize for the confusion. AWS Athena does not support the AGE function. We can calculate the age by subtracting the year of birth from the year of the first admission. Here is the corrected SQL query:

```sql
SELECT p.*
FROM mimiciii.patients p
JOIN (
    SELECT subject_id, MIN(DATE_PART('year', admittime)) as first_admit_year
    FROM mimiciii.admissions
    GROUP BY subject_id
) a
ON p.subject_id = a.subject_id
WHERE first_admit_year - DATE_PART('year', p.dob) >= 18;
```

This query first finds the year of the first admission for each patient, then calculates the age at the time of this admission, and finally filters for patients who were at least 18 years old. Please replace 'mimiciii' with the actual schema name if it's different.
Query failed! State: FAILED
Reason: FUNCTION_NOT_FOUND: line 4:28: Function 'date_part' not registered


TypeError: cannot unpack non-iterable NoneType object

In [13]:
%%mimicSQL

Try again! Reason: FUNCTION_NOT_FOUND: line 4:28: Function 'date_part' not registered

I apologize for the confusion earlier. AWS Athena uses different functions for date and time manipulations. Let's use the `year` function to extract the year from the dates. Here is the corrected SQL query:

```sql
SELECT p.*
FROM mimiciii.patients p
JOIN (
    SELECT subject_id, MIN(year(admittime)) as first_admit_year
    FROM mimiciii.admissions
    GROUP BY subject_id
) a
ON p.subject_id = a.subject_id
WHERE first_admit_year - year(p.dob) >= 18;
```

This query first finds the year of the first admission for each patient, then calculates the age at the time of this admission, and finally filters for patients who were at least 18 years old. Please replace 'mimiciii' with the actual schema name if it's different.


Unnamed: 0,row_id,subject_id,gender,dob,dod,dod_hosp,dod_ssn,expire_flag
0,234,249,F,2075-03-13 00:00:00.000,,,,0
1,235,250,F,2164-12-27 00:00:00.000,2188-11-22 00:00:00.000,2188-11-22 00:00:00.000,,1
2,236,251,M,2090-03-15 00:00:00.000,,,,0
3,237,252,M,2078-03-06 00:00:00.000,,,,0
4,238,253,F,2089-11-26 00:00:00.000,,,,0


In [15]:
%%mimicPython

How many men and women are there?

You can use the `value_counts()` function in pandas to count the number of men and women in the dataframe. Here is the Python code:

```python
gender_counts = df['gender'].value_counts()
print(gender_counts)
```

This will print the number of men and women in the dataframe. 'M' stands for male and 'F' stands for female.

 Result : gender
M    21818
F    16722
Name: count, dtype: int64


In [11]:
%%mimicPython

So how statistically significant is this result?

The p-value obtained from the Chi-Square test is extremely small (5.74e-05), much less than the commonly used significance level of 0.05. This means that the difference in death rates between men and women is statistically significant. 

In other words, the likelihood that the observed difference in death rates occurred by chance is very low, so we can confidently reject the null hypothesis that there is no difference in death rates between men and women. 

Please note that while this result is statistically significant, it does not necessarily imply a large or meaningful difference in death rates. The magnitude and clinical significance of the difference would need to be evaluated separately.
