In [None]:
1. combine dataset 
2. find patient die or not any Time
3. When  did diagnose and when die 
4. Survial analysis chart run 
5. find the correlation between the time of diagnosis and the time of death

## What Does Time-to-Event Mean?

In survival analysis, time-to-event refers to the amount of time from a defined starting point (e.g., admission) until a specific event of interest occurs (e.g., death, readmission, relapse, recovery).

## In your HCUP data context:

Start time: Admission to the hospital.

Event: In-hospital death (died == 1).

Time-to-event: The value in daystoevent, which represents how many days it took from admission until the patient died.

## For patients who did not die:

Their died == 0

daystoevent represents the time they were observed without the event → this is called censoring

In [53]:
# locading database
import pandas as pd

df = pd.read_csv('../data/Lung_Cancer_one_hot/md_sid_one_hot_encoding_core_Lung_Cancer.csv')
# Drop rows with invalid visitlink
df = df[df["visitlink"] != -99999999]

# Confirm they're gone
print("Any -99999999 left?", (-99999999 in df["visitlink"].values))

  df = pd.read_csv('../data/Lung_Cancer_one_hot/md_sid_one_hot_encoding_core_Lung_Cancer.csv')


Any -99999999 left? False


In [54]:
df.shape

(28733, 864)

# What Does the `atype` Variable Mean?

| `atype` | Meaning                        |
| ------- | ------------------------------ |
| 1       | **Emergency** (e.g., ER visit) |
| 2       | **Urgent**                     |
| 3       | **Elective**                   |
| 4       | **Newborn**                    |
| 5       | **Trauma Center**              |
| 6       | **Other**                      |
| .       | Missing / invalid / unknown    |


# what does the `readmit` variable mean?

| `readmit` | Meaning                               |
| --------- | ------------------------------------- |
| 1         | Yes — this visit is a **readmission** |
| 0         | No — this is **not** a readmission    |
| . or NaN  | Missing or not applicable             |


# what does the `died` variable mean?

| `died`   | Meaning                               |
| -------- | ------------------------------------- |
| 1        | Yes — patient **died in hospital**    |
| 0        | No — patient was **discharged alive** |
| . or NaN | Missing data (rare)                   |


# what does the `discharge` variable mean?

| Code | Meaning                                         |
| ---- | ----------------------------------------------- |
| 1    | Discharged to **home/self-care**                |
| 2    | Short-term hospital                             |
| 3    | Skilled nursing facility (SNF)                  |
| 4    | Intermediate care facility                      |
| 5    | Another type of facility                        |
| 6    | **Home health care**                            |
| 7    | **Left against medical advice (AMA)**           |
| 20   | **Died in hospital**                            |
| 21   | Discharged/transferred to court/law enforcement |
| 30   | Still a patient                                 |
| .    | Missing / invalid                               |


### ✅ Step 1: Group to get `start_day` (first admission day)

```python
start_day_map = df.groupby("visitlink")["daystoevent"].min().rename("start_day")
```

* **Purpose**: Find the **earliest `daystoevent`** for each patient (first admission).
* `.groupby("visitlink")` makes sure you're grouping all visits per patient.
* `.min()` returns the **first admission day**.

---

### ✅ Step 2: Group to get `death_day` (death event day)

```python
death_day_map = df[df["died"] == 1].groupby("visitlink")["daystoevent"].max().rename("death_day")
```

* **Purpose**: For patients who died (`died == 1`), find their **last visit day** (death day).
* This uses `.groupby("visitlink")` to get the **latest visit** where the death occurred.
* `.max()` is used to get the latest daystoevent for that patient.

---

### 🔁 Then, these two grouped Series are merged back:

```python
df = df.merge(start_day_map, on="visitlink", how="left")
df = df.merge(death_day_map, on="visitlink", how="left")
```

Each row in the DataFrame gets the corresponding `start_day` and `death_day` based on its `visitlink`.




In [55]:
import pandas as pd

# === Load your DataFrame (assuming it's already loaded as df) ===
# Example: df = pd.read_csv("your_file.csv")

# === Step 1: Get first admission day (start_day) for each visitlink ===
start_day_map = df.groupby("visitlink")["daystoevent"].min().rename("start_day")

# === Step 2: Get death_day for each visitlink (only where patient died) ===
death_day_map = df[df["died"] == 1].groupby("visitlink")["daystoevent"].max().rename("death_day")

# === Step 3: Merge start_day and death_day into original dataframe ===
df = df.merge(start_day_map, on="visitlink", how="left")
df = df.merge(death_day_map, on="visitlink", how="left")

# === Step 4: Calculate time_to_death ===
df["time_to_death"] = df["death_day"] - df["start_day"]

# === Step 5: Define columns to display ===
columns_to_show = [
    "visitlink", "ayear", "daystoevent", "los", "died",
    "atype", "dispuniform", "readmit", "start_day", "death_day", "time_to_death"
]

# === Step 6: Filter only patients who died and export to CSV ===
df_died = df[df["died"] == 1]
df_died[columns_to_show].to_csv("../data/Lung_Cancer_one_hot/time_to_death_Lung_Cancer_died.csv", index=False)

# === Step 7: Optional display ===
print("\n☠️ Patients who died:")
df_died[columns_to_show].head(50)



☠️ Patients who died:


Unnamed: 0,visitlink,ayear,daystoevent,los,died,atype,dispuniform,readmit,start_day,death_day,time_to_death
13,1824304,2016,19933,2,1,1,20,1,19933,19933.0,0.0
38,1825802,2017,20295,4,1,1,20,0,20203,20295.0,92.0
41,1826120,2019,16568,8,1,1,20,0,16568,16568.0,0.0
49,1826910,2017,15912,2,1,1,20,1,15871,15912.0,41.0
54,1827537,2017,16820,2,1,1,20,0,16820,16820.0,0.0
55,1827619,2015,18837,2,1,1,20,0,18837,18837.0,0.0
60,1827663,2019,20183,9,1,1,20,0,19178,20183.0,1005.0
141,1837746,2018,16181,5,1,1,20,0,15927,16181.0,254.0
178,1841221,2017,19537,10,1,1,20,0,19537,19537.0,0.0
182,1841539,2019,16308,0,1,1,20,1,16097,16308.0,211.0


In [52]:
# read the csv file

data_death = pd.read_csv("../data/Lung_Cancer_one_hot/time_to_death_Lung_Cancer_died.csv")

data_death.value_counts("time_to_death")

time_to_death
0.0       1082
16.0        25
14.0        21
11.0        21
19.0        19
          ... 
1043.0       1
1063.0       1
1142.0       1
1236.0       1
1005.0       1
Name: count, Length: 407, dtype: int64

In [45]:
# === Step 2: Extract death_day for each visitlink (only where patient died) ===
death_day_map = df[df["died"] == 1].groupby("visitlink")["daystoevent"].max().rename("death_day")

# === Step 3: Merge death_day into original dataframe ===
df = df.merge(death_day_map, on="visitlink", how="left")

# === Step 4: Calculate time_to_death ===
df["time_to_death"] = df["death_day"] - df["daystoevent"]

# === Step 5 (Optional): Define what to show ===
columns_to_show = [
    "visitlink", "ayear", "daystoevent", "los", "died",
    "atype", "dispuniform", "readmit", "death_day", "time_to_death"
]



# === Optional: Filter rows where patient died (died == 1) ===
df_died = df[df["died"] == 1]
print("\n☠️ Patients who died:")
df_died[columns_to_show].to_csv("../data/Lung_Cancer_one_hot/time_to_death_Lung_Cancer_died.csv", index=False)
df_died[columns_to_show]


☠️ Patients who died:


Unnamed: 0,visitlink,ayear,daystoevent,los,died,atype,dispuniform,readmit,death_day,time_to_death
13,1824304,2016,19933,2,1,1,20,1,19933.0,0.0
38,1825802,2017,20295,4,1,1,20,0,20295.0,0.0
41,1826120,2019,16568,8,1,1,20,0,16568.0,0.0
49,1826910,2017,15912,2,1,1,20,1,15912.0,0.0
54,1827537,2017,16820,2,1,1,20,0,16820.0,0.0
...,...,...,...,...,...,...,...,...,...,...
28695,13978890,2019,18791,12,1,1,20,0,18791.0,0.0
28696,13979209,2019,18871,8,1,6,20,1,18871.0,0.0
28697,13979492,2019,17138,2,1,1,20,0,17138.0,0.0
28709,13981997,2019,20985,2,1,1,20,0,20985.0,0.0


In [39]:
df_died["df_died"].value_counts()

KeyError: 'df_died'

# Sanity check for key variables

In [2]:
# Sanity check for key variables
sanity_df = df[['ayear', 'daystoevent', 'dispuniform', 'visitlink']]

# Check for missing values
print("Missing values:\n", sanity_df.isnull().sum())

Missing values:
 ayear          0
daystoevent    0
dispuniform    0
visitlink      0
dtype: int64


# Unique counts for key variables

In [3]:

print("\nUnique years of admission:", sanity_df['ayear'].unique())
print("\nSample discharge statuses (dispuniform):")
print(sanity_df['dispuniform'].value_counts())


Unique years of admission: [2016 2017 2018 2019 2015]

Sample discharge statuses (dispuniform):
dispuniform
1     12917
6      6509
5      5922
20     2397
2       795
7       201
0        11
99        1
Name: count, dtype: int64


In [4]:
# Check if daystoevent is non-negative
print("\nAny negative daystoevent?:", (sanity_df['daystoevent'] < 0).any())

# Confirm visitlink uniqueness per patient
print("\nTotal unique patients (visitlink):", sanity_df['visitlink'].nunique())



Any negative daystoevent?: True

Total unique patients (visitlink): 16860


In [5]:
# Check for -99999999 in visitlink
invalid_ids = df[df["visitlink"] == -99999999]

# Summary
print("Number of records with visitlink = -99999999:", len(invalid_ids))

Number of records with visitlink = -99999999: 20


Any -99999999 left? False


In [7]:
# Filter only patients who died
died_df = df[df["died"] == 1]
died_df.shape

(2394, 864)

In [8]:
# Summary statistics of daystoevent
print(died_df["daystoevent"].describe())

count     2394.000000
mean     18108.738513
std       1778.327647
min      14453.000000
25%      16635.000000
50%      18055.500000
75%      19673.500000
max      21891.000000
Name: daystoevent, dtype: float64


# DaysToEvent

The data element DaysToEvent is one of two data elements that are supplemental information created for HCUP States for which there are encrypted person identifiers. The timing information in DaysToEvent must be used in tandem with the visit linkage variable (VisitLink). VisitLink is created from verified person numbers. These variables enable users to study multiple hospital visits for the same patient across hospitals and time while adhering to strict privacy regulations.

The timing variable (DaysToEvent) was calculated consistently for each verified person number (visitLink) based on a randomly assigned "start date." Each verified person number is assigned a unique start date that is used to calculate DaysToEvent for all visits associated with that visitLink value. The variable DaysToEvent is the difference between the visit's admission date and the start date associated with the visitLink.

The calculation of days between visits is the difference of DaysToEvent between two selected visits for a unique verified person number (visitLink). For example, consider a patient with congestive heart failure that has a hospital admission on 1/10/2008 and an ED visit on 1/25/2008. If the DaysToEvent value is "9" for the 1/10/2008 admission and the DaysToEvent value is "24" for the 1/25/2008 ED visit, then the number of days between the two visits is 15 days (24 - 9 = 15). It should be noted that often readmission analyses consider the time between the end of one admission and the start on the next admission. To adjust for the length of the admission, subtract the length of stay to the difference. In the example, above, it the first admission had a length of stay of 2 days then the number of days between the end of the first visit and the start of the second visit is 13 days (24 - 9 - 2 = 13).

The lowest value of DaysToEvent will be on the first or earliest event for a patient. It is important to remember that if patient A has a value of 605 for DaysToEvent and patient B has a value of 300 for DaysToEvent, patient B's event did not necessarily take place prior to patient A's event - in fact, Patient B's DaysToEvent value has no relation to Patient A's DaysToEvent value. Because of the use of a random start date in the calculation of DaysToEvent, the value of DaysToEvent cannot be compared across patients.

Beginning with the 2009 HCUP data, the revisit variables (VisitLink and DaysToEvent) are included in the Core file of the SID, SASD, and SEDD files, when possible. For 2003-2008 data, the revisit variables are in separate HCUP Supplemental Files for Revisit Analyses.

In [1]:
from datetime import datetime, timedelta
datetime(1960, 1, 1) + timedelta(days=18837)
# Result: 2011-07-18


datetime.datetime(2011, 7, 29, 0, 0)

In [12]:
import pandas as pd



# Filter patients who died
died_df = df[df["died"] == 1]

# Ensure ayear is valid
df_valid = df[df["ayear"] > 0]

# Step 1: Get first valid admission day and year for each patient who died
first_admission = df_valid[df_valid['visitlink'].isin(died_df['visitlink'])] \
    .groupby('visitlink').agg({
        'daystoevent': 'min',
        'ayear': 'min'
    }).reset_index().rename(columns={
        'daystoevent': 'start_day',
        'ayear': 'start_year'
    })

# Step 2: Get death information (we assume died_df already has valid ayear)
death_info = died_df[['visitlink', 'daystoevent', 'ayear']].rename(columns={
    'daystoevent': 'death_day',
    'ayear': 'death_year'
})

# Step 3: Merge and compute time to death
merged = pd.merge(death_info, first_admission, on='visitlink')
merged['time_to_death'] = merged['death_day'] - merged['start_day']

# Step 4: Output
print(merged[['visitlink', 'start_year', 'death_year', 'start_day', 'death_day', 'time_to_death']].head(20))


    visitlink  start_year  death_year  start_day  death_day  time_to_death
0     1824304        2016        2016      19933      19933              0
1     1825802        2017        2017      20203      20295             92
2     1826120        2019        2019      16568      16568              0
3     1826910        2017        2017      15871      15912             41
4     1827537        2017        2017      16820      16820              0
5     1827619        2015        2015      18837      18837              0
6     1827663        2016        2019      19178      20183           1005
7     1837746        2018        2018      15927      16181            254
8     1841221        2017        2017      19537      19537              0
9     1841539        2019        2019      16097      16308            211
10    1841945        2016        2017      16876      17297            421
11    1842276        2016        2016      20553      20553              0
12    1843954        2017

| Column          | Meaning                                                              |
| --------------- | -------------------------------------------------------------------- |
| `visitlink`     | Encrypted patient ID (same person may appear multiple times)         |
| `start_year`    | **Actual calendar year** of the admission (can trust this)           |
| `death_year`    | **Actual calendar year** of the death (can trust this)               |
| `start_day`     | Synthetic number for intra-patient timing; not a real-world date     |
| `death_day`     | Same synthetic timeline as `start_day`                               |
| `time_to_death` | Difference between `death_day` and `start_day` (within same patient) |


KeyError: 'death_day'