## Cascadia RSV A/B swab dates and serum collection date plots
**The goal of this script is to create a single plot for each patient that contains:**
- a timeline of the dates that they tested POSITIVE for RSV A or RSV B (see see computational_notebooks/gjuviler/rsv_imprinting/01-data/Imprinting_Sera2 - column name: '0a_rsv_b' or '0a_rsv_a'. A 1 denotes a positive swab.)
- a timeline of the dates that serum was collected that WE HAVE RIGHT NOW (see computational_notebooks/gjuviler/rsv_imprinting/Bloom_Simonich_CASCADIA_Oct2025_Samples.xlsx)
- the date and outcome of pre-F binding antibody tests (see computational_notebooks/gjuviler/rsv_imprinting/Imprinting_Sera2 - column name: 'ar_rsv_pre_f')
    - not sure if this should be a number or convert the number to a simple yes/no if binding occurred or not
    - currently on Teagan's plots, this is on a timeline of patient visit (redcap repeat instance), but we would rather have a date
- the date and outcome of neutralization titer assays (see computational_notebooks/gjuviler/rsv_imprinting/Imprinting_Sera2 - column name: 'ar_rsva_nd50')
    - again, this is currently on a patient visit timeline, but we want the date
    - there is only rsva neut data

**Previous work**
- Teagan has two notebooks in her comp notebook (see computational_notebooks/tmcmahon/2025/RSV_imprinting/02_notebooks) that create plots
- These notebooks are extremely long and it's hard to tell exactly what is going on. The outputs are located in the output folder, and are fairly useful. However, the changes mentioned above need to be made


### Import necessary components

In [None]:
import math
import os
import altair as alt
import numpy as np
import pandas as pd
from pathlib import Path

#os.chdir('..')
print(os.getcwd())


/fh/fast/bloom_j/computational_notebooks/gjuviler/rsv_imprinting


### Read the current data

In [None]:


selected_child = pd.read_csv('01-data/Bloom_Simonich_CASCADIA_Oct2025_Samples_children.csv')
selected_child['ptid'] = pd.to_numeric(selected_child['ptid'], downcast='integer', errors='coerce')     #convert the ptid column to int

selected_adult = pd.read_csv('01-data/Bloom_Simonich_CASCADIA_Oct2025_Samples_adult.csv')

sera = pd.read_csv('01-data/Imprinting_Sera2.csv')       #sera data
swab = pd.read_csv('01-data/Imprinting_swab2.csv')       #swab data
sera_swab = '01-data/Imprinting_sera_swab'      #this is the new dataframe into which I will import the necessary information

### First, let's create a new dataframe of serum that contains only the patient ids that we have on hand:

In [53]:
df_aliquots_child = selected_child[['household_id', 'ptid', 'aliquot_id', 'collect_dt']]    #creates a new df with just these 4 columns

#dataframe of our selected ptids + dates
selected_ptids = []        #creates a list of patient ids (in int format) that we have that we can use to select the correct sera data
for index, ptid in enumerate(df_aliquots_child.iloc[:, 1]):
    if ptid != 'EMPTY':     #some of the rows have EMPTY where the ptid should be 
        selected_ptids.append(ptid)

sera_filtered_ptid = sera[sera['ptid'].isin(selected_ptids)]
sera_filtered_ptid


Unnamed: 0,ptid,household_id,study_region,calc_age_years,rsv_a_outcome,rsv_b_outcome,rsv_both,rsv_a_only,rsv_b_only,aliquot_id,...,ar_b11_529_nd50,ar_b11_529_nd80,ar_neut_b11_529_result,ar_xbb_nd50,ar_xbb_nd80,ar_neut_xbb_result,rsva_neut_flag,ar_rsva_nd50,ar_rsva_nd80,ar_rsva_result
303,20000401,40,2,0,1,1,1,0,0,3704633g,...,,,,,,,0.0,,,
304,20000401,40,2,0,1,1,1,0,0,0027014g,...,4137.0,1642.0,POS,,,,0.0,,,
305,20000401,40,2,0,1,1,1,0,0,8189674g,...,,,,,,,0.0,,,
306,20000401,40,2,0,1,1,1,0,0,0027014g,...,,,,,,,1.0,422.0,93.0,POS
307,20000401,40,2,0,1,1,1,0,0,1665540g,...,,,,,,,1.0,886.0,262.0,POS
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
637,20074272,7427,2,0,0,1,0,0,1,4210123g,...,,,,,,,0.0,,,
638,20074272,7427,2,0,0,1,0,0,1,7047035g,...,,,,,,,1.0,465.0,73.0,POS
639,20074272,7427,2,0,0,1,0,0,1,3883693g,...,,,,594.0,236.0,POS,0.0,,,
640,20074272,7427,2,0,0,1,0,0,1,3883693g,...,,,,,,,1.0,1474.0,526.0,POS


# Filtering Swab Data

### Create a dataframe that contains only our selected ptids, aliquot ids, and dates

In [54]:

filtered_aliquots = selected_child[['ptid', 'aliquot_id', 'collect_dt']]
filtered_aliquots


Unnamed: 0,ptid,aliquot_id,collect_dt
0,20000401.0,3001262g,9/21/22
1,20000401.0,6125192g,11/18/23
2,20000533.0,8129615g,8/19/22
3,20000533.0,3614113g,8/21/23
4,20001363.0,2140292g,7/24/22
5,20001363.0,4446778g,9/10/23
6,20002793.0,9555599g,9/3/22
7,20002793.0,8242954g,9/23/23
8,20002903.0,4027265g,8/22/22
9,20002903.0,4636768g,9/2/23


### And then we can create a filtered dataset for when our ptids test positive for rsv a or b
I am filtering by the oa_rsv_a or oa_rsv_b column, but based on Teagan's plots, she seems to be filtering by something else. She has 3 positive events for RSV A for ptid 20001363, but they only have two if filtering the way that I am. I'll have to ask Cassie which column is actually correct. The 2022_46 week swab is the one that is missing by my filtering strategy. It's possible that I should just be filtering by the rsv_a and rsv_b columns? 

In [55]:
swab_filtered = swab[swab['ptid'].isin(selected_ptids)]     #filter swab data to contain only our ptids
swab_positive = swab_filtered[(swab['oa_rsv_a'] == 1) | (swab['oa_rsv_b'] == 1)]        #filter for positive rsv a or b swabs
swab_positive[['ptid', 'swab_date', 'oa_rsv_a', 'oa_rsv_b']]

  swab_positive = swab_filtered[(swab['oa_rsv_a'] == 1) | (swab['oa_rsv_b'] == 1)]        #filter for positive rsv a or b swabs


Unnamed: 0,ptid,swab_date,oa_rsv_a,oa_rsv_b
3008,20000401,27-Sep-22,1.0,0.0
3009,20000401,2-Oct-22,1.0,0.0
3010,20000401,10-Oct-22,1.0,0.0
3011,20000401,16-Oct-22,1.0,0.0
3067,20000401,14-Nov-23,0.0,1.0
3103,20000533,27-Nov-22,0.0,1.0
3280,20001363,7-Nov-22,1.0,0.0
3295,20001363,19-Feb-23,1.0,0.0
3477,20002793,28-Jan-23,1.0,0.0
3539,20002793,6-Jan-24,0.0,1.0


### Now, we can merge the aliquot/date dataframe and the filtered swab dataframe to have the swab dates next to the aliquot dates

In [73]:
swab_aliquot_concat = pd.concat([swab_positive, filtered_aliquots], ignore_index=True)
swab_aliquot_concat = swab_aliquot_concat[['ptid', 'aliquot_id', 'collect_dt', 'oa_rsv_a', 'oa_rsv_b', 'swab_date']].sort_values(['ptid'])
swab_aliquot_concat

Unnamed: 0,ptid,aliquot_id,collect_dt,oa_rsv_a,oa_rsv_b,swab_date
0,20000401.0,,,1.0,0.0,27-Sep-22
1,20000401.0,,,1.0,0.0,2-Oct-22
2,20000401.0,,,1.0,0.0,10-Oct-22
3,20000401.0,,,1.0,0.0,16-Oct-22
4,20000401.0,,,0.0,1.0,14-Nov-23
...,...,...,...,...,...,...
77,20074272.0,4751353g,11/11/23,,,
78,20074272.0,9669406g,12/30/23,,,
65,,EMPTY,EMPTY,,,
67,,EMPTY,EMPTY,,,


### Now, we can plot the data on a timeline

In [104]:
df["swab_date"]  = pd.to_datetime(df["swab_date"], errors="coerce")
df["collect_dt"] = pd.to_datetime(df["collect_dt"], errors="coerce")

df["event_date"] = df["swab_date"].combine_first(df["collect_dt"])

df["event_type"] = None
df.loc[df["oa_rsv_a"] == 1, "event_type"] = "RSV A pos."
df.loc[df["oa_rsv_b"] == 1, "event_type"] = "RSV B pos."
df.loc[df["aliquot_id"].notna(), "event_type"] = "Aliquot"

timeline_df = df.dropna(subset=["event_type", "event_date"]).copy()
timeline_df["ptid"] = timeline_df["ptid"].astype(int)

# --- STEP 1: CREATE DATE LABEL STRING ---
# This goes *after* timeline_df is made, before charting.
timeline_df["date_label"] = timeline_df["event_date"].dt.strftime("%Y-%m-%d")


# --- STEP 2: Make Dropdown Selection ---
ptid_dropdown = alt.binding_select(
    options=sorted(timeline_df["ptid"].unique()),
    name="Select ptid: "
)

ptid_selection = alt.selection_point(
    fields=["ptid"],
    bind=ptid_dropdown,
    value=sorted(timeline_df["ptid"].unique())[0]
)



# --- STEP 3: BUILD ALTair CHART ---
chart = (
    alt.Chart(timeline_df)
    .mark_point(size=120)
    .encode(
        x=alt.X(
            "date_label:N",
            title="Date (YYYY-MM-DD)",
            sort=alt.SortField(field="event_date", order="ascending"),
            axis=alt.Axis(
                labelAngle=-45,
                labelFontSize=14,
                titleFontSize=18
            ),
        ),
        y=alt.value(50),
        color=alt.Color(
            "event_type:N",
            scale=alt.Scale(
                domain=["RSV A pos.", "RSV B pos.", "Aliquot"],
                range=["red", "blue", "black"]
            ),
            legend=alt.Legend(
                title="Event Type",
                labelFontSize=14,
                titleFontSize=18
                )
        ),
        shape=alt.Shape(
            "event_type:N",
            scale=alt.Scale(
                domain=["RSV A pos.", "RSV B pos.", "Aliquot"],
                range=["circle", "square", "triangle"]
            )
        ),
        tooltip=["ptid", "event_type", "event_date"]
    )
    .add_params(ptid_selection)
    .transform_filter(ptid_selection)
    .properties(width=800, height=100)
)

chart