## Cascadia RSV A/B swab dates and serum collection date plots
**The goal of this script is to create a single plot for each patient that contains:**
- a timeline of the dates that they tested POSITIVE for RSV A or RSV B (see see computational_notebooks/gjuviler/rsv_imprinting/01-data/Imprinting_Sera2 - column name: 'rsv_b' or 'rsv_a'. A 1 denotes a positive swab.)
- a timeline of the dates that serum was collected that WE HAVE RIGHT NOW (see computational_notebooks/gjuviler/rsv_imprinting/Bloom_Simonich_CASCADIA_Oct2025_Samples.xlsx)
- the date and outcome of pre-F binding antibody tests (see computational_notebooks/gjuviler/rsv_imprinting/Imprinting_Sera2 - column name: 'ar_rsv_pre_f')
    - not sure if this should be a number or convert the number to a simple yes/no if binding occurred or not
    - currently on Teagan's plots, this is on a timeline of patient visit (redcap repeat instance), but we would rather have a date
- the date and outcome of neutralization titer assays (see computational_notebooks/gjuviler/rsv_imprinting/Imprinting_Sera2 - column name: 'ar_rsva_nd50')
    - again, this is currently on a patient visit timeline, but we want the date
    - there is only rsva neut data

**Previous work**
- Teagan has two notebooks in her comp notebook (see computational_notebooks/tmcmahon/2025/RSV_imprinting/02_notebooks) that create plots
- These notebooks are extremely long and it's hard to tell exactly what is going on. The outputs are located in the output folder, and are fairly useful. However, the changes mentioned above need to be made


### Import necessary components

In [1]:
import math
import os
import altair as alt
import numpy as np
import pandas as pd
from pathlib import Path
os.chdir('../../fh/fast/bloom_j/computational_notebooks/gjuviler/rsv_imprinting')
print(os.getcwd())

/fh/fast/bloom_j/computational_notebooks/gjuviler/rsv_imprinting


### Read the current data

In [2]:


selected_child = pd.read_csv('01-data/Bloom_Simonich_CASCADIA_Oct2025_Samples_children.csv')
selected_adult = pd.read_csv('01-data/Bloom_Simonich_CASCADIA_Oct2025_Samples_adult.csv')

sera = pd.read_csv('01-data/Imprinting_Sera2.csv')       #sera data
swab = pd.read_csv('01-data/Imprinting_swab2.csv')       #swab data
sera_swab = '01-data/Imprinting_sera_swab'      #this is the new dataframe into which I will import the necessary information

### First, let's create a new dataframe of serum that contains only the patient ids that we have on hand:

In [3]:
df_aliquots_child = selected_child[['household_id', 'ptid', 'aliquot_id', 'collect_dt']]    #creates a new df with just these 4 columns

selected_ptids = []        #creates a list of patient ids (in int format) that we have that we can use to select the correct sera data
for index, ptid in enumerate(df_aliquots_child.iloc[:, 1]):
    if ptid != 'EMPTY':     #some of the rows have EMPTY where the ptid should be 
        selected_ptids.append(int(ptid))


sera_filtered = sera[sera['ptid'].isin(selected_ptids)]
sera_filtered

Unnamed: 0,ptid,household_id,study_region,calc_age_years,rsv_a_outcome,rsv_b_outcome,rsv_both,rsv_a_only,rsv_b_only,aliquot_id,...,ar_b11_529_nd50,ar_b11_529_nd80,ar_neut_b11_529_result,ar_xbb_nd50,ar_xbb_nd80,ar_neut_xbb_result,rsva_neut_flag,ar_rsva_nd50,ar_rsva_nd80,ar_rsva_result
303,20000401,40,2,0,1,1,1,0,0,3704633g,...,,,,,,,0.0,,,
304,20000401,40,2,0,1,1,1,0,0,0027014g,...,4137.0,1642.0,POS,,,,0.0,,,
305,20000401,40,2,0,1,1,1,0,0,8189674g,...,,,,,,,0.0,,,
306,20000401,40,2,0,1,1,1,0,0,0027014g,...,,,,,,,1.0,422.0,93.0,POS
307,20000401,40,2,0,1,1,1,0,0,1665540g,...,,,,,,,1.0,886.0,262.0,POS
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
637,20074272,7427,2,0,0,1,0,0,1,4210123g,...,,,,,,,0.0,,,
638,20074272,7427,2,0,0,1,0,0,1,7047035g,...,,,,,,,1.0,465.0,73.0,POS
639,20074272,7427,2,0,0,1,0,0,1,3883693g,...,,,,594.0,236.0,POS,0.0,,,
640,20074272,7427,2,0,0,1,0,0,1,3883693g,...,,,,,,,1.0,1474.0,526.0,POS


### Next, we need a way to add our selected aliquot ids into this filtered dataset