## There are three questions to address:

1. For each of the 32 communities in the data and for each of 4 time points, report the proportion of patients who have an unsuppressed viral load. This quantity is defined below.

* `unsupp_t = patient level - unsuppressed viral load - 1 if HIV positive at time t and VL_t >  500` 
* Output 1: unsupp.csv - a csv file with 32 rows and 5 columns:
`community name, prop_unsupp_0, prop_unsupp_1, prop_unsupp_2, prop_unsupp_3`



2. Pretending these were real data, are there any data quality problems in the dataset that the team would need to investigate? What decisions did you make in addressing these problems?
* Output 2: writeup.pdf - Your answers to [2] and [3]. 
3. (bonus question - the ability to answer this question is not a requirement for the position, but an ideal candidate would be able to do this) Suppose we changed our data simulation so that all patients who are HIV positive at time 1, 2, or 3 are immediately treated with antiretroviral therapy (ART). The data generating process is otherwise unchanged (including treatment at time 0). In the resulting data, what would be the total population proportion of patients with an unsuppressed viral load at time 3? Provide a single estimate and a 95% confidence interval. Very briefly describe your methodology.

* Output: The code you used for this challenge.

In [1]:
import pandas as pd
import numpy as np

In [2]:
Bugono0 = pd.read_csv("Bugono_0.csv")
Bugono0

Unnamed: 0,searchid,HIV,ART,chcdate,trdate,braceletid,age,male
0,458837,0,0,2014-01-17,,2419425546,16,0
1,812797,0,0,2014-01-09,,2446123435,6,0
2,770596,0,0,2014-01-17,,2373811392,36,0
3,876332,0,0,2014-01-18,,2361281601,3,1
4,530216,0,0,2014-01-18,,2435287694,33,0
...,...,...,...,...,...,...,...,...
6426,393553,0,0,2014-01-19,,2402768221,8,0
6427,303589,0,0,,2014-03-23,2436234783,37,0
6428,290013,0,0,2014-01-13,,2382884601,11,1
6429,778437,0,0,2014-01-12,,2448140368,2,0


In [3]:
Bugono0N = Bugono0["searchid"].unique().shape
Bugono0N

(6431,)

In [4]:
ViralLoads= pd.read_csv("ViralLoads.csv")
ViralLoads

Unnamed: 0,braceletid,VL,date
0,2471071857,40,2013-11-28
1,2541201209,40,2013-11-29
2,2535113633,314,2013-11-29
3,2388517000,116,2013-12-01
4,2361177391,249,2013-12-02
...,...,...,...
176417,2377024394,343,2016-12-27
176418,2435753313,129,2016-12-28
176419,2534064706,714,2016-12-30
176420,2525472748,60,2017-01-01


In [5]:
Bugono0VL = pd.merge(Bugono0, ViralLoads, how='outer', on='braceletid')

In [6]:
Bugono0VL

Unnamed: 0,searchid,HIV,ART,chcdate,trdate,braceletid,age,male,VL,date
0,458837.0,0.0,0.0,2014-01-17,,2419425546,16.0,0.0,,
1,812797.0,0.0,0.0,2014-01-09,,2446123435,6.0,0.0,,
2,770596.0,0.0,0.0,2014-01-17,,2373811392,36.0,0.0,,
3,876332.0,0.0,0.0,2014-01-18,,2361281601,3.0,1.0,,
4,530216.0,0.0,0.0,2014-01-18,,2435287694,33.0,0.0,,
...,...,...,...,...,...,...,...,...,...,...
182233,,,,,,2384551382,,,312.0,2016-12-12
182234,,,,,,2319695678,,,40.0,2016-12-13
182235,,,,,,2458155417,,,835.0,2016-12-14
182236,,,,,,2458843052,,,40.0,2016-12-14


In [7]:
#remove null searchid

Bugono0VL = Bugono0VL[Bugono0VL["searchid"].notnull()]

#Show data type
Bugono0VL.info()
Bugono0VL.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6498 entries, 0 to 6497
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   searchid    6498 non-null   float64
 1   HIV         6498 non-null   float64
 2   ART         6498 non-null   float64
 3   chcdate     5857 non-null   object 
 4   trdate      642 non-null    object 
 5   braceletid  6498 non-null   int64  
 6   age         6498 non-null   float64
 7   male        6498 non-null   float64
 8   VL          682 non-null    float64
 9   date        682 non-null    object 
dtypes: float64(6), int64(1), object(3)
memory usage: 558.4+ KB


Unnamed: 0,searchid,HIV,ART,chcdate,trdate,braceletid,age,male,VL,date
0,458837.0,0.0,0.0,2014-01-17,,2419425546,16.0,0.0,,
1,812797.0,0.0,0.0,2014-01-09,,2446123435,6.0,0.0,,
2,770596.0,0.0,0.0,2014-01-17,,2373811392,36.0,0.0,,
3,876332.0,0.0,0.0,2014-01-18,,2361281601,3.0,1.0,,
4,530216.0,0.0,0.0,2014-01-18,,2435287694,33.0,0.0,,


In [8]:
#convert dates to datetime format
Bugono0VL['chcdate'] = pd.to_datetime(Bugono0VL['chcdate'], infer_datetime_format=True)
Bugono0VL['date'] = pd.to_datetime(Bugono0VL['date'], infer_datetime_format=True)
Bugono0VL['trdate'] = pd.to_datetime(Bugono0VL['trdate'], infer_datetime_format=True)
Bugono0VL.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6498 entries, 0 to 6497
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   searchid    6498 non-null   float64       
 1   HIV         6498 non-null   float64       
 2   ART         6498 non-null   float64       
 3   chcdate     5857 non-null   datetime64[ns]
 4   trdate      642 non-null    datetime64[ns]
 5   braceletid  6498 non-null   int64         
 6   age         6498 non-null   float64       
 7   male        6498 non-null   float64       
 8   VL          682 non-null    float64       
 9   date        682 non-null    datetime64[ns]
dtypes: datetime64[ns](3), float64(6), int64(1)
memory usage: 558.4 KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Bugono0VL['chcdate'] = pd.to_datetime(Bugono0VL['chcdate'], infer_datetime_format=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Bugono0VL['date'] = pd.to_datetime(Bugono0VL['date'], infer_datetime_format=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Bugono0VL['trdate'] = pd.to_datetim

## Initial Data Exploration and Cleaning
We'll start by exploring the data to find obvious areas where we can clean the data.

In [9]:
Bugono0VL.describe(include='all')

  Bugono0VL.describe(include='all')
  Bugono0VL.describe(include='all')
  Bugono0VL.describe(include='all')


Unnamed: 0,searchid,HIV,ART,chcdate,trdate,braceletid,age,male,VL,date
count,6498.0,6498.0,6498.0,5857,642,6498.0,6498.0,6498.0,682.0,682
unique,,,,38,27,,,,,81
top,,,,2014-01-16 00:00:00,2014-03-19 00:00:00,,,,,2014-01-16 00:00:00
freq,,,,476,57,,,,,44
first,,,,1899-01-01 00:00:00,2014-03-06 00:00:00,,,,,2013-12-30 00:00:00
last,,,,2014-02-03 00:00:00,2014-04-03 00:00:00,,,,,2016-08-30 00:00:00
mean,552502.492151,0.104494,0.030779,,,2432930000.0,20.15374,0.487073,2042.140762,
std,260869.553212,0.305923,0.172731,,,66980110.0,11.808196,0.499871,7502.873028,
min,100066.0,0.0,0.0,,,2316316000.0,0.0,0.0,40.0,
25%,326647.5,0.0,0.0,,,2374537000.0,10.0,0.0,72.25,


## Keep the viral loads with the date closest to  chcstart_t. 

* I can't figure out how to do this advanced step in Python despite spending several hours on this. I'm still learning Python. Below is my work.

In [10]:
# Identify all rows for which the patient's has more than 1 VL
# z = Bugono0VL['searchid'].value_counts() 
# z1 = z.to_dict() #converts to dictionary
# Bugono0VL['Count_Patient_ID'] = Bugono0VL['searchid'].map(z1) 

#Above codes simplified, you can use a series to map (doesn't have to be a dict) 
Bugono0VL['Count_Patient_ID'] = Bugono0VL['searchid'].map(Bugono0VL['searchid'].value_counts())
print(Bugono0VL['Count_Patient_ID'].describe(include="all"))
print(Bugono0VL["Count_Patient_ID"].value_counts())

count    6498.000000
mean        1.020622
std         0.142125
min         1.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         2.000000
Name: Count_Patient_ID, dtype: float64
1    6364
2     134
Name: Count_Patient_ID, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Bugono0VL['Count_Patient_ID'] = Bugono0VL['searchid'].map(Bugono0VL['searchid'].value_counts())


In [11]:
Bugono0VL.to_csv("Bugono0VL_cleaned.csv", index=False)

## Problems with the data
* Some dates are same as chcdate, so I will set missing chcdate == date 
* some dates do not match chcdate, so I will sort by ascending, and keep the first row. 
* Some chcdates start from 1899, but no need to change them

In [12]:
Bugono0VL['chcdate'].fillna(Bugono0VL.date)

0      2014-01-17
1      2014-01-09
2      2014-01-17
3      2014-01-18
4      2014-01-18
          ...    
6493   2014-01-19
6494          NaT
6495   2014-01-13
6496   2014-01-12
6497   2014-01-18
Name: chcdate, Length: 6498, dtype: datetime64[ns]

In [13]:
#For patients w/more than 1 VL, select VL w/date closest to chcdate
 
#This function will return the datetime in items which is the closest to the date pivot.
#def nearest(items, pivot):
 #   return min(items, key=lambda x: abs(x - pivot))

# def nearest(date, chcdate):
#     return min(date, key=lambda x: abs(x - chcdate))
# print(nearest(Bugono0VL['date'], Bugono0VL['chcdate'])) 



#from datetime import datetime as dt
#copy = Bugono0VL.copy
# copy =copy[~(copy['Count_PatientID']==2)) 

## Find proportion with unsuppressed VL

In [14]:
#Create an unsuppressed VL column
#unsupp_t = patient level - unsuppressed viral load - 1 if HIV positive at time t and VL_t >  500
filter_unsupp_0 = (Bugono0VL["HIV"] == 1) & (Bugono0VL["VL"] > 500.0)
filter_unsupp_0.value_counts()

False    6229
True      269
dtype: int64

In [15]:
Bugono0VL["unsupp_0"] = filter_unsupp_0
Bugono0VL

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Bugono0VL["unsupp_0"] = filter_unsupp_0


Unnamed: 0,searchid,HIV,ART,chcdate,trdate,braceletid,age,male,VL,date,Count_Patient_ID,unsupp_0
0,458837.0,0.0,0.0,2014-01-17,NaT,2419425546,16.0,0.0,,NaT,1,False
1,812797.0,0.0,0.0,2014-01-09,NaT,2446123435,6.0,0.0,,NaT,1,False
2,770596.0,0.0,0.0,2014-01-17,NaT,2373811392,36.0,0.0,,NaT,1,False
3,876332.0,0.0,0.0,2014-01-18,NaT,2361281601,3.0,1.0,,NaT,1,False
4,530216.0,0.0,0.0,2014-01-18,NaT,2435287694,33.0,0.0,,NaT,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
6493,393553.0,0.0,0.0,2014-01-19,NaT,2402768221,8.0,0.0,,NaT,1,False
6494,303589.0,0.0,0.0,NaT,2014-03-23,2436234783,37.0,0.0,,NaT,1,False
6495,290013.0,0.0,0.0,2014-01-13,NaT,2382884601,11.0,1.0,,NaT,1,False
6496,778437.0,0.0,0.0,2014-01-12,NaT,2448140368,2.0,0.0,,NaT,1,False


## Assuming I knew how to keep the viral loads with the date closest to chcstart_t, below is how I would calculate the proprotion of unsuppressed VL for this CHC

In [16]:
prop = Bugono0VL['unsupp_0'].values.sum() / Bugono0N

In [17]:
prop

array([0.04182864])

### Double check the work

In [18]:
Bugono0VL['unsupp_0'].value_counts()

False    6229
True      269
Name: unsupp_0, dtype: int64

In [19]:
Bugono0VL['unsupp_0'].values.sum()

269