# Analyzing mosquito species Aedes aegypti transmission

The project is to analyze how much transmission has occurred in a small town (date: January 1, 2020) and the current state of exposed individuals from the mosquito species Aedes aegypti. For each community member tested, we have the following information:

- birthdate: patient's date of birth
- exposed_date: date the patient first noticed mosquito bites
- test_results: 'Pos' = positive, 'Neg' = negative, or 'TBD' = to be determined


With this information, we will classify each patient into one of the transmission categories below and compare the size of each category across age groups. The transmission categories will use:

- 'Susceptible': does not have the virus but is able to get it
- 'Exposed': has been bitten by infectious mosquitoes and might have the virus
- 'Infectious': has the virus and is currently able to transmit to new mosquitoes
- 'Recovered': had the virus and is no longer infectious or susceptible
- 'Unknown': not enough information


Here's a summary of the steps we will take:

- First, we will use each patient's birthdate to calculate their age and place them into age groups.
- Secondly, we'll use exposed_date and test_results to place patients into one of the five transmission categories.
- Finally, we will examine counts within each transmission category by age group using a pivot table.

# Step 1. Opening File & Observing Data

In [4]:
import pandas as pd

test_data = pd.read_excel('/datasets/test_results.xlsx')

test_data.info() 
#printing the general info for our dataset

print (test_data.head()) 
#printing the first 5 rows

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 786 entries, 0 to 785
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   birthdate     786 non-null    datetime64[ns]
 1   exposed_date  786 non-null    datetime64[ns]
 2   test_results  786 non-null    object        
dtypes: datetime64[ns](2), object(1)
memory usage: 18.5+ KB
   birthdate exposed_date test_results
0 1992-01-08   2019-12-24          Neg
1 1972-01-13   2019-12-15          Neg
2 1981-01-10   2019-12-21          Neg
3 1962-01-15   2019-12-15          Pos
4 1962-01-15   2019-12-04          Pos


# Step 2. Checking for Missing Values

In [10]:
print(test_data.isna().sum())
#confirming we have no missing values

birthdate       0
exposed_date    0
test_results    0
age             0
dtype: int64


The data types are good and there are no missing values. We would normally look for duplicate values but we don't have enough information to conclude that duplicate rows belong to the same patient. We have no patient ID that uniquely identifies patients. It's possible that two people might have the same exact values for all three columns. We will assume that each row belongs to a unique patient.

We read the data and used info() and head() to get a glimpse of what the data looks like. We found no missing values and decided that duplicates could not be identified due to a lack of information (no patient ID, Social Security number, driver's license, etc.). 

Nexts step is to create age groups, and we need first an age column. For a crude estimate of age, let's take the dat given (January 1, 2020) and subtract each patient's birth year. If a negative age is returned, perhaps due to a data entry error, we will return a nan value.

# Step 3. Categorizing Data & Using apply()

In [7]:
bdate1 = pd.to_datetime('1986-08-30',format='%Y-%m-%d')
bdate2 = pd.to_datetime('2000-04-01',format='%Y-%m-%d')
bdate3 = pd.to_datetime('2025-04-01',format='%Y-%m-%d')

today = pd.to_datetime('2020-01-01',format='%Y-%m-%d') 
#setting this equal to January 1, 2020 as a datetime variable

def calc_age(birthdate):
    age = today.year - birthdate.year 
    if age < 0:
        return float('nan')    
    else:
        return age 
    #returning age

print(calc_age(bdate1))
#testing that our function on bdate1 works

print(calc_age(bdate2))
#testing that our function on bdate2 works

print(calc_age(bdate3))
#testing that our function on bdate3 works

34
20
nan


Now that we are able to calculate age, let's add an age column to our dataset.

In [11]:
def calc_age(birthdate):
    age = today.year - birthdate.year
    if age < 0:
        return float('nan')
    else:
        return age


test_data['age'] = test_data['birthdate'].apply(calc_age)
#applying our function

print (test_data.head(10)) 
#printing the first 10 rows

   birthdate exposed_date test_results  age
0 1992-01-08   2019-12-24          Neg   28
1 1972-01-13   2019-12-15          Neg   48
2 1981-01-10   2019-12-21          Neg   39
3 1962-01-15   2019-12-15          Pos   58
4 1962-01-15   2019-12-04          Pos   58
5 1989-01-08   2019-12-11          Neg   31
6 1989-01-08   2019-12-30          TBD   31
7 1960-01-16   2019-12-24          Pos   60
8 1942-01-20   2019-12-01          TBD   78
9 1953-01-17   2019-12-09          Neg   67


Examining the distribution of age will tell us how best to construct our age groups.

In [17]:
print(test_data['age'].value_counts().sort_index()) 
#printing the value counts of age sorted

15     8
16     9
17     8
18    10
19    12
      ..
84     3
85     3
86     3
88     2
90     2
Name: age, Length: 74, dtype: int64


Age ranges from 15-90, with more patients in the lower tail than the upper tail. This will be useful for defining our age groups.

Thus we will group people by decade (teens, 20s, 30s, etc.). For age data, it is common that older ages be placed in 60+ or 70+ categories which is what we are going to use too. The smaller the groups, the more detail provided. The more detail provided, the harder data is to interpret. We believe that 10-year age groups will do just fine. If age is negative or nan, the function will return the string 'NA' (not applicable).

In [18]:
def assign_age_group(age):
    
    if age < 0 or pd.isna(age):
        return 'NA'
    
    elif age < 10:
        return '0-9' # < finish function >
        
    elif age < 20:
        return '10-19'
        
    elif age < 30:
        return '20-29'

    elif age < 40:
        return '30-39'
        
    elif age < 50:
        return '40-49'
        
    elif age < 60:
        return '50-59'
        
    elif age < 70:
        return '60-69'
        
    else: 
        return '70+'
        

test_data['age_group'] = test_data['age'].apply(assign_age_group) 
#applying function assign_age_group    

print (test_data.head(10)) 
#printing the first 10 rows

   birthdate exposed_date test_results  age age_group
0 1992-01-08   2019-12-24          Neg   28     20-29
1 1972-01-13   2019-12-15          Neg   48     40-49
2 1981-01-10   2019-12-21          Neg   39     30-39
3 1962-01-15   2019-12-15          Pos   58     50-59
4 1962-01-15   2019-12-04          Pos   58     50-59
5 1989-01-08   2019-12-11          Neg   31     30-39
6 1989-01-08   2019-12-30          TBD   31     30-39
7 1960-01-16   2019-12-24          Pos   60     60-69
8 1942-01-20   2019-12-01          TBD   78       70+
9 1953-01-17   2019-12-09          Neg   67     60-69


An important measurement needed for our categorization is the number of days from our current date to each patient's date of exposure. We will call this new variable days_since_exposed. 

In [19]:
today = pd.to_datetime('2020-01-01',format='%Y-%m-%d')

date_diff = today - test_data['exposed_date'][0]
#calculating date_diff

print (date_diff)
#printing date_diff

print (type(date_diff))
#print data type

8 days 00:00:00
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>


We will use the above information to complete a function that accepts a datetime object, exposed, and returns the number of days between exposed and today. If the number of days is less than 0, will return a nan value.

In [20]:
def calc_days_since_exposed(exposed):
    
    days_diff =  today - exposed 
    
    if days_diff.days < 0:
        return float('nan') 
        
    else:
        return days_diff.days 
    
    
test_data['days_since_exposed'] = test_data['exposed_date'].apply (calc_days_since_exposed) 
#applying function to the exposed_date column

print (test_data.head(10)) 
#printing the first 10 rows

   birthdate exposed_date test_results  age age_group  days_since_exposed
0 1992-01-08   2019-12-24          Neg   28     20-29                   8
1 1972-01-13   2019-12-15          Neg   48     40-49                  17
2 1981-01-10   2019-12-21          Neg   39     30-39                  11
3 1962-01-15   2019-12-15          Pos   58     50-59                  17
4 1962-01-15   2019-12-04          Pos   58     50-59                  28
5 1989-01-08   2019-12-11          Neg   31     30-39                  21
6 1989-01-08   2019-12-30          TBD   31     30-39                   2
7 1960-01-16   2019-12-24          Pos   60     60-69                   8
8 1942-01-20   2019-12-01          TBD   78       70+                  31
9 1953-01-17   2019-12-09          Neg   67     60-69                  23


We need the above information to determine if individuals who test positive have recovered or are currently infectious. We have been told that the infectious period can last between 6-17 days after exposure. This means that for those who test positive, if the value of days since exposure is 17 or less, we assume they are still infectious. If it's greater than 17, we assume they have recovered. 

Additionally, we've created a new Excel file that contains all the previously added columns. Now let's assign each patient to one of the five transmission categories.

We will create a function to assign the correct category and apply the function to the dataset. We will use the following rules to assign categories:

'Susceptible': if test results are negative
'Exposed': if test results are TBD and days since exposed is less than or equal to 17
'Infectious': if test results are positive and days since exposed is less than or equal to 17
'Recovered': if test results are positive and days since exposed is greater than 17
'Unknown': if test results are TBD and days since exposed is greater than 17 (they are either recovered or susceptible)

In [22]:
test_data = pd.read_excel('/datasets/testing_with_age_exp_days.xlsx')

def assign_status(row):
    if row['test_results'] == 'Neg':
        return 'Susceptible'
    
    elif row['test_results'] == 'TBD' and row['days_since_exposed'] <= 17:
        return 'Exposed'
    
    elif row['test_results'] == 'Pos' and row['days_since_exposed'] <= 17:
        return 'Infectious'
    
    elif row['test_results'] == 'Pos' and row['days_since_exposed'] > 17:
        return 'Recovered'
    
    elif row['test_results'] == 'TBD' and row['days_since_exposed'] > 17:
        return 'Unknown'
    
    
test_data['status'] = test_data.apply(assign_status, axis=1) 
#applying the function 

print (test_data.head(10)) 
#printing the first 10 rows

   birthdate exposed_date test_results  age age_group  days_since_exposed  \
0 1992-01-08   2019-12-24          Neg   28     20-29                   8   
1 1972-01-13   2019-12-15          Neg   48     40-49                  17   
2 1981-01-10   2019-12-21          Neg   39     30-39                  11   
3 1962-01-15   2019-12-15          Pos   58     50-59                  17   
4 1962-01-15   2019-12-04          Pos   58     50-59                  28   
5 1989-01-08   2019-12-11          Neg   31     30-39                  21   
6 1989-01-08   2019-12-30          TBD   31     30-39                   2   
7 1960-01-16   2019-12-24          Pos   60     60-69                   8   
8 1942-01-20   2019-12-01          TBD   78       70+                  31   
9 1953-01-17   2019-12-09          Neg   67     60-69                  23   

        status  
0  Susceptible  
1  Susceptible  
2  Susceptible  
3   Infectious  
4    Recovered  
5  Susceptible  
6      Exposed  
7   Infectious  

# Step 4. Identifying Correlation

Let's use the age groups we've created and the statuses we've assigned to examine the current state of sickness in the small town's community. We will use a pivot table to look at status counts (columns) by age group (rows). We've saved the final dataset in the file final_dengue_testing_data.xlsx.

In [26]:
test_data = pd.read_excel('/datasets/final_dengue_testing_data.xlsx')

print(pd.pivot_table(test_data, values='age', index='age_group', columns='status', aggfunc='count', margins=True))
#pivot table

status     Exposed  Infectious  Recovered  Susceptible  Unknown  All
age_group                                                           
10-19            3           2          8           22       12   47
20-29           11          14         40           53       26  144
30-39            8          16         31           69       35  159
40-49            7          10         56           65       14  152
50-59           11          10         26           57       12  116
60-69            6           9         18           36        7   76
70+              8           8         23           41       12   92
All             54          69        202          343      118  786


# Step 5. Conclusion

Many insights can be drawn from our pivot table above, including:

- There are many susceptible and recovered patients across age groups.
- The majority of those still infectious are in the 30-39 age range.
- There are many unknown status counts in the 20-29 and 30-39 age groups.
- There are more 40-49-year-old people who have recovered than any other age group.

The high number of unknowns should encourage the town to get the TBD test results as fast as possible. The high number of recovered counts across age groups is worrisome because it implies that dengue virus was being transmitted for a while before test results were analyzed. In this short project, we started with three columns and ended with seven. The four columns we created were necessary to examine the data the way we did. 