# Project 4: Predict Dengue Cases

**Notebook 1.2 - Contents:**<br>
[Data Inspection](#Data-Inspection)<br>
[Data Cleaning](#Data-Cleaning)<br>
[Concatenate dataframes](#Concatenate-dataframes)<br>
[Export dataframe](#Export-dataframe)


## Data Inspection

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import pickle

### Import datasets (train)

In [2]:
#import datasets from data.gov
dengue_cases = pd.read_csv('../data/data.gov/weekly-infectious-disease-bulletin-cases.csv')

#import datasets from weather.gov
changi_weather = pd.read_csv('../data/weather.gov/changi_weather.csv')

#import datasets from google_trends
dengue_csv = ['../data/google_trends/dengue_google_1.csv',
             '../data/google_trends/dengue_google_2.csv',
             '../data/google_trends/dengue_google_3.csv']
dfs = [pd.read_csv(csv_file, skiprows=2, header=0, index_col=0, sep=',') for csv_file in dengue_csv]
dengue_google = pd.concat(dfs, axis=0)

### Import functions for cleaning

First, we import the data_inspect function from my personal list of compiled functions. We use `compiled_functions.shape_head` and `compiled_functions.data_inspect` to do a preliminary round of inspection of the data.

In [3]:
import compiled_functions

In [4]:
compiled_functions.shape_head(dengue_cases, "dengue_cases") 

dengue_cases dataset shape:
(20070, 3)

dengue_cases dataset head:


Unnamed: 0,epi_week,disease,no._of_cases
0,2012-W01,Acute Viral hepatitis B,0
1,2012-W01,Acute Viral hepatitis C,0
2,2012-W01,Avian Influenza,0
3,2012-W01,Campylobacterenterosis,6
4,2012-W01,Chikungunya Fever,0


In [5]:
compiled_functions.shape_head(changi_weather, "changi_weather") 

changi_weather dataset shape:
(4018, 13)

changi_weather dataset head:


Unnamed: 0,station,year,month,day,daily rainfall total (mm),highest 30 min rainfall (mm),highest 60 min rainfall (mm),highest 120 min rainfall (mm),mean temperature (°c),maximum temperature (°c),minimum temperature (°c),mean wind speed (km/h),max wind speed (km/h)
0,Changi,2012,1,1,0.6,,,,27.2,31.4,25.2,8.4,28.4
1,Changi,2012,1,2,0.0,,,,27.7,31.3,25.4,13.6,33.1
2,Changi,2012,1,3,0.0,,,,27.6,30.9,25.7,15.4,34.6
3,Changi,2012,1,4,0.0,,,,27.4,31.0,25.0,13.3,33.8
4,Changi,2012,1,5,0.0,,,,27.0,30.7,24.5,12.2,33.8


In [6]:
compiled_functions.shape_head(dengue_google, "dengue_google")

dengue_google dataset shape:
(574, 3)

dengue_google dataset head:


Unnamed: 0_level_0,Dengue: (Singapore),dengue fever: (Singapore),dengue symptoms: (Singapore)
Week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-01-01,7,5,2
2012-01-08,5,5,0
2012-01-15,8,4,2
2012-01-22,6,7,3
2012-01-29,6,3,0


In [7]:
# Create list of dataframes
data_train = [(dengue_cases, "dengue_cases"),
              (changi_weather, "changi_weather"),
              (dengue_google, "dengue_google")]

Do preliminary inspection:

In [8]:
for df, df_name in data_train:
    compiled_functions.data_inspect(df, df_name)

dengue_cases dataset inspection
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20070 entries, 0 to 20069
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   epi_week      20070 non-null  object
 1   disease       20070 non-null  object
 2   no._of_cases  20070 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 470.5+ KB
None

++++++++++

Check for null percentages for dengue_cases dataset:
epi_week        0.0
disease         0.0
no._of_cases    0.0
dtype: float64

++++++++++

Check for no of duplicated values for dengue_cases dataset:
0
++++++++++

changi_weather dataset inspection
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4018 entries, 0 to 4017
Data columns (total 13 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0

##### Findings:
In the prelim inspection, there appeared to be no null values for all the datasets, and no duplicated values for the dengue_cases and changi_weather dataset. dengue_google has 142 duplicated values. 

Some of the null percentages may be inaccurate - from the changi_weather df, we can see that the null values are actually represented by . Further inspection is to be done on the data. Also note: All rows with the value "0.0" are considered not missing at random (NMAR). This is because there could be 0 counts of dengue cases, or 0.0 rainfall/wind, or 0.0 google searches respectively for each dataset. We hence keep these values, and only address true null values.

For all 3 datasets, we will also need to convert the date/time values into date-time format. For changi_weather dataset, to create new column combining year, month, day to convert to date-time format. Also, for changi_weather dataset - all rainfall, temperature and windspeed columns should be float or integer datatypes as these are numerical values. For dengue_google dataset - all columns (besides index that is date-time) should be integers, as these values are all representing counts of google searches.

## Data Cleaning

### dengue_cases

This dataset consists of all infectious diseases count per week. Let's extract only the relevant diseases.

In [9]:
dengue_cases['disease'].unique()

array(['Acute Viral hepatitis B', 'Acute Viral hepatitis C',
       'Avian Influenza', 'Campylobacterenterosis', 'Chikungunya Fever',
       'Cholera', 'Dengue Fever', 'Dengue Haemorrhagic Fever',
       'Diphtheria', 'Encephalitis', 'Haemophilus influenzae type b',
       'Hand, Foot Mouth Disease', 'Legionellosis', 'Malaria', 'Measles',
       'Melioidosis', 'Meningococcal Infection', 'Mumps',
       'Nipah virus infection', 'Paratyphoid', 'Pertussis', 'Plague',
       'Pneumococcal Disease (invasive)', 'Poliomyelitis', 'Rubella',
       'Salmonellosis(non-enteric fevers)', 'SARS', 'Typhoid',
       'Viral Hepatitis A', 'Viral Hepatitis E', 'Yellow Fever',
       'Zika Virus Infection', 'Acute Viral Hepatitis A',
       'Acute Viral Hepatitis E', 'Chikungunya', 'HFMD', 'Nipah',
       'Campylobacter enteritis', 'Leptospirosis', 'Zika',
       'Ebola Virus Disease', 'Japanese Encephalitis', 'Tetanus',
       'Botulism', 'Murine Typhus', 'Monkeypox'], dtype=object)

Call out only the relevant 'Dengue Fever', 'Dengue Haemorrhagic Fever' data:

In [10]:
target_diseases = ['Dengue Fever', 'Dengue Haemorrhagic Fever']
dengue_cases = dengue_cases[dengue_cases['disease'].isin(target_diseases)]

In [11]:
dengue_cases

Unnamed: 0,epi_week,disease,no._of_cases
6,2012-W01,Dengue Fever,74
7,2012-W01,Dengue Haemorrhagic Fever,0
37,2012-W02,Dengue Fever,64
38,2012-W02,Dengue Haemorrhagic Fever,2
68,2012-W03,Dengue Fever,60
...,...,...,...
19962,2022-W50,Dengue Haemorrhagic Fever,1
20000,2022-W51,Dengue Fever,270
20001,2022-W51,Dengue Haemorrhagic Fever,0
20039,2022-W52,Dengue Fever,285


Convert the date-time format for epi_week:

In [12]:
dengue_cases['epi_week'] = pd.to_datetime(dengue_cases['epi_week'] + '-1', format='%Y-W%U-%w') - pd.Timedelta(days=1)
dengue_cases

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dengue_cases['epi_week'] = pd.to_datetime(dengue_cases['epi_week'] + '-1', format='%Y-W%U-%w') - pd.Timedelta(days=1)


Unnamed: 0,epi_week,disease,no._of_cases
6,2012-01-01,Dengue Fever,74
7,2012-01-01,Dengue Haemorrhagic Fever,0
37,2012-01-08,Dengue Fever,64
38,2012-01-08,Dengue Haemorrhagic Fever,2
68,2012-01-15,Dengue Fever,60
...,...,...,...
19962,2022-12-11,Dengue Haemorrhagic Fever,1
20000,2022-12-18,Dengue Fever,270
20001,2022-12-18,Dengue Haemorrhagic Fever,0
20039,2022-12-25,Dengue Fever,285


Make Dengue Fever and Dengue Haemorrhagic Fever each a column of its own by pivotting the dataframe. That way we can make epi_week the index, to use for time series analysis later on.

In [13]:
# Pivot the DataFrame
pivoted_dengue = dengue_cases.pivot_table(index='epi_week', columns='disease', values='no._of_cases', fill_value=0)
print(pivoted_dengue.shape)
pivoted_dengue.head()

(572, 2)


disease,Dengue Fever,Dengue Haemorrhagic Fever
epi_week,Unnamed: 1_level_1,Unnamed: 2_level_1
2012-01-01,74.0,0.0
2012-01-08,64.0,2.0
2012-01-15,60.0,1.0
2012-01-22,50.0,2.0
2012-01-29,84.0,1.0


Noticed from the shape of the dataframe that there are only 572 rows. However, we are expecting 574 rows. After merging the various datasets later, we will be able to more easily identify which weeks are missing from the dataset and then we can impute values.

Also, as the proportion of Dengue Haemorrhagic Fever values are too low in numbers compared to Dengue Fever, let's feature engineer a new column 'dengue_cases' combining both Dengue Fever and Dengue Haemorrhagic Fever:

In [14]:
# Add total dengue cases (combine dengue_cases and dengue_haemorrhagic_case)
pivoted_dengue['dengue_cases'] = pivoted_dengue['Dengue Fever'] + pivoted_dengue['Dengue Haemorrhagic Fever']

# Sort by date
pivoted_dengue.sort_index(inplace=True)

# View dataframe
pivoted_dengue.tail()

disease,Dengue Fever,Dengue Haemorrhagic Fever,dengue_cases
epi_week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2022-11-27,242.0,0.0,242.0
2022-12-04,326.0,1.0,327.0
2022-12-11,289.0,1.0,290.0
2022-12-18,270.0,0.0,270.0
2022-12-25,285.0,0.0,285.0


In [15]:
pivoted_dengue.loc['2018-1']

disease,Dengue Fever,Dengue Haemorrhagic Fever,dengue_cases
epi_week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-01-07,83.0,0.0,83.0
2018-01-14,68.0,0.0,68.0
2018-01-21,54.0,0.0,54.0
2018-01-28,45.0,0.0,45.0


---

### changi_weather

Create date column with date-time format and drop existing year, month, day column.

In [16]:
# create new data column
changi_weather['date'] = pd.to_datetime(changi_weather[['year', 'month', 'day']])

# drop year, month, day columns
changi_weather.drop(columns=['year', 'month', 'day'], inplace=True)
changi_weather.head()

Unnamed: 0,station,daily rainfall total (mm),highest 30 min rainfall (mm),highest 60 min rainfall (mm),highest 120 min rainfall (mm),mean temperature (°c),maximum temperature (°c),minimum temperature (°c),mean wind speed (km/h),max wind speed (km/h),date
0,Changi,0.6,,,,27.2,31.4,25.2,8.4,28.4,2012-01-01
1,Changi,0.0,,,,27.7,31.3,25.4,13.6,33.1,2012-01-02
2,Changi,0.0,,,,27.6,30.9,25.7,15.4,34.6,2012-01-03
3,Changi,0.0,,,,27.4,31.0,25.0,13.3,33.8,2012-01-04
4,Changi,0.0,,,,27.0,30.7,24.5,12.2,33.8,2012-01-05


In [17]:
# confirm date is in date-time format
changi_weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4018 entries, 0 to 4017
Data columns (total 11 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   station                        4018 non-null   object        
 1   daily rainfall total (mm)      4018 non-null   float64       
 2   highest 30 min rainfall (mm)   4018 non-null   object        
 3   highest 60 min rainfall (mm)   4018 non-null   object        
 4   highest 120 min rainfall (mm)  4018 non-null   object        
 5   mean temperature (°c)          4018 non-null   float64       
 6   maximum temperature (°c)       4018 non-null   float64       
 7   minimum temperature (°c)       4018 non-null   float64       
 8   mean wind speed (km/h)         4018 non-null   object        
 9   max wind speed (km/h)          4018 non-null   object        
 10  date                           4018 non-null   datetime64[ns]
dtypes: datetime64[ns]

Find out number of rows of data with 	null values:

In [18]:
changi_weather.columns

Index(['station', 'daily rainfall total (mm)', 'highest 30 min rainfall (mm)',
       'highest 60 min rainfall (mm)', 'highest 120 min rainfall (mm)',
       'mean temperature (°c)', 'maximum temperature (°c)',
       'minimum temperature (°c)', 'mean wind speed (km/h)',
       'max wind speed (km/h)', 'date'],
      dtype='object')

In [19]:
changi_weather['highest 30 min rainfall (mm)'].unique()

array(['\x97', '0.0', '8.6', '10.0', '1.8', '5.4', '18.2', '0.2', '20.4',
       '3.6', '0.4', '2.0', '2.8', '15.2', '9.8', '4.2', '1.4', '21.2',
       '1.6', '3.0', '8.2', '9.4', '11.0', '3.8', '0.8', '7.8', '17.2',
       '3.2', '1.0', '16.4', '2.4', '18.0', '0.6', '3.4', '10.8', '10.4',
       '24.6', '7.6', '15.4', '18.4', '11.2', '1.2', '2.6', '9.2', '33.6',
       '4.4', '31.4', '5.0', '33.4', '36.4', '29.8', '7.0', '13.0',
       '21.0', '16.6', '19.2', '5.8', '4.6', '19.8', '43.6', '6.2',
       '32.2', '22.2', '27.8', '25.2', '13.6', '4.0', '26.6', '6.4',
       '11.8', '23.6', '6.0', '4.8', '12.4', '13.2', '2.2', '5.2', '12.8',
       '17.4', '41.6', '15.8', '10.6', '12.6', '29.4', '19.0', '11.6',
       '6.8', '16.2', '7.2', '17.0', '24.4', '12.2', '22.4', '40.8',
       '14.4', '20.2', '8.8', '43.8', '6.6', '21.6', '23.2', '10.2',
       '5.6', '0', '1', '3', '9.6', '53.4', '17.8', '14.2', '34.0', '8.0',
       '31.8', '19.6', '8.4', '30.6', '23.4', '13.8', '26.8', '11.4',

In [20]:
# Filter rows where columns contain '\x97' which is equivalent to 
columns_check = ['highest 30 min rainfall (mm)',
       'highest 60 min rainfall (mm)', 'highest 120 min rainfall (mm)', 'mean wind speed (km/h)',
       'max wind speed (km/h)']

filtered_weather = changi_weather[changi_weather[columns_check].apply(lambda col: col == '\x97').any(axis=1)]
print(f"Percentage of rows with : {round((len(filtered_weather)/len(changi_weather))*100, 2)}%")
filtered_weather.head()

Percentage of rows with : 18.34%


Unnamed: 0,station,daily rainfall total (mm),highest 30 min rainfall (mm),highest 60 min rainfall (mm),highest 120 min rainfall (mm),mean temperature (°c),maximum temperature (°c),minimum temperature (°c),mean wind speed (km/h),max wind speed (km/h),date
0,Changi,0.6,,,,27.2,31.4,25.2,8.4,28.4,2012-01-01
1,Changi,0.0,,,,27.7,31.3,25.4,13.6,33.1,2012-01-02
2,Changi,0.0,,,,27.6,30.9,25.7,15.4,34.6,2012-01-03
3,Changi,0.0,,,,27.4,31.0,25.0,13.3,33.8,2012-01-04
4,Changi,0.0,,,,27.0,30.7,24.5,12.2,33.8,2012-01-05


Further inspection shows us that '\x97' values in the columns ['daily rainfall total (mm)', 'highest 30 min rainfall (mm)', 'highest 60 min rainfall (mm)'] are from 2012-2014. Since dropping these rows will mean losing 20% of our data, perhaps we keep the rows, and impute values for each column instead.

In [21]:
# Investigate '-' values
changi_weather[changi_weather[columns_check].apply(lambda col: col == '-').any(axis=1)]

Unnamed: 0,station,daily rainfall total (mm),highest 30 min rainfall (mm),highest 60 min rainfall (mm),highest 120 min rainfall (mm),mean temperature (°c),maximum temperature (°c),minimum temperature (°c),mean wind speed (km/h),max wind speed (km/h),date
3013,Changi,0.0,0.0,0.0,0.0,29.3,32.7,26.4,9.9,-,2020-04-01
3025,Changi,0.0,0.0,0.0,0.0,29.5,33.7,27.1,8.4,-,2020-04-13
3074,Changi,32.4,20.2,-,-,26.6,29.5,24.7,5.9,31.5,2020-06-01
3075,Changi,0.2,0.2,-,-,28.5,33.0,25.3,5.9,33.3,2020-06-02
3078,Changi,41.0,-,-,-,27.6,31.1,24.1,7.6,33.3,2020-06-05
3079,Changi,4.2,-,-,-,27.6,29.1,25.0,6.3,33.3,2020-06-06
3085,Changi,0.0,0.0,0.0,0.0,29.1,31.6,27.2,-,25.9,2020-06-12
3119,Changi,3.8,-,-,-,28.2,32.2,25.6,7.7,35.2,2020-07-16
3133,Changi,0.6,0.4,0.4,0.6,28.2,31.6,25.3,7.9,-,2020-07-30
3163,Changi,6.6,2.8,3.4,4.8,27.2,30.8,25.1,8.0,-,2020-08-29


Inspection of the above data reveals that these are not readings that should be '0' but were more likely not available. E.g. Index 3078 has rainfall of 41.0mm but no highest rainfall reading. Treat these as NA values and impute values together with the '\x97' values.

In [22]:
# convert all values with '\x97' or '-' to np.nan
for column in columns_check:
    changi_weather[column] = changi_weather[column].apply(pd.to_numeric, errors='coerce')

# check results
print(changi_weather.info(), '\n')

# check null values
changi_weather.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4018 entries, 0 to 4017
Data columns (total 11 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   station                        4018 non-null   object        
 1   daily rainfall total (mm)      4018 non-null   float64       
 2   highest 30 min rainfall (mm)   3282 non-null   float64       
 3   highest 60 min rainfall (mm)   3280 non-null   float64       
 4   highest 120 min rainfall (mm)  3280 non-null   float64       
 5   mean temperature (°c)          4018 non-null   float64       
 6   maximum temperature (°c)       4018 non-null   float64       
 7   minimum temperature (°c)       4018 non-null   float64       
 8   mean wind speed (km/h)         4010 non-null   float64       
 9   max wind speed (km/h)          4003 non-null   float64       
 10  date                           4018 non-null   datetime64[ns]
dtypes: datetime64[ns]

station                            0
daily rainfall total (mm)          0
highest 30 min rainfall (mm)     736
highest 60 min rainfall (mm)     738
highest 120 min rainfall (mm)    738
mean temperature (°c)              0
maximum temperature (°c)           0
minimum temperature (°c)           0
mean wind speed (km/h)             8
max wind speed (km/h)             15
date                               0
dtype: int64

#### missing data imputation

Bulk of the null values in highest rainfall, which is dependent on daily rainfall total and would not make sense to impute mean/median. Same for wind speeds - mean wind speed would be lower than max wind speed, whilst max wind speed will be no less than mean wind speed. These weather features are also probably dependent on other weather features in the table.

Therefore, use IterativeImputer, then sense-check on min and max values between the columns.

In [23]:
# drop station column as this is non-variable
changi_weather.drop(columns=['station'], inplace=True)

In [24]:
# set date as index and sort
changi_weather.set_index('date', inplace=True)
changi_weather.sort_index(inplace=True)

In [25]:
changi_weather.head()

Unnamed: 0_level_0,daily rainfall total (mm),highest 30 min rainfall (mm),highest 60 min rainfall (mm),highest 120 min rainfall (mm),mean temperature (°c),maximum temperature (°c),minimum temperature (°c),mean wind speed (km/h),max wind speed (km/h)
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2012-01-01,0.6,,,,27.2,31.4,25.2,8.4,28.4
2012-01-02,0.0,,,,27.7,31.3,25.4,13.6,33.1
2012-01-03,0.0,,,,27.6,30.9,25.7,15.4,34.6
2012-01-04,0.0,,,,27.4,31.0,25.0,13.3,33.8
2012-01-05,0.0,,,,27.0,30.7,24.5,12.2,33.8


In [26]:
# store indexes for missing values to enable sense-checking later
missing_dic = {}

for column in columns_check:
    missing_dic[column] = changi_weather[changi_weather[column].isnull()].index

missing_dic

{'highest 30 min rainfall (mm)': DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03', '2012-01-04',
                '2012-01-05', '2012-01-06', '2012-01-07', '2012-01-08',
                '2012-01-09', '2012-01-10',
                ...
                '2013-12-27', '2013-12-28', '2013-12-29', '2013-12-30',
                '2013-12-31', '2016-09-21', '2017-05-30', '2020-06-05',
                '2020-06-06', '2020-07-16'],
               dtype='datetime64[ns]', name='date', length=736, freq=None),
 'highest 60 min rainfall (mm)': DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03', '2012-01-04',
                '2012-01-05', '2012-01-06', '2012-01-07', '2012-01-08',
                '2012-01-09', '2012-01-10',
                ...
                '2013-12-29', '2013-12-30', '2013-12-31', '2016-09-21',
                '2017-05-30', '2020-06-01', '2020-06-02', '2020-06-05',
                '2020-06-06', '2020-07-16'],
               dtype='datetime64[ns]', name='date', length=738, fr

In [27]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [28]:
# Impute missing data with Iterative Imputer
imp = IterativeImputer(random_state=123, 
                       sample_posterior=True, # sample_posterior True since multiple imputations required
                       min_value = 0, # rainfall and windspeeds should be non-negative
                       max_iter=10)

imputed = imp.fit_transform(changi_weather)

In [29]:
changi_weather_imp = pd.DataFrame(imputed, index=changi_weather.index, columns=imp.get_feature_names_out())

In [30]:
changi_weather_imp.head()

Unnamed: 0_level_0,daily rainfall total (mm),highest 30 min rainfall (mm),highest 60 min rainfall (mm),highest 120 min rainfall (mm),mean temperature (°c),maximum temperature (°c),minimum temperature (°c),mean wind speed (km/h),max wind speed (km/h)
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2012-01-01,0.6,5.285768,4.146961,5.203163,27.2,31.4,25.2,8.4,28.4
2012-01-02,0.0,3.429591,3.859431,4.450288,27.7,31.3,25.4,13.6,33.1
2012-01-03,0.0,1.499996,1.558347,1.503074,27.6,30.9,25.7,15.4,34.6
2012-01-04,0.0,3.83671,2.836281,2.085437,27.4,31.0,25.0,13.3,33.8
2012-01-05,0.0,2.816258,1.415756,3.515006,27.0,30.7,24.5,12.2,33.8


In [31]:
# Create function for min/max check

def min_max(df, imputed_col, max_col=None, min_col=None):
    for index in missing_dic[imputed_col]:
        if max_col is not None and df[imputed_col][index] > df[max_col][index]:
            df[imputed_col][index] = df[max_col][index]
        elif min_col is not None and df[imputed_col][index] < df[min_col][index]:
            df[imputed_col][index] = df[min_col][index]       
        else:
            continue
        

In [32]:
# sense-check imputed values: highest 30 min rainfall (mm), no more than daily rainfall total

min_max(changi_weather_imp, 'highest 30 min rainfall (mm)', 'daily rainfall total (mm)')


In [33]:
# sense-check imputed values: highest 60 min rainfall (mm), no more than daily rainfall total and no less than highest 30 min rainfall (mm)

min_max(changi_weather_imp, 'highest 60 min rainfall (mm)', 'daily rainfall total (mm)', 'highest 30 min rainfall (mm)')

In [34]:
# sense-check imputed values: highest 120 min rainfall (mm), no more than daily rainfall total and no less than highest 60 min rainfall (mm)

min_max(changi_weather_imp, 'highest 120 min rainfall (mm)', 'daily rainfall total (mm)', 'highest 60 min rainfall (mm)')

In [35]:
# view/check results
changi_weather_imp.loc[missing_dic['highest 120 min rainfall (mm)'], :]

Unnamed: 0_level_0,daily rainfall total (mm),highest 30 min rainfall (mm),highest 60 min rainfall (mm),highest 120 min rainfall (mm),mean temperature (°c),maximum temperature (°c),minimum temperature (°c),mean wind speed (km/h),max wind speed (km/h)
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2012-01-01,0.6,0.600000,0.600000,0.600000,27.2,31.4,25.2,8.4,28.4
2012-01-02,0.0,0.000000,0.000000,0.000000,27.7,31.3,25.4,13.6,33.1
2012-01-03,0.0,0.000000,0.000000,0.000000,27.6,30.9,25.7,15.4,34.6
2012-01-04,0.0,0.000000,0.000000,0.000000,27.4,31.0,25.0,13.3,33.8
2012-01-05,0.0,0.000000,0.000000,0.000000,27.0,30.7,24.5,12.2,33.8
...,...,...,...,...,...,...,...,...,...
2020-06-01,32.4,20.200000,24.063788,24.695657,26.6,29.5,24.7,5.9,31.5
2020-06-02,0.2,0.200000,0.200000,0.200000,28.5,33.0,25.3,5.9,33.3
2020-06-05,41.0,10.950965,17.899576,24.382176,27.6,31.1,24.1,7.6,33.3
2020-06-06,4.2,0.521959,0.674713,0.721058,27.6,29.1,25.0,6.3,33.3


In [36]:
# sense-check imputed values: mean wind speed (km/h) no more than max wind speed (km/h)

min_max(changi_weather_imp, 'mean wind speed (km/h)', 'max wind speed (km/h)')

In [37]:
# sense-check imputed values: max wind speed (km/h) no less than mean wind speed (km/h)

min_max(changi_weather_imp, 'max wind speed (km/h)', max_col = None, min_col = 'max wind speed (km/h)')

In [38]:
# view/check results
changi_weather_imp.loc[missing_dic['max wind speed (km/h)'].append(missing_dic['mean wind speed (km/h)']), :]

Unnamed: 0_level_0,daily rainfall total (mm),highest 30 min rainfall (mm),highest 60 min rainfall (mm),highest 120 min rainfall (mm),mean temperature (°c),maximum temperature (°c),minimum temperature (°c),mean wind speed (km/h),max wind speed (km/h)
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2013-01-16,0.0,0.0,0.0,0.0,27.4,30.3,26.1,11.043082,35.75711
2013-01-17,0.0,0.0,0.0,0.0,27.5,30.7,25.2,5.669832,28.317076
2015-02-10,0.0,0.0,0.0,0.0,26.7,30.4,24.5,9.579448,41.302577
2015-02-11,0.0,0.0,0.0,0.0,26.6,31.1,23.6,5.516387,18.865555
2015-02-12,0.0,0.0,0.0,0.0,26.3,30.8,23.3,4.699378,33.599034
2015-02-13,0.0,0.0,0.0,0.0,26.6,30.8,24.0,10.4,34.800442
2016-09-21,2.5,2.5,2.5,2.5,28.1,33.6,24.0,5.696609,28.154801
2020-04-01,0.0,0.0,0.0,0.0,29.3,32.7,26.4,9.9,25.003108
2020-04-13,0.0,0.0,0.0,0.0,29.5,33.7,27.1,8.4,29.19329
2020-07-30,0.6,0.4,0.4,0.6,28.2,31.6,25.3,7.9,29.550806


Sense-check imputed values - compare statistics before and after imputation

In [39]:
stats_bef = changi_weather.describe()
stats_aft = changi_weather_imp.describe()

# join before and after stats for imputed data
compare = stats_bef[columns_check].merge(stats_aft[columns_check],
                                         left_index=True,
                                         right_index=True,
                                         suffixes=['_0', '_1'])

# Add comparison % change
for col in columns_check:
    compare[col+'_valchange'] = compare[col+'_1'] - compare[col+'_0']

# Sort columns for easier comparison
compare[compare.columns.sort_values()]


Unnamed: 0,highest 120 min rainfall (mm)_0,highest 120 min rainfall (mm)_1,highest 120 min rainfall (mm)_valchange,highest 30 min rainfall (mm)_0,highest 30 min rainfall (mm)_1,highest 30 min rainfall (mm)_valchange,highest 60 min rainfall (mm)_0,highest 60 min rainfall (mm)_1,highest 60 min rainfall (mm)_valchange,max wind speed (km/h)_0,max wind speed (km/h)_1,max wind speed (km/h)_valchange,mean wind speed (km/h)_0,mean wind speed (km/h)_1,mean wind speed (km/h)_valchange
count,3280.0,4018.0,738.0,3282.0,4018.0,736.0,3280.0,4018.0,738.0,4003.0,4018.0,15.0,4010.0,4018.0,8.0
mean,4.209787,4.058089,-0.151698,3.012523,2.806174,-0.206348,3.712713,3.507443,-0.20527,32.564402,32.565117,0.000715,8.68591,8.68294,-0.00297
std,9.406909,8.903684,-0.503225,6.460285,5.992072,-0.468213,8.223056,7.675789,-0.547267,6.600887,6.602152,0.001265,2.964181,2.963714,-0.000467
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14.8,14.8,0.0,0.4,0.4,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,28.1,28.1,0.0,6.4,6.4,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,31.7,31.7,0.0,7.9,7.9,0.0
75%,3.4,3.519863,0.119863,2.4,2.577522,0.177522,2.85,3.0,0.15,36.0,36.0,0.0,10.6,10.6,0.0
max,85.6,85.6,0.0,53.4,53.4,0.0,73.8,73.8,0.0,79.6,79.6,0.0,21.4,21.4,0.0


Generally, changes between original data (with missing values) and imputed values are within expectation, with no significant changes. 
- Rainfall data statistic changes all less than 1mm
- Windspeed statistic changes all less than 0.01km/h. 

#### weather data wrap-up

For time series forecasting later, re-sample weather data to weekly mean to match weekly dengue data:

In [40]:
weather = changi_weather_imp.resample('W').mean()
print(weather.shape)
weather.head()

(575, 9)


Unnamed: 0_level_0,daily rainfall total (mm),highest 30 min rainfall (mm),highest 60 min rainfall (mm),highest 120 min rainfall (mm),mean temperature (°c),maximum temperature (°c),minimum temperature (°c),mean wind speed (km/h),max wind speed (km/h)
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2012-01-01,0.6,0.6,0.6,0.6,27.2,31.4,25.2,8.4,28.4
2012-01-08,4.0,1.468296,1.468296,1.776585,26.971429,30.542857,24.785714,12.214286,34.6
2012-01-15,3.685714,0.808308,1.158314,1.754746,26.228571,29.5,23.828571,7.814286,34.214286
2012-01-22,4.0,2.701471,3.2066,3.225174,26.914286,31.3,24.471429,7.357143,28.042857
2012-01-29,1.228571,0.871997,0.916665,0.962512,26.6,30.6,24.4,8.585714,30.857143


In [41]:
# Expected 574 rows, but got 575 rows: view tail
weather.tail()

Unnamed: 0_level_0,daily rainfall total (mm),highest 30 min rainfall (mm),highest 60 min rainfall (mm),highest 120 min rainfall (mm),mean temperature (°c),maximum temperature (°c),minimum temperature (°c),mean wind speed (km/h),max wind speed (km/h)
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2022-12-04,13.828571,8.942857,11.142857,12.4,27.057143,31.5,24.128571,6.9,32.285714
2022-12-11,9.371429,6.457143,7.371429,8.314286,26.8,30.328571,24.8,6.357143,26.971429
2022-12-18,11.685714,6.885714,9.028571,9.542857,26.414286,29.828571,24.314286,8.628571,28.314286
2022-12-25,1.457143,1.171429,1.257143,1.342857,26.685714,30.285714,24.442857,10.214286,34.114286
2023-01-01,2.666667,1.866667,1.933333,2.666667,27.083333,30.7,24.916667,11.733333,36.083333


Drop final row 2023-01-01 as we only want to analyse up till 2022 which our dengue cases dataset is limited to.

In [42]:
weather.drop('2023-01-01', inplace=True)

---

### dengue_google

In [43]:
dengue_google.head()

Unnamed: 0_level_0,Dengue: (Singapore),dengue fever: (Singapore),dengue symptoms: (Singapore)
Week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-01-01,7,5,2
2012-01-08,5,5,0
2012-01-15,8,4,2
2012-01-22,6,7,3
2012-01-29,6,3,0


In [44]:
dengue_google.index.dtype

dtype('O')

Convert datatype of index (currently object) into datetime format:

In [45]:
dengue_google.index = pd.to_datetime(dengue_google.index)
print(f"Updated dtype of index: {dengue_google.index.dtype}")
dengue_google.head()

Updated dtype of index: datetime64[ns]


Unnamed: 0_level_0,Dengue: (Singapore),dengue fever: (Singapore),dengue symptoms: (Singapore)
Week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-01-01,7,5,2
2012-01-08,5,5,0
2012-01-15,8,4,2
2012-01-22,6,7,3
2012-01-29,6,3,0


In [46]:
# Sort index
dengue_google.sort_index(inplace=True)

Inspect duplicated rows:

In [47]:
dengue_google[dengue_google.duplicated()]

Unnamed: 0_level_0,Dengue: (Singapore),dengue fever: (Singapore),dengue symptoms: (Singapore)
Week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-02-26,6,3,2
2012-04-01,7,5,2
2012-07-01,9,5,3
2012-08-05,8,5,3
2012-09-02,7,5,2
...,...,...,...
2021-10-31,8,1,1
2021-11-28,7,1,0
2022-01-02,11,3,2
2022-02-13,14,3,4


In [48]:
# double-check for duplicated dates
np.unique(dengue_google.index.duplicated(), return_counts=True)

(array([False]), array([574], dtype=int64))

Upon further inspection of duplicated rows, we realise these are not really duplicates as they have different index time value. These duplicates should hence be kept, as each row helps represent the searches at a particular  point in time, defined in the index.

In [49]:
# Double check column data-types
dengue_google.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 574 entries, 2012-01-01 to 2022-12-25
Data columns (total 3 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Dengue: (Singapore)           574 non-null    int64 
 1   dengue fever: (Singapore)     574 non-null    object
 2   dengue symptoms: (Singapore)  574 non-null    object
dtypes: int64(1), object(2)
memory usage: 17.9+ KB


In [50]:
print(dengue_google['Dengue: (Singapore)'].unique())
print(dengue_google['dengue fever: (Singapore)'].unique())
print(dengue_google['dengue symptoms: (Singapore)'].unique())

[  7   5   8   6   4   9  10   3  12  11  13  15  14  16  18  38  34  30
  39  35  32  29  59  53 100  72  44  43  37  31  26  23  22  25  20  19
  28  24  21  17  33  27  42  45  52  41  47  51  64  75  78  70  77  76
  67  57  46  36  74  58  84  86  87  90  83  85  69  68  55  40]
[5 4 7 3 6 2 9 8 16 13 10 20 15 12 11 26 21 41 24 14 1 0 '6' '3' '4' '0'
 '1' '2' '5' '8' '7' '10' '13' '11' '12' '9' '14' '<1' 17]
[2 0 3 1 4 5 8 7 9 10 15 16 28 20 13 11 6 '3' '1' '0' '4' '2' '5' '6' '8'
 '9' '7' '10' '13' '14' '12' '11' '<1' 12 25 19 24 23 22 27 17 18]


Looking at unique values of the columns, we notice there is a value <1 that is non-numerical. Let's map this value to 0, and convert datatype to integer:

In [51]:
# Replace non-numerical "<1" value with "0"
for column in dengue_google.columns:
    dengue_google[column] = dengue_google[column].replace('<1', '0')

# Correct selected columns' datatypes
selected_columns = ['Dengue: (Singapore)',
       'dengue fever: (Singapore)', 'dengue symptoms: (Singapore)']
dengue_google[selected_columns] = dengue_google[selected_columns].astype(int)
dengue_google.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 574 entries, 2012-01-01 to 2022-12-25
Data columns (total 3 columns):
 #   Column                        Non-Null Count  Dtype
---  ------                        --------------  -----
 0   Dengue: (Singapore)           574 non-null    int32
 1   dengue fever: (Singapore)     574 non-null    int32
 2   dengue symptoms: (Singapore)  574 non-null    int32
dtypes: int32(3)
memory usage: 11.2 KB


## Concatenate dataframes
let's concatenate the 3 dataframes containing weather, dengue cases and google trends datasets to form:<br>

**`dengue_wk` <br>**
This dataset will be explored during EDA, whereby we introduce feature engineering and manipulate the time series data to be used later for modelling. All data in this dataset are in weekly date-time granularity. 

In [52]:
dengue_wk = pd.concat([pivoted_dengue, weather, dengue_google], axis=1)

In [53]:
compiled_functions.shape_head(dengue_wk, "dengue_wk")

dengue_wk dataset shape:
(574, 15)

dengue_wk dataset head:


Unnamed: 0,Dengue Fever,Dengue Haemorrhagic Fever,dengue_cases,daily rainfall total (mm),highest 30 min rainfall (mm),highest 60 min rainfall (mm),highest 120 min rainfall (mm),mean temperature (°c),maximum temperature (°c),minimum temperature (°c),mean wind speed (km/h),max wind speed (km/h),Dengue: (Singapore),dengue fever: (Singapore),dengue symptoms: (Singapore)
2012-01-01,74.0,0.0,74.0,0.6,0.6,0.6,0.6,27.2,31.4,25.2,8.4,28.4,7,5,2
2012-01-08,64.0,2.0,66.0,4.0,1.468296,1.468296,1.776585,26.971429,30.542857,24.785714,12.214286,34.6,5,5,0
2012-01-15,60.0,1.0,61.0,3.685714,0.808308,1.158314,1.754746,26.228571,29.5,23.828571,7.814286,34.214286,8,4,2
2012-01-22,50.0,2.0,52.0,4.0,2.701471,3.2066,3.225174,26.914286,31.3,24.471429,7.357143,28.042857,6,7,3
2012-01-29,84.0,1.0,85.0,1.228571,0.871997,0.916665,0.962512,26.6,30.6,24.4,8.585714,30.857143,6,3,0


In the earlier data cleaning processes, we identified that 'Dengue Fever' and 'Dengue Haemorrhagic Fever' had 2 rows of missing values. Now we would be able to easily identify missing values in the dataset to impute values:

In [54]:
rows_with_nulls = dengue_wk[dengue_wk.isna().any(axis=1)]
rows_with_nulls

Unnamed: 0,Dengue Fever,Dengue Haemorrhagic Fever,dengue_cases,daily rainfall total (mm),highest 30 min rainfall (mm),highest 60 min rainfall (mm),highest 120 min rainfall (mm),mean temperature (°c),maximum temperature (°c),minimum temperature (°c),mean wind speed (km/h),max wind speed (km/h),Dengue: (Singapore),dengue fever: (Singapore),dengue symptoms: (Singapore)
2012-12-30,,,,1.457143,0.734695,1.007585,1.37882,26.728571,29.9,24.957143,6.742857,26.528571,5,4,2
2017-12-31,,,,17.2,8.342857,9.285714,11.171429,26.528571,29.871429,24.342857,8.6,33.128571,9,3,2


In [55]:
dengue_wk.columns

Index(['Dengue Fever', 'Dengue Haemorrhagic Fever', 'dengue_cases',
       'daily rainfall total (mm)', 'highest 30 min rainfall (mm)',
       'highest 60 min rainfall (mm)', 'highest 120 min rainfall (mm)',
       'mean temperature (°c)', 'maximum temperature (°c)',
       'minimum temperature (°c)', 'mean wind speed (km/h)',
       'max wind speed (km/h)', 'Dengue: (Singapore)',
       'dengue fever: (Singapore)', 'dengue symptoms: (Singapore)'],
      dtype='object')

In [56]:
# view dates before and after these dates to get an idea of what dengue numbers were like around these dates
dengue_col = ['Dengue Fever', 'Dengue Haemorrhagic Fever', 'dengue_cases']

print(dengue_wk.loc['2012-12', dengue_col])
print(dengue_wk.loc['2013-1', dengue_col], '\n')

print(dengue_wk.loc['2017-12', dengue_col])
print(dengue_wk.loc['2018-1', dengue_col], '\n')

            Dengue Fever  Dengue Haemorrhagic Fever  dengue_cases
2012-12-02          78.0                        0.0          78.0
2012-12-09         104.0                        1.0         105.0
2012-12-16          92.0                        2.0          94.0
2012-12-23         111.0                        1.0         112.0
2012-12-30           NaN                        NaN           NaN
            Dengue Fever  Dengue Haemorrhagic Fever  dengue_cases
2013-01-06         132.0                        2.0         134.0
2013-01-13         204.0                        1.0         205.0
2013-01-20         219.0                        0.0         219.0
2013-01-27         264.0                        3.0         267.0 

            Dengue Fever  Dengue Haemorrhagic Fever  dengue_cases
2017-12-03          33.0                        1.0          34.0
2017-12-10          40.0                        0.0          40.0
2017-12-17          51.0                        0.0          51.0
2017-12-

In [57]:
# Impute NA with mean of week before and after rather than annual mean, as the former would probably give a better indication of dengue numbers at that specific point of time

dengue_wk.loc["2012-12-30", ["Dengue Fever"]] = round(np.mean([dengue_wk.loc["2012-12-23", ["Dengue Fever"]], dengue_wk.loc["2013-01-06", ["Dengue Fever"]]]))
dengue_wk.loc["2012-12-30", ["Dengue Haemorrhagic Fever"]] = round(np.mean([dengue_wk.loc["2012-12-23", ["Dengue Haemorrhagic Fever"]], dengue_wk.loc["2013-01-06", ["Dengue Haemorrhagic Fever"]]]))
dengue_wk.loc["2012-12-30", ["dengue_cases"]] = dengue_wk["Dengue Fever"]["2012-12-30"] + dengue_wk["Dengue Haemorrhagic Fever"]["2012-12-30"]

dengue_wk.loc["2017-12-31", ["Dengue Fever"]] = round(np.mean([dengue_wk.loc["2017-12-24", ["Dengue Fever"]], dengue_wk.loc["2018-01-07", ["Dengue Fever"]]]))
dengue_wk.loc["2017-12-31",["Dengue Haemorrhagic Fever"]] = round(np.mean([dengue_wk.loc["2017-12-24", ["Dengue Haemorrhagic Fever"]], dengue_wk.loc["2018-01-07", ["Dengue Haemorrhagic Fever"]]]))
dengue_wk.loc["2017-12-31", ["dengue_cases"]] = dengue_wk["Dengue Fever"]["2017-12-31"] + dengue_wk["Dengue Haemorrhagic Fever"]["2017-12-31"]

In [59]:
# check results
print(dengue_wk.loc["2012-12-30", dengue_col], '\n')
print(dengue_wk.loc["2017-12-31", dengue_col])

Dengue Fever                 122.0
Dengue Haemorrhagic Fever      2.0
dengue_cases                 124.0
Name: 2012-12-30 00:00:00, dtype: float64 

Dengue Fever                 74.0
Dengue Haemorrhagic Fever     0.0
dengue_cases                 74.0
Name: 2017-12-31 00:00:00, dtype: float64


In [60]:
dengue_wk.columns

Index(['Dengue Fever', 'Dengue Haemorrhagic Fever', 'dengue_cases',
       'daily rainfall total (mm)', 'highest 30 min rainfall (mm)',
       'highest 60 min rainfall (mm)', 'highest 120 min rainfall (mm)',
       'mean temperature (°c)', 'maximum temperature (°c)',
       'minimum temperature (°c)', 'mean wind speed (km/h)',
       'max wind speed (km/h)', 'Dengue: (Singapore)',
       'dengue fever: (Singapore)', 'dengue symptoms: (Singapore)'],
      dtype='object')

To make dataframe neater, map names:

In [62]:
# map column names to make neat
column_name_mapping = {
    'Dengue Fever': 'df',
    'Dengue Haemorrhagic Fever': 'dhf',
    'daily rainfall total (mm)': 'daily_rainf_total',
    'highest 30 min rainfall (mm)': 'highest_30min_rainf',
    'highest 60 min rainfall (mm)': 'highest_60min_rainf',
    'highest 120 min rainfall (mm)': 'highest_120min_rainf',
    'mean temperature (°c)': 'mean_temp',
    'maximum temperature (°c)': 'max_temp',
    'minimum temperature (°c)': 'min_temp',
    'mean wind speed (km/h)': 'mean_wind_speed',
    'max wind speed (km/h)': 'max_wind_speed',
    'Dengue: (Singapore)': 'dengue_searches',
    'dengue fever: (Singapore)': 'dengue_fever_searches',
    'dengue symptoms: (Singapore)': 'dengue_symptoms_searches'
}
# Rename the columns using the rename() method
dengue_wk.rename(columns=column_name_mapping, inplace=True)
dengue_wk.head()

Unnamed: 0,df,dhf,dengue_cases,daily_rainf_total,highest_30min_rainf,highest_60min_rainf,highest_120min_rainf,mean_temp,max_temp,min_temp,mean_wind_speed,max_wind_speed,dengue_searches,dengue_fever_searches,dengue_symptoms_searches
2012-01-01,74.0,0.0,74.0,0.6,0.6,0.6,0.6,27.2,31.4,25.2,8.4,28.4,7,5,2
2012-01-08,64.0,2.0,66.0,4.0,1.468296,1.468296,1.776585,26.971429,30.542857,24.785714,12.214286,34.6,5,5,0
2012-01-15,60.0,1.0,61.0,3.685714,0.808308,1.158314,1.754746,26.228571,29.5,23.828571,7.814286,34.214286,8,4,2
2012-01-22,50.0,2.0,52.0,4.0,2.701471,3.2066,3.225174,26.914286,31.3,24.471429,7.357143,28.042857,6,7,3
2012-01-29,84.0,1.0,85.0,1.228571,0.871997,0.916665,0.962512,26.6,30.6,24.4,8.585714,30.857143,6,3,0


## Export dataframe

In [63]:
# Save the DataFrame to a CSV file
dengue_wk.to_csv('../data/dengue_wk.csv', index=True)

# Also, save pickle
pickle.dump(dengue_wk, open('../pkls/dengue_wk.pkl', 'wb'))