# Class balancing

During preliminary analysis, the presence of a significant class imbalance became apparent. This was due in part to the fact that most locations did have outbreaks and most of these outbreaks were ongoing (present at all dates) throughout the span of the available data.

To attempt to remedy this in a timeframe relevant to that allotted for this project, several frameworks were tested to balance the classes and make prediction easier:

* **Framework A:** Locations used in the `non-outbreak` class were considered to be those that had never had an outbreak, and the feature information from the first available date was used for this data. The `outbreak` class was defined as a location which had an outbreak at any time during the analyzed dates. For these locations, features from two different dates were tested: those from date at which the outbreak began and the date at which the outbreak had reached its maximum level (during the span of data collection). These two data sets were called `framework_a_first` and `framework_a_max`, respectively. 
* **Framework B:** Only data from locations which had an outbreak were used. This data was then split into the first date available, assuming no outbreaks were present. This data was used as the `non-outbreak` class. Then, from the timeseries for these points, features from either the date at which the outbreak began or the date at which the outbreak had reached its maximum level (during the span of data collection) were used for the `outbreak` class. These two data sets were called `framework_b_first` and `framework_b_max`, respectively. 

In the end, **`framework_a_first`** was used. This framework predicts on the most relevant portion of the outbreak--the start. It also likely makes prediction slightly easier since factors like weather, etc. will be different for different locations. 

In [1]:
import pandas as pd
import numpy as np
import dill
from datetime import timedelta
from dateutil.parser import parser
from csv_pkl_sql import save_it, csv_it, pkl_it

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

## Load infection data

In [2]:
infection = pd.read_pickle('../pkl/03_infection_data_final.pkl')
infection.head(1)

Unnamed: 0,date,location,zika_cases,data_field
0,2016-03-19,Argentina-Buenos_Aires,0,cumulative_confirmed_local_cases


In [3]:
infection.dtypes

date          datetime64[ns]
location              object
zika_cases             int64
data_field            object
dtype: object

There are multiple data fields for a given date and location, so roll the data up by just date and location.

In [4]:
infection_sum = (infection[['date','location','zika_cases']]
                 .groupby(['date','location'], as_index=False)
                 .sum())

In [5]:
infection_sum.to_pickle('../pkl/10_class_balancing_parsed_dates.pkl')

## Examine feasibility for prediction

In [6]:
def feasibility(df):
    cases_first_date = df.loc[df.date==df.date.min(), 'zika_cases'].values[0]
    date_first_date  = df.date.min()
    
    cases_max   = df.zika_cases.max()
    date_max    = df.loc[df.zika_cases==df.zika_cases.max(), 'date'].values[0]
    
    cases_last  = df.loc[df.date==df.date.max(), 'zika_cases'].values[0]
    date_last   = df.date.max()
    
    cases_total = df.zika_cases.sum()
    
    df2 = df.loc[df.zika_cases>0]
    
    if df2.shape[0]>=1:
        cases_first_nonzero = df2.loc[df2.date==df2.date.min(),'zika_cases'].values[0]
        date_first_nonzero  = df2.date.min()
    else:
        cases_first_nonzero = np.NaN
        date_first_nonzero = np.NaN
        
    #print(type(date_first_date), type(date_max), type(date_last), type(date_first_nonzero))
        
    return pd.Series({'cases_first_date' : cases_first_date,
                      'date_first_date'  : date_first_date,
                      'cases_first_nonzero' : cases_first_nonzero,
                      'date_first_nonzero'  : date_first_nonzero,
                      'cases_max'  : cases_max,
                      'date_max'   : date_max,
                      'cases_last' : cases_last,
                      'date_last'  : date_last,
                      'cases_total': cases_total})

In [7]:
framework_key = (infection_sum
                   .groupby('location')
                   .apply(feasibility))

In [8]:
framework_key.head(1)

Unnamed: 0_level_0,cases_first_date,cases_first_nonzero,cases_last,cases_max,cases_total,date_first_date,date_first_nonzero,date_last,date_max
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Argentina-Buenos_Aires,130,130.0,227,356,3028,2016-03-19 00:00:00,2016-03-19 00:00:00,2016-06-26 00:00:00,2016-05-22T00:00:00.000000000


In [9]:
# Total size
print framework_key.shape[0]

# Completely zero entries
print framework_key.query('cases_max==0').shape[0]

# No zeros at all
print framework_key.query('cases_first_date>0').shape[0]

# Number of entries that start at zero and have cases
print ((framework_key.cases_max>0) & (framework_key.cases_first_date==0)).sum()

1605
365
718
522


In [10]:
framework_key['date_max'] = pd.to_datetime(framework_key.date_max)
framework_key['date_last'] = pd.to_datetime(framework_key.date_last)
framework_key['date_first_date'] = pd.to_datetime(framework_key.date_first_date)
framework_key['date_first_nonzero'] = pd.to_datetime(framework_key.date_first_nonzero)

In [11]:
framework_key.dtypes

cases_first_date                int64
cases_first_nonzero           float64
cases_last                      int64
cases_max                       int64
cases_total                     int64
date_first_date        datetime64[ns]
date_first_nonzero     datetime64[ns]
date_last              datetime64[ns]
date_max               datetime64[ns]
dtype: object

## Framework B
Outbreak and non-outbreak samples come from the same location. The oubreak date is either the date of the first non-zero case entry (`first`) or the maximum (`max`) non-zero entry. The non-outbreak date corresponds to the first date, which is defined as being non-zero.

In [12]:
mask = ((framework_key.cases_max>0) & (framework_key.cases_first_date==0))

framework_b_first = pd.concat([(framework_key
                                .loc[mask, ['date_first_date']]
                                .assign(zika_bool=0)
                                .rename(columns={'date_first_date':'date'})),
                               (framework_key
                                .loc[mask, ['date_first_nonzero']]
                                .assign(zika_bool=1)
                                .rename(columns={'date_first_nonzero':'date'}))]).sort_index().reset_index()

framework_b_max   = pd.concat([(framework_key
                                .loc[mask, ['date_first_date']]
                                .assign(zika_bool=0)
                                .rename(columns={'date_first_date':'date'})),
                               (framework_key
                                .loc[mask, ['date_max']]
                                .assign(zika_bool=1)
                                .rename(columns={'date_max':'date'}))]).sort_index().reset_index()

framework_b_max.head(5)

Unnamed: 0,location,date,zika_bool
0,Argentina-San_Juan,2016-03-19,0
1,Argentina-San_Juan,2016-05-07,1
2,Brazil-Amapa,2016-05-28,1
3,Brazil-Amapa,2016-02-13,0
4,Brazil-Amazonas,2016-05-28,1


In [13]:
framework_b_first.dtypes

location             object
date         datetime64[ns]
zika_bool             int64
dtype: object

In [14]:
framework_b_first.zika_bool.value_counts()

1    522
0    522
Name: zika_bool, dtype: int64

In [15]:
framework_b_max.zika_bool.value_counts()

1    522
0    522
Name: zika_bool, dtype: int64

In [16]:
save_it(framework_b_first, '10_class_balancing_framework_b_first')

In [17]:
save_it(framework_b_max, '10_class_balancing_framework_b_max')

## Framework A

Data for outbreak and non-outbreak come from separate locations. Outbreak data are non-zero with regards to cases at any time. There are two possibilities for definition of outbreak date:

Outbreak date can be chosen for the first non-zero value (`first`) or to correspond to the maximum non-zero value (`max`)



In [18]:
mask = framework_key.cases_max > 0

framework_a_first = pd.concat([(framework_key
                                .loc[mask, ['date_first_nonzero']]
                                .assign(zika_bool=1)
                                .rename(columns={'date_first_nonzero':'date'})),
                               
                                # zero case data are taken from first date
                               (framework_key
                                .loc[mask.pipe(np.invert), ['date_first_date']]
                                .assign(zika_bool=0)
                                .rename(columns={'date_first_date':'date'}))]).sort_index().reset_index()


framework_a_max  = pd.concat([(framework_key
                                .loc[mask, ['date_max']]
                                .assign(zika_bool=1)
                                .rename(columns={'date_max':'date'})),
                              
                                # zero case data are taken from first date
                               (framework_key
                                .loc[mask.pipe(np.invert), ['date_first_date']]
                                .assign(zika_bool=0)
                                .rename(columns={'date_first_date':'date'}))]).sort_index().reset_index()

In [19]:
framework_key.head()

Unnamed: 0_level_0,cases_first_date,cases_first_nonzero,cases_last,cases_max,cases_total,date_first_date,date_first_nonzero,date_last,date_max
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Argentina-Buenos_Aires,130,130.0,227,356,3028,2016-03-19,2016-03-19,2016-06-26,2016-05-22
Argentina-CABA,77,77.0,158,258,1872,2016-03-19,2016-03-19,2016-06-26,2016-05-22
Argentina-Catamarca,14,14.0,16,16,212,2016-03-19,2016-03-19,2016-06-26,2016-05-07
Argentina-Chaco,48,48.0,66,126,984,2016-03-19,2016-03-19,2016-06-26,2016-05-07
Argentina-Chubut,6,6.0,6,6,82,2016-03-19,2016-03-19,2016-06-26,2016-03-19


In [20]:
mask = framework_key.cases_max > 0

fwf = pd.concat([(framework_key
                                .loc[mask, ['date_first_nonzero','cases_first_date','cases_total']]
                                .assign(zika_bool=1)
                                .rename(columns={'date_first_nonzero':'date'})),
                               
                                # zero case data are taken from first date
                               (framework_key
                                .loc[mask.pipe(np.invert), ['date_first_date']]
                                .assign(zika_bool=0)
                                .rename(columns={'date_first_date':'date'}))]).sort_index().reset_index()

fwf['cases_first_date'] = fwf.cases_first_date.fillna(0).astype(np.int)
fwf['cases_total'] = fwf.cases_total.fillna(0).astype(np.int)

#fwf.to_pickle('../pkl/10_class_balancing_fwf.pkl')
save_it(fwf, '10_class_balancing_fwf')

In [21]:
framework_a_first.head(1).T

Unnamed: 0,0
location,Argentina-Buenos_Aires
date,2016-03-19 00:00:00
zika_bool,1


In [22]:
fwf.head(1).T

Unnamed: 0,0
location,Argentina-Buenos_Aires
cases_first_date,130
cases_total,3028
date,2016-03-19 00:00:00
zika_bool,1


In [23]:
framework_a_first.zika_bool.value_counts()

1    1240
0     365
Name: zika_bool, dtype: int64

In [24]:
framework_a_max.zika_bool.value_counts()

1    1240
0     365
Name: zika_bool, dtype: int64

Write the dataframes to pickle files.

In [25]:
framework_a_first.dtypes

location             object
date         datetime64[ns]
zika_bool             int64
dtype: object

In [26]:
save_it(framework_a_first, '10_class_balancing_framework_a_first')

In [27]:
save_it(framework_a_max, '10_class_balancing_framework_a_max')