# Data Science Functions Module and Influenza Exploratory Analysis

Lucas Nguyen 
UCSD

Description:

The goal of this project is to create a module of functions that automate the Data Science process. The functions in the module are designed to replace often repeated data science tasks so users don't have to repeat the same actions each time they analyze a new dataset. We keep principles of abstraction so the module can be applied to a variety of DS tasks.

We show how to use these functions in an exploratory data analysis of Google-Flu Trends and CDC Influenza data.


In [102]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime as dt
import functions as f

In [29]:
cdc_df = pd.read_csv('data/fv2/ILINet.csv', skiprows=1)

In [87]:
ca_df = pd.read_csv('data/ca2.csv', skiprows=2)

In [103]:
f.head_n(ca_df, 3, 20, 30)

Unnamed: 0_level_0,influenza: (California),flu: (California)
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
2006-07,1,4
2006-10,1,16
2007-01,1,10
2007-04,1,4
2007-07,<1,3
2007-10,1,13
2008-01,1,8
2008-04,1,4
2008-07,1,2
2008-10,1,12


In [89]:
# ca_df['Date'] = ca_df.apply(lambda x: dt.strptime(x[0], '%Y-%m'), axis=1)
# ca_df.head()

In [90]:
ca_df.index = ca_df['Month']
ca_df.drop(['Month'], axis=1, inplace=True)
ca_df.head()

Unnamed: 0_level_0,influenza: (California),flu: (California)
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-01,1,9
2004-02,2,7
2004-03,1,3
2004-04,1,3
2004-05,1,2


In [6]:
cdc_df.head()

Unnamed: 0,REGION TYPE,REGION,YEAR,WEEK,% WEIGHTED ILI,%UNWEIGHTED ILI,AGE 0-4,AGE 25-49,AGE 25-64,AGE 5-24,AGE 50-64,AGE 65,ILITOTAL,NUM. OF PROVIDERS,TOTAL PATIENTS
0,States,California,2010,40,X,1.95412,X,X,X,X,X,X,632,112,32342
1,States,California,2010,41,X,2.15266,X,X,X,X,X,X,742,122,34469
2,States,California,2010,42,X,2.24173,X,X,X,X,X,X,766,126,34170
3,States,California,2010,43,X,1.91748,X,X,X,X,X,X,666,130,34733
4,States,California,2010,44,X,2.52326,X,X,X,X,X,X,887,131,35153


In [44]:
cdc_df['Date'] = cdc_df.apply(lambda x: dt.strptime(str(x[2])+str(x[3])+'-1', '%Y%U-%w'), axis=1)
cdc_df['Month'] = cdc_df['Date'].apply(lambda x: x.month)
cdc_df.head()

Unnamed: 0,REGION TYPE,REGION,YEAR,WEEK,% WEIGHTED ILI,%UNWEIGHTED ILI,AGE 0-4,AGE 25-49,AGE 25-64,AGE 5-24,AGE 50-64,AGE 65,ILITOTAL,NUM. OF PROVIDERS,TOTAL PATIENTS,Date,Month
2010-10-04,States,California,2010,40,X,1.95412,X,X,X,X,X,X,632,112,32342,2010-10-04,10
2010-10-11,States,California,2010,41,X,2.15266,X,X,X,X,X,X,742,122,34469,2010-10-11,10
2010-10-18,States,California,2010,42,X,2.24173,X,X,X,X,X,X,766,126,34170,2010-10-18,10
2010-10-25,States,California,2010,43,X,1.91748,X,X,X,X,X,X,666,130,34733,2010-10-25,10
2010-11-01,States,California,2010,44,X,2.52326,X,X,X,X,X,X,887,131,35153,2010-11-01,11


In [71]:
cdc_df_clean = pd.DataFrame(cdc_df[['ILITOTAL', 'YEAR', 'Month']].groupby(['YEAR', 'Month'])['ILITOTAL'].sum())
cdc_df_clean.reset_index(inplace=True)
cdc_df_clean.index = cdc_df_clean.apply(lambda x: '{}-{}'.format(str(x['YEAR']),str(x['Month'])), axis=1)
cdc_df_clean.head()

Unnamed: 0,YEAR,Month,ILITOTAL
2010-10,2010,10,2806
2010-11,2010,11,4481
2010-12,2010,12,4215
2011-1,2011,1,7383
2011-2,2011,2,7805


In [40]:
date_range = min(cdc_df['Date']), max(cdc_df['Date'])
date_range

(Timestamp('2010-10-04 00:00:00'), Timestamp('2019-12-02 00:00:00'))

In [91]:
cdc_df_clean.join(ca_df)

Unnamed: 0,YEAR,Month,ILITOTAL,influenza: (California),flu: (California)
2010-10,2010,10,2806,1,9.0
2010-11,2010,11,4481,1,6.0
2010-12,2010,12,4215,1,6.0
2011-1,2011,1,7383,,
2011-2,2011,2,7805,,
2011-3,2011,3,5722,,
2011-4,2011,4,3365,,
2011-5,2011,5,3270,,
2011-6,2011,6,1911,,
2011-7,2011,7,1356,,
