In [None]:
""""Thing to be done:
1 correlation
2 correlation with other pages
3 predictive model - neural network


1 compute another graph"""

# Near Real-Time Flu Estimation via Wikipedia

## Introduction



In the following notebook we are going to reproduce for Italy McIver paper et al. "Wikipedia usage estimates Prevalence of Influenza-like illness in theUnited States in near real time". In this paper we show a method of estimating, in near-real time, the level of influenza-like illness (ILI) in Italy by monitoring the rate of particular Wikipedia article views on a daily basis. We calculated on a weekly base the number of times certain influenza- or health-related Wikipedia articles were accessed and compared these data to one of the Italian health protection agency program called "Influnet".

The Notebook in the those three following sections:

* **Comparing Influnet and Wikipedia's Influenza click throught rate**
    * Retrivial and cleaning of the wikipedia pages:
        * Using 3rd party toolkit wikishark
        * Downloading the raw data from https://dumps.wikimedia.org/ (bonus point)
    * Plot the two scaled curves on the same figure
    * Compute Correlation
* **Comparing Influnet data with other Wikipedia's related pages:**
    * Retrivial and Cleaning
    * Plotting and correlation analysis among each other
* **Estimate Flu outbreaks:**
    * Lasso
    * Jesus
    
In the whole notebook we will make use of the following convetions:

## 1 - Comparison between Influnet and  Wikipedia Influenza 

In this section we are going to define the auxiliary functions that I defined in order to perform in order to perform data analysis in a more efficient way: 

In [5]:
import pandas as pd
import numpy as np
from os import listdir
import matplotlib.pyplot as plt
from scipy.stats import pearsonr 
import seaborn as sb
import datetime as dt

In [36]:
def importInflunet(path):
    '''
    Reads the Influnet data and creates a unique multiindex dataframe of the format
    
    (year,week) - incidence
    
    :param path: location of the influnet folder
    :return: compacted version of 
    '''
    parser = lambda d: dt.datetime.strptime(d + '-0', "%Y-%W-%w")
    
    df = pd.concat([pd.read_csv(path+t,
            names=["time","incidence"], sep=" ", parse_dates = ["time"], date_parser = parser, 
        header=1, usecols=[0,4], decimal=",") for t in listdir(path)])
    
    df = df.set_index(["time"], append=False)
    df["incidence"] = df["incidence"].astype(float)
    df = df.sort_index()
    
    df = df.groupby(df.index).sum()
    alpha = df.index[0]
    omega = df.index[-1]
    time_range = pd.date_range(alpha, omega, freq='W-SUN')
    df = df.reindex(time_range, fill_value=0)
    
    return df



In [37]:
def getWikiRaw(wikiPages, path = "/home/aalto/PycharmProjects/digitalepidemiology/data/"):
    '''
    lplp
    :param wikiPages: list of the wikipages that we want to analyze
    :param path: location of the downloaded wikipedia pages
    :return: 
    '''
    df = pd.DataFrame()
    for wikiPage in wikiPages:
        wiki = pd.read_csv(path+wikiPage+".csv", usecols=[0,1], parse_dates=True, index_col=[0], header=None)
        wiki = wiki.resample("W-Sun").sum()
        df = df.reindex(wiki.index)
        df[wikiPage] = wiki
    return df

In [38]:
def comparePlots(elems, start):
    for elem in elems:
        y = elem.ix[elem.index.year > start, 0]
        y = y/max(y)
        plt.plot(y)
    plt.show()

In [39]:
influnet = importInflunet("/home/aalto/Desktop/DE/hw2/influnet/data/")
wiki = getWiki(["influenza2"])

comparePlots([influnet, wiki], 2010)

In [67]:
getCrossCorrelation([influnet, wiki], range(2010,2017))

          2010      2011      2012      2013      2014      2015      2016
2010  0.515360  0.734056  0.788339  0.667596  0.679576  0.616321  0.306227
2011  0.308532  0.905462  0.820030  0.912294  0.821094  0.914130  0.504102
2012  0.316096  0.858859  0.809108  0.877478  0.796107  0.887290  0.505945
2013  0.263040  0.764573  0.745169  0.826106  0.778834  0.826799  0.547922
2014  0.307226  0.822641  0.784128  0.855148  0.816775  0.850329  0.543867
2015  0.264612  0.861972  0.784301  0.883366  0.824500  0.903945  0.551272
2016  0.497960  0.311634  0.548219  0.508097  0.516761  0.449457  0.823771


At this point we need the wikipedia pages views count in order to compare them with influnet. In order to satisy the requirement for the bonus point I wrote a simple script that downloads the files from https://dumps.wikimedia.org/, scan it to find the words we are intrested in and writes those entries of the files that we are intrested and write them on different files. 

Once those files are collected and stored on our disk we can call this function which loads the element passed them in memoery and group them by week.


Now to check the previous point we load the influnet dataset in memory and than show its plot

# 2 - Other functions



At this point we are intrested to find if there is any other wikipedia pages that are able to gives us a better insghit about the topics

In [140]:
def getWikis(path = "/home/aalto/PycharmProjects/digitalepidemiology/data/influenza1.csv"):
    '''
    lplp
    :param wikiPages: list of the wikipages that we want to analyze
    :param path: location of the downloaded wikipedia pages
    :return: 
    '''
    df = pd.read_csv(path, parse_dates=['Date'])
    df = df.set_index(["Date"], append=False)
    del df["Week Number "]
    new_columns = dict((column,column[:-4].lower()) for column in df.columns.values)
    df = df.rename(columns = new_columns )
    del df["unname"]
    return df
    

In [229]:
def getCrossCorrelation(elems, years):
    '''
    :param elems: 
    :param years: 
    :return: 
    '''
    y = len(years)
    heatmap = pd.DataFrame(np.zeros(y**2).reshape(y,y), index = years, columns=years)
    
    for y1 in years:
        for y2 in years:
            #print("ciao",elems[0][elems[0].index.year == year1])
            a = elems[0][elems[0].index.year == y1]
            a = a["incidence"]/max(a["incidence"])
            b = elems[1][elems[1].index.year == y2]
            b = b/max(b)
            minimum = min(len(a), len(b))
            heatmap.ix[y1,y2] = pearsonr(a[:minimum].values, b[:minimum].values)[0]
    print heatmap
    sb.heatmap(heatmap)
    plt.show()

In [None]:
influnet = importInflunet("/home/aalto/Desktop/DE/hw2/influnet/data/")
wikis = getWikis()
for wiki in wikis:
    #getCrossCorrelation([influnet, wikis[wiki]], range(2010,2017))
    plt.plot(influnet/max(influnet["incidence"]))
    plt.plot(wikis[wiki]/max(wikis[wiki]))
    plt.show()

# Stuff

In [None]:
def padInflunet(aux, year):
    '''
    The influnet dataset lacks information about the weeks that do not belog to the flu season (usally, but not necessarly, from week 17 to 40).
    This functions fills the dataset with empty position in order to match the wikipedia format.
    
    :param aux: Influnet dataframe from a specific year
    :param year: year of the previous Influnet dataframe
    :return: padded version of the original dataframe
    '''
    year_weeks = aux.index.values[-1]
    week_range = range(1,year_weeks+1)
    aux = aux.reindex(week_range, fill_value=0)
    aux["year"] = year
    aux["week"] = week_range
    
    aux.set_index(['year', 'week'], append=False, inplace=True)
    return aux


def getInflunet(path = "/home/aalto/Desktop/DE/hw2/influnet/data/"):
    '''
    import and reformat the original Influnet dataset
    
    :param path: 
    :return: clean and padded version of the Influnet dataset
    '''
    
    df = importInflunet();
    previous = None
    for x,y in df.index.values:
        if previous == None:
            df2 = reindexDF(df.loc[x], x)
        elif x != previous:
            df2 = df2.append(reindexDF(df.loc[x], x))
        previous = x
    return df