<h1 style="text-align:center;text-decoration: underline">Getting Started Tutorial</h1>
<h1>Overview</h1>
<p>Welcome to the getting started tutorial for EpiData's Jupyter Notebook inteface. In this tutorial we will query, retrieve and analyze sample weather data acquired from a simulated wireless sensor network.</p>
<p>Note: The tutorial assumes that you have a working knowledge of Jupyter Notebook.</p>

<h2>Package and Module Imports</h2>
<p>As a first step, we will import packages and modules required for this tutorial. Since <i>EpiData Context (ec)</i> is required to use the application, it is implicitly imported. Other modules, such as <i>datetime</i>, <i>pandas</i> and <i>matplotlib</i>, can be imported at this time. Let's run the cell below to import these modules.
</p>

In [None]:
#from epidata.context import ec
from datetime import datetime, timedelta
import pandas as pd
import matplotlib.pyplot as plt

<h2>Query and Retrieve</h2>

<h3>1. Query</h3>
<p>Data stored in the database can be queried by specifying the values of primary keys, start time and stop time. Below are the required primary keys for the current dataset:
<ul>
<li><i>company, site, station, sensor</i></li>
</ul>
</p>
<p>
In several cases, one may not know the values stored in the primary keys. We have provided the <i>ec.list_keys()</i> function to obtain valid combination of primary keys values. Let's run the code below to see these values for sample weather data.
</p>

In [None]:
keys = ec.list_keys()
keys.toPandas()

<p>Now that we know the valid primary keys values for sample weather data, we can specify them in <i>ec.query_measurements_original()</i> function. The function outputs the query result as an <i>EpiData DataFrame</i>.</p>

In [None]:
primary_key={"company": "EpiData", "site": "San_Jose", "station":"WSN-1", "sensor": ["Temperature_Probe","Anemometer","RH_Probe"]}
start_time = datetime.strptime('8/1/2017 00:00:00', '%m/%d/%Y %H:%M:%S')
stop_time = datetime.strptime('8/31/2017 00:00:00', '%m/%d/%Y %H:%M:%S')
df = ec.query_measurements_original(primary_key, start_time, stop_time)

<h3>2. Retrieve</h3>
<p>Data is retrieved from the database as an <i>EpiData dataframe</i>. To optimize memory and compute resources, we can reduce the size of the data by using the <i>df.select()</i> function. In the cell below, we will select the fields of interest, retrieve the data and count the number of records.</p>

In [None]:
df = df.select("site", "station", "ts", "meas_name", "meas_value", "meas_unit")
print "Number of records:", df.count()

<p>Data can also be retreived as <i>pandas DataFrame</i> using the <i>toPandas()</i> function. Let's perform this operation and take a look at the initial 5 records of our sample data.</p>

In [None]:
dflocal = df.toPandas()
dflocal.head(5)

<h2>Data Analysis</h2>
<p>Once data is available in a <i>pandas DataFrame</i>, we can call any of the high-performance and easy-to-use data analysis functions available in <i>pandas</i> library. Let's start by computing basic statistics such as min, max, mean, standard deviation and percentile for temperature measurements.</p>

In [None]:
dflocal = dflocal.loc[dflocal["meas_name"]=="Temperature"]
dflocal["meas_value"].describe()

<p>Next, we'll look at the distribution of the temperature measurements using a histogram.</p>

In [None]:
plt.rcParams["figure.figsize"] = [10,5]
plt.title("Histogram - Temperature Measurements")
plt.xlabel("Temperature (deg F)")
plt.ylabel("Frequency")
dflocal["meas_value"].hist()

<p>As we can see, most of the temperature measurements in our sample data are quite moderate. However, there are some measurements that are unusually high. Let's identify these outlier measurements using a simple method that compares each measurement with the sample mean and standard deviation.</p>

In [None]:
outliers = dflocal.loc[abs(dflocal["meas_value"] - dflocal["meas_value"].mean()) > abs(3*dflocal["meas_value"].std())]
print "Number of Outliers:", outliers["meas_value"].count()
outliers.head()

<h2>Next Steps</h2>
<p>Congratulations, you have successfully queried, retrieved and analyzed sample data aquired by a wireless sensor network. The next step is to explore the various capabilities of EpiData by creating your own Jupyter Notebook. Happy Data Exploring!</p>