# Exercise 2: Analysing log data

The aim of this excersies is to analyse data from the user logs generated by the Meerkat system. 

Every time a page is loaded or the api is called, the actions are logged in a database. The data gathered looks like this: 

Metadata:

`
timestamp - time of event, defined at the client as the logging request is sent
type - Type of event:
	User-activity - user prompted activity in front end
	Batch-job - scheduled batch job
	Admin - administrative activity
source - deployment ID
source_type - frontend/api/abacus/auth etc: Module that is performing the logging
implementation - country or organization name
Event Data (JSON): 
    User-activity: 
        {
            “path": <relative path>, 
            "role": <user roles>,
            "user": <user account>, 
            "base_url": <accessed url>, 
            "full_url": <full accessed url>, 
            "request_time": <time spent to deliver request response>
         }
`

The main questions we want to explore are the following: 

* Who are the most active users
* What pages are they visiting
* Look at the request time. Which urls take the most time, is there large variability in the request time
* What time of day are people accessing the site


This data can be explored in two ways. The first following this tutorial is using the python pandas package. The second is to explore this data using SQL. Ask a demonstrator to help you get started with an SQL session to access this data. 

This notebook contains some help to get the log data into pandas then you can explore the data on your own to aser the questions above. If you want some ideas of the capabilities of pandas you can look at this resource: https://pandas.pydata.org/pandas-docs/stable/10min.html

In [18]:
import pandas as pd
import sqlalchemy
%matplotlib notebook
engine = sqlalchemy.create_engine("postgresql+psycopg2://postgres:postgres@localhost/event_db")
data = pd.read_sql_query("SELECT * from log", engine)
data = pd.concat([data, data["event_data"].apply(pd.Series)], axis=1)

In [20]:
# Examples

data["user"].value_counts()

cd-clinic         1126
ncd-clinic         585
rania              370
cd-dir             207
refqi              132
gunnar             107
jberry              81
admin               46
sami                28
report-emails       20
jsoppela            12
reports-jor-cd       2
Name: user, dtype: int64

In [23]:
data["base_url"].value_counts()

http://jordan.emro.info/en/                                                                                            714
http://jordan.emro.info/                                                                                               710
http://jordan.emro.info/en                                                                                             704
http://jordan.emro.info/api/key_indicators                                                                             702
http://iers.moh.gov.jo/api/locations                                                                                   361
http://iers.moh.gov.jo/en/login                                                                                        253
http://iers.moh.gov.jo/api/locationtree                                                                                233
http://iers.moh.gov.jo/api/variables/pc                                                                                168
http://iers.moh.

In [25]:
data["request_time"].max(), data["request_time"].mean()

(40.610305309295654, 0.66513522626156985)

In [46]:
data.groupby("path").agg({"request_time": ["max", "min", "mean", "std", "count"]
                         })["request_time"].sort_values(by ="mean", ascending=False)

Unnamed: 0_level_0,max,min,mean,std,count
path,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
/aggregate_category/visit/5/2017,40.610305,40.610305,40.610305,,1
/en/reports/non_communicable_diseases_return_visits~1~2017-08-31T00:00:00.000Z~2017-08-01T00:00:00.000Z.pdf,20.046369,20.046369,20.046369,,1
/aggregate_category/pc/5/2017/tot_1,20.020204,20.020204,20.020204,,1
/aggregate_category/Chapter/5/2017/tot_1,20.008009,20.008009,20.008009,,1
/en/reports/non_communicable_diseases_new_visits~7~2017-08-31T00:00:00.000Z~2017-08-01T00:00:00.000Z.pdf,17.329325,17.329325,17.329325,,1
/en/reports/non_communicable_diseases_return_visits/1/2017-08-31T00:00:00.000Z/2017-08-01T00:00:00.000Z/,16.380693,15.012480,15.652493,0.688356,3
/reports/ncd_report_return_visits/1/2017-08-31T00:00:00.000Z/2017-08-01T00:00:00.000Z,15.764689,14.939328,15.383811,0.416341,3
/aggregate_category/visit/6/2017,29.339819,0.807180,15.073500,20.175622,2
"/completeness/reg_11/37/6/1/4,5",21.474913,9.520266,14.767030,5.086038,4
/query_category/cd_tab/locations:region/2017-01-01T01:00:00.000Z/2017-09-21T00:00:00.000Z,12.769280,12.769280,12.769280,,1
