# Time Series - Active Users 

## Project problem

A web-based application, is used by University's staff members, students and  agencies. At the moment there we keep track on How many active users are connected every month, but there is no forecasting model in place to predict active users in the following weeks. 

**Goal:** The intention of this notebook is to, using time series, create a model that can predict the active users in the following week.

**Impact:** Since the web application is deployed in the cloud, being able to predict active users, could have an impact in how effectively cloud services are provisioned and adjusted to the demand. It could potentially save some money for the company. 

**Hypothesis:** Based on University placements created, a seasonality component and the number of active users in the previous period, we could predict a higher or lower number of active users.

## Dataset

The dataset is a set of **IIS logs** from a client in Canberra **from 2017-12-07 to 2018-10-10**.

Variable | Description | type of variable 
---|---|---
 date          | Date of the event logged|continuous
 time          | Time of the event logged|continuous
 server-ip     | IP of the server|categorical
 cs-uri-query  | Part of the URL access by the client|categorical
 server-port   | Server port used for serving the page|categorical
 cs-username   | Client user. High percentage of requests are missing the user|categorical 
 client-ip     | IP of the client|categorical
 cs(User-Agent)| Http response part with information about the browser used and the type of device|categorical
 cs(Referer)   | Http reponse part with the URL accessed |categorical
 sc-status     | Http response status|categorical
 sc-substatus  | Http response sub-status|categorical
 time-taken(ms)| time taken in ms, to repond|continuous
 client-city   | City from which the client connected. Derived from the IP|categorical
 client-country| Country from which the client.Derived from the IP |categorical
 client-device | Type of deviced used to access the website by the client (Desktop or Mobile). |categorical
 client-browser| Browser used to access the website by the client. Derived from cs(User-Agent)|categorical
 client-webPage| Web page accessed by the client, Derived from cs(User-Agent)|categorical
 
 **Need to include Placements with students, placement with no students , agreggate per week - also include data related to users if possible, to predict based on type of user, agency, staff, student**

## Domain Knowledge

This is a new area for me, as I have not being actively involved in the development of this product. But when asking around other collegues from work, the conclusion was that as of now, there is no existing process to forecast active users.

There are already tools in the market like google analytics , that will give you insights based on the IIS logs, but I have not seen a tool that out from the box would provide a forecast on active users from the IIS logs.

## Project concerns
**Risks:** 

1) cs-username is missing in a lot of the observations. It might not be good enough to identify what type of user is connection to the website (staff, agency, student). If that is the case, we will just forecast active users as a whole.

2) The features and the model having a very low accuracy, below 60%. 


## Outcomes
For the project to be a success , the prediction accuraty should be at least 60%. If the project fail, I will try other features or timeseries models , with the goal to improve the accuracy to at least 60%.

## EDA

For the EDA, since we have one file for each day, the intention is to load all the files into a dataframe, enhance the data and aggregate it per week.To keep the notebook as tidy as possible, most logic to automate this process has been moved to a python class called **myLogReader.py**.

Below is a summary of the automated steps:

- List every log file in the local folder
- For each file:
    - Load the file in a df
    - Enhance df, deriving city, country, device type, browser used, weekday, etc...
    - **Default NaN** values for city, user name, webpage to **Unknown**
    - For every week (7 log files) loaded, aggregate the data and calculate number of distinct connections, total number of connections, reponse time taken (ms), number of distinct user names, etc...
    
![title](../data/img/Aggregated_data.JPG)


#### Load python libraries

In [2]:
import pandas as pd
import statsmodels.api as sm
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import geoip2.database
import myLogReader as mlr
import re
import os
import sys
import datetime as dt

%matplotlib inline

#### Load and transform log files into a data frame

In [13]:
logsPath = '../data/logs'
geoLiteIPDBPath = '../data/GeoLite2-City_20181009/GeoLite2-City.mmdb'

#Create a myLogReader object
myLogReader = mlr.log()
#Open Reader
myLogReader.openReader(geoLiteIPDBPath)

In [11]:
df =  myLogReader.readLogs(logsPath,7)

../data/logs\u_ex171216.log
../data/logs\u_ex171217.log
../data/logs\u_ex171218.log
../data/logs\u_ex171219.log
../data/logs\u_ex171220.log
../data/logs\u_ex171221.log
../data/logs\u_ex171222.log


In [5]:
#Close Reader
myLogReader.closeReader()

In [12]:
df.head()

Unnamed: 0_level_0,client-ip-unique-count,cs-username-unique-count,client-connections-count,time-taken(ms)-sum,Chrome-count,Firefox-count,Other-count,Safari-count,Desktop-count,Mobile-count
calendar-year-week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2017-50,225,91,13730,6799558.0,5098,289,5892,2451,12460,1270
2017-51,715,347,70252,38674394.0,36005,5803,16593,11851,63201,7051
