# ESTIMATE AVERAGE NUMBER OF EVENTS PER USER PER DAY

## Introduction
Its important to know on average how many events each user generates because this has a direct bearing on the accuracy of our analysis. Also, it has implications in preprocessing stages of the analysis.
For example, we will ikely drop some users from the analysis as follows:
1. Users with small number of events per day because it will be hard to determine trips from those kind of users. The exact threshold will be determine at a later stage
2. Users with too many events as these may indicate that those numbers arent used for personal use but rather for business

In [3]:
import os
from datetime import datetime
import pandas as pd
import numpy as np
from pyspark.sql.types import Row
from pyspark.sql.types import IntegerType, DateType, TimestampType, StringType
from pyspark.sql.functions import collect_set, from_unixtime, unix_timestamp, col, udf, datediff, count

In [4]:
col.__str__


In [5]:
col.

## Processing environment setup

In [7]:
# Replace with your values
# NOTE: Set the access to this notebook appropriately to protect the security of your keys.
# Or you can delete this cell after you run the mount command below once successfully.
#YOUR_STORAGE_ACCOUNT_NAME = "REPLACE_WITH_YOUR_AZURE_BLOB"
STORAGE_ACCOUNT_NAME = "c344850"
#YOUR_CONTAINER_NAME = "REPLACE_WITH_YOUR_AZURE_CONTAINER"
CONTAINER_NAME = "freetown-sampledata"
#MOUNT_NAME = "REPLACE_WITH_YOUR_MOUNT_NAME"
MOUNT_NAME = "sample"

#ACCESS_KEY = "fs.azure.account.key.YOUR_STORAGE_ACCOUNT_NAME.blob.core.windows.net"
ACCESS_KEY = "fs.azure.account.key.{}.blob.core.windows.net".format(STORAGE_ACCOUNT_NAME)
#SECRET_KEY = "REPLACE_WITH_YOUR_SECRET_KEY"
SECRET_KEY = "kbqBQVOcEz7Jz30wCCVI/JzYwfjG9+6s0A6rsakRInSyj/UP9wpfxeLgzkHMevbonM5u9XhcqsgDY+j95hQdCw=="

In [8]:
def mount_folder_from_azure_blob(storage_acc_name=None, container_name=None, 
                                 dirname=None, mnt_name=None, access_key=None, secret_key=None):
  
  """
  Utility function to mount a folder from Azure Blob storage
  """
  configs = {access_key: secret_key}
  result = dbutils.fs.mount(
              source = "wasbs://{}@{}.blob.core.windows.net/{}".format(container_name, storage_acc_name, dirname),
              mount_point = "/mnt/{}".format(mnt_name),
              extra_configs = configs)
  
  return result
  

In [9]:
def check_if_mounted(mount_name=None):
  """
  Checks if required folder is arleady mounted
  """
  mnt_res = dbutils.fs.mounts()
  for r in mnt_res:
    mnt_name = r.mountPoint.split('/')[-1]
    if mnt_name == mount_name:
      print('Arleady mounted')
      return
    else:
      if mount_folder_from_azure_blob(storage_acc_name=STORAGE_ACCOUNT_NAME, container_name=CONTAINER_NAME, mnt_name=MOUNT_NAME, secret_key=SECRET_KEY, access_key=ACCESS_KEY):
        print('Successfully mounted')
  

## Read in Data

In [11]:
# Read data as Spark Dataframe
file_name = 'africell_first_sample.csv'
df = spark.read.format("csv").option("header", "true").option("mode", "DROPMALFORMED").load("/mnt/sample/{}".format(file_name))

#### add datetime

In [13]:
# Spark function to add date
add_datetime =  udf (lambda x: datetime.strptime(x, '%Y%m%d%H%M%S'), TimestampType())

In [14]:
def find_num_of_days(df=None, time_col=None):
  """
  Assumes we have a Spark TimestampType - time_col
  """
  
  df_sorted_asc = df.sort(time_col, ascending=True)
  start = df_sorted_asc.first()
  df_sorted_desc = df.sort(time_col, ascending=False)
  end = df_sorted_desc.first()
  num_dys = (end.datetime - start.datetime).days
  
  return num_dys

## Lets determine what kind of events we have in the data

In [16]:
df.select('cdr_type').distinct().show()

Here is an attempt to describe what eacch type of event means. It doesnt seem like we need all these CDR types. Some of the data may be repetitive.
Some terminology to know about:
SMC- Short Message Centre
1. MtSMSRecord: For now I just assume this pertains to SMS
2. TransitRecord:
3. MtCallRecord: 
4. MoCallRecord: Mobile generated call record due to outgoing call attempt.
5. RoamingRecord: Roaming call attempt
6. MoSMSRecord: 

##### Skip this part for now andcome back to it later

#### Add user id : we use anonymised calling IMEI as user_id. We could also use phone number

In [19]:
df2 = df.withColumn('datetime', add_datetime(col('cdr_datetime')))

In [20]:
def calculate_crude_avg_events_per_day(exclude_events=None, df=None, outcol=None, uid_col=None, numdays=None):
  """
  For each user, sum all events in the dataset and divide by the total number of days in the dataset.
  In other cases, we may want to do this on a day by day basis
  :param exclude_events: exclude some events (e.g., roaming)
  :param outcol : Name of column from this
  """
  if not numdays:
    numdays = find_num_of_days(df=df, time_col='datetime')
  
  if not exclude_events:
    # count all cdr-types
    dfgrp_uid = df.groupBy(uid_col).agg(count(df.cdr_type))
    dfout = dfgrp_uid.withColumnRenamed("count(cdr_type)", 'events_count')
    print(dfout.columns)
    dfout2 = dfout.withColumn(outcol, dfout.events_count/numdays)
 

  return dfout2

In [21]:
df_avg_events = calculate_crude_avg_events_per_day(exclude_events=None, df=df2, outcol='avg_events_all', uid_col='userid',numdays=13)

#### Now lets see overral daily average number of events for all users

In [23]:
# First, I would rather deal with a pandas datraframe for this
df_avg = df_avg_events.toPandas()

In [24]:
print('='*100)
print ('When all categories of events are considered, the average number of events per day for each user is {}'.format(int(df_avg.avg_events_all.mean())))
print('='*100)