## Workstream3

### Tanvi Arora

**This notebook has basic code for startups, query to fetch random data for analysis. But can be modified**

In [72]:
import os
import pandas as pd
import numpy as np

# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")

# set option to view all columns
pd.set_option('display.max_columns', None)

In [None]:
#credentials file obtained from datakind team
GOOGLE_APPLICATION_CREDENTIALS=os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
GOOGLE_APPLICATION_CREDENTIALS

### Data from viamo

In [5]:
from google.cloud import bigquery
bigquery_client=bigquery.Client()

In [44]:
#sample data
'''
organization_country : country for which data is to be fetched. 
min_call_date :  the data is fetched from this date onwards
sample_count : limit your sample count. Note - this limits the sample of calls and not records. This query ensures all the records for a sample call are selected
'''
organization_country='Uganda'
min_call_date = '2022-01-01'
sample_count = 25000

query_str=f"""WITH random_calls AS (select distinct(call_id) from `viamo-datakind.datadive.321_sessions_1122` WHERE  organization_country='{organization_country}' and call_date >= '{min_call_date}' order by rand() limit {sample_count}) SELECT * FROM `viamo-datakind.datadive.321_sessions_1122` WHERE call_id IN (SELECT call_id FROM random_calls)"""
query_str

"WITH random_calls AS (select distinct(call_id) from `viamo-datakind.datadive.321_sessions_1122` WHERE  organization_country='Uganda' and call_date >= '2022-01-01' order by rand() limit 25000) SELECT * FROM `viamo-datakind.datadive.321_sessions_1122` WHERE call_id IN (SELECT call_id FROM random_calls)"

In [45]:
data_df = pd.read_gbq(query_str)

In [None]:
##Backup data

orig_data_df=data_df.copy()
data_df.to_csv('uganda_25K_calls_sample_2022.csv',index=False)

#### Basic statistics about data

In [46]:
data_df.shape

(173322, 44)

**173K records obtained for 25K calls, indicates many-to-one relationship between calls and records. Each record represents a block. 1 call can have 1 or more blocks.**

In [47]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173322 entries, 0 to 173321
Data columns (total 44 columns):
 #   Column                          Non-Null Count   Dtype              
---  ------                          --------------   -----              
 0   call_date                       173322 non-null  dbdate             
 1   dl_global_created_at            173322 non-null  datetime64[ns, UTC]
 2   block_interaction_id            173322 non-null  Int64              
 3   call_id                         173322 non-null  Int64              
 4   subscriber_id                   173322 non-null  Int64              
 5   block_global_created_at         173322 non-null  datetime64[ns, UTC]
 6   block_entry_at                  173322 non-null  datetime64[ns, UTC]
 7   js_key                          173322 non-null  object             
 8   tree_version_set_id             173322 non-null  Int64              
 9   call_started                    173322 non-null  datetime64[ns, UTC]
 

**Data-Dictionary** [here](https://docs.google.com/spreadsheets/d/1QRJzAj0EB05CF7qmuy1VZ1Ivsa8me3UeKrKgRNeijaM/edit?usp=sharing)

call_date : date on which call was made.  
block_interaction_id : primary key. Represents a block entry.  
call_id : is an individual call. 1 call can have 1 or more blocks.   
subscriber_id : id for the caller. Can be used to identify patterns of call a caller makes.  

#### Sample data

In [None]:
data_df.head(30)

### Workstream#3 :

3 - How many individual calls end at the menu block? Is that the only block accessed in those calls?  
4 - How many individual calls end at the digest block? Is that the only block accessed in those calls?


To begin answering above question we need to know which block is the last block of the call.Below is some analysis for it

In [52]:
#Analyze last block for a call logic
'''
logic is to look into calls with multiple blocks
blocks_per_call_df1 : count of blocks per call
multiple_blocks_per_call_df1 : call_id (calls) with multiple blocks
'''
#get call_ids with multiple blocks
blocks_per_call_df1=data_df.groupby(['call_id'],as_index=False).agg({"block_interaction_id":"count"})
#blocks_per_call_df1
multiple_blocks_per_call_df1 = blocks_per_call_df1.query('block_interaction_id>1')
multiple_blocks_per_call_df1

Unnamed: 0,call_id,block_interaction_id
0,1323858875531135168,10
1,1323860291012267184,3
2,1323867992417109448,7
3,1323872643887663024,9
4,1323873895543473368,12
...,...,...
24995,1441513275946175444,4
24996,1441514351311195940,3
24997,1441516847324727908,13
24998,1441520677680582332,4


In [53]:
num_calls_multiple_blocks=(len(multiple_blocks_per_call_df1)/len(blocks_per_call_df1))*100
print("percent of calls with multiple blocks :",num_calls_multiple_blocks)

percent of calls with multiple blocks : 92.688


**From the sample obtained, 92.7% of calls had more than 1 block**

In [54]:
# get call records with multiple blocks
'''
multiple_blocks_per_call_df1['marker']==1 indicates the call has multiple blocks. join this to the original data
'''
multiple_blocks_per_call_df1['marker']=1
joined_calls_records_multiple_blocks_df1=pd.merge(data_df,multiple_blocks_per_call_df1,on=['call_id'],how='left')
joined_calls_records_multiple_blocks_df1.shape

(173322, 46)

In [57]:
'''
calls_records_multiple_blocks_df1 : is the original data with additional columns marker, which indicates the record 
                                    belongs to a block within a call that has multiple blocks.
block_interaction_id_x : block_interaction_id of original data
block_interaction_id_y : count of blocks for the call

'''
calls_records_multiple_blocks_df1=joined_calls_records_multiple_blocks_df1.query('marker==1').sort_values(by=['call_id','block_global_created_at'])


In [None]:
calls_records_multiple_blocks_df1.head(50)

In [59]:
## how many calls with single blocks
calls_records_single_blocks_df1=joined_calls_records_multiple_blocks_df1.query('marker!=1').sort_values(by=['call_id','block_global_created_at'])


In [60]:
calls_records_single_blocks_df1.shape

(1828, 46)

In [None]:
calls_records_single_blocks_df1.head(30)

In [108]:
calls_records_single_blocks_df1['block_theme'].unique()

array(['', 'games', 'health', 'ag'], dtype=object)

One may think single block calls may have just ended and not been utilized. but some of the records show data in block_theme and block_topic, which indicates this was not a blank call.

#### Logic to identify last block of a call
When you analyze the data of a call with multiple blocks you notice that 1 of the record has  

block_global_created_at=call_ended  

Validated the same information with calls with single blocks.   
Hence we can safely say that the block with **"block_global_created_at=call_ended"** is the last block of the call, irrespective of the call having 1 or more blocks.  

In [None]:
# get the last blocks per call

last_block_call_df=joined_calls_records_multiple_blocks_df1.query('block_global_created_at==call_ended')
last_block_call_df.head(10)

In [77]:
last_block_call_df['listen_menu_or_digest']=last_block_call_df['listen_menu_or_digest'].fillna('None')

In [92]:
list_count=last_block_call_df[['call_id','listen_menu_or_digest']].groupby(last_block_call_df['listen_menu_or_digest'])['call_id'].count()

In [102]:
total_calls=len(last_block_call_df)
list_option_percentage={}
for l,v in list_count.items():
    list_option_percentage[l]=round((v/total_calls)*100,2)

list_option_percentage

{'Listen Digest': 19.72, 'Listen Menu': 44.72, 'None': 35.56}

Listen Digest indicates, the caller gets an automated response of 10 top News
Listen menu is where the user engages in looking for information of their interest

Percentage of option selected by user calls in the last block :  
'Listen Digest': 19.72%.    
'Listen Menu': 44.72%.    
'None': 35.56%   


This indicates that "Listen Menu" is where the user is engaging personally. 
However there are 35.56% records that have a null value. Reaching out to Viamo team, 
"Those values are labelled on our blocks by the staff who create the content, so null could mean the block was not appropriately labelled."

This indicates that these 35.56% could either be a "Listen Digest" or "Listen Menu"
Having correct values for nulls can change the interpretation completely. And so we cannot assume anything here.
From current stats, clearly we have more users engaging, this means viamo needs to identify efforts needed in making sure users get appropriate information they are looking for to maintain or improve user engagement.

But if the null values were properly labelled and it turns out that we have almost equal number of users seeking both options, then appropriate efforts need to be divided for both areas.

####  What blocks identify the start of a call

In [None]:
calls_records_single_blocks_df1

From above we can say that for calls with single blocks , which are technically start and end blocks of the call,
**"call_ended==block_global_created_at"**

But if we look at calls with multiple blocks, **"call_started==block_global_created_at"**

## Themes and Topics of calls

In [105]:
# what does a call start with ??
joined_calls_records_multiple_blocks_df1.query('call_started==block_global_created_at')['block_title'].unique()

array(['10th Call Message', '11th Call Message', '1-9th Call Message',
       None], dtype=object)

In [None]:
# what does block_title of last blcok look like
joined_calls_records_multiple_blocks_df1.query('call_ended==block_global_created_at')['block_title'].unique()