## Creating a User-level dataset for analysing user insights

This is a notebook started by Richie with the goal to pull data on individual subscribers. The goal is one row per subscriber, with info about them, like their gender, age, but also calculated values like average call duration and number of calls. This data can be used for all sorts of user analysis, including cluster analysis.

In [2]:
import os
import pandas as pd
from google.cloud import bigquery
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
#Set environment variable. 
#Note: CHANGE THIS TO WHERE ON YOUR COMPUTER THE JSON FILE IS
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='G:\My Drive\data science\DataDives\\viamo_api_key.json'

In [4]:
Bigquery_client = bigquery.Client()
user_data = pd.read_gbq('''with ugandan_data as 
    (select subscriber_id, 
            min(age) as age, 
            min(gender) as gender, 
            count(distinct(call_id)) as n_calls, 
            avg(duration_listened_seconds) as average_call_duration  
    FROM `viamo-datakind.datadive.321_sessions_1122` 
    WHERE organization_country = 'Uganda' 
    GROUP BY subscriber_id)

select * from  ugandan_data''')

Idea: write some SQL to grab the most popular block theme and block topic per user. Do users have a favourite call topic or theme? If so, we can use that for clustering

In [8]:
user_data.head()

Unnamed: 0,subscriber_id,age,gender,n_calls,average_call_duration
0,1249007965135440660,,,34,21.760563
1,1107356916055015424,18_24,male,41,49.25
2,921411853161594880,18_24,male,31,56.010676
3,590958164967424000,under_18,female,26,53.369458
4,701490087854080000,,,28,44.851852


In [14]:
user_data.describe()

Unnamed: 0,subscriber_id,n_calls,average_call_duration
count,2714900.0,2714900.0,2107415.0
mean,2237601000000.0,8.526765,28.62595
std,2.390752e+17,13.49636,17.30161
min,5.697945e+17,1.0,0.0
25%,1.090526e+18,1.0,16.66667
50%,1.291351e+18,3.0,25.75
75%,1.383805e+18,10.0,36.87952
max,1.441578e+18,2488.0,483.5


## A different analysis on which subscribers have unusually high call numbers

In [37]:
Bigquery_client = bigquery.Client()
n_calls = pd.read_gbq('''with n_calls_per_sub as 
    (select subscriber_id, organization_country,
            count(distinct(call_id)) as n_calls, 
    FROM `viamo-datakind.datadive.321_sessions_1122` 
    GROUP BY subscriber_id, organization_country)

select * from  n_calls_per_sub where n_calls > 1000 order by n_calls desc''')

In [38]:
n_calls.organization_country.value_counts()

Mali       12016
Nigeria        5
Uganda         3
Name: organization_country, dtype: int64

In [36]:
n_calls.organization_country.value_counts()

Mali       224451
Nigeria      6549
Uganda       5636
Name: organization_country, dtype: int64