<a href="https://colab.research.google.com/github/datakind/Viamo_DataDive_Dec22/blob/main/Workstream%20%232/Viamo_Workstream2_Mali_EveThan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Install and import packages**

In [99]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px
import matplotlib.dates as mdates
import numpy as np

In [None]:
!pip install google-cloud
!pip install google-cloud-bigquery[pandas]
!pip install google-cloud-storage
!pip install pandas-gbq -U

In [2]:
from google.cloud import bigquery

##**Initial setup**

I'm going to randomly extract only 100,000 rows for organization_country equal to 'Mali'. Only columns that are deemed relevant will be extracted here. 

In [3]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = 'viamo-datakind-19b12e3872f5.json'

Bigquery_client = bigquery.Client()

df = pd.read_gbq("SELECT call_id, call_date, subscriber_id, call_started, call_ended, region, region_name, age, gender, location, location_level_2, subscriber_pereferred_language, rural_or_urban, phone_type, education_level, income_source, highest_expense_area, duration_listened_minutes, duration_listened_seconds, block_title, block_theme, block_topic FROM `viamo-datakind.datadive.321_sessions_1122` WHERE organization_country='Mali' ORDER BY RAND() LIMIT 100000")

##**Take a quick look at the data: EDA and some data cleaning**

###**First 5 rows**

It seems like there's a mispelled column [subscriber_pereferred_language], where 'preferred' is spelled wrongly. Not a big deal though

In [4]:
df.head()

Unnamed: 0,call_id,call_date,subscriber_id,call_started,call_ended,region,region_name,age,gender,location,...,rural_or_urban,phone_type,education_level,income_source,highest_expense_area,duration_listened_minutes,duration_listened_seconds,block_title,block_theme,block_topic
0,1314686664769595156,2021-12-06,1145379592367300608,2021-12-06 20:15:26+00:00,2021-12-06 20:18:09+00:00,FWA,Francophone West Africa,18_24,male,Kidal,...,,,,,,0.4,23.0,Digest Intro,,
1,1198628852222979868,2021-01-20,991681842476802048,2021-01-20 14:03:07+00:00,2021-01-20 14:16:18+00:00,FWA,Francophone West Africa,25_34,male,Sikasso,...,,,,,,0.5,30.0,Fin_Journal,news,
2,1355218011510336948,2022-03-28,1096547323951505408,2022-03-28 16:32:33+00:00,2022-03-28 16:48:38+00:00,FWA,Francophone West Africa,under_18,male,Segou,...,,,,,,,,Fin_Journal,news,
3,1401874574773641300,2022-08-04,1082214802493333505,2022-08-04 10:29:04+00:00,2022-08-04 10:33:34+00:00,FWA,Francophone West Africa,18_24,female,Sikasso,...,,,,,,0.8,49.0,Spot femme adultes 321,,
4,1330130663764454024,2022-01-18,994214908340527104,2022-01-18 11:04:22+00:00,2022-01-18 11:04:54+00:00,FWA,Francophone West Africa,18_24,female,Kayes,...,,,,,,0.1,3.0,Digest or main menu?,,


###**Missing data**

In [5]:
df.isnull().sum()

call_id                                0
call_date                              0
subscriber_id                          0
call_started                           0
call_ended                             0
region                                 0
region_name                            0
age                                25797
gender                             14694
location                           17581
location_level_2                   34930
subscriber_pereferred_language       614
rural_or_urban                    100000
phone_type                        100000
education_level                   100000
income_source                     100000
highest_expense_area              100000
duration_listened_minutes           1009
duration_listened_seconds           1009
block_title                        18012
block_theme                            0
block_topic                            0
dtype: int64

We can see that all the rows extracted don't have values for these columns: [rural_or_urban], [phone_type], [education_level], [income_source], and [highest_expense_area]. I am going to remove these columns.

In [6]:
df = df.drop(columns=['rural_or_urban', 'phone_type', 'education_level', 'income_source', 'highest_expense_area'])
df.isnull().sum()

call_id                               0
call_date                             0
subscriber_id                         0
call_started                          0
call_ended                            0
region                                0
region_name                           0
age                               25797
gender                            14694
location                          17581
location_level_2                  34930
subscriber_pereferred_language      614
duration_listened_minutes          1009
duration_listened_seconds          1009
block_title                       18012
block_theme                           0
block_topic                           0
dtype: int64

In [7]:
df.block_title.value_counts()

Digest or main menu?                                                                          14952
Main Menu                                                                                      8123
Choix Menu Tamani                                                                              7097
Titres du Journal                                                                              6597
Digest Intro                                                                                   4800
                                                                                              ...  
Dans quel cas est-ce qu’un certificat international de transhumance ou CIT est nécessaire?        1
1.1.3. ENVIRONNEMENT SCOLAIRE                                                                     1
Q.4.3                                                                                             1
Les méthodes de prévention du paludisme lors des Visites à Domicile                               1


In [8]:
df.block_theme.value_counts()

                    55948
news                26273
health               7324
financial            1853
health,nutrition     1702
ed                   1345
security             1338
financial,rights     1196
gender               1118
rights               1020
ag                    584
games                 293
nutrition               6
Name: block_theme, dtype: int64

In [9]:
df.block_topic.value_counts()

                           92845
coronavirus,malaria         3787
coronavirus                 2149
livestock                    401
malaria                      369
gbv                          163
coronavirus,malaria,ncd      148
malaria,ncd                  108
maternal                      25
coronavirus,ncd                3
environment                    2
Name: block_topic, dtype: int64

[block_theme] and [block_topic] have empty strings as values which are not caught by the isnull() function.

Since I'm only going to visualize this data and not run some machine learning algorithm on it, I feel like it's ok to have some null data in some of these columns, so I will keep them for now.

###**Outliers**

In [10]:
df.describe()

Unnamed: 0,call_id,subscriber_id,duration_listened_minutes,duration_listened_seconds
count,100000.0,100000.0,98991.0,98991.0
mean,-67633420000000.0,-40331010000000.0,0.964128,57.819903
std,6.939328e+16,1.343981e+17,2.313053,138.745476
min,1.191524e+18,7.322052e+17,0.0,0.0
25%,1.254897e+18,1.017215e+18,0.0,2.0
50%,1.311791e+18,1.112842e+18,0.2,14.0
75%,1.369344e+18,1.259873e+18,0.9,52.0
max,1.441578e+18,1.441499e+18,45.5,2728.0


Someone listened for 45.5 minutes but the third quartile is only 0.9 minutes. 

Again, since I'm only going to visualize this data and not run some machine learning algorithm on it so I'm going to keep these potential outliers for now.

###**Repetitive or uninformative data**

In [11]:
df[['region', 'region_name']].apply(pd.Series.value_counts)

Unnamed: 0,region,region_name
FWA,100000.0,
Francophone West Africa,,100000.0


It seems like all the values in the [region] and [region_name] columns are the same respectively, so I will remove these columns.

In [12]:
df = df.drop(columns=['region', 'region_name'])

In [13]:
df['location'].value_counts()

Sikasso       21289
Segou         14896
Kayes         13486
Koulikoro     10430
Mopti          8607
Bamako         5402
Tombouctou     3861
Kidal          2419
Gao            2016
Kita              3
Koutiala          2
Bankass           2
Fana              1
Markala           1
Douentza          1
Kati              1
Kolondieba        1
Ségou             1
Name: location, dtype: int64

In [14]:
df['location_level_2'].value_counts()

Cercle de Bougouni          6588
Cercle de Kadiolo           3952
Cercle de Barouéli          3345
Cercle de Koutiala          3081
Cercle de Bafoulabé         3076
Cercle de Bla               2923
Cercle de Kéniéba           2662
Cercle de Diéma             2507
Cercle de Dioïla            2368
Cercle de Banamba           2286
Cercle de Sikasso           2235
Cercle de Bandiagara        2201
Cercle de Macina            1909
Cercle de Bankass           1597
Cercle de Niono             1537
Cercle de Ségou             1345
Cercle de Kita              1314
Cercle de Goundam           1238
Cercle de Kayes             1111
Cercle de Djenné            1105
Cercle de Kangaba           1073
Cercle de Kati               995
Cercle de San                986
Cercle de Mopti              830
Cercle de Kolondiéba         827
Cercle de Yorosso            817
Cercle de Koulikoro          804
Cercle de Gao                801
Cercle de Kolokani           789
Cercle de Kidal              732
Cercle de 

In [15]:
df['subscriber_pereferred_language'].value_counts()

Bambara     70276
Fula        13267
Tamasheq     6730
Songhay      5104
French       4009
Name: subscriber_pereferred_language, dtype: int64

[location], [location_level_2], and [subscriber_pereferred_language] look ok.

###**Duplicates**

Let's remove any duplicate rows.

In [16]:
df[df.duplicated()]

Unnamed: 0,call_id,call_date,subscriber_id,call_started,call_ended,age,gender,location,location_level_2,subscriber_pereferred_language,duration_listened_minutes,duration_listened_seconds,block_title,block_theme,block_topic
47610,1255652067050579108,2021-06-26,996741621477859328,2021-06-26 22:33:01+00:00,2021-06-27 00:13:00+00:00,35_44,male,Sikasso,,Fula,1.3,75.0,Faire un Budget,"financial,rights",
68561,1271317827966003572,2021-08-09,1149071909825601536,2021-08-09 04:03:10+00:00,2021-08-09 06:58:41+00:00,under_18,female,Sikasso,Cercle de Bougouni,Tamasheq,0.7,42.0,Digest or main menu?,,
98603,1377521797528348616,2022-05-29,1336627097577776676,2022-05-29 05:39:49+00:00,2022-05-29 08:39:50+00:00,,female,,,French,0.0,0.0,,,


It seems like these 3 rows have duplicates somewhere in the dataframe. I'm going to remove their duplicates.

In [17]:
df = df.drop_duplicates()
df.shape

(99997, 15)

###**Data types**

In [18]:
df.dtypes

call_id                                         Int64
call_date                                      dbdate
subscriber_id                                   Int64
call_started                      datetime64[ns, UTC]
call_ended                        datetime64[ns, UTC]
age                                            object
gender                                         object
location                                       object
location_level_2                               object
subscriber_pereferred_language                 object
duration_listened_minutes                     float64
duration_listened_seconds                     float64
block_title                                    object
block_theme                                    object
block_topic                                    object
dtype: object

Matplotlib needs the date to be in datetime64[ns]. Change the call_date column's data type to datetime64[ns].

In [19]:
df.call_date = pd.to_datetime(df.call_date, format="%Y-%m-%d")

In [20]:
df.dtypes

call_id                                         Int64
call_date                              datetime64[ns]
subscriber_id                                   Int64
call_started                      datetime64[ns, UTC]
call_ended                        datetime64[ns, UTC]
age                                            object
gender                                         object
location                                       object
location_level_2                               object
subscriber_pereferred_language                 object
duration_listened_minutes                     float64
duration_listened_seconds                     float64
block_title                                    object
block_theme                                    object
block_topic                                    object
dtype: object

The data types for other columns look ok.

##**Workstream 2: Bivariate Data Exploration**

###**Question 15: Distribution of calls (unique call ids) over time (day, week, month)**

####**Day**

In [21]:
day_calls_df = df[['call_date', 'call_id']].copy()
day_calls_df = day_calls_df.drop_duplicates()

In [22]:
day_calls_df.call_date.nunique()

691

I'm going to set the number of bins in the graph to 691 so that each bin will have only 1 day.

In [23]:
day_calls_df.call_date.describe()

  day_calls_df.call_date.describe()


count                   99915
unique                    691
top       2021-11-26 00:00:00
freq                      242
first     2020-12-31 00:00:00
last      2022-11-21 00:00:00
Name: call_date, dtype: object

In [24]:
day_calls_df.groupby('call_date')['call_id'].count().sort_values(ascending=False)

call_date
2021-11-26    242
2021-11-15    240
2021-11-18    240
2021-11-19    234
2021-12-02    232
             ... 
2022-07-10     75
2022-05-07     73
2022-08-21     72
2022-11-20     61
2020-12-31      1
Name: call_id, Length: 691, dtype: int64

From the data above, we can see that: 
- **2021-11-26** got the **most** unique calls (242 calls in total). 
- **2020-12-31** got the **least** number of unique calls (1 call in total).
- The **earliest** date we have here is **2020-12-31**.
- The **latest** date we have here is **2022-11-21**.

In [26]:
fig = px.histogram(day_calls_df, x='call_date', nbins=691)
fig.show()

Hover over the graph above to see the exact count on each day. 

####**Week**

In [27]:
week_calls_df = day_calls_df.copy()
week_calls_df['week'] = pd.to_datetime(df['call_date']).dt.to_period('W')
week_calls_df = week_calls_df.drop(columns=['call_date'])
week_calls_df = week_calls_df.drop_duplicates()
week_calls_df

Unnamed: 0,call_id,week
0,1314686664769595156,2021-12-06/2021-12-12
1,1198628852222979868,2021-01-18/2021-01-24
2,1355218011510336948,2022-03-28/2022-04-03
3,1401874574773641300,2022-08-01/2022-08-07
4,1330130663764454024,2022-01-17/2022-01-23
...,...,...
99995,1285506573024161260,2021-09-13/2021-09-19
99996,1423762135360007960,2022-10-03/2022-10-09
99997,1255785381656846340,2021-06-21/2021-06-27
99998,1355548180473243268,2022-03-28/2022-04-03


In [28]:
week_calls_df.week.nunique()

100

There seem to be 100 unique weeks.

In [29]:
week_calls_df.week.describe()

count                     99915
unique                      100
top       2021-11-15/2021-11-21
freq                       1532
Name: week, dtype: object

In [30]:
week_calls_df.sort_values(by='week')

Unnamed: 0,call_id,week
16115,1192237056190376744,2020-12-28/2021-01-03
46825,1192457069379775800,2020-12-28/2021-01-03
95930,1192382733427994124,2020-12-28/2021-01-03
79976,1191572649424773808,2020-12-28/2021-01-03
55677,1192370052994230916,2020-12-28/2021-01-03
...,...,...
9882,1441495278556866392,2022-11-21/2022-11-27
55412,1441499205968913936,2022-11-21/2022-11-27
16684,1441449766436794312,2022-11-21/2022-11-27
52710,1441437987799363444,2022-11-21/2022-11-27


In [31]:
week_calls_df.groupby('week')['call_id'].count().sort_values(ascending=False)

week
2021-11-15/2021-11-21    1532
2021-11-22/2021-11-28    1441
2021-11-29/2021-12-05    1383
2021-11-08/2021-11-14    1306
2021-11-01/2021-11-07    1302
                         ... 
2022-11-07/2022-11-13     766
2022-06-27/2022-07-03     760
2022-11-14/2022-11-20     703
2020-12-28/2021-01-03     416
2022-11-21/2022-11-27      93
Freq: W-SUN, Name: call_id, Length: 100, dtype: int64

From the data above, we can see that: 
- The week **2021-11-15/2021-11-21** got the **most** unique calls (1532 calls in total). 
- The week **2022-11-21/2022-11-27** got the **least** number of unique calls (93 calls in total).
- The **earliest** week we have here is **2020-12-28/2021-01-03**.
- The **latest** week we have here is **2022-11-21/2022-11-27**.

In [32]:
week_calls_df_temp = week_calls_df.copy()
week_calls_df_temp['week'] = week_calls_df_temp['week'].astype(str)
week_calls_df_temp.sort_values(by='week', inplace=True)

fig2 = px.histogram(week_calls_df_temp, x='week')
fig2.show()

Hover over the graph above to see the exact count in each week. 

####**Month**

In [33]:
month_calls_df = day_calls_df.copy()
month_calls_df['month'] = pd.to_datetime(df['call_date']).dt.to_period('M')
month_calls_df = month_calls_df.drop(columns=['call_date'])
month_calls_df = month_calls_df.drop_duplicates()
month_calls_df

Unnamed: 0,call_id,month
0,1314686664769595156,2021-12
1,1198628852222979868,2021-01
2,1355218011510336948,2022-03
3,1401874574773641300,2022-08
4,1330130663764454024,2022-01
...,...,...
99995,1285506573024161260,2021-09
99996,1423762135360007960,2022-10
99997,1255785381656846340,2021-06
99998,1355548180473243268,2022-03


In [34]:
month_calls_df.month.nunique()

24

There are 24 unique months.

In [35]:
month_calls_df.month.describe()

count       99915
unique         24
top       2021-11
freq         5931
Name: month, dtype: object

In [36]:
month_calls_df.sort_values(by='month')

Unnamed: 0,call_id,month
5789,1191523579024960748,2020-12
29249,1192890385266434264,2021-01
43910,1196685530986242764,2021-01
95283,1198181603881836656,2021-01
57722,1198376272263702128,2021-01
...,...,...
41570,1439377535384742048,2022-11
1954,1434892409758083916,2022-11
21663,1436731415412142036,2022-11
29976,1438602641348814668,2022-11


In [37]:
month_calls_df.groupby('month')['call_id'].count().sort_values(ascending=False)

month
2021-11    5931
2022-03    5144
2021-10    4866
2022-02    4807
2021-07    4739
2022-04    4720
2021-12    4719
2021-04    4634
2021-09    4605
2021-03    4473
2022-01    4410
2021-05    4404
2021-08    4360
2021-01    4320
2021-06    4281
2022-07    4110
2022-08    4036
2022-05    4030
2022-10    3864
2022-06    3780
2021-02    3745
2022-09    3689
2022-11    2247
2020-12       1
Freq: M, Name: call_id, dtype: int64

From the data above, we can see that: 
- The month **2021-11** got the **most** unique calls (5931 calls in total). 
- The month **2020-12** got the **least** number of unique calls (1 call in total).
- The **earliest** month we have here is **2020-12**.
- The **latest** month we have here is **2022-11**.

In [97]:
month_calls_df_temp = month_calls_df.copy()
month_calls_df_temp['month'] = month_calls_df_temp['month'].astype(str)
month_calls_df_temp.sort_values(by='month', inplace=True)

fig3 = px.histogram(month_calls_df_temp, x='month', nbins=24)
fig3.show()

Hover over the graph above to see the exact count in each month. 

###**Question 16: Distribution of number of calls per subscriber (overall, by month)**

In [68]:
subscriber_calls_df = df[['subscriber_id', 'call_id', 'call_date']].copy()
subscriber_calls_df[subscriber_calls_df.duplicated()]

Unnamed: 0,subscriber_id,call_id,call_date
2557,988439896706834432,1430975462423193872,2022-10-23
4945,984560638909538304,1246384427786362928,2021-06-01
18482,1043184358951280640,1296891495395683180,2021-10-18
27521,1137734825101811712,1309269045820317760,2021-11-21
29206,1361629175370802972,1382354042806134540,2022-06-11
...,...,...,...
94504,1337513781379001212,1338080895059945316,2022-02-09
95472,966307122495676416,1366820943863277708,2022-04-29
95487,1305278998242716568,1354165744082414808,2022-03-25
99062,1111931167240151040,1278000432446760292,2021-08-27


There seem to be 82 rows that are duplicated.

In [69]:
subscriber_calls_df.drop_duplicates(inplace=True)
subscriber_calls_df

Unnamed: 0,subscriber_id,call_id,call_date
0,1145379592367300608,1314686664769595156,2021-12-06
1,991681842476802048,1198628852222979868,2021-01-20
2,1096547323951505408,1355218011510336948,2022-03-28
3,1082214802493333505,1401874574773641300,2022-08-04
4,994214908340527104,1330130663764454024,2022-01-18
...,...,...,...
99995,1108701495542996992,1285506573024161260,2021-09-17
99996,1422365841269777136,1423762135360007960,2022-10-03
99997,1255486224605963248,1255785381656846340,2021-06-27
99998,1209990486736561824,1355548180473243268,2022-03-29


In [70]:
subscriber_calls_df.groupby('subscriber_id')['call_id'].count().sort_values(ascending=False)

subscriber_id
977142100578066432     22
1328677109371560040    19
1141692174535548928    19
1185540172512027116    16
1147494740703895552    14
                       ..
1052229678796431360     1
1052229029417508864     1
1052225021638402048     1
1052216748411510784     1
1441499200696674780     1
Name: call_id, Length: 81697, dtype: int64

In [74]:
fig = px.histogram(subscriber_calls_df.groupby('subscriber_id')['call_id'].count())
fig.show()

It seems like most of the subscribers called only once. However, it's better to have more calls to form a distribution. I'm going to pick only the top 10 subscribers that had the most number of calls.

In [76]:
temp_series = subscriber_calls_df.groupby('subscriber_id')['call_id'].count().sort_values(ascending=False).head(10)
temp_series

subscriber_id
977142100578066432     22
1328677109371560040    19
1141692174535548928    19
1185540172512027116    16
1147494740703895552    14
974707124544790528     14
1297566227384297616    13
1333534415007574604    13
1342449927968058360    13
1140319678510981120    13
Name: call_id, dtype: int64

In [79]:
top10_subscribers = temp_series.index.to_numpy()

In [134]:
i = 1
for subscriber in top10_subscribers:
  single_subscriber_df = subscriber_calls_df.loc[(subscriber_calls_df.subscriber_id == subscriber)].copy()
  single_subscriber_df['month'] = pd.to_datetime(single_subscriber_df['call_date']).dt.to_period('M')

  single_subscriber_df.sort_values(by='call_date', inplace=True)
  #print(single_subscriber_df)
  month_differences = (single_subscriber_df.month.iloc[-1] - single_subscriber_df.month.iloc[0]).n + 1
  #print(month_differences)
  single_subscriber_df['month'] = single_subscriber_df['month'].astype(str)
  single_subscriber_df.drop(columns=['subscriber_id', 'call_date'], inplace=True)

  title = "Distribution of calls over month for subscriber No." + str(i) + " with subscriber_id " + str(subscriber)
  i += 1

  fig = px.histogram(single_subscriber_df, x='month', nbins=month_differences, title=title)
  fig.show()
  

###**Question 20: Time of day distribution of calls**

In [39]:
hours_calls_df = df[['call_started', 'call_id']].copy()
hours_calls_df = hours_calls_df.drop_duplicates()

In [40]:
hours_calls_df['time'] = hours_calls_df['call_started'].dt.time
hours_calls_df.drop(columns=['call_started'], inplace=True)

In [41]:
hours_calls_df[hours_calls_df.duplicated()]

Unnamed: 0,call_id,time


No duplicates.

In [42]:
hours_calls_df

Unnamed: 0,call_id,time
0,1314686664769595156,20:15:26
1,1198628852222979868,14:03:07
2,1355218011510336948,16:32:33
3,1401874574773641300,10:29:04
4,1330130663764454024,11:04:22
...,...,...
99995,1285506573024161260,07:44:10
99996,1423762135360007960,20:02:25
99997,1255785381656846340,07:22:46
99998,1355548180473243268,14:24:31


In [43]:
hours_calls_df.time.describe()

count        99915
unique       51649
top       19:34:44
freq            10
Name: time, dtype: object

In [44]:
hours_calls_df.sort_values(by='time', inplace=True)
hours_calls_df

Unnamed: 0,call_id,time
55526,1400991497620876000,00:00:02
76945,1343734218110853208,00:00:02
84887,1381422553616672808,00:00:02
55146,1371638091374717928,00:00:04
33641,1262196961554522492,00:00:06
...,...,...
79380,1227407678096598552,23:59:54
68171,1193705611343749408,23:59:55
7835,1283940193770464732,23:59:56
76094,1230306791624599776,23:59:57


In [45]:
hours_calls_df.groupby('time')['call_id'].count().sort_values(ascending=False)

time
19:34:44    10
19:56:40    10
22:20:15     9
20:08:16     9
20:57:05     9
            ..
13:20:19     1
13:20:13     1
13:20:12     1
13:20:10     1
23:59:58     1
Name: call_id, Length: 51649, dtype: int64

From the data above, we can see that: 
- The time **19:34:44** got the **most** unique calls (10 calls in total). 
- The **least** amount of call here is 1. There are several timestamps that have this number of call.
- The **earliest** time we have here is **00:00:02**.
- The **latest** time we have here is **23:59:58**.

In [46]:
fig = px.histogram(hours_calls_df, x='time')
fig.show()

Hover over the graph above to see the exact count at each timestamp.

###**Question 21: Day of week distribution of calls**

In [47]:
dayofweek_calls_df = df[['call_date', 'call_id']].copy()
dayofweek_calls_df = dayofweek_calls_df.drop_duplicates()

In [48]:
dayofweek_calls_df['dayofweek'] = dayofweek_calls_df['call_date'].dt.day_name()
dayofweek_calls_df.drop(columns=['call_date'], inplace=True)

In [49]:
dayofweek_calls_df[dayofweek_calls_df.duplicated()]

Unnamed: 0,call_id,dayofweek


No duplicates.

In [50]:
dayofweek_calls_df

Unnamed: 0,call_id,dayofweek
0,1314686664769595156,Monday
1,1198628852222979868,Wednesday
2,1355218011510336948,Monday
3,1401874574773641300,Thursday
4,1330130663764454024,Tuesday
...,...,...
99995,1285506573024161260,Friday
99996,1423762135360007960,Monday
99997,1255785381656846340,Sunday
99998,1355548180473243268,Tuesday


In [51]:
dayofweek_calls_df.dayofweek.describe()

count      99915
unique         7
top       Friday
freq       15485
Name: dayofweek, dtype: object

In [52]:
cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

dayofweek_calls_df['dayofweek'] = pd.Categorical(dayofweek_calls_df['dayofweek'], categories=cats, ordered=True)
dayofweek_calls_df = dayofweek_calls_df.sort_values('dayofweek')
dayofweek_calls_df

Unnamed: 0,call_id,dayofweek
0,1314686664769595156,Monday
41796,1299445878780126836,Monday
41794,1228294803864217572,Monday
41787,1248537981443695252,Monday
41781,1398359113146624272,Monday
...,...,...
69642,1210293552811664888,Sunday
33920,1410597584577817224,Sunday
33919,1403043115619117584,Sunday
15225,1286463436263843352,Sunday


In [53]:
dayofweek_calls_df.groupby('dayofweek')['call_id'].count().sort_values(ascending=False)

dayofweek
Friday       15485
Wednesday    15236
Monday       14916
Tuesday      14876
Thursday     14785
Saturday     13049
Sunday       11568
Name: call_id, dtype: int64

From the data above, we can see that: 
- **Friday** got the **most** unique calls (15,485 calls in total). 
- **Sunday** got the **least** number of unique calls (11,568 calls in total).

In [54]:
fig = px.histogram(dayofweek_calls_df, x='dayofweek')
fig.show()

Hover over the graph above to see the exact count on each weekday.

###**Extra: Day of month distribution of calls**



In [55]:
dayofmonth_calls_df = df[['call_date', 'call_id']].copy()
dayofmonth_calls_df = dayofmonth_calls_df.drop_duplicates()

In [56]:
dayofmonth_calls_df['dayofmonth'] = dayofmonth_calls_df['call_date'].dt.day.values
dayofmonth_calls_df

Unnamed: 0,call_date,call_id,dayofmonth
0,2021-12-06,1314686664769595156,6
1,2021-01-20,1198628852222979868,20
2,2022-03-28,1355218011510336948,28
3,2022-08-04,1401874574773641300,4
4,2022-01-18,1330130663764454024,18
...,...,...,...
99995,2021-09-17,1285506573024161260,17
99996,2022-10-03,1423762135360007960,3
99997,2021-06-27,1255785381656846340,27
99998,2022-03-29,1355548180473243268,29


In [57]:
dayofmonth_calls_df.drop(columns=['call_date'], inplace=True)
dayofmonth_calls_df.sort_values(by='dayofmonth', inplace=True)
dayofmonth_calls_df['dayofmonth'] = dayofmonth_calls_df['dayofmonth'].astype(str)
dayofmonth_calls_df

Unnamed: 0,call_id,dayofmonth
58561,1257299813805384728,1
58846,1302040339225832020,1
85461,1356607524631539100,1
85458,1345326829007332924,1
35473,1224271896037092128,1
...,...,...
1861,1301671956407969856,31
26624,1223958684318098384,31
9616,1400446695707831216,31
4171,1378346449683145676,31


In [58]:
dayofmonth_calls_df[dayofmonth_calls_df.duplicated()]

Unnamed: 0,call_id,dayofmonth


No duplicates.

In [59]:
dayofmonth_calls_df.dayofmonth.describe()

count     99915
unique       31
top           1
freq       3446
Name: dayofmonth, dtype: object

In [60]:
dayofmonth_calls_df.groupby('dayofmonth')['call_id'].count().sort_values(ascending=False)

dayofmonth
1     3446
15    3429
6     3425
5     3386
4     3375
8     3352
2     3348
23    3340
14    3340
21    3334
25    3326
22    3319
13    3303
7     3301
11    3299
16    3288
18    3284
17    3270
12    3270
27    3270
19    3248
9     3235
24    3230
10    3229
28    3222
20    3211
3     3186
26    3114
29    2949
30    2766
31    1820
Name: call_id, dtype: int64

From the data above, we can see that: 
- The day of month **1** got the **most** unique calls (3,446 calls in total). 
- The day of month **31** got the **least** number of unique calls (1,820 calls in total).

Note that not every month has day of month 29, 30, or 31.

In [61]:
fig = px.histogram(dayofmonth_calls_df, x='dayofmonth')
fig.show()

Hover over the graph above to see the exact count on each day of month.