# Introduction

## Question
How to identify the location associated with each call record to be the proxy of the calling or called

## Key Assumption
The rule should exist; that is, the rule for the location assignment in each call record is not random during the data collection process 

> Another related question can be asked: what is the data collection process supposed to be?
>> Suppose a company wants to collect the calling detail records within Deyang. There must be three types of calls: calls within Deyang, calls into Deyang, and calls out from Deyang. For each type of call, if either calling number or called number is the company's client, retain it and for the other case, exclude it.

## Rules
1. calling
2. called
3. client


## Setup
If both the calling and called area code is 838, which means both the calling and called agent is within deyang, then it's impossible to identify the tower location is the proxy of the calling or called, as they are all 838.

> The tower area code is infered from the corresponding address of the lontitude and latude of the tower.

Therefore, we should consider those call records that only one of the calling and called area code is 838.

If the tower location is the proxy of the calling, then all the calling area code should be consistent with the tower area code, etc. Notice that the "client area code" is infered from the "client number". If the client number is the calling number, the client area code is the calling area code, etc.

In [1]:
import os
os.chdir('..')

from dask_cuda import LocalCUDACluster
from dask.distributed import Client


cluster = LocalCUDACluster(
    CUDA_VISIBLE_DEVICES="0,1",
    protocol="ucx",
    enable_tcp_over_ucx=True,
    enable_infiniband=True,
    rmm_managed_memory=True,
    rmm_pool_size='24GB'
)
client = Client(cluster)
client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 43061 instead


0,1
Connection method: Cluster object,Cluster type: dask_cuda.LocalCUDACluster
Dashboard: http://127.0.0.1:43061/status,

0,1
Dashboard: http://127.0.0.1:43061/status,Workers: 2
Total threads: 2,Total memory: 502.58 GiB
Status: running,Using processes: True

0,1
Comm: ucx://127.0.0.1:33271,Workers: 2
Dashboard: http://127.0.0.1:43061/status,Total threads: 2
Started: Just now,Total memory: 502.58 GiB

0,1
Comm: ucx://127.0.0.1:53435,Total threads: 1
Dashboard: http://127.0.0.1:36303/status,Memory: 251.29 GiB
Nanny: ucx://127.0.0.1:56911,
Local directory: /tmp/dask-scratch-space/worker-rwybmhv4,Local directory: /tmp/dask-scratch-space/worker-rwybmhv4
GPU: NVIDIA GeForce RTX 3090,GPU memory: 24.00 GiB

0,1
Comm: ucx://127.0.0.1:48661,Total threads: 1
Dashboard: http://127.0.0.1:40741/status,Memory: 251.29 GiB
Nanny: ucx://127.0.0.1:48575,
Local directory: /tmp/dask-scratch-space/worker-o5vgdfmg,Local directory: /tmp/dask-scratch-space/worker-o5vgdfmg
GPU: NVIDIA RTX A4000,GPU memory: 15.99 GiB


In [2]:
import cudf
import dask
import dask.dataframe as dd

dask.config.set({"dataframe.backend": "cudf"})


cdr = dd.read_parquet('data/processed/201308/cdr_loc')
N = len(cdr)

cdr.head()

Unnamed: 0,serv_id,client_nbr,calling_nbr,calling_area_code,duration,called_area_code,called_nbr,cell_id,time,day_of_week,lon,lat,deyang_center_flag,tower_area_code
0,a5fvkt7s,5z06jq6z,5z06jq6z,838,33,838,46rc69e6,348C,21,4,104.263,31.0913,0,838
1,9fo16vqy,6hz03ag8,gcy0l5w8,838,483,838,6hz03ag8,3B93,21,3,104.709,31.2001,0,838
2,9mgzc8ht,fq343ibo,7kwzd0j0,838,39,838,fq343ibo,3488,14,5,104.29,30.9884,0,838
3,8nil2kw1,ce7t2smm,5r0c6xjn,838,25,838,ce7t2smm,390C,8,0,104.245,31.1316,0,838
4,8uu2mydh,ima05c55,dkhqlzr7,838,129,838,ima05c55,36DB,12,1,104.429,31.1813,0,838


## Is tower location the location proxy of the calling or called

In [3]:
# noise ratio, given the call only one of the calling or called is within Deyang, 
#     but the tower area code is neither the calling area code nor called area code
(
    cdr[
        (cdr.calling_area_code != cdr.called_area_code)
        & (cdr.calling_area_code != cdr.tower_area_code)
        & (cdr.called_area_code != cdr.tower_area_code)
    ].shape[0]
    /
    cdr[
        cdr.calling_area_code != cdr.called_area_code
    ].shape[0]
).compute()

0.053215001881392235

In [4]:
group = cdr[
    (cdr.calling_area_code != cdr.called_area_code)
    # &(
    #     (cdr.calling_area_code == cdr.tower_area_code)
    #     |(cdr.called_area_code == cdr.tower_area_code)
    # )
]

n = group.shape[0]
n.compute()

8169482

In [5]:
n1 = (
    group[
        group.tower_area_code == group.calling_area_code
    ]
    .shape[0]
)

(n1 / n).compute()

0.5061230320355685

In [6]:
n2 = (
    group[
        group.tower_area_code == group.called_area_code
    ]
    .shape[0]
)

(n2 / n).compute()

0.4406619660830393

In [7]:
(
    group[
        (group.calling_area_code != group.tower_area_code)
        & (group.called_area_code != group.tower_area_code)
    ]
    .shape[0]
    / n
).compute()

0.053215001881392235

## How about the client area code?

In [8]:
official_client = cudf.read_csv('data/processed/201308/clean_user_info.csv')['client_nbr']

In [9]:
(
    group[
        (group.tower_area_code == group.calling_area_code)
        & (group.client_nbr == group.calling_nbr)
    ].shape[0]
    /
    n1
).compute()

0.5365669084298181

In [10]:
(
    group[
        (group.tower_area_code == group.called_area_code)
        & (group.client_nbr == group.called_nbr)
    ].shape[0]
    /
    n1
).compute()

0.4036855316737622

## Extension: Why tower_area_code is calling_area_code but client_nbr is called_nbr ?

In [11]:
_test = group.compute()

test = _test[
   (_test.tower_area_code == _test.calling_area_code)
    & (_test.client_nbr == _test.called_nbr)
]
test.shape[0]

1916203

In [12]:
(
    test[
        ~test.calling_nbr.isin(official_client)
    ].shape[0]
    /
    test.shape[0]
)

0.7894048803806277

In [13]:
(
    test[
        (test.calling_nbr.isin(official_client))
        &(test.called_nbr.isin(official_client))
    ].shape[0]
    /
    test.shape[0]
)

0.17625011546271455

In [14]:
test.tower_area_code.value_counts()

838    1864294
28       50699
816        902
817         58
833         40
825         39
834         31
839         30
837         27
830         22
831         12
818          9
827          8
835          8
813          7
836          7
826          5
812          3
832          1
852          1
Name: tower_area_code, dtype: int64

In [15]:
test[
    test.tower_area_code == 838
].shape[0]

1864294

In [16]:
test[
    (~test.calling_nbr.isin(official_client))
    &(test.tower_area_code == 838)
].shape[0]

1468959

In [17]:
test[
    (test.calling_nbr.isin(official_client))
    &(test.called_nbr.isin(official_client))
    &(test.tower_area_code == 838)
].shape[0]

330556

In [18]:
test[
    test.tower_area_code == 28
].shape[0]

50699

In [19]:
test[
    (~test.calling_nbr.isin(official_client))
    &(test.tower_area_code == 28)
].shape[0]

42657

In [20]:
test[
    (test.calling_nbr.isin(official_client))
    &(test.called_nbr.isin(official_client))
    &(test.tower_area_code == 28)
].shape[0]

7032