# DAT210x - Programming with Python for DS

## Module5- Lab3

In [18]:
import pandas as pd
from datetime import timedelta
import matplotlib.pyplot as plt
import matplotlib

matplotlib.style.use('ggplot') # Look Pretty

A convenience function for you to use:

In [19]:
def clusterInfo(model):
    print("Cluster Analysis Inertia: ", model.inertia_)
    print('------------------------------------------')
    
    for i in range(len(model.cluster_centers_)):
        print("\n  Cluster ", i)
        print("    Centroid ", model.cluster_centers_[i])
        print("    #Samples ", (model.labels_==i).sum()) # NumPy Power

In [20]:
# Find the cluster with the least # attached nodes
def clusterWithFewestSamples(model):
    # Ensure there's at least on cluster...
    minSamples = len(model.labels_)
    minCluster = 0
    
    for i in range(len(model.cluster_centers_)):
        if minSamples > (model.labels_==i).sum():
            minCluster = i
            minSamples = (model.labels_==i).sum()

    print("\n  Cluster With Fewest Samples: ", minCluster)
    return (model.labels_==minCluster)

### CDRs

A [call detail record](https://en.wikipedia.org/wiki/Call_detail_record) (CDR) is a data record produced by a telephone exchange or other telecommunications equipment that documents the details of a telephone call or other telecommunications transaction (e.g., text message) that passes through that facility or device.

The record contains various attributes of the call, such as time, duration, completion status, source number, and destination number. It is the automated equivalent of the paper toll tickets that were written and timed by operators for long-distance calls in a manual telephone exchange.

The dataset we've curated for you contains call records for 10 people, tracked over the course of 3 years. Your job in this assignment is to find out where each of these people likely live and where they work at!

Start by loading up the dataset and taking a peek at its `head` and `dtypes`. You can convert date-strings to real date-time objects using `pd.to_datetime`, and the times using `pd.to_timedelta`:

In [21]:
df1 = pd.read_csv('Datasets/CDR.csv')
df1 = df1.dropna()
df1['CallDate'] = pd.to_datetime(df1['CallDate'], 'coerce')
df1['CallTime'] = pd.to_timedelta(df1['CallTime'])
df1['Duration'] = pd.to_timedelta(df1['Duration'])
df1.dtypes

In                     int64
Out                    int64
Direction             object
CallDate      datetime64[ns]
CallTime     timedelta64[ns]
DOW                   object
Duration     timedelta64[ns]
TowerID               object
TowerLat             float64
TowerLon             float64
dtype: object

Create a unique list of the phone number values (people) stored in the `In` column of the dataset, and save them in a regular python list called `unique_numbers`. Manually check through `unique_numbers` to ensure the order the numbers appear is the same order they (uniquely) appear in your dataset:

In [22]:
# .. your code here ..
unique_numbers = df1.In.unique().tolist()
unique_numbers

[4638472273,
 1559410755,
 4931532174,
 2419930464,
 1884182865,
 3688089071,
 4555003213,
 2068627935,
 2894365987,
 8549533077]

Using some domain expertise, your intuition should direct you to know that people are likely to behave differently on weekends vs on weekdays:

#### On Weekends
1. People probably don't go into work
1. They probably sleep in late on Saturday
1. They probably run a bunch of random errands, since they couldn't during the week
1. They should be home, at least during the very late hours, e.g. 1-4 AM

#### On Weekdays
1. People probably are at work during normal working hours
1. They probably are at home in the early morning and during the late night
1. They probably spend time commuting between work and home everyday

In [23]:
print("Examining person: ", unique_numbers[0])

Examining person:  4638472273


Create a slice called `user1` that filters to only include dataset records where the `In` feature (user phone number) is equal to the first number on your unique list above:

In [24]:
# .. your code here ..
user1 = df1[df1['In'] == unique_numbers[0]]
user1

Unnamed: 0,In,Out,Direction,CallDate,CallTime,DOW,Duration,TowerID,TowerLat,TowerLon
0,4638472273,2666307251,Incoming,2010-12-25,07:16:24.736813,Sat,00:02:41.741499,0db53dd3-eb9c-4344-abc5-c2d74ebc3eec,32.731611,-96.709417
1,4638472273,1755442610,Incoming,2010-12-25,21:18:30.053710,Sat,00:02:47.108750,aeaf8b43-8034-44fe-833d-31854a75acbf,32.731722,-96.709500
2,4638472273,5481755331,Incoming,2010-12-25,14:52:42.878016,Sat,00:04:35.356341,fadaa83f-6001-45fd-aa4a-17d6c6b7ec00,32.899944,-96.910389
3,4638472273,1755442610,Incoming,2010-12-25,16:02:09.001913,Sat,00:02:23.498499,fadaa83f-6001-45fd-aa4a-17d6c6b7ec00,32.899944,-96.910389
4,4638472273,2145623768,Incoming,2010-12-25,15:28:35.028554,Sat,00:03:54.692497,95d7920d-c3cd-4d20-a568-9a55800dc807,32.899944,-96.910389
5,4638472273,2946222380,Incoming,2010-12-25,11:38:17.275327,Sat,00:03:06.670355,95c91e8b-6ff1-4893-9df3-b0342636bd25,32.899944,-96.910389
6,4638472273,7841019020,Missed,2010-12-25,10:38:35.924232,Sat,00:02:02.855268,fadaa83f-6001-45fd-aa4a-17d6c6b7ec00,32.899944,-96.910389
71,4638472273,5912117253,Incoming,2010-12-26,14:34:27.530327,Sun,00:06:34.154104,84844bb1-1d36-4401-abf4-a3c566c07630,32.895833,-96.901944
72,4638472273,2145623768,Incoming,2010-12-26,20:45:00.306964,Sun,00:03:35.283608,26f7441e-e64b-4b9d-8c5a-0eb34f7e919f,32.731611,-96.709417
73,4638472273,4551601645,Missed,2010-12-26,16:03:43.841914,Sun,00:04:15.182895,bf3bf815-3795-4be6-bec3-12b8f713e50d,32.899944,-96.910389


Alter your slice so that it includes only Weekday (Mon-Fri) values:

In [26]:
# .. your code here ..
pm5 = pd.to_timedelta('17:00:00')
am730 = pd.to_timedelta('07:30:00')
#user2 = user1[(((user1['DOW'] == 'Sat') | (user1['DOW'] == 'Sun')) & ((user1['CallTime'] > am1) & (user1['CallTime'] < am4)))]
user2 = user1
user1 = user1[(((user1['DOW'] == 'Mon') | (user1['DOW'] == 'Tue') | (user1['DOW'] == 'Wed') | (user1['DOW'] == 'Thu') 
               | (user1['DOW'] == 'Fri')) & ( (user1['CallTime'] < pm5 )  & (user1['CallTime'] > am730 ) ) ) ]
user1

Unnamed: 0,In,Out,Direction,CallDate,CallTime,DOW,Duration,TowerID,TowerLat,TowerLon
197,4638472273,5481755331,Incoming,2010-12-28,16:20:11.532130,Tue,00:03:56.069765,95d7920d-c3cd-4d20-a568-9a55800dc807,32.899944,-96.910389
200,4638472273,8180005891,Incoming,2010-12-28,15:27:51.105337,Tue,00:03:30.047757,95c91e8b-6ff1-4893-9df3-b0342636bd25,32.899944,-96.910389
248,4638472273,8180005891,Missed,2010-12-29,09:13:35.198987,Wed,00:02:10.574446,95c91e8b-6ff1-4893-9df3-b0342636bd25,32.899944,-96.910389
250,4638472273,8068555700,Incoming,2010-12-29,13:54:14.724722,Wed,00:03:31.297111,fadaa83f-6001-45fd-aa4a-17d6c6b7ec00,32.899944,-96.910389
251,4638472273,8180005891,Incoming,2010-12-29,15:45:52.307534,Wed,00:05:34.587700,bf3bf815-3795-4be6-bec3-12b8f713e50d,32.899944,-96.910389
403,4638472273,4217121115,Incoming,2010-12-31,08:29:48.427392,Fri,00:01:50.428576,fadaa83f-6001-45fd-aa4a-17d6c6b7ec00,32.899944,-96.910389
406,4638472273,2946222380,Incoming,2010-12-31,08:55:44.522912,Fri,00:02:11.781534,fadaa83f-6001-45fd-aa4a-17d6c6b7ec00,32.899944,-96.910389
616,4638472273,6022767590,Incoming,2011-01-03,16:03:10.997508,Mon,00:05:17.983777,95c91e8b-6ff1-4893-9df3-b0342636bd25,32.899944,-96.910389
618,4638472273,2946222380,Missed,2011-01-03,08:46:32.937427,Mon,00:05:18.069929,95c91e8b-6ff1-4893-9df3-b0342636bd25,32.899944,-96.910389
619,4638472273,7756914135,Incoming,2011-01-03,07:50:54.789093,Mon,00:03:58.003275,60161457-ddb5-4b18-bbb2-8b78fd61c1e7,32.888833,-96.895722


The idea is that the call was placed before 5pm. From Midnight-730a, the user is probably sleeping and won't call / wake up to take a call. There should be a brief time in the morning during their commute to work, then they'll spend the entire day at work. So the assumption is that most of the time is spent either at work, or in 2nd, at home:

In [None]:
# .. your code here ..

Plot the Cell Towers the user connected to

In [39]:
# .. your code here ..
%matplotlib notebook

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(user1.TowerLon,user1.TowerLat, c='g', marker='o', alpha=0.2)
ax.set_title('Weedkay Calls (7:30am - 5pm)')
plt.show()

<IPython.core.display.Javascript object>

In [33]:
from sklearn.cluster import KMeans

def doKMeans(data, num_clusters=0):
    # TODO: Be sure to only feed in Lat and Lon coordinates to the KMeans algo, since none of the other
    # data is suitable for your purposes. Since both Lat and Lon are (approximately) on the same scale,
    # no feature scaling is required. Print out the centroid locations and add them onto your scatter
    # plot. Use a distinguishable marker and color.
    #
    # Hint: Make sure you fit ONLY the coordinates, and in the CORRECT order (lat first). This is part
    # of your domain expertise. Also, *YOU* need to create, initialize (and return) the variable named
    # `model` here, which will be a SKLearn K-Means model for this to work:
    
    # .. your code here ..
    
        data = data[['TowerLat', 'TowerLon']]
        model = KMeans(n_clusters=num_clusters)
        model.fit(data)
        
        # Now we can print and plot the centroids:
        centroids = model.cluster_centers_
        print(centroids)
        #ax.scatter(centroids[:,0], centroids[:,1], marker='x', c='red', alpha=0.3)
        return model

Let's tun K-Means with `K=3` or `K=4`. There really should only be a two areas of concentration. If you notice multiple areas that are "hot" (multiple areas the user spends a lot of time at that are FAR apart from one another), then increase K=5, with the goal being that all centroids except two will sweep up the annoying outliers and not-home, not-work travel occasions. the other two will zero in on the user's approximate home location and work locations. Or rather the location of the cell tower closest to them.....

In [41]:
model = doKMeans(user1, 4)

[[ 32.84579692 -96.81976265]
 [ 32.89970164 -96.91026779]
 [ 32.87348968 -96.85115015]
 [ 32.911583   -96.892222  ]]


Print out the mean `CallTime` value for the samples belonging to the cluster with the LEAST samples attached to it. If our logic is correct, the cluster with the MOST samples will be work. The cluster with the 2nd most samples will be home. And the `K=3` cluster with the least samples should be somewhere in between the two. What time, on average, is the user in between home and work, between the midnight and 5pm?

In [42]:
midWayClusterIndices = clusterWithFewestSamples(model)
midWaySamples = user1[midWayClusterIndices]
print("    Its Waypoint Time: ", midWaySamples.CallTime.mean())


  Cluster With Fewest Samples:  3
    Its Waypoint Time:  0 days 07:44:31.892341


Let's visualize the results! First draw the X's for the clusters:

In [43]:
fig1 = plt.figure()
ax1 = fig1.add_subplot(111)
ax1.scatter(model.cluster_centers_[:,1], model.cluster_centers_[:,0], s=169, c='r', marker='x', alpha=0.8, linewidths=2)
ax1.set_title('Weekday Calls Centroids')
plt.show()

<IPython.core.display.Javascript object>

In [44]:
clusterInfo(model)

Cluster Analysis Inertia:  0.0499933966513
------------------------------------------

  Cluster  0
    Centroid  [ 32.84579692 -96.81976265]
    #Samples  48

  Cluster  1
    Centroid  [ 32.89970164 -96.91026779]
    #Samples  728

  Cluster  2
    Centroid  [ 32.87348968 -96.85115015]
    #Samples  81

  Cluster  3
    Centroid  [ 32.911583 -96.892222]
    #Samples  32


In [48]:
users_phones =  [2068627935,2894365987,1559410755,3688089071]

In [46]:
def examineNumber(df, number, num_clusters):
    print("Examining person: ", number)
    user = df[df['In'] == number]
    pm5 = pd.to_timedelta('17:00:00')
    am730 = pd.to_timedelta('07:30:00')
    user = user[(((user['DOW'] == 'Mon') | (user['DOW'] == 'Tue') | (user['DOW'] == 'Wed') | (user['DOW'] == 'Thu') 
               | (user['DOW'] == 'Fri')) & ( (user['CallTime'] < pm5 )  & (user['CallTime'] > am730 ) ) ) ]
    data = user[['TowerLat', 'TowerLon']]
    model = KMeans(n_clusters=num_clusters)
    model.fit(data)
     # Now we can print and plot the centroids:
    clusterInfo(model)
    return model



In [50]:
examineNumber(df1,users_phones[0],4)

Examining person:  2068627935
Cluster Analysis Inertia:  0.083671853092
------------------------------------------

  Cluster  0
    Centroid  [ 32.72089235 -96.83287671]
    #Samples  1194

  Cluster  1
    Centroid  [ 32.71215719 -96.75923115]
    #Samples  190

  Cluster  2
    Centroid  [ 32.72222073 -96.80687153]
    #Samples  112

  Cluster  3
    Centroid  [ 32.731611 -96.709444]
    #Samples  19


KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [51]:
examineNumber(df1,users_phones[1],4)

Examining person:  2894365987
Cluster Analysis Inertia:  0.00584613804294
------------------------------------------

  Cluster  0
    Centroid  [ 32.717667 -96.875194]
    #Samples  141

  Cluster  1
    Centroid  [ 32.72174109 -96.89194104]
    #Samples  2705

  Cluster  2
    Centroid  [ 32.741889 -96.857611]
    #Samples  241

  Cluster  3
    Centroid  [ 32.698088   -96.92053683]
    #Samples  6


KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

2894365987 is the closest so far

In [52]:
examineNumber(df1,users_phones[2],4)

Examining person:  1559410755
Cluster Analysis Inertia:  0.0278806172308
------------------------------------------

  Cluster  0
    Centroid  [ 32.69634229 -96.93521264]
    #Samples  2430

  Cluster  1
    Centroid  [ 32.72173859 -96.91649201]
    #Samples  189

  Cluster  2
    Centroid  [ 32.76754224 -96.91641493]
    #Samples  99

  Cluster  3
    Centroid  [ 32.67086679 -96.93461827]
    #Samples  75


KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [53]:
examineNumber(df1,users_phones[3],4)

Examining person:  3688089071
Cluster Analysis Inertia:  0.00457446197459
------------------------------------------

  Cluster  0
    Centroid  [ 32.80065673 -96.81319445]
    #Samples  22

  Cluster  1
    Centroid  [ 32.81175517 -96.87019575]
    #Samples  361

  Cluster  2
    Centroid  [ 32.80733081 -96.83542586]
    #Samples  21

  Cluster  3
    Centroid  [ 32.79794895 -96.78880445]
    #Samples  20


KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [63]:
def getClusterSamples(df, number, num_clusters):
    print("getting cluster for person: ", number)
    user = df[df['In'] == number]
    pm5 = pd.to_timedelta('17:00:00')
    am730 = pd.to_timedelta('07:30:00')
    user = user[(((user['DOW'] == 'Mon') | (user['DOW'] == 'Tue') | (user['DOW'] == 'Wed') | (user['DOW'] == 'Thu') 
               | (user['DOW'] == 'Fri')) & ( (user['CallTime'] < pm5 )  & (user['CallTime'] > am730 ) ) ) ]
    data = user[['TowerLat', 'TowerLon']]
    model = KMeans(n_clusters=num_clusters)
    model.fit(data)
     # Now we can print and plot the centroids:
    smallest_cluster_index = clusterWithFewestSamples(model)
    sample = user[smallest_cluster_index]
    print("    Avg time : ", sample.CallTime.mean())
    return sample


In [64]:
clusterlist = []

for i in range(len(unique_numbers)):
    print("examining user : " , unique_numbers[i])
    c = getClusterSamples(df1,unique_numbers[i],3)
    clusterlist.append(c)
    


examining user :  4638472273
getting cluster for person:  4638472273

  Cluster With Fewest Samples:  1
    Avg time :  0 days 07:44:01.395089
examining user :  1559410755
getting cluster for person:  1559410755

  Cluster With Fewest Samples:  0
    Avg time :  0 days 07:49:46.609049
examining user :  4931532174
getting cluster for person:  4931532174

  Cluster With Fewest Samples:  2
    Avg time :  0 days 10:25:23.941509
examining user :  2419930464
getting cluster for person:  2419930464

  Cluster With Fewest Samples:  2
    Avg time :  0 days 07:47:11.097689
examining user :  1884182865
getting cluster for person:  1884182865

  Cluster With Fewest Samples:  1
    Avg time :  0 days 07:44:52.338718
examining user :  3688089071
getting cluster for person:  3688089071

  Cluster With Fewest Samples:  1
    Avg time :  0 days 07:43:12.171078
examining user :  4555003213
getting cluster for person:  4555003213

  Cluster With Fewest Samples:  1
    Avg time :  0 days 08:04:09.204236