<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Telecomm EDA Challenge Lab

_Author: Alex Combs (NYC) _

---

Let's do some Exploratory Data Analysis (EDA)! As a data scientist, you often may find yourself given a data set you've never seen before, and asked to do a rapid analysis. This is today's goal.

# Prompt

You work for a telecommunications company. The company has been storing metadata about customer phone usage, as part of the regular course of business. Currently, this data is sitting in an unsecured database. The company doesn't want to pay to increase their database security, because they don't think there's really anything to be learned from the metadata.

They are under pressure from "right to privacy" organizations to beef up the database security. These organizations argue that you can learn a lot about a person from their cell phone metadata.

The telecom company wants to understand if this is true, and they want your help. They will give you one person's metadata for 2014 and want to see what you can learn from it.

Working in teams, create a report revealing everything you can about the person. Prepare a presentation, with slides, showcasing your findings.


# The Data

The [person's metadata](./datasets/metadata.csv) has the following fields:

| Field Name          | Description
| ---                 | ---
| **Cell Cgi**        | cell phone tower identifier
| **Cell Tower**      | cell phone tower location
| **Comm Identifier** |	de-identified recipient of communication
| **Comm Timedate String** | time of communication
| **Comm Type	Id**  | type of communication
| **Latitude**        | latitude of communication
| **Longitude**       | longitude of communication


In [1]:
!conda install folium --yes
!pip install folium

!pip install geopy

# pygmaps

!pip install gpxpy
import gpxpy.geo

Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - folium

Current channels:

  - https://repo.continuum.io/pkgs/main/osx-64
  - https://repo.continuum.io/pkgs/main/noarch
  - https://repo.continuum.io/pkgs/free/osx-64
  - https://repo.continuum.io/pkgs/free/noarch
  - https://repo.continuum.io/pkgs/r/osx-64
  - https://repo.continuum.io/pkgs/r/noarch
  - https://repo.continuum.io/pkgs/pro/osx-64
  - https://repo.continuum.io/pkgs/pro/noarch


Collecting folium
  Downloading folium-0.5.0.tar.gz (79kB)
[K    100% |████████████████████████████████| 81kB 3.2MB/s ta 0:00:011
[?25hCollecting branca (from folium)
  Downloading branca-0.2.0-py2-none-any.whl
Building wheels for collected packages: folium
  Running setup.py bdist_wheel for folium ... [?25ldone
[?25h  Stored in directory: /Users/fiona/Library/Caches/pip/wheels/04/d0/a0/b2b8356443364ae79743fce0b9b6a5b045f7560742129fde22
Successfully built folium
Installing 

In [3]:
import numpy as np
import pandas as pd

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

# Hints

This is totally open-ended! If you're totally stumped -- and only if stumped -- should you look below for prompts. As a starting point, given that you have geo-locations, consider investigating ways to display this type of information (i.e. mapping functionality).

<font color='white'>
Well for starters, he's in Australia!

Ideas for things to look into:
- where does he work?
- where does he live?
- who does he contact most often?
- what hours does he work?
- did he move?
- did he go on holiday?  If so, where did he go?
- did he get a new phone?

Challenges:
- how does he get to work?
- where does his family live?
- if he went on holiday, can you find which flights he took?
- can you guess who some of his contacts are, based on the frequency, location, time and mode (phone/text) of communications?


If you're stuck on how to map the data, you can try "basemap" or "gmplot", or anything else you find online.
</font>

In [11]:
df_raw = pd.read_csv("../eda-telecomm_group_project-lab/datasets/metadata.csv")

In [12]:
df_raw.shape

(10476, 7)

In [14]:
df_raw.head()

Unnamed: 0,Cell Cgi,Cell Tower Location,Comm Identifier,Comm Timedate String,Comm Type,Latitude,Longitude
0,50501015388B9,REDFERN TE,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 9:40,Phone,-33.892933,151.202296
1,50501015388B9,REDFERN TE,62157ccf2910019ffd915b11fa037243b75c1624,4/1/14 9:42,Phone,-33.892933,151.202296
2,505010153111F,HAYMARKET #,c8f92bd0f4e6fb45ed7fce96fc831b283db2b642,4/1/14 13:13,Phone,-33.880329,151.20569
3,505010153111F,HAYMARKET #,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 13:13,Phone,-33.880329,151.20569
4,5.05E+106,HAYMARKET #,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 17:27,Phone,-33.880329,151.20569


In [176]:
comms_rank = df_raw["Comm Identifier"].value_counts(ascending = False).to_frame()
comms_rank2 = comms_rank.rank(ascending = False)
comms_rank2.index.name = 'Comm Identifier'
comms_rank2

Unnamed: 0_level_0,Comm Identifier
Comm Identifier,Unnamed: 1_level_1
bc0b01860486b0f0a240ce8419d3d7553fe404ab,1.0
12e3d1b0c95aa32b6890c4455918dfc10e09fb51,2.0
91aba4a11359ff3af7902428d20cfa7e676c36e7,3.0
a24a4646d074a779b45b34b943a47bf33168f791,4.0
6bbc17070aa91e2dab7909b96c6eecbd6109ba56,5.0
a804558e420ececf05faedf05722704a115f1b50,6.0
cd3b39466869088df4904451c626591cc500e4ba,7.0
c22670da93038f568c4a3bd8ae22f9e6fef2c5a2,8.0
70e1f163d854d4e9b63e9a3f4056ced467567d85,9.0
c521537546eee0e62e2d8e98e831ac11edbf10cc,10.0


In [None]:
df_raw2 = df_raw.merge(comms_rank2, on = 'Comm Identifer')

In [98]:
# find home:
df_raw["Cell Tower Location"].value_counts().head(20)

BALGOWLAH HAYES ST                          4301
CHIPPENDALE                                 1084
SUNDERLAND ST                                723
REDFERN TE                                   712
HAYMARKET #                                  563
BRICKWORKS                                   501
HARBORD 22 WAINE ST                          465
FAIRLIGHT 137 SYDNEY RD                      454
MANLY #                                      231
NEW TOWN                                     197
CHINATOWN                                    161
BEECHWORTH                                   112
BALGOWLAH VILLAGE SHOPPING CENTRE IBC        106
MANLY SOUTH STEYNE                            92
BROADWAY OTC                                  85
MASCOT INTERNATIONAL AIRPORT TERMINAL T1      65
71 MACQUARIE ST                               49
SURRY HILLS 418A ELIZABETH ST                 45
MANLY NTH STEYNE                              40
MASCOT M5 MOTORWAY EMERGENCY STAIRS           33
Name: Cell Tower Loc

In [182]:
df_raw["Comm Identifier"].value_counts(ascending = False).to_frame().head(20)

Unnamed: 0,Comm Identifier
bc0b01860486b0f0a240ce8419d3d7553fe404ab,219
12e3d1b0c95aa32b6890c4455918dfc10e09fb51,146
91aba4a11359ff3af7902428d20cfa7e676c36e7,144
a24a4646d074a779b45b34b943a47bf33168f791,133
6bbc17070aa91e2dab7909b96c6eecbd6109ba56,83
a804558e420ececf05faedf05722704a115f1b50,62
cd3b39466869088df4904451c626591cc500e4ba,56
c22670da93038f568c4a3bd8ae22f9e6fef2c5a2,44
70e1f163d854d4e9b63e9a3f4056ced467567d85,39
c521537546eee0e62e2d8e98e831ac11edbf10cc,31


In [62]:
top_1 = df_raw[df_raw["Comm Identifier"] == 'bc0b01860486b0f0a240ce8419d3d7553fe404ab']

In [63]:
top_2 = df_raw[df_raw["Comm Identifier"] == '12e3d1b0c95aa32b6890c4455918dfc10e09fb51']

In [42]:
print top_1["Comm Type"].value_counts()
print top_2["Comm Type"].value_counts()

SMS      210
Phone      9
Name: Comm Type, dtype: int64
Phone    136
SMS       10
Name: Comm Type, dtype: int64


In [43]:
df_raw["Comm Type"].value_counts()

Internet    9102
Phone        717
SMS          657
Name: Comm Type, dtype: int64

In [61]:
top_1["Cell Tower Location"].value_counts()

REDFERN TE                       95
BALGOWLAH HAYES ST               54
HAYMARKET #                      33
CHIPPENDALE                      11
FAIRLIGHT 137 SYDNEY RD           7
CHINATOWN                         6
SYDNEY 450 GEORGE ST              4
SUNDERLAND ST                     2
SURRY HILLS 418A ELIZABETH ST     2
SYDNEY 2 CASTLEREAGH STREET       2
BROADWAY OTC                      2
71 MACQUARIE ST                   1
Name: Cell Tower Location, dtype: int64

In [51]:
top_2["Cell Tower Location"].value_counts().head(5)

BALGOWLAH HAYES ST     46
REDFERN TE             24
HARBORD 22 WAINE ST    17
CHIPPENDALE            13
HAYMARKET #             8
Name: Cell Tower Location, dtype: int64

In [59]:
#!pip install python-dateutil
from dateutil import parser

def convertDateTimeToTimeOfDayPrecision(string):
    dt = parser.parse(string)
    return dt.strftime("%H:00")

def convertDateTimeToDayPrecision(string):
    dt = parser.parse(string)
    return dt.strftime("%m-%d-%Y")

def convertDateTimeToHourPrecision(string):
    dt = parser.parse(string)
    return dt.strftime("%m-%d-%Y %H:00")

df_raw['Comm Day'] = df_raw["Comm Timedate String"].map(convertDateTimeToDayPrecision)
df_raw['Time of Day'] = df_raw["Comm Timedate String"].map(convertDateTimeToTimeOfDayPrecision)

Unnamed: 0,Cell Cgi,Cell Tower Location,Comm Identifier,Comm Timedate String,Comm Type,Latitude,Longitude,Comm Day,Time of Day
0,50501015388B9,REDFERN TE,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 9:40,Phone,-33.892933,151.202296,04-01-2014,09:00
1,50501015388B9,REDFERN TE,62157ccf2910019ffd915b11fa037243b75c1624,4/1/14 9:42,Phone,-33.892933,151.202296,04-01-2014,09:00
2,505010153111F,HAYMARKET #,c8f92bd0f4e6fb45ed7fce96fc831b283db2b642,4/1/14 13:13,Phone,-33.880329,151.205690,04-01-2014,13:00
3,505010153111F,HAYMARKET #,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 13:13,Phone,-33.880329,151.205690,04-01-2014,13:00
4,5.05E+106,HAYMARKET #,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 17:27,Phone,-33.880329,151.205690,04-01-2014,17:00
5,5050101532B23,CHIPPENDALE,6bbc17070aa91e2dab7909b96c6eecbd6109ba56,4/1/14 17:36,Phone,-33.884171,151.202350,04-01-2014,17:00
6,5050101536E5E,CHIPPENDALE,6bbc17070aa91e2dab7909b96c6eecbd6109ba56,4/1/14 17:40,Phone,-33.884171,151.202350,04-01-2014,17:00
7,5050101531F08,REDFERN TE,7cb96eadd3ff95e25406d24794027c443c0661c5,4/2/14 19:18,Phone,-33.892933,151.202296,04-02-2014,19:00
8,505010153111F,HAYMARKET #,de40c5c1f9249f95f7fb216931db58747afef74f,4/3/14 14:35,Phone,-33.880329,151.205690,04-03-2014,14:00
9,505010153111F,HAYMARKET #,66f32c1163d0e597983b65c51f5a477070ad3785,4/3/14 14:36,Phone,-33.880329,151.205690,04-03-2014,14:00


In [73]:
top_1.groupby(["Time of Day"])["Comm Identifier"].size().sort_values(ascending = False)

Time of Day
19:00    44
22:00    36
17:00    25
18:00    23
12:00    14
10:00    13
21:00    12
15:00    11
23:00     8
14:00     7
09:00     7
20:00     6
13:00     4
11:00     3
16:00     2
08:00     2
01:00     2
Name: Comm Identifier, dtype: int64

In [71]:
top_2.groupby(["Time of Day"])["Comm Identifier"].size().sort_values(ascending = False)

Time of Day
17:00    18
09:00    16
20:00    15
10:00    15
19:00    13
15:00    13
16:00    10
14:00     9
12:00     8
21:00     7
18:00     6
11:00     6
13:00     5
08:00     4
07:00     1
Name: Comm Identifier, dtype: int64

In [94]:
df_raw.groupby(["Cell Tower Location","Comm Identifier"])["Comm Identifier"].size().sort_values(ascending = False)

Cell Tower Location                 Comm Identifier                         
REDFERN TE                          bc0b01860486b0f0a240ce8419d3d7553fe404ab    95
BALGOWLAH HAYES ST                  a24a4646d074a779b45b34b943a47bf33168f791    69
                                    bc0b01860486b0f0a240ce8419d3d7553fe404ab    54
                                    12e3d1b0c95aa32b6890c4455918dfc10e09fb51    46
REDFERN TE                          a24a4646d074a779b45b34b943a47bf33168f791    39
                                    91aba4a11359ff3af7902428d20cfa7e676c36e7    39
HAYMARKET #                         bc0b01860486b0f0a240ce8419d3d7553fe404ab    33
BALGOWLAH HAYES ST                  91aba4a11359ff3af7902428d20cfa7e676c36e7    27
REDFERN TE                          6bbc17070aa91e2dab7909b96c6eecbd6109ba56    26
                                    12e3d1b0c95aa32b6890c4455918dfc10e09fb51    24
BALGOWLAH HAYES ST                  a804558e420ececf05faedf05722704a115f1b50    23
          