<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Telecomm EDA Challenge Lab

_Author: Alex Combs (NYC) _

---

Let's do some Exploratory Data Analysis (EDA)! As a data scientist, you often may find yourself given a data set you've never seen before, and asked to do a rapid analysis. This is today's goal.

# Prompt

You work for a telecommunications company. The company has been storing metadata about customer phone usage, as part of the regular course of business. Currently, this data is sitting in an unsecured database. The company doesn't want to pay to increase their database security, because they don't think there's really anything to be learned from the metadata.

They are under pressure from "right to privacy" organizations to beef up the database security. These organizations argue that you can learn a lot about a person from their cell phone metadata.

The telecom company wants to understand if this is true, and they want your help. They will give you one person's metadata for 2014 and want to see what you can learn from it.

Working in teams, create a report revealing everything you can about the person. Prepare a presentation, with slides, showcasing your findings.


# The Data

The [person's metadata](./datasets/metadata.csv) has the following fields:

| Field Name          | Description
| ---                 | ---
| **Cell Cgi**        | cell phone tower identifier
| **Cell Tower**      | cell phone tower location
| **Comm Identifier** |	de-identified recipient of communication
| **Comm Timedate String** | time of communication
| **Comm Type	Id**  | type of communication
| **Latitude**        | latitude of communication
| **Longitude**       | longitude of communication


# Hints

This is totally open-ended! If you're totally stumped -- and only if stumped -- should you look below for prompts. As a starting point, given that you have geo-locations, consider investigating ways to display this type of information (i.e. mapping functionality).

<font color='white'>
Well for starters, he's in Australia!

Ideas for things to look into:
- where does he work?
- where does he live?
- who does he contact most often?
- what hours does he work?
- did he move?
- did he go on holiday?  If so, where did he go?
- did he get a new phone?

Challenges:
- how does he get to work?
- where does his family live?
- if he went on holiday, can you find which flights he took?
- can you guess who some of his contacts are, based on the frequency, location, time and mode (phone/text) of communications?


If you're stuck on how to map the data, you can try "basemap" or "gmplot", or anything else you find online.
</font>

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

metadata_csv = './datasets/metadata.csv'

df = pd.read_csv(metadata_csv, encoding='latin-1')

In [4]:
df.head()

Unnamed: 0,Cell Cgi,Cell Tower Location,Comm Identifier,Comm Timedate String,Comm Type,Latitude,Longitude
0,50501015388B9,REDFERN TE,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 9:40,Phone,-33.892933,151.202296
1,50501015388B9,REDFERN TE,62157ccf2910019ffd915b11fa037243b75c1624,4/1/14 9:42,Phone,-33.892933,151.202296
2,505010153111F,HAYMARKET #,c8f92bd0f4e6fb45ed7fce96fc831b283db2b642,4/1/14 13:13,Phone,-33.880329,151.20569
3,505010153111F,HAYMARKET #,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 13:13,Phone,-33.880329,151.20569
4,5.05E+106,HAYMARKET #,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 17:27,Phone,-33.880329,151.20569


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10476 entries, 0 to 10475
Data columns (total 7 columns):
Cell Cgi                10476 non-null object
Cell Tower Location     10476 non-null object
Comm Identifier         1374 non-null object
Comm Timedate String    10476 non-null object
Comm Type               10476 non-null object
Latitude                10476 non-null float64
Longitude               10476 non-null float64
dtypes: float64(2), object(5)
memory usage: 573.0+ KB


In [6]:
df_col = ['cgi', 'tower', 'identifier', 'timedate', 'type', 'latitude', 'longitude'] 

# Replace during file reading (disables the header from the file).
df = pd.read_csv('./datasets/metadata.csv', header=0, names=df_col)

In [7]:
df.head()

Unnamed: 0,cgi,tower,identifier,timedate,type,latitude,longitude
0,50501015388B9,REDFERN TE,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 9:40,Phone,-33.892933,151.202296
1,50501015388B9,REDFERN TE,62157ccf2910019ffd915b11fa037243b75c1624,4/1/14 9:42,Phone,-33.892933,151.202296
2,505010153111F,HAYMARKET #,c8f92bd0f4e6fb45ed7fce96fc831b283db2b642,4/1/14 13:13,Phone,-33.880329,151.20569
3,505010153111F,HAYMARKET #,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 13:13,Phone,-33.880329,151.20569
4,5.05E+106,HAYMARKET #,f1a6836c0b7a3415a19a90fdd6f0ae18484d6d1e,4/1/14 17:27,Phone,-33.880329,151.20569


In [8]:
df['tower'].value_counts()

# Listing the most common Tower locations would give us an indication of the most frequented places for the client.
# For example, one could assume the top 3 belong to his home address, his office, and maybe his partner's house, or something
# related to a hobby, like a gym. 
# Similarly, taking a look at the least frequented places, one could assume those were made during a holiday. 

BALGOWLAH HAYES ST                          4301
CHIPPENDALE                                 1084
SUNDERLAND ST                                723
REDFERN TE                                   712
HAYMARKET #                                  563
BRICKWORKS                                   501
HARBORD 22 WAINE ST                          465
FAIRLIGHT 137 SYDNEY RD                      454
MANLY #                                      231
NEW TOWN                                     197
CHINATOWN                                    161
BEECHWORTH                                   112
BALGOWLAH VILLAGE SHOPPING CENTRE IBC        106
MANLY SOUTH STEYNE                            92
BROADWAY OTC                                  85
MASCOT INTERNATIONAL AIRPORT TERMINAL T1      65
71 MACQUARIE ST                               49
SURRY HILLS 418A ELIZABETH ST                 45
MANLY NTH STEYNE                              40
MASCOT M5 MOTORWAY EMERGENCY STAIRS           33
BALGOWLAH TE        

In [9]:
df.loc[(df['tower'] == 'LENEVA; WODONGA TIP OFF BEACHWORTH RD')]

# With reference to the holiday factor, mapping the coordinates in "LENEVA; WODONGA TIP OFF BEACHWORTH RD", indicates
# that the client was at Victoria, Australia, while the most frequented places take place in New South Wales.

Unnamed: 0,cgi,tower,identifier,timedate,type,latitude,longitude
10000,5050115504CB1,LENEVA; WODONGA TIP OFF BEACHWORTH RD,,3/27/15 10:03,Internet,-36.16793,146.88306
10001,5050115504CB1,LENEVA; WODONGA TIP OFF BEACHWORTH RD,,3/27/15 10:03,Internet,-36.16793,146.88306
10002,5050115504CB1,LENEVA; WODONGA TIP OFF BEACHWORTH RD,e925cff3596db298e7f5d7cd31306e790b7fe7be,3/27/15 10:03,Phone,-36.16793,146.88306
10164,5050115504CB1,LENEVA; WODONGA TIP OFF BEACHWORTH RD,,3/29/15 14:22,Internet,-36.16793,146.88306
10165,5050115504CB1,LENEVA; WODONGA TIP OFF BEACHWORTH RD,,3/29/15 14:22,Internet,-36.16793,146.88306
10166,5050115504CB1,LENEVA; WODONGA TIP OFF BEACHWORTH RD,,3/29/15 14:23,Internet,-36.16793,146.88306
10167,5050115504CB1,LENEVA; WODONGA TIP OFF BEACHWORTH RD,,3/29/15 14:23,Internet,-36.16793,146.88306


In [10]:
df.loc[(df['tower'] == 'CHIPPENDALE')]

Unnamed: 0,cgi,tower,identifier,timedate,type,latitude,longitude
5,5050101532B23,CHIPPENDALE,6bbc17070aa91e2dab7909b96c6eecbd6109ba56,4/1/14 17:36,Phone,-33.884171,151.20235
6,5050101536E5E,CHIPPENDALE,6bbc17070aa91e2dab7909b96c6eecbd6109ba56,4/1/14 17:40,Phone,-33.884171,151.20235
13,5050101537A4A,CHIPPENDALE,91aba4a11359ff3af7902428d20cfa7e676c36e7,4/4/14 9:47,Phone,-33.884171,151.20235
17,5050101536E5E,CHIPPENDALE,91aba4a11359ff3af7902428d20cfa7e676c36e7,4/4/14 18:10,Phone,-33.884171,151.20235
42,50501015334B6,CHIPPENDALE,70e1f163d854d4e9b63e9a3f4056ced467567d85,4/10/14 21:35,SMS,-33.884171,151.20235
43,50501015334B6,CHIPPENDALE,70e1f163d854d4e9b63e9a3f4056ced467567d85,4/10/14 21:36,SMS,-33.884171,151.20235
44,50501015334B6,CHIPPENDALE,70e1f163d854d4e9b63e9a3f4056ced467567d85,4/10/14 21:36,SMS,-33.884171,151.20235
45,50501015334B6,CHIPPENDALE,70e1f163d854d4e9b63e9a3f4056ced467567d85,4/10/14 21:43,SMS,-33.884171,151.20235
46,50501015334B6,CHIPPENDALE,70e1f163d854d4e9b63e9a3f4056ced467567d85,4/10/14 21:44,SMS,-33.884171,151.20235
47,50501015334B6,CHIPPENDALE,70e1f163d854d4e9b63e9a3f4056ced467567d85,4/10/14 21:46,SMS,-33.884171,151.20235


In [11]:
df.loc[(df['tower'] == 'BALGOWLAH HAYES ST')]

Unnamed: 0,cgi,tower,identifier,timedate,type,latitude,longitude
715,505012056EF02,BALGOWLAH HAYES ST,,9/24/14 17:17,Internet,-33.78815,151.26654
716,505012056EF02,BALGOWLAH HAYES ST,,9/24/14 19:08,Internet,-33.78815,151.26654
717,505012056EF02,BALGOWLAH HAYES ST,,9/24/14 19:08,Internet,-33.78815,151.26654
718,505012056EF02,BALGOWLAH HAYES ST,,9/24/14 19:09,Internet,-33.78815,151.26654
719,505012056EF02,BALGOWLAH HAYES ST,,9/24/14 19:10,Internet,-33.78815,151.26654
720,505012056EF02,BALGOWLAH HAYES ST,,9/24/14 19:10,Internet,-33.78815,151.26654
789,505012056EF02,BALGOWLAH HAYES ST,,9/27/14 10:34,Internet,-33.78815,151.26654
790,505012056EF02,BALGOWLAH HAYES ST,,9/27/14 13:37,Internet,-33.78815,151.26654
793,505012056EF02,BALGOWLAH HAYES ST,,9/27/14 15:20,Internet,-33.78815,151.26654
799,505012056EF02,BALGOWLAH HAYES ST,,9/27/14 15:26,Internet,-33.78815,151.26654


In [None]:
# After analyzing the previous two towers by the time the cell phone was used, it seems like Balgowlah Hayes St is his place
# of work, as the usage occurs in general, between 7 and 17, while Chippendale registers usage during the night. 

In [None]:
# Arising questions: 
# Is there a method to separate the data in 'timedate' into two colums, one specifically for date and another one for time?
# How can I sort the date by month, for example, when there are three values in there?