# Exploration Overview
This notebook's purpose is to load in the raw submarine cables json file and look through it to visualize potential data cleaning steps needed.

The code and outputs for this are shown below.


## Findings 
All findings found below in code are listed here for ease of viewing.
- Small data. If assumption is that this is a one time cleaning process, excessive optimization not needed.
- `url` and `email_link` columns don't seem to have much information, can probably be removed
- `rfs` is a string column when it represents a date. Doesn't seem to have granularity consistency.
- `owners` is a string list of multiple owners. Should be separated into individual owners.
- `landing_points` seems to be its own dataset
- `length` is a string col with unit and should be cleaned by making all the same unit and casting to numeric
- `length` is missing in some rows. Filtering may have unexpected results
- `rfs` is missing in 5 rows. Filtering may have unexpected results
- `date_of_collection` is a string despite looking to have a date format above
- `date_of_collection` is a constant. Column can likely be removed as it doesn't provide additional info
- All `length` values are in KM. Can strip the unit and cast to numeric without converting units.
- `ready_for_service` values have inconsistent granularity, but seem to have the year as the first element
    - Not sure if Q4 means Q4 of the calendar year or that individual landing point's business year
- The `cable_id` is a unique field, good primary key
- `is_tbd` is null most of the time. Either fill or drop likely.
- Is there only one landing point in a city? Potential to extrapolate the city from `name`.
- For visualizing, there are no coordinates provided. Will need to collect these in order to visualize.
- `is_tbd` is either empty or True. Could make assumption that None means False.
- landing point's `id` to `cable_id` is a many to many relationship.

## Expected Cleaning Steps
- Remove `url`, `email_link`, and `date_of_collection` columns
- Separate `rfs` columns into granularity aspects (year, month/quarter) and cast to appropriate types
- Due to many to many relationship, separate owners and landing_point information into separate tablelike datasets.
- Cast `length` to numeric and add unit information to column name
- Fill `is_tbd` nulls with False
- Potentially feature engineer to get landing point city
- Collect and append latitude/longitude data for landing points

## Dashboard User Feedback
- If filtering by `length` or `rfs`, include user input as to whether null rows should be included or not.

In [1]:
# Imports
import json
from pathlib import Path

import pandas as pd

from us_state_abbr import abbrev_to_us_state

# Constants
RAW_DATA_PATH = Path("raw_data")
RAW_CABLE_DATA_PATH = RAW_DATA_PATH / "Submarine Cables - 2023-02-22.json"

In [5]:
# Load sub cable datasets
with RAW_CABLE_DATA_PATH.open("r") as data_file:
    raw_cable_data = json.load(data_file)
    
# Print out various data tidbits
# FINDING: small data, excessive optimization not needed
print("Number of elements:", len(raw_cable_data))
print("\nKey values for a data element:", raw_cable_data[0].keys())
print("\n2 example elements:\n")
for example in raw_cable_data[:2]:
    print("\n", example)

Number of elements: 545

Key values for a data element: dict_keys(['date_of_collection', 'cable_id', 'cable_name', 'rfs', 'owners', 'url', 'email_link', 'landing_points', 'length'])

2 example elements:


 {'date_of_collection': '2023-02-22', 'cable_id': '2africa', 'cable_name': '2Africa', 'rfs': '2023', 'owners': 'China Mobile, MTN, Meta, Orange, Saudi Telecom, Telecom Egypt, Vodafone, WIOCC', 'url': None, 'email_link': 'https://www.submarinecablemap.com/#/submarine-cable/2africa', 'landing_points': '[{"is_tbd": null, "country": "Angola", "id": "luanda-angola", "name": "Luanda, Angola"}, {"is_tbd": null, "country": "Bahrain", "id": "manama-bahrain", "name": "Manama, Bahrain"}, {"is_tbd": null, "country": "Comoros", "id": "moroni-comoros", "name": "Moroni, Comoros"}, {"is_tbd": null, "country": "Congo, Dem. Rep.", "id": "muanda-congo-dem-rep-", "name": "Muanda, Congo, Dem. Rep."}, {"is_tbd": null, "country": "Congo, Rep.", "id": "pointe-noire-congo-rep-", "name": "Pointe-Noire, Congo, 

In [6]:
# For easier visualization and exploration, load json into a Pandas DataFrame and visualize first couple of rows
# FINDING: `url` and `email_link` columns don't seem to have much information, can probably be removed
# FINDING: `rfs` is a string column when it represents a date. Doesn't seem to have granularity consistency.
# FINDING: `owners` is a string list of multiple owners. Should be separated into individual owners.
# FINDING: `landing_points` seems to be its own dataset
# FINDING: `length` is a string col with unit and should be cleaned by making all the same unit and casting to numeric
raw_cable_df = pd.DataFrame.from_dict(raw_cable_data)
raw_cable_df.head(3)

Unnamed: 0,date_of_collection,cable_id,cable_name,rfs,owners,url,email_link,landing_points,length
0,2023-02-22,2africa,2Africa,2023,"China Mobile, MTN, Meta, Orange, Saudi Telecom...",,https://www.submarinecablemap.com/#/submarine-...,"[{""is_tbd"": null, ""country"": ""Angola"", ""id"": ""...","45,000 km"
1,2023-02-22,acs-alaska-oregon-network-akorn,ACS Alaska-Oregon Network (AKORN),2009 April,Alaska Communications,https://www.alaskacommunications.com,https://www.submarinecablemap.com/#/submarine-...,"[{""is_tbd"": null, ""country"": ""United States"", ...","3,000 km"
2,2023-02-22,aden-djibouti,Aden-Djibouti,1994,"Djibouti Telecom, Orange, Tata Communications,...",https://www.teleyemen.com.ye/,https://www.submarinecablemap.com/#/submarine-...,"[{""is_tbd"": null, ""country"": ""Djibouti"", ""id"":...",269 km


In [7]:
# Visualize nulls and types
# FINDING: `length` is missing in some rows. Filtering may have unexpected results
# FINDING: `rfs` is missing in 5 rows. Filtering may have unexpected results
# FINDING: `date_of_collection` is a string despite looking to have a date format above
raw_cable_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_of_collection  545 non-null    object
 1   cable_id            545 non-null    object
 2   cable_name          545 non-null    object
 3   rfs                 540 non-null    object
 4   owners              545 non-null    object
 5   url                 297 non-null    object
 6   email_link          545 non-null    object
 7   landing_points      545 non-null    object
 8   length              508 non-null    object
dtypes: object(9)
memory usage: 38.4+ KB


In [10]:
# FINDING: `date_of_collection` is a constant. Column can likely be removed as it doesn't provide additional info
print("Unique date collections:", raw_cable_df['date_of_collection'].nunique())

# FINDING: All `length` values are in KM. Can strip the unit and cast to numeric without converting units.
print("\nLength Unit options:", raw_cable_df['length'].str.split().str[-1].unique())

# FINDING: `ready_for_service` values have inconsistent granularity, but seem to have the year as the first element
#     Not sure if Q4 means Q4 of the calendar year or that individual landing point's business year
print("\nReady for Service examples", raw_cable_df['rfs'].unique().tolist()[:15])

# FINDING: The cable_id is a unique field, good primary key
print(f"\nNum unique cable ids: {raw_cable_df['cable_id'].nunique()} / {raw_cable_df.shape[0]} total rows")

Unique date collections: 1

Length Unit options: ['km' None]

Ready for Service examples ['2023', '2009 April', '1994', '1996 September', '2016 January', '2024 Q4', '2012 December', '1999 February', '2008 November', '2012', '2004 June', '2012 August', '1997 April', '1990', '1999']

Num unique cable ids: 545 / 545 total rows


In [12]:
# Explore owners and location separately

# Explore owners
# Explode owners out so each owner for a `cable_id` value is on a new line.
owner_data = raw_cable_df[['owners', 'cable_id']].copy()
owner_data['owners'] = owner_data['owners'].str.split(', ')
owner_data = owner_data.explode('owners', ignore_index=True).rename(columns={'owners': 'owner'})
owner_data.head(3)

Unnamed: 0,owner,cable_id
0,China Mobile,2africa
1,MTN,2africa
2,Meta,2africa


In [18]:
# FINDING: `owner` to `cable_id` is a many to many relationship. 
print(owner_data.groupby('cable_id').count().sort_values('owner', ascending=False).head(5))
print(owner_data.groupby('owner').count().sort_values('cable_id', ascending=False).head(5))

                             owner
cable_id                          
seamewe-3                       50
sat-3wasc                       33
safe                            29
japan-u-s-cable-network-jus     24
apcn-2                          22
                        cable_id
owner                           
Orange                        30
Telecom Italia Sparkle        25
Tata Communications           22
AT&T                          22
Google                        21


In [23]:
# Explore location
# Explode out landing points after loading them into Python objects using JSON
# Translate dictionary into separate columns
location_data = raw_cable_df[['landing_points', 'cable_id']].copy()
location_data['landing_points'] = location_data['landing_points'].apply(lambda x: json.loads(x))
location_data = location_data.explode('landing_points', ignore_index=True)
location_data = pd.concat([location_data['landing_points'].apply(pd.Series), location_data['cable_id']], axis=1)


# Visualize
# FINDINGS: `is_tbd` is null most of the time. Either fill or drop likely.
# FINDINGS: Is there only one landing point in a city? Potential to extrapolate the city from `name`.
# FINDINGS: For visualizing, there is no coordinates provided. Will need to collect these in order to visualize.
print(location_data.info())
location_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2578 entries, 0 to 2577
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   is_tbd    6 non-null      object
 1   country   2578 non-null   object
 2   id        2578 non-null   object
 3   name      2578 non-null   object
 4   cable_id  2578 non-null   object
dtypes: object(5)
memory usage: 100.8+ KB
None


Unnamed: 0,is_tbd,country,id,name,cable_id
0,,Angola,luanda-angola,"Luanda, Angola",2africa
1,,Bahrain,manama-bahrain,"Manama, Bahrain",2africa
2,,Comoros,moroni-comoros,"Moroni, Comoros",2africa
3,,"Congo, Dem. Rep.",muanda-congo-dem-rep-,"Muanda, Congo, Dem. Rep.",2africa
4,,"Congo, Rep.",pointe-noire-congo-rep-,"Pointe-Noire, Congo, Rep.",2africa


In [24]:
# FINDING: `is_tbd` is either empty or True. Could make assumption that None means False.
location_data['is_tbd'].value_counts(dropna=False)

None    2572
True       6
Name: is_tbd, dtype: int64

In [26]:
# FINDING: landing point's `id` to `cable_id` is a many to many relationship. 
group_cols = ['cable_id', 'id']
print(location_data[group_cols].groupby('cable_id').count().sort_values('id', ascending=False).head(5))
print(location_data[group_cols].groupby('id').count().sort_values('cable_id', ascending=False).head(5))

                                                   id
cable_id                                             
connected-coast                                    51
2africa                                            48
bt-highlands-and-islands-submarine-cable-system    40
seamewe-3                                          38
philippine-domestic-submarine-cable-network-pdscn  33
                     cable_id
id                           
mumbai-india               17
tuas-singapore             17
batam-indonesia            15
jeddah-saudi-arabia        14
marseille-france           14
