# Extract data from Notice To Mariners
### Objective: Get coordinates of points from pdf documents published in __www.marine.gov.my__ website
### Target dataset: Coordinates of published **oil & gas** operation within **Sarawak** Waters in year **2019**
### Steps are as follows
1. Load the website
2. Find all links with PDF
3. Download all the PDF
4. Convert all PDF to text
5. Find Coordinate information in the text
6. Aggregate collection data into table
7. Draw maps of current activities in the vicinity of our company's operations

In [1]:
import requests
import time
from bs4 import BeautifulSoup

### Using a plain simple __BeautifulSoup__ and __Request__ to download the pdfs
__Beautiful Soup__ is a Python library for pulling data out of HTML and XML files.

__Requests__ is an elegant and simple HTTP library for Python, built for human beings.

In [2]:
main_url = 'http://www.marine.gov.my/jlmv4/ms/notis/pelaut'

In [3]:
response = requests.get(main_url)
soup = BeautifulSoup(response.text, 'lxml')

### With this simple approach, we only able to get 20 documents in year 2019 from all Malaysia and regardless of industry

In [4]:
for i,a in enumerate(soup.select("a[href*='2019']")):
    print(f"{i+1} {a.text[:120]}{'-'*(120-len(a.text))}: {a['href']}")

1 Amendment To Beacon Height, Sarawak Waters------------------------------------------------------------------------------: http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK1242019.pdf
2 New Position For Tok Bali Fairway Buoy Kelantan Waters------------------------------------------------------------------: http://www.marine.gov.my/jlmv4/sites/default/files/NTM1562019.pdf
3 Site Survey And Soil Boring Malacca Straits-----------------------------------------------------------------------------: http://www.marine.gov.my/jlmv4/sites/default/files/NTM1572019.pdf
4 Anchorage Buoy 2, Sarawak River - Collapsed-----------------------------------------------------------------------------: http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK1232019.pdf
5 Notice Of Transportation Vessel And Pipeline Pull In – Posh Defender And MMA Prestige To D18 Field, Offshore Sarawak----: http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK1222019.pdf
6 Sungai Kuantan Wreck Buoy Off Station K

### We need to use __Selenium__ to get all the pdf for 2019 because __www.marine.gov.my__ is a dynamic website.
__Selenium__ automates browsers. Especially useful for dynamically loaded websites.

What I know about the website from browsing it manually:
1. There is a drop down to select the region (in my case I select 'Wilayah Sarawak').
2. There is a drop down to select how many documents shown in one page (here I select 60).
3. To covers all 2019 documents, I need to get to page 2.
4. All of the above are dynamically loaded when selected.

In [5]:
from selenium import webdriver
driver = webdriver.Chrome("../Chromedriver/chromedriver.exe")
driver.set_window_position(-2560,0)
driver.set_window_size(1280,1440)
# open browser and go the url
driver.get(main_url)
# Select Wilayah Sarawak from the drop down
driver.find_element_by_xpath(f"//*[@id='edit-field-notis-header-tid']/option[3]").click()
time.sleep(3)
# Select 60 items per page frp, the drop down
driver.find_element_by_xpath(f"//*[@id='edit-items-per-page']/option[5]").click()
time.sleep(3)
soup1 = BeautifulSoup(driver.page_source, 'lxml')

### This time we get 59 documents indentified to be in 2019

In [6]:
for i,a in enumerate(soup1.select("a[href*='2019']")):
    print(f"{i+1} {a.text[:120]}{'-'*(120-len(a.text))}: {a['href']}")

1 Amendment To Beacon Height, Sarawak Waters------------------------------------------------------------------------------: http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK1242019.pdf
2 Anchorage Buoy 2, Sarawak River - Collapsed-----------------------------------------------------------------------------: http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK1232019.pdf
3 Notice Of Transportation Vessel And Pipeline Pull In – Posh Defender And MMA Prestige To D18 Field, Offshore Sarawak----: http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK1222019.pdf
4 The Installation Operation Of Floating Production, Storage And Offloading (FPSO) In Block SK10, Offshore Sarawak--------: http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK1212019.pdf
5 DSV Sapura Jane Diving And Rov Underwater Inspection In Sarawak Waters--------------------------------------------------: http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK12022019.pdf
6 Geohazard Site Survey Investigat

### Now we filter them with appropriate keywords

In [7]:
keywords = ["oil","drilling","exploration","field","block","geotechnical","rig"]
links1 = []
for i,a in enumerate(soup1.select("a[href*='2019']")):
    for kw in keywords:
        if kw in a.get_text().lower():
            links1.append(a)
links1 = list(set(links1))

### This time we get 27 documents after keywords filtering

In [8]:
for i,a in enumerate(links1):
    print(f"{i+1} {a.text[:120]}{'-'*(120-len(a.text))}: {a['href']}")

1 Marine Geotechnical Survey, Offshore Sarawak----------------------------------------------------------------------------: http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK692019.pdf
2 SSR (Semi-Submersible Drilling Rig) Deep Water Nautilus Moving From Bolai To Saderi Location, Offshore Sarawak----------: http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK922019.pdf
3 Debris Survey At Bokor (BODP-D) Oil Rig For Integrated Redevelopment Projects Bokor Phase 3 Eor And Betty, Offshore Sara: http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK672019.pdf
4 Ship Movement For Modification Works At Oil Rig For D18 Phase 2 Development Projects, Offshore Sarawak------------------: http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK1042019.pdf
5 Notice Of Naga-7 Jack Up Rig Move From ASB To TEDP-B Platform In Temana Field, Offshore Sarawak-------------------------: http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK932019.pdf
6 Notice Of Rig Mobilization And SK408 

### Let's do the same for second page

In [9]:
# second page
driver.find_element_by_xpath(f"//*[@id='block-system-main']/div/div[3]/ul/li[2]/a").click()
time.sleep(3)
soup2 = BeautifulSoup(driver.page_source, 'lxml')
#close the browser
driver.close()

### We get additional 27 documents that fit our criteria in the second page.

In [10]:
links2 = []
for i,a in enumerate(soup2.select("a[href*='2019']")):
    for kw in keywords:
        if kw in a.get_text().lower():
            links2.append(a)
links2 = list(set(links2))
for i,a in enumerate(links2):
    print(f"{i+1} {a.text[:120]}{'-'*(120-len(a.text))}: {a['href']}")

1 Notification Of Perisai Pacific 101 (PP101) Jack-Up Rig Movement From Baronia Field (BNJT-K) To Johor-------------------: http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK632019.pdf
2 TAD (Tender Assist Drilling Rig) SKD Esperanza Moving From Labuan Anchorage To F1 4DR-A Location, Offshore Sarawak------: http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK482019.pdf
3 FPSO MTC Ledang Floating Hoses, On Tow From Bintulu (Sarawak) To Kayu Manis Oilfield, Offshore Sarawak------------------: http://www.marine.gov.my/jlmv4/sites/default/files/NPM312019(T).pdf
4 Corrigendum To NTM 41/2019(T) - MPSV Nor Australis Installing Subsea Equipment At Gumusut - Kakap ( Phase 2 ), Offshore : http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK592019.pdf
5 Notification Of Carrying Out Activities Involving Ships At Medan Merapuh, Block SK309, Within Exclusive Economic Zone Of: http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK652019.pdf
6 Mooring Pile And Mooring Chain Laying 

### Merge page 1 and 2

In [11]:
links = links1 + links2

### Now we download the document to our local disk

In [12]:
for i,a in enumerate(links):
    print(f"{i} Downloading... {a['href']}")
    url = a['href']
    r = requests.get(url, allow_redirects=True)
    with open(f"{url.split('/')[-1]}", 'wb') as file:
        file.write(r.content)
    time.sleep(0.2)
print("DONE!")

0 Downloading... http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK692019.pdf
1 Downloading... http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK922019.pdf
2 Downloading... http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK672019.pdf
3 Downloading... http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK1042019.pdf
4 Downloading... http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK932019.pdf
5 Downloading... http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK1152019.pdf
6 Downloading... http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK1142019.pdf
7 Downloading... http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK732019.pdf
8 Downloading... http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK682019.pdf
9 Downloading... http://www.marine.gov.my/jlmv4/sites/default/files/NTMSRK912019.pdf
10 Downloading... http://www.marine.gov.my/jlmv4/sites/default/files/892019%28T%29.pdf
11 Downloading... http://www.marine.gov.my/jlmv4/sites/defau

### Extract data from downloaded pdf

In [13]:
import PyPDF2
import re
import pandas as pd
import glob

filenames = sorted(glob.glob("*.pdf"))
pat = "\s+([a-zA-Z0-9.,:\-\&\(\)\s]{,40})\s*(\d+)°\s?(\d+|\d+\.\d+)\'\s?(\d+\.\d+|\d+)?\"?\s*([NnEe]?)\s*(\d+)°\s?(\d+|\d+\.\d+)\'\s?(\d+\.\d+|\d+)?\"?\s*([NnEe]?)"
#pat = "\s+([a-zA-Z0-9.,:\-\&\(\)\s]{,40})\s*(\d+)°\s?(\d+|\d+\.\d+)[\'’]\s?(\d+\.\d+|\d+)?\"?\s*([NnEe]?)\s*(\d+)°\s?(\d+|\d+\.\d+)[\'’]\s?(\d+\.\d+|\d+)?\"?\s*([NnEe]?)"
#dms_pattern = "\s*\t{3}(\s?[a-zA-Z0-9\-\&\(\):]+)\t{3}\s*?(\d+)°\s?(\d+)\'\s?(\d*\.?\d*)\s*\"\s*([NnEe]?)\s+(\d+)°\s?(\d+)\'\s?(\d*\.?\d*)\s*\"\s*([NnEe]?)"

In [14]:
columns = ["Name","DegY","MinY","SecY","SymY","DegX","MinX","SecX","SymX","DocName","Page"]
df = pd.DataFrame(columns=columns)
for filename in filenames:
    data = []
    with open(filename,'rb') as fileObj: 
        pdfReader = PyPDF2.PdfFileReader(fileObj)
        for page in range(pdfReader.getNumPages()):
            text = pdfReader.getPage(page).extractText()
            text = text.replace("\n","")
            text = text.replace("Longitude","Longitude"*6)
            text = text.replace("Duration","Duration"*6)
            data = re.findall(pat,text)
            df_temp = pd.DataFrame() # empty temporary dataFrame
            df_temp = df_temp.append(pd.DataFrame(data,columns=columns[:-2]),sort=False)
            df_temp["DocName"] = filename
            df_temp["Page"] = page + 1
            df = df.append(df_temp) # append to main dataFrame
            if not len(data):
                print(filename," document has no Coordinates")

NPM342019(T).pdf  document has no Coordinates
NTMSRK1042019.pdf  document has no Coordinates
NTMSRK1152019.pdf  document has no Coordinates
NTMSRK282019.pdf  document has no Coordinates
NTMSRK382019.pdf  document has no Coordinates
NTMSRK582019.pdf  document has no Coordinates
NTMSRK622019.pdf  document has no Coordinates
NTMSRK682019.pdf  document has no Coordinates
NTMSRK962019.pdf  document has no Coordinates


### Calculate coordinate into decimal degrees

In [15]:
df.fillna(0, inplace=True)
df.replace('',0,inplace=True)

In [16]:
cols = ['DegY', 'MinY', 'SecY','DegX', 'MinX', 'SecX']
for col in cols:
    df[col] = pd.to_numeric(df[col],errors='coerce')

In [17]:
df["ddY"] = df['DegY'] + (df['MinY']/60) + (df['SecY']/3600)
df["ddX"] = df['DegX'] + (df['MinX']/60) + (df['SecX']/3600)

In [18]:
df.to_csv("point.csv")

### Transform WGS84 to Timbalai 1948

In [19]:
from pyproj import Proj, transform
wgs84 = Proj('+proj=longlat +datum=WGS84 +no_defs')
tim48 = Proj('+proj=longlat +ellps=evrstSS +towgs84=-533.4,669.2,-52.5,0.0,0.0,4.28,9.4 +no_defs')
timUTM = Proj('+proj=utm +zone=49 +ellps=evrstSS +towgs84=-533.4,669.2,-52.5,0.0,0.0,4.28,9.4 +units=m +no_defs')
wgsUTM = Proj('+proj=utm +zone=49 +datum=WGS84 +units=m +no_defs')

In [20]:
X_,Y_ = transform(wgs84,tim48,df.ddX.values,df.ddY.values)
df["ddY_tim"],df["ddX_tim"] = Y_,X_
df1 = df.copy()
df2 = df.copy()

### Save the into ESRI Shapefiles

In [21]:
import geopandas as gpd
from shapely.geometry import Point

In [22]:
df1['geometry'] = df.apply(lambda x : Point((float(x.ddX),float(x.ddY))),axis=1)
df1 = gpd.GeoDataFrame(df1,geometry='geometry')
df1.crs = '+proj=longlat +datum=WGS84 +no_defs'
df1.to_file("Points_WGS84.shp",driver='ESRI Shapefile')

In [23]:
df2['geometry'] = df.apply(lambda x : Point((float(x.ddX_tim),float(x.ddY_tim))),axis=1)
df2 = gpd.GeoDataFrame(df2,geometry='geometry')
df2.crs = '+proj=longlat +ellps=evrstSS +towgs84=-533.4,669.2,-52.5,0.0,0.0,4.28,9.4 +no_defs'
df2.to_file("Points_tim48.shp",driver='ESRI Shapefile')

In [24]:
df1.__dict__

{'_is_copy': None, '_data': BlockManager
 Items: Index(['Name', 'DegY', 'MinY', 'SecY', 'SymY', 'DegX', 'MinX', 'SecX', 'SymX',
        'DocName', 'Page', 'ddY', 'ddX', 'ddY_tim', 'ddX_tim', 'geometry'],
       dtype='object')
 Axis 1: Int64Index([0, 1, 0, 1, 0, 0, 1, 0, 1, 2,
             ...
             1, 0, 0, 0, 0, 1, 0, 0, 0, 1],
            dtype='int64', length=102)
 FloatBlock: [2, 3, 6, 7, 11, 12, 13, 14], 8 x 102, dtype: float64
 IntBlock: [1, 5, 10], 3 x 102, dtype: int64
 ObjectBlock: [0, 4, 8, 9, 15], 5 x 102, dtype: object, '_item_cache': {'geometry': 0    POINT (111.5999008333333 4.626460000000001)
  1    POINT (111.5789236111111 4.607560833333333)
  0             POINT (113.0755 3.159333333333334)
  1                  POINT (111.6373333333333 4.5)
  0    POINT (112.0658333333333 4.764166666666667)
  0    POINT (112.0453388888889 3.792733333333333)
  1    POINT (112.0659777777778 3.762805555555556)
  0             POINT (113.6283166666667 4.553925)
  1    POINT (113.62

In [25]:
df2.head()

Unnamed: 0,Name,DegY,MinY,SecY,SymY,DegX,MinX,SecX,SymX,DocName,Page,ddY,ddX,ddY_tim,ddX_tim,geometry
0,Patawali-2,4,37.0,35.256,N,111,35.0,59.643,E,892019%28T%29.pdf,1,4.62646,111.599901,4.627264,111.596462,POINT (111.5964622221384 4.627264416819839)
1,Patawali-3,4,36.0,27.219,N,111,34.0,44.125,E,892019%28T%29.pdf,1,4.607561,111.578924,4.608364,111.575482,POINT (111.575482358422 4.608363839876462)
0,"From: Sg, Nyigu (Bintulu)",3,9.56,0.0,N,113,4.53,0.0,E,NPM312019(T).pdf,1,3.159333,113.0755,3.160036,113.072256,POINT (113.0722556895158 3.160036469366533)
1,Kayu Manis Oil-Field (Offshore Sarawak),4,30.0,0.0,N,111,38.24,0.0,E,NPM312019(T).pdf,1,4.5,111.637333,4.500796,111.6339,POINT (111.6338999356702 4.50079554037153)
0,D35R,4,45.0,51.0,N,112,3.0,57.0,E,NPM322019(T).pdf,1,4.764167,112.065833,4.764982,112.062454,POINT (112.0624543649082 4.764982379677904)
