# Diving into VesselFinder

## Prototyping a piece of code - Part1

Let's assume that we are interested in collecting the data coming from this web page : 

https://www.vesselfinder.com/vessels/MOTIVATION-D-IMO-9301108-MMSI-636092241

Interesting informations are located, for example, here: 


![Informations on Motivation D.](https://ibb.co/nrJ8Ypz)

The source code of the web page shows us that they are embedded in a table : 

![Table for the data](https://ibb.co/Gpc0DZH)

Our goal is to put these data in a simple csv file.

### Installing some packages

We will use 2 packages to scrape the data :  
- requests, that basically simulates a web browser
- beautifulsoup4, that reads and parses the html code

This is done by using "pip", to installe the required packages.

Please note the the --user option at the end allows you to install these packages without the admin priviledge on your machine... pretty convenient!


In [3]:
pip install requests --user

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install beautifulsoup4 --user

Note: you may need to restart the kernel to use updated packages.


### Let's start to write some code

First, we import some standard libraries on Python : 

- csv, to easily handle csv files
- time and dateime, to packages to deal with timestamps and scheduling.

Best practices to write cool code with Python recommand to import these natural python libs, first.

Importing a library is done using... the "import" command! ;)
#### Importing the libs

In [51]:
import csv

In [52]:
import time

In [53]:
from datetime import datetime

Then we import the libraries from the packages we just installed above, using pip.

In [54]:
import requests

In [55]:
import urllib.request

In [56]:
from bs4 import BeautifulSoup

#### Defining some variables
We must declare the variable containing the url we want to read, corresponding to our target vessel.

In [57]:
url = 'https://www.vesselfinder.com/vessels/MOTIVATION-D-IMO-9301108-MMSI-636092241'
scrape_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

Let's check we did it well by printing it.... using print(url). Not mandatory, but it's sometimes good to do that for debugging purposes!

In [58]:
print(url, scrape_time)

https://www.vesselfinder.com/vessels/MOTIVATION-D-IMO-9301108-MMSI-636092241 2020-10-12 22:52:13


#### Let's try to connect to the website....
This is done by using the "requests" command get : requests.get(url)

In [59]:
reqs = requests.get(url)

Now we suppose that the webpage is imported in the reqs variable.
In order to read it, we use beautiful soup to parse the page.

In [60]:
soup = BeautifulSoup(reqs.text, 'lxml')

Let's print the result of this by using print().

In [61]:
print(soup)

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access this resource.</p>
</body></html>



What happens here????!

The problem is that VesselFinder uses a sort of protection to avoid being scraped by robots...
We must identify ourselves as a real browser,and not a Python script.
This is done by defining a "header", for requests, with a 'user-agent'.
We will use the user-agent of Firefox, for instance...

In [62]:
 headers = {'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0'}

Now we can try again to get the page and it's content....

In [63]:
reqs = requests.get(url, headers=headers)
soup = BeautifulSoup(reqs.text, 'lxml')

Let's print the soup!.... That's much better!

In [64]:
print(soup)

<!DOCTYPE html>
<html lang="en"><head><script>(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)})(window,document,'script','//www.google-analytics.com/analytics.js','ga');ga('create', 'UA-27021448-6', 'vesselfinder.com');ga('send', 'pageview');</script><script async="" src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script><script>
var prefix="";</script><meta charset="utf-8"/><meta content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no" name="viewport"/>
<title>MOTIVATION D, Container Ship - Details and current position - IMO 9301108 MMSI 636092241 - VesselFinder</title><meta content="Vessel MOTIVATION D (IMO: 9301108, MMSI: 636092241) is a Container Ship built in 2006 and currently sailing under the flag of Liberia." name="description"/>
<link href="https://ww

It's not mandatory but we will save this as an html page... It's always a good ideea to keep a trace of what we are doing.

#### Backup the page

In [65]:
with open("motivationD_output.html", "w", encoding='utf-8') as file:
        file.write(str(soup))

Let's observe the source code, once again.
We can see that all our data are enclosed in <td> tags....
BeautifulSoup to the rescue!
we can search all the td things in the soup....

#### Read the soup

In [66]:
data = soup.find_all('td')

What do we have here? Let's print it!

In [67]:
print(data)

[<td class="n3ata">
<a href="/ports/HAMBURG-GERMANY-1977">
<span class="m-flag-small flag-icon" style="background-image:url(https://static.vesselfinder.net/images/flags/4x3/de)" title="Germany"></span>
                    Hamburg                </a>
</td>, <td class="n3ata">Oct 13, 09:00</td>, <td class="n3">AIS Type</td>, <td class="v3">Cargo ship</td>, <td class="n3">Flag</td>, <td class="v3">Liberia</td>, <td class="n3">Destination</td>, <td class="v3">HAMBURG VIA NOK</td>, <td class="n3">ETA</td>, <td class="v3">Oct 13, 09:00</td>, <td class="n3">IMO / MMSI</td>, <td class="v3">9301108 / 636092241</td>, <td class="n3">Callsign</td>, <td class="v3">A8ZB6</td>, <td class="n3">Length / Beam</td>, <td class="v3">155 / 20 m</td>, <td class="n3">Current draught</td>, <td class="v3">6.0 m</td>, <td class="n3">Course / Speed</td>, <td class="v3">264.7° / 14.4 kn</td>, <td class="n3">Coordinates</td>, <td class="v3">54.58134 N/10.72763 E</td>, <td class="n3">Status</td>, <td class="v3 toolt

#### Select the data

We can now extract the desired informations, by just indicating its position into brackets...
For example,the coordinates are in position 21.... So we can do someting like :
coordinates = data[21].get_text()

In [68]:
coordinates = data[21].get_text()
print(coordinates)
lat = coordinates.split('/')[0]
lon = coordinates.split('/')[1]

54.58134 N/10.72763 E


Let's do the same for the timestamp of the position (as date_tag).
It's a bit more complicated because we have some html code here...
But not that difficult as we can use "soup" to read it

In [76]:
date_tag = data[25]
date_tag = str(date_tag)
date_tag_soup = BeautifulSoup(date_tag, features="lxml")
date_tag = date_tag_soup.td['data-title']

In [77]:
print(date_tag)

Oct 12, 2020 20:52 UTC


We convert this date to a more friendly format, using datetime.

In [71]:
date_tag = date_tag.replace(',', '').strip(' UTC')
date_tag = datetime.strptime(date_tag, '%b %d %Y %H:%M')
date_collect = date_tag.strftime('%Y-%m-%d')
time_collect = date_tag.strftime('%H:%M')

In [72]:
print(date, current_time)

2020-10-12 20:03


Let's extract the speed and the heading (same as coordinates...we must split them!)

In [73]:
head_spd = data[19].get_text()
heading = head_spd.split(' / ')[0]
speed = head_spd.split(' / ')[1]

In [74]:
print(heading,speed)

264.7° 14.4 kn


We could also scrape the ETA, the port, the draught... Everything....

But for the moment, we write a csv file for all this information.

In [75]:
with open('AIS_Track_motivation.csv', 'a', newline='') as csv_file:
        writer = csv.writer(csv_file, delimiter=',')
        writer.writerow([scrape_time, lat, lon, date_collect, time_collect, heading, speed])

### Conclusion

At this stage we know how to scrape the data for one ship.
Everytime we launch this script, we collect new data....
Every hour should be enough.
Let's enhance this a little bit, by adding new vessels... using a loop and google spreadsheets!