# Diving into VesselFinder

## Prototyping a piece of code - Part1

Let's assume that we are interested in collecting the data coming from this web page : 

https://www.vesselfinder.com/vessels/MOTIVATION-D-IMO-9301108-MMSI-636092241

Interesting informations are located, for example, here: 


[Informations on Motivation D.](https://ibb.co/nrJ8Ypz)

The source code of the web page shows us that they are embedded in a table : 

[Table for the data](https://ibb.co/Gpc0DZH)

Our goal is to put these data in a simple csv file.

### Installing some modules

We will use 2 moduless to scrape the data :  
- requests, that basically simulates a web browser
- beautifulsoup4, that reads and parses the html code

This is done by using "pip", to install the required moduless.

Please note the the --user option at the end allows you to install these moduless without admin priviledges on your machine... pretty convenient!


In [None]:
pip install requests --user

In [None]:
pip install beautifulsoup4 --user

### Let's start to write some code

First, we import some standard libraries on Python : 

- csv, to easily handle csv files
- time and datetime, to packages to deal with timestamps and scheduling.

Best practices to write cool code with Python recommand to import these natural python libs, first.

Importing a library is done using... the "import" command! ;)
#### Importing the libs

In [None]:
import csv

In [None]:
import time

In [None]:
from datetime import datetime

Then we import the libraries from the packages we just installed above, using pip.

In [None]:
import requests

In [None]:
import urllib.request

In [None]:
from bs4 import BeautifulSoup

#### Defining some variables
We must declare the variable containing the url we want to read, corresponding to our target vessel.

In [None]:
url = 'https://www.vesselfinder.com/vessels/MOTIVATION-D-IMO-9301108-MMSI-636092241'
scrape_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

Let's check we did it well by printing it.... using print(url). Not mandatory, but it's sometimes good to do that for debugging purposes!

In [None]:
print(url, scrape_time)

#### Let's try to connect to the website....
This is done by using the "requests" command get : requests.get(url)

In [None]:
reqs = requests.get(url)

Now we suppose that the webpage is imported in the reqs variable.
In order to read it, we use beautiful soup to parse the page.

In [None]:
soup = BeautifulSoup(reqs.text, 'lxml')

Let's print the result of this by using print().

In [None]:
print(soup)

What happens here????!

The problem is that VesselFinder uses a sort of protection to avoid being scraped by robots...
We must identify ourselves as a real browser, and not as a Python script.
This is done by defining a "header", for requests, with a 'user-agent'.
We will use the user-agent of Firefox, for instance...

In [None]:
 headers = {'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0'}

Now we can try again to get the page and it's content....

In [None]:
reqs = requests.get(url, headers=headers)
soup = BeautifulSoup(reqs.text, 'lxml')

Let's print the soup!.... That's much better!

In [None]:
print(soup)

It's not mandatory but we will save this as an html page... It's always a good idea to keep a trace of what we are doing.

#### Backup the page

In [None]:
with open("motivationD_output.html", "w", encoding='utf-8') as file:
        file.write(str(soup))

Let's observe the source code, once again.
We can see that all our data are enclosed in <td> tags....
BeautifulSoup to the rescue!
we can search all the td things in the soup....

#### Read the soup

In [None]:
data = soup.find_all('td')

What do we have here? Let's print it!

In [None]:
print(data)

#### Select the data

We can now extract the desired informations, by just indicating its position into brackets...
For example,the coordinates are in position 21.... So we can do someting like :
coordinates = data[21].get_text()

In [None]:
coordinates = data[21].get_text()
print(coordinates)
lat = coordinates.split('/')[0]
lon = coordinates.split('/')[1]

Let's do the same for the timestamp of the position (as date_tag).
It's a bit more complicated because we have some html code here...
But not that difficult as we can use "soup" to read it

In [None]:
date_tag = data[25]
date_tag = str(date_tag)
date_tag_soup = BeautifulSoup(date_tag, features="lxml")
date_tag = date_tag_soup.td['data-title']

In [None]:
print(date_tag)

We convert this date to a more friendly format, using datetime.

In [None]:
date_tag = date_tag.replace(',', '').strip(' UTC')
date_tag = datetime.strptime(date_tag, '%b %d %Y %H:%M')
date_collect = date_tag.strftime('%Y-%m-%d')
time_collect = date_tag.strftime('%H:%M')

In [None]:
print(date_collect, time_collect)

Let's extract the speed and the heading (same as coordinates...we must split them!)

In [None]:
head_spd = data[19].get_text()
heading = head_spd.split(' / ')[0]
speed = head_spd.split(' / ')[1]

In [None]:
print(heading,speed)

We could also scrape the ETA, the port, the draught... Everything....

But for the moment, we write a csv file for all this information.

In [None]:
with open('AIS_Track_motivation.csv', 'a', newline='') as csv_file:
    writer = csv.writer(csv_file, delimiter=',')
    writer.writerow([scrape_time, lat, lon, date_collect, time_collect, heading, speed])

### Conclusion

At this stage, we know how to scrape the data for one ship.
Everytime we launch this script, we collect new data....
Every hour should be enough.
Let's enhance this a little bit, by adding new vessels... using a loop and google spreadsheets!