### Information evaluation:
Searching for 
1. IMO: 7 digit number
2. MMSI: 9 digit number
3. Ship, Company, or People’s name : information about who is involved (crew/captain/police)
4. country's name: it can be offence location / flag / where the offender from etc
5. fish names: species involved


#### 1. Searching IMO

In [43]:
import re

# some example text 
text = "Here are some numbers: 1234, 12312, 1423124125, 1231241232, 1231244, 2384822, 210392892"


In [44]:
import re

def find_imo(content):    
    # IMO is a 7-digit number (exactly 7-digit)
    imo_pattern = r'\b\d{7}\b'
    
    # find all 7-digit numbers in the content 
    imo_list = re.findall(imo_pattern, content)
    
    return imo_list

find_imo(text)

['1231244', '2384822']

#### 2. Searching MMSI

In [45]:
def find_mmsi(content):    
    # regular expression for a 9-digit number (exactly 9-digit)
    mmsi_pattern = r'\b\d{9}\b'
    
    # find all 9-digit numbers in the content 
    mmsi_list = re.findall(mmsi_pattern, content)
    
    return mmsi_list

find_mmsi(text)

['210392892']

#### 3. Ship, company, people's name

In [47]:
import spacy

# Load the English language model
nlp = spacy.load('en_core_web_sm')

def find_involved_parties_spacy(text):
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    names = []

    for ent in doc.ents:
        if ent.label_ in ['PERSON', 'ORG', 'NORP']: # NORP is about nationality/religious/political group
            names.append(ent.text)
    return names

# country's name / location
def find_location(text):
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    company_names = []

    for ent in doc.ents:
        if ent.label_ in ['LOC', 'GPE']: # location labels
            company_names.append(ent.text)

    return company_names



##### Examples

In [35]:
# Sample text for demonstration. The article is from:
# https://www.cbc.ca/news/canada/nova-scotia/n-s-boat-captain-fined-fisheries-violations-1.6965198
text = """
A boat captain from Sambro, N.S., with a history of fishery convictions has been fined $60,000 and banned from fishing for six months for five violations that included a secret, middle-of-the-night offload of halibut.

The case involved misreporting of halibut, hake and cod catch from trips on board the fishing boat Ivy Lew between May 2019 and June 2020.

Casey Henneberry, 40, and ALS Fisheries and Law Fisheries were found guilty last October by Halifax provincial court Judge Elizabeth Buckle.

The sentence was handed down in court last month and posted this week.

"Mr. Henneberry's offending conduct is that he inaccurately logged and hailed weight of groundfish on four trips over approximately one year and on one of those trips, he illegally off-loaded $40,000 worth of halibut, intending to sell it," Buckle said in her written sentence.

On that trip, DFO officers were observing the Ivy Lew and intercepted the illegal off-load in Sambro. During the attempted arrests of those involved, Henneberry fled and was arrested later, the judge noted in sentencing.

During this period the groundfish licence was held first by Law Fisheries and then ALS Fisheries. The companies were fined $55,000 and $10,000 respectively for failing to ensure licence conditions were complied with.

ALS Fisheries owns the Ivy Lew, which has been held for the past three years by the Department of Fisheries and Oceans, likely as security on any fine.

The company has continued to pay on the $1-million mortgage on the vessel."""

print("involved parties:", find_involved_parties_spacy(text))
print("location:", find_location(text))

#### Sambro and Halifax are cities..
#### Seems like they don't know much about Canadian cities 😂
### maybe we should combine these two, or find a better library.. 

involved parties: ['Sambro', 'Casey Henneberry', 'ALS Fisheries', 'Halifax', 'Elizabeth Buckle', 'Henneberry', 'Sambro', 'Henneberry', 'Law Fisheries', 'ALS Fisheries', 'ALS Fisheries', 'the Department of Fisheries', 'Oceans']
location: ['N.S.']


#### 5. Searching fish names (in progress)

In [41]:
from nltk.corpus import wordnet as wn

### handpicking common fish names for fishing
fish_words = {'salmon', 'tuna', 'shark', 'whale', 'crab', 'lobster', 'shrimp'}

In [42]:
import nltk
from nltk.corpus import wordnet as wn

nltk.download('wordnet')

def search_fish_names():
    fish_names = []
    # Search for synsets directly related to 'fish'
    synsets = wn.synsets('fish', pos=wn.NOUN)

    for synset in synsets:
        # Retrieve hyponyms (more specific terms) of each 'fish' synset
        hyponyms = synset.hyponyms()
        for hyponym in hyponyms:
            # Retrieve lemmas (actual words) of each hyponym
            for lemma in hyponym.lemmas():
                fish_names.append(lemma.name().replace('_', ' '))

    return fish_names

# Example usage:
fish_names = search_fish_names()
print("Fish names found in WordNet:")
for name in fish_names:
    print(name)

## not covering all species..

Fish names found in WordNet:
bony fish
bottom-feeder
bottom-dweller
bottom lurkers
cartilaginous fish
chondrichthian
climbing perch
Anabas testudineus
A. testudineus
fingerling
food fish
game fish
sport fish
mouthbreeder
northern snakehead
rough fish
spawner
young fish
alewife
anchovy
eel
haddock
hake
mullet
grey mullet
gray mullet
panfish
rock salmon
salmon
schrod
scrod
shad
smelt
stockfish
trout


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sumin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [64]:
import nltk
from nltk.corpus import wordnet as wn

nltk.download('wordnet')

def search_fish_names():
    fish_names = []
    # Search for synsets related to 'fish'
    fish_synsets = wn.synsets('fish')

    for synset in fish_synsets:
        # Retrieve all hyponyms (more specific terms) recursively
        hyponyms = synset.hyponyms()
        for hyponym in hyponyms:
            # Iterate through all lemmas (actual words) of each hyponym
            for lemma in hyponym.lemmas():
                fish_names.append(lemma.name().replace('_', ' '))

            # Additionally, retrieve all hyponyms of each hyponym recursively
            nested_hyponyms = hyponym.hyponyms()
            for nested_hyponym in nested_hyponyms:
                for lemma in nested_hyponym.lemmas():
                    fish_names.append(lemma.name().replace('_', ' '))

    return fish_names

# Example usage:
fish_names = search_fish_names()
print("Fish names found in WordNet:")
for name in fish_names:
    print(name)

    
# maybe not covering all species

Fish names found in WordNet:
bony fish
crossopterygian
lobefin
lobe-finned fish
lungfish
teleost fish
teleost
teleostan
bottom-feeder
bottom-dweller
mullet
bottom lurkers
cartilaginous fish
chondrichthian
elasmobranch
selachian
holocephalan
holocephalian
climbing perch
Anabas testudineus
A. testudineus
fingerling
food fish
barracouta
snoek
groundfish
bottom fish
herring
Clupea harangus
salmon
sardine
sea bass
shad
snapper
sole
trout
tuna
tunny
whitefish
game fish
sport fish
mouthbreeder
northern snakehead
rough fish
spawner
young fish
brit
britt
parr
parr
whitebait
alewife
anchovy
eel
elver
smoked eel
haddock
finnan haddie
finnan haddock
finnan
smoked haddock
hake
mullet
grey mullet
gray mullet
panfish
rock salmon
salmon
Atlantic salmon
chinook salmon
chinook
king salmon
kippered salmon
red salmon
sockeye
sockeye salmon
silver salmon
coho salmon
coho
cohoe
smoked salmon
schrod
scrod
shad
smelt
American smelt
rainbow smelt
European smelt
sparling
stockfish
trout
rainbow trout
sea trout


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sumin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
