# scanning all files for relevant information

we need to build a baseline to determin who has been kidnapped, to do this, we are going to scrape the articles provided to find out who has been mentioned in the context of the kidnapping. this will help us first and foremost establish who the victims are. once we have determined that information, we can begin to use other sources to determine who runs in their circles.

### import statements

to find out where to look first we are going to be leaning on NLTK to help us find the standout articles relative to kidnapping and the individuals associated within those articles.

In [25]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
import os
from collections import Counter

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\andyt\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\andyt\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\andyt\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\andyt\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


### function definition

we're going to functionalize these operations because we might want to perform them a couple times with modified parameters to get the best slice of the available data.

this information also includes a list of additional stopwords to ignore in the scanning, this list was updated iteratively over scans of the information.

In [26]:
def search_for_terms(folder_path, terms, ignore_names):
    entities = []
    for filename in os.listdir(folder_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(folder_path, filename)
            with open(file_path, "r") as file:
                full_text = file.read()
            tokenized_text = nltk.word_tokenize(full_text)
            tagged_text = nltk.pos_tag(tokenized_text)
            named_entities = nltk.ne_chunk(tagged_text)
            for entity in named_entities:
                if type(entity) == nltk.tree.Tree:
                    if entity.label() == 'PERSON':
                        name = ' '.join([word for word, tag in entity.leaves()])
                        if name not in ignore_names and terms[0] in full_text and name in full_text:
                            print(f"Found {terms[0]} and name '{name}' in document: {file_path}")
                            entities.append(name)
    return entities


def count_elements(list):
    count = Counter(list)
    for element, frequency in sorted(count.items(), key=lambda x: x[1], reverse=True):
        print(f"{element}: {frequency}")


stop_words = ['Abila', 'Tethys', 'Kronos', 'Online', 'Currently', 'Tethyn', 'Jr', 'Ngohebo', 'Haneson Ngohebo' 'Carmen', 'Adrien Carmen']

### first pass

we think its a kidnapping! so lets take a look for any names associated with kidnapping first

note: we worked on this in a few different environments, so feel free to replace the except statement with whatever path makes sense for your data

In [27]:
try: 
    folder_path = "C:/Users/Andy/PycharmProjects/pythonProject/The_Last_Stand/TP-1_Kronos/articles"
    sus = search_for_terms(folder_path, ["kidnapping", "name"], stop_words)    
except: 
    folder_path = "C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_Kronos/articles"
    sus = search_for_terms(folder_path, ["kidnapping", "name"], stop_words)


Found kidnapping and name 'Edvard Vann' in document: C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_Kronos/articles\107.txt
Found kidnapping and name 'Avila' in document: C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_Kronos/articles\107.txt
Found kidnapping and name 'Athena' in document: C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_Kronos/articles\140.txt
Found kidnapping and name 'Tethan' in document: C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_Kronos/articles\140.txt
Found kidnapping and name 'Sten St. George' in document: C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_Kronos/articles\140.txt
Found kidnapping and name 'Edvard Vann' in document: C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_Kronos/articles\140.txt
Found kidnapping and name 'Edvard Vann' in document: C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_

In [28]:
count_elements(sus)

Sanjorge: 21
Edvard Vann: 15
Sten Sanjorge: 8
Vann: 7
Carman: 6
Rossini: 6
Marcella Trapani: 6
Kapelou: 5
John Rathburn: 4
Haneson Ngohebo: 4
Tethan: 3
Worldwise: 3
Adrien Carman: 3
Abila Fire: 3
Sanjorge Escapes Kidnapping: 3
Avila: 2
Athena: 2
Adrien Tethyn: 2
Willem: 2
Sten St. George: 1
Disappeared: 1
Sten Sanjorge Jr: 1
Orhan: 1
Juan Rathburn: 1
Speaks: 1
Civils: 1
Star: 1
Nobody: 1
Adrien: 1


so off the bat we are getting lots of references to a Sanjorge, this is a likely first victim based on the context, so we're going to make a note to lookout for this person.

### second pass

kidnapping can also mean missing, so lets do a scan looking for missing as our keyword as well. might turn up a few other clues.

In [29]:
try: 
    folder_path = "C:/Users/Andy/PycharmProjects/pythonProject/The_Last_Stand/TP-1_Kronos/articles"
    mis = search_for_terms(folder_path, ["missing", "name"], stop_words)
except: 
    folder_path = "C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_Kronos/articles"
    mis = search_for_terms(folder_path, ["missing", "name"], stop_words)
    

Found missing and name 'Haneson Ngohebo' in document: C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_Kronos/articles\139.txt
Found missing and name 'Sanjorge' in document: C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_Kronos/articles\167.txt
Found missing and name 'John Rathburn' in document: C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_Kronos/articles\167.txt
Found missing and name 'Miriam Avila' in document: C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_Kronos/articles\170.txt
Found missing and name 'Avila' in document: C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_Kronos/articles\170.txt
Found missing and name 'Avila' in document: C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_Kronos/articles\170.txt
Found missing and name 'Haneson Ngohebo' in document: C:/Users/andyt/PycharmProjects/classworker_General/The_Last_Stand/TP-1_Kronos/articl

In [30]:
count_elements(mis)

Haneson Ngohebo: 10
Maha Salo: 7
Avila: 5
Miriam Avila: 4
Edvard Vann: 4
John Rathburn: 3
Tethan: 3
Adrien Carman: 3
Sanjorge: 2
Willem: 2
Abila Fire: 2
Orhan Strum: 2
Mr. Strum: 2
Sten Sanjorge: 2
Rossini: 2
Marcella Trapani: 2
Carman: 2
Asterian: 1
Linda Lagos: 1
Uneasy: 1
Lagos: 1
Spokesman Adrien Carman: 1
Centrum: 1
Elodis: 1
Star: 1
Mr. Sanjorge: 1
Abila Kronos: 1


less interesting information, but at this point we started clicking around on the articles that looked interesting, particularly with regard to mentions of Sanjorge as he seems highly likely to be a victim and reading more might provide us with a lay of the land. in doing so, we identified article 460, elements reproduced below:

> Fourteen employees feared kidnapped in Kronos for radical an environmental group of the terrorist during a corporative meeting. 

> They fear fourteen employees, including possibly to five executive officials, kidnapped yesterday by the "protectors of Kronos".  The disclosed disappear include: President and CEO Sten Sanjorge Jr, CFO Ingrid Precipice, Field-Corrente of Ada of I BACK WATER, rasgueo of Orhan de GAStech of the COO, and environmental official Willem Basoue-Country.

> The local organizations of the news have received a note of the rescue of the responsibility that she demanded and to demand of POK $20 million of the company.  It is possible additional demands is next.

so that's pretty much a wrap, we cross referenced the names of the high ranking GASTech officials kidnapped and found that they're all pretty much the highest ranking people at the company.

now its time to determine who might be connected to their disappearance
