# Project 4: Finding geo-names in the Flydubai Leak documents
## *An evaluation of GUI software for text analysis*


On July 29, The Guardian published a [report](https://www.theguardian.com/business/2016/jul/29/airline-pilots-complain-dangerous-fatigue-leaked-documents-flydubai) about internal complains from the staff of Flydubai, a government-owned airline from the UAE.

The report was based on [documents](https://www.theguardian.com/business/2016/jul/29/flydubai-flight-records-the-leaked-documents) leaked to that newspaper, and that were also published on the same date.

![](https://s19.postimg.org/s7zv1g2n7/flydubai_cover.png)

From our project's perspective, the content of the documents looked very interesting for a possible practice with geographical names, which abounded in the document.

But instead of treating this corpus as any other one, we decided to start trying some of the tools that are available in the market for journalists not working with programming for text mining.

The idea was to see how effective these tools can be, and possibly compare results.

### Description of the content

Period covered by events in the complains: March and April 2016

Guardian's note on spelling: *The misspellings are as they appear in the documents; English is the language used by pilots, but it is not necessarily a pilot’s mother tongue.*

**Guardian's findings**: In all, the reports include:

- 42 complaints about or experience of fatigue; 
- 25 bird strikes; 
- 10 medical emergencies; 
- 5 laser incidents; one bomb threat; 
- 1 “dogs on the runway”; 
- 1 unstable aircraft due to unstable truffles.

## (1) Collecting the data

Data collection in this case was very simple: we copied and pasted the text that was all in one web page (we tried to find copies of the original documents but the *Guardian* did not release them). We saved the info to a utf-8 text file.

## (2) Using text mining software for content analysis

## DocumentCloud

We first uploaded the file to [DocumentCloud](www.documentcloud.org), a project the IRE (Investigative Reporters and Editors) has sponsored since 2011, which not only is a "catalog of primary source documents," it also works as "a tool for annotating, organizing and publishing" documents on the web.

![](https://s19.postimg.org/v2gsk48yr/flydubai_doccloud.jpg)

Most of the options in the buttons are about the document(s)' meta data, not about its content, except for the drop down menu `Analyze`. 

We used the Entity analysis feature and these is what we found:

![](https://s19.postimg.org/vqytujmg3/results_doccloud.jpg)

As you can see, the many mentions of the one person found is because he is the photographer in the piece we are analysing (we didn't do any manual data cleaning on purpose, to deal with dirty data with the resources offered by these programs).

Another option in this menu allows for the extraction of dates. This document includes only one:

![](https://s19.postimg.org/5y40vaulf/flydubai_timeline_doccloud.jpg)

These are the kinds of native entity extraction performed by DocumentCloud. Note that, as indicated in their website, this analysis includes running the documents uploaded through Thomson Reuters' [OpenCalais](http://www.opencalais.com), so this test is also indicative of the capabilities of that other tool.

A third option in the `Analysis` menu is analysing the document(s) using OverviewDocs.

## OverviewDocs

Overview is also a text analysis platform created with journalists in mind: "Overview began at The Associated Press, supported by the John S. and James L. Knight Foundation as part of its Knight News Challenge," reads the *About* section in their website.

The platform has way more functionalities than DocumentCloud, and can be used separately, without having to have an account in DocumentCloud (that restricts its services to registered news organisations). In fact, separate registration is a requirement for DocumentCloud users wanting to use Overview.

The tool offers: "built-in OCR, a sophisticated search engine, word clouds, entity detection, and topic-based document clustering. It has sophisticated tagging and metadata support and supports many input and export formats. If you need custom analysis, you can write your own plugins using the API."

The first thing we see when we open our document in Overview is a word cloud and a series of tabs:

![](https://s19.postimg.org/6c5cuweoz/word_cloud_overview.jpg)

In the `Multisearch` tab we can search for a term and get the number of documents containing it (in this case we only have one document, so that is not particularly useful in this case). If we activate the search within documents to the right, we also get a preview of the matches highlighted in colour:

![](https://s19.postimg.org/5b545rxpf/overview_multisearch.jpg)

### Geo-name extraction

We finally get to our original motivation as journalists in the search for a story: the geographical names mentioned in the document.

In its `Entities` tab, Overview offers a series of options, one of which is the extraction of country names, based, as you can see in the image, on a list of names provided by [this website](http://www.geonames.org/countries/):

![](https://s19.postimg.org/qm79ns57n/geo_names_countries_overview.jpg)

### Excluding results manually

Once we have the list, the program gives us the option to manually exclude inaccurate results. We can use the document search to the right to make sure that the word in question is not the name of a place. It's unfortunate that search options don't include the possibility to search for exact words:

![](https://s19.postimg.org/hfoz0hzz7/excluding_names_overview.jpg)

We excluded "go to", "pain", "some", "a man", "men", "sat", "bi", "end" and "cat".

A big problem with this feature is that we didn't find a way to export those results. 

The experience with the names of cities was different, rendering a wider range of results, which in this case is a negative thing, the filter is not as good as it should be (we have excluded the list of stop words from the results):

![](https://s19.postimg.org/tvlou8tb7/geocities.jpg)

The best results were accomplished when we excluded the Google Books words, but again, it was not possible to export or copy the list of results. As the website states in its About section, it was created with investigative journalists in mind, so this may be the reason why the pipelining for data journalists to continue to do their analysis is not very well implemented.

## (3) Processing the document with Python libraries

Having seen these options, we proceed to analyse our document with the tools programming (NLTK and other Python libraries in this case) has to offer.

We begin importing the libraries we need...

In [317]:
import nltk
import pandas as pd

...and opening/reading our file to start working with it.

In [318]:
text = open("FlyDubai leak documents.txt", "r").read()

### (3.1) Preliminary data exploration

It's always useful to get an idea of the extension of the document:

In [26]:
len(text)

88746

### (3.2) Finding geo-names based on the intersections of sets of words

We are going to start applying the same method used by WordCloud and Overview for entity extraction: comparing a document with a given set of words that identify something we want to discover in the text.

In this case, we are going to compare our text with a known list of geographical names.

Let's start cleaning/normalising the text:

In [316]:
# import nltk, pprint
from nltk import word_tokenize

# we separate the text into its tokens
tokens = word_tokenize(text)

# transforming the text into lower case and storing it in a different variable
lowercase_text = [w.lower() for w in tokens]

This is one of those cases in which keeping a copy of the text with the original capitalisation can be helpful, given that geographical names could be easier to find that way.

We are going to use the same list used by Overview, available [here](http://www.geonames.org/countries/).

We transferred the list to a csv file, now we can import it here and assign its content to a data frame named `countries_df`.

In [319]:
countries_df = pd.read_csv("geonames.csv", encoding = "ISO-8859-1")
countries_df.head(2)

Unnamed: 0,ISO-3166,ISO-3166.1,fips,Country,Capital,Continent
0,AD,AND,AN,Andorra,Andorra la Vella,EU
1,AE,ARE,AE,United Arab Emirates,Abu Dhabi,AS


We can now turn some of its columns into lists, that we can use as filters:

In [320]:
country_name = countries_df["Country"].tolist()
city_name = countries_df["Capital"].tolist()

We can write a very simple function to extract the matches from our document and store the results in a list that we have called `countries_flydubai`.

In [321]:
countries_flydubai = []
for w in tokens:
    if w in country_name:
        if w not in countries_flydubai:
            countries_flydubai.append(w)    

In [324]:
# we sort the list to visualise the results in alphabetical order
sorted(countries_flydubai)

['Afghanistan',
 'Armenia',
 'Azerbaijan',
 'Bahrain',
 'Bangladesh',
 'Egypt',
 'Georgia',
 'India',
 'Iran',
 'Iraq',
 'Kazakhstan',
 'Kuwait',
 'Lebanon',
 'Nepal',
 'Oman',
 'Pakistan',
 'Qatar',
 'Russia',
 'Slovakia',
 'Somalia',
 'Sudan',
 'Tajikistan',
 'Ukraine']

Now we do the same for the cities:

In [60]:
cities_flydubai = []
for w in tokens:
    if w in city_name:
        if w not in cities_flydubai:
            cities_flydubai.append(w)

In [325]:
sorted(cities_flydubai)

['Astana',
 'Baghdad',
 'Baku',
 'Beirut',
 'Bratislava',
 'Brussels',
 'Colombo',
 'Dhaka',
 'Doha',
 'Dushanbe',
 'Juba',
 'Kabul',
 'Kathmandu',
 'Khartoum',
 'Kiev',
 'Moscow',
 'Muscat',
 'Riyadh',
 'Sarajevo',
 'Tbilisi',
 'Tehran']

It is easy to notice the deficiencies of our list/filter: Dubai is not on the list because it is not the capital of the UAE. We need to work on the creation of better quality filters if we want to use this method.

We can now get an idea of the number of elements in each search:

In [329]:
# number of cities mentioned in the document
len(cities_flydubai)

21

In [331]:
# number of countries mentioned in the document
len(countries_flydubai)

23

## Adapting the search to the characteristics of the document

The advantage of being able to write code for text analysis is that we can adapt calculations at will, something that can be especially useful with documents that have very specific characteristics. 

When working with text data, it is good practice to do an exploration of the text (skim through it) to get familiar with its structure, and try to detect any special features that can facilitate its analysis.

When we did that, we found the following pattern:

![](https://s19.postimg.org/qidg7izqb/doc_flydubai_copia.jpg)

Most of the complains were identified with the airport codes connected in each flight.

Let's work with that.

### Using regular expression for entity extraction

When we realised the connection codes responded to the structure XXX-XXX and XXXX-XXXX (that is, groups of three and four capital letters), we decided to use regular expressions to extract them from the text.

We begin by importing the `re` library:

In [91]:
import re
# We use this website https://regex101.com to write our expressions

# Going for capital letters only [A-Z] will eliminate results of hyphenated terms, such as "turn-around", and year ranges
# like "2015-2016"

p = re.compile('([A-Z]{3,4}-[A-Z]{3,4})')
p.findall(text)[1:10]

['LYP-DXB',
 'URKK-OMBD',
 'DXB-JED',
 'DXB-KBL',
 'DXB-KBL',
 'DXB-KHI',
 'BTS-DXB',
 'DXB-KWI',
 'AAA-AAA']

The above is a sample of the results, the 10 first results to be exact. Now we can calculate totals:

In [88]:
# total number of matches
flights = p.findall(text)
len(flights)

70

In [332]:
# unique number of matches (using `set`)
len(set(flights))

53

We can try to get more specific results counting the repetitions:

In [150]:
# http://stackoverflow.com/questions/2600191/how-can-i-count-the-occurrences-of-a-list-item-in-python

from collections import Counter
countofFlights = Counter(flights)

# http://stackoverflow.com/questions/20950650/how-to-sort-counter-by-value-python

countofFlights.most_common()

[('DXB-KHI', 5),
 ('DXB-KBL', 3),
 ('DXB-TIF', 3),
 ('MCT-DXB', 2),
 ('BTS-DXB', 2),
 ('DXB-JED', 2),
 ('KDH-DXB', 2),
 ('DXB-BGW', 2),
 ('DXB-KWI', 2),
 ('IEV-DXB', 2),
 ('DXB-TSE', 2),
 ('URKK-OMBD', 2),
 ('DXB-SJJ', 1),
 ('DXB-JUB', 1),
 ('OIKB-OMDB', 1),
 ('OMDB-URKK', 1),
 ('DXB-HBE', 1),
 ('AJAK-GIDO', 1),
 ('DXB-DYU', 1),
 ('MUX-DXB', 1),
 ('DXB-ELQ', 1),
 ('DXB-MED', 1),
 ('DXB-KRT', 1),
 ('KRT-DXB', 1),
 ('ALA-DXB', 1),
 ('ADER-OGOG', 1),
 ('HGA-DXB', 1),
 ('VKO-DXB', 1),
 ('DXB-AHB', 1),
 ('LYP-DXB', 1),
 ('MHD-DXB', 1),
 ('DWC-DXB', 1),
 ('DYU-DXB', 1),
 ('DXB-OAI', 1),
 ('DXB-GYD', 1),
 ('KWI-DXB', 1),
 ('DXB-TBS', 1),
 ('DXB-BEY', 1),
 ('OMDB-OAKB', 1),
 ('DAC-DXB', 1),
 ('KHI-DXB', 1),
 ('DXB-SKT', 1),
 ('ODS-DXB', 1),
 ('DXB-COK', 1),
 ('CMB-DXB', 1),
 ('DXB-TRV', 1),
 ('KWI-DWC', 1),
 ('DXB-IEV', 1),
 ('DXB-DAC', 1),
 ('AAA-AAA', 1),
 ('LKO-KTM', 1),
 ('VCBI-OMDB', 1),
 ('RUH-DXB', 1)]

The list above is the full list of results, which is already a good start in terms of results (we could use that to create visualisation of the airports connected and which itineraries presented the most complains.

We can transform the data even more. Let's transform the codes into readable names of cities:

In [333]:
# we need to convert our list of flights to text (string) to be able to replace the text for new text
flights_as_string = ",".join(str(x) for x in flights)

In [334]:
flights_as_string = flights_as_string.replace("DXB", "Dubai")
flights_as_string = flights_as_string.replace("IEV", "Kiev")
flights_as_string = flights_as_string.replace("KHI", "Karachi")
flights_as_string = flights_as_string.replace("KBL", "Kabul")
flights_as_string = flights_as_string.replace("TIF", "Taif")
flights_as_string = flights_as_string.replace("MCT", "Muscat")
flights_as_string = flights_as_string.replace("BTS", "Bratislava")
flights_as_string = flights_as_string.replace("JED", "Jeddah")
flights_as_string = flights_as_string.replace("KWI", "Kuwait")
flights_as_string = flights_as_string.replace("KDH", "Kandahar")
flights_as_string

'Dubai-Kiev,LYP-Dubai,URKK-OMBD,Dubai-Jeddah,Dubai-Kabul,Dubai-Kabul,Dubai-Karachi,Bratislava-Dubai,Dubai-Kuwait,AAA-AAA,OMDB-URKK,Muscat-Dubai,RUH-Dubai,Dubai-Taif,Dubai-HBE,Bratislava-Dubai,Dubai-TSE,Dubai-Kabul,Dubai-MED,MUX-Dubai,Dubai-BGW,Dubai-TRV,Dubai-GYD,Dubai-SJJ,Dubai-Karachi,Dubai-AHB,Dubai-DAC,KRT-Dubai,Kuwait-DWC,Dubai-COK,Dubai-TSE,CMB-Dubai,Dubai-TBS,Kiev-Dubai,Kiev-Dubai,Dubai-KRT,Dubai-Karachi,Kuwait-Dubai,Dubai-Karachi,Dubai-Karachi,Dubai-Taif,ALA-Dubai,LKO-KTM,Kandahar-Dubai,Karachi-Dubai,Dubai-BEY,DAC-Dubai,DYU-Dubai,Kandahar-Dubai,Dubai-DYU,HGA-Dubai,VCBI-OMDB,Dubai-Jeddah,URKK-OMBD,OIKB-OMDB,Dubai-JUB,Dubai-ELQ,Muscat-Dubai,DWC-Dubai,Dubai-BGW,Dubai-Kuwait,VKO-Dubai,OMDB-OAKB,ADER-OGOG,AJAK-GIDO,MHD-Dubai,Dubai-SKT,ODS-Dubai,Dubai-Taif,Dubai-OAI'

We can now repeat the search, the same as above, to visualise the count with the city names changed:

In [335]:
p2 = re.compile('(\w*-\w*)')
flights_with_names = p2.findall(flights_as_string)
Counter(flights_with_names).most_common(15)

[('Dubai-Karachi', 5),
 ('Dubai-Kabul', 3),
 ('Dubai-Taif', 3),
 ('Bratislava-Dubai', 2),
 ('Kiev-Dubai', 2),
 ('Dubai-Jeddah', 2),
 ('Kandahar-Dubai', 2),
 ('Muscat-Dubai', 2),
 ('URKK-OMBD', 2),
 ('Dubai-TSE', 2),
 ('Dubai-Kuwait', 2),
 ('Dubai-BGW', 2),
 ('Dubai-SJJ', 1),
 ('Dubai-TRV', 1),
 ('OIKB-OMDB', 1)]

### Failed attempt

We know that the search and replace process used above could be automated, but we didn't get the code to run. We have preserved our fail attempts below, because we may be able to fix the code in the future, but we think our algorithm to solve the problem is right (we just need to figure out how to translate it to code).

This is what we want to do:

- We created a filter based on [this list]() of international airport codes
- We created a tuples based on the combination of (City, Airport Code)
- We tried to use the replace method above, providing the "Airport Code" as the "old" argument, and the "City" as the "new" argument
- We tried to use that idea in a for loop, but tuples are not iterable, according to the result we got

In [336]:
# airports_df = pd.read_csv("international airport codes.csv", encoding = "ISO-8859-1")
# airports_df.head(2)

# airports_df['pairs'] = airports_df[["Airport Code", "City"]].apply(tuple, axis=1)
# airports_df['pairs'][1][1]

# code = airports_df['pairs'][10][i][0] <---- we tried to create an iteration and used it as the index for the tuple
                                            # but that didn't work
# city = airports_df['pairs'][10][i][1]
# flights_as_string.replace(code, city)

# --------------------------------------------------------

# origin_dubai = re.compile("(DXB-.*)")   <---- We also tried this option in the for loop below, but we didn´t get it to work
# destination = re.compile("DXB-(.*)")

# for w in flights_as_string:
    # if w==origin_dubai:
        # readable_list.append("Dubai-" + destination)     

## Findings

The information we have so far is enough to create a visualisation of the flights and the most affected connections. But it could also be the starting point for an investigation into the worst connections in terms of complaints. Here a few possible questions:

- explore correlations between flight duration and complains
- explore correlations between most problematic connections and known air traffic accidents
- intercept incidents with weather conditions

![](https://s19.postimg.org/6zdgbwqcj/flydubai_map.jpg)



### References

- Guardian (2016), *Airline pilots complain of dangerous fatigue in leaked documents* https://www.theguardian.com/business/2016/jul/29/airline-pilots-complain-dangerous-fatigue-leaked-documents-flydubai

- Guardian (2016), *Flydubai flight records – the leaked documents* https://www.theguardian.com/business/2016/jul/29/flydubai-flight-records-the-leaked-documents 

- NLTK 3.0 Documentation (2015), *Corpus readers*, available at http://www.nltk.org/howto/corpus.html#corpus-reader-classes

- PythonHow (2016) *Accessing pandas dataframe columns, rows, and cells* http://pythonhow.com/accessing-dataframe-columns-rows-and-cells

- Stack Overflow (2010) *How can I count the occurrences of a list item in Python?* http://stackoverflow.com/questions/2600191/how-can-i-count-the-occurrences-of-a-list-item-in-python

- Stack Overflow (2011) *Python count elements in list [duplicate]* http://stackoverflow.com/questions/4130027/python-count-elements-in-list

- Stack Overflow (2011) *Reference an Element in a List of Tuples* http://stackoverflow.com/questions/6454894/reference-an-element-in-a-list-of-tuples

- Stack Overflow (2013) *How to form tuple column from two columns in Pandas* http://stackoverflow.com/questions/16031056/how-to-form-tuple-column-from-two-columns-in-pandas

- Stack Overflow (2014) *Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries* http://stackoverflow.com/questions/23317342/pandas-dataframe-split-column-into-multiple-columns-right-align-inconsistent-c

- Stack Overflow (2014) *Creating a pandas DataFrame from columns of other DataFrames with similar indexes* http://stackoverflow.com/questions/21231834/creating-a-pandas-dataframe-from-columns-of-other-dataframes-with-similar-indexe

- TutorialsPoint (2016) *Python List count() Method* http://www.tutorialspoint.com/python/list_count.htm