Airport fathead plugin #47

Merged
merged 33 commits into from Jan 22, 2014

Conversation

Projects
None yet
5 participants
Contributor

mellon85 commented Mar 30, 2013

The output.txt file content is quite big as for each airport writes 3 lines (one for IATA,ICAO, Airport Name, Airport Location); so that, I guess, the mapping should work in that way to. Correct me I am wrong obviously.

Please suggest how data should be represented too!

Contributor

rpicard commented Mar 30, 2013

It looks like someone beat you to it. #46 :)

Why don't you join in on that discussion so we can work in any ideas you have!

Contributor

rpicard commented Apr 1, 2013

Since you're already parsing the Wikipedia page, we can go ahead and use your implementation. Some changes I think we should make are:

  • Limit the entries to the code (e.g. 'JAX' ) and the airport name + 'code' (e.g. 'Jacksonville International Airport code'). The latter being a redirect to the former. We could also include one with the word 'airport' removed (e.g. 'Jacksonville International code'). Having the result for the city name would cover too many searches that aren't looking for the airport code.
  • I commented on the other pull request regarding my thoughts on how to format the abstract: #46 (comment)
  • "More at" links should go to the source of the information. In this case it would be: https://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_[$letter].

Let me know what you think!

@rpicard rpicard referenced this pull request Apr 1, 2013

Closed

Airport Codes #46

Contributor

mellon85 commented Apr 2, 2013

  1. What about having the codes and the names like (using as an example Vienna's Airport)

VIE
LOWW
Schwechat Airport 'code' / Schwechat 'airport code'
Vienna 'airport'

The last 2 can be obtained by the location and the airport name in the
wikipedia table fields.

  1. I like the abstract, but we should find a way to put the ICAO code too.
  2. More at shouldn't link at the wikipedia airport page and only if not
    present in the links in the table page itself?
Contributor

rpicard commented Apr 3, 2013

  1. That sounds good. In those cases, you can probably go ahead and leave out the "code" part. I'll add it as a trigger on our end. This means that the output should have these:
  • VIE
  • LOWW
  • Schwechat Airport
  • Vienna Airport
  1. Would the ICAO codes not fit within that abstract template?
  2. I don't quite understand what you're saying here, but right now the "More at" is linking to the Wikipedia page for the city. We want it to link to the source of the information (i.e. the page I listed above).
Contributor

mellon85 commented Apr 4, 2013

I think I did all the changes. The abstract changes based on the type of query

@mellon85 mellon85 closed this Apr 4, 2013

@rpicard rpicard reopened this Apr 4, 2013

Contributor

mellon85 commented Apr 4, 2013

ops, I clicked the wrong button :D

Contributor

rpicard commented Apr 4, 2013

Haha, I thought it was weird that you closed it. I'll take a look at your changes later today. For now, I have class. :)

Contributor

rpicard commented Apr 5, 2013

It looks like there is something silly going on with the abstract. See: "The "KAAF" airport code corresponds to the Apalachicola, Florida in Apalachicola, Florida, United States"

Here's another one: "The "Ilulissat" airport corresponds to the IATA JAV and ICAO BGJN"

Contributor

mellon85 commented Apr 5, 2013

The first one is an issue,

The second part is correct data

Contributor

rpicard commented Apr 5, 2013

I'm a little confused on what the second one is supposed to be saying then. What is "Ilulissat"?

Contributor

mellon85 commented Apr 5, 2013

Illuisat Airport is the name of the airport

Contributor

rpicard commented Apr 5, 2013

Oh, okay. I see what you did there. Maybe it should be something along the lines of: "The IATA code for the Ilulissat airport is JAV and the ICAO code is BGJN." to make things a little clearer.

Fixes to abstract text to avoid confusion
added IATA<->ICAO mapping in all abstracts when possible
Contributor

mellon85 commented Apr 5, 2013

Please check also the Berlin airport (BER) abstracts too as it's a meta airport with no ICAO code and no real name

Contributor

rpicard commented Apr 5, 2013

Since we have so many different forms of entry here, we need to work on making things a little clearer. Here's an example of one where I don't understand what is going on:

Bury St. Edmunds Airport    A                                       The RAF Honington correspond to the IATA "BEQ" and the ICAO code is "EGXH" near Bury St. Edmunds, England, United Kingdom   https://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_B

What is the "RAF Honington"?

In this one the wording is just awkward in my opinion. I think this one should follow the same format as Val de Cães International Airport (see my last example below):

Bandon Airport      A                                       The Bandon State Airport correspond to the IATA "BDY" near Bandon, Oregon, United States    https://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_B

This one looks better, except I think we should specify that the first code is IATA and add a period to the end (inside the quotation marks):

BES A                                       The "BES" airport code corresponds to Brest Bretagne Airport in Brest, France and the ICAO code is "LFRB"   https://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_B

I like the format here, but there is some redundancy in "airports in Berlin in Berlin, Germany." Also, we should probably specify which type of code it is.

BER A                                       The "BER" airport code corresponds to airports in Berlin in Berlin, Germany https://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_B

With a period at the end, this one will be perfect IMO.

Val de Cães International Airport  A                                       The IATA code for the Val de Cães International Airport is "BEL" and the ICAO code is "SBBE"   https://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_B

I really do like the way this is shaping up!

mellon85 added some commits Apr 6, 2013

Fix for "RAF Honington"
If the airport name doesn't contain the word Airport it is added at the
name to make the abstracts meaningful
Fix Bandon Airport
Changed the abstract format for location
Fix BES Airport abstract
Added the type of code we have found in the abstracts
Fix
Adding a period at the end of the abstracts
Fix BER
Changed 'airports in Berlin near Berlin' to 'airports in Berlin'
Contributor

mellon85 commented Apr 6, 2013

I managed to fix all the points I think.

Only the Berlin airport seemed to have the bad name/location issue in the abstract.

Contributor

rpicard commented Apr 7, 2013

There are a couple of things I'd like to tweak before we move forward with this:

  • 'The ICAO "YARY" airport code corresponds...' is worded in an awkward way. Maybe we should switch it to 'The "YARY" ICAO airport code corresponds...'
  • When the sentence ends with a quotation, ", the period should be inside it, e.g. "The dog ran under the tomato." This will be kind of a pain to program since you're not sure whether it will end in a quotation or not during programming, so you'll have to add some checks to it before it's printed out.

Let me know what you think!

Contributor

mellon85 commented Apr 7, 2013

I agree with the period, and I can fix it very easily.

Regarding the wording of the ICAO code.. yes, it is probably awkward :) but
then should I change the wording in general? or just for the IATA/ICAO code
lookups?

Contributor

rpicard commented Apr 7, 2013

I think the awkwardness just comes from the fact that "IATA/ICAO" comes before the code in the sentence. The other types of answers (i.e., those with different sentence structures) should be fine still.

Fixes: Punctuation and Phrasing
- Added function append_period to append a final period in a sentence.
- Corrected phrasing in IATA/ICAO code lookup
Contributor

mellon85 commented Apr 9, 2013

I have done the rephrasing

Contributor

rpicard commented Apr 29, 2013

@mellon85 Just wanted to give you an update. I just went through and tested your changes. They look pretty good. I've started this plugin on the QA process, so I might have some more feedback soon. Otherwise, we'll be deploying it.

Contributor

rpicard commented May 19, 2013

@mellon85 We have one final change we'd like to make. In cases where the airport is [CITY] international airport we'd like to add iata [CITY] as an element. This way you can just search "iata jacksonville" to get the code for Jacksonville International Airport. The same goes for iaca as well of course. This new record should probably be a redirect to the [CITY] international airport one.

Contributor

mellon85 commented May 20, 2013

@rpicard I have added the redirections.
I hope I understood the format for redirections correctly.

Contributor

rpicard commented May 20, 2013

I just noticed another issue. There are several different results for "Jacksonville Airport." I think we could probably get rid of this case altogether now that we have the city name and "airport" as a trigger (i.e. "jacksonville airport" would trigger the redirects we just added). The other option would be to create a disambiguation, which would be nice. That way if they search "Jacksonville airport" they would get a list of them (like: https://duckduckgo.com/?q=perl6+accepts). Then we could change the redirects we just added to point there instead of to the international airport result.

What do you think?

Contributor

mellon85 commented May 20, 2013

Having the disambiguation is good idea in case of multiple airport with the same city name like for Jacksonville having the Municipal Airport and the International Airport; this clearly requires a disambiguation.
I'll work on it.

Contributor

rpicard commented May 20, 2013

@mellon85 Awesome, thanks. Let me know if you have any questions.

Contributor

mellon85 commented May 22, 2013

@rpicard I have indeed a question, I can't find any documentation on the format of the disambiguation field. How should I exactly fill it?

Contributor

rpicard commented May 22, 2013

@mellon85 That's a good point. I'll have to add something to the README. In the meantime though, you can look at the Perl 6 Fathead does it: https://github.com/duckduckgo/zeroclickinfo-fathead/blob/master/share/perl6_doc/output.txt#L3

Here's the ZCI for that line: https://duckduckgo.com/?q=perl6+accepts

It's the same formatting that is used for disambiguation pages on Wikipedia.

Contributor

mellon85 commented May 23, 2013

@rpicard Ok, this should properly generate the disambiguation pages

Contributor

rpicard commented May 24, 2013

Those don't look quite right.

Marietta Airport    D                               *[[JGL]] , [[]] , Marietta, Georgia, United States\n*[[MGE]] , [[KMGE]] , Marietta, Georgia, United States\n            

This should be:

Marietta Airport    D                               *[[JGL]], Marietta, Georgia, United States\n*[[MGE]], Marietta, Georgia, United States\n            

Actually, I think it should probably say Galleria Heliport Airport in Marietta, Georgia instead, since all of them are from Marietta (that's what we searched for).

Contributor

rpicard commented May 24, 2013

I think you should use the ICAO code as the link.

Contributor

mellon85 commented May 24, 2013

I am using both ICAO and IATA because ICAO is not always present, and not
always unique :)

I have to add the airport name, not just the location as it can still
be ambiguous and fix the check regarding missing ICAOs, it was not meant to
be present with a [[]] at all...

Dario Meloni

2013/5/24 Robert Picard notifications@github.com

I think you should use the ICAO code as the link.


Reply to this email directly or view it on GitHubhttps://github.com/duckduckgo/zeroclickinfo-fathead/pull/47#issuecomment-18381079
.

Contributor

rpicard commented May 24, 2013

Are you trying to have a link in the disambiguation for the ICAO and one for IATA?

Contributor

mellon85 commented May 25, 2013

Yes,so that they could choose on which of the 2 article go to. If it's not
a good idea I will just use the IATA code

Dario Meloni

2013/5/24 Robert Picard notifications@github.com

Are you trying to have a link in the disambiguation for the ICAO and one
for IATA?


Reply to this email directly or view it on GitHubhttps://github.com/duckduckgo/zeroclickinfo-fathead/pull/47#issuecomment-18420548
.

Contributor

rpicard commented May 25, 2013

If we wanted to do that, we would need to have a separate one for each. Having two bracketed sections wouldn't generate the two lines.

That said, I think having the IATA code only would work best because the abstract also gives the ICAO code.

Contributor

mellon85 commented May 25, 2013

It should be ok now. Using only IATA, and with name and location of the airport for disambiguation

Contributor

rpicard commented May 26, 2013

This looks awesome! https://robert.duckduckgo.com/?q=jacksonville+airport+code
I'm going to see what the other guys think.

mellon85 added some commits May 26, 2013

fix: the strings from the HTML source are trimmed
QDF airport contains a newline in the location and would ruin the format
Contributor

mellon85 commented May 26, 2013

I have checked for duplicates in the file and I found a couple of bug:

ICAO could be duplicated (such as ETMN) and html string are stripped to avoid insertin unwanted '\n'

Contributor

rpicard commented May 27, 2013

Could you elaborate on what you're fixing? Both of those examples (QDF and ETMN) look alright to me.

Contributor

mellon85 commented May 27, 2013

QDF text contained a newline in the HTML that was then brought to my output.txt. I just added strip() calls to clear any possible issue regarding the formatting issues. I found it while checking for duplicates entries checking the difference between 'cut -f1 | sort' and 'cut -f1 | sort -u'.

Regarding ETMN (and others), there were 6 lines with duplicate titles. For instance ETMN is a duplicated ICAO code for 2 airports with 2 different IATA codes, respectively FCN and NDZ, so I added a disambiguation point to the IATA codes.

Contributor

rpicard commented May 29, 2013

Sounds good. I'll update the data on our server.

Contributor

rpicard commented May 29, 2013

It looks like records such as Tampa International Airport are no longer in output.txt.

Contributor

mellon85 commented May 29, 2013

you are right, I'll take a look

Dario Meloni

2013/5/29 Robert Picard notifications@github.com

It looks like records such as Tampa International Airport are no longer
in output.txt.


Reply to this email directly or view it on GitHubhttps://github.com/duckduckgo/zeroclickinfo-fathead/pull/47#issuecomment-18623273
.

Contributor

rpicard commented May 29, 2013

Awesome, thanks.

Contributor

mellon85 commented May 31, 2013

It looks like beautiful soup doesn't return all the airports in the tables... it stops while parsing the file and doesn't returns all the lines as it should , but just the first 130 airports for the letter T for instance, probably it is happening even for other letters.

Contributor

rpicard commented Jun 1, 2013

I might not have access to the internet for a couple of days. I just wanted to let you know. I'll get back to you as soon as I get back on.

Contributor

rpicard commented Jun 2, 2013

It turns out I'll have access after-all. Can you fix the beautiful soup problem?

Contributor

mellon85 commented Jun 2, 2013

I was busy this weekend, I'll report to them the issue to see if in fact
there something wrong on this side or on their side. Otherwise I have to
rewrite it in a different way.

Dario Meloni

2013/6/2 Robert Picard notifications@github.com

It turns out I'll have access after-all. Can you fix the beautiful soup
problem?


Reply to this email directly or view it on GitHubhttps://github.com/duckduckgo/zeroclickinfo-fathead/pull/47#issuecomment-18812584
.

Contributor

rpicard commented Jun 3, 2013

Let me know when you are ready to move forward.

Contributor

mellon85 commented Jun 4, 2013

It is not a bug in beatifulsoup, or at least, not anything they will fix. The correct way to do this is using an external parser such as html5lib. Now it has 3 times more airports...

Contributor

rpicard commented Jun 4, 2013

I'm running into this error:

(airport-fathead)~/Code/zeroclickinfo-fathead/share/airports (mellon85-master) $ ./parse.sh 
DEBUG:root:Index: A
Traceback (most recent call last):
  File "parse.py", line 209, in <module>
    parser.get_airports()
  File "parse.py", line 146, in get_airports
    table = self.soup.find_all('table')[1]
IndexError: list index out of range
Contributor

mellon85 commented Jun 4, 2013

There is now a new dependency, are you sure you installed it?

Dario Meloni

2013/6/4 Robert Picard notifications@github.com

I'm running into this error:

(airport-fathead)~/Code/zeroclickinfo-fathead/share/airports (mellon85-master) $ ./parse.sh DEBUG:root:Index: ATraceback (most recent call last):
File "parse.py", line 209, in
parser.get_airports()
File "parse.py", line 146, in get_airports
table = self.soup.find_all('table')[1]IndexError: list index out of range


Reply to this email directly or view it on GitHubhttps://github.com/duckduckgo/zeroclickinfo-fathead/pull/47#issuecomment-18924360
.

Contributor

rpicard commented Jun 4, 2013

Here's what I've installed:

(airport-fathead)~/Code/zeroclickinfo-fathead/share/airports (mellon85-master) $ pip freeze
beautifulsoup4==4.2.0
html5lib==1.0b1
six==1.3.0
wsgiref==0.1.2
Contributor

mellon85 commented Jun 4, 2013

I am using html5lib version 0.95 (the one packaged with ubuntu)

Dario Meloni

2013/6/4 Robert Picard notifications@github.com

Here's what I've installed:

(airport-fathead)~/Code/zeroclickinfo-fathead/share/airports (mellon85-master) $ pip freezebeautifulsoup4==4.2.0html5lib==1.0b1six==1.3.0wsgiref==0.1.2


Reply to this email directly or view it on GitHubhttps://github.com/duckduckgo/zeroclickinfo-fathead/pull/47#issuecomment-18925464
.

Contributor

rpicard commented Jun 4, 2013

I'm getting the same problem when running it with 0.95 installed.

Contributor

mellon85 commented Jun 5, 2013

I tried deleting all the index data and download them again with fetch.sh but I am not having the same error;... I can't reproduce it, the only difference can be in the input files, can you send the index file A?

Contributor

rpicard commented Jun 6, 2013

Contributor

mellon85 commented Jun 7, 2013

ok, I found out the problem. In fact I was using bs4 4.1.2 and not 4.2.0. I have the same issue using the latest version.

update: I have done some tests from the command line. Their parser doesn't parse all the airports as the in version 4.1 but there must some issues with their usage of html5lib as now almost any type of find call fails miserably...

Contributor

mellon85 commented Jun 9, 2013

I have done some tests and indeed a problem is present with beautiful soup 4.2.0
I tried using various parsing libraries (lxml, html5lib) it's not working at all. While with version 4.1.2 it's working.
The only "working" is the internal parser, that failes to parse all the airports.

Contributor

rpicard commented Jun 9, 2013

Lets go ahead and use the one with the best results. Let me know what I need to do for that.

Contributor

mellon85 commented Jun 10, 2013

Then is beautifulsoup 4.1.2 with html5lib

Contributor

rpicard commented Jun 11, 2013

Could you add a record for each airport name that redirects to it's code? For example Tampa International Airport would redirect to TPA. Either that or we could have the code redirect to the airport name. I think the latter might be nicer, but it would mean more work. What do you think?

Contributor

mellon85 commented Jun 11, 2013

Now all the airports should be reported correctly

Contributor

rpicard commented Jun 21, 2013

Hey, I have what should be the last alteration. We'd like to have airport code tampa international work, which means adding tampa international to the list. If the airport ends with the words "international airport," we should also have a record that ends with "international." This can redirect to wherever "international airport" points.

Contributor

mellon85 commented Jun 24, 2013

This commit should add the last request

Contributor

rpicard commented Jun 25, 2013

We shouldn't have these:

Tampa   A                                       The IATA code for the Tampa is "TPA" and the ICAO code is "KTPA."   https://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_T

We already have Tampa Airport.

On another note, should we add in Tampa Airports (plural) as well and redirect Tampa Airport there?

Contributor

mellon85 commented Jun 26, 2013

The Tampa thing is an issue introduced by the previous change.
As of now there is Tampa airport that disambiguates, adding a very similar line with the s is an easy task. I don't know if it is actually needed or not

Contributor

rpicard commented Jun 27, 2013

It looks like that introduced a little bug:

KTPA    A                                       The "KTPA" ICAO airport code corresponds to None in Tampa, Florida, United States and the IATA code is "TPA."   https://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_T
Tampa International Airport A                                       The IATA code for the None is "TPA" and the ICAO code is "KTPA."    https://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_T
Tampa International R   Tampa International Airport                                     
TPE A                                       The "TPE" IATA airport code corresponds to None in Taoyuan County (near Taipei), Taiwan, Republic of China and the ICAO code is "RCTP." https://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_T
RCTP    A                                       The "RCTP" ICAO airport code corresponds to None in Taoyuan County (near Taipei), Taiwan, Republic of China and the IATA code is "TPE." https://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_T
Taiwan Taoyuan International Airport    A                                       The IATA code for the None is "TPE" and the ICAO code is "RCTP."    https://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_T

Notice all of the Nones.

Contributor

mellon85 commented Jun 27, 2013

haha i missed it :) ok I'll fix it, damn!

Contributor

rpicard commented Jun 27, 2013

@mellon85 I know, it seems like every time we get one thing another pops up! :)

Contributor

mellon85 commented Jun 28, 2013

Ok, this fix should not introduce any new regression...

Contributor

rpicard commented Jul 4, 2013

I'd still like to add Tampa Airports as a disambiguation (like Tampa Airport is now). #47 (comment)

Added redirection for disambiguations
Tampa Airports ('R') -> Tampa Airport ('D')
Contributor

mellon85 commented Jul 4, 2013

Added and tested for the additional redirection to disambiguation articles

Contributor

rpicard commented Jul 8, 2013

It looks like that is mostly getting overridden by other information, which is actually alright. I'm going to go ahead and move forward with deploying this.

@ghost ghost assigned hunterlang Oct 23, 2013

jagtalon added a commit that referenced this pull request Jan 22, 2014

@jagtalon jagtalon merged commit af6b959 into duckduckgo:master Jan 22, 2014

@mellon85 make sure you change the package name ;)

Owner

moollaza commented Jan 22, 2014

@jagtalon @mellon85 also, the metadata is all wrong. Please make sure to read the docs on this: https://duck.co/duckduckhack/metadata

You can only choose from a pre-determined list for the category and topics.

You can also test that the metadata is correct but running duckpan query inside the Fathead repo.

Contributor

mellon85 commented Jan 23, 2014

I'll fix it by today (CEST time)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment