Skip to content
This repository has been archived by the owner on Oct 15, 2022. It is now read-only.

Add EPA Fuel Economy Fathead #100

Closed
wants to merge 11 commits into from
Closed

Add EPA Fuel Economy Fathead #100

wants to merge 11 commits into from

Conversation

zachthompson
Copy link
Contributor

What does your Instant Answer do?
Fathead for EPA Fuel Economy data. Downloads source, parses it, and creates the standard output.txt file for fatheads.

What problem does your Instant Answer solve (Why is it better than organic links)?
-Displays city/hwy fuel economy directly.
-Gives ranges for city/hwy for models with multiple vehicle configurations
-Lists individual model configuration fuel economies.

What is the data source for your Instant Answer? (Provide a link if possible)
http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip

Why did you choose this data source?
Suggested in idea request

Are there any other alternative (better) data sources?
Not that I found with comprehensive fuel economy data.

What are some example queries that trigger this Instant Answer?
2011 Honda Fit fuel economy
2011 Honda Fit mpg (this could optional)

Which communities will this Instant Answer be especially useful for? (gamers, book lovers, etc)
Anyone researching mileage for a vehicle

Is this Instant Answer connected to a DuckDuckHack Instant Answer idea?
Yes - https://duck.co/ideas/idea/4514/vehicle-fuel-efficiency

Which existing Instant Answers will this one supercede/overlap with?
None that I know of.

Are you having any problems? Do you need our help with anything?
How to best handle disambiguation pages and/or display is TBD. For example, a search for "1993 Colt fuel economy" could reference 1993 Dodge Colt or 1993 Plymouth Colt. Should the answer list links to those models or just list both models so no link is necessary? For now, ambiguous redirects like this are simply deleted.

Where did you hear about DuckDuckHack? (For first time contributors)
jobs-subscribe@perl.org I believe.

What does the Instant Answer look like? (Provide a screenshot for new or updated Instant Answers)
http://withoutopus.org/fueleconomy.htm

Checklist

Please place an 'X' where appropriate.

[x] Added metadata and attribution information
[] Wrote test file and added to t/ directory
[x] Verified that Instant Answer adheres to design guidelines (https://duck.co/duckduckhack/design_styleguide)
[x] Verified that Instant Answer adheres to code styleguide (https://duck.co/duckduckhack/code_styleguide)
[] Tested cross-browser compatibility
    Please let us know which browsers/devices you've tested on:
    - Windows 8
        [] Google Chrome
        [] Firefox
        [] Opera
        [] IE 10

    - Windows 7
        [] Google Chrome
        [] Firefox
        [] Opera
        [] IE 8
        [] IE 9
        [] IE 10

    - Windows XP
        [] IE 7
        [] IE 8

    - Mac OSX
        [] Google Chrome
        [] Firefox
        [] Opera
        [] Safari

@moollaza
Copy link
Member

Hey @zachthompson thanks a lot for submitted this. We'll give it a look and provide some feedback shortly

@jdorweiler
Copy link
Contributor

@zachthompson Thanks for this. Everything works!
selection_228

@jdorweiler
Copy link
Contributor

What do you think of the design of it?

@jdorweiler
Copy link
Contributor

The title can't have any parens in it.
selection_230

Otherwise that text is interpreted as a subtitle.
selection_231

@zachthompson
Copy link
Contributor Author

I could replace parens with double quotes since it seems like that's how they're using them.

Also, now that I see the output, should I remove the redundant title in the abstract so that it starts with "MPG:..."?

@jdorweiler
Copy link
Contributor

Yeah try double quotes and removing that repeated title.

@zachthompson
Copy link
Contributor Author

Those two changes are made. Let me know if we need any others.

@jdorweiler
Copy link
Contributor

@zachthompson looks great thanks! I added a few trigger words to it: mpg, fuel economy, gas mileage. Are there any others you can think should trigger it?

Here's what it looks like now:
selection_234

And working with a trigger word:
selection_235

@zachthompson
Copy link
Contributor Author

@jdorweiler Looks good. All I can think of is maybe expanding mpg, e.g. miles per gallon, miles/gallon, etc., or "fuel mileage" which seems to be propagated, for example, here - http://www.fuelmileage.com/

BTW, any thoughts on the disambiguation?

@mwmiller
Copy link
Contributor

Consider also "(fuel|gas|petrol)efficiency"

BTW, any thoughts on the disambiguation?

For my part, I would love to see them in a Spice-style tile/detail view, but I don't know if Fatheads can do that today.

@mwmiller
Copy link
Contributor

This looks super cool!

@zachthompson
Copy link
Contributor Author

@mwmiller I can change the shebang to whatever best works in the ddg environment. I like the tile idea as well. I'm working on another fathead and was wondering the same thing for certain searches it could provide.

@jdorweiler
Copy link
Contributor

No fathead templates yet but that's something I'll like to have too. For now you can just use <br> for a new line.

@zachthompson
Copy link
Contributor Author

Well, there are a couple of options.

  1. Leave them out, since many are convenience searches without the maker. However, there are quite a few that are ambiguous because of 2WD/4WD/AWD, passenger/cargo, etc.
  2. Change the summary to something like '"1992 Plymouth Voyager" may refer to multiple vehicles' and list each one below. I think the max number of vehicles is four.
    If we want to do Network ports database for zeroclickinfo-fathead #2, I could either a) have the vehicle as a subheader (would look roughly like the normal display without the summary sentence in the abstract) or b) append the vehicle name to each line. Option b could make for long lines, e.g. "2003 Chrysler Town and Country/Voyager/Grand Voy. AWD, x L, y C, ...etc."

@jdorweiler
Copy link
Contributor

Let's just leave it as-is for now. I kinda like seeing the multiple vehicles as long as it has some logical limit to the number that will show.

I'm wondering if we can fix the triggering on this. The titles are so specific that I have to go into the output.txt file to see what to search for. I though 2009 subaru outback would trigger but the actual search title is 2009 subaru outback wagon awd (see img below). The 1987 Chevrolet example a few posts above for an even better example.

What do you think about adding additional redirect entries that have some of the common words stripped off? i.e. awd, 2wd, 4wd, pickup ... I'm sure there are others. That way 2009 subaru outback would redirect to the correct entry.

This page has some info about redirects if you didn't see it already https://duck.co/duckduckhack/fathead_overview#data-file-format. Let me know if you need help.

selection_236

@zachthompson
Copy link
Contributor Author

If you look at the bottom half of output.txt you should see about 20k redirects. I played around with several ways to remove these types of words to make them easier to trigger. In the specific case you mention, it will work, since Subaru doesn't make 2WD versions. In most, however, removing the AWD/2WD will create an ambiguous redirect (e.g. go to the search http://www.fueleconomy.gov/feg/findacar.shtml, bring up 2009 Ford, and scan the models.) We can do it and just see how many additional, unique redirects it generates. I could also just run through all of the combinations of the words in the model.

@zachthompson
Copy link
Contributor Author

BTW, to be clear, I was only talking about changing the display for any redirects that referenced multiple vehicles. It wouldn't change the display of multiple configurations, which is what we've been looking at. Multiple vehicles would just be vertically stacked, rather than horizontally as was suggested.

@zachthompson
Copy link
Contributor Author

131k additional redirects with the update. Any variation that's not completely ambiguous should work.

@jdorweiler
Copy link
Contributor

ah thanks. I didn't notice the redirects at the bottom. I'll check out the new ones and see how it works now.

@zachthompson
Copy link
Contributor Author

The only item not part of the variations is the year. It has to be first. I could change it to allow the year to be anywhere as well.

@zachthompson
Copy link
Contributor Author

Actually, that last statement is incorrect. Both the year and make are in fixed positions. If we allow for these two terms to be in any position, require that a year be present, and allow for any number of terms, as long as it's unique, the redirects balloon to over 5.2M.

Not allowing the year or make to appear between two terms of the model reduces the output to a much more manageable 1.25M redirects or so.

* Significant memory (~1.4G on FreeBSD amd64 Perl 5.16.3) and run time (~3 minutes) reqs
* Clarified some variable usage
* Some memory tweaks
@zachthompson
Copy link
Contributor Author

@jdorweiler Updated. Only 75 vehicles where the volumes come into play.

@jdorweiler
Copy link
Contributor

@zachthompson great! I updated with your new changes.


use DDG::Fathead;

primary_example_queries '2014 Honday Fit fuel economy', '2014 Prius mpg';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honday -> honda
2014 prius mpg -> 2014 prius v mpg (I guess that's the model name now?)

@jdorweiler
Copy link
Contributor

Looks great! I just want to see if @chrismorast has any design ideas
https://ddh4.duckduckgo.com/?q=2005+volkswagen+jetta

* Make transmission in the configuration optional
* Fix some typos
@zachthompson
Copy link
Contributor Author

@jdorweiler A couple of additional tweaks for electric vehicles.

@zachthompson
Copy link
Contributor Author

fyi, records for the upcoming year start showing up in Spring and are continuing to be added even now. According to my contact 15 new records arrived just this afternoon. Might need to run this somewhat regularly.

@jdorweiler
Copy link
Contributor

@zachthompson oh nice! I didn't even think of electric cars. The names for the tesla cars are kinda long 2013 tesla 40 s battery kwhr model pack though.

Thanks for looking into when they do updates. I'll make a note to update this a few times a year. New changes are up if you want to try and trigger some of the electric cars.

@zachthompson
Copy link
Contributor Author

@jdorweiler yeah, unfortunately some models have extensive qualifiers in parens like that. Looks good. I tried the 2012 nissan leaf and a few without transmissions specified, e.g. 2001 th!nk and 2001 hyper-mini.

For models with a single configurations like these, we might collapse the summary and configuration into one line. Since there's no range it's sort of redundant. However, I could also see leaving it just to be consistent.

@jdorweiler
Copy link
Contributor

@zachthompson Could you limit the number of variations? When I search for 2013 tesla I'm finding ~ 18k results. This because the name is so long and it's making permutations on each word in the title?

@zachthompson
Copy link
Contributor Author

@jdorweiler With that particular example I'm not sure I see a good way to reduce the redirects any further.

The redirects are generated as follows:

Year: 2013
Make: Tesla
Model: Model S "60 kW-hr battery pack"

The terms of model are 1) permuted and 2) reduced in number, requiring at least one term. Each model is then inserted into each permutation of year/make/model. It only requires two of the latter but forces the year to be one of them (eliminating rare cases where make/model, or even just model, would work.) So it allows for reasonable variety while also requiring minimal terms, e.g. "2013 60" will work.

Ways I can think of to reduce redirects:

  1. Force year, make, and model to be specified.
  2. Require a specific order to year, make, and model.
  3. Remove junk terms from the model, e.g. "of", "the", "and", etc. I'm working on another fathead where this will probably be required. It will reduce redirects but has unintentional consequences.
  4. Somehow detect phrases like "battery pack", "town and country", "60 kW-hr", etc., and disallow permutation. This is probably difficult...to do well.

Let me know if you have other ideas. None of the above seem like great tradeoffs with respect to flexibility or effort to me.

@jdorweiler
Copy link
Contributor

@zachthompson Makes sense. I agree though there doesn't seem like an easy way to fix that. I think this is good to go though. I'm going to post it up internally for testing.

@zachthompson
Copy link
Contributor Author

@jdorweiler A variation of #3 above would be to prevent "and", "of", "the", etc., from appearing first or last in the model permutation. However, I generated all of the unique first and last terms and it would only save us ~15k redirects, mostly from "and", "inc", and "incl.".

@chrismorast
Copy link

Looks good!

Although, I tried some other vehicles but couldn't get it to trigger.

2007 Nissan Frontier
2007 Subaru Impreza

@jdorweiler
Copy link
Contributor

@zachthompson After getting some feedback on this I think we're going to have to cut back on the number of redirects. All of the current fatheads have less than 1M entries combined. I'm not sure the best way to fix it but a few idea that could work:

  • make a big list of common words to remove from the titles
  • look for a source of car names with less specific names that we can apply to this data
  • try this as a longtail

I think a better option would be to try this as a longtail and see how that works. A longtail does relevancy searching on the title so it's not as strict as key:value searching for a fathead. I think either option could work so don't let me discourage you from trying the others.

Let me know if you want to try the longtail and I can give you a quick summary. It's been a while since anyone made one so our docs need an update (https://duck.co/duckduckhack/longtail_overview).

@zachthompson
Copy link
Contributor Author

@chrismorast Both of those have multiple models (2007 Nissan Frontier 2WD, 2007 Nissan Frontier V6 4WD, 2007 Nissan Frontier V6 2WD, etc.) I suppose one way to reduce redirects would be to try and determine the most generic model name and group the specific models under it like this. From some of the model names though it might be harder to figure out than it appears.

@jdorweiler ok. So it sounds like fatheads should only be used for keys with a couple of terms. Most of them seem to be essentially definitions that don't really require extensive redirects.

I'll check out the longtail when I have a chance. I'm guessing some magic is performed on the title field instead of the redirects.

Shall I close this?

@jdorweiler
Copy link
Contributor

@zachthompson Let's leave this open. What about reducing it down to [year][make][model] by stripping off the common words?

@zachthompson
Copy link
Contributor Author

@jdorweiler It already is year/make/model. In the case of the Frontier above, that's what the EPA considers a model. All of the descriptors like "2WD", "AWD", "V6", etc., are what distinguish them with respect to fuel economy. Though we visually comprehend that "Nissan Frontier X" all refer to the same basic model, the data aren't that way.

So if we were to remove the common words, do you mean in the article or just for redirects? If we attempt that in the article, there are a few challenges. First, how to do it in a general way? Even if the model were split apart and rebuilt term by term, how do we know when the base model has been found? For example, if there were a "Town and Country *" and a "Town Car *", are these both model "Town" or completely separate? There are a lot of other models with numbers and letters where it just isn't obvious where the base model stops.

Just removing common terms from articles leads to duplicates which have to be somehow resolved. In the case of the Frontier, for example, we would assume that they all refer to the same model. The removed terms would likely have to be relocated to the configurations below to distinguish them.

We can't do this only for redirects since they would become ambiguous.

If you or anyone else has experience with this type of thing or additional thoughts, I'm game for giving it a go.

The longtail does sound like a better solution and addresses my earlier concern about generating redirects in a uniform way. However, I'm not sure how extensive the relevancy search is on the title(s). I'm assuming it does terms in any order, minimum number of terms to identify a single item, etc.? Or can longtails display multiple items if a single item isn't found?

@moollaza
Copy link
Member

moollaza commented Nov 7, 2014

Just to chime in maybe we should reconsider the Spice route? If we can build a hash that maps the car names to ID's or we can can cleverly parse a query to get the make model year we can form an API request, I think?

@zachthompson
Copy link
Contributor Author

@moollaza I was looking at this a bit. The year, make, model API is pretty inflexible. You have to have the exact model for it to generate a hit as far as I can tell, e.g. Town and Country/Voyager/Grand Voy. 2WD.

You might be able to utilize the year/make API, if they can be identified, and try to match the rest against one of the models returned. That seems pretty involved to do on the fly.

The hash idea with names to id mappings sounds interesting. Were you thinking something like all of the articles + redirects (~1.25M) in the current output mapped to specific IDs? That probably wouldn't be too intense. However, it should be noted that each configuration within each article has a separate ID. For example, the 2005 Jetta above maps to 11 IDs, not one. In order to derive a single ID for the API we would need to generate the unique redirects for each configuration!

@jdorweiler
Copy link
Contributor

@zachthompson Good point. That really makes me think that this is better for a longtail. With the longtail things like 2012 impreza should trigger things like:

  • 2012 subaru impreza
  • 2012 subaru impreza awd
  • 2012 subaru impreza 4wd awd 4 door

You can even search for 2012 subaru turbo and you get results where the description field has the word turbo in the mileage description. Multiple results would show up as a tile view (stackoverflow is a longtail https://duckduckgo.com/?q=python+append+to+list&t=canonical).

If you want to try that out the output for a single entry would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<add allowDups="true">

<doc>
<field name="title"><![CDATA[ NAME OF CAR ]]></field>
<field name="paragraph"><![CDATA[  MPG TEXT ]]></field>
<field name="source">LINK BACK TO SOURCE PAGE </field>
</doc>

</add>

You can just repeat the <doc></doc> part for each entry.

@zachthompson
Copy link
Contributor Author

@moollaza Yeah, that would be awesome if it tiled on searches like that. I'll start converting it to a longtail and we can see how it compares. Should make the parsing much easier.

@jdorweiler
Copy link
Contributor

Great, thanks for trying this out. 👍

@chrismorast
Copy link

@jdorweiler , as far as the multiple models go, is it possible to add a dropdown (similar to the nutrition one) for disambiguation?

@jdorweiler
Copy link
Contributor

@chrismorast no but @zachthompson resubmitted this as a longtail which will show multiple models in a tile view. duckduckgo/zeroclickinfo-longtail#9

@moollaza
Copy link
Member

moollaza commented Dec 1, 2014

@zachthompson @jdorweiler do we still need this PR? Or are we indefinitely going with the Longtail? Just want to make sure we don't have any lingering PR's that need our attention :)

@zachthompson
Copy link
Contributor Author

@moollaza I think the fathead route has limitations that can't be overcome for the data. I'm ok with closing it unless @jdorweiler has other reasons not to.

@jdorweiler
Copy link
Contributor

@zachthompson @moollaza Thanks. The longtail solves all the troubles we had here so let's go with that one.

@jdorweiler jdorweiler closed this Dec 2, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants