Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reindex site results with search.gov #2991

Closed
6 tasks done
Tracked by #137 ...
dorothyyeager opened this issue Jun 24, 2019 · 13 comments · Fixed by #3278
Closed
6 tasks done
Tracked by #137 ...

Reindex site results with search.gov #2991

dorothyyeager opened this issue Jun 24, 2019 · 13 comments · Fixed by #3278
Assignees
Milestone

Comments

@dorothyyeager
Copy link
Contributor

dorothyyeager commented Jun 24, 2019

Summary

When searching for "other pages" in fec.gov, the search generally yields badly outdated results pointing to transition pages that no longer exist.

For example, a search in fec.gov's search box for "Guideline good order public funding" should return the public funding page (https://www.fec.gov/introduction-campaign-finance/understanding-ways-support-federal-candidates/presidential-elections/public-funding-presidential-elections/) at the top. Instead this is the result. The correct page does not appear in the results.

image

(The correct page is actually at the top of google's results when searching the same term.)

Expected Behavior

The search should return updated results so that the latest version of the current pages appears. I don't think this is an SEO issue as noted above; fec.gov pages generally turn up in google's results quickly.

Actual Behavior

fec.gov content pages that have been taken down off of transition are returned in the search results. Pages that have been up for quite awhile and are in the results for the same term searched in Google do not.

Frequency

  • This is an ongoing, predictable issue

How to Reproduce

List any steps you took for this issue to happen. Be sure to include the URL and what you clicked or entered.
URL: https://www.fec.gov/…

  1. Enter "Guideline good order public funding" into the search box
  2. Click "Search other pages"
  3. Watch a bunch of retired pulled transition pages appear.

Screenshots

Same search on Google, yielding correct page as top result:
image

Misc

This is actually happening a lot with various searches for content pages, but this was the most egregious error yet, as public funding page has been up for over a year.

Completion criteria:

  • Review search.gov search tool that is incorporated into our global site search. Refer to search.gov documentation: https://search.gov/developer/index.html
  • Take a look at what scripts run the indexing on our website. Looks like their code base may have been updated after the rebranding, so we'll need to double check our indexing scripts will still work. If not, fix them.
  • Document what the indexing script is doing. What is it indexing? So that we can have a good follow-up ticket to test to make sure the global search is picking content accurately. @AmyKort thinks that this is what is happening:

The overall site search doesn't include latest updates.
If you want to search anything from latest updates, search in the separate search box for that.
The overall site search will look through certain data search aspects (committee name, contributor name, candidate name, etc. and text from legal resources (but not from legal resources data results, like AOs, etc.)

  • Get the latest wagtail db dump from prod and store within our back-ups. This may be a good time to check the backups that have been done with the automated scripts. Ask @rjayasekera, he'll know where these automated db dumps are stored. Make sure to update wagtail database on stage, so that the indexing may be accurate?
  • Run the index script
  • Test search on the website. It doesn't matter which environment, since it's all using the same search.gov index drawer. We don't have separate drawers for our feature, staging and dev spaces.
@dorothyyeager dorothyyeager added this to the Sprint 10.1 milestone Aug 9, 2019
@lbeaufort lbeaufort changed the title FEC.gov search results for content pages are turning up outdated pages pulled off transition Reindex site results with search.gov Sep 5, 2019
@lbeaufort lbeaufort modified the milestones: Sprint 10.1, Sprint 10.2 Sep 5, 2019
@patphongs patphongs modified the milestones: Sprint 10.2, Sprint 10.3 Oct 2, 2019
@rfultz
Copy link
Contributor

rfultz commented Oct 7, 2019

While we're reindexing the site, I'd like to think about adding a sitemap.xml. Are there places in the site that people are having trouble finding, places we'd like them to find more easily, etc? May be good to add those to a general sitemap so search engines can more readily find them.

@patphongs patphongs self-assigned this Oct 16, 2019
@patphongs
Copy link
Member

patphongs commented Oct 17, 2019

Trying to get access into the rebranded search.gov system. Emailed the system owner yesterday. Going to move this to blocked, it's important that we get access back into the system before we start trying to re-index. Need to verify the API key from their system is also still valid.

@patphongs
Copy link
Member

I have received a response back from the search.gov team and I have access back into their system now. Heading back into researching this ticket.

@patphongs
Copy link
Member

Found some great documentation on our website's search indexing here: https://github.com/fecgov/fec-cms/blob/develop/fec/search/management/instructions.md

@patphongs
Copy link
Member

I was able to follow the documentation to re-index the wagtail pages and data app pages. Next we'll need to re-index transition pages.

@patphongs
Copy link
Member

@dorothyyeager The pages for transition are scraped based on the pages that are defined in this JSON file: https://github.com/fecgov/fec-cms/blob/develop/fec/search/management/data/transition_pages.json. Is there a more up-to-date list of transition pages you would like me to scrape? It doesn't have to be every page on transition, just the pages we think are important to have in the search.

@patphongs
Copy link
Member

Thanks @dorothyyeager! These have been removed from the site search. At some point, I'd like the content team to decide what pages on the transition site should be added to the site index. Made a new ticket here: #3279

@dorothyyeager
Copy link
Contributor Author

Thanks for doing this! It will be awesome to have all the content we've been putting up show up in the searches. Much, much appreciated! Will start thinking about the new ticket.

@patphongs
Copy link
Member

patphongs commented Oct 18, 2019

While we're reindexing the site, I'd like to think about adding a sitemap.xml. Are there places in the site that people are having trouble finding, places we'd like them to find more easily, etc? May be good to add those to a general sitemap so search engines can more readily find them.

Thank you @rfultz for suggesting this! I wrote up a ticket that explains steps that should be taken to accomplish this goal: #3280. It may even help us with automating search indexing from search.gov.

@patphongs
Copy link
Member

patphongs commented Oct 18, 2019

@dorothyyeager I noticed that your example in this issue is still not showing up in the site search. I think it's not indexing the children or descendants of this section: /introduction-campaign-finance/. It may be missing other pages too. I created this new issue to see if we can figure out why: #3281

@patphongs
Copy link
Member

@dorothyyeager FYI, solved the issue about the /introduction-campaign-finance/ page indexing. It now appears in the search. Still going to leave open #3281 though, in case there's others we want to add to the index.

@dorothyyeager
Copy link
Contributor Author

Thanks @patphongs !! Good idea. I'll think on it.

@PaulClark2 PaulClark2 modified the milestones: Sprint 10.3, Sprint 10.4 Oct 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants