-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full text search #46
Merged
Merged
Full text search #46
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Querying instructions take the form of full-text guide.txt, which walks through the process of developing the queries. The querying script is query-script.py, providing a Python interface for running queries. Sample usage is python query-script.py housing income gross rent
The search data now includes column titles, as well as the previously included table title, subject area, and universe. The weighting has also been adjusted.
The Python sript now has an explanation of the complicated SQL query. Full detail is provided in full-text-guide.md, with a walkthrough of query construction and usage instructions.
…census-api into full-text-search
Wrote a script that creates a table, table_search_metadata, which stores information about the acs2014_1yr tables to facilitate full text search.
…census-api into full-text-search
/2.1/table/search now functions as a basic route for the full-text search tabulation data.
Added a "alter table" statement to change the owner of table_search_metadata to census.
The route is under /2.1/geo/search and the create script may be incomplete in the information it tabulates.
The route is 2.1/geo/search. The query output still needs to be verified to match the previous version.
For local testing purposes, the app.S3 connection initialization block is now inside a try except block. The app will log a warning if S3 configuration fails.
The metadata_profile_script and metadata_table_script files were updated with additional documentation. metadata_script.sql contains code to create a combined metadata table for both profiles and tables. api.py was modified to point to this new table.
Also added minor clarification in full-text guide file and refactored a few lines in api.py
score.py takes the query 'ithaca' and shows the expected search results alongside their custom computed scores.
Also begin commenting score.py script.
The first change is that profile_query_script.py and table_query_script.py were refactored to have the same structure. This was primarily to separate the search and score algorithms from I/O (taking in the query and printing results). This refactoring allows its functions to be called independently in query_script.py. Moreover, it makes clearer the purpose of both scripts (running queries from the command line) by renaming them. Secondly, table_query_script.py had its compute_score function updated, to match the profile_query_script.py / score.py function and return scores in a comparable range. Finally, query_script.py takes advantage of the refactoring and is able to search for both tables and profiles matching a given query. Because scores are all in the same range, it simply is able to run both scripts and build combined results from the two independent sets of results.
In addition, modified the table querying script to target the combined metadata table, simplifying the query.
In addition, modified query_script.py to take into account the new data shapes from table_ and profile_query_script.py, after their queries were updated.
api.py now includes a function to search both tables and profiles, implementing the functionality from query_script.py to the API.
…census-api into full-text-search
Minor bugfixes; a couple of words were getting concatenated together, which I took care of. The special characters dagger and double-dagger included on the topics pages were included, so they were removed. Two topics pages contained additional data in a second section tag; these are now scraped as well.
Add functions to remove old topics from search_metadata (it should be cleaned every time the script is run) and insert new topics into it. These are not yet called anywhere.
The functions created before now update search_metadata with topic page entries. Data for the topic name, relevant tables, and text on the page is included. The data can be re-updated simply as "python topic_scraper.py", as each run of the script deletes the old topics entries before adding the new ones in, refreshing the table.
The glossary page on Census Reporter is now scraped as well, and it is indexed as a topic page along with the regular topics.
Removed query_script.py, as there have been many changes to the API and to the database that were not reflected in the script. Furthermore, it was reliant on the individual profile and table query scripts, which themselves have since been deleted. Because of all this, and the fact that its original purpose was only for testing, there is no reason to keep it around. full-text-guide.md has been updated to reflect this.
topic_scraper.py now adds the URL to search_metadata. It was necessary to include this in the database because there is no standard format for topic pages URLs that *always* works, so they cannot be generated on the fly. If the URLs change, the script will need to be re-run, but I do not anticipate that happening. Fixed some whitespace as well.
The full text search endpoint now returns topics as well as profiles and tables. There is work to be done with ranking of the results - it is difficult to rank the results to give topics precedence while not flooding the results with topics only (for queries like 'income', which return 11 topic pages and a ton of tables). More work to be done refactoring as well.
Refactor table_search(), profile_search(), and topic_search() into one function to make more sense. There was a lot of repeated code, and it makes more sense to have one search function that accepts a parameter "type."
Refactor compute_profile_score() and compute_table_score() into one function, compute_score(). This allows for less reuse of code in the score computation. In addition, the function takes a row directly, rather than individual attributes of the row (like population or priority), so it is resilient to changes to the query format.
app.s3 = S3Connection() | ||
except Exception, e: | ||
app.s3 = None | ||
app.logger.warning("S3 Configuration failed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be app.logger.exception()
so that we get the exception logged out in addition to the message? That would help with debugging if something is broken.
The profile URLs returned only use the geoid now, rather than the geoid and display name earlier. The Census Reporter app will interpret a URL with geoid only correctly, and redirect to the "right" URL. This was done to avoid having to slugify display names and have them potentially be wrong in the future.
Group score computation logic into one block, rather than having three separate blocks for each object type.
Add a note that if the search_metadata table is rebuilt, the topic scraper must be run again as well. Add table IDs and geoids in the 'document' in the search table, so users can enter these and see results returned.
Table scraping now happens when the TopicPageParser is initialized, using a regex search for table codes. This is much simpler than checking all of the strings found to see if they're a table.
table_tester scrapes all topic pages and checks every table referenced on those pages for existence in census_tabulation_metadata.
The HTMLStripper class removes HTML tags from a page, leaving only the contents of those tags (i.e., the raw data). This is called in the function that finds all tables in a page.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The API has been updated to support full text search. The most notable change is the addition of a new function in
census_extractomatic/api.py
calledfull_text_search()
, which contains all of the functionality needed to perform such a search. The API is a superset of the original result format, staying faithful for existing fields and adding some new information.It is necessary to build a search table before using the API; instructions to do so are written in
setup.md
, with further detail provided infull-text-guide.md
. A Python script (query_script.py
) is included for local testing of the search.Addresses #43