Full text search #46

Joonpark13 · 2016-06-28T20:12:33Z

The API has been updated to support full text search. The most notable change is the addition of a new function in census_extractomatic/api.py called full_text_search(), which contains all of the functionality needed to perform such a search. The API is a superset of the original result format, staying faithful for existing fields and adding some new information.

It is necessary to build a search table before using the API; instructions to do so are written in setup.md, with further detail provided in full-text-guide.md. A Python script (query_script.py) is included for local testing of the search.

Addresses #43

Querying instructions take the form of full-text guide.txt, which walks through the process of developing the queries. The querying script is query-script.py, providing a Python interface for running queries. Sample usage is python query-script.py housing income gross rent

The search data now includes column titles, as well as the previously included table title, subject area, and universe. The weighting has also been adjusted.

The Python sript now has an explanation of the complicated SQL query. Full detail is provided in full-text-guide.md, with a walkthrough of query construction and usage instructions.

…census-api into full-text-search

Wrote a script that creates a table, table_search_metadata, which stores information about the acs2014_1yr tables to facilitate full text search.

…census-api into full-text-search

/2.1/table/search now functions as a basic route for the full-text search tabulation data.

Added a "alter table" statement to change the owner of table_search_metadata to census.

The route is under /2.1/geo/search and the create script may be incomplete in the information it tabulates.

The route is 2.1/geo/search. The query output still needs to be verified to match the previous version.

For local testing purposes, the app.S3 connection initialization block is now inside a try except block. The app will log a warning if S3 configuration fails.

The metadata_profile_script and metadata_table_script files were updated with additional documentation. metadata_script.sql contains code to create a combined metadata table for both profiles and tables. api.py was modified to point to this new table.

Also added minor clarification in full-text guide file and refactored a few lines in api.py

score.py takes the query 'ithaca' and shows the expected search results alongside their custom computed scores.

Also begin commenting score.py script.

The first change is that profile_query_script.py and table_query_script.py were refactored to have the same structure. This was primarily to separate the search and score algorithms from I/O (taking in the query and printing results). This refactoring allows its functions to be called independently in query_script.py. Moreover, it makes clearer the purpose of both scripts (running queries from the command line) by renaming them. Secondly, table_query_script.py had its compute_score function updated, to match the profile_query_script.py / score.py function and return scores in a comparable range. Finally, query_script.py takes advantage of the refactoring and is able to search for both tables and profiles matching a given query. Because scores are all in the same range, it simply is able to run both scripts and build combined results from the two independent sets of results.

In addition, modified the table querying script to target the combined metadata table, simplifying the query.

In addition, modified query_script.py to take into account the new data shapes from table_ and profile_query_script.py, after their queries were updated.

api.py now includes a function to search both tables and profiles, implementing the functionality from query_script.py to the API.

…census-api into full-text-search

Minor bugfixes; a couple of words were getting concatenated together, which I took care of. The special characters dagger and double-dagger included on the topics pages were included, so they were removed. Two topics pages contained additional data in a second section tag; these are now scraped as well.

Add functions to remove old topics from search_metadata (it should be cleaned every time the script is run) and insert new topics into it. These are not yet called anywhere.

The functions created before now update search_metadata with topic page entries. Data for the topic name, relevant tables, and text on the page is included. The data can be re-updated simply as "python topic_scraper.py", as each run of the script deletes the old topics entries before adding the new ones in, refreshing the table.

The glossary page on Census Reporter is now scraped as well, and it is indexed as a topic page along with the regular topics.

Removed query_script.py, as there have been many changes to the API and to the database that were not reflected in the script. Furthermore, it was reliant on the individual profile and table query scripts, which themselves have since been deleted. Because of all this, and the fact that its original purpose was only for testing, there is no reason to keep it around. full-text-guide.md has been updated to reflect this.

topic_scraper.py now adds the URL to search_metadata. It was necessary to include this in the database because there is no standard format for topic pages URLs that *always* works, so they cannot be generated on the fly. If the URLs change, the script will need to be re-run, but I do not anticipate that happening. Fixed some whitespace as well.

The full text search endpoint now returns topics as well as profiles and tables. There is work to be done with ranking of the results - it is difficult to rank the results to give topics precedence while not flooding the results with topics only (for queries like 'income', which return 11 topic pages and a ton of tables). More work to be done refactoring as well.

Refactor table_search(), profile_search(), and topic_search() into one function to make more sense. There was a lot of repeated code, and it makes more sense to have one search function that accepts a parameter "type."

Refactor compute_profile_score() and compute_table_score() into one function, compute_score(). This allows for less reuse of code in the score computation. In addition, the function takes a row directly, rather than individual attributes of the row (like population or priority), so it is resilient to changes to the query format.

iandees · 2016-08-08T15:28:15Z

census_extractomatic/api.py

+    app.s3 = S3Connection()
+except Exception, e:
+    app.s3 = None
+    app.logger.warning("S3 Configuration failed.")


Can this be app.logger.exception() so that we get the exception logged out in addition to the message? That would help with debugging if something is broken.

The profile URLs returned only use the geoid now, rather than the geoid and display name earlier. The Census Reporter app will interpret a URL with geoid only correctly, and redirect to the "right" URL. This was done to avoid having to slugify display names and have them potentially be wrong in the future.

Group score computation logic into one block, rather than having three separate blocks for each object type.

Add a note that if the search_metadata table is rebuilt, the topic scraper must be run again as well. Add table IDs and geoids in the 'document' in the search table, so users can enter these and see results returned.

Table scraping now happens when the TopicPageParser is initialized, using a regex search for table codes. This is much simpler than checking all of the strings found to see if they're a table.

table_tester scrapes all topic pages and checks every table referenced on those pages for existence in census_tabulation_metadata.

The HTMLStripper class removes HTML tags from a page, leaving only the contents of those tags (i.e., the raw data). This is called in the function that finds all tables in a page.

tuchandra and others added 30 commits June 28, 2016 09:49

Include column titles as part of search data.

617ff6f

The search data now includes column titles, as well as the previously included table title, subject area, and universe. The weighting has also been adjusted.

Add exception for incorrect command line usage

06e8323

Update Python script and markdown instructions with futher instructions.

42c1d16

The Python sript now has an explanation of the complicated SQL query. Full detail is provided in full-text-guide.md, with a walkthrough of query construction and usage instructions.

Merge branch 'full-text-search' of https://github.com/censusreporter/…

658e5df

…census-api into full-text-search

Merge branch 'master' into full-text-search

58fbfaf

Add script for creation of table search metadata table.

ee9f1b0

Wrote a script that creates a table, table_search_metadata, which stores information about the acs2014_1yr tables to facilitate full text search.

Merge branch 'full-text-search' of https://github.com/censusreporter/…

4c49bc4

…census-api into full-text-search

Finish full-text search route function for table data.

776f01a

/2.1/table/search now functions as a basic route for the full-text search tabulation data.

Add change owner instruction to table creation script.

44f2941

Added a "alter table" statement to change the owner of table_search_metadata to census.

Add index on document column.

3878105

Add geo full-text search route and metadata table create script.

61d586f

The route is under /2.1/geo/search and the create script may be incomplete in the information it tabulates.

Write initial version of full-text search route functions.

98986e7

The route is 2.1/geo/search. The query output still needs to be verified to match the previous version.

Allow app to be run without S3 connection.

c2fef1c

For local testing purposes, the app.S3 connection initialization block is now inside a try except block. The app will log a warning if S3 configuration fails.

Fix syntax errors in metadata table creation script.

bab44af

Also added minor clarification in full-text guide file and refactored a few lines in api.py

Add script to test profile page scoring.

15cd677

score.py takes the query 'ithaca' and shows the expected search results alongside their custom computed scores.

Add command line functionality to scoring script.

19d4f73

Also begin commenting score.py script.

Adjust normalization of table search results.

8ab33aa

Add population and priority columns to metadata table.

c4947a2

In addition, modified the table querying script to target the combined metadata table, simplifying the query.

Remove unnecessary JOIN from SQL query.

9554607

Add normalization for table query.

ce2206f

In addition, modified query_script.py to take into account the new data shapes from table_ and profile_query_script.py, after their queries were updated.

Fix null handling on populations.

e0c2c1a

Create endpoint for full text search.

7db524c

api.py now includes a function to search both tables and profiles, implementing the functionality from query_script.py to the API.

Add drop table if statement to sql script

23193cc

Merge branch 'full-text-search' of https://github.com/censusreporter/…

ae9f25e

…census-api into full-text-search

Refactor row processing from SQL queries.

27e5a89

Get rid of trailing slash for full text search route

c983fb6

Add sumlevel name to API response.

88de0d3

tuchandra added 12 commits August 3, 2016 11:33

Fix whitespace

24efe3b

Add functions to update search_metadata

69cd3b1

Add functions to remove old topics from search_metadata (it should be cleaned every time the script is run) and insert new topics into it. These are not yet called anywhere.

Add scraper for glossary page

799049c

The glossary page on Census Reporter is now scraped as well, and it is indexed as a topic page along with the regular topics.

Update with instructions for topic pages

c3fcedc

Refactor search functions

92a2c5e

Refactor table_search(), profile_search(), and topic_search() into one function to make more sense. There was a lot of repeated code, and it makes more sense to have one search function that accepts a parameter "type."

Add code to compute topic score

a3c42fb

iandees reviewed Aug 8, 2016
View reviewed changes

tuchandra added 10 commits August 8, 2016 11:11

Use SQLAlchemy query formatting

44f7219

Tweak control flow

0b441e3

Remove unnecessary apostrophe replacement

e00c95f

Group score computation logic

f9d783b

Group score computation logic into one block, rather than having three separate blocks for each object type.

Fix whitespace

109d5ec

Update setup; add table ID and geoid search

145e210

Add a note that if the search_metadata table is rebuilt, the topic scraper must be run again as well. Add table IDs and geoids in the 'document' in the search table, so users can enter these and see results returned.

Modify TopicPageParser table scraping

f080293

Table scraping now happens when the TopicPageParser is initialized, using a regex search for table codes. This is much simpler than checking all of the strings found to see if they're a table.

Initial commit

671673f

table_tester scrapes all topic pages and checks every table referenced on those pages for existence in census_tabulation_metadata.

Add HTMLStripper class to remove HTML tags

cc887c5

The HTMLStripper class removes HTML tags from a page, leaving only the contents of those tags (i.e., the raw data). This is called in the function that finds all tables in a page.

JoeGermuska merged commit cc887c5 into master Aug 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full text search #46

Full text search #46

Joonpark13 commented Jun 28, 2016 •

edited by tuchandra

Loading

iandees Aug 8, 2016

Full text search #46

Full text search #46

Conversation

Joonpark13 commented Jun 28, 2016 • edited by tuchandra Loading

iandees Aug 8, 2016

Choose a reason for hiding this comment

Joonpark13 commented Jun 28, 2016 •

edited by tuchandra

Loading