Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full text search #46

Merged
merged 78 commits into from
Aug 14, 2016
Merged

Full text search #46

merged 78 commits into from
Aug 14, 2016

Conversation

Joonpark13
Copy link
Member

@Joonpark13 Joonpark13 commented Jun 28, 2016

The API has been updated to support full text search. The most notable change is the addition of a new function in census_extractomatic/api.py called full_text_search(), which contains all of the functionality needed to perform such a search. The API is a superset of the original result format, staying faithful for existing fields and adding some new information.

It is necessary to build a search table before using the API; instructions to do so are written in setup.md, with further detail provided in full-text-guide.md. A Python script (query_script.py) is included for local testing of the search.

Addresses #43

tuchandra and others added 30 commits June 28, 2016 09:49
Querying instructions take the form of full-text guide.txt,
which walks through the process of developing the queries.

The querying script is query-script.py, providing a Python
interface for running queries. Sample usage is
python query-script.py housing income gross rent
The search data now includes column titles, as well as the previously
included table title, subject area, and universe. The weighting has also
been adjusted.
The Python sript now has an explanation of the complicated SQL query.
Full detail is provided in full-text-guide.md, with a walkthrough of
query construction and usage instructions.
Wrote a script that creates a table, table_search_metadata,
which stores information about the acs2014_1yr tables to
facilitate full text search.
/2.1/table/search now functions as a basic route for the full-text
search tabulation data.
Added a "alter table" statement to change the owner of
table_search_metadata to census.
The route is under /2.1/geo/search and the create script may be
incomplete in the information it tabulates.
The route is 2.1/geo/search. The query output still needs to be verified
to match the previous version.
For local testing purposes, the app.S3 connection initialization block
is now inside a try except block. The app will log a warning if S3
configuration fails.
The metadata_profile_script and metadata_table_script files were
updated with additional documentation. metadata_script.sql contains
code to create a combined metadata table for both profiles and tables.
api.py was modified to point to this new table.
Also added minor clarification in full-text guide file and refactored a
few lines in api.py
score.py takes the query 'ithaca' and shows the expected search results
alongside their custom computed scores.
Also begin commenting score.py script.
The first change is that profile_query_script.py and table_query_script.py
were refactored to have the same structure. This was primarily to separate
the search and score algorithms from I/O (taking in the query and printing
results). This refactoring allows its functions to be called independently
in query_script.py. Moreover, it makes clearer the purpose of both scripts
(running queries from the command line) by renaming them.

Secondly, table_query_script.py had its compute_score function updated, to
match the profile_query_script.py / score.py function and return scores in
a comparable range.

Finally, query_script.py takes advantage of the refactoring and is able to
search for both tables and profiles matching a given query. Because scores
are all in the same range, it simply is able to run both scripts and build
combined results from the two independent sets of results.
In addition, modified the table querying script to target the combined
metadata table, simplifying the query.
In addition, modified query_script.py to take into account the
new data shapes from table_ and profile_query_script.py, after
their queries were updated.
api.py now includes a function to search both tables and profiles,
implementing the functionality from query_script.py to the API.
Minor bugfixes; a couple of words were getting concatenated together,
which I took care of. The special characters dagger and double-dagger
included on the topics pages were included, so they were removed. Two
topics pages contained additional data in a second section tag; these
are now scraped as well.
Add functions to remove old topics from search_metadata (it should
be cleaned every time the script is run) and insert new topics into
it. These are not yet called anywhere.
The functions created before now update search_metadata with topic
page entries. Data for the topic name, relevant tables, and text on
the page is included.

The data can be re-updated simply as "python topic_scraper.py", as
each run of the script deletes the old topics entries before adding
the new ones in, refreshing the table.
The glossary page on Census Reporter is now scraped as well, and
it is indexed as a topic page along with the regular topics.
Removed query_script.py, as there have been many changes to the API
and to the database that were not reflected in the script. Furthermore,
it was reliant on the individual profile and table query scripts, which
themselves have since been deleted. Because of all this, and the fact
that its original purpose was only for testing, there is no reason to
keep it around.

full-text-guide.md has been updated to reflect this.
topic_scraper.py now adds the URL to search_metadata. It was necessary
to include this in the database because there is no standard format
for topic pages URLs that *always* works, so they cannot be generated
on the fly. If the URLs change, the script will need to be re-run, but
I do not anticipate that happening.

Fixed some whitespace as well.
The full text search endpoint now returns topics as well as profiles
and tables. There is work to be done with ranking of the results -
it is difficult to rank the results to give topics precedence while
not flooding the results with topics only (for queries like 'income',
which return 11 topic pages and a ton of tables). More work to be done
refactoring as well.
Refactor table_search(), profile_search(), and topic_search() into
one function to make more sense. There was a lot of repeated code,
and it makes more sense to have one search function that accepts a
parameter "type."
Refactor compute_profile_score() and compute_table_score() into
one function, compute_score(). This allows for less reuse of code
in the score computation. In addition, the function takes a row
directly, rather than individual attributes of the row (like population
or priority), so it is resilient to changes to the query format.
app.s3 = S3Connection()
except Exception, e:
app.s3 = None
app.logger.warning("S3 Configuration failed.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be app.logger.exception() so that we get the exception logged out in addition to the message? That would help with debugging if something is broken.

The profile URLs returned only use the geoid now, rather than the
geoid and display name earlier. The Census Reporter app will interpret
a URL with geoid only correctly, and redirect to the "right" URL. This
was done to avoid having to slugify display names and have them
potentially be wrong in the future.
Group score computation logic into one block, rather than having
three separate blocks for each object type.
Add a note that if the search_metadata table is rebuilt, the topic
scraper must be run again as well.

Add table IDs and geoids in the 'document' in the search table, so
users can enter these and see results returned.
Table scraping now happens when the TopicPageParser is initialized,
using a regex search for table codes. This is much simpler than
checking all of the strings found to see if they're a table.
table_tester scrapes all topic pages and checks every table referenced
on those pages for existence in census_tabulation_metadata.
The HTMLStripper class removes HTML tags from a page, leaving only
the contents of those tags (i.e., the raw data). This is called in
the function that finds all tables in a page.
@JoeGermuska JoeGermuska merged commit cc887c5 into master Aug 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants