Research code for TREC Genomics 2006 at UC Berkeley.
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Here is my research code for TREC Genomics 2006. 
It includes the steps needed for data preparation, citation analysis, parsing, annotating, indexing, and search.
The data (about 162,259 articles from the 49 journals) can be downloaded here:

Below is the README file which was included at the time.


Author: Alex Ksikes

==== (0) Programs installed ==== 

/projects/bebop/usr/local/gcc-3.4.4 --- Used to compile PyLucene
/projects/bebop/usr/local/lib/python2.4 --- python version for PyLucene

==== (1) Data normalization ====

Description: normalize the trec data
Usage: python out_folder trec_zipfiles

Description: compute the spans of a given html_file
Usage: python html_file

Description: remove all html tags
Usage: python html_file

==== (2) Citation/link analysis ====

Desription: Grab pub med links from the trec data
Usage: python trec_zipfiles output_file 
writes a citation_file

Desription: Writes a dot file to be read by graphwiz
Usage: python citation_file dot_file

==== (3) Parsing - Html tag analysis ====

# From internal links of the form <a href="#tags">anchored_text</a>
# We get the following stats:
# tags (exist_in_article_count, total_count)
# anchored_text (exist_in_article_count, total_count)
# coo_tags_anchored_text (tags, anchored_text, count)
# article_size (article, number_of_char)
# tags_article (tags, article)
# anchored_text article (anchored_text, article)
# jrl_art_count (journal_name, number_articles)
Usage: python output_name trec_zipfiles

Older versions:

==== (4) Parsing - Using Martinj's XML ====

== Error checking - Fixing ==

Description: Fix Martin's xml by stripping out all unrecognized entities. 

Description: xml validity, abnormal ids and ratio of text from xml 
over text from html < 0.5

== Section extraction - categorizing ==

Description: Grab all section ids
Usage: python xml_data

Description: Categorize each clustered section
Usage: python sections_clustered

==== (5) Section annotation ====

Description: Get the spans of each section
Usage: python xml_path
Output section spans.

Description: Map section_spans to span_ids
Usage: python span_ids section_spans

==== (6) Medline abstracts ====

A lot of abstracts from the xml data were misclassified. Medline was
used to get them right.

Run these under nepal:

Description: Annotate the span ids as abstract using medline
Usage: python normalized_data

Description: Finds span_id of a given medline abstract
Usage: python abstract normalized_article

quick and dirty term vector implementation used to locate 
medline abstract in our normalized data

==== (7) Search engine - Indexing ====

# Index the normalized data with the annotation_files
# An annotation file has the following tab delimited format:
# pmid span_id annotation_id annotation_text
# annotation_id is what will be indexed (for now)
# annotation_text is the actual annotated text
# The field names are hard coded and should be given in order
Usage: python data_norm index_dir annotation_files

Description: Allows searching in any index and output a table.
Usage: python -i lucene_index -f field_mame_1,field_name_2,...,field_name_ -q 'the_query' -s start_page -r nb_of_results

You have to use /projects/bebop/usr/local/bin/python

To allow a programmatic interface in any language I chose the following

|---------------| url (REST queries)  |--------| searcher  |------- |
| Web interface |-------------------->| Small  |---------->| Lucene |
|               |                     | web    |           |        |
| Other clients |<--------------------| server |<----------| index  |
|---------------|   formatted output  |--------|    hits   |------- |

Description: Launch the web server that takes REST calls and searches
the lucene index given by index_dir
Usage: python index_dir

Description: A small command line client to the server.

==== (8) Programmable interface ====

The search engine supports a REST API. 
Read the output data at the following url:

port = port number of the server
query = urlencoded query
start = start page
results = number of results per page

The output data:

I wanted the server to output xml but no DOM extension for php was installed.
The web interface needed this extension so I did this quick and dirty output 
format instead.


where hit_i is:

==== (9) Web interface ====
q is the query
s is the start page
r is the number of results

bebop/public_html/trec/index.php --- Main php script
bebop/public_html/trec/tools.php --- Some useful functions   

==== (10) Search engine - own implentation ====

I started by writing a small search engine that would fit the
entire index in memory.

The files are:,,,,,

==== (11) Passage scoring  ====
Description: quickly get a span from the collection given by its span_id