Skip to content
Browse files

Merge branch 'master' of github.com:dbieber/coursesearch

  • Loading branch information...
2 parents 0a4b46b + d741e7c commit 87df83f7b0570c948b244d989ddba181a4d4c2d5 @dbieber committed May 15, 2012
Showing with 5 additions and 5 deletions.
  1. +5 −5 Project.tex
View
10 Project.tex
@@ -142,7 +142,7 @@ \subsection{Registrar Scraper}
Our registrar scraper operates in three tiers and is highly specific to the job of scraping course data. A visual overview of the registrar scraper's logic is provided in the accompanying figure. The first tier contains a single document, the registrar's main page, from which we scrape a list of department pages. The second tier is made of these department pages, each of which contains a list of courses. We begin gather course specific information on the second tier, as the department pages contain information about when the classes meet, how large the classes are, and what distribution areas they satisfy. Additionally and most importantly we crawl the second tier for links to pages with additional information about the classes. These class specific pages form the third tier of the registrar scraper. For each page in the third tier we gather as much information as possible about the individual course represented. We can usually find most of the following information about each class: the professor teaching it, a description of the class, a sample reading list, any prerequesites, the way the class is graded, P/D/F and audit options, the amount of work to expect, when the course is offered, and sometimes even more. All of the information about the classes gather during tiers two and three is aggregated and serialized to disk so that it can be later used by the indexer.
-The course data for the whole registrar is 1649kB. While scraping the whole registrar, 6 department pages failed to load and 52 course pages failed to load on the first try. Our scraper handled these failures gracefully and reconnected to these pages at a later time. The time of day at which we scrape may be a factor. All but 1 link succeeded on the second iteration. The last one, Macroeconomics/Int'l Finance Workshop, succeeded on the third iteration. There are 1201 total courses and 96 departments (from the perspective of our scraper -- Freshman Seminars and Writing Seminars are considered departments by it). This gives us a $4.3\%$ error rate for courses, $6.3\%$ error rate for department pages, and a $4.5\%$ error rate overall.
+The course data for the whole registrar is 1649kB. While scraping the whole registrar, 6 department pages failed to load and 52 course pages failed to load on the first try. Our scraper handled these failures gracefully and reconnected to these pages at a later time. The time of day at which we scrape may be a factor. All but 1 link succeeded on the second iteration. The last one, Macroeconomics/Int'l Finance Workshop, succeeded on the third iteration. There are 1201 total courses and 96 departments (from the perspective of our scraper -- Freshman Seminars and Writing Seminars are considered departments by it). This gives us a $4.3\%$ failure rate for courses, $6.3\%$ failure rate for department pages, and a $4.5\%$ failure rate overall.
\subsection{Indexer}
@@ -165,10 +165,7 @@ \subsection{Query Parser}
For times, we perform two special operations. We must convert ranges to sets of values and put these values into military time. We also search for special time search terms such as ``afternoon'' or ``noon'' and use them to set ranges of times. We do not remove these terms from the query in the case that time is not the query term's intention. Once we have extracted the Princeton-specific information, we pass our new query to Lucene's multi-field query parser. This parser takes a query string expands it into a form suitable for searching a Lucene index. Specifically, for each search term for which a field is not specified, it constructs the query such that the searcher will search all specified fields for the term.
\subsection{Search Engine}
-At this level, the user types a query into the search bar. The query is sent to the query parser which puts it into a form appropriate for the index. The searcher is a Lucene object created from the index. For a given query, it retrieves information from the index and compiles a list ranked by score. ??? We then take these ranked results and re-order them according to our own ranking scheme. ???
-
-This search engine is a Chrome extension and accompanies the existing ICE tool. Our extension takes over the search bar on ICE and submits queries to our server.
-
+At this level, the user types a query into the search bar. The query is sent to the query parser which puts it into a form appropriate for the index. The searcher is a Lucene object created from the index. For a given query, it retrieves information from the index and compiles a list ranked by score. As we customized the boosts at indexing time, the searcher will return our course-specific results. This search engine is a Chrome extension and accompanies the existing ICE tool. Our extension takes over the search bar on ICE and could submit queries to our server.
\subsection{Ranker}
@@ -244,6 +241,9 @@ \subsection{Comparison with Existing Search Engines}
At a broader scale, our search engine is simply more powerful than both existing tools. We can search for a variety of options that neither of the other two can and we parse complex queries to deliver more relevant results. The Registrar and ICE are functional when you know what courses you want to see. However, when you only know the characteristics of the courses you want to take, our search engine can find those relevant courses for you. In this respect, we are more powerful than both alternatives.
+\subsection{Additional Results}
+
+
\section{Future Works}
\subsection{Personalization}

0 comments on commit 87df83f

Please sign in to comment.
Something went wrong with that request. Please try again.