Merge branch 'master' of github.com:dbieber/coursesearch

dbieber · May 15, 2012 · d741e7c · d741e7c
2 parents 8cb8a6a + 0b71fab
commit d741e7c
Showing 1 changed file with 17 additions and 4 deletions.
diff --git a/Project.tex b/Project.tex
@@ -187,6 +187,15 @@ \subsubsection{Course Ranking}
 At indexing time, we adjusted the fieldBoost for the following fields: course abbreviation, distribution area, title, and PDF. We chose these fields as they tend to be most relevant to a user's query. We attempted to determine these values experimentally using a technique similar to the machine learning method gradient descent. As the default fieldBoost values are 1.0, indicating no boost, we first tried our algorithm without any boosting. The results were good but the ranking was less than adequate.
 With a combination of logical reasoning and simply fiddling with the numbers, we adjusted the boost values in order to improve the rankings generated by our search engine. When adjusting the values too far in one direction made the results worse, we modified the fieldBoost values in order to correct for this change, until we were satisfied with the results.
 
+\section{Implementation Choices}
+We chose to build our search engine on top of Apache Lucene, as it provides a powerful library for performing text based search. We considered Sphinx as an alternative, but chose Lucene due to its vibrant community offering support and our own familiarity with Java as compared to C++. In forming the architecture of our search engine, we chose the components described previously in order to make our code base more modualar and managable. For instance our choice of separating our query parsing modules allows us to later extend our search engine to handle additional signals.
+
+In order to implement our scraper, we chose to use the open source JSoup over two alternatives, Kevin Wayne's \emph{In} module and another open source project JTidy. JSoup proved to be cleaner and better documented than either alternative, and was more powerful for traversing the particular type of HTML document we needed to scrape the course data from.
+
+For our scraper, we initially chose a sequential scraping model. All the courses in a particular department were scraped one after another, and the departments were scraped in order as well. Later we changed our scraping model to take after Mercator, using queues to keep track of what URLs still must be scraped, and adding to them as more courses are discovered or as some HTTP requests failed. This allows us to handle connection failures more gracefully and also provides us with the option of parallelizing our scraper, which would be useful if we take on a larger task such as scraping all major university's course data.
+
+For code sharing we
+
 \section{Experiments}
 
 We tested our index without fieldBoosts versus the index with fieldBoosts. As the score is only affected when relevant fields are selected, we chose queries that included those fields. We found that the ranking with the boosts was improved. We used the query ``arc technology'' as our means for comparison. Based on how we boosted fields, this will increase the importance of the `ARC' term relative to the technology term. \\
@@ -236,6 +245,7 @@ \section{Future Works}
 \subsection{Personalization}
 
 Personal Data
+
 The following is the personal data we could use to implement personalized search results. These are data we theoretically have access to on the ICE platform.
 \begin{enumerate}
 \item Course History
@@ -253,22 +263,25 @@ \subsection{Personalization}
 
 \item Know which prerequisites you've filled and give a warning before adding
 
+\item Boost the rankings of those classes your friend's have rated highly, and those classes rated highly by users we deem similar to you.
+
 \end{enumerate}
 
 
 \subsection{User Interface}
-Because our engine takes free-text queries rather than database queries, when we return results they are ranked by best-fit rather than a yes or no. Thus, if the user enters several terms, we may have results that don't exactly satisfy the information need. To help address this from the user's perspective, we could have a display that also lists the relevant fields searched. For instance, if I search ``CLA MW 330 pdfonly'', then in addition to listing CLA XXX - TITLE, the time/days and pdf options would also be displayed. This allows the user to determine at a glance if the search has been satisfied in an acceptable way. 
+Because our engine takes free-text queries rather than database queries, when we return results they are ranked by best-fit rather than a yes or no. Thus, if the user enters several terms, we may have results that don't exactly satisfy the information need.
+To help address this from the user's perspective, we could have a display that also lists the relevant fields searched. For instance, if I search ``CLA MW 330 pdfonly'', then in addition to listing CLA XXX - TITLE, the time/days and pdf options would also be displayed. This allows the user to determine at a glance if the search has been satisfied in an acceptable way. 
 As we have the scores of document in the returned results list. We could compute a percent match to the query to give the user a sense of expected relevance. We could also implement feedback so that the user could give us a thumbs up for relevant or interesting courses. 
 
-As we hope to fully integrate this with ICE, our primary flexibility is with the ranking order and details per hit. We have already created a Chrome extension that takes over the search bar and can send a request to our own server.  
+As we hope to fully integrate this with ICE, our primary flexibility is with the ranking order and details per hit. We have already created a Chrome extension that takes over the search bar and can send a request with the query to our own server.
 
 \subsection{Additional Signals}
 
-There are a variety of other sources we may use to help refine our results. We could use SCORE evaluations and the student course guide to help students choose courses with lighter workloads. Additionally, data on friends's course histories also serve as a ranking signal. We could also consider signals outside ICE such as what's trending on Twitter or on PrincetonFML, a site where students go to complain, celebrate, and share snippets of their lives. 
+There are a variety of other sources we may use to help refine our results. We could use SCORE evaluations and the student course guide to help students choose courses with lighter workloads or courses often reviewed positively. Additionally, data on friends's course histories also serve as a ranking signal. We could also consider signals outside ICE such as what's trending on Twitter or on PrincetonFML, a site where students go to complain, celebrate, and share snippets of their lives.
 
 \section{Conclusion}
 
-We useful tool for Princeton students. In order to create it, we had to learn about web scraping and built our own Princeton-specific web crawler. Towards the end of the project, we moved towards the Mercator model, as we used queues to determine which links to crawl. Our search engine is designed to meet students's information needs. In addition to basic searches, we successfully are able to search for desired PDF and audit options. We can also search for reading amounts and under schedule constraints. We've also created code that is easy to maintain and extend, as we choose to incorporate more signals.
+We created a useful tool for Princeton students. In order to create it, we had to learn about web scraping and built our own Princeton-specific web crawler. Towards the end of the project, we moved towards the Mercator model, as we used queues to determine which links to crawl. Our search engine is designed to meet students's information needs. In addition to basic searches, we successfully are able to search for desired PDF and audit options. We can also search for reading amounts and under schedule constraints. We've also created code that is easy to maintain and extend, as we choose to incorporate more signals.
 
 
 \appendix