A-Small-Search-Engine

In this project, I have built the basic data structure underlying search engines: an inverted index. I have used this inverted index to answer some simple search queries.

An inverted index for a set of webpages

Suppose we are given a set of webpages W. For our purposes, each webpage w ∈ W will be considered to be a sequence of words w₁, w₂, . . . w_k. Another way of representing the webpage could be to maintain a list of words along with the position(s) of the words in the webpage. For example consider a web page with the following text:

Data structures is the study of structures for storing data.

This can be represented as

{ ( data : 1, 10 ), ( structures : 2, 7 ), ( study : 5 ), ( storing : 9 ) }.

Note that the small connector words like “is”, “the”, “of”, “for” have not been stored. Words like this are referred to as stop words and are generally removed since they are very frequent and normally contain no information about the content of the webpage.

This representation of the webpage is similar to the index we see at the back of many books which tell us the page numbers where certain important terms used in the book may be found. In fact, we can refer to this as an index for the webpage. In mathematical notation we would say that given a webpage w = w₁, w₂, . . . , w_k, the index of w is

{ ( u : i₁( u ), . . . i_l( u ) ) : w_{i_j}(u) = u, 1 ≤ j ≤ l }.

An index is used to find the location of a particular string (word) in a specific document or webpage, but when we move to a collection of webpages, we need to first figure out which of the web pages contain the string. For this we store an inverted index. Let us try to define an inverted index formally.

Let us suppose we are given a collection C of webpages. For each page p ∈ C, let us denote by W(p) the set of all words (excluding stop words) that occur in p.

An inverted index for C will contain an entry for each word w ∈ W(C). This entry will contain tuples of the form (p, k) to indicate that w occurs in the kth position of page p ∈ C. Using the notation that p[k] denotes the kth word of page p, we can say that the inverted index of C is defined as

Inv(C) = {(w : {(p, k) : p ∈ C, p[k] = w}) : w ∈ W(C)}.

For example, consider the following (small) collection of documents.

Data structures is the study of structures for storing data.
Structural engineers collect data about structures.

The inverted index for this would be

{ ( data : { ( 1, 1 ), ( 1, 10 ), ( 2, 4 ) } ), ( structures: { ( 1, 2 ), ( 1, 7 ), ( 2, 6 ) } ), ( study : { ( 1, 5 ) } ), ( storing : { ( 1,9 ) } ), ( structural : { ( 2, 1 ) } ), ( engineers : { ( 2, 2 ) } ), ( collect : { ( 2, 3 ) } ) }

The web search problem

The web search problem is defined as follows:

Given a collection of webpages C and a sequence of words q₁ . . . q_k, find the “most relevant” set of pages p₁, p₂, . . . p that contain as many of q₁ . . . q_k as possible and return them in the order of decreasing “relevance”.

The question of how to measure the relevance of a webpage to a particular query is an involved question with no easy answers. However, for the purpose of this assignment we will work with a simple scoring function.

A scoring function for search term relevance

One of the simplest scoring function is term frequency-inverse document frequency. It is used to measure how important a word w is to a webpage p. It is a product of two factors i.e. term frequency and inverse document frequency. Given a word w and a webpage p, the relevance score is defined as

relevance_w(p) = tf_w(p) x idf_w(p)

Term Frequency: It is the total number of occurrence of a word w in a webpage p, denoted by f_w(p). It is normalized by the total number of words in webpage p, denoted by |W(p)|. It is defined as

tf_w(p) = f_w(p)/ |W(p)|

Inverse document frequency: It is the logarithm of the total number of webpages, denoted by N divide by the logarithm of the number of webpages n_w(p) that contain the word w. It is defined as

idf_w(p) = log(N/n_w(p))

So, if we are given a search query that has a single term, say w, to return the webpages in order of relevance we have to first extract the entry corresponding to w from Inv(C) and then calculate the relevance of each page and return the pages in decreasing order of relevance.

Compound searches

In this assignment we will answer three kinds of search queries: AND queries, OR queries and phrase queries. We now describe these three along with their scoring methodology.

• OR queries: Given a search query q₁ . . . q_k, any page that contains any of the words q₁ to q_k is a valid answer. The relevance score of a page p is computed as

relevance_{q₁...q_k} (p) = ∑ relevance_{q_i} (p),

and pages are returned in decreasing order of relevance. Note that if some q_i does not occur in page p the relevance_{q_i}(p) = 0.

• AND queries: Given a search query q₁ . . . q_k, any page that contains all of the words q₁ to q_k is a valid answer. The relevance score of a page p is computed as

relevance_{q₁...q_k} (p) = ∑ relevance_{q_i} (p),

and pages are returned in decreasing order of relevance.

• Phrase queries: Given a search query q₁ . . . q_k, any page that contains q₁ in position l, q₂ in position l+1 and so on till q_k in position l+k−1 is said to contain the phrase q₁ . . . q_k at the position . Suppose in a webpage p having |W(p)| words, the phrase q₁ . . . q_k occurs m times then the relevance score of page p for this phrase is

relevance_{q₁...q_k} (p) = tf_{q₁...q_k}(p) x idf_{q₁...q_k}(p) = {m /(|W(p)| − (k − 1) x m )} x log (N/n_w(p))

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
webpages		webpages
.gitignore		.gitignore
AVL.java		AVL.java
InvertedPageIndex.java		InvertedPageIndex.java
MyHashTable.java		MyHashTable.java
MyLinkedList.java		MyLinkedList.java
MySort.java		MySort.java
Myset.java		Myset.java
NotInSetException.java		NotInSetException.java
PageEntry.java		PageEntry.java
PageIndex.java		PageIndex.java
Position.java		Position.java
README.md		README.md
SearchEngine.java		SearchEngine.java
SearchResult.java		SearchResult.java
WordEntry.java		WordEntry.java

anchittandon/A-Small-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

A-Small-Search-Engine

An inverted index for a set of webpages

The web search problem

A scoring function for search term relevance

Compound searches

About

Topics

Resources

Stars

Watchers

Forks

Languages