# [Lucene](http://lucene.apache.org/)

- Originally written in 1999;
- Open Source;<br>
- Inverted Index + Search;<br>
- Advanced Search options : synonyms, stopwords, autocomplete, facets, fuzzy-search etc.;<br>
- Used by: Apple, Twitter, LinkedIn, Eclipse, IBM , [etc](http://wiki.apache.org/lucene-java/PoweredBy); <br>
- Can be used for implementing recommender systems - supports a MoreLikeThis functionality; <br>
- Supports many types of file formats. <br>
- _Solr_ and _ElasticSearch_ are full-text search servers, which are built on top of Lucene, providing scalability, fault tolerance, caching and many more <br>

# Flow
## Index Structure
- Index contains documents;<br>
- Documents consist of fields;<br>
- Each Field has attributes :<br>
    - data type;<br>
    - how to handle content - analyzers;<br>
    - is it stored, is it indexed field;<br>
    
![Lucene Flow](img/lucene-flow.png)

# This notebook configuration

This notebook uses a Java kernel for Jupyter. 
You can execute the examples in two ways:
1. Open an IDE for Java and include the jar dependencies located in the lucene_jars folder.
2. Open the Binder link to the copy of this repo with a Java kernel setup. 
    - The notebook imports the dependecies with the %jars magic word.
    - To obtain the data, don't forget to run the ./data/get_20newsgroups.sh script
3. Install the [Java kernel](https://github.com/SpencerPark/IJava) for Jupyter locally.

In [1]:
// This is a java-kernel specific syntax, which allows for including jar dependencies
// The dependencies for Lucene are in the lucene_jars folder
List<String> added = %jars lucene_jars/*.jar
added

[/Users/pgencheva/git/information_retrieval_exercises/./lucene_jars/lucene-core-7.5.0.jar, /Users/pgencheva/git/information_retrieval_exercises/./lucene_jars/lucene-analyzers-common-7.5.0.jar, /Users/pgencheva/git/information_retrieval_exercises/./lucene_jars/lucene-queryparser-7.5.0.jar]

In [2]:
import java.io.File;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.Arrays;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.CharArraySet;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.LetterTokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.LongPoint;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.analysis.core.WhitespaceTokenizerFactory;
import org.apache.lucene.analysis.core.LowerCaseFilterFactory;
import org.apache.lucene.analysis.core.StopFilterFactory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.document.Field;

# Index Creation

## [Directory](http://lucene.apache.org/core/7_5_0/core/org/apache/lucene/store/Directory.html)
- Abstract class that represents the location of an index
- Some implementations: SimpleFSDirectory, MMapDirectory, RAMDirectory

In [3]:
Path indexPath = Paths.get("data", "lucene_index");
Directory indexDir = FSDirectory.open(indexPath);

## [Analyzer](http://lucene.apache.org/core/7_5_0/core/org/apache/lucene/analysis/Analyzer.html)
- The Analyzer implements the procedure of turning text into streams of tokens
- A pipeline of : 0 or more Char Filters + 1 Tokenizer + 0 or more Token Filters
- Some implementations: <br>
_“The quick brown fox jumped over the lazy dogs.”_<br>
    - __WhiteSpaceAnalyzer__ - splits tokens on whitespace<br>
The, quick, brown, fox, jumped, over, the, lazy, dogs.<br>
    - __SimpleAnalyzer__- splits tokens on non-letters, and then lowercases<br>
the, quick, brown, fox, jumped, over, the, lazy, dogs<br>
    - __StandardAnalyzer__ - most sophisticated analyzer that knows about certain token types, lowercases, removes stop words...<br>
quick, brown, fox, jumped, over, lazy, dogs<br>
- Some [Filters](https://lucene.apache.org/core/6_4_2/analyzers-common/org/apache/lucene/analysis/util/TokenFilterFactory.html) and [Tokenizers](https://lucene.apache.org/core/6_4_2/analyzers-common/org/apache/lucene/analysis/util/TokenizerFactory.html): <br>
    - __Standard Tokenizer__ (Other: WhitespaceTokenizer, KeywordTokenizer, LetterTokenizer)<br> 
The, quick, brown, fox, jumped, over, the, lazy, dogs<br>
    - __LowerCase Filter__<br>
the, quick, brown, fox, jumped, over, the, lazy, dogs<br>
    - __StopWords Filter__<br>
quick, brown, fox, jumped, over, lazy, dogs<br>
    - __PorterStem Filter__<br>
quick, brown, fox, jump, over, lazy, dog<br>
    - Other: SynonymMapFilter, CodepointCountFilter, SynonymGraphFilter, CommonGramsFilter, HyphenatedWordsFilter, WordDelimiterFilter

In [4]:
//Now we can create a custom text Analyzer!
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer(WhitespaceTokenizerFactory.class)
.addTokenFilter(LowerCaseFilterFactory.class)
.build();

## IndexWriter
- Allows you to create a new index, open an existing one, and add, remove, or update documents in an index

In [5]:
IndexWriter writer = new IndexWriter(indexDir, new IndexWriterConfig(analyzer));

## Documents
- The building blocks of an index <br>
- Collection of named fields <br>

## Fields types:
- __StringField__ - indexed but not tokenized <br>
- __TextField__ - indexed and tokenized
- __Others__: BinaryDocValuesField, BinaryPoint, DoublePoint, DoubleRange, FeatureField, FloatPoint, FloatRange, IntPoint, IntRange, LatLonDocValuesField, LatLonPoint, LongPoint, LongRange, NumericDocValuesField, SortedDocValuesField, SortedNumericDocValuesField, SortedSetDocValuesField, StoredField

In [6]:
private static final String CONTENT_FIELD_NAME = "content";

public void addDocument(IndexWriter writer, File file) throws IOException {
    Document doc = new Document();
    
    String content = new String(Files.readAllBytes(file.toPath()), Charset.defaultCharset());
    doc.add(new TextField(CONTENT_FIELD_NAME, content, Field.Store.YES));

    writer.addDocument(doc);
}

In [7]:
// Add each document of the "20 News Groups” corpus to the index
//Path documentDir = Paths.get("data", "mini_newsgroups/alt.atheism/");
File[] listOfFiles = new File("./data/mini_newsgroups/alt.atheism/").listFiles();
for (File file : listOfFiles) {
    addDocument(writer, file);
}

In [8]:
// Don't forget to close the writer!
writer.close();

# Searching in the index

## IndexSearcher
- Exposes several search methods on an index

In [9]:
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(indexDir));

## Query Parser - parses a textual representation of a query into a Query instance

In [11]:
int hitsPerPage = 10;
String userQuery = "Christians celebrate the +Holy Friday.";
Query query = new QueryParser(CONTENT_FIELD_NAME, analyzer).parse(userQuery);

// references to the top documents returned by a search
TopDocs docs = searcher.search(query, hitsPerPage);

System.out.println(docs.totalHits);
for(ScoreDoc scoreDoc : docs.scoreDocs) {
    Document doc = searcher.doc(scoreDoc.doc);
    System.out.println(doc.get(CONTENT_FIELD_NAME));
}

32
Newsgroups: alt.atheism
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!darwin.sura.net!udel!news.intercon.com!psinntp!wrldlnk!usenet
From: "Robert Knowles" <p00261@psilink.com>
Subject: Re: Islam And Scientific Predictions (was
In-Reply-To: <C5L1Fv.H9r@ra.nrl.navy.mil>
Message-ID: <2944081075.2.p00261@psilink.com>
Sender: usenet@worldlink.com
Nntp-Posting-Host: 127.0.0.1
Organization: Kupajava, East of Krakatoa
Date: Fri, 16 Apr 1993 23:32:03 GMT
X-Mailer: PSILink-DOS (3.3)
Lines: 35

>DATE:   Fri, 16 Apr 1993 15:23:54 GMT
>FROM:   Umar Khan <khan@itd.itd.nrl.navy.mil>
>
> His conclusion was that,
>while he was impressed that what little the Holy Qur'an had to
>say about science was accurate, he was far more impressed that the
>Holy Qur'an did not contain the same rampant errors evidenced in
>the Traditions.  How would a man of 7th Century Arabia have known
>what *not to include* in the Holy Q


Newsgroups: alt.atheism
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!darwin.sura.net!udel!news.intercon.com!psinntp!wrldlnk!usenet
From: "Robert Knowles" <p00261@psilink.com>
Subject: Re: Islam And Scientific Predictions (was
In-Reply-To: <C5L1Fv.H9r@ra.nrl.navy.mil>
Message-ID: <2944081075.2.p00261@psilink.com>
Sender: usenet@worldlink.com
Nntp-Posting-Host: 127.0.0.1
Organization: Kupajava, East of Krakatoa
Date: Fri, 16 Apr 1993 23:32:03 GMT
X-Mailer: PSILink-DOS (3.3)
Lines: 35

>DATE:   Fri, 16 Apr 1993 15:23:54 GMT
>FROM:   Umar Khan <khan@itd.itd.nrl.navy.mil>
>
> His conclusion was that,
>while he was impressed that what little the Holy Qur'an had to
>say about science was accurate, he was far more impressed that the
>Holy Qur'an did not contain the same rampant errors evidenced in
>the Traditions.  How would a man of 7th Century Arabia have known
>what *not to include* in the Holy Qur


Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53369 talk.religion.misc:83780 talk.origins:40954
Newsgroups: alt.atheism,talk.religion.misc,talk.origins
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!noc.near.net!uunet!pipex!uknet!warwick!nott-cs!mips.nott.ac.uk!eczcaw
From: eczcaw@mips.nott.ac.uk (A.Wainwright)
Subject: Re: Rawlins debunks creationism
Message-ID: <1993Apr16.132316.29748@cs.nott.ac.uk>
Sender: news@cs.nott.ac.uk
Reply-To: eczcaw@mips.nott.ac.uk (A.Wainwright)
Organization: Nottingham University
References: <1993Mar29.231830.2055@rambo.atlanta.dg.com> <1993Apr7.073926.9874@engage.pko.dec.com> <1993Apr10.213547.17644@rambo.atlanta.dg.com> <2BC8B03B.29868@ics.uci.edu> <1993Apr15.223844.16453@rambo.atlanta.dg.com>
Date: Fri, 16 Apr 93 13:23:16 GMT
Lines: 58

In article <1993Apr15.223844.16453@rambo.atlanta.dg.com>, wpr@atlanta.dg.com (Bill Rawlins) writes:

|>     We are talking about origins, not merely 

# Advanced Query Syntax: <br>
- Fields<br>
title:"The Right Way" AND text:go<br>
- Wildcard Searches<br>
te?t, test*, te*t<br>
- Fuzzy Searches - Levenstein/Edit disstance<br>
roam~<br>
roam~0.8<br>
- Proximity Searches - words are a within a specific distance away<br>
"jakarta apache"~10<br>
- Range Searches<br>
mod_date:[20020101 TO 20030101]<br>
- Boosting a Term<br>
jakarta^4 apache<br>
- Boolean Operators<br>
"jakarta apache" OR jakarta<br>
"jakarta apache" AND "Apache Lucene"<br>
+jakarta lucene - must contain jakarta and can contain lucene<br>
"jakarta apache" -"Apache Lucene"<br>
- Escaping Special Characters : + - && || ! ( ) { } [ ] ^ " ~ * ? : \

# Exercise : <br>
- Implement the __index construction__ with all categories in the “20 News Groups” corpus. <br>
- Add 3 __more fields__ to each document:<br>
    - Title of the document (id) - StringField/IntPoint<br>
    - Path to the document - StringField<br>
    - Size of the document (in terms of the text’s lenght) - LongPoint<br>
- Create your own __Analyzer__ with the following pipeline:<br>
    - Tokenize at non-letters<br>
    - Lower Case Filter<br>
    - Stop Words Filter - find a list of stopwords to use<br>
    - Porter Stem Filter<br>
    - Choose also some filters from the available ones (like CommonGramsFilter)
- Experiment with the Lucene’s __Rich Query Syntax__ of these kinds:<br>
    - Proximity Searches<br>
        - “Apple patent” ~5<br>
    - Range Searches & Fields<br>
        - id:[57110:59652]<br>
        - length:[100:10000]<br>
    - Wildcard search within single term<br>
        - te?t, test*, te*t<br>
    - Boolean Operators<br>
        - (Honda OR Accord) and compact<br>
    - Term Boosting<br>
        - chronically depressed^4<br>
- Experiment with Advanced Lucene functionalities like Highlight, Faceted Search, More Like This, etc.