IQPlayground

Text Indexation & Query Algorithms Playground

Aim

This project offers a graphical playground for discovering Text Indexation & Query Algorithms, and playing around with the many parameters to see how it may influence performance in terms of speed and memory consumption.

Since this is also a class project, we wanted to learn the several things from it :

Learn Python (at the beginning, we were four out of five never to have programmed a full project in this language), and how to write decent python as compliant as possible with PEP8,
Implement many of the Indexation & Query Algorithms that are used nowadays in most software,
Think about how the activation and parameterizing of these algorithms influenced the overall performance,
And naturally spend time trying to understand each other and work together as a team.

Running the project:

Prerequisites

- Python 3.6
- Git
- Ideally, a GNU/Linux based operating system, though the code is meant to be portable

Installation

Clone the project:

git pull https://github.com/Xia0ben/IQPlayground.git

If you have access to the "latimes" archive with all the files, you may want to uncompress it in the project's root (careful to remove the MACOSX subfolder and put all the content in the main latimes folder, and remove the .txt files manually : the files shouldn't be in "latimes/latimes" but in "latimes").
Install PyGObject, GTK+ and their dependencies following the instructions at the following link : http://pygobject.readthedocs.io/en/latest/getting_started.html
Install nltk and gi (using the python interpreter provided by the previous link):

>>> import nltk
>>> import gi
>>> nltk.download()

Install the numpy and sortedcontainers librairies with pip :

pip install numpy --index-url=http://pypi.python.org/simple/ --trusted-host pypi.python.org
pip install sortedcontainers

If at execution time, the project fails and says that you are missing the "pickles" and "invertedfiles" folders in the project's root directory, create them manually.

Execution

Console mode:
- access the project repository in the console
- run "console_mode_launch.py" with the python interpreter in you installed everything.
Graphical mode:
- run "launch.py" with the python interpreter in you installed everything.

Description of the packages

Algorithm:
- Contains classes for each algorithm
- Each algorithm is derived from the baseAlgorithm class
- naiveAlgorithm : a naive query algorithm based on parallel scan of the posting lists
- FA_Algorithm : Fagin's top-k query algorithm
- TA_Algorithm : Fagin's thershold algorithm
Executable
- Contains an executable class which act as a controller for the project
Files
- Contains multiple classes which act with the memory and the hard drive
- document : a memory based representation of a document
- fileToPostingList : read a pl FILE to generate memory based posting lists when needed
- invertedFile : memory based inverted file, which do a merge based creation of the posting lists
- postingList : a memory based representation of the posting lists with 3 access :
  - randomaccess : return the score of a doc id in log(n) time
  - alphaaccess : return the doc_id and score ordered by doc_id
  - orderedaccess : return the doc_id and score ordered by score
- reader : read a file and returns documents
- stemmer : implementation of porter's algorithm
- variableByte : encoding and decoding of variable byte integers

Explaining Encoding :

Get the last seven bits from the binary code 1100001001 is 0001001,
Add 1 as the "head" bit (0001001 -> 10001001)
Because there are still 3 bits left in the original binary code. Get the remain 3 bits, this time use 0 as the "head" bit (110 -> 00000110)
Because there is no remain in the original binary code put these two bytes together 0000011010001001 is the VB code.
RandomIndexing
- Contains classes used for the random indexing of terms
- randomIndex : Contains a class with different functions to get context vector for a term
- vectorsSimilarity : Contains functions to calculate Cosinus distance between context vectors this distance is in the interval [0-2]
Stats
- Contains classes which allow to get stats on indexing and querying
UI
- Contains files used to generate the user interface

Project Preparation

Learning prerequisites

Learn Python 3 :

To test and use quickly without an IDE :

Online code tester

Use Pycharm :

To install it quickly on Ubuntu

Feature Goals

Indexing :

Parse pseudo-xml files into usable objects
Tokenization of each of the documents : use the nltk Python library
Stemming : may help save place but is useless outside of this scope -> use a library. Allow user to activate or deactivate it.
Stop words removal : use a library. Allow user to activate or deactivate it.
Inverted file : Implement this mapping between vocabulary of terms (VOC) and posting lists (PL).
Merge-based algorithm To build the IF : to research and implement
PL Compression : to research and implement
Variable Byte : Implement following this proposition.
Score computation : To be implemented with variations that the user can play with.
Random indexing : Search pseudo-code and implement

Querying :

Fagin’s top-k query algorithm (FA) : To implement with possibilities for variations.

Graphical User interface

Simple graphical user interface with search bar(s), some checkboxes to activate some other options, and a text area where results can be displayed (list of articles with id, title, file name). When you click on it, displays

First iteration 06/10/2017 -> 08/10/2017

Alexis writes a basic implementation to get everyone started :
- Parse pseudo-xml files into usable objects without library : make each document into a separate object, removing all the XML markup and considering all the rest as text
- Simple tokenization of each of the documents using nltk (space-based rule with some exceptions managed by the library)
- Basic fully in-memory inverted file, with scores computed as the product of the term-frequency and inverse-document-frequency, with posting-lists ordered by DOCID
- For queries, input the query in terminal
- Parallel scan all posting-lists to find each DOCID that has all search terms at least once
- Consider all queries are only conjunctive and aggregate the scores as a simple sum of the score of each term for each document
- Sort document list by score and output DOCIDs of the 5 first ones in terminal
All others learn Python through at least the QuickStart and the PEP8 (ideally with Learn Python the hard way if they can)

First code review 09/10/2017 from 10 to 12 a.m.

Discussion around a nice cup of tea/chocolate/coffee.
Alexis presents the basic structure of the project and the strategy for expanding the different elements (packages, classes, functions, ...).
Affectation of the elements for the second iteration.

Second iteration 09/10/2017 -> 15/10/2017

Important note : For each feature, think about parameterizing it so that it is possible to play with it in the GUI (for example, change the ponderation of words that are in the document title when computing the inverted file).

Simple GUI using PyGObject, Kivy or PySide -> SHOTGUN By Benoit :D
Better parse the different XML elements of the file so that it is possible to ponderate the computation of the scores in the inverted file depending on whether the terms are the title or if they must agree with a particular date, ... -> TODO By Zakaria
Deactivable (through UI) stemming and stop words removal that is going to influence the creation of the -> TODO By Omar
Merge-based creation of the inverted file (see slide 24 of teacher's slideshow) -> TODO By Alexis
PL Compression : Variable Byte Encoding (see slides 26 to 29 of teacher's slideshow) -> TODO By Bachir

Second code review 16/10/2017 from 8 to 10 a.m.

Discussion around a nice cup of tea/chocolate/coffee.
Presentation of each person's work for 20 minutes max.
Affectation of the elements for the third iteration in the remaining time.

Third iteration 16/10/2017 -> 19/10/2017

Fagin’s threshold algorithm (TA) (Top-k improvement, so we don't do the FA before) (see slides 16 to 18, then 19 to 21 of teacher's slideshow), with Espilon-approximation of Top-K Answers (see slide 22 of teacher's slideshow) -> TODO By Zakaria
Random Indexing (see slide 32 of teacher's slideshow) -> TODO By Bachir
Allow OR clause instead of only AND -> TODO By Alexis
Improvement of the GUI : Logging of queries ad associated results + performance indicators (processor time, memory used, number of reads of posting-lists...) -> SHOTGUN By Benoit :D

Third code review 19/10/2017 from 4:30 to 6:30 p.m.

Discussion around a nice cup of tea/chocolate/coffee.
Presentation of each person's work for 20 minutes max.
Bug fixes.

Preparation of the presentation 19/10/2017 from 6:30 to 8:30 p.m.

Overall bug fixes
Preparation of slideshow and oral presentation.

Preparation of the presentation 20/10/2017 from 10 to 17 a.m

Overall bug fixes
Preparation of slideshow and oral presentation.

Presentation 20/10/2017 from 17 to 18 a.m

BE AWESOME ! :D

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
algorithm		algorithm
app-overview		app-overview
demofiles		demofiles
executable		executable
files		files
randomindexing		randomindexing
stats		stats
tests		tests
ui		ui
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
console_mode_launch.py		console_mode_launch.py
launch.py		launch.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IQPlayground

Aim

Running the project:

Prerequisites

Installation

Execution

Description of the packages

Project Preparation

Learning prerequisites

Feature Goals

Indexing :

Querying :

Graphical User interface

First iteration 06/10/2017 -> 08/10/2017

First code review 09/10/2017 from 10 to 12 a.m.

Second iteration 09/10/2017 -> 15/10/2017

Second code review 16/10/2017 from 8 to 10 a.m.

Third iteration 16/10/2017 -> 19/10/2017

Third code review 19/10/2017 from 4:30 to 6:30 p.m.

Preparation of the presentation 19/10/2017 from 6:30 to 8:30 p.m.

Preparation of the presentation 20/10/2017 from 10 to 17 a.m

Presentation 20/10/2017 from 17 to 18 a.m

About

Releases

Packages

Contributors 6

Languages

License

Xia0ben/IQPlayground

Folders and files

Latest commit

History

Repository files navigation

IQPlayground

Aim

Running the project:

Prerequisites

Installation

Execution

Description of the packages

Project Preparation

Learning prerequisites

Feature Goals

Indexing :

Querying :

Graphical User interface

First iteration 06/10/2017 -> 08/10/2017

First code review 09/10/2017 from 10 to 12 a.m.

Second iteration 09/10/2017 -> 15/10/2017

Second code review 16/10/2017 from 8 to 10 a.m.

Third iteration 16/10/2017 -> 19/10/2017

Third code review 19/10/2017 from 4:30 to 6:30 p.m.

Preparation of the presentation 19/10/2017 from 6:30 to 8:30 p.m.

Preparation of the presentation 20/10/2017 from 10 to 17 a.m

Presentation 20/10/2017 from 17 to 18 a.m

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages