# Using Gismo on a toy example

A typical Gismo workflow stands as follows:
- Its input is a list of objects, called the source;
- A source is wrapped into a Corpus object;
- A dual embedding is computed that relates objects and their content;
- The embedding fuels a query-based ranking function;
- The best results of a query can be organized in a hierarchical way.

## Source

In Gismo, a source is a list of objects. The typical case is when objects are documents represented by a string or a dictionary.

For tutoring, gismo provides a toy source, in both string and dict format.

In [1]:
from gismo.common import toy_source_text
toy_source_text

['Gizmo is a Mogwaï.',
 'This is a sentence about Blade.',
 'This is another sentence about Shadoks.',
 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.',
 'In chinese folklore, a Mogwaï is a demon.']

In [2]:
from gismo.common import toy_source_dict
toy_source_dict

[{'title': 'First Document', 'content': 'Gizmo is a Mogwaï.'},
 {'title': 'Second Document', 'content': 'This is a sentence about Blade.'},
 {'title': 'Third Document',
  'content': 'This is another sentence about Shadoks.'},
 {'title': 'Fourth Document',
  'content': 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.'},
 {'title': 'Fifth Document',
  'content': 'In chinese folklore, a Mogwaï is a demon.'}]

## Corpus

A corpus is mostly a wrapper around a source that tells how to convert source objects to string and provides basic I/O capacities.

### Construction and basic use

In [3]:
from gismo.corpus import Corpus
corpus = Corpus(source=toy_source_dict, to_text=lambda e: e['content'])

The source itself is stored as attribute of the corpus.

In [4]:
corpus.source == toy_source_dict

True

The corpus can provide length and individual elements of the source.

In [5]:
len(corpus)

5

In [6]:
corpus[2]

{'title': 'Third Document',
 'content': 'This is another sentence about Shadoks.'}

The iterator ``iterate`` iterates through the source.

In [8]:
[e['title'].split(' ')[0] for e in corpus.iterate()]

['First', 'Second', 'Third', 'Fourth', 'Fifth']

### iterate_text method

Note that all actions above can be directly performed on the source and do not justify the introduction of a Corpus class. The main interest of the class is the ``ìterate_text`` operator. Indeed, the Embedding class that comes next requires a string operator. The Corpus class facilitates this.

In [9]:
[e for e in corpus.iterate_text()]

['Gizmo is a Mogwaï.',
 'This is a sentence about Blade.',
 'This is another sentence about Shadoks.',
 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.',
 'In chinese folklore, a Mogwaï is a demon.']

Remark: the corpus ``to_text`` function can be overridden on the fly. This can be handy if one needs to consider multiple views of a same source without creating two Corpus objects.

In [10]:
[e for e in corpus.iterate_text(to_text=lambda e: e['content'][::-1])]

['.ïawgoM a si omziG',
 '.edalB tuoba ecnetnes a si sihT',
 '.skodahS tuoba ecnetnes rehtona si sihT',
 '.adoY dna omziG gnirapmoc yb eivom snilmerG eht ot ecnerefer edis a tniop emos ta sekam ,edisni sraW ratS tuoba ffuts fo tol a htiw ,ecnetnes gnol yrev sihT',
 '.nomed a si ïawgoM a ,erolklof esenihc nI']

### I/O

Like most Gismo objects, a basic I/O interface is provided for Corpus.

In [11]:
import tempfile
corpus1 = Corpus(toy_source_text)
with tempfile.TemporaryDirectory() as tmpdirname:
    corpus1.save(filename="myfile", path=tmpdirname)
    corpus2 = Corpus(filename="myfile", path=tmpdirname)
corpus2[0]

'Gizmo is a Mogwaï.'

## Embedding 

The Embedding class computes the relationships between the corpus and the features of its elements.

### X/Y convention

For most scenarios, objects of a corpus are documents and features are words. But not always. For example, you can assume scientific articles as documents and authors as features. Or authors as documents and the words they use as features.

## Ranking

## Structuring