### Background

The Garífuna are a mix of West/Central African, Arawak, and Carib ancestry. Garifuna is a language widely spoken in villages of Garifuna people on the coasts of Central America. Today, the global population of Garifunas stands at upwards of **300,000** people, many of whom live in the U.S. and Canada.

This project an attempt at language preservation by:
1. Digitizing existing physical dictionaries (**Tesseract OCR** – Optical Character Recognition)
2. Building a "flexible" Search Engine (**Whoosh**)
> When exact matches are not found, the search engine should return suggestions

In [1]:
# Install packages
import os
import pandas as pd
# constants
import settings

### Extract Text From Images (Batch)

In [2]:
# run script from CLI
! python3 batch_ocr.py

In [3]:
# read in DataFrame
data = pd.read_csv(
    os.path.join(settings.DIR_DATA, settings.DATA_FNAME_SEP_TUP[0]), 
    sep=settings.DATA_FNAME_SEP_TUP[1],
)

data.head()

Unnamed: 0,English,Garifuna,Spanish
0,"chaotic, adj","urouhabuti (u-rou-ha-bu-ti), adj","caotico, adj"
1,"chapel, n","ligilisi (li-gi-li-si), n","capilla, n"
2,"character, n","usa (u-sa), n","caracter, n"
3,"characteristic, n","luruyeri (lu-ru-ye-ri), n","caracteristica, n"
4,"charancaco, n","wagagan (wa-ga-gan), n","charancaco, n"


### Populate Whoosh Search Engine

In [4]:
# from CLI
! python3 create_and_load_schema_ix.py

In [5]:
from whoosh.index import open_dir

In [6]:
# get schema index
ix = open_dir(settings.DIR_SCHEMA)

# get all docs (generator)
all_docs = ix.searcher().documents() 
# list
dict_output = [doc["output"] for doc in all_docs]
dict_output[:5]

['chaotic, adj | urouhabuti (u-rou-ha-bu-ti), adj | caotico, adj',
 'chapel, n | ligilisi (li-gi-li-si), n | capilla, n',
 'character, n | usa (u-sa), n | caracter, n',
 'characteristic, n | luruyeri (lu-ru-ye-ri), n | caracteristica, n',
 'charancaco, n | wagagan (wa-ga-gan), n | charancaco, n']

### SEARCH

In [7]:
# Example 1: Successful search
! python3 translator.py --t "child" --lang "english"

    English            Garifuna       Spanish
0  child, n  irahd (i-ra-hu), n  nifio (a), n
1  child, n  irahd (i-ra-hu), n      nifio, n


In [8]:
# Example 2: Unsuccesful Search – Suggestions
! python3 translator.py --t "chil" --lang "english"

No search results found for: 'chil'

Try:
- 'child'
- 'chili'
- 'chill'
- 'chin'
- 'chip'


In [9]:
# Example 3: Incorrect Language
! python3 translator.py --t "chil" --lang ""

supported language(s): ['english', 'garifuna', 'spanish']


In [10]:
# Example 3: Nothing Found
! python3 translator.py --t "an" --lang "english"

No search results found for: an
