# Building and Loading Text Search in PostgreSQL

## OUTLINE
 1. [PostgreSQL Text storage](#PG_text)
 1. [Task at hand](#task)
 1. [Buiding our Text Document Retrieval DB](#build_it)
 1. [Loading Data](#load_it)
 1. [Executing Queries, Google-lite...very very lite](#search_me) 
 



--- 
<a id='PG_text' ></a>

## PostgreSQL Text Storage

This notebook documents the building of the `ir.BookLines` useing the Information Retrieval (IR) based extension, _full text search_.


<a id='task' /> </a>

## Task at Hand

This lab walks through the process of creating full text search capability within PostgreSQL for integration into other analytical processes of lines for a book (with sub-books).


### Database of Unstructured Text Files 

As was used in the lab, we are going to use this collection of text files.
It is 4.3 megabytes of text and 31 thousand lines, sounds fun!

```BASH
[scottgs@metal pg_text_search]$ ls book/*
book/1chron.txt    book/acts.txt      book/isaiah.txt    book/nahum.txt
book/1corinth.txt  book/amos.txt      book/james.txt     book/nehemiah.txt
book/1john.txt     book/colossia.txt  book/jeremiah.txt  book/numbers.txt
book/1kings.txt    book/daniel.txt    book/job.txt       book/obadiah.txt
book/1peter.txt    book/deut.txt      book/joel.txt      book/philemon.txt
book/1samuel.txt   book/eccl.txt      book/john.txt      book/philipp.txt
book/1thess.txt    book/ephesian.txt  book/jonah.txt     book/proverbs.txt
book/1timothy.txt  book/esther.txt    book/joshua.txt    book/psalms.txt
book/2chron.txt    book/exodus.txt    book/jude.txt      book/rev.txt
book/2corinth.txt  book/ezekiel.txt   book/judges.txt    book/romans.txt
book/2john.txt     book/ezra.txt      book/lament.txt    book/ruth.txt
book/2kings.txt    book/galatian.txt  book/levit.txt     book/song.txt
book/2peter.txt    book/genesis.txt   book/luke.txt      book/titus.txt
book/2samuel.txt   book/habakkuk.txt  book/malachi.txt   book/zech.txt
book/2thess.txt    book/haggai.txt    book/mark.txt      book/zeph.txt
book/2timothy.txt  book/hebrews.txt   book/matthew.txt
book/3john.txt     book/hosea.txt     book/micah.txt

[scottgs@metal pg_text_search]$ du -skh book
4.3M	book
[scottgs@metal pg_text_search]$ wc -l book/*  | tail -n1
  31258 total
```

### However, now I am going to index it line-by-line.

<a id='build_it' /> </a>

## Building a Text Retrieval Database

#### All the commands are available [here](../practices/PG_Build_Lines_Search.sql).

### Data repository within database.

```SQL
-------------------------
-- Basic Table 
-------------------------
CREATE TABLE ir.BookLines(
        id SERIAL NOT NULL,
        name varchar(250) NOT NULL,
        line_no INT NOT NULL,
        line text NOT NULL
);

ALTER TABLE ir.BookLines
ADD CONSTRAINT pk_BookLines PRIMARY KEY (id);
```

### A column that implements the vector model

```SQL
-------------------------
Separate Ts_Vector column
-------------------------
-- TS_Vector for GIN INDEX
ALTER TABLE ir.BookLines
  ADD COLUMN line_tsv_gin tsvector;

UPDATE ir.BookLines
SET line_tsv_gin = to_tsvector('pg_catalog.english', line);
```

### Another column that implements the vector model

```SQL
-- TS_Vector for GIST INDEX
ALTER TABLE ir.BookLines
  ADD COLUMN line_tsv_gist tsvector;

UPDATE ir.BookLines
SET line_tsv_gist = to_tsvector('pg_catalog.english', line);
```

### Further steps are completed, similiar to shown in the lab


### Result


Finally, take a look at the resulting table definition:

```SQL
dsa_ro=# \dt ir.
           List of relations
 Schema |    Name    | Type  |  Owner  
--------+------------+-------+---------
 ir     | booklines  | table | scottgs
 ir     | booksearch | table | scottgs
(2 rows)


dsa_ro=# \d ir.booklines
                                        Table "ir.booklines"
    Column     |          Type          |                         Modifiers                         
---------------+------------------------+-----------------------------------------------------------
 id            | integer                | not null default nextval('ir.booklines_id_seq'::regclass)
 name          | character varying(250) | not null
 line_no       | integer                | not null
 line          | text                   | not null
 line_tsv_gin  | tsvector               | 
 line_tsv_gist | tsvector               | 
Indexes:
    "pk_booklines" PRIMARY KEY, btree (id)
    "booklines_line" gin (line gin_trgm_ops)
    "booklines_line_tsv_gin" gin (line_tsv_gin)
    "booklines_line_tsv_gist" gist (line_tsv_gist)
Triggers:
    tsv_gin_update BEFORE INSERT OR UPDATE ON ir.booklines FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('line_tsv_gin', 'pg_catalog.english', 'line')
    tsv_gist_update BEFORE INSERT OR UPDATE ON ir.booklines FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('line_tsv_gist', 'pg_catalog.english', 'line')


```

<a id='load_it' /> </a>

## Loading Data

To load the data, a python script with follow the basic crawling behavior

 1. For each file/folder in the specified starting folder:
 1. If it is a folder, recurse into folder and process contents
 1. If it is a file, read contents and load into database, one line at a time.

In [None]:
# Not Executable
epass = input()
uname = input()

In [None]:
import os
import psycopg2

try:
    conn = psycopg2.connect("host='dbase' port='5432' dbname='dsa_ro' user='{}' password='{}'".format(uname,epass))
except:
    print("I am unable to connect to the database")

def loadFile(filename):
    '''
    Read file contents, load into database.
    
    Returns: The document ID that was created
    '''
    line_no = 1
    with conn, conn.cursor() as curs:
        with open(filename, 'r') as infile:
            for line in infile:
                line = line.rstrip('\n')
                # print("Loading: {},{} = {}".format(filename,line_no,line))
                # Note, even numerical parameters get %s placeholder, not %d for numerical like
                #               some DB
                SQL = "INSERT INTO ir.booklines(name,line_no,line)VALUES (%s,%s,%s) RETURNING id;"        
                curs.execute(SQL,(filename,line_no,line))
                row_id = curs.fetchone()[0]
                line_no += 1
    return line_no


def processFolder(folder):
    '''
    Process a folder for files and subfolders
    '''
    
    print('Processing folder: ',folder)
    
    for root, dirs, files in os.walk(folder):
        
        print("root = ", root)
        
        # Process Files
        for file in files:
            if file.endswith(".txt"):
                filename = os.path.join(root, file)
                print('Processing File:',filename)
                document_id = 0
                # Comment out this line to watch the next cell walk the tree
                lines_loaded = loadFile(filename)
                print("Lines Loaded: {}".format(lines_loaded))
                
            elif file.endswith(".html"):
                print("HTML Files Not Handled Yet")

        # Recurse into subfolders
        for d in dirs:
            print("recursing into ",d)
            processFolder(d)
        

In [None]:
###########################
# If you run this cell, first comment out the " document_id = loadFile(filename) " line above
###########################
processFolder('./book');

##### In case the output above is cleared, it is saved [here](../resources/PG_FTS_load_output.txt).

### Check the Results

```SQL
dsa_ro=# select count(*),sum(length(line)) from ir.booklines;
 count |   sum   
-------+---------
 31259 | 4315223
(1 row)
```

#### 31K lines

#### Looking at a randome line that was added:

```SQL
dsa_ro=# \x 
Expanded display is on.
dsa_ro=# select * from ir.booklines where id = 34;
-[ RECORD 1 ]-+-------------------------------------------
id            | 34
name          | ./book/zeph.txt
line_no       | 34
line          | 2:14: And flocks shall lie down in the midst of her, 
                all the beasts of the nations: both the cormorant and 
                the bittern shall lodge in the upper lintels of it; 
                their voice shall sing in the windows; desolation shall 
                be in the thresholds: for he shall uncover the cedar work.
line_tsv_gin  | '14':2 '2':1 'beast':15 'bittern':24 'cedar':51 'cormor':21 
                'desol':40 'flock':4 'lie':6 'lintel':30 'lodg':26 'midst':10 
                'nation':18 'shall':5,25,35,41,48 'sing':36 'threshold':45 
                'uncov':49 'upper':29 'voic':34 'window':39 'work':52
line_tsv_gist | '14':2 '2':1 'beast':15 'bittern':24 'cedar':51 'cormor':21 
                'desol':40 'flock':4 'lie':6 'lintel':30 'lodg':26 'midst':10 
                'nation':18 'shall':5,25,35,41,48 'sing':36 'threshold':45 
                'uncov':49 'upper':29 'voic':34 'window':39 'work':52

```

Notice that we have built a document vector that is stemmed and has removed common (stop) words.



<a id='search_me' /> </a>

## Executing Queries,
### Google-lite...very very lite

Recall, from the video lecture;
the database is now a collection of vectors. 

Now, to query the database we must convert our queries into vectors for matching.

For full documentation, you will want to consult the PostgreSQL documentation.
  * https://www.postgresql.org/docs/current/static/textsearch.html
  * https://www.postgresql.org/docs/current/static/textsearch-controls.html
  * https://www.postgresql.org/docs/current/static/textsearch-features.html

Below we show a few examples, which you can play with and adjust as you see fit.

#### Basic connection with readonly user

In [1]:
%load_ext sql
%sql postgres://dsa_ro_user:readonly@dbase.dsa.missouri.edu/dsa_ro

'Connected: dsa_ro_user@dsa_ro'

#### A couple query examples

NOTE:
```
%%sql
```
... allows multi-line SQL statements

NOTE:
Query terms can be joined with boolean operators, 
  * `|` is "or" 
  * `&` is "and"
  

In [2]:
%%sql

SELECT id,name,line_no,line, ts_rank_cd(line_tsv_gin, query) AS rank
FROM ir.booklines, to_tsquery('love | hate') query
WHERE query @@ line_tsv_gin
ORDER BY rank DESC LIMIT 20;

20 rows affected.


id,name,line_no,line,rank
8856,./book/luke.txt,286,"6:32: For if ye love them which love you, what thank have ye? for sinners also love those that love them.",0.4
25558,./book/2samuel.txt,311,"13:15: Then Amnon hated her exceedingly; so that the hatred wherewith he hated her was greater than the love wherewith he had loved her. And Amnon said unto her, Arise, be gone.",0.4
12774,./book/john.txt,674,"15:18: If the world hate you, ye know that it hated me before it hated you.",0.3
29794,./book/1john.txt,26,"2:15: Love not the world, neither the things that are in the world. If any man love the world, the love of the Father is not in him.",0.3
29848,./book/1john.txt,80,"4:16: And we have known and believed the love that God hath to us. God is love; and he that dwelleth in love dwelleth in God, and God in him.",0.3
8515,./book/malachi.txt,3,"1:2: I have loved you, saith the LORD. Yet ye say, Wherein hast thou loved us? Was not Esau Jacob's brother? saith the LORD: yet I loved Jacob,",0.3
16938,./book/hosea.txt,36,"3:1: Then said the LORD unto me, Go yet, love a woman beloved of her friend, yet an adulteress, according to the love of the LORD toward the children of Israel, who look to other gods, and love flagons of wine.",0.3
12721,./book/john.txt,621,"13:34: A new commandment I give unto you, That ye love one another; as I have loved you, that ye also love one another.",0.3
28568,./book/1samuel.txt,536,"20:17: And Jonathan caused David to swear again, because he loved him: for he loved him as he loved his own soul.",0.3
23093,./book/deut.txt,574,"21:15: If a man have two wives, one beloved, and another hated, and they have born him children, both the beloved and the hated; and if the firstborn son be hers that was hated:",0.3


In [3]:
%%sql

SELECT id,name,line_no,line, ts_rank_cd(line_tsv_gin, query) AS rank
FROM ir.booklines, to_tsquery('love & hate') query
WHERE query @@ line_tsv_gin
ORDER BY rank DESC LIMIT 10;

10 rows affected.


id,name,line_no,line,rank
4067,./book/proverbs.txt,239,8:36: But he that sinneth against me wrongeth his own soul: all they that hate me love death.,0.05
2914,./book/psalms.txt,1550,"97:10: Ye that love the LORD, hate evil: he preserveth the souls of his saints; he delivereth them out of the hand of the wicked.",0.0333333
27388,./book/2chron.txt,385,"19:2: And Jehu the son of Hanani the seer went out to meet him, and said to king Jehoshaphat, Shouldest thou help the ungodly, and love them that hate the LORD? therefore is wrath upon thee from before the LORD.",0.0333333
16805,./book/isaiah.txt,1200,"61:8: For I the LORD love judgment, I hate robbery for burnt offering; and I will direct their work in truth, and I will make an everlasting covenant with them.",0.0333333
17112,./book/hebrews.txt,11,"1:9: Thou hast loved righteousness, and hated iniquity; therefore God, even thy God, hath anointed thee with the oil of gladness above thy fellows.",0.0333333
6922,./book/matthew.txt,163,"6:24: No man can serve two masters: for either he will hate the one, and love the other; or else he will hold to the one, and despise the other. Ye cannot serve God and mammon.",0.025
6893,./book/matthew.txt,134,"5:43: Ye have heard that it hath been said, Thou shalt love thy neighbour, and hate thine enemy.",0.025
6684,./book/micah.txt,32,"3:2: Who hate the good, and love the evil; who pluck off their skin from off them, and their flesh from off their bones;",0.025
9311,./book/luke.txt,741,"16:13: No servant can serve two masters: for either he will hate the one, and love the other; or else he will hold to the one, and despise the other. Ye cannot serve God and mammon.",0.025
24014,./book/amos.txt,75,"5:15: Hate the evil, and love the good, and establish judgment in the gate: it may be that the LORD God of hosts will be gracious unto the remnant of Joseph.",0.025


In [4]:
%%sql

SELECT id,name,line_no,line, ts_rank_cd(line_tsv_gin, query) AS rank
FROM ir.booklines, to_tsquery('love') query
WHERE query @@ line_tsv_gin
ORDER BY rank DESC LIMIT 10;

10 rows affected.


id,name,line_no,line,rank
8856,./book/luke.txt,286,"6:32: For if ye love them which love you, what thank have ye? for sinners also love those that love them.",0.4
29842,./book/1john.txt,74,"4:10: Herein is love, not that we loved God, but that he loved us, and sent his Son to be the propitiation for our sins.",0.3
8515,./book/malachi.txt,3,"1:2: I have loved you, saith the LORD. Yet ye say, Wherein hast thou loved us? Was not Esau Jacob's brother? saith the LORD: yet I loved Jacob,",0.3
12765,./book/john.txt,665,"15:9: As the Father hath loved me, so have I loved you: continue ye in my love.",0.3
28568,./book/1samuel.txt,536,"20:17: And Jonathan caused David to swear again, because he loved him: for he loved him as he loved his own soul.",0.3
29794,./book/1john.txt,26,"2:15: Love not the world, neither the things that are in the world. If any man love the world, the love of the Father is not in him.",0.3
12721,./book/john.txt,621,"13:34: A new commandment I give unto you, That ye love one another; as I have loved you, that ye also love one another.",0.3
16938,./book/hosea.txt,36,"3:1: Then said the LORD unto me, Go yet, love a woman beloved of her friend, yet an adulteress, according to the love of the LORD toward the children of Israel, who look to other gods, and love flagons of wine.",0.3
29850,./book/1john.txt,82,4:18: There is no fear in love; but perfect love casteth out fear: because fear hath torment. He that feareth is not made perfect in love.,0.3
29848,./book/1john.txt,80,"4:16: And we have known and believed the love that God hath to us. God is love; and he that dwelleth in love dwelleth in God, and God in him.",0.3


##### Optional third argument for to_tsquery to weight

In [5]:
%%sql
SELECT id,name,line_no,line, ts_rank_cd(line_tsv_gin, query, 50) AS rank
FROM ir.booklines, to_tsquery('test | file') query
WHERE query @@ line_tsv_gin
ORDER BY rank DESC LIMIT 10;

2 rows affected.


id,name,line_no,line,rank
31259,./book/one_level_down/two_levels_down/test.txt,1,This is just a test file,0.0593485
28327,./book/1samuel.txt,295,"13:21: Yet they had a file for the mattocks, and for the coulters, and for the forks, and for the axes, and to sharpen the goads.",0.00288232


In [6]:
%%sql
SELECT id,name,line_no,line, ts_rank_cd(line_tsv_gin, query) AS rank
FROM ir.booklines, plainto_tsquery('test file') query
WHERE query @@ line_tsv_gin
ORDER BY rank DESC LIMIT 10;

1 rows affected.


id,name,line_no,line,rank
31259,./book/one_level_down/two_levels_down/test.txt,1,This is just a test file,0.1


In [7]:
%%sql
SELECT id,name,line_no,line, ts_rank_cd(line_tsv_gin, query) AS rank
FROM ir.booklines, plainto_tsquery('love') query
WHERE query @@ line_tsv_gin
ORDER BY rank DESC LIMIT 10;

10 rows affected.


id,name,line_no,line,rank
8856,./book/luke.txt,286,"6:32: For if ye love them which love you, what thank have ye? for sinners also love those that love them.",0.4
29842,./book/1john.txt,74,"4:10: Herein is love, not that we loved God, but that he loved us, and sent his Son to be the propitiation for our sins.",0.3
8515,./book/malachi.txt,3,"1:2: I have loved you, saith the LORD. Yet ye say, Wherein hast thou loved us? Was not Esau Jacob's brother? saith the LORD: yet I loved Jacob,",0.3
12765,./book/john.txt,665,"15:9: As the Father hath loved me, so have I loved you: continue ye in my love.",0.3
28568,./book/1samuel.txt,536,"20:17: And Jonathan caused David to swear again, because he loved him: for he loved him as he loved his own soul.",0.3
29794,./book/1john.txt,26,"2:15: Love not the world, neither the things that are in the world. If any man love the world, the love of the Father is not in him.",0.3
12721,./book/john.txt,621,"13:34: A new commandment I give unto you, That ye love one another; as I have loved you, that ye also love one another.",0.3
16938,./book/hosea.txt,36,"3:1: Then said the LORD unto me, Go yet, love a woman beloved of her friend, yet an adulteress, according to the love of the LORD toward the children of Israel, who look to other gods, and love flagons of wine.",0.3
29850,./book/1john.txt,82,4:18: There is no fear in love; but perfect love casteth out fear: because fear hath torment. He that feareth is not made perfect in love.,0.3
29848,./book/1john.txt,80,"4:16: And we have known and believed the love that God hath to us. God is love; and he that dwelleth in love dwelleth in God, and God in him.",0.3


# Please explore different queries

  1. Explore changing the query below.
  2. Observer how the ranking score is changed with different queries and different number of search terms.

In [8]:
%%sql
SELECT id,name,line_no,line, ts_rank_cd(line_tsv_gin, query) AS rank
FROM ir.booklines, plainto_tsquery('stone') query
WHERE query @@ line_tsv_gin
ORDER BY rank DESC LIMIT 10;

10 rows affected.


id,name,line_no,line,rank
31229,./book/1chron.txt,915,"29:2: Now I have prepared with all my might for the house of my God the gold for things to be made of gold, and the silver for things of silver, and the brass for things of brass, the iron for things of iron, and wood for things of wood; onyx stones, and stones to be set, glistering stones, and of divers colours, and all manner of precious stones, and marble stones in abundance.",0.5
11541,./book/joshua.txt,151,"7:25: And Joshua said, Why hast thou troubled us? the LORD shall trouble thee this day. And all Israel stoned him with stones, and burned them with fire, after they had stoned them with stones.",0.4
29178,./book/1kings.txt,228,"7:10: And the foundation was of costly stones, even great stones, stones of ten cubits, and stones of eight cubits.",0.4
16132,./book/isaiah.txt,527,"28:16: Therefore thus saith the Lord GOD, Behold, I lay in Zion for a foundation a stone, a tried stone, a precious corner stone, a sure foundation: he that believeth shall not make haste.",0.3
29129,./book/1kings.txt,179,"5:17: And the king commanded, and they brought great stones, costly stones, and hewed stones, to lay the foundation of the house.",0.3
996,./book/rev.txt,38,"2:17: He that hath an ear, let him hear what the Spirit saith unto the churches; To him that overcometh will I give to eat of the hidden manna, and will give him a white stone, and in the stone a new name written, which no man knoweth saving he that receiveth it.",0.2
5490,./book/numbers.txt,589,"15:36: And all the congregation brought him without the camp, and stoned him with stones, and he died; as the LORD commanded Moses.",0.2
10135,./book/levit.txt,412,"14:42: And they shall take other stones, and put them in the place of those stones; and he shall take other morter, and shall plaister the house.",0.2
5419,./book/numbers.txt,518,14:10: But all the congregation bade stone them with stones. And the glory of the LORD appeared in the tabernacle of the congregation before all the children of Israel.,0.2
1326,./book/rev.txt,368,"21:11: Having the glory of God: and her light was like unto a stone most precious, even like a jasper stone, clear as crystal;",0.2


# Save your notebook