## Data Types
As mentioned in Data Types page, two Postgres data types together provide text search functionality:
- `TSVECTOR`
- `TSQUERY`

`TSVECTOR` represents a sorted list of distinct lexemes. A lexeme is essentially different variants of the same word merged into one:

In [1]:
# %%
%load_ext sql

# %%
%sql postgresql://postgres:root@localhost:5432/dvdrental

In [2]:
%config SqlMagic.style = '_DEPRECATED_DEFAULT'

In [5]:
%%sql

-- # Use to_tsvector to convert text to tsvector
SELECT TO_TSVECTOR('A quick brown dog jumps over the lazy fox');

 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.


to_tsvector
'brown':3 'dog':4 'fox':9 'jump':5 'lazi':8 'quick':2


Few things to note here:
- result contains words in the original text sorted lexicographically
- every word has its position in the original text associated with it
- words like 'a', 'the', 'over' etc are skipped

The example below gives more clarity on term lexemes:

In [6]:
%%sql

SELECT TO_TSVECTOR('Run runs runner');

 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.


to_tsvector
"'run':1,2 'runner':3"


"Run" and "runs" is considered the same which means search for "run" would also return "runs" and vice verce. Another things to note is that this result is specific to english language:

In [7]:
%%sql

SELECT TO_TSVECTOR('english', 'A quick brown dog jumps over the lazy fox') -- # Same as SELECT TO_TSVECTOR('A quick brown dog jumps over the lazy fox');
UNION
SELECT TO_TSVECTOR('simple', 'A quick brown dog jumps over the lazy fox'); -- # Considers all words in the input text

 * postgresql://postgres:***@localhost:5432/dvdrental
2 rows affected.


to_tsvector
'brown':3 'dog':4 'fox':9 'jump':5 'lazi':8 'quick':2
'a':1 'brown':3 'dog':4 'fox':9 'jumps':5 'lazy':8 'over':6 'quick':2 'the':7


To search for a term, it first needs to be converted into `TSQUERY` using `TO_TSQUERY`. More on querying later:

In [8]:
%%sql

SELECT TO_TSQUERY('fox');

 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.


to_tsquery
'fox'


## Searching
To search for a term in a text use `@@` operator:

In [11]:
%%sql

SELECT TO_TSVECTOR('A quick brown dog jumps over the lazy fox') @@ TO_TSQUERY('foxes');
-- # Order does not matter, so one could also write SELECT TO_TSQUERY('foxes') @@ TO_TSVECTOR('A quick brown dog jumps over the lazy fox');

 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.


?column?
True


In [12]:
%%sql

SELECT title FROM movies
WHERE TO_TSVECTOR(title) @@ TO_TSQUERY('star')
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


title
Lucky Star
Five Star Final
The Star Witness
Star Trek VI: The Undiscovered Country
Make Me a Star


What if we want to search for movie "Star Wars"?

In [13]:
%%sql

SELECT title FROM movies
WHERE TO_TSVECTOR(title) @@ TO_TSQUERY('star wars')
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/dvdrental
(psycopg2.errors.SyntaxError) syntax error in tsquery: "star wars"

[SQL: SELECT title FROM movies
WHERE TO_TSVECTOR(title) @@ TO_TSQUERY('star wars')
LIMIT 5;]
(Background on this error at: https://sqlalche.me/e/20/f405)


It results in an error. The input to `TS_QUERY` must be in specific form:
- **AND:** using `&` operator
- **OR:** using `|` operator
- **NOT:** using `!` operator
- **FOLLOWED BY:** using `<->` operator

In [14]:
%%sql

SELECT title FROM movies
WHERE TO_TSVECTOR(title) @@ TO_TSQUERY('star<->wars') -- # star and wars appearing next to each other
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


title
Star Wars: Episode I – The Phantom Menace 3D
Star Wars: Episode I – The Phantom Menace
Star Wars: Episode III – Revenge of the Sith
Star Wars: The Last Jedi
Star Wars: The Clone Wars


In [15]:
%%sql

-- # All movies containing star but excluding star wars and star trek series
SELECT title FROM movies
WHERE TO_TSVECTOR(title) @@ TO_TSQUERY('star&!(wars|trek)')
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


title
The Star Maker
Star Dust
Star Spangled Rhythm
The Lone Star Trail
Star Spangled Rhythm


Postgres provides a few handy function to convert text to `TSQUERY`:

In [16]:
%%sql

SELECT PLAINTO_TSQUERY('star wars'), PHRASETO_TSQUERY('star wars');

 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.


plainto_tsquery,phraseto_tsquery
'star' & 'war','star' <-> 'war'


`WEBSEARCH_TO_TSQUERY` takes web search format and converts to `TSQUERY`:

In [17]:
%%sql

SELECT WEBSEARCH_TO_TSQUERY('"star wars" +clone');

 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.


websearch_to_tsquery
'star' <-> 'war' & 'clone'


## Ranking
Every matching search result is associated with a rank indicating how close it matches with the search term:

In [20]:
%%sql

SELECT title, TS_RANK(TO_TSVECTOR(title), TO_TSQUERY('fight')) AS RANK
FROM movies
WHERE TO_TSVECTOR(title) @@ TO_TSQUERY('fight')
ORDER BY RANK DESC
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


title,rank
"Fight, Zatoichi, Fight",0.075990885
Fight Back to School,0.06079271
Ip Man: The Final Fight,0.06079271
Fight Club – Members Only,0.06079271
The Fighting Prince of Donegal,0.06079271


The first entry has rank 0.075, whereas the rest have 0.06. Since the term "fight" appears twice it has been given a higher rating. 

`TS_RANK` also accepts a normalization factor, to account for:
- a longer document tends to have more matches just by chance.
- s shorter document with fewer but concentrated matches might actually be more relevant.
There are various possible options, for example setting it to one divides the score with the length of the text, therefore shorter text has higher rank:

In [21]:
%%sql

SELECT title, TS_RANK(TO_TSVECTOR(title), TO_TSQUERY('fight'), 1) AS RANK
FROM movies
WHERE TO_TSVECTOR(title) @@ TO_TSQUERY('fight')
ORDER BY RANK DESC
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


title,rank
Fighting,0.06079271
The Fighting Sullivans,0.03835593
Come Out Fighting,0.03835593
Fighting Back,0.03835593
The Fighting Seabees,0.03835593


There are other ways to influence rank as well. If for example, we want to rank movies with genre "drama" or "romance" higher, then:

In [22]:
%%sql

SELECT title, (TS_RANK(TO_TSVECTOR(title), TO_TSQUERY('fight'), 1) + (
	CASE
		WHEN genre ILIKE '%drama%' OR genre ILIKE '%romance%' THEN 0.1 
		ELSE 0
	END
)) AS RANK
FROM movies
WHERE TO_TSVECTOR(title) @@ TO_TSQUERY('fight')
ORDER BY RANK DESC
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


title,rank
Fighting for Love,0.1383559316396713
The Fighting Sullivans,0.1383559316396713
The Fighting Chance,0.1383559316396713
Fighting Back,0.1383559316396713
A Man's Fight,0.1383559316396713


To query in multiple columns at once, simply concatenate the `TS_VECTOR`s:

In [27]:
%%sql

SELECT title, TS_RANK(TO_TSVECTOR(title) || ' ' || TO_TSVECTOR(plot), TO_TSQUERY('fight')) AS rank
FROM movies
WHERE (TO_TSVECTOR(title) || ' ' || TO_TSVECTOR(plot)) @@ TO_TSQUERY('fight')
ORDER BY rank DESC
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


title,rank
Goon: Last of the Enforcers,0.09703509
Dragon Ball Z: Bojack Unbound,0.0967147
Never Back Down,0.0967147
The Legend of Bruce Lee,0.0967147
Knucklehead,0.09652707


The first search result has higher rank even though its title doesn't contain the word "fight". This is because its plot contains the word "fight" few times. To give more weightage to a particular column, use `SETWEIGHT` function:

In [29]:
%%sql

SELECT title, TS_RANK(SETWEIGHT(TO_TSVECTOR(title), 'A') || ' ' || SETWEIGHT(TO_TSVECTOR(plot), 'B'), TO_TSQUERY('fight')) AS rank -- # A has higher weight than B, maximum is D
FROM movies
WHERE (TO_TSVECTOR(title) || ' ' || TO_TSVECTOR(plot)) @@ TO_TSQUERY('fight')
ORDER BY rank DESC
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


title,rank
"Fight, Zatoichi, Fight",0.80212605
Fighting,0.7500232
Fist Fight,0.7391829
Fight Club,0.7391829
Fight Club – Members Only,0.7323812


The `SETWEIGHT` function affects the `TSVECTOR` by storing lexemes with optional weight:

In [None]:
%sql

SELECT SETWEIGHT(TO_TSVECTOR(title), 'A') 
FROM movies 
LIMIT 5;
-- # Returns the following:
-- # "'friday':3A 'mous':2A"
-- # "'hollywood':1A 'stori':2A"
-- # "'home':1A 'stori':3A 'town':2A"
-- # "'hong':1A 'kong':2A"
-- # "'hoodlum':2A"

**Highlighting:**

## Indexing