<img src = "https://images2.imgbox.com/60/09/VFwl5LOq_o.jpg" width="400">

# 4. Full-text Search and PostgresSQL Extensions
---

An introduction into some more advanced capabilities of PostgreSQL like full-text search and extensions.

In [1]:
%pip install -q sqlalchemy

Note: you may need to restart the kernel to use updated packages.


In [2]:
%load_ext sql

In [3]:
%sql postgresql://postgres:123@localhost/sakila

## A review of the LIKE operator
---

The `LIKE` operator allows us to filter our queries by matching one or more characters in text data. By using the `%` wildcard we can match one or more characters in a string. This is useful when you want to return a result set that matches certain characteristics and can also be very helpful during exploratory data analysis or data cleansing tasks.

Let's explore how different usage of the `%` wildcard will return different results by looking at the `film` table of the Sakila DVD Rental database.

### Instructions

Select all columns for all records that begin with the word `GOLD`.

In [4]:
%%sql

SELECT *
FROM   film
WHERE  title LIKE 'GOLD%' 

 * postgresql://postgres:***@localhost/sakila
3 rows affected.


film_id,title,description,release_year,language_id,original_language_id,rental_duration,rental_rate,length,replacement_cost,rating,last_update,special_features,fulltext
365,GOLD RIVER,A Taut Documentary of a Database Administrator And a Waitress who must Reach a Mad Scientist in A Baloon Factory,2006,1,,4,4.99,154,21.99,R,2006-02-15 05:03:42,"['Trailers', 'Commentaries', 'Deleted Scenes', 'Behind the Scenes']",'administr':9 'baloon':21 'databas':8 'documentari':5 'factori':22 'gold':1 'mad':17 'must':14 'reach':15 'river':2 'scientist':18 'taut':4 'waitress':12
366,GOLDFINGER SENSIBILITY,A Insightful Drama of a Mad Scientist And a Hunter who must Defeat a Pastry Chef in New Orleans,2006,1,,3,0.99,93,29.99,G,2006-02-15 05:03:42,"['Trailers', 'Commentaries', 'Behind the Scenes']",'chef':18 'defeat':15 'drama':5 'goldfing':1 'hunter':12 'insight':4 'mad':8 'must':14 'new':20 'orlean':21 'pastri':17 'scientist':9 'sensibl':2
367,GOLDMINE TYCOON,A Brilliant Epistle of a Composer And a Frisbee who must Conquer a Husband in The Outback,2006,1,,6,0.99,153,20.99,R,2006-02-15 05:03:42,"['Trailers', 'Behind the Scenes']",'brilliant':4 'compos':8 'conquer':14 'epistl':5 'frisbe':11 'goldmin':1 'husband':16 'must':13 'outback':19 'tycoon':2


Now select all records that end with the word `GOLD`.

In [6]:
%%sql

SELECT *
FROM   film
WHERE  title LIKE '%GOLD' 

 * postgresql://postgres:***@localhost/sakila
2 rows affected.


film_id,title,description,release_year,language_id,original_language_id,rental_duration,rental_rate,length,replacement_cost,rating,last_update,special_features,fulltext
644,OSCAR GOLD,A Insightful Tale of a Database Administrator And a Dog who must Face a Madman in Soviet Georgia,2006,1,,7,2.99,115,29.99,PG,2006-02-15 05:03:42,['Behind the Scenes'],'administr':9 'databas':8 'dog':12 'face':15 'georgia':20 'gold':2 'insight':4 'madman':17 'must':14 'oscar':1 'soviet':19 'tale':5
870,SWARM GOLD,A Insightful Panorama of a Crocodile And a Boat who must Conquer a Sumo Wrestler in A MySQL Convention,2006,1,,4,0.99,123,12.99,PG-13,2006-02-15 05:03:42,"['Trailers', 'Commentaries']",'boat':11 'conquer':14 'convent':21 'crocodil':8 'gold':2 'insight':4 'must':13 'mysql':20 'panorama':5 'sumo':16 'swarm':1 'wrestler':17


Finally, select all records that contain the word `GOLD`.

In [7]:
%%sql

SELECT *
FROM   film
WHERE  title LIKE '%GOLD%'

 * postgresql://postgres:***@localhost/sakila
8 rows affected.


film_id,title,description,release_year,language_id,original_language_id,rental_duration,rental_rate,length,replacement_cost,rating,last_update,special_features,fulltext
2,ACE GOLDFINGER,A Astounding Epistle of a Database Administrator And a Explorer who must Find a Car in Ancient China,2006,1,,3,4.99,48,12.99,G,2006-02-15 05:03:42,"['Trailers', 'Deleted Scenes']",'ace':1 'administr':9 'ancient':19 'astound':4 'car':17 'china':20 'databas':8 'epistl':5 'explor':12 'find':15 'goldfing':2 'must':14
95,BREAKFAST GOLDFINGER,A Beautiful Reflection of a Student And a Student who must Fight a Moose in Berlin,2006,1,,5,4.99,123,18.99,G,2006-02-15 05:03:42,"['Trailers', 'Commentaries', 'Deleted Scenes']","'beauti':4 'berlin':18 'breakfast':1 'fight':14 'goldfing':2 'moos':16 'must':13 'reflect':5 'student':8,11"
365,GOLD RIVER,A Taut Documentary of a Database Administrator And a Waitress who must Reach a Mad Scientist in A Baloon Factory,2006,1,,4,4.99,154,21.99,R,2006-02-15 05:03:42,"['Trailers', 'Commentaries', 'Deleted Scenes', 'Behind the Scenes']",'administr':9 'baloon':21 'databas':8 'documentari':5 'factori':22 'gold':1 'mad':17 'must':14 'reach':15 'river':2 'scientist':18 'taut':4 'waitress':12
366,GOLDFINGER SENSIBILITY,A Insightful Drama of a Mad Scientist And a Hunter who must Defeat a Pastry Chef in New Orleans,2006,1,,3,0.99,93,29.99,G,2006-02-15 05:03:42,"['Trailers', 'Commentaries', 'Behind the Scenes']",'chef':18 'defeat':15 'drama':5 'goldfing':1 'hunter':12 'insight':4 'mad':8 'must':14 'new':20 'orlean':21 'pastri':17 'scientist':9 'sensibl':2
367,GOLDMINE TYCOON,A Brilliant Epistle of a Composer And a Frisbee who must Conquer a Husband in The Outback,2006,1,,6,0.99,153,20.99,R,2006-02-15 05:03:42,"['Trailers', 'Behind the Scenes']",'brilliant':4 'compos':8 'conquer':14 'epistl':5 'frisbe':11 'goldmin':1 'husband':16 'must':13 'outback':19 'tycoon':2
644,OSCAR GOLD,A Insightful Tale of a Database Administrator And a Dog who must Face a Madman in Soviet Georgia,2006,1,,7,2.99,115,29.99,PG,2006-02-15 05:03:42,['Behind the Scenes'],'administr':9 'databas':8 'dog':12 'face':15 'georgia':20 'gold':2 'insight':4 'madman':17 'must':14 'oscar':1 'soviet':19 'tale':5
798,SILVERADO GOLDFINGER,A Stunning Epistle of a Sumo Wrestler And a Man who must Challenge a Waitress in Ancient India,2006,1,,4,4.99,74,11.99,PG,2006-02-15 05:03:42,"['Trailers', 'Commentaries']",'ancient':19 'challeng':15 'epistl':5 'goldfing':2 'india':20 'man':12 'must':14 'silverado':1 'stun':4 'sumo':8 'waitress':17 'wrestler':9
870,SWARM GOLD,A Insightful Panorama of a Crocodile And a Boat who must Conquer a Sumo Wrestler in A MySQL Convention,2006,1,,4,0.99,123,12.99,PG-13,2006-02-15 05:03:42,"['Trailers', 'Commentaries']",'boat':11 'conquer':14 'convent':21 'crocodil':8 'gold':2 'insight':4 'must':13 'mysql':20 'panorama':5 'sumo':16 'swarm':1 'wrestler':17


## What is a tsvector?
---

You saw how to convert strings to `tsvector` and `tsquery` in the video and, in this exercise, we are going to dive deeper into what these functions actually return after converting a string to a `tsvector`. In this example, you will convert a text column from the `film` table to a `tsvector` and inspect the results. Understanding how full-text search works is the first step in more advanced machine learning and data science concepts like natural language processing.

### Instructions

Select the film description and convert it to a `tsvector` data type.

In [9]:
%%sql

SELECT to_tsvector(description)
FROM   film
LIMIT  20

 * postgresql://postgres:***@localhost/sakila
20 rows affected.


to_tsvector
'battl':13 'canadian':18 'drama':3 'epic':2 'feminist':6 'mad':9 'must':12 'rocki':19 'scientist':10 'teacher':15
'administr':7 'ancient':17 'astound':2 'car':15 'china':18 'databas':6 'epistl':3 'explor':10 'find':13 'must':12
"'astound':2 'baloon':17 'car':9 'factori':18 'lumberjack':6,14 'must':11 'reflect':3 'sink':12"
'chase':12 'documentari':3 'fanci':2 'frisbe':6 'lumberjack':9 'monkey':14 'must':11 'shark':17 'tank':18
'chef':9 'dentist':12 'documentari':5 'fast':3 'fast-pac':2 'forens':17 'gulf':21 'mexico':23 'must':14 'pace':4 'pastri':8 'psychologist':18 'pursu':15
'ancient':17 'boy':9 'china':18 'escap':12 'intrepid':2 'must':11 'panorama':3 'robot':6 'sumo':14 'wrestler':15
"'boat':18 'butler':9,14 'discov':12 'hunter':6 'jet':17 'must':11 'saga':3 'touch':2"
'ancient':16 'confront':12 'epic':2 'girl':9 'india':17 'monkey':14 'moos':6 'must':11 'tale':3
"'administr':7 'boat':21 'databas':6 'jet':20 'mad':10,16 'must':13 'outgun':14 'panorama':3 'scientist':11,17 'thought':2"
'action':3 'action-pack':2 'ancient':18 'china':19 'feminist':16 'lumberjack':11 'man':8 'must':13 'pack':4 'reach':14 'tale':5


## Basic full-text search
---

Searching text will become something you do repeatedly when building applications or exploring data sets for data science. Full-text search is helpful when performing exploratory data analysis for a natural language processing model or building a search feature into your application.

In this exercise, you will practice searching a text column and match it against a string. The search will return the same result as a query that uses the `LIKE` operator with the `%` wildcard at the beginning and end of the string, but will perform much better and provide you with a foundation for more advanced full-text search queries. Let's dive in.

### Instructions

Select the `title` and `description` columns from the `film` table.

Perform a full-text search on the `title` column for the word `elf`.

In [10]:
%%sql

SELECT title,
       description
FROM   film
WHERE  to_tsvector(title) @@ to_tsquery('elf')

 * postgresql://postgres:***@localhost/sakila
3 rows affected.


title,description
ELF MURDER,A Action-Packed Story of a Frisbee And a Woman who must Reach a Girl in An Abandoned Mine Shaft
ENCINO ELF,A Astounding Drama of a Feminist And a Teacher who must Confront a Husband in A Baloon
GHOSTBUSTERS ELF,A Thoughtful Epistle of a Dog And a Feminist who must Chase a Composer in Berlin


## User-defined data types
---

`ENUM` or enumerated data types are great options to use in your database when you have a column where you want to store a fixed list of values that rarely change. Examples of when it would be appropriate to use an `ENUM` include days of the week and states or provinces in a country.

Another example can be the directions on a compass (i.e., north, south, east and west.) In this exercise, you are going to create a new `ENUM` data type called `compass_position`.

### Instructions

Create a new enumerated data type called `compass_position`.

Use the four positions of a compass as the values.

In [11]:
%%sql

CREATE TYPE compass_position AS ENUM (
    'North', 
    'South',
    'East', 
    'West'
);

 * postgresql://postgres:***@localhost/sakila
Done.


[]

Verify that the new data type has been created by looking in the `pg_type` system table.

In [13]:
%%sql

SELECT typname,typcategory
FROM   pg_type
WHERE  typname='compass_position'

 * postgresql://postgres:***@localhost/sakila
1 rows affected.


typname,typcategory
compass_position,E


## Getting info about user-defined data types
---

The Sakila database has a user-defined `enum` data type called `mpaa_rating`. The `rating` column in the `film` table is an mpaa_rating type and contains the familiar rating for that film like PG or R. This is a great example of when an enumerated data type comes in handy. Film ratings have a limited number of standard values that rarely change.

When you want to learn about a column or data type in your database the best place to start is the `INFORMATION_SCHEMA`. You can find information about the `rating` column that can help you learn about the type of data you can expect to find. For `enum` data types, you can also find the specific values that are valid for a particular `enum` by looking in the `pg_enum` system table. Let's dive into the exercises and learn more.

### Instructions

Select the `column_name`, `data_type`, `udt_name`.

Filter for the `rating` column in the `film` table.

In [14]:
%%sql

SELECT column_name, data_type, udt_name
FROM   INFORMATION_SCHEMA.COLUMNS 
WHERE  table_name = 'film' AND column_name ='rating'

 * postgresql://postgres:***@localhost/sakila
1 rows affected.


column_name,data_type,udt_name
rating,USER-DEFINED,mpaa_rating


Select all columns from the `pg_type` table where the type name is equal to `mpaa_rating`.

In [15]:
%%sql

SELECT *
FROM   pg_type
WHERE  typname='mpaa_rating'

 * postgresql://postgres:***@localhost/sakila
1 rows affected.


oid,typname,typnamespace,typowner,typlen,typbyval,typtype,typcategory,typispreferred,typisdefined,typdelim,typrelid,typsubscript,typelem,typarray,typinput,typoutput,typreceive,typsend,typmodin,typmodout,typanalyze,typalign,typstorage,typnotnull,typbasetype,typtypmod,typndims,typcollation,typdefaultbin,typdefault,typacl
17807,mpaa_rating,17799,10,4,True,e,E,False,True,",",0,-,0,17806,enum_in,enum_out,enum_recv,enum_send,-,-,-,i,p,False,0,-1,0,0,,,


## User-defined functions in Sakila
---

If you were running a real-life DVD Rental store, there are many questions that you may need to answer repeatedly like whether a film is in stock at a particular store or the outstanding balance for a particular customer. These types of scenarios are where user-defined functions will come in very handy. The Sakila database has several user-defined functions pre-defined. These functions are available out-of-the-box and can be used in your queries like many of the built-in functions we've learned about in this course.

In this exercise, you will build a query step-by-step that can be used to produce a report to determine which film title is currently held by which customer using the `inventory_held_by_customer()` function.

### Instructions

Select the `title` and `inventory_id` columns from the `film` and `inventory` tables in the database.

In [17]:
%%sql

SELECT f.title,
       i.inventory_id
FROM   film AS f
       INNER JOIN inventory AS i
               ON f.film_id = i.film_id
LIMIT  20

 * postgresql://postgres:***@localhost/sakila
20 rows affected.


title,inventory_id
ACADEMY DINOSAUR,1
ACADEMY DINOSAUR,2
ACADEMY DINOSAUR,3
ACADEMY DINOSAUR,4
ACADEMY DINOSAUR,5
ACADEMY DINOSAUR,6
ACADEMY DINOSAUR,7
ACADEMY DINOSAUR,8
ACE GOLDFINGER,9
ACE GOLDFINGER,10


`inventory_id` is currently held by a customer and alias the column as `held_by_cust`

In [18]:
%%sql

SELECT f.title,
       i.inventory_id,
       inventory_held_by_customer(i.inventory_id) AS held_by_cust
FROM   film AS f
       INNER JOIN inventory AS i
               ON f.film_id = i.film_id
LIMIT 20

 * postgresql://postgres:***@localhost/sakila
20 rows affected.


title,inventory_id,held_by_cust
ACADEMY DINOSAUR,1,
ACADEMY DINOSAUR,2,
ACADEMY DINOSAUR,3,
ACADEMY DINOSAUR,4,
ACADEMY DINOSAUR,5,
ACADEMY DINOSAUR,6,554.0
ACADEMY DINOSAUR,7,
ACADEMY DINOSAUR,8,
ACE GOLDFINGER,9,366.0
ACE GOLDFINGER,10,


Now filter your query to only return records where the `inventory_held_by_customer()` function returns a non-null value.

In [19]:
%%sql

SELECT f.title,
       i.inventory_id,
       inventory_held_by_customer(i.inventory_id) AS held_by_cust
FROM   film AS f
       INNER JOIN inventory AS i
               ON f.film_id = i.film_id
WHERE  inventory_held_by_customer(i.inventory_id) IS NOT NULL

LIMIT 20

 * postgresql://postgres:***@localhost/sakila
20 rows affected.


title,inventory_id,held_by_cust
ACADEMY DINOSAUR,6,554
ACE GOLDFINGER,9,366
AFFAIR PREJUDICE,21,111
AFRICAN EGG,25,590
ALI FOREVER,70,108
ALONE TRIP,81,236
AMADEUS HOLY,97,512
AMERICAN CIRCUS,106,44
AMISTAD MIDSUMMER,112,349
ARMAGEDDON LOST,177,317


## Enabling extensions
---

Before you can use the capabilities of an extension it must be enabled. As you have previously learned, most PostgreSQL distributions come pre-bundled with many useful extensions to help extend the native features of your database. You will be working with `fuzzystrmatch` and `pg_trgm` in upcoming exercises but before you can practice using the capabilities of these extensions you will need to first make sure they are enabled in our database. In this exercise you will enable the `pg_trgm` extension and confirm that the `fuzzystrmatch` extension, which was enabled in the video, is still enabled by querying the `pg_extension` system table.

### Instructions

Enable the `pg_trgm` extension

In [20]:
%%sql

CREATE EXTENSION IF NOT EXISTS pg_trgm

 * postgresql://postgres:***@localhost/sakila
Done.


[]

Now confirm that both `fuzzystrmatch` and `pg_trgm` are enabled by selecting all rows from the appropriate system table.

In [21]:
%%sql

SELECT * 
FROM   pg_extension

 * postgresql://postgres:***@localhost/sakila
2 rows affected.


oid,extname,extowner,extnamespace,extrelocatable,extversion,extconfig,extcondition
13740,plpgsql,10,11,False,1.0,,
18279,pg_trgm,10,17799,True,1.6,,


## Measuring similarity between two strings
---

Now that you have enabled the `fuzzystrmatch` and `pg_trgm` extensions you can begin to explore their capabilities. First, we will measure the similarity between the title and description from the `film` table of the Sakila database.

### Instructions

Select the film title and description.

Calculate the similarity between the title and description.

In [22]:
%%sql

SELECT title,
       description,
       similarity(title, description)
FROM   film 

LIMIT  20

 * postgresql://postgres:***@localhost/sakila
20 rows affected.


title,description,similarity
ACADEMY DINOSAUR,A Epic Drama of a Feminist And a Mad Scientist who must Battle a Teacher in The Canadian Rockies,0.02
ACE GOLDFINGER,A Astounding Epistle of a Database Administrator And a Explorer who must Find a Car in Ancient China,0.041237112
ADAPTATION HOLES,A Astounding Reflection of a Lumberjack And a Car who must Sink a Lumberjack in A Baloon Factory,0.045454547
AFFAIR PREJUDICE,A Fanciful Documentary of a Frisbee And a Lumberjack who must Chase a Monkey in A Shark Tank,0.010204081
AFRICAN EGG,A Fast-Paced Documentary of a Pastry Chef And a Dentist who must Pursue a Forensic Psychologist in The Gulf of Mexico,0.00952381
AGENT TRUMAN,A Intrepid Panorama of a Robot And a Boy who must Escape a Sumo Wrestler in Ancient China,0.03409091
AIRPLANE SIERRA,A Touching Saga of a Hunter And a Butler who must Discover a Butler in A Jet Boat,0.025974026
AIRPORT POLLOCK,A Epic Tale of a Moose And a Girl who must Confront a Monkey in Ancient India,0.012820513
ALABAMA DEVIL,A Thoughtful Panorama of a Database Administrator And a Mad Scientist who must Outgun a Mad Scientist in A Jet Boat,0.05154639
ALADDIN CALENDAR,A Action-Packed Tale of a Man And a Lumberjack who must Reach a Feminist in Ancient China,0.044444446


## Levenshtein distance examples
---

Now let's take a closer look at how we can use the `levenshtein` function to match strings against text data. If you recall, the `levenshtein` distance represents the number of edits required to convert one string to another string being compared.

In a search application or when performing data analysis on any data that contains manual user input, you will always want to account for typos or incorrect spellings. The `levenshtein` function provides a great method for performing this task. In this exercise, we will perform a query against the `film` table using a search string with a misspelling and use the results from `levenshtein` to determine a match. Let's check it out.

### Instructions

Select the film title and film description.

Calculate the levenshtein distance for the film title with the string `JET NEIGHBOR`.

In [None]:
%%sql

SELECT title,
       description,
       LEVENSHTEIN(title, 'JET NEIGHBOR') AS distance
FROM   film
ORDER  BY 3 

## Putting it all together
---

In this exercise, we are going to use many of the techniques and concepts we learned throughout the course to generate a data set that we could use to predict whether the words and phrases used to describe a film have an impact on the number of rentals.

First, you need to create a `tsvector` from the `description` column in the `film` table. You will match against a `tsquery` to determine if the phrase "Astounding Drama" leads to more rentals per month. Next, create a new column using the `similarity` function to rank the film descriptions based on this phrase.

### Instructions

Select the title and description for all DVDs from the `film` table.

Perform a full-text search by converting the description to a `tsvector` and match it to the phrase `'Astounding & Drama'` using a `tsquery` in the `WHERE` clause.

In [24]:
%%sql

SELECT title,
       description
FROM   film
WHERE  to_tsvector(description ) @@ to_tsquery('Astounding & Drama')

 * postgresql://postgres:***@localhost/sakila
5 rows affected.


title,description
BIKINI BORROWERS,A Astounding Drama of a Astronaut And a Cat who must Discover a Woman in The First Manned Space Station
CAMPUS REMEMBER,A Astounding Drama of a Crocodile And a Mad Cow who must Build a Robot in A Jet Boat
COWBOY DOOM,A Astounding Drama of a Boy And a Lumberjack who must Fight a Butler in A Baloon
ENCINO ELF,A Astounding Drama of a Feminist And a Teacher who must Confront a Husband in A Baloon
GLASS DYING,A Astounding Drama of a Frisbee And a Astronaut who must Fight a Dog in Ancient Japan


Add a new column that calculates the similarity of the description with the phrase 'Astounding Drama'.

Sort the results by the new similarity column in descending order.

In [25]:
%%sql

SELECT   title,
         description,
         similarity(description, 'Astounding Drama')
FROM     film
WHERE    to_tsvector(description) @@ to_tsquery('Astounding & Drama')
ORDER BY similarity(description, 'Astounding Drama') DESC

 * postgresql://postgres:***@localhost/sakila
5 rows affected.


title,description,similarity
COWBOY DOOM,A Astounding Drama of a Boy And a Lumberjack who must Fight a Butler in A Baloon,0.24637681
GLASS DYING,A Astounding Drama of a Frisbee And a Astronaut who must Fight a Dog in Ancient Japan,0.23943663
CAMPUS REMEMBER,A Astounding Drama of a Crocodile And a Mad Cow who must Build a Robot in A Jet Boat,0.2361111
ENCINO ELF,A Astounding Drama of a Feminist And a Teacher who must Confront a Husband in A Baloon,0.22972973
BIKINI BORROWERS,A Astounding Drama of a Astronaut And a Cat who must Discover a Woman in The First Manned Space Station,0.1954023
