In [1]:
%load_ext sql

In [2]:
%sql postgresql://postgres:aa@localhost:5432/dvdrental

In [4]:
%%sql
/* see what extensions are available */
SELECT name
FROM pg_available_extensions;

 * postgresql://postgres:***@localhost:5432/dvdrental
99 rows affected.


name
adminpack
amcheck
autoinc
bloom
bool_plperl
bool_plperlu
btree_gin
btree_gist
citext
cube


In [5]:
%%sql
/* see which extenion already installed */
SELECT extname
FROM pg_extension;

 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.


extname
plpgsql


In [7]:
%%sql
/* enable an extension thats availablle */
-- enable the fuzzystrmatch extension
CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;

--confirm that extension has been enabled
SELECT extname FROM pg_extension;

 * postgresql://postgres:***@localhost:5432/dvdrental
Done.
2 rows affected.


extname
plpgsql
fuzzystrmatch


#### Using fuzzystrmatch or fuzzy searching
When preforming a full text search based on user input or looking to perform an analysis and comparison of text data in a natural language processing exercise, a function that you will use often is levenshtein from the fuzzystrmatch extension. The levenshtein function calculates the levenshtein distance between two strings which is the number of edits required for the strings to be a perfect match. In this example, you see the distance returned is 2 because in order to convert GUMBO to GAMBOL it would require replacing the U with an A and adding an L to the end of the word or two edits.

In [8]:
%%sql
SELECT levenshtein('GUMBO','GAMBOL'); 

 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.


levenshtein
2


#### Compare two strings with pg_trgm
The pg_trgm extension provides functions and operators to determine the similarity of two strings using trigram matchings. Trigrams are groups of 3 consecutive characters in a string and based on the number of matching trigrams in two strings will provide a measurement of how similar they are. This measurement can be calculated using the similarity function of this extension. The similarity function accepts two parameters; the first being the string you wish to compare and the second being the string you wish to compare against. This function will return a number between 0-1 with zero representing no matching trigrams at all and 1 representing a perfect match. In this example we see that using similarity on GUMBO and GAMBOL returns a value of 0.181818.

In [9]:
%%sql
SELECT similarity('GUMBO','GAMBOL'); 

 * postgresql://postgres:***@localhost:5432/dvdrental
(psycopg2.errors.UndefinedFunction) function similarity(unknown, unknown) does not exist
LINE 1: SELECT similarity('GUMBO','GAMBOL');
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

[SQL: SELECT similarity('GUMBO','GAMBOL');]
(Background on this error at: https://sqlalche.me/e/14/f405)


In [10]:
%%sql
/* enable an extension thats availablle */
-- enable the pg_trgm extension
CREATE EXTENSION IF NOT EXISTS pg_trgm;

--confirm that extension has been enabled
SELECT extname FROM pg_extension;

 * postgresql://postgres:***@localhost:5432/dvdrental
Done.
3 rows affected.


extname
plpgsql
fuzzystrmatch
pg_trgm


In [11]:
%%sql
SELECT similarity('GUMBO','GAMBOL'); 

 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.


similarity
0.18181819


In [12]:
%%sql
-- Select all rows extensions
SELECT * 
FROM pg_extension;

 * postgresql://postgres:***@localhost:5432/dvdrental
3 rows affected.


oid,extname,extowner,extnamespace,extrelocatable,extversion,extconfig,extcondition
15049,plpgsql,10,11,False,1.0,,
16719,fuzzystrmatch,10,2200,True,1.2,,
16731,pg_trgm,10,2200,True,1.6,,


#### Measuring similarity between two strings
Now that you have enabled the fuzzystrmatch and pg_trgm extensions you can begin to explore their capabilities. First, we will measure the similarity between the title and description from the film table of the Sakila database.

In [13]:
%%sql
-- Select the title and description columns
SELECT 
  title, 
  description, 
  -- Calculate the similarity
  similarity(title,description)
FROM 
  film
LIMIT 3;

 * postgresql://postgres:***@localhost:5432/dvdrental
3 rows affected.


title,description,similarity
Chamber Italian,A Fateful Reflection of a Moose And a Husband who must Overcome a Monkey in Nigeria,0.011627907
Grosse Wonderful,A Epic Drama of a Cat And a Explorer who must Redeem a Moose in Australia,0.025641026
Airport Pollock,A Epic Tale of a Moose And a Girl who must Confront a Monkey in Ancient India,0.012820513


#### Levenshtein distance examples
Now let's take a closer look at how we can use the levenshtein function to match strings against text data. If you recall, the levenshtein distance represents the number of edits required to convert one string to another string being compared.

In a search application or when performing data analysis on any data that contains manual user input, you will always want to account for typos or incorrect spellings. The levenshtein function provides a great method for performing this task. In this exercise, we will perform a query against the film table using a search string with a misspelling and use the results from levenshtein to determine a match. Let's check it out.

In [14]:
%%sql
-- Select the title and description columns
SELECT  
  title, 
  description, 
  -- Calculate the levenshtein distance
  levenshtein('JET NEIGHBOR', title) AS distance
FROM 
  film
ORDER BY 3
LIMIT 3;

 * postgresql://postgres:***@localhost:5432/dvdrental
3 rows affected.


title,description,distance
Early Home,A Amazing Panorama of a Mad Scientist And a Husband who must Meet a Woman in The Outback,10
Egg Igby,A Beautiful Documentary of a Boat And a Sumo Wrestler who must Succumb a Database Administrator in The First Manned Space Station,10
Jet Neighbors,A Amazing Display of a Lumberjack And a Teacher who must Outrace a Woman in A U-Boat,10


## Putting it all together
In this exercise, we are going to use many of the techniques and concepts we learned throughout the course to generate a data set that we could use to predict whether the words and phrases used to describe a film have an impact on the number of rentals.

First, you need to create a tsvector from the description column in the film table. You will match against a tsquery to determine if the phrase "Astounding Drama" leads to more rentals per month. Next, create a new column using the similarity function to rank the film descriptions based on this phrase.

In [15]:
%%sql
/*
Select the title and description for all DVDs from the film table.
Perform a full-text search by converting the description to a tsvector and match it to the phrase 'Astounding & Drama' using a tsquery in the WHERE clause.
*/
-- Select the title and description columns
SELECT  
  title, 
  description 
FROM 
  film
WHERE 
  -- Match "Astounding Drama" in the description
  to_tsvector(description) @@
  to_tsquery('Astounding & Drama');

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


title,description
Bikini Borrowers,A Astounding Drama of a Astronaut And a Cat who must Discover a Woman in The First Manned Space Station
Campus Remember,A Astounding Drama of a Crocodile And a Mad Cow who must Build a Robot in A Jet Boat
Cowboy Doom,A Astounding Drama of a Boy And a Lumberjack who must Fight a Butler in A Baloon
Encino Elf,A Astounding Drama of a Feminist And a Teacher who must Confront a Husband in A Baloon
Glass Dying,A Astounding Drama of a Frisbee And a Astronaut who must Fight a Dog in Ancient Japan


In [16]:
%%sql
SELECT 
  title, 
  description, 
  -- Calculate the similarity
  similarity(description,'Astounding & Drama' )
FROM 
  film 
WHERE 
  to_tsvector(description) @@ 
  to_tsquery('Astounding & Drama') 
ORDER BY 
	similarity(title, description) DESC;

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


title,description,similarity
Cowboy Doom,A Astounding Drama of a Boy And a Lumberjack who must Fight a Butler in A Baloon,0.24637681
Glass Dying,A Astounding Drama of a Frisbee And a Astronaut who must Fight a Dog in Ancient Japan,0.23943663
Campus Remember,A Astounding Drama of a Crocodile And a Mad Cow who must Build a Robot in A Jet Boat,0.2361111
Bikini Borrowers,A Astounding Drama of a Astronaut And a Cat who must Discover a Woman in The First Manned Space Station,0.1954023
Encino Elf,A Astounding Drama of a Feminist And a Teacher who must Confront a Husband in A Baloon,0.22972973
