# Semantic Search demo

In this demo we will perform semantic queries on a vector DB.

## Prerequisites

1. Install dependencies (can be found in `requirements.txt`)
2. Run `populate_movie_faiss.py` to create and populate the FAISS index.

## Dataset

For this demo we'll use the [CMU Movie Summary Corpus](https://www.cs.cmu.edu/~ark/personas/). This is a dataset that consists of info about  movies (name, release date, actors, etc.), along with their plot summaries.

In detail the features we'll use are:

- plot_summary (str): a summary of the movie's plot
- name (str): name of the movie
- release_date (str): date of release 
- box_office (float): box office earnings of the movie
- runtime (float): movie runtime
- languages (array): movie languages
- countries (array): movie countries
- genres (array): movie genres
- actors (array): movie actors

In [1]:
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

## Load the FAISS index in memory

In [2]:
embeddings = HuggingFaceEmbeddings()
vector_store = FAISS.load_local('movie_faiss', embeddings, allow_dangerous_deserialization=True)

  from tqdm.autonotebook import tqdm, trange
2024-10-19 20:46:07.790834: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-19 20:46:07.956148: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-19 20:46:07.992855: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-19 20:46:08.709937: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could

In [3]:
print('Num vectors stored:', vector_store.index.ntotal)
print('Vector dim:', vector_store.index.d)

Num vectors stored: 42204
Vector dim: 768


## Semantic queries

The benefit of using a vector DB instead of a regular one is that we can perform semantic queries instead of regular keyword-based ones. 

For example let's say we want to look for movies that have to do with gangsters...

In [5]:
results = vector_store.similarity_search('gangster movie', k=10)  # return 10 results

for res in results:
    print(f"* {res.metadata['name']}")
    print(res.page_content)
    print()

* The Musketeers of Pig Alley
The film is about a poor married couple living in New York City. The husband works as a musician and must travel often for work. When returning, his wallet is taken by a gangster. His wife goes to a ball where a man tries to drug her, but his attempt is stopped by the same man who robbed the husband. The two criminals become rivals, and a shootout ensues. The husband gets caught in the shootout and recognizes one of the men as the gangster who took his money. The husband sneaks his wallet back and the gangster goes to safety in the couple's apartment. Policemen track the gangster down but the wife gives him a false alibi.

* Marrying the Mafia
The film is a gangster comedy about a businessman who becomes involved with the gangster underworld through the daughter of a crime boss.Synopsis based on {{cite web}} A businessman and a young woman wake up in bed together with no knowledge of how they got there. Next the business man is confronted by the young woma

Or let's say we want movies about the vietnam war...

In [7]:
results = vector_store.similarity_search('vietnam war')

for res in results:
    print(f"* {res.metadata['name']}")
    print(res.page_content)
    print()

* How Sleep the Brave
During the Vietnam War of Christmas 1969, a group of fresh young American soldiers who arrive at an army camp in Vietnam are sent to patrol in a nearby jungle. Once they have killed a few Viet Cong soldiers and losing a couple of their comrades in the battle, they return to camp. They are now sent on a mission, which is to destroy a Viet Cong village. After they destroy the village, they embark on a hazardous journey through a jungle to board a helicopter and return to camp. But, it's only a matter of who will survive the Viet Cong's gunshots and make it to the helicopter.

* Hail, Hero!
During the Vietnam War, college student Carl Dixon quits school and joins the Army in hopes of using love, not bullets, to combat the Viet Cong.

* Vietnam: The Last Battle
Pilger introduces the film, on the 20th anniversary of the end of the conflict, from the roof of the U.S. Embassy, Saigon, where the last American troops had departed by helicopter. Veteran Robert Muller, inter

Or another one. Let's say we want movies about art thieves...

In [9]:
results = vector_store.similarity_search('art thieves')

for res in results:
    print(f"* {res.metadata['name']}")
    print(res.page_content)
    print()

* Artworks
Runtime:  93.0
A police chief’s daughter in home security sales, contacts an art gallery possessor. They love art and have despise for their clients’ motives for gathering artworks. They think of a plot to steal overlooked but valuable artworks from her affluent clients’ homes.

* Fakers
Runtime:  85.0
In this crime caper set in the eccentric London art world, Nick Edwards  owes £50,000 to the super-smooth, yet brutal, crime lord Foster Wright  and has four days to find the cash. Nick knows nothing about working a heist of that size, but when he stumbles across a lost sketch by the legendary Italian artist Antonio Fraccini, he believes he’s in the clear. The problem is, it’s only worth 15 grand. With the help of the eternal cynic Eve  and her extremely talented yet naïve artist brother Tony , the plan is hatched; to forge the drawing and sell it to five Mayfair galleries within an hour before anyone cottons onto the fact there’s a scam going down.

* Insadong Scandal
Runtime

### Filtering

We can also combine semantic queries with regular ones. 

Let's say we want to look for WW2 movies between 1 and 2 hours long...

In [18]:
results = vector_store.similarity_search(
    'world war 2',
    filter=lambda d: 60 < d['runtime'] < 120,
    k=5
)

for res in results:
    print(f"* {res.metadata['name']}")
    print('Runtime: ', res.metadata['runtime'])
    print(res.page_content)
    print()

* The Young Warriors
Runtime:  93.0
Europe: 1944. A group of replacements are assigned to Sgt Cooley's squad and sent into battle. Initially frightened, Hacker grows to love killing but loses that feeling as well. He is promoted to Corporal and later given his own squad.

* A Midnight Clear
Runtime:  107.0
In France in 1944, an American Intelligence squad locates a German platoon in the Ardennes, wishing to surrender rather than die in Germany's final war offensive. The two groups of men, isolated from the war at present, put aside their differences and spend Christmas together before the surrender plan turns bad and both sides are forced to fight each other.

* Eagles Over London
Runtime:  100.0
During World War II at the height of the Battle of Britain, British military officers are in pursuit of a merciless team of Nazi saboteurs. They searched though war-ravaged London but the Nazis eluded them. Finally, the British caught up with the Germans in a final battle at the RAF Control Ce

Now imagine the following scenario. We are looking for the movie *Top Gun*, but we forgot its name. We know that the protagonist is Tom Cruise and that he was an airforce pilot for the US. Can we use this information to somehow find the movie?

In [76]:
def find_actor(actor_name):
    """
    Helper function factory to produce functions that look for specific actors
    """
    def find_func(d):
        try:
            return actor_name in d['actors']
        except TypeError:
            return False
    return find_func

In [77]:
find_tom_cruise = find_actor('Tom Cruise')
        
        
results = vector_store.similarity_search(
    'aircraft pilot',        # look for movies that have something to do with the query 'aircraft pilot'
    filter=find_tom_cruise,  # filter movies by actor 'Tom Cruise' 
)

for res in results:
    print(f"* {res.metadata['name']}")
    print('Actors: ', res.metadata['actors'])
    print(res.page_content)
    print()

Why didn't we get any results?

The reason is the way this library implements the queries on the search. There are two ways to do this **pre-filtering** and **post-filtering**, depending on whether the filtering occurs before or after computing the vector similarity. 

This library uses post-filtering. This means that it by default returns 20 results from the vector similarity search and then performs the filtering. Now in our example if there were no Tom Cruise films in these results for aircraft pilot, then nothing will be returned!

To mitegate this we need to retrieve more results from the vector similarity!

In [78]:
find_tom_cruise = find_actor('Tom Cruise')

results = vector_store.similarity_search(
    'aircraft pilot',        # look for movies that have something to do with the query 'aircraft pilot'
    filter=find_tom_cruise,  # filter movies by actor 'Tom Cruise' 
    fetch_k=5000             # return 5000 results before filtering
)

for res in results:
    print(f"* {res.metadata['name']}")
    print('Actors: ', res.metadata['actors'])
    print(res.page_content)
    print()

* Valkyrie
Actors:  ['Kenneth Branagh' 'Tom Cruise' 'Bill Nighy' 'Tom Wilkinson'
 'Carice van Houten' 'Thomas Kretschmann' 'Terence Stamp' 'Eddie Izzard'
 'Kevin McNally' 'Christian Berkel' 'Jamie Parker' 'David Bamber'
 'Tom Hollander' 'Halina Reijn' 'Christian Oliver' 'Florian Panzner']
During World War II, Wehrmacht Colonel Claus von Stauffenberg  is severely wounded during an RAF air raid in Tunisia, losing a hand and an eye, and is evacuated home to Nazi Germany. Meanwhile, Major General Henning von Tresckow  attempts to assassinate Adolf Hitler by smuggling a bomb aboard the Führer's personal airplane. The bomb, however, is a dud and fails to detonate, and Tresckow flies to Berlin in order to safely retrieve it. After learning that the Gestapo has arrested Major General Hans Oster, he orders General Olbricht  to find a replacement. After recruiting Stauffenberg into the German Resistance, Olbricht presents Stauffenberg at a meeting of the secret committee which has coordinated pr

Let's try another example. Let's try to find *Saving Private Ryan*, but without using its name.

In [81]:
find_tom_hanks = find_actor('Tom Hanks')

results = vector_store.similarity_search(
    'world war 2',
    filter=find_tom_hanks,
    fetch_k=5000
)

for res in results:
    print(f"* {res.metadata['name']}")
    print('Actors: ', res.metadata['actors'])
    print(res.page_content)
    print()

* Charlie Wilson's War
Actors:  ['Peter Gerety' 'Tom Hanks' 'Amy Adams' 'Julia Roberts'
 'Philip Seymour Hoffman' 'Brian Markinson' 'Judy Tylor' 'Emily Blunt'
 'Wynn Everett' 'Mary Bonner Baker' 'Rachel Nichols' 'Shiri Appleby'
 'John Slattery' 'Om Puri' 'Navid Negahban' "Denis O'Hare" 'Ken Stott'
 'Ned Beatty' 'Spencer Garrett' 'Nazanin Boniadi']
In 1980, U.S. Representative Charlie Wilson  is more interested in partying than legislating, frequently throwing huge galas and staffing his congressional office with young, attractive women. His social life eventually brings about a federal investigation into allegations of his cocaine use, conducted by then-U.S. Attorney Rudy Giuliani as part of a larger investigation into congressional misconduct. The investigation results in no charge against Charlie. A friend and romantic interest, Joanne Herring , encourages Charlie to do more to help the Afghan people, and persuades Charlie to visit the Pakistani leadership. The Pakistanis complain ab

One final example. We are now looking for scifi movies that have to do with 

In [90]:
def find_scifi(d):
    try:
        return 'Science Fiction' in d['genres'] and 'Star Wars' not in d['name']
    except TypeError:
        return False

results = vector_store.similarity_search(
    'space battles',
    filter=find_scifi,
    fetch_k=5000,
    k=10
)

for res in results:
    print(f"* {res.metadata['name']}")
    print('Genres: ', res.metadata['genres'])
    print(res.page_content)
    print()

* Royal Space Force: The Wings of Honneamise
Genres:  ['Science Fiction', 'Japanese Movies', 'Animation', 'Anime', 'Drama', 'Romance Film', 'Action']
 On an alternate Earth, an industrial civilization is flourishing amid an impending war between two bordering nations, the kingdom of Honneamise and the "Republic". Shirotsugh Lhadatt is an unmotivated young man who has drifted into his nation's lackadaisical space program. After the death of a fellow astronaut, he nurtures a close acquaintance with a young religious woman named Riquinni Nonderaiko, whose faith has seen her through some personal hardships.  Seeing Lhadatt as a prime example of what mankind is capable of, along with the godliness and ground-breaking nature of his work, she inspires him to become the first man in space. His training as an astronaut parallels his coming of age, and he and the rest of the members of the space project overcome technological difficulties, spiritual doubt, the machinations of their political mas