# Data Modeling with Movie Data

Retrieval Augmented Generation (RAG) – using vector similarity search to
retrieve semantically relevant information – is great. It is easy to implement
and provides great results for many questions. However, as great as it is there
are cases where it breaks down and advanced RAG techniques may be needed:
* Complex questions needing information about multiple topics may require question rewriting or an agent based approach.
* Broad questions may need multiple perspectives, and benefit from techniques like MMR to retrieve a diverse set of results.
* Deep questions may require explanations and clarifying information for the retrieved context, benefitting from knowledge graphs and/or agent techniques asking follow-up questions.

Knowledge graphs linking between content solve these problems. Graph RAG lets
us start with vector similarity to address a wide variety of questions. As
vector similarity breaks down, links between documents can be traversed to
retrieve deeper information and handle more complex questions. Graph traversal
techniques like MMR can be used to retrieve a diverse set of contexts also helps
with broad questions.

By enabling these graph traversals using metadata in the vector store, Graph RAG allows you to solve these problems without re-ingesting your data. At retrieval time, documents may be linked to other documents based on properties stored in the metadata.

In this article, we’ll see how easy it is to use an existing vector store as a graph vector store. We'll understand the kind of information that may be useful to have in the metadata to support these traversals.

## What to link
When first upgrading a vector store to a GraphVectorStore, the question of what to link naturally arises. Since vector similarity does such a great job of capturing semantic similarity based on the content the best use of links is to capture relationships that may not be readily apparent from the content alone. For example:

1. In a legal document you could create links from paragraphs to definitions of relevant terms.
1. In a technical support knowledge base you could link to more specific steps for each suggested procedure.
1. In an article, you could link a paragraph of text to an image that illustrates the concept.
1. In a travel application you could link a hotel to nearby attractions in the same metropolitan area.

In all of these cases, the links provide additional information that is not semantically related to the original content. They are the kinds of things that you might link to if you were writing a webpage or look for in a follow-up search when performing deeper research. Often, this information is available in your existing API or database and just needs to be added to the retrieval system.

## Example: IMDB API

We previously showed using movie descriptions from IMDB as documents when
constructing a knowledge graph. In that example, we extracted keywords and other
properties to use as links. We’re going to revisit that example, but using the
IMDB API to demonstrate creating links from the existing structured information.

For our use case, we’re focused on retrieving information about movies, so we’ll
use bidirectional links and only create a LangChain `Document` for each movie.

In [7]:
%pip install langchain-core langchain-community langchain-graph-retriever PyMovieDb lxml-html-clean

Collecting lxml-html-clean
  Downloading lxml_html_clean-0.4.1-py3-none-any.whl.metadata (2.4 kB)
Downloading lxml_html_clean-0.4.1-py3-none-any.whl (14 kB)
Installing collected packages: lxml-html-clean
Successfully installed lxml-html-clean-0.4.1
Note: you may need to restart the kernel to use updated packages.


In [23]:
import json
from PyMovieDb import IMDB
from langchain_core.documents import Document

imdb = IMDB()
res = json.loads(imdb.popular_movies())

IDS = [
    'tt14513804', 'tt13654226', 'tt20969586', 'tt5040012', 'tt32063050',
    'tt28607951', 'tt8999762', 'tt26584495', 'tt26743210', 'tt17526714',
    'tt30253036', 'tt31036941', 'tt32558992', 'tt26625693', 'tt10676052',
    'tt14961016', 'tt21227864', 'tt30057084', 'tt34956443', 'tt20215234',
    'tt20221436', 'tt9218128', 'tt11563598', 'tt32434379', 'tt30788842',
    'tt30788842', 'tt30788842', 'tt30788842', 'tt30788842', 'tt30788842',
    'tt30788842', 'tt30788842', 'tt30788842', 'tt30788842', 'tt30788842',
    'tt30788842', 'tt30788842', 'tt30788842', 'tt30788842', 'tt30788842',
    'tt30788842', 'tt30788842', 'tt30788842', 'tt30788842', 'tt30788842',
    'tt30788842',
]

def names(people) -> list[str]:
    return [p["name"] for p in people]

def movie_to_doc(id: str) -> Document:
    movie = json.loads(imdb.get_by_id(id))
    try:
        return Document(
            id = id,
            page_content = movie["description"] or movie["name"],
            metadata = {
                "name": movie["name"],
                "url": movie["url"],
                "rating": movie["rating"],
                "contentRating": movie["contentRating"],
                "genres": movie["genre"],
                "keywords": movie["keywords"],
                "actors": names(movie.get("actor", [])),
                "directors": names(movie.get("director", [])),
                "creators": names(movie.get("creator", [])),
            },
        )
    except Exception as e:
        print(e)
        print(movie)
        raise e

docs = [movie_to_doc(id) for id in IDS]

## Conclusion

Documents contain _unstructured_ content. Document metadata captures
_structured_ information about that content. Edges connect documents based on
the metadata.

When populating your vector store, make sure the information you already have is
available on the metadata. Adding this information allows you to use metadata
filtering and graph traversal to navigate these dimensions in ways not captured
by semantic similarity on the content.