diff --git a/05_knowledge_graphs.ipynb b/05_knowledge_graphs.ipynb
new file mode 100644
index 0000000..6487b7a
--- /dev/null
+++ b/05_knowledge_graphs.ipynb
@@ -0,0 +1,532 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Knowledge Graphs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Knowledge Graphs, a form of graph-based knowledge representation, provide a method for modeling and storing interlinked information in a format that is both human- and machine-understandable. These graphs consist of *nodes* and *edges*, representing entities and their relationships. Unlike traditional databases, the inherent expressiveness of graphs allows for richer semantic understanding, while providing the flexibility to accommodate new entity types and relationships without being constrained by a fixed schema.\n",
+ "\n",
+ "By combining knowledge graphs with embeddings (vector search), we can leverage *multi-hop connectivity* and *contextual understanding of information* to enhance querying, reasoning, and explainability in LLMs. This notebook explores the practical implementation of this approach, demonstrating how to (i) build a knowledge graph of academic publications, and (ii) extract actionable insights from it."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "
\n",
+ "
\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1. Knowledge Graph Initialization\n",
+ "\n",
+ "We will create our Knowledge Graph using [Neo4j](https://neo4j.com/), an open-source database management system that specializes in graph database technology."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\u001b[33mWARNING: Ignoring invalid distribution -umpy (/home/codespace/.local/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n",
+ "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -umpy (/home/codespace/.local/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n",
+ "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -umpy (/home/codespace/.local/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n",
+ "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -umpy (/home/codespace/.local/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n",
+ "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -umpy (/home/codespace/.local/lib/python3.10/site-packages)\u001b[0m\u001b[33m\n",
+ "\u001b[0mNote: you may need to restart the kernel to use updated packages.\n"
+ ]
+ }
+ ],
+ "source": [
+ "%pip install neo4j langchain langchain_openai langchain-community python-dotenv --quiet | tail -n 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 1.1 Setting Up a Neo4j Instance\n",
+ "\n",
+ "For a quick and easy setup, you can start a free instance on [Neo4j Aura](https://neo4j.com/product/auradb/). "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "True"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import dotenv\n",
+ "dotenv.load_dotenv('.env', override=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "from langchain_community.graphs import Neo4jGraph\n",
+ "\n",
+ "graph = Neo4jGraph(\n",
+ " url=os.environ['NEO4J_URI'], \n",
+ " username=os.environ['NEO4J_USERNAME'],\n",
+ " password=os.environ['NEO4J_PASSWORD'],\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 1.2 Loading Dataset into a Graph\n",
+ "\n",
+ "The below example creates a connection with our Neo4j database and populates it with synthetic data about research articles and their authors. \n",
+ "\n",
+ "The entities are: \n",
+ "- *Researcher*\n",
+ "- *Article*\n",
+ "- *Topic*\n",
+ "\n",
+ "Whereas the relationships are:\n",
+ "- *Researcher* --[PUBLISHED]--> *Article*\n",
+ "- *Article* --[IN_TOPIC]--> *Topic*\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "from langchain_community.graphs import Neo4jGraph\n",
+ "\n",
+ "graph = Neo4jGraph()\n",
+ "\n",
+ "q_load_articles = \"\"\"\n",
+ "LOAD CSV WITH HEADERS\n",
+ "FROM 'https://raw.githubusercontent.com/dcarpintero/generative-ai-101/main/dataset/synthetic_articles.csv' \n",
+ "AS row \n",
+ "FIELDTERMINATOR ';'\n",
+ "MERGE (a:Article {title:row.Title})\n",
+ "SET a.abstract = row.Abstract,\n",
+ " a.publication_date = date(row.Publication_Date)\n",
+ "FOREACH (researcher in split(row.Authors, ',') | \n",
+ " MERGE (p:Researcher {name:trim(researcher)})\n",
+ " MERGE (p)-[:PUBLISHED]->(a))\n",
+ "FOREACH (topic in [row.Topic] | \n",
+ " MERGE (t:Topic {name:trim(topic)})\n",
+ " MERGE (a)-[:IN_TOPIC]->(t))\n",
+ "\"\"\"\n",
+ "\n",
+ "graph.query(q_load_articles)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Node properties:\n",
+ "Article {title: STRING, abstract: STRING, publication_date: DATE}\n",
+ "Researcher {name: STRING}\n",
+ "Topic {name: STRING}\n",
+ "Relationship properties:\n",
+ "\n",
+ "The relationships:\n",
+ "(:Article)-[:IN_TOPIC]->(:Topic)\n",
+ "(:Researcher)-[:PUBLISHED]->(:Article)\n"
+ ]
+ }
+ ],
+ "source": [
+ "graph.refresh_schema()\n",
+ "print(graph.get_schema)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 1.3 Build Vector Index\n",
+ "\n",
+ "We implement a vector index to efficiently search for relevant articles based on their *topic, title, and abstract*. This process involves calculating the embeddings for each article using these fields. At query time, the system finds the most similar articles to the user's input by employing a similarity metric, such as cosine distance.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain_community.vectorstores import Neo4jVector\n",
+ "from langchain_openai import OpenAIEmbeddings\n",
+ "\n",
+ "vector_index = Neo4jVector.from_existing_graph(\n",
+ " OpenAIEmbeddings(),\n",
+ " url=os.environ['NEO4J_URI'],\n",
+ " username=os.environ['NEO4J_USERNAME'],\n",
+ " password=os.environ['NEO4J_PASSWORD'],\n",
+ " index_name='articles',\n",
+ " node_label=\"Article\",\n",
+ " text_node_properties=['topic', 'title', 'abstract'],\n",
+ " embedding_node_property='embedding',\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 2. Q&A on Similarity"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain.chains import RetrievalQA\n",
+ "from langchain_openai import ChatOpenAI\n",
+ "\n",
+ "vector_qa = RetrievalQA.from_chain_type(\n",
+ " llm=ChatOpenAI(),\n",
+ " chain_type=\"stuff\",\n",
+ " retriever=vector_index.as_retriever()\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "The articles related to protecting users from misinformation are:\n",
+ "\n",
+ "1. Title: Language Models for Automated News Article Generation\n",
+ "Abstract: We discuss the use of language models in generating news articles, exploring the benefits and ethical considerations related to media integrity.\n",
+ "\n",
+ "2. Title: Language Models for Detecting Fake Reviews\n",
+ "Abstract: We propose language model-based approaches for detecting fake and misleading reviews online, helping users make informed decisions.\n"
+ ]
+ }
+ ],
+ "source": [
+ "r = vector_qa.invoke(\n",
+ " {\"query\": \"which articles are related to protecting users from misinformation? include the article titles and abstracts.\"}\n",
+ ")\n",
+ "print(r['result'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 3. Graph-Cypher-Chain w/ LangChain\n",
+ "\n",
+ "LangChain provides also wrapper around Neo4j graph database that allows to generate Cypher statements based on the user input and use them to retrieve relevant information from the database."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain.chains import GraphCypherQAChain\n",
+ "from langchain_openai import ChatOpenAI\n",
+ "\n",
+ "graph.refresh_schema()\n",
+ "\n",
+ "cypher_chain = GraphCypherQAChain.from_llm(\n",
+ " cypher_llm = ChatOpenAI(temperature=0, model_name='gpt-4o'),\n",
+ " qa_llm = ChatOpenAI(temperature=0, model_name='gpt-4o'), \n",
+ " graph=graph,\n",
+ " verbose=True,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 4. Inference traversing Knowledge Graphs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Knowledge graphs excel in their ability to query and navigate the connections between entities, allowing for the retrieval of pertinent information and the discovery of new insights."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 3.1 Sample 1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In this example, our question 'How many articles has published Emma Wilson' will be translated into the Cyper query:\n",
+ "\n",
+ "```\n",
+ "MATCH (r:Researcher {name: \"Emma Wilson\"})-[:PUBLISHED]->(a:Article)\n",
+ "RETURN COUNT(a) AS numberOfArticles\n",
+ "```\n",
+ "\n",
+ "which matches nodes labeled `Author` with the name 'Emma Wilson' and traverses the `PUBLISHED` relationships to `Article` nodes. \n",
+ "It then counts the number of `Article` nodes connected to 'Emma Wilson':"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "\n",
+ "\u001b[1m> Entering new GraphCypherQAChain chain...\u001b[0m\n",
+ "Generated Cypher:\n",
+ "\u001b[32;1m\u001b[1;3mcypher\n",
+ "MATCH (r:Researcher {name: \"Emma Wilson\"})-[:PUBLISHED]->(a:Article)\n",
+ "RETURN COUNT(a) AS numberOfArticles\n",
+ "\u001b[0m\n",
+ "Full Context:\n",
+ "\u001b[32;1m\u001b[1;3m[{'numberOfArticles': 5}]\u001b[0m\n",
+ "\n",
+ "\u001b[1m> Finished chain.\u001b[0m\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "{'query': 'How many articles has published Emma Wilson?',\n",
+ " 'result': 'Emma Wilson has published 5 articles.'}"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# the answer should be '5'\n",
+ "cypher_chain.invoke(\n",
+ " {\"query\": \"How many articles has published Emma Wilson?\"}\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 3.2 Sample 2"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In this example the query 'are there any pair of researchers who have published more than one article together?' results in the Cypher query:\n",
+ "\n",
+ "```\n",
+ "MATCH (r1:Researcher)-[:PUBLISHED]->(a:Article)<-[:PUBLISHED]-(r2:Researcher)\n",
+ "WHERE r1 <> r2\n",
+ "WITH r1, r2, COUNT(a) AS sharedArticles\n",
+ "WHERE sharedArticles > 1\n",
+ "RETURN r1.name, r2.name, sharedArticles\n",
+ "```\n",
+ "\n",
+ "which results in traversing from the `Researcher` nodes to the `PUBLISHED` relationship to find connected `Article` nodes, and then traversing back to find `Researchers` pairs."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "\n",
+ "\u001b[1m> Entering new GraphCypherQAChain chain...\u001b[0m\n",
+ "Generated Cypher:\n",
+ "\u001b[32;1m\u001b[1;3mcypher\n",
+ "MATCH (r1:Researcher)-[:PUBLISHED]->(a:Article)<-[:PUBLISHED]-(r2:Researcher)\n",
+ "WHERE r1 <> r2\n",
+ "WITH r1, r2, COUNT(a) AS sharedArticles\n",
+ "WHERE sharedArticles > 1\n",
+ "RETURN r1.name, r2.name, sharedArticles\n",
+ "\u001b[0m\n",
+ "Full Context:\n",
+ "\u001b[32;1m\u001b[1;3m[{'r1.name': 'Alice Johnson', 'r2.name': 'David Miller', 'sharedArticles': 2}, {'r1.name': 'Alexander Lee', 'r2.name': 'David Miller', 'sharedArticles': 2}, {'r1.name': 'Olivia Taylor', 'r2.name': 'Alexander Lee', 'sharedArticles': 2}, {'r1.name': 'David Miller', 'r2.name': 'Alexander Lee', 'sharedArticles': 2}, {'r1.name': 'Alexander Lee', 'r2.name': 'Olivia Taylor', 'sharedArticles': 2}, {'r1.name': 'David Miller', 'r2.name': 'Alice Johnson', 'sharedArticles': 2}]\u001b[0m\n",
+ "\n",
+ "\u001b[1m> Finished chain.\u001b[0m\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "{'query': 'are there any pair of researchers who have published more than one article together?',\n",
+ " 'result': 'Yes, there are pairs of researchers who have published more than one article together. These pairs are:\\n\\n- Alice Johnson and David Miller\\n- Alexander Lee and David Miller\\n- Olivia Taylor and Alexander Lee\\n- David Miller and Alexander Lee\\n- Alexander Lee and Olivia Taylor\\n- David Miller and Alice Johnson'}"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# the answer should be Alice Johnson and David Miller, Alexander Lee and David Miller, Olivia Taylor and Alexander Lee, and David Miller and Alice Johnson\n",
+ "cypher_chain.invoke(\n",
+ " {\"query\": \"are there any pair of researchers who have published more than one article together?\"}\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 3.3 Sample 3\n",
+ "\n",
+ "It appears David Miller has collaborated with many peers. Lets find out is he is the researcher with most peers collaborations. \n",
+ "Our query 'which researcher has collaborated with the most peers?' results now in the Cyper:\n",
+ "\n",
+ "```\n",
+ "MATCH (r:Researcher)-[:PUBLISHED]->(:Article)<-[:PUBLISHED]-(peer:Researcher)\n",
+ "WITH r, COUNT(DISTINCT peer) AS peerCount\n",
+ "RETURN r.name AS researcher, peerCount\n",
+ "ORDER BY peerCount DESC\n",
+ "LIMIT 1\n",
+ "```\n",
+ "\n",
+ "Here, we need to star from all `Researcher` nodes and traverse their `PUBLISHED` relationships to find connected `Article` nodes. For each `Article` node, Neo4j then traverses back to find other `Researcher` nodes (peer) who have also published the same article."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "\n",
+ "\u001b[1m> Entering new GraphCypherQAChain chain...\u001b[0m\n",
+ "Generated Cypher:\n",
+ "\u001b[32;1m\u001b[1;3mcypher\n",
+ "MATCH (r1:Researcher)-[:PUBLISHED]->(:Article)<-[:PUBLISHED]-(r2:Researcher)\n",
+ "WHERE r1 <> r2\n",
+ "WITH r1, COUNT(DISTINCT r2) AS collaborators\n",
+ "RETURN r1.name AS researcher, collaborators\n",
+ "ORDER BY collaborators DESC\n",
+ "LIMIT 1\n",
+ "\u001b[0m\n",
+ "Full Context:\n",
+ "\u001b[32;1m\u001b[1;3m[{'researcher': 'David Miller', 'collaborators': 5}]\u001b[0m\n",
+ "\n",
+ "\u001b[1m> Finished chain.\u001b[0m\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "{'query': 'Which researcher has collaborated with the most peers?',\n",
+ " 'result': 'David Miller has collaborated with 5 peers.'}"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# the answer should be 'David Miller' with 5\n",
+ "cypher_chain.invoke(\n",
+ " {\"query\": \"Which researcher has collaborated with the most peers?\"}\n",
+ ")"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.13"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/README.md b/README.md
index c9402dd..44a07bb 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
# Generative-AI-101
-Annotated Notebooks to dive into foundational concepts and state-of-the-art techniques for LLMs and Diffusion models. This is a work in progress, more content will be released on a regular basis.
+Best Practices for building with LLMs. Dive into Self-Attention, In-Context Learning, LLM-Augmentation, RAG, Knowledge-Graphs, Fine-Tuning, and Model Optimization.
[](https://github.com/dcarpintero/generative-ai-101/blob/master/LICENSE)
[](https://GitHub.com/dcarpintero/generative-ai-101/graphs/contributors/)
@@ -81,7 +81,17 @@ Tags: `[RAG]` `[Chunking]` `[FAISS]` `[Hugging Face Transformers]` `[LangChain]`
## 05. Knowledge Graphs
-In Progress
+[](https://colab.research.google.com/github/dcarpintero/generative-ai-101/blob/main/05_knowledge_graphs.ipynb)
+
+Knowledge Graphs, a form of graph-based knowledge representation, provide a method for modeling and storing interlinked information in a human - and machine - understandable format. In practice, such a graph data structure consists of *nodes* and *edges*, representing entities and their relationships. Unlike traditional databases, the inherent expressiveness of graphs allows for richer semantic understanding, while providing the flexibility to accommodate new entity types and relationships without being constrained by a fixed schema.
+
+By combining knowledge graphs with embeddings (vector search), we can leverage *multi-hop connectivity* and *contextual understanding of information* to enhance querying, reasoning, and explainability in LLMs. This notebook explores the practical implementation of this approach, demonstrating how to (i) build a knowledge graph from academic literature, and (ii) extract actionable insights from it.
+
+
+
+
+
+Tags: `[Knowledge Graphs]` `[Neo4j]` `[Contextual Reasoning]` `[Embeddings]` `[Data Modeling]`
## 06. Fine-Tuning BERT
diff --git a/static/knowledge-graphs.png b/static/knowledge-graphs.png
new file mode 100644
index 0000000..8156089
Binary files /dev/null and b/static/knowledge-graphs.png differ