In [1]:
import pandas as pd
import warnings

## Task 3. Enriching graph with unstructured data and external KGs

### 3.1. Extracting knowledge from unstructured data
This task turned out to be quite complicated. Initially, we consider 3 options: 
- NER with SpaCy NLP model(e.g. en_core_web_trf)
- NER with an LLM
- NER+RE with an LLM

Models like en_core_web_trf are very optimized and are decent for general purpose, but we only need to extract particular types of entities from reviews which will require retraining.
This lefts us with LLMs, which are good, but are limited by API rates, which are usually not enough to process the amount of data we have. Thus, the most logical thing would be to download a lightweight LLM and run it locally, which is thankfully quite straightforward process with SpaCy.

The choice of model is the most important part here, as some models may be too heavy to run with limited hardware resources, others, on the other hand, are just not 'smart' enough for out task. It seemed that the optimal model for our needs, among the [available from Spacy](https://spacy.io/api/large-language-models#models-hf), would be Mistral-7B-instruct, which is quite fast and has a good performance. However, due to limited resources we decided to limit ourselves ONLY to NER, no Relation Extraction was made, as it would be to computationally expensive to run on our devices.

Unfortunately due to an internal error in SpaCy source code it used some outdated parameter, which mistral model didn't accept, this lead to an error during assemble. The only straightforward way of fixing it was to edit [this file](.venv/Lib/site-packages/spacy_llm/models/hf/mistral.py) in Spacy source code(deleting resume_download=true parameter from \_\_init__() method). 

**NOTE: If you want to run llm inference on your computer, consider that it requires a GPU, extensive amount of RAM and editing aforementioned file.
We provide file with results of NER in the [data](data) folder**

The following code is commented out to avoid errors during run of the Jupyter notebook

In [2]:
# reviews_df = pd.read_csv("data/cleaned_reviews.csv").head(10000)

In [3]:
# from spacy_llm.util import assemble
# import spacy
# from transformers import AutoModelForCausalLM, AutoTokenizer

# nlp = assemble("config/ner_rel_llm.cfg")

In [4]:
# from tqdm import tqdm
# import warnings

# warnings.filterwarnings('ignore')

# docs = nlp.pipe(reviews_df["Review"].head(500), batch_size=1)

# extracted_entities = []
# for doc in tqdm(docs, total=500):
#     extracted_entities.append(doc.ents)

# reviews_df["reviews_ingredients_df"] = pd.Series(extracted_entities)

# reviews_df.to_csv("data/extracted_reviews.csv")

In [5]:
# doc = nlp(". ".join(["123: I have made this pie instead of plain ol' pumpkin pie for the last 7 years.  Everyone always raves about it.  The flavor is wonderful and the texture is slightly lighter than traditional pumpin pie\"	 I suspect due to the substitution of light cream instead of canned milk.  	If you try this	\" you won't go back to plain ol' pumkin again!", "I hate this freaking recipe it's the worst thing i've every eaten in my life crazy", "I don't know, it seems fine, but nothing special really. I don't know what to think about it", "I decided to add milk chocolate and it resulted in a more colourful flavour!"]))

#### Clean extracted ingredients
 Format them as list of lower case strings

In [6]:
def clean_ingredients(value: str):
    return [s.lower() for s in value.strip("()").replace(", ", ",").split(",") if s != '' ]



In [7]:
reviews_ingredients_df = pd.read_csv("data/extracted_reviews.csv")

reviews_ingredients_df["extracted_ingredients"] = reviews_ingredients_df["extracted_ingredients"].map(clean_ingredients, 'ignore')
reviews_ingredients_df = reviews_ingredients_df[~reviews_ingredients_df["extracted_ingredients"].isna()]

reviews_ingredients_df = reviews_ingredients_df[reviews_ingredients_df["extracted_ingredients"].apply(lambda x: len(x) > 0)]

reviews_ingredients_df = reviews_ingredients_df.drop(["Unnamed: 0"], axis=1)


reviews_ingredients_df

  reviews_ingredients_df = pd.read_csv("data/extracted_reviews.csv")


Unnamed: 0,ReviewId,RecipeId,AuthorId,AuthorName,Review,DateSubmitted,DateModified,extracted_ingredients
1,9,4523,2046,Gay Gilmore ckpt,i think i did something wrong because i could ...,2000-02-25T09:00:00Z,2000-02-25T09:00:00Z,[cornstarch]
5,23,4684,2046,Gay Gilmore ckpt,this is absolutely delicious. i even served i...,2000-02-25T09:06:00Z,2000-02-25T09:06:00Z,[lime slices]
6,25,3431,2046,Gay Gilmore ckpt,leeks on a pizza?! it was really delicious. ...,2000-04-07T11:06:00Z,2000-04-07T11:06:00Z,"[leeks, pizza, boboli, chicken sausage, mushro..."
8,33,4053,1986,Kevin Connolly,This was a fine sandwich I'll definitely be ma...,2000-08-26T12:35:25Z,2000-08-26T12:35:25Z,"[blue cheese, roast beef]"
15,49,2388,2033,Sandy Zikursh,My husbands Aunt Dorothy made the best dumplin...,2000-09-20T13:56:46Z,2000-09-20T13:56:46Z,"[mashed potatoes, pork, beef, stew]"
...,...,...,...,...,...,...,...,...
487,1310,7776,8571,Kirsti Piironen,Why would you want to add anything like mayo o...,2001-06-25T10:36:56Z,2001-06-25T10:36:56Z,"[mayo, sour cream]"
489,1319,9427,7802,Mark H.,These were very smooth and had a good taste. N...,2001-06-25T15:51:29Z,2001-06-25T15:51:29Z,"[smooth, cream cheese, sour cream]"
492,1324,5275,11733,Janet1,having lived in Australia for 33years we often...,2001-06-25T19:19:34Z,2001-06-25T19:19:34Z,"[english faggots, pigs liver]"
493,1325,9492,11297,Jen T,So easy and tasty. To make the clean-up even...,2001-06-26T11:23:50Z,2001-06-26T11:23:50Z,[tinfoil]


#### Connect Extracted Ingredients with KG


In [8]:
from rdflib import Graph, Namespace, URIRef, Literal, RDF, RDFS, XSD, SDO, SKOS

In [9]:

g = Graph() 
g.parse("vocabulary.ttl")

print(g.serialize(format="ttl"))

@prefix kgs: <http://kg-course.io/food-nutrition/schema/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <https://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

kgs:extractedIngredient a rdfs:Class .

schema:NutritionInformation a rdfs:Class .

schema:Recipe a rdfs:Class .

schema:Restaurant a rdfs:Class .

schema:Review a rdfs:Class .

kgs:averageCostOfTwo a rdf:Property ;
    rdfs:comment "None"^^xsd:string ;
    rdfs:domain schema:Restaurant ;
    rdfs:range xsd:decimal .

kgs:hasExtractedIngredients a rdf:Property ;
    rdfs:comment "None"^^xsd:string ;
    rdfs:domain schema:Review ;
    rdfs:range kgs:extractedIngredient .

kgs:hasNutrition a rdf:Property ;
    rdfs:comment "Nutrition information for the recipe"^^xsd:string ;
    rdfs:domain schema:Recipe ;
    rdfs:range schema:NutritionInformation .

kgs:hasOnlineDelivery a rdf:Property ;
    rdfs:comment "None"^^xsd:stri

In [10]:
BASE = Namespace("http://kg-course.io/food-nutrition/")
KGS = Namespace("http://kg-course.io/food-nutrition/schema/")

Modify KG with Extracted Ingredients related property and class

In [11]:
# # Add class for Extracted Ingredients
# g.add((KGS.extractedIngredient, RDF.type, RDFS.Class))
#
# # Add property for connecting Review and Extracted Ingredient
# g.add((KGS.hasExtractedIngredients, RDF.type, RDF.Property))
# g.add((KGS.hasExtractedIngredients, RDFS.domain, SDO.Review))
# g.add((KGS.hasExtractedIngredient, RDFS.range, KGS.extractredIngredient))


Enrich KG with reviews and extracted ingredients


In [12]:
def add_review(g, review):
    review_uri = URIRef(BASE["review/" + str(review["ReviewId"])])
    recipe_uri = URIRef(BASE["recipe/" + str(review["RecipeId"])])

    g.add((review_uri, RDF.type, SDO.Review))
    g.add((review_uri, SDO.author, Literal(review["AuthorId"], datatype=XSD.string)))
    g.add((review_uri, SDO.reviewBody, Literal(review["Review"], datatype=XSD.string)))
    g.add((review_uri, SDO.datePublished, Literal(review["DateSubmitted"], datatype=XSD.dateTime)))
    g.add((review_uri, SDO.dateModified, Literal(review["DateModified"], datatype=XSD.dateTime)))
    g.add((recipe_uri, KGS.hasReview, review_uri))



In [13]:
index = 0 # unique identifier for an extracted ingredient
# save ingredient-ingredient_uri pairs in a dictionary, as we will need them later
ingredient_dict = {}
for _, row in reviews_ingredients_df.iterrows():
    review_uri = URIRef(BASE["review/" + str(row["ReviewId"])])
    recipe_uri = URIRef(BASE["recipe/" + str(row["RecipeId"])])

    # We also need to add reviews we are using to the graph
    add_review(g, row)

    for ingredient in row["extracted_ingredients"]:
        ingredient_uri = URIRef(BASE["extractedIngredient/" + str(index)])

        # link extracted ingredients to review
        g.add((ingredient_uri, RDF.type, KGS.extractedIngredient))
        g.add((review_uri, KGS.hasExtractedIngredient, ingredient_uri))
        g.add((ingredient_uri, RDFS.label, Literal(ingredient, datatype=XSD.string)))

        ingredient_dict[ingredient] = ingredient_uri

        index += 1


### 3.2. Sentiment Analysis
For the Sentiment analysis we are going to use [cardiffnlp/twitter-roberta-base-sentiment-latest](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest) classification model. It was trained on posts from Twitter, which are short and often informal, which makes them similar to reviews, so this model is a perfect fit for our needs. The model output has 3 labels: Negative, Neutral, Positive

In [14]:
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig, pipeline
import numpy as np


  from .autonotebook import tqdm as notebook_tqdm


In [15]:
# Use softmax to normalize the output to probability distribution
def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

In [16]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"

# Using tokenizer and config provided by the model
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)

model = AutoModelForSequenceClassification.from_pretrained(MODEL)

Loading weights: 100%|██████████| 201/201 [00:00<00:00, 1173.65it/s, Materializing param=roberta.encoder.layer.11.output.dense.weight]             
[1mRobertaForSequenceClassification LOAD REPORT[0m from: cardiffnlp/twitter-roberta-base-sentiment-latest
Key                             | Status     |  | 
--------------------------------+------------+--+-
roberta.embeddings.position_ids | UNEXPECTED |  | 
roberta.pooler.dense.weight     | UNEXPECTED |  | 
roberta.pooler.dense.bias       | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


In [17]:
classifier = pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)

In [18]:
results = classifier(reviews_ingredients_df["Review"].to_list(), batch_size=32, truncation=True)

labels = np.array([res['label'] for res in results])

scores = np.array([res['score'] for res in results])

reviews_ingredients_df['sentimentLabel'] = labels
reviews_ingredients_df['confidenceScore'] = scores



Connect sentiment scores results to the graph

In [19]:
for _, row in reviews_ingredients_df.iterrows():
    g.add((URIRef(BASE["review/" + str(row["ReviewId"])]), KGS.hasSentiment, Literal(row["confidenceScore"], datatype=XSD.float)))

#### 3.3 Connect Extracted Ingredients with WikiData
 Create extractedIngredient class and link all the ingredients extracted from reviews to their corresponding reviews with extractedFrom property

Using helper methods from lab 3 for creating and searching vector store

In [20]:
# import torch
from transformers import AutoTokenizer, AutoModel
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
import numpy as np

# Create vector store
def create_vector_store(embedding_model, texts=None, metadata=None):
    if texts is None or len(texts) == 0:
        raise ValueError("Text data cannot be empty when initializing the vector store.")

    # Add embeddings and metadata to FAISS
    vector_store = FAISS.from_texts(texts, embedding_model, metadatas=metadata or [])
    return vector_store

# Search in FAISS Vector Store
def search_vector_store(vector_store, query, embedding_model, top_k=5):
    results, score = vector_store.similarity_search_with_score(query, k=top_k)[0]
    return results, score

In [21]:
import json

with open('data/ingredients_wikidata.json', 'r', encoding='UTF-8') as json_file:
     ingredients_wiki = json.load(json_file)

ingredients_wiki[:10]


[{'item': 'http://www.wikidata.org/entity/Q10987',
  'itemLabel': 'honey',
  'itemDescription': 'sweet food made by bees mostly using nectar from flowers'},
 {'item': 'http://www.wikidata.org/entity/Q12106',
  'itemLabel': 'Triticum',
  'itemDescription': 'genus of plants (focus on taxonomy here, not on agriculture)'},
 {'item': 'http://www.wikidata.org/entity/Q13186',
  'itemLabel': 'raisin',
  'itemDescription': 'dried grape'},
 {'item': 'http://www.wikidata.org/entity/Q27855',
  'itemLabel': 'Mytilus edulis',
  'itemDescription': 'species of mollusc'},
 {'item': 'http://www.wikidata.org/entity/Q28165',
  'itemLabel': 'cinnamon',
  'itemDescription': 'spice obtained from the inner bark of several tree species from the genus Cinnamomum'},
 {'item': 'http://www.wikidata.org/entity/Q29476',
  'itemLabel': 'baking powder',
  'itemDescription': 'dry chemical leavening agent'},
 {'item': 'http://www.wikidata.org/entity/Q36465',
  'itemLabel': 'flour',
  'itemDescription': 'powder which is 

In [22]:
# Formatting the input texts for embedding.

def format_text(json_data):
    texts = [f"""{i.get('itemLabel', " ")}: {i.get('itemDescription', " ")}"""
                     for i in json_data]
    metadata = [{"label": item.get('itemLabel', ''),
                  "description": item.get("itemDescription", ''),
                  "item": item.get('item', '')}
                 for item in json_data]

    return (texts, metadata)
text, metadata = format_text(ingredients_wiki)

In [23]:
# Pick an embedding model
model_name = "BAAI/bge-large-en-v1.5"
# Embed texts
%time embedding_model = HuggingFaceEmbeddings(model_name=model_name)
%time vector_store = create_vector_store(embedding_model, text, metadata)

Loading weights: 100%|██████████| 391/391 [00:00<00:00, 1210.86it/s, Materializing param=pooler.dense.weight]                               
[1mBertModel LOAD REPORT[0m from: BAAI/bge-large-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


CPU times: user 358 ms, sys: 610 ms, total: 969 ms
Wall time: 4.9 s
CPU times: user 683 ms, sys: 182 ms, total: 864 ms
Wall time: 4.76 s


Vector store always tries to provide us with an answer, so if it can't find a decent match with the query it still provides with the closest it can find. To avoid too much unrelated links we skip results with score(distance) higher than 0.5. This number was chosen more or less by vibe, after looking at which scores correspond to which results.

In [24]:
ingredients_linking = {}
for _, row in reviews_ingredients_df.head(10).iterrows():
    for ingredient in row["extracted_ingredients"]:
        results, score = search_vector_store(vector_store, ingredient, embedding_model, top_k=1)

        if score > 0.5:
            continue

        ingredients_linking[ingredient] = {'matched-label': results.metadata.get('label',''),
                                        'wikidata-iri': results.metadata.get('item',''),
                                        "score": score}
        
        # add extracted links from wikidata to their corresponding entities in the graph
        g.add((ingredient_dict[ingredient], SKOS.closeMatch, URIRef(results.metadata.get('item', ''))))

ingredients_linking


{'cornstarch': {'matched-label': 'corn starch',
  'wikidata-iri': 'http://www.wikidata.org/entity/Q3393961',
  'score': np.float32(0.33861506)},
 'chicken sausage': {'matched-label': 'chicken sausage meat',
  'wikidata-iri': 'http://www.wikidata.org/entity/Q107246470',
  'score': np.float32(0.3144497)},
 'mushrooms': {'matched-label': 'dried mushrooms',
  'wikidata-iri': 'http://www.wikidata.org/entity/Q11702846',
  'score': np.float32(0.39636096)},
 'roast beef': {'matched-label': 'roast',
  'wikidata-iri': 'http://www.wikidata.org/entity/Q899561',
  'score': np.float32(0.20075789)},
 'pork': {'matched-label': 'pork liver',
  'wikidata-iri': 'http://www.wikidata.org/entity/Q18384179',
  'score': np.float32(0.33809114)},
 'beef': {'matched-label': 'roast',
  'wikidata-iri': 'http://www.wikidata.org/entity/Q899561',
  'score': np.float32(0.41034186)},
 'chicken': {'matched-label': 'chicken egg',
  'wikidata-iri': 'http://www.wikidata.org/entity/Q15260613',
  'score': np.float32(0.452519

### 3.4. Integrating external KG

In [25]:
with open('data/recipe_cuisine_wikidata.json', 'r', encoding='UTF-8') as json_file:
     recipe_cuisine = json.load(json_file)

In [26]:
# Add recipeCuisine property to recipes
# g.add((SDO.recipeCuisine, RDF.type, RDF.Property))
# g.add((SDO.recipeCuisine, RDFS.domain, SDO.Recipe))
# g.add((SDO.recipeCuisine, RDFS.range, XSD.string))

for dict in recipe_cuisine:
    g.add((URIRef(dict["recipe"]), RDF.type, SDO.Recipe))
    g.add((URIRef(dict["recipe"]), SDO.recipeCuisine, Literal(dict["cuisineLabel_l"], datatype=XSD.string)))

In [27]:
# save the unstructured graph we created
g.serialize(destination="KEN4256-unstructured-KG-Team6.ttl", format="ttl")

structured_g = Graph()
structured_g.parse("KEN4256-structured-KG-Team6.ttl")

g_combined = structured_g + g

g_combined.serialize(destination="KEN4256-integrated-KG-Team6.ttl", format="ttl")

<Graph identifier=Na4470c7fd58745398159a15a1419136c (<class 'rdflib.graph.Graph'>)>