In [22]:
from sentence_transformers import SentenceTransformer
from elasticsearch import Elasticsearch
from dotenv import load_dotenv
from transformers import (
    pipeline,
    AutoTokenizer,
    AutoModelForCausalLM,
)
import json
import os

from tqdm.auto import tqdm
import time
import torch

In [2]:
load_dotenv()

True

In [3]:
ELASTIC_URL = os.getenv("ELASTIC_URL_LOCAL")
MODEL_NAME = os.getenv("MODEL_NAME")
INDEX_NAME = os.getenv("INDEX_NAME")
HUGGINGFACE_API = os.getenv("HUGGINGFACE_API")


## Summary of [02_SEARCH_AND_INDEX_IMPLEMENTATION.ipynb](#file:02_search_and_index_implemantation.ipynb-context)

This Jupyter Notebook demonstrates the process of setting up and utilizing a search and indexing system using Elasticsearch and Sentence Transformers. The workflow includes the following steps:

1. **Importing Libraries**: Essential libraries such as `sentence_transformers`, `elasticsearch`, `dotenv`, `transformers`, `json`, `os`, `tqdm`, `time`, and `torch` are imported.

2. **Loading Environment Variables**: Environment variables are loaded using `load_dotenv()` to retrieve configurations like `ELASTIC_URL`, `MODEL_NAME`, `INDEX_NAME`, and `HUGGINGFACE_API`.

3. **Loading the Model**: A pre-trained SentenceTransformer model is loaded using the specified `MODEL_NAME`.

4. **Fetching Documents**: JSON documents are fetched from a specified directory, read, and aggregated into a list.

5. **Setting up Elasticsearch**: An Elasticsearch index is set up with specific settings and mappings, including properties for document ID, page number, chunk ID, text, and text vectors.

6. **Indexing Documents**: The fetched documents are indexed into Elasticsearch. Each document's text is encoded into a dense vector using the SentenceTransformer model before indexing.

7. **KNN Search in Elasticsearch**: A function is defined to perform a K-Nearest Neighbors (KNN) search in Elasticsearch using the encoded text vectors.

8. **Building Prompts**: A prompt is constructed for querying the language model, incorporating the search results from Elasticsearch.

9. **Loading a Language Model**: A language model is loaded from Hugging Face using the specified model name and device configuration (CPU or GPU).

10. **Generating Responses**: The language model generates responses to the constructed prompts, and the response time is measured.

11. **Displaying Results**: The generated answers and response times are printed for evaluation.

This notebook provides a comprehensive guide to setting up a search and indexing system, integrating machine learning models for text encoding and generation, and utilizing Elasticsearch for efficient document retrieval.


In [4]:
def load_mode():
    """
    Loads a pre-trained SentenceTransformer model.

    This function prints the name of the model being loaded and returns an instance of the SentenceTransformer
    initialized with the specified model name.

    Returns:
        SentenceTransformer: An instance of the SentenceTransformer model.
    """
    print(f"Loading model: {MODEL_NAME}")
    return SentenceTransformer(MODEL_NAME)


model = load_mode()

Loading model: all-mpnet-base-v2


In [5]:
def read_json(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)
    return data


def fetch_documents():
    """
    Fetches and reads JSON documents from a specified directory.

    This function lists all files in the '../json_data' directory, reads each JSON file,
    and aggregates the data into a single list of documents.

    Returns:
        list: A list containing all the documents read from the JSON files.
    """
    print("Fetching documents...")

    directory_path = "../json_data"

    # List all files in the directory
    files = os.listdir(directory_path)

    documents = []
    for file in files:
        print(f"Reading file: {file}")
        data = read_json(f"{directory_path}/{file}")
        documents.extend(data)
        print(f"Fetched {len(documents)} documents")
    return documents


documents = fetch_documents()

Fetching documents...
Reading file: Cityphilia-and-cityphobia--A-multi-scalar-search-for_2024_Journal-of-Urban-M.json
Fetched 60 documents
Reading file: How-do-local-governments-respond-to-central-mandate-in-affo_2024_Journal-of-.json
Fetched 113 documents
Reading file: Inclusive-cities--Less-crime-requires-more-lo_2024_Journal-of-Urban-Manageme.json
Fetched 118 documents
Reading file: sideris_gonzales_ong.json
Fetched 171 documents
Reading file: The_High_Cost_of_Free_Parking.json
Fetched 190 documents


In [16]:
len(documents)

190

In [21]:
# Data to evaluate
with open("../data_output/data_to_test.json", "w", encoding="utf-8") as f:
    json.dump(documents[:50], f, ensure_ascii=False, indent=4)

{'doc_id': 'Cityphilia-and-cityphobia--A-multi-scalar-search-for_2024_Journal-of-Urban-M',
 'page_num': 1,
 'chunk_id': 'Cityphilia-and-cityphobia--A-multi-scalar-search-for_2024_Journal-of-Urban-M_1_1',
 'text': "Research Article Cityphilia and cityphobia: A multi-scalar search for city love in Flanders Karima Kourtita,b,c,*, Bart Neutsd, Peter Nijkampa,b,c, Marie H. Wahlstr €ome aOpen University, Heerlen, the Netherlands bAlexandru Ioan Cuza University, Iasi, Romania cUniversity of Rijeka, Rijeka, Croatia dKU Leuven, Leuven, Belgium eKTH, Stockholm, Sweden ARTICLE INFO Keywords: Well-being Happiness City loveSocial cohesionCentral place systemsInter-urban attractivenessABSTRACT Cities, towns, and rural areas form a complex spatial system in ﬂuenced by governance, economic factors, and the perceptions of their residents. This paper introduces the concepts of 'cityphilia' and 'cityphobia' as metaphors for the spatial attraction and repulsion forces that shape local quality of life. It 

In [6]:
def setup_elasticsearch():
    """
    Sets up an Elasticsearch index with specified settings and mappings.

    This function performs the following steps:
    1. Connects to the Elasticsearch client using the provided ELASTIC_URL.
    2. Defines the index settings, including the number of shards and replicas.
    3. Defines the mappings for the index, specifying the data types and properties for each field.
    4. Deletes the existing index if it exists.
    5. Creates a new index with the defined settings and mappings.

    Returns:
        Elasticsearch: An instance of the Elasticsearch client connected to the created index.
    """
    print("Setting up Elasticsearch...")
    es_client = Elasticsearch(ELASTIC_URL)

    index_settings = {
        "settings": {"number_of_shards": 1, "number_of_replicas": 0},
        "mappings": {
            "properties": {
                "doc_id": {"type": "keyword"},
                "page_num": {"type": "integer"},
                "chunk_id": {"type": "keyword"},
                "text": {"type": "text"},
                "text_vector": {
                    "type": "dense_vector",
                    "dims": 768,
                    "index": True,
                    "similarity": "cosine",
                },
            }
        },
    }

    es_client.indices.delete(index=INDEX_NAME, ignore_unavailable=True)
    es_client.indices.create(index=INDEX_NAME, body=index_settings)
    print(f"Elasticsearch index '{INDEX_NAME}' created")
    return es_client


es_client = setup_elasticsearch()

Setting up Elasticsearch...
Elasticsearch index 'housing-political-policy' created


In [18]:
def index_documents(es_client, documents, model):
    print("Indexing documents...")
    for doc in tqdm(documents):
        doc["text_vector"] = model.encode(doc["text"]).tolist()
        es_client.index(index=INDEX_NAME, document=doc)
    print(f"Indexed {len(documents)} documents")


index_documents(es_client, documents, model)

Indexing documents...


100%|██████████| 190/190 [00:54<00:00,  3.48it/s]

Indexed 190 documents





In [7]:
def elastic_search_knn(
    field,
    vector,
    # course,
    index_name=INDEX_NAME,
):
    knn = {
        "field": field,
        "query_vector": vector,
        "k": 5,
        "num_candidates": 10000,
        # "filter": {"term": {"course": course}},
    }

    search_query = {
        "knn": knn,
        "_source": ["doc_id", "page_num", "chunk_id", "text"],
    }

    es_results = es_client.search(index=index_name, body=search_query)

    return [hit["_source"] for hit in es_results["hits"]["hits"]]

In [9]:
query = "What is the gentrification of cities?"
vector = model.encode(query)
search_results = elastic_search_knn("text_vector", vector)
search_results

[{'page_num': 16,
  'doc_id': 'sideris_gonzales_ong',
  'chunk_id': 'sideris_gonzales_ong_16_1'},
 {'page_num': 1,
  'text': 'https://doi.org/10.1177/0739456X17730890Journal of Planning Education and Research 2019, Vol. 39(2) 227 –242 © The Author(s) 2017Article reuse guidelines: sagepub.com/journals-permissions DOI: 10.1177/0739456X17730890 journals.sagepub.com/home/jpe Planning Research Introduction Since the term gentrification was first used by sociologist Ruth Glass (1964) in the mid-1960s, a rich literature has emerged of studies that seek to identify the magnitude of change and document its impact on gentrified neighbor - hoods. While these studies discuss mostly the processes and impacts of gentrification, we are not aware of studies that focus on the methodologies of studying gentrification. In general, a methodological dichotomy characterizes much of the existing gentrification literature, as studies are either quantitative, “macro” analyses or qualitative, “micro” inqui-ries

In [23]:
def build_prompt(query, search_results):
    prompt_template = """
As a housing policy expert advising policymakers, answer the QUESTION below using only the verified information provided in the CONTEXT. 
Maintain a neutral, factual tone, and avoid assumptions or extrapolations beyond the CONTEXT. 
Structure your response with a brief summary of pros and cons to support balanced decision-making, and keep the response as concise as possible.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

    context = "\n\n".join(
        [f"doc_id: {doc['doc_id']}\nanswer: {doc['text']}" for doc in search_results]
    )
    return prompt_template.format(question=query, context=context).strip()

In [11]:
query = "What is the gentrification of cities?"
search_results = elastic_search_knn("text_vector", model.encode(query))
prompt = build_prompt(query, search_results)
prompt



In [24]:
if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"


def model_from_huggingface(model_name, device):
    model_generation = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map=device,
        torch_dtype="auto",
        trust_remote_code=True,
        token=HUGGINGFACE_API,
    )
    tokenizer_generation = AutoTokenizer.from_pretrained(
        model_name,
        token=HUGGINGFACE_API,
    )

    pipe_generation = pipeline(
        "text-generation",
        model=model_generation,
        tokenizer=tokenizer_generation,
    )

    return pipe_generation


pipe_generation = model_from_huggingface("meta-llama/Llama-3.2-1B-Instruct", device)


def llm(prompt):
    start_time = time.time()
    messages = [
        {"role": "user", "content": prompt},
    ]

    eos_token_id = pipe_generation.tokenizer.eos_token_id

    generation_args = {
        "max_new_tokens": 500,
        "return_full_text": False,
        # "temperature": 0.0,
        "do_sample": False,
        "pad_token_id": eos_token_id,
    }

    output = pipe_generation(messages, **generation_args)

    answer = output[0]["generated_text"].strip()

    end_time = time.time()
    response_time = end_time - start_time

    return answer, response_time

In [None]:
query = "What is the gentrification of cities?"
search_results = elastic_search_knn("text_vector", model.encode(query))
prompt = build_prompt(query, search_results)
answer, response_time = llm(prompt)



: 

In [14]:
response_time

26.792170524597168

In [15]:
print(answer)

**Gentrification of Cities: A Balanced Approach**

Gentrification is a complex and multifaceted phenomenon that affects urban neighborhoods in various ways. While some studies focus on the magnitude of gentrification, others emphasize the importance of incorporating qualitative methods to gain a more comprehensive understanding of the issue. Here are the pros and cons of adopting a balanced approach to studying gentrification:

**Pros:**

1. **Comprehensive understanding**: Incorporating both quantitative and qualitative methods can provide a more nuanced understanding of gentrification, allowing researchers to capture the complex interactions between physical, cultural, economic, and demographic shifts.
2. **Improved data collection**: Visual surveys and interviews can complement each other, providing a more detailed picture of neighborhood change brought about by gentrification.
3. **Increased accuracy**: Mixed-methods approaches can reduce the risk of biased or incomplete data, lead