<a href="https://colab.research.google.com/github/eduardoplima/artists-expenditure-llm/blob/main/artists.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Identifying artists in public expenditure using LLMs

## Author: Eduardo P. Lima

## Summary

The Brazilian Audit Courts have, among their constitutional attributions, the responsibility of monitoring the expenses with cultural events and artistic presentations in general of the government departments that report to them. To this end, the Audit Courts receive information from the departments under their jurisdiction about the expenditures of this nature.

However, this information is not structured in a way that facilitates the identification of the artists hired. Therefore, it is necessary to use Natural Language Processing techniques to extract this information in order to assess the regular payment of these contracts.

This notebook shows the use of techniques for this purpose, especially the use of Large Language Models (LLM).

### Keypoints

* Point 1




In [1]:
!pip install gdown langchain_openai langgraph langchain_community

Collecting langchain_openai
  Downloading langchain_openai-0.2.12-py3-none-any.whl.metadata (2.7 kB)
Collecting langgraph
  Downloading langgraph-0.2.59-py3-none-any.whl.metadata (15 kB)
Collecting langchain_community
  Downloading langchain_community-0.3.12-py3-none-any.whl.metadata (2.9 kB)
Collecting openai<2.0.0,>=1.55.3 (from langchain_openai)
  Downloading openai-1.58.1-py3-none-any.whl.metadata (27 kB)
Collecting tiktoken<1,>=0.7 (from langchain_openai)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting langgraph-checkpoint<3.0.0,>=2.0.4 (from langgraph)
  Downloading langgraph_checkpoint-2.0.9-py3-none-any.whl.metadata (4.6 kB)
Collecting langgraph-sdk<0.2.0,>=0.1.42 (from langgraph)
  Downloading langgraph_sdk-0.1.47-py3-none-any.whl.metadata (1.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.

In [11]:
import os
import requests
import base64
import gdown
import getpass

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import langchain
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI


from langgraph.graph import StateGraph, END, START
#from langchain_openai import

## Dataset loading

We load our dataset from the xlsx file. It has 3 columns, concerning the description of the procurement process, contract and subsequent prepayment. We have to look for an artist identification in those columns.

In [4]:
url = "https://github.com/eduardoplima/artists-expenditure-llm/raw/refs/heads/main/festas_juninas.xlsx"
output = "artists.xlsx"
gdown.download(url, output)

Downloading...
From: https://github.com/eduardoplima/artists-expenditure-llm/raw/refs/heads/main/festas_juninas.xlsx
To: /content/artists.xlsx
100%|██████████| 1.14M/1.14M [00:00<00:00, 16.9MB/s]


'artists.xlsx'

In [5]:
#df_art = pd.read_csv('artists.csv', on_bad_lines='skip')
df_art = pd.read_excel('artists.xlsx', engine='openpyxl')

In [6]:
df_art.head(10)

Unnamed: 0,contract,prepayment,procurement
0,contratação da empresa A. NUNES DE ARAÚJO PROD...,"Despesa com diária em favor da servidora, NAYA...",Contratação de empresa especializada no fornec...
1,contratação da empresa A. NUNES DE ARAÚJO PROD...,Ref. empenho estimativo de diárias nacionais p...,Contratação de empresa especializada no fornec...
2,contratação da empresa A. NUNES DE ARAÚJO PROD...,Ref. empenho estimativo de diárias internacion...,Contratação de empresa especializada no fornec...
3,,Referente despesa com 4º termo aditivo empenho...,
4,contratação da empresa A. NUNES DE ARAÚJO PROD...,Referente despesa do 4º termo aditivo empenho ...,Contratação de empresa especializada no fornec...
5,,Ref. serviço de fornecimento de passagens aére...,
6,,Ref. serviço de fornecimento de passagens aére...,
7,,Referente despesa com participação no lounge m...,
8,contratação da empresa A. NUNES DE ARAÚJO PROD...,Referente empenho com participação no evento s...,Contratação de empresa especializada no fornec...
9,,Despesa com participação Expoturismo Paraná d...,


## Environment variables

We set up the variables we'll use in the external API that power our agents.

In [7]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key:··········


In [8]:
os.environ["SPOTIFY_CLIENT_ID"] = getpass.getpass("Spotify Client ID:")

Spotify Client ID:··········


In [9]:
os.environ["SPOTIFY_CLIENT_SECRET"] = getpass.getpass("Spotify Client Secret:")

Spotify Client Secret:··········


In [10]:
os.environ["TAVILY_API_KEY"] = getpass.getpass("Tavily API Key:")

Tavily API Key:··········


## Agent functions

We create the functions...

In [12]:
def get_token(client_id: str, client_secret: str):
  """
  Gets an access token for the Spotify API.

  Args:
      client_id (str): Spotify Client ID.
      client_secret (str): Spotify Secret Key.

  Returns:
      str: Código de acesso obtido da API do Spotify.
      str: Tipo de token do código de acesso (ou None se falhar ao recuperar).
      str: Tempo de disponibilidade do token em segundos.
  """
  base64_auth = base64.b64encode(f"{client_id}:{client_secret}".encode()).decode()

  auth_options = {
      'url': 'https://accounts.spotify.com/api/token',
      'headers': {
          'Authorization': 'Basic ' + base64_auth,
          'Content-Type' : 'application/x-www-form-urlencoded'
      },
      'data': {
          'grant_type': 'client_credentials'
      }
  }

  response = requests.post(auth_options['url'], headers=auth_options['headers'], data=auth_options['data'])

  if response.status_code == 200:
    r = response.json()
    token = r['access_token']
    token_type = r['token_type']
    token_duration = r['expires_in']
    return f'{token_type} {token}'
  else:
    return None


def spotify_api_call(url: str, access_token: str) -> dict:
  """
  Calls Spotify API using a given endpoint URL and access token.

  Args:
      url (str): Endpoint URL for the API call.

      access_token (str): Access token for the API call.

  Returns:
      dict: API response in JSON format.
  """
  response = requests.get(url, headers={'Authorization': access_token})
  api_response = response.json()

  return api_response

In [None]:
def has_artist_name(text):


## Model creation

We create the models that we'll use on our agents

In [13]:
model = ChatOpenAI(temperature=0, model_name="gpt-4o-mini")

In [None]:
def analyze_question(state):
  prompt = PromptTemplate.from_template("""
  Você é um agente que identifica artistas em textos descritivos de despesas públicas.

  Despesa : {input}

  Dada a despesa identifique artistas contratados na descrição fornecida. Só responda se
  houve um artista no texto fornecido. Se não houve, responda com um texto vazio.

  Sua resposta :
  """)
  chain = prompt | model
  response = chain.invoke({"input": state["input"]})
  decision = response.content.strip().lower()
  return {"decision": decision, "input": state["input"]}