## Reading HTML file with Langchain and Google

### Installing Required Libraries

In [20]:
#!pip install -q langchain-google-genai
#!pip install -q bs4
#!pip uninstall openai -y
#!pip install -q langchain-community
#!pip install -q bs4
#!pip install langchain-google-genai
#!pip install pydantic==2.8.2
#!pip install -q lxml
#!pip install docarray

In [1]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv('langchain1_llmapp/api_key.env')) # read local .env file
google_key = os.environ['GOOGLE_API_KEY']

### Loading DEU session 20 of parliament members

##### Cleaning the HTML file

In [53]:
from bs4 import BeautifulSoup
from langchain.docstore.document import Document

file_path = 'data/20_session_DEU.html'
with open(file_path, 'r', encoding='utf-8') as f:
    html = f.read()

soup = BeautifulSoup(html, 'html.parser')

tables = soup.find_all('table')

# Create documents from tables
documents = []
for idx, table in enumerate(tables, start=1):
    table_text = table.get_text(separator=' ', strip=True)
    doc = Document(
        page_content=table_text,
        metadata={"table_index": idx}
    )
    documents.append(doc)


from langchain_community.document_loaders import BSHTMLLoader

file_path = 'data/20_session_DEU.html'
loader = BSHTMLLoader(file_path)
data = loader.load()

print(data)

## Step by step

### Creating an Embedding Model

In [55]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embedding_model = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

#### Creating an Index

In [56]:
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.indexes import VectorstoreIndexCreator

"""
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embedding_model
).from_loaders([loader])
"""

vectorstore = DocArrayInMemorySearch.from_documents(documents, embedding_model)

#### Querying the index to pass to the LLM

In [49]:
from langchain_google_genai import ChatGoogleGenerativeAI

query = "List all members of the German Bundestag. Retrieve what you can, it doesn´t have to be perfect. Make it in a table format"
llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro",temperature=0)

response = index.query(query, llm = llm)

In [50]:
from IPython.display import display, Markdown

display(Markdown(response))

I can provide a partial list of members, along with some additional information, but the table in the Wikipedia article is incomplete and I cannot generate a full list of all 733 members.  I also cannot reliably extract every name even from the partial table due to formatting inconsistencies.

| Faction/Group | Party | Members (Partial List) |
|---|---|---|
| SPD | SPD | Fahimi, Keller, Mohrs, Maas, Philippi, Kiziltepe, Gremmels, Mansoori, Trăsnea, Grötsch, Mehmet Ali, De Ridder, Rinkert, Bartz, Vontz, Mende, Trăsnea, Lutze, Ruf, Rabanus, Hohmann, Schanbacher |
| CDU/CSU | CDU/CSU | Altmaier, Kramp-Karrenbauer, Storjohann, Hennrich, Schäuble, Berghegger, Jung, Schwarz, Stöcker, Schön, Uhl, Bernstein, Föhr, Pahlmann, Kaufmann, Mannes, Wiesmann, Wellenreuther, Sekmen, Scheuer, Müller |
| Bündnis 90/Die Grünen |  | Krischer, Trittin, Kühn, Stahr, Sekmen, Rottmann, Aeffner, Sacher, von Holtz, Kretz, Krumwiede-Steiner, Kekeritz |
| Die Linke (Gruppe) | Die Linke | Amira Mohamed Ali, Dietmar Bartsch, Susanne Ferschl, Gesine Lötzsch, Nicole Gohlke, Ali Al-Dailami, Jan Korte, Sahra Wagenknecht, Klaus Ernst, Jessica Tatti, Christian Görke, Heidi Reichinnek, Sören Pellmann |
| BSW (Gruppe) |  |  (No names listed in the provided excerpt) |


**Presidium (Leadership):**

* **President:** Bärbel Bas (SPD)
* **Vice Presidents:** Aydan Özoğuz (SPD), Yvonne Magwas (CDU), Katrin Göring-Eckardt (Grüne), Wolfgang Kubicki (FDP), Petra Pau (Linke), *[Open position for AfD]*
* **Elder President (Alterspräsident):** Wolfgang Schäuble (CDU)


This is not a complete list of Bundestag members. The Wikipedia article from which this information is taken is itself incomplete.  It is best to consult the official Bundestag website for a definitive list.