## Install ollama - local language models

Note, we need to install ollama and langchain with the default environment

(see here for details https://docs.aws.amazon.com/sagemaker/latest/dg/studio-lab-environments.html)

Langchain is installed on the persistent environment but it is an older version

Some libraries for text integrating langchain with ollama

https://pypi.org/project/langchain-community/

https://github.com/chroma-core/chroma

ChromaDB, often referred to simply as Chroma, is an open-source embedding database designed to store, query, and manage embeddings efficiently. Embeddings are numerical representations of data, typically derived from machine learning models, that capture semantic information in a high-dimensional space. They are widely used in applications like natural language processing, recommendation systems, and image recognition.

In [2]:
!mkdir tmp
print("Downloading ollama...")
!curl --fail --show-error --location --progress-bar -o tmp/ollama "https://ollama.com/download/ollama-linux-amd64"
print("Installing ollama...")
!install -m755 tmp/ollama /home/studio-lab-user/.conda/envs/default/bin/ollama
print("Cleaning up...")
!rm -rf tmp
print("Upgrading langchain to work with latest models...")
%pip install -U langchain-community chromadb --quiet
print("Install complete!")


Downloading ollama...
######################################################################### 100.0%#=#=#                                                                           
Installing ollama...
Cleaning up...
Upgrading langchain to work with latest models...
Note: you may need to restart the kernel to use updated packages.
Install complete!


## Next we'll launch ollama. 

Note all this code does is run `ollama serve` but it uses threading so that we don't block the notebook and can keep running other cells. You can also run the command in a separate terminal to achieve the same effect.

In [33]:
import asyncio
import threading

async def run_process(cmd):
    print('>>> starting', *cmd)
    process = await asyncio.create_subprocess_exec(
        *cmd,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
    )

    
async def start_ollama_serve():
    await run_process(['ollama', 'serve'])

def run_async_in_thread(loop, coro):
    asyncio.set_event_loop(loop)
    loop.run_until_complete(coro) 
    loop.close()

# Create a new event loop that will run in a new thread 
new_loop = asyncio.new_event_loop() 

# Start ollama serve in a separate thread so the cell won't block execution 
thread = threading.Thread(target=run_async_in_thread, args=(new_loop, start_ollama_serve()))
thread.start() 

>>> starting ollama serve


In [34]:
from langchain_community.llms import Ollama

llm = Ollama(model="llama3",  base_url='http://localhost:11434')

print(llm.invoke("Tell me a joke"))

Here's one:

Why don't scientists trust atoms?

Because they make up everything!

Hope that made you smile! Do you want to hear another?


In [32]:
print(llm.invoke("Tell me another joke"))

ConnectionError: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0cffc976a0>: Failed to establish a new connection: [Errno 111] Connection refused'))

In [5]:
print(llm.invoke("Tell me a joke about AI"))

Here's one:

Why did the AI program go to therapy?

Because it was struggling with its "algorithmic" identity and wanted to improve its "predictive" personality!

Hope that made you laugh (or at least raised an eyebrow)


In [6]:
print(llm.invoke("why is the sky blue"))

The sky appears blue because of a phenomenon called Rayleigh scattering, named after the British physicist Lord Rayleigh, who first described it in the late 19th century.

Here's what happens:

1. **Light from the sun**: When sunlight enters Earth's atmosphere, it contains all the colors of the visible spectrum (red, orange, yellow, green, blue, indigo, and violet).
2. **Molecules in the air**: The air molecules are much smaller than the wavelength of light. This is important.
3. **Scattering occurs**: As the sunlight travels through the atmosphere, it encounters these tiny air molecules. These molecules scatter the shorter (blue) wavelengths of light more than the longer (red) wavelengths.
4. **Blue light dominates**: Because blue light is scattered in all directions by the air molecules, it reaches our eyes from all parts of the sky. This scattering effect gives the sky its blue appearance.
5. **Red light continues straight**: The longer wavelengths of light, like red and orange, are

## Working with text data (RAG - Retrieval-Augmented Generation)

https://ollama.com/blog/embedding-models


In Langchain, a VectorStoreIndexCreator is a module or component responsible for creating an index for storing and retrieving vectors efficiently. Vectors represent numerical data points in multi-dimensional space and are commonly used in data analysis, machine learning, and natural language processing tasks. The VectorStoreIndexCreator helps organize these vectors in a way that allows for quick and effective searching and querying.

The vectorstore as_retriever is setting up a retriever to perform approximate nearest neighbor search using a vector store called `index.vectorstore`. The `as_retriever` method is being used to convert the vector store into a retriever object. The `search_kwargs={"k": 1000}` parameter specifies that the retriever should return the top 1000 nearest neighbors for the query vectors. The retriever object allows you to efficiently search for similar vectors within the vector store by retrieving the nearest neighbors based on similarity metrics like cosine similarity or Euclidean distance.


In [8]:
course_codes = ['STATS5073',
                'STATS5074',
                'STATS5075',
                'STATS5076',
                'STATS5077',
                'STATS5078',
                'STATS5079',
                'STATS5080',
                'STATS5081',
                'STATS5082',
                'STATS5083',
                'STATS5084']

base_url = "https://www.gla.ac.uk/postgraduate/taught/dataanalyticsonline/?card=course&code="

course_list = [base_url + c for c in course_codes]



In [15]:
course_list[10]

'https://www.gla.ac.uk/postgraduate/taught/dataanalyticsonline/?card=course&code=STATS5083'

Object `langchain.indexes` not found.


In [25]:

from langchain.embeddings import OllamaEmbeddings

from langchain.indexes import VectorstoreIndexCreator



from langchain.chains import ConversationalRetrievalChain, RetrievalQA


from langchain_community.document_loaders import WebBaseLoader

oembed = OllamaEmbeddings(base_url="http://localhost:11434", model="nomic-embed-text")

loader = WebBaseLoader(web_paths=course_list[0:12])
index = VectorstoreIndexCreator(embedding=oembed).from_loaders([loader])

chain = ConversationalRetrievalChain.from_llm(
  llm=llm,
  retriever=index.vectorstore.as_retriever(search_kwargs={"k": 10}),
)

chat_history = []




In [26]:
query = "Summarize the combined content of all the courses. Group similar topics together"
result = chain.invoke({"question": query, "chat_history": chat_history})
print(result['answer'])

chat_history.append((query, result['answer']))

Based on the course descriptions, here is a summary of the combined content:

**Machine Learning and Data Analysis**

* Introduction to different methods for dimension reduction and clustering (unsupervised learning)
* Introduction to kernel methods and support vector machines
* Application and interpretation of methods such as principal component analysis, classical cluster analysis, and classification techniques
* Explanation and interpretation of ROC curves and performance measures

**Data Analytics**

* Data- analytical methods specific to a given application area such as credit scoring or marketing
* Classical methods for cluster analysis
* Regularised regression
* Graphical models for structural inference in high-dimensional data
* Quantitative text analysis
* Social network analysis and knowledge graphs

**Big Data and Distributed Computing**

* Efficient implementation of computationally expensive data- analytic methods and/or data-analytic methods for big data
* Deep learning 

In [27]:
query = "Summarize the deep learning components of the programme "
result = chain.invoke({"question": query, "chat_history": chat_history})
print(result['answer'])

chat_history.append((query, result['answer']))

Based on the course aims and intended learning outcomes, the deep learning components of the program appear to be:

* Introduction to deep learning and convolutional neural networks
* Fitting a neural network using specialized frameworks such as TensorFlow or Keras
* Discussing important methodological aspects underpinning deep learning

These components suggest that students will learn about the fundamental concepts and applications of deep learning, including convolutional neural networks, and gain practical experience in implementing deep learning models using popular frameworks.


In [35]:
mostpopularpage = ['https://en.wikipedia.org/wiki/Bill_Walton']


oembed = OllamaEmbeddings(base_url="http://localhost:11434", model="nomic-embed-text")

loader = WebBaseLoader(web_paths=mostpopularpage)
index = VectorstoreIndexCreator(embedding=oembed).from_loaders([loader])

chain = ConversationalRetrievalChain.from_llm(
  llm=llm,
  retriever=index.vectorstore.as_retriever(search_kwargs={"k": 10}),
)

chat_history = []



In [40]:
query = "Where was Bill Walton born?"
result = chain.invoke({"question": query, "chat_history": chat_history})
print(result['answer'])

chat_history.append((query, result['answer']))

Unfortunately, this context does not provide information about where Bill Walton was born. The provided text only mentions his achievements, awards, and involvement in various sports-related activities, but does not include any details about his birthplace or personal life.


In [41]:
query = "Are you sure you don't have information about his early life?"


In [None]:


result = chain.invoke({"question": query, "chat_history": chat_history})
print(result['answer'])

chat_history.append((query, result['answer']))

In [None]:
from langchain_community.llms import Ollama
ollama = Ollama(
    base_url='http://localhost:11434',
    model="llama3"
)
print(ollama.invoke("why is the sky blue"))