# Week 2 Practical

Relevant unit learning outcomes used this week:

**UL01**: Identify the key natural language processing applications that match
current and emerging industry needs.

# Hugging Face Tools Exploration

This is mostly copied from last week's practical, because many students didn't finish the practical
or encountered problems, or (sorry!) some of the wording wasn't clear about what was expected.

## What are some common natural language processing tasks?

Let's explore some of the interesting pre-trained models on Hugging Face hub.

We're doing this so that you can see the different kinds of natural language processing applications that exist, and can play with them to understand how they work.

## Set up

Make sure the `transformers` library is installed. If the next cell works and displays a version number,
you're good. Otherwise run `conda install transformers` or `pip install transformers`.

Common problems:

- Expect to see a transformer version 4.x or newer

- If transformers.pipeline doesn't exist, check what version of python you are running. It should be 3.11.x or
  newer

In [18]:
import transformers
transformers.__version__

'4.38.1'

In [19]:
import sys
sys.version

'3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]'

## Research about some different NLP tasks

Look up https://huggingface.co/docs/transformers/en/main_classes/pipelines and find the
documentation for the `pipeline` constructor.

Make a list of `task` options that look relevant to natural language processing. Randomly pick one
and look up more information about it.

In [21]:
tasks = {
    "audio-classification": 'will return a AudioClassificationPipeline.',
    "automatic-speech-recognition": 'will return a AutomaticSpeechRecognitionPipeline.',
    "conversational": 'will return a ConversationalPipeline.',
    "document-question-answering": 'will return a DocumentQuestionAnsweringPipeline.',
    "question-answering": 'will return a QuestionAnsweringPipeline.',
    "summarization": 'will return a SummarizationPipeline.',
    "table-question-answering": 'will return a TableQuestionAnsweringPipeline.',
    "text2text-generation": 'will return a Text2TextGenerationPipeline.',
    "text-classification": '(alias "sentiment-analysis" available): will return a TextClassificationPipeline.',
    "text-generation": 'will return a TextGenerationPipeline:.',
    "text-to-audio": '(alias "text-to-speech" available): will return a TextToAudioPipeline:.',
    "token-classification": '(alias "ner" available): will return a TokenClassificationPipeline.',
    "translation": 'will return a TranslationPipeline.',
    "translation_xx_to_yy": 'will return a TranslationPipeline.',
    "zero-shot-image-classification": 'will return a ZeroShotImageClassificationPipeline.',
    "zero-shot-audio-classification": 'will return a ZeroShotAudioClassificationPipeline.',
}
import random
random.choice(list(tasks))

'table-question-answering'

Go to the Hugging Face hub and find a model that implements the task that you saw.

## Looking at more example models, looking at their datasets and using models

The point of this section (down to where we start talking about PyTorch) is to get some
hands-on time with different NLP tasks.

Each `model` has been trained on a `dataset` of some kind. Looking at that dataset will
tell you what sort of answers the model can output. (e.g. if the training dataset only
has 0s and 1s in the output column, it is never going to predict "2").

Many natural language datasets are quite simple: there will be a text column and a label
column.

Let's look at some datasets so that we can understand what a model trained on that dataset
would be useful for.

### SST2

Look up the following dataset:

- sst2

(You can look it up on Hugging Face, or just search in general.)

What is it about? What are the features (columns) of this data set?

In [23]:
!pip install datasets

from datasets import load_dataset

# Load the SST-2 dataset
sst2_dataset = load_dataset("glue", "sst2")



In [None]:
# Print dataset information
print(sst2_dataset)

# Access the training split
train_data = sst2_dataset["train"]

# Select a few examples from the training split
few_examples = train_data.select(range(5))  # 选择前5个示例

# Iterate over and display the content of each example
for example in few_examples:
    sentence = example["sentence"]  # Sentence content
    label = example["label"]  # Label
    idx = example["idx"]
    print(f"Sentence: {sentence}")
    print(f"Label: {label}")
    print(f"Idx: {idx}")
    print()

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})
Sentence: hide new secretions from the parental units 
{sentence}
Label: 0
Idx: 0

Sentence: contains no wit , only labored gags 
{sentence}
Label: 0
Idx: 1

Sentence: that loves its characters and communicates something rather beautiful about human nature 
{sentence}
Label: 1
Idx: 2

Sentence: remains utterly satisfied to remain the same throughout 
{sentence}
Label: 0
Idx: 3

Sentence: on the worst revenge-of-the-nerds clichés the filmmakers could dredge up 
{sentence}
Label: 0
Idx: 4




The SST-2 dataset, also known as the Stanford Sentiment Treebank, is a benchmark dataset commonly used for sentiment analysis tasks in natural language processing (NLP). It consists of sentences extracted from movie reviews, along with binary sentiment labels indicating whether each sentence expresses a positive or negative sentiment.

Look up the following models:

- distilbert-base-uncased-finetuned-sst-2-english

You can probably guess what dataset it has been fine-tuned on! What sort of task is it?

You can use it with `transformers.pipeline(` *name-of-the-task* `,model="` *name-of-the-model* `)`

In [27]:
transformers.pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

<transformers.pipelines.text_classification.TextClassificationPipeline at 0x7a27723451e0>

Try the following:

- Write a grumpy sentence and see what it says.

- Write a cheery sentence

What output does it give?

In [28]:
from transformers import pipeline
classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [29]:
from transformers import pipeline
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

In [30]:
classifier("I hate the world and everything in it.")

[{'label': 'NEGATIVE', 'score': 0.9982221722602844}]

In [31]:
classifier("I'm chill about the the world and everything in it.")

[{'label': 'POSITIVE', 'score': 0.9757987260818481}]

Now let's look at some forms of bias in these models, and other anomalies:

- Write a neutral sentence but include the name of developed nation or popular city

- Write a neutral sentence but include the name of undeveloped nation or a city that has a bad reputation

In [32]:
classifier("I'm going to Sydney")

[{'label': 'POSITIVE', 'score': 0.9987388253211975}]

In [33]:
classifier("I'm going to Wagga Wagga")

[{'label': 'NEGATIVE', 'score': 0.9660237431526184}]

In [34]:
classifier("I'm going to Uzbekistan")

[{'label': 'POSITIVE', 'score': 0.9977136850357056}]

In [35]:
classifier("I'm going to Afghanistan")

[{'label': 'NEGATIVE', 'score': 0.9114143252372742}]

Confess to some terrible deeds (pretend you committed a crime) but phrase it in a way that is very upbeat.

In [36]:
classifier("I'm so excited about being able to commit my first murder!")

[{'label': 'POSITIVE', 'score': 0.9991581439971924}]

### The conll03 dataset

Look up the  **conll03** dataset.

We can create a pipeline with the "ner" task. The default model has been trained on **conll03**


"NER" stands for Named Entity Recognition. It is a task in natural language processing (NLP) that involves identifying and classifying named entities (such as persons, organizations, locations, dates, etc.) mentioned in unstructured text into predefined categories.

In [None]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [37]:
from transformers import pipeline

# Specifying a model name and revision
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
ner_tagger = pipeline("ner", model=model_name, aggregation_strategy="simple")

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Copy and paste a news article from today's news and do ner (named entity recognition) on it.

In [38]:
news_article = """A man is in custody as New South Wales police continue investigating the disappearance of the former Channel Ten presenter Jesse Baird and his Qantas flight attendant partner, Luke Davies.

“A 29-year-old man presented to Bondi police station shortly after 10.30am,” police said on Friday. “The man will be transferred to Waverley police station where investigations will continue.”

Police earlier raided a home in the Sydney suburb of Balmain after announcing they were searching for a third person potentially linked to the disappearance of Baird, 26 and now an AFL goal umpire, and Davies, 29, who vanished from Sydney’s east on Monday.

Their disappearance is being treated as suspicious after bloody possessions belonging to both men were found in a skip bin in Cronulla on Wednesday, leading police to Baird’s blood-stained sharehouse 30km away in Paddington."""

ner_tagger(news_article)

[{'entity_group': 'LOC',
  'score': 0.96861726,
  'word': 'New South Wales',
  'start': 23,
  'end': 38},
 {'entity_group': 'ORG',
  'score': 0.9967077,
  'word': 'Channel Ten',
  'start': 101,
  'end': 112},
 {'entity_group': 'PER',
  'score': 0.9994762,
  'word': 'Jesse Baird',
  'start': 123,
  'end': 134},
 {'entity_group': 'ORG',
  'score': 0.9854202,
  'word': 'Qantas',
  'start': 143,
  'end': 149},
 {'entity_group': 'PER',
  'score': 0.9995712,
  'word': 'Luke Davies',
  'start': 176,
  'end': 187},
 {'entity_group': 'LOC',
  'score': 0.99238646,
  'word': 'Bondi',
  'start': 222,
  'end': 227},
 {'entity_group': 'LOC',
  'score': 0.9860048,
  'word': 'Waverley',
  'start': 322,
  'end': 330},
 {'entity_group': 'LOC',
  'score': 0.9959139,
  'word': 'Sydney',
  'start': 420,
  'end': 426},
 {'entity_group': 'LOC',
  'score': 0.96554327,
  'word': 'Balmain',
  'start': 437,
  'end': 444},
 {'entity_group': 'PER',
  'score': 0.99972326,
  'word': 'Baird',
  'start': 544,
  'end':

Expect to see some locations, persons and organisations. If you don't, edit the news article and add a few,
and confirm that this NER model can detect them.

#### A small digression

Named entity recognition is really easy in English, almost every place, person or organisation begins with
a capital letter.

What about languages that don't have capital letters?

Create a new pipeline using the Japanese named entity model `Mizuiro-sakura/luke-japanese-base-finetuned-ner`

In [40]:
jp_ner = pipeline(model="Mizuiro-sakura/luke-japanese-base-finetuned-ner")

config.json:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.88k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/842k [00:00<?, ?B/s]

entity_vocab.json:   0%|          | 0.00/22.3M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/595 [00:00<?, ?B/s]

Let's give it an easy sentence to start with:

"2023年4月に東京で開催されるオリンピックに、選手として参加する予定の大坂なおみは、トレーニングに励んでいます。"

This sentence translates to: "Naomi Osaka, who is scheduled to participate as an athlete in the Olympics held in Tokyo in April 2023, is working hard on her training."

You should expect the model to find:

- Naomi Osaka (の大坂なお) -- which will appear as a person (人名)

- Tokyo (東京) -- which will appear as a place name (地名)

In [41]:
jp_ner("2023年4月に東京で開催されるオリンピックに、選手として参加する予定の大坂なおみは、トレーニングに励んでいます")

[{'entity': 'B-地名',
  'score': 0.9938637,
  'index': 6,
  'word': '東京',
  'start': None,
  'end': None},
 {'entity': 'B-人名',
  'score': 0.99575496,
  'index': 15,
  'word': 'の大',
  'start': None,
  'end': None},
 {'entity': 'I-人名',
  'score': 0.9965101,
  'index': 16,
  'word': '坂',
  'start': None,
  'end': None},
 {'entity': 'I-人名',
  'score': 0.9965744,
  'index': 17,
  'word': 'なお',
  'start': None,
  'end': None}]

But it gets much, much harder still in Japanese!

Use your NER classifier on "花春は春に桜の花を見に行くのが大好きです。"

This sentence means: "Hana Haru loves to go see the cherry blossoms in the spring."

"花春" (Hana Haru) is the person's full name, but the components of the name, "花" (flower) and "春" (spring), also directly relate to the content of the activity (viewing cherry blossoms), highlighting the complexity and beauty of Japanese naming conventions and their potential ambiguity in NER tasks.

The `Mizuiro-sakura/luke-japanese-base-finetuned-ner` model guesses that the first "Flower" is a name (Hana), but doesn't think "Haru" is part of that name.

In [42]:
jp_ner("花春は春に桜の花を見に行くのが大好きです。")

[{'entity': 'B-人名',
  'score': 0.9887017,
  'index': 1,
  'word': '▁花',
  'start': None,
  'end': None}]

## SquAD dataset

Look up the **SQuAD** dataset.

The question-answering task is a little different. The pipeline takes two arguments:

- question (the question you want to ask)

- context (the information that it has to answer questions about)

Unlike a generative model, it is not synthesising an answer, it is simply trying to find a relevant fragment of the document to answer the question. This makes it very fast.

Ask a question about your news article.

In [43]:
reader = pipeline("question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [44]:
reader(question="Who were the police looking for", context=news_article)

{'score': 0.09344498068094254,
 'start': 486,
 'end': 500,
 'answer': 'a third person'}

In [45]:
reader(question="What was Luke Davies job?", context=news_article)

{'score': 0.34823381900787354,
 'start': 143,
 'end': 166,
 'answer': 'Qantas flight attendant'}

## Summarisation

Congbo will talk more about how this works in her lecture!

Let's not spoil any secrets here.

Summarise your news article using the default model for the "summarization" task.

In [46]:
summarizer = pipeline(task="summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [47]:
summarizer(news_article)

[{'summary_text': ' Jesse Baird and Luke Davies vanished from Sydney’s east on Monday . Police have arrested a 29-year-old man in connection with the disappearance of the pair . The man will be transferred to Waverley police station where investigations will continue . Baird and Davies were found dead in a sharehouse in Paddington on Wednesday .'}]

## Translation

Translate it into another language. The name of the task will be something like
"translation_en_to_de" ("de" is the language code for Germany). The two-letter ISO-639 codes
are here: https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes

Pick your favourite language!

Note that you will have to specify a model. Helsinki-NLP is a good place to start searching
for on. You might have to install some supporting packages like `sentencepiece`
(and then you will have to restart the jupyter notebook kernel).

In [6]:
translator = pipeline("translation_en_to_eo",
                     model='Helsinki-NLP/opus-mt-en-eo')

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/300M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/806k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/820k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.35M [00:00<?, ?B/s]



Write a short sentence and translate it into that language.

In [48]:
translator("This is pretty amazing")

[{'translation_text': 'Tio estas sufiĉe mirinda'}]

## Speech recognition

OpenAI have a remarkably good speech-to-text model called "whisper".

The task is called "automatic-speech-recognition". There is a good model called "openai/whisper-small".

In [49]:
transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-small")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


There is an audio clip of David Attenborough speaking in the same folder as this notebook, called `David_Attenborough.wav`

Transcribe it.

In [50]:
transcriber("David_Attenborough.wav")

{'text': " Mae'r unig ymddiadau bod ymddiadau ymddiadau yn ymddiadau yn ymddiadau yn ymddiadau."}

## Speech generation

ElevenLabs.io just raised $80m and have a billion dollar valuation turning text into speech!

The quality of the default model in Hugging Face isn't quite that good though.

The name of the task is "text-to-audio".

In [51]:
make_audio = pipeline("text-to-audio")

No model was supplied, defaulted to suno/bark-small and revision 645cfba (https://huggingface.co/suno/bark-small).
Using a pipeline without specifying a model name and revision in production is not recommended.


Make up a sentence and run the text-to-audio model on your sentence. It outputs a dictionary with two keys:

- `audio` (a 2D numpy array, but because it's mono audio data there's only one real axis)

- `sampling_rate` just a number

This may take a long time to run. A 9-word sentence two 2 minutes on my laptop (keeping all 4 cores busy
the whole time).

Store the result in a varibale called `audio_out`

In [52]:
%%time
audio_out = make_audio("But what if I want to speak without a mouth?")
audio_out

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.


CPU times: user 1min 39s, sys: 0 ns, total: 1min 39s
Wall time: 1min 40s


{'audio': array([[0.06964394, 0.06185713, 0.06140869, ..., 0.00832439, 0.00824419,
         0.00820295]], dtype=float32),
 'sampling_rate': 24000}

If you install the `sounddevice` package, you'll be able to run

```python
import sounddevice
sounddevice.play(audio_out['audio'].T, audio_out['sampling_rate'])
```

Assuming your computer has a speaker, you should hear something.

In [11]:
!pip install sounddevice
!apt-get install libportaudio2

Collecting sounddevice
  Downloading sounddevice-0.4.6-py3-none-any.whl (31 kB)
Installing collected packages: sounddevice
Successfully installed sounddevice-0.4.6
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  libportaudio2
0 upgraded, 1 newly installed, 0 to remove and 35 not upgraded.
Need to get 65.3 kB of archives.
After this operation, 223 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libportaudio2 amd64 19.6.0-1.1 [65.3 kB]
Fetched 65.3 kB in 0s (155 kB/s)
Selecting previously unselected package libportaudio2:amd64.
(Reading database ... 121749 files and directories currently installed.)
Preparing to unpack .../libportaudio2_19.6.0-1.1_amd64.deb ...
Unpacking libportaudio2:amd64 (19.6.0-1.1) ...
Setting up libportaudio2:amd64 (19.6.0-1.1) ...
Processing triggers for libc-bin (2.35-0ubuntu3.4) ...
/sbin/ldconfig.real: /usr/local

In [17]:
import sounddevice
# Query available audio devices
devices = sounddevice.query_devices()
print(devices)  # Print available audio devices




In [16]:
import sounddevice
sounddevice.play(audio_out['audio'].T, audio_out['sampling_rate'])

PortAudioError: Error querying device -1

## ollama

ollama is a platform for running some famous LLMs very efficiently on a wide range of computers. We'll talk about
quantization and other ways of shrinking models in a later lecture.

If you are using the 06EaR 206 computers, they should already have ollama installed.

If you are on a laptop, you can install it from https://ollama.com/download

-----

There are many ways to run `ollama`. You can do `!ollama` in a cell.

In [53]:
!ollama

/bin/bash: line 1: ollama: command not found


You can also open up a terminal and run `ollama`.

On Windows, terminal is `cmd.exe` (typing in `cmd` at the Start menu will find it)

On OSX, the default terminal is `Terminal.app` and it is in the `/Applications/Utilities` folder.

On Linux, it depends on your Linux distribution, but there's probably a very prominent button to show it.

Regardless of the operating system, from the Jupyter main page you can also choose to create a Terminal instead of a notebook.

Running the `ollama` command should output something the similar to the command above.

------
2. Run `ollama list`.

This will probably show nothing.

In [54]:
!ollama list

/bin/bash: line 1: ollama: command not found


You download models with `ollama pull`. They are listed on https://ollama.com/library

Interesting models:
- `gemma:2b` -- a recent small model from Google
- `phi` -- a very small model from Microsoft
- `zephyr` -- a somewhat larger model that is good at writing code
- `mixtral` -- a huge model, nearly as capable as ChatGPT-4. It needs 26+ GB of RAM.
- `dolphin-mistral` -- an uncensored model

Pick one of these and download it.

In [None]:
!ollama pull phi

/bin/bash: line 1: ollama: command not found


You should be able to run the model and interact with it now. (Unfortunately, you can't do `!ollama run *model*` because it interacts in ways that a Jupyter notebook can't handle. You'll have to run it elsewhere.)

`!ollama run ...`

## Ollama python

You can also make use of this very efficient way of running LLMs in your python programs. This will
automatically spawn an `ollama` server when it is requires.

You will need to run `pip install ollama` to install the python library package.

In [None]:
!pip install ollama



Try the following (replace "llama2" with the model you chose)

    import ollama
    ollama.generate(model='llama2', prompt='Why is the sky blue?')

In [None]:
!curl https://ollama.ai/install.sh | sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0>>> Downloading ollama...
100  8425    0  8425    0     0  30570      0 --:--:-- --:--:-- --:--:-- 30636
############################################################################################# 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


In [None]:
ollama

<module 'ollama' from '/usr/local/lib/python3.10/dist-packages/ollama/__init__.py'>

In [None]:
import os
import asyncio

# 设置 LD_LIBRARY_PATH，使系统 NVIDIA 库优先于内置库。
# 这对于 Google Colab 尤其重要，因为它安装了较旧的驱动程序
os.environ.update({'LD_LIBRARY_PATH': '/usr/lib64-nvidia'})

async def run(cmd):
    '''run 是一个帮助函数，用于异步运行子命令。'''
    print('>>> starting', *cmd)
    p = await asyncio.subprocess.create_subprocess_exec(
        *cmd,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE,
    )
    async def pipe(lines):
        async for line in lines:
            print(line.strip().decode('utf-8'))
    await asyncio.gather(
        pipe(p.stdout),
        pipe(p.stderr),
    )

await asyncio.gather(
    run(['ollama', 'serve']),
)

>>> starting ollama serve
time=2024-02-29T11:59:15.928Z level=INFO source=images.go:710 msg="total blobs: 0"
time=2024-02-29T11:59:15.928Z level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-02-29T11:59:15.928Z level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)"
time=2024-02-29T11:59:15.929Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-02-29T11:59:27.932Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [rocm_v5 cpu cpu_avx rocm_v6 cuda_v11 cpu_avx2]"
time=2024-02-29T11:59:27.933Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-29T11:59:27.934Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-02-29T11:59:27.955Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-02-29T11:59:27.956Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so"
time=2024-

CancelledError: 

In [None]:
!ollama pull phi

Error: could not connect to ollama app, is it running?


In [None]:
import ollama
ollama.generate(model='phi', prompt='Why is the sky blue?')

ConnectError: [Errno 111] Connection refused

You can make it feel faster by streaming the output. Add `stream=True` as an argument. It will then
return a generator that you can iterate over.

In [None]:
for term in ollama.generate(model='phi', prompt='Why is the sky blue?', stream=True):
    print(term['response'])

 The
 sky
 appears
 blue
 due
 to
 Ray
leigh
 scattering
,
 which
 occurs
 when
 light
 from
 the
 sun
 passes
 through
 the
 Earth
's
 atmosphere
.
 The
 atoms
 and
 molecules
 in
 the
 air
 scatter
 the
 sunlight
 in
 all
 directions
,
 but
 blue
 light
 is
 scattered
 more
 than
 other
 colors
 because
 it
 travels
 as
 shorter
,
 smaller
 waves
.
 This
 causes
 the
 blue
 light
 to
 be
 more
 concentrated
 at
 lower
 alt
itudes
,
 making
 the
 sky
 appear
 blue
.
 Additionally
,
 during
 sunrise
 and
 sunset
,
 the
 sun
's
 rays
 have
 to
 pass
 through
 a
 thicker
 layer
 of
 the
 atmosphere
,
 which
 sc
atters
 even
 more
 blue
 light
 away
 from
 our
 view
,
 giving
 the
 sky
 a
 redd
ish
-
orange
 hue
.





If you want to use the output in a program though, it's often convenient to have it output in JSON format
that your program can parse. Use `format="json"` for this. This means that the `response` text will likely
be a JSON object that you can parse with python's JSON library.

For example, try asking it to say for whether each thing in a list is living or non-living (e.g. "stone",
"mouse", "car", "tiger", "human being")

In [None]:
r = ollama.generate(
    model='phi',
    prompt='Tag these things as either living or non-living: stone, mouse, car, tiger, human being',
    format='json')
import json
answers = json.loads(r['response'])
print(answers)

{'stone': 'non-living', 'mouse': 'living', 'car': 'non-living', 'tiger': 'living', 'human being': 'living'}


# PyTorch

In the lectures, we finished up talking about how PyTorch doesn't have a built-in `.fit()` or `.train()`
method.

The code below has a training loop, but it has a fixed number of iterations.

Change it so that it has a validation dataset, and that it stops training when the accuracy
on the validation dataset is no longer improving.

To do this, you will make use of
- `model.eval()`  This doesn't evaluate anything! It just tells the model that you are in
   evaluation mode.
- `model.train()` Likewise, doesn't train, it just tells the model that you are
   switching back to training mode.
- `with torch.no_grad():` speeds up calculations by not trying to get gradients (which would
   be irrelevant for the test set).
- `loss.item()` get a float out for the loss
- `model()` this is how you do inference on a model: call it like a function.

For your own sanity you will probably want to split this code up into a few cells, print out
some messages in each batch (or at least each epoch).

Optional bonus exercises:

- According to your model, what was the probability that a 3rd-class male passenger would die? What about a 1st-    class female passenger?

- Visualise how the weights of the first layer of this model change over time.

- Make the model work better. It's particularly inaccurate at the moment.

In [65]:
import torch
import pandas as pd
import sklearn.model_selection  # splitting data into training and testing sets
import sklearn.preprocessing #  scaling, transforming, and preprocessing data

df = pd.read_csv('titanic.csv')
df.dropna(inplace=True)
X = df.drop('Survived', axis=1).select_dtypes(include=['float64', 'int64']).values
y = df['Survived'].values #Converting it into a NumPy array.

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.2, random_state=42)
# Split the training set into training and validation sets (64% training, 16% validation)
X_train, X_val, y_train, y_val = sklearn.model_selection.train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [66]:
# Instantiates a StandardScaler object, which is used to standardize features by removing the mean and scaling to unit variance.
scaler = sklearn.preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.fit_transform(X_val)
X_test = scaler.transform(X_test)
# fit_transform is used for both learning parameters from the data and applying transformations,
# transform is used solely for applying transformations using the previously learned parameters.

# Convert arrays to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
X_val_tensor = torch.tensor(X_val, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.long)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)


In [67]:
# Create DataLoader instances
# Creates a PyTorch TensorDataset for the training data. TensorDataset is a dataset wrapper that allows you to access slices of tensors as if they were arrays.
train_dataset = torch.utils.data.TensorDataset(X_train_tensor, y_train_tensor)
# : Creates a DataLoader for the training dataset. DataLoader is responsible for creating batches of data from the dataset.
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
val_dataset = torch.utils.data.TensorDataset(X_val_tensor, y_val_tensor)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=64, shuffle=True)
test_dataset = torch.utils.data.TensorDataset(X_test_tensor, y_test_tensor)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=64, shuffle=True)

In [68]:
# Neural network architecture
class TitanicNN(torch.nn.Module):
    def __init__(self, input_size):
        super(TitanicNN, self).__init__()
        self.layer1 = torch.nn.Linear(input_size, 64)
        # self.dropout1 = torch.nn.Dropout(p=0.2)
        self.layer2 = torch.nn.Linear(64, 32)
        self.layer3 = torch.nn.Linear(32, 2)
        # That last one is a binary classifier
        # Don't need softmax, CrossEntropyLoss handles it

    def forward(self, x):
        x = torch.relu(self.layer1(x))
        # x = self.dropout1(x)
        x = torch.relu(self.layer2(x))
        x = self.layer3(x)
        return x

In [69]:
# Instantiate the model, loss function, and optimizer
model = TitanicNN(X_train.shape[1])
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)
# optimizer = torch.optim.SGD(model.parameters(), lr=0.005)
# optimizer = torch.optim.RMSprop(model.parameters(), lr=0.005)

# Training loop
epochs = 500
for epoch in range(epochs):
    model.train()
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()    # Computes the gradients of the loss with respect to the model parameters.
        optimizer.step()   # Updates the model parameters using the computed gradients.

    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

Epoch 1, Loss: 0.6871896982192993
Epoch 2, Loss: 0.6139301657676697
Epoch 3, Loss: 0.5957580804824829
Epoch 4, Loss: 0.521987795829773
Epoch 5, Loss: 0.5473371744155884
Epoch 6, Loss: 0.6493860483169556
Epoch 7, Loss: 0.578399121761322
Epoch 8, Loss: 0.5540949106216431
Epoch 9, Loss: 0.6019346714019775
Epoch 10, Loss: 0.5852155089378357
Epoch 11, Loss: 0.5684382915496826
Epoch 12, Loss: 0.5499333739280701
Epoch 13, Loss: 0.47539222240448
Epoch 14, Loss: 0.5460414290428162
Epoch 15, Loss: 0.5364839434623718
Epoch 16, Loss: 0.5173927545547485
Epoch 17, Loss: 0.5736076831817627
Epoch 18, Loss: 0.52018803358078
Epoch 19, Loss: 0.46956920623779297
Epoch 20, Loss: 0.500639796257019
Epoch 21, Loss: 0.46086257696151733
Epoch 22, Loss: 0.4376153349876404
Epoch 23, Loss: 0.5234804749488831
Epoch 24, Loss: 0.5307371020317078
Epoch 25, Loss: 0.5198067426681519
Epoch 26, Loss: 0.48918965458869934
Epoch 27, Loss: 0.4526529610157013
Epoch 28, Loss: 0.4998268485069275
Epoch 29, Loss: 0.457363754510879

In [74]:
best_val_accuracy = 0

for epoch in range(epochs):
    model.train()
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

    # Validation phase
    model.eval()
    with torch.no_grad():  #ensuring that gradients are not computed during forward pass
        correct = 0
        total = 0
        for inputs, labels in val_loader:
            outputs = model(inputs)
            _, predicted = torch.max(outputs.data, 1)
            # print("outputs", outputs.data)
            # print("predicted:", predicted)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
        val_accuracy = correct / total

    print(f'Epoch {epoch+1}, Loss: {loss.item()}, Validation Accuracy: {val_accuracy}')

    # Early stopping
    patience = 5
    if val_accuracy > best_val_accuracy:
        best_val_accuracy = val_accuracy
        patience_counter = 0  # Reset counter
    else:
        patience_counter += 1

    if patience_counter > patience:
        print("Stopping early due to no improvement in validation accuracy.")
        break

Epoch 1, Loss: 0.0062426854856312275, Validation Accuracy: 0.43333333333333335
Epoch 2, Loss: 0.006429627537727356, Validation Accuracy: 0.4666666666666667
Epoch 3, Loss: 0.00467557180672884, Validation Accuracy: 0.4666666666666667
Epoch 4, Loss: 0.004509605932980776, Validation Accuracy: 0.4666666666666667
Epoch 5, Loss: 0.003611586755141616, Validation Accuracy: 0.4666666666666667
Epoch 6, Loss: 0.005945819895714521, Validation Accuracy: 0.43333333333333335
Epoch 7, Loss: 0.0025855039712041616, Validation Accuracy: 0.4666666666666667
Epoch 8, Loss: 0.003516323631629348, Validation Accuracy: 0.4666666666666667
Stopping early due to no improvement in validation accuracy.


In [75]:
 # Test phase
model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for inputs, labels in test_loader:
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
    test_accuracy = correct / total

print(f'Epoch {epoch+1}, Loss: {loss.item()}, Test Accuracy: {test_accuracy}')

Epoch 8, Loss: 0.003516323631629348, Test Accuracy: 0.5945945945945946


To improve the training effectiveness of this code, you can consider the following approaches:

- Increase Model Capacity: Try increasing the complexity of the neural network by adding more layers or increasing the number of units per layer to enhance the model's capacity. A more complex model may better capture the underlying patterns in the data.

- Adjust Learning Rate: Tuning the learning rate can significantly impact training effectiveness. Experiment with different learning rates, such as increasing or decreasing the learning rate, and observe the effects on training effectiveness.

- Try Different Optimizers: Besides the Adam optimizer, experiment with alternative optimizers such as SGD, RMSProp, etc., as different optimizers may yield varying effects on training effectiveness.

- Increase Training Data: If feasible, augment the training dataset. More data can often help the model generalize better to unseen data.

- Incorporate Dropout Layers: Consider adding dropout layers to the neural network to reduce the risk of overfitting. Dropout layers randomly drop some neuron outputs during training, thereby reducing model complexity.

- Adjust Batch Size and Number of Training Epochs: Experiment with different batch sizes and numbers of training epochs to find the optimal combination. Smaller batch sizes may lead to faster convergence but could make the training process less stable.