In [2]:
from IPython.display import display, Markdown

with open('README.md', 'r') as file:
    readme_content = file.read()

display(Markdown(readme_content)

### Client Engineering: LLMs Test Exercises

#### Exercise 1 - Text Cleaning

Process the provided text samples by:
- Remove URLs from the text: "The link to latest football score. https://xyz.com/a/b"
- Remove alphanumeric words from the text: "Hello Maria whatsup123"
- Remove words starting with '#' character: "Mado is very good with last ball six #dhoni #six"
- Splits alphanumeric words into digits and text: "I will be buying movie tickets for 4adults"

#### Exercise 2 - Summarization 
Use the text.csv file from the /data folder and summarize 3 of the stories using a model/technique of your choice 

#### Exercise 3 - Classification
Use the provided Pytorch model from the /model folder and classify the text from all 3x stories chosen above (one by one).

#### Exercise 4 - Performance
Compare the summarized output of the article from /data and calculate the precision (BLEU score) taken into consideration the reference summary (summary-1-flan-ul2--article1) and the candidate summary (summary-2-flan-ul2--article1)

**NOTE**:
- You can provide your input within a Jupiter notebook containing the cells' output. Don't forget to be creative as much as you want in the provided time.
- Please don't fork this repo.


--------------
## Exercise 1
Remove URLs from the text: "The link to latest football score. https://xyz.com/a/b"

In [2]:
text = "The link to latest football score. https://xyz.com/a/b"

output1 = text[:-20]
print(f'Output #1: {output1}')

url_start = text.find('https://')
output2 = text[:url_start]
print(f'Output #2: {output2}')

import re

url_regex = r'https?://\S+\.\S+'
output3 = re.sub(url_regex, '', text)
print(f'output #3: {output3}')

Output #1: The link to latest football score.
Output #2: The link to latest football score. 
output #3: The link to latest football score. 


Remove alphanumeric words from the text: "Hello Maria whatsup123"


In [16]:
import re

text = "Hello Maria whatsup123"

regex = r'\b\w*\d\w*\b'
output = re.sub(regex, '', text)
print(output)

Hello Maria whatsup123


Remove words starting with '#' character: "Mado is very good with last ball six #dhoni #six"

In [15]:
import re

text = "Mado is very good with last ball six #dhoni #six"

regex = r'\s*#\w+\s*'
output = re.sub(regex, '', text)
print(output)

Mado is very good with last ball six #dhoni #six


Splits alphanumeric words into digits and text: "I will be buying movie tickets for 4adults"


In [17]:
import re

text = "I will be buying movie tickets for 4adults"

regex = r'(\d+|\D+)'
l_split = re.findall(regex, text)
output = ' '.join(l_split)
print(output)

I will be buying movie tickets for  4 adults


## Exercise 2 - Summarization
Use the text.csv file from the /data folder and summarize 3 of the stories using a model/technique of your choice

In [18]:
import pandas as pd
df = pd.read_csv('data/text.csv')
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [19]:
article_labels = df.labels.unique()
df.labels.value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: labels, dtype: int64

In [20]:
import random 
random.seed(13)
nums = random.sample(range(1, df.shape[0]), 3)

articles = df.loc[nums].text.to_list()
article_labels = df.loc[nums].labels.to_list()
articles

['Blunkett tells of love and pain\n\nDavid Blunkett has spoken of his love for married publisher Kimberly Quinn for the first time.\n\nThe home secretary described how it affected his friends and personal life, but said he was a great believer in personal responsibility. Mr Blunkett is taking legal action to gain access to Mrs Quinn\'s two-year-old son. She denies he is Mr Blunkett\'s. The interview with BBC Radio Sheffield was made before allegations he fast-tracked a visa for Mrs Quinn\'s nanny. The allegations, which he has denied, are being investigated by Sir Alan Budd. Mr Blunkett talked about how he fell in love - but that she resisted his desire to go public.\n\nIn an apparent reference to his court action to gain access to her son, he says he was a great believer in responsibility and consequences, even when they were painful. Mr Blunkett told BBC Radio Sheffield: "I fell in love with someone and they wouldn\'t go public and things started to go very badly wrong in the summer,

### HuggingFace

In [137]:
from transformers import pipeline
summarizer = pipeline('summarization')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [141]:
summaries = []
for article in articles:
    summaries.append(summarizer(article, max_length=150, min_length=20, do_sample=False))

In [140]:
summaries

[[{'summary_text': " Blunkett tells of love and pain but says he is a great believer in responsibility and consequences . Interview made before allegations he fast-tracked a visa for Quinn's nanny ."}],
 [{'summary_text': ' Lord Archer spent two years in prison after being convicted of perjury and perverting the course of justice . Former Tory deputy chairman\'s five-year suspension from the party has just elapsed . Dr Liam Fox said there was no place for "vindictiveness" in politics .'}],
 [{'summary_text': " Motley Crue guitarist is being sued by his ex-girlfriend for $10 million (£5.4 million) Robin Mantooth claims he broke a promise to take care of her . She is asking a Los Angeles court to award her half the musician's property and damages ."}]]

### NLTK

In [24]:
import nltk
from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize 
from heapq import nlargest

In [25]:
%%capture
nltk.download("stopwords")
nltk.download('punkt')
stop_words = stopwords.words('english')
punctuation = punctuation + '\n\t\r'

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ZZ082W668\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ZZ082W668\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [26]:
from collections import Counter

def compute_word_frequencies(word_list):
    word_freq = Counter(word_list)
    return word_freq

def weight_frequencies(word_frequencies):
    max_frequency = max(word_frequencies.values())
    
    for word in word_frequencies.keys():
        word_frequencies[word] = word_frequencies[word]/max_frequency
    return word_frequencies

def sentence_score(sent_token, word_frequencies):
    word_frequencies_lower = {word.lower(): freq for word, freq in word_frequencies.items()}
    
    sentence_scores = {}
    for sent in sent_token:
        sentence_words_lower = set(sent.lower().split(" "))
        
        # calculate score
        sentence_score = sum(word_frequencies_lower.get(word, 0) for word in sentence_words_lower)
        
        # store score
        sentence_scores[sent] = sentence_score
    
    return sentence_scores



def get_summary(sent_token, sentence_scores):
    select_length = int(len(sent_token)*0.3)
    summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)
    final_summary = [word for word in summary]
    summary = ' '.join(final_summary)
    return summary

In [28]:
for article in articles:
    # nltk func
    tokens = word_tokenize(article)
    # compute frequencies   
    word_frequencies = compute_word_frequencies(tokens)
    # normalize frequencies
    word_frequencies = weight_frequencies(word_frequencies)
    # nltk func
    sent_token = sent_tokenize(article)
    # compute words' score in sentences
    sentence_scores = sentence_score(sent_token, word_frequencies)

    summary = get_summary(sent_token, sentence_scores)

    print(summary)
    print(f'Article len: [{len(articles[0])}]')
    print(f'Summary len: [{len(summary)}]')
    print('---'*10)

Mr Blunkett told BBC Radio Sheffield: "I fell in love with someone and they wouldn't go public and things started to go very badly wrong in the summer, and then the News of the World picked up the story. BBC political correspondent Carole Walker said the timing of the broadcast was unlikely to help his efforts to show that he is concentrating on getting on with the job of home secretary. "I work with him every day and I have always been surprised by how focused he is on the job in hand, on working to deal with things," she said. Shadow home secretary David Davis says Mr Blunkett should quit if he is found to have influenced the visa process even indirectly. Mr Blunkett talked about how he fell in love - but that she resisted his desire to go public.
Article len: [2157]
Summary len: [759]
------------------------------
He has not been seen in the House of Lords since his release from prison in July 2003, although there is nothing in the rules to prevent him from attending. A jury ruled 

## Exercise 3 - Classification
Use the provided Pytorch model from the /model folder and classify the text from all 3x stories chosen above (one by one).

In [30]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_path = './model'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

OSError: You seem to have cloned a repository without having git-lfs installed. Please install git-lfs and run `git lfs install` followed by `git lfs pull` in the folder you cloned.

https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage

![Sample Image](git_error.png)


In [131]:
from transformers import pipeline

classifier = pipeline('zero-shot-classification')
# classifier = pipeline('zero-shot-classification', model=model, tokenizer=tokenizer)

array(['business', 'entertainment', 'politics', 'sport', 'tech'],
      dtype=object)

In [143]:
results = []
for article in articles:
    results.append(classifier(article, candidate_labels = article_labels))

In [154]:
for article, article_label in zip(results, article_labels):
    pred_label = article['labels'][0]    
    print(f'Pred label: [{pred_label}] | True label: [{article_label}]')

Pred label: [politics] | True label: [politics]
Pred label: [politics] | True label: [politics]
Pred label: [business] | True label: [entertainment]


In [152]:
results

[{'sequence': 'Blunkett tells of love and pain\n\nDavid Blunkett has spoken of his love for married publisher Kimberly Quinn for the first time.\n\nThe home secretary described how it affected his friends and personal life, but said he was a great believer in personal responsibility. Mr Blunkett is taking legal action to gain access to Mrs Quinn\'s two-year-old son. She denies he is Mr Blunkett\'s. The interview with BBC Radio Sheffield was made before allegations he fast-tracked a visa for Mrs Quinn\'s nanny. The allegations, which he has denied, are being investigated by Sir Alan Budd. Mr Blunkett talked about how he fell in love - but that she resisted his desire to go public.\n\nIn an apparent reference to his court action to gain access to her son, he says he was a great believer in responsibility and consequences, even when they were painful. Mr Blunkett told BBC Radio Sheffield: "I fell in love with someone and they wouldn\'t go public and things started to go very badly wrong i

## Exercise 4 - Performance
Compare the summarized output of the article from /data and calculate the precision (BLEU score) taken into consideration the reference summary (summary-1-flan-ul2--article1) and the candidate summary (summary-2-flan-ul2--article1)

In [20]:
with open('data/summary-1-flan-ul2--article1.txt', 'r') as file:
    reference_summary = file.read()

with open('data/summary-2-flan-ul2--article1.txt', 'r') as file:
    candidate_summary = file.read()

print(reference_summary)
print(candidate_summary)

People are using AI chatbots to fill junk websites with AI-generated text that attracts paying advertisers, according to a new report from the media research organization NewsGuard that was shared exclusively with MIT Technology Review. Over 140 major brands are paying for ads that end up on unreliable AI-written sites, likely without their knowledge. Ninety percent of the ads from major brands found on these AI-generated news sites were served by Google, though the company’s own policies prohibit sites from placing Google-served ads on pages that include “spammy automatically generated content.” The practice threatens to hasten the arrival of a glitchy, spammy internet that is overrun by AI-generated content, as well as wasting massive amounts of ad money.

A new report finds that sites run with AI-generated content serve ads from major brands, which mostly come from Google. Some of those sites contained dangerous misinformation. And this is just getting started. More could be on the 

In [21]:
from nltk.translate.bleu_score import sentence_bleu

bleu = sentence_bleu(reference_summary, candidate_summary)
print(f'BLEU score {bleu}')


BLEU score 1.0025117266892697e-231


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [23]:
from nltk.translate.bleu_score import SmoothingFunction

smoothie = SmoothingFunction().method4

bleu = sentence_bleu(reference_summary, candidate_summary, smoothing_function=smoothie)
print(f'BLEU score with smoothing: {bleu}')


BLEU score with smoothing: 0.0027240282824459398
