<a href="https://colab.research.google.com/github/VictorPelaez/Courses/blob/master/news.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook to play and create an example

Tutorial instructions:
- Clone [github](https://github.com/VictorPelaez/genai_gazzete) repository
- Install all python dependences
- **Get api key** from [newsapi.org](https://newsapi.org/docs/client-libraries/python) and set it in config.ini file
- Run Summarization model to get all the article summaries
- Run Document generation with Images to get a docx file as a newsletter

In [1]:
!git clone https://github.com/VictorPelaez/genai_gazzete.git

Cloning into 'genai_gazzete'...
remote: Enumerating objects: 28, done.[K
remote: Counting objects: 100% (28/28), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 28 (delta 11), reused 14 (delta 4), pack-reused 0[K
Receiving objects: 100% (28/28), 31.45 KiB | 5.24 MiB/s, done.
Resolving deltas: 100% (11/11), done.


In [2]:
!pip install --upgrade newsapi-python transformers python-docx diffusers scipy

Collecting newsapi-python
  Downloading newsapi_python-0.2.7-py2.py3-none-any.whl (7.9 kB)
Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m57.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-docx
  Downloading python-docx-0.8.11.tar.gz (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m98.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting diffusers
  Downloading diffusers-0.19.3-py3-none-any.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m76.6 MB/s[0m eta [36m0:00:00[0m
Collecting scipy
  Downloading scipy-1.11.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.3/36.3 MB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingf

In [3]:
from genai_gazzete import utils
from newsapi import NewsApiClient

# -----------------------------------------------------
# Get news from fdate
# -----------------------------------------------------

fdate = '2023-07-31'

config = utils.readConfig()
d_config = dict(config.items('DEFAULT'))

newsapi = NewsApiClient(api_key=d_config['api_key'])
all_articles = newsapi.get_everything(q='generative ai llms',
                                      language='en',
                                      from_param=fdate,
                                      sort_by='relevancy')

print("#articles: ", len(all_articles["articles"]))

#articles:  100


## Summarization Model

In [4]:
import pandas as pd
import time
from bs4 import BeautifulSoup
import requests
import re
from transformers import pipeline

import warnings
warnings.filterwarnings('ignore')

# -----------------------------------------------------
# Run Summarization model for all the articles
# -----------------------------------------------------

model_name = "sshleifer/distilbart-cnn-12-6" # other models: "sshleifer/bart-large-cnn" "sshleifer/distilbart-xsum-12-1" "google/flan-t5-base"
summarizer = pipeline('summarization', model="sshleifer/distilbart-cnn-12-6", device=0) # T4 GPU in Colab

dash_line = '-'.join('' for x in range(100))

## Data to feed
N = int(round(len(all_articles["articles"])*0.1, 0)) # Number of summaries to show, verbose
L = 3000 # Context size
df = pd.DataFrame(columns = ["source", "url", "title", "description", "len_text", "summary"])

for i, a in enumerate(all_articles["articles"]):

  start = time.time()
  print(dash_line)
  print('Example ', i + 1)
  print(dash_line)

  # read url HTLM web
  page = requests.get(a["url"])
  soup = BeautifulSoup(page.content, 'html.parser')
  result = soup.find_all(["p","i"]) # Slashdot is <i>
  ARTICLE = ""
  for part in result:
    if (part.get("class")==None) and len(ARTICLE)<L:
      ARTICLE = ARTICLE + " " + part.get_text()

  ARTICLE = utils.clear_article(ARTICLE)

  if len(ARTICLE)>0:
    summarized_article = summarizer(ARTICLE)[0]["summary_text"] # Inference summarization model
    summarized_article = re.sub(r'\s([?.!"](?:\s|$))', r'\1', summarized_article) # Remove whitespaces
  else:
    summarized_article = "empty"

  df = df.append({'source': a['source']['name'],
                  'url': a["url"],
                  'title': a["title"],
                  'description': a["description"],
                  'len_text': len(ARTICLE),
                  'summary': summarized_article}, ignore_index = True)

  # print samples
  if (i<N) and (len(ARTICLE)>0) :
    print(a['source']['name'], a["url"])
    print('Description:  ', a['description'])
    print('LLM Summary')
    print(df.iloc[i]["summary"])

  end = time.time()
  print(end - start)
  print(dash_line)
  print()

df = utils.remove_summaries(df)
print('Final number of articles: ' + str(len(df. index)))

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

---------------------------------------------------------------------------------------------------
Example  1
---------------------------------------------------------------------------------------------------
ReadWrite https://readwrite.com/google-assistant-is-getting-a-major-upgrade/
Description:   In a monumental shift towards the future, Google has made an astonishing realization that has spurred the company to realign […]
The post Google Assistant Is Getting a Major Upgrade appeared first on ReadWrite.
LLM Summary
 Recent reports indicate that google assistant one of the most widely used virtual assistants is undergoing a generative face lift harnessing the power of the latest large language model llm technology. This strategic move reflects google s ambition to explore the immense potential of a supercharged assistant revolutionizing the way users interact with this groundbreaking technology.
2.1187071800231934
--------------------------------------------------------------------

Your max_length is set to 142, but your input_length is only 11. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)


1.7781243324279785
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  29
---------------------------------------------------------------------------------------------------
0.661388635635376
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  30
---------------------------------------------------------------------------------------------------
4.603588104248047
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  31
---------------------------------------------------------------------------------------------------
2.8502

Your max_length is set to 142, but your input_length is only 137. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=68)


2.006213426589966
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  53
---------------------------------------------------------------------------------------------------
0.9050388336181641
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  54
---------------------------------------------------------------------------------------------------
1.331233024597168
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  55
---------------------------------------------------------------------------------------------------
1.1012

Your max_length is set to 142, but your input_length is only 11. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)


1.388181447982788
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  68
---------------------------------------------------------------------------------------------------


Your max_length is set to 142, but your input_length is only 4. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=2)


1.5695586204528809
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  69
---------------------------------------------------------------------------------------------------
0.46226000785827637
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  70
---------------------------------------------------------------------------------------------------
0.29074764251708984
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  71
---------------------------------------------------------------------------------------------------
0.

Your max_length is set to 142, but your input_length is only 4. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=2)


1.431950569152832
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  73
---------------------------------------------------------------------------------------------------


Your max_length is set to 142, but your input_length is only 4. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=2)


0.9586844444274902
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  74
---------------------------------------------------------------------------------------------------
0.10875773429870605
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  75
---------------------------------------------------------------------------------------------------
1.55092191696167
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  76
---------------------------------------------------------------------------------------------------
2.611

Your max_length is set to 142, but your input_length is only 4. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=2)


1.425638198852539
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  83
---------------------------------------------------------------------------------------------------
0.09788751602172852
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  84
---------------------------------------------------------------------------------------------------


Your max_length is set to 142, but your input_length is only 4. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=2)


1.680586814880371
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  85
---------------------------------------------------------------------------------------------------


Your max_length is set to 142, but your input_length is only 4. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=2)


1.852083683013916
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  86
---------------------------------------------------------------------------------------------------
0.15578532218933105
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  87
---------------------------------------------------------------------------------------------------
0.15661096572875977
---------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------
Example  88
---------------------------------------------------------------------------------------------------
1.7

In [5]:
df.head(3)

Unnamed: 0,source,url,title,description,len_text,summary
0,ReadWrite,https://readwrite.com/google-assistant-is-gett...,Google Assistant Is Getting a Major Upgrade,"In a monumental shift towards the future, Goog...",3079,Recent reports indicate that google assistant...
1,Slashdot.org,https://tech.slashdot.org/story/23/08/01/00282...,Google's Jigsaw Was Fighting Toxic Speech With...,tedlistens writes: All large language models a...,2948,perspective API is a free tool from google s ...
2,VentureBeat,https://venturebeat.com/games/inworld-ai-raise...,Inworld AI raises new round at $500M valuation...,"Inworld AI has raised funding from Lightspeed,...",3289,inworld ai has raised 50 million funding from...


## Document Generation with images

In [6]:
import os
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
# import stable diffusion model

import torch
from diffusers import StableDiffusionPipeline

model_id = "CompVis/stable-diffusion-v1-4"

device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to(device)

Downloading (…)ain/model_index.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

Downloading model.safetensors:   0%|          | 0.00/492M [00:00<?, ?B/s]

Downloading (…)_checker/config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

Downloading (…)tokenizer/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

Downloading (…)nfig-checkpoint.json:   0%|          | 0.00/209 [00:00<?, ?B/s]

Downloading (…)cheduler_config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

Downloading (…)_encoder/config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

Downloading (…)69ce/vae/config.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

Downloading (…)tokenizer/vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

Downloading (…)9ce/unet/config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

Downloading (…)ch_model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

Downloading (…)ch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with: 
```
pip install accelerate
```
.


Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["bos_token_id"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["eos_token_id"]` will be overriden.


In [8]:
from genai_gazzete import functions_doc
from docx import Document
from docx.shared import Inches
from docx.enum.text import WD_ALIGN_PARAGRAPH
from datetime import date

# -----------------------------------------------------
# document creation
# -----------------------------------------------------

today = date.today()
document = Document()

document.add_picture('/content/drive/MyDrive/LLMs/20230804_130845_0000.png', width=Inches(6), height=Inches(1.2))
last_paragraph = document.paragraphs[-1]
last_paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER
p = document.add_paragraph()
runner = p.add_run(fdate + " - " + str(today)).italic = True
last_paragraph = document.paragraphs[-1]
last_paragraph.alignment = WD_ALIGN_PARAGRAPH.RIGHT

for r in df.index:
  # 1. add title
  document.add_heading('[' + df["source"][r] + '] ' +df["title"][r], level=1)
  # 2. add link
  p = document.add_paragraph()
  functions_doc.add_hyperlink(p, 'Original article', df["url"][r])

  # 3. create image and add it
  prompt_style = "sci-fi painting by Ian McQue:1 sci-fi painting by Simon Stalenhag:0.5, pen and ink, pastel colors"
  # prompt_style= "sci-fi painting style, pen and ink, primary pastel colors"
  prompt = df["title"][r] + prompt_style
  num_images_per_prompt = 2
  images = pipe(prompt, num_images_per_prompt=num_images_per_prompt).images

  for idx, im in enumerate(images):
    im.save("/content/drive/MyDrive/LLMs/images/image_"+str(r)+"_"+str(idx)+".png")
  document.add_picture("/content/drive/MyDrive/LLMs/images/image_"+str(r)+"_0.png", width=Inches(2), height=Inches(2))
  last_paragraph = document.paragraphs[-1]
  last_paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER

  # 4. add summary
  document.add_paragraph(df["summary"][r], style='Intense Quote')
  p = document.add_paragraph()
  p.paragraph_format.line_spacing = Inches(0.3)

file_name = "summarized_articles" + "_" + today.strftime('%m_%d_%Y') + ".docx"
document.save('/content/drive/MyDrive/LLMs/'+file_name)

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]