<a href="https://colab.research.google.com/github/fedy-culer/AI-Blog-Post-Summarization-Project/blob/main/AI_Blog_Post_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI Blog Post Summarization with Hugging Face Transformers & Beautiful Soup Web Scraping

**By Fedy Ben Hassouna**

## 1- Instal HuggingFace Transformers and Install Dependencies

In [2]:
from transformers import pipeline
from bs4 import BeautifulSoup
import requests

## 2- Load Summarization Pipeline


In [3]:
model = "sshleifer/distilbart-cnn-12-6"
Summarizer=pipeline(task="summarization", model=model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

## 3- Get Blog Post from A Website : " HackerNoon "

In [4]:
URL ="https://hackernoon.com/school-is-dead-embrace-modern-education-instead"

In [5]:
r = requests.get(URL)


In [6]:
r

<Response [200]>

In [7]:
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all(['h1', 'p'])

In [8]:
results

[<h1 class="story-title" style="text-align:left">School is Dead. Embrace Modern Education Instead.</h1>,
 <p class="line-space"> <br/> </p>,
 <p>N.B. <em>In my previous letter,</em> <em><a href="https://hackernoon.com/how-to-learn-anything-fast?ref=hackernoon.com" rel="noopener noreferrer ugc" target="_blank">Learn Anything Fast</a>, I shared insights on how to quickly acquire new skills and knowledge.</em></p>,
 <p class="line-space"> <br/> </p>,
 <p><em>Today, I want to take that a step further by introducing the concept of the "New School" and providing a practical framework to help you put these ideas into action.</em></p>,
 <p class="line-space"> <br/> </p>,
 <p><em>If you haven’t read the previous letter, I’d suggest you</em> <em><a href="https://hackernoon.com/how-to-learn-anything-fast?ref=hackernoon.com" rel="noopener noreferrer ugc" target="_blank">do that first</a>.</em></p>,
 <p class="line-space"> <br/> </p>,
 <p>Let’s get into it.</p>,
 <p class="line-space"> <br/> </p>,


In [9]:
text = [result.text for result in results]
ARTICLE = ' '.join(text)

In [10]:
ARTICLE

'School is Dead. Embrace Modern Education Instead.    N.B. In my previous letter, Learn Anything Fast, I shared insights on how to quickly acquire new skills and knowledge.    Today, I want to take that a step further by introducing the concept of the "New School" and providing a practical framework to help you put these ideas into action.    If you haven’t read the previous letter, I’d suggest you do that first.    Let’s get into it.    I always had a dislike for school—And school seemed to have a dislike for me.    As a matter of fact, I had to start over a class year not once, but twice. I often got into trouble. My parents were frequently called in to discuss my lack of interest, poor grades, and disturbances.    Was it because I had below-average intelligence?    I hope not.    It was because I lacked trust in the education system and because I was bored.    Philosophy was my favorite class until someone ruined it for me. We were supposed to recite philosophers’ thoughts verbatim.

## 4- Chunk Text

In [11]:
ARTICLE = ARTICLE.replace('.', '.<eos>')
ARTICLE = ARTICLE.replace('?', '?<eos>')
ARTICLE = ARTICLE.replace('!', '!<eos>')

In [12]:
sentences = ARTICLE.split('<eos>')

In [13]:
sentences[12]

'    Was it because I had below-average intelligence?'

In [31]:
max_chunk = 500
current_chunk = 0
chunks = []
for sentence in sentences:
    if len(chunks) == current_chunk + 1:
        if len(chunks[current_chunk]) + len(sentence.split(' ')) <= max_chunk:
            chunks[current_chunk].extend(sentence.split(' '))
        else:
            current_chunk += 1
            chunks.append(sentence.split(' '))
    else:
        print(current_chunk)
        chunks.append(sentence.split(' '))

for chunk_id in range(len(chunks)):
    chunks[chunk_id] = ' '.join(chunks[chunk_id])

0


In [32]:
len(chunks)


6

In [33]:
chunks[2]

'    Curate Resources: Pin Point reputable sources like YouTube channels, paid courses, and informative articles.     Use AI: to help you speed up the process of gathering and sorting through large amounts of data.     Schedule Learning Time: Dedicate specific times for learning to build a consistent routine.     Engage with the Community: Join forums, groups, and networks related to your field of interest to share insights and gain support.     Apply Knowledge: Implement what you learn through real-life projects and practical applications.     Now that you have an understanding of the approach, let me show you my exact process of learning.        Here’s a step-by-step guide, going beyond the basics to give you a comprehensive framework.      I’ll walk you through each step, and a timeline to keep it all under (an average of*) 20 hours—not including the mandatory 24 hour break.        Start with a broad search to gather initial information and understand the basics.     When I dive int

## 5- Summarize Text

In [34]:
results = Summarizer(chunks, max_length=120, min_length=30, do_sample=False)


In [35]:
results[0]

{'summary_text': ' In his previous letter, Learn Anything Fast, I shared insights on how to quickly acquire new skills and knowledge . Today, I want to take that a step further by introducing the concept of the "New School" and providing a practical framework .'}

In [39]:
text =' '.join([res['summary_text'] for res in results])


In [40]:
with open('blogsummary.txt', 'w') as f:
  f.write(text)