

***Summarization*** adalah suatu cara untuk meringkas suatu dokumen ke dalam bentuk yang lebih sederhana dengan mengambil poin-poin penting dalam dokumen tersebut.

Secara umum, pendekatan news summarization dibagi menjadi dua, yaitu extractive dan abstractive. Extractive summarization berarti mengidentifikasi bagian-bagian penting dari teks dan membuat kalimat dari kata-kata tersebut. Sementara abstractive summarization mereproduksi materi penting dengan cara baru setelah melakukan interpretasi dan memeriksa teks menggunakan NLP untuk membuat kalimat baru yang berasal dari dokumen aslinya.




***TextRank*** adalah algoritma peringkat berbasis grafik untuk memproses teks. TextRank menghasilkan ekstraksi kalimat sebagai ringkasan. Salah satu kelebihan dari algoritma ini, tidak diperlukannya pelatihan menggunakan data training pada algoritma yang digunakan.

**Langkah pertama adalah meng-import library yang diperlukan**

In [None]:
import numpy as np
import pandas as pd
import nltk
import re
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from gensim.models import Word2Vec
from scipy import spatial
import networkx as nx

**Teks berita yang digunakan**

In [None]:
text='''The Double Asteroid Redirection Test successfully changed the trajectory of the asteroid Dimorphos when the NASA spacecraft intentionally slammed into the space rock on September 26, according to the agency. The DART mission, a full-scale demonstration of deflection technology, was the world’s first conducted on behalf of planetary defense. The mission was also the first time humanity intentionally changed the motion of a celestial object in space. Prior to impact, it took Dimorphos 11 hours and 55 minutes to orbit its larger parent asteroid Didymos. Astronomers used ground-based telescopes to measure how Dimorphos’ orbit changed after impact. Now, it takes Dimorphos 11 hours and 23 minutes to circle Didymos. The DART spacecraft changed the moonlet asteroid’s orbit by 32 minutes. Initially, astronomers expected DART to be a success if it shortened the trajectory by 10 minutes. “All of us have a responsibility to protect our home planet. After all, it’s the only one we have,” said NASA Administrator Bill Nelson. “This mission shows that NASA is trying to be ready for whatever the universe throws at us. NASA has proven we are serious as a defender of the planet. This is a watershed moment for planetary defense and all of humanity, demonstrating commitment from NASA’s exceptional team and partners from around the world.”'''

In [None]:
import nltk
nltk.download('punkt')
from nltk import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Tokenizing paragraph ke bentuk list sentences**

In [None]:
sentences=sent_tokenize(text)
print(sentences)

['The Double Asteroid Redirection Test successfully changed the trajectory of the asteroid Dimorphos when the NASA spacecraft intentionally slammed into the space rock on September 26, according to the agency.', 'The DART mission, a full-scale demonstration of deflection technology, was the world’s first conducted on behalf of planetary defense.', 'The mission was also the first time humanity intentionally changed the motion of a celestial object in space.', 'Prior to impact, it took Dimorphos 11 hours and 55 minutes to orbit its larger parent asteroid Didymos.', 'Astronomers used ground-based telescopes to measure how Dimorphos’ orbit changed after impact.', 'Now, it takes Dimorphos 11 hours and 23 minutes to circle Didymos.', 'The DART spacecraft changed the moonlet asteroid’s orbit by 32 minutes.', 'Initially, astronomers expected DART to be a success if it shortened the trajectory by 10 minutes.', '“All of us have a responsibility to protect our home planet.', 'After all, it’s the 

**Tahap Preprocessing**

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True



*   re.sub(r'[^\w\s]') menghapus semua tanda baca dari setiap kalimat menggunakan pemahaman daftar
*   Hapus stopwords (a, the, an, dll) untuk fokus pada kata-kata yang lebih signifikan dalam sebuah kalimat.




In [None]:
sentences_clean=[re.sub(r'[^\w\s]','',sentence.lower()) for sentence in sentences]
stop_words = stopwords.words('english')
sentence_tokens=[[words for words in sentence.split(' ') if words not in stop_words] for sentence in sentences_clean]
print(sentence_tokens)

[['double', 'asteroid', 'redirection', 'test', 'successfully', 'changed', 'trajectory', 'asteroid', 'dimorphos', 'nasa', 'spacecraft', 'intentionally', 'slammed', 'space', 'rock', 'september', '26', 'according', 'agency'], ['dart', 'mission', 'fullscale', 'demonstration', 'deflection', 'technology', 'worlds', 'first', 'conducted', 'behalf', 'planetary', 'defense'], ['mission', 'also', 'first', 'time', 'humanity', 'intentionally', 'changed', 'motion', 'celestial', 'object', 'space'], ['prior', 'impact', 'took', 'dimorphos', '11', 'hours', '55', 'minutes', 'orbit', 'larger', 'parent', 'asteroid', 'didymos'], ['astronomers', 'used', 'groundbased', 'telescopes', 'measure', 'dimorphos', 'orbit', 'changed', 'impact'], ['takes', 'dimorphos', '11', 'hours', '23', 'minutes', 'circle', 'didymos'], ['dart', 'spacecraft', 'changed', 'moonlet', 'asteroids', 'orbit', '32', 'minutes'], ['initially', 'astronomers', 'expected', 'dart', 'success', 'shortened', 'trajectory', '10', 'minutes'], ['us', 'res

Seperti yang kita ketahui kita perlu mengkonversi data teks ke dalam bentuk numerik, pertama kita akan menghitung penyisipan kata & menggunakan penyematan ini akan menghitung penyisipan kalimat menggunakan variabel sentence_tokens yang dihitung di atas.

- Menggunakan Word2Vec() gensim, menghitung embedding untuk setiap kata dari teks.
- Menggunakan pemahaman daftar, kami akan mengganti token kata dengan penyematannya di kalimat_tokens
- Menghitung max_length kalimat dalam teks yang ada
- Penyematan kalimat padding menggunakan 0s hingga max_len. Ini telah dilakukan untuk mempertahankan ukuran embedding yang sama untuk setiap kalimat.

In [None]:
w2v=Word2Vec(sentence_tokens,size=1,min_count=1,iter=1000)
sentence_embeddings=[[w2v[word][0] for word in words] for words in sentence_tokens]
max_len=max([len(tokens) for tokens in sentence_tokens])
sentence_embeddings=[np.pad(embedding,(0,max_len-len(embedding)),'constant') for embedding in sentence_embeddings]
print(sentence_embeddings)



[array([-1.3774105, -1.5648252, -1.4396875, -1.4545798, -1.4218374,
       -1.5217278, -1.5086281, -1.5648252, -1.5196548, -1.6093941,
       -1.4879727, -1.5637304, -1.4247156, -1.6206998, -1.5261114,
       -1.6321287, -1.6664679, -1.6027577, -1.6043236], dtype=float32), array([-1.58468  , -1.6650534, -1.5069708, -1.683015 , -1.6686802,
       -1.7436057, -1.6813625, -1.6704615, -1.6362563, -1.7113441,
       -1.803447 , -1.7605822,  0.       ,  0.       ,  0.       ,
        0.       ,  0.       ,  0.       ,  0.       ], dtype=float32), array([-1.6650534, -1.4216933, -1.6704615, -1.4764392, -1.6757175,
       -1.5637304, -1.5217278, -1.4476883, -1.4730337, -1.5155343,
       -1.6206998,  0.       ,  0.       ,  0.       ,  0.       ,
        0.       ,  0.       ,  0.       ,  0.       ], dtype=float32), array([-1.3958535, -1.4649928, -1.4221333, -1.5196548, -1.5319772,
       -1.5061108, -1.3955861, -1.492912 , -1.4730704, -1.392894 ,
       -1.340811 , -1.5648252, -1.5237055,  0.

  


**calculate the similarity matrix**


> Inisialisasi kesamaan_matriks dimensi N x N di mana N adalah jumlah total kalimat dalam teks

> Menggunakan 1-spatial.distance.cosine(), hitung kesamaan antara setiap dua pasang kalimat.





In [None]:
similarity_matrix = np.zeros([len(sentence_tokens), len(sentence_tokens)])
for i,row_embedding in enumerate(sentence_embeddings):
    for j,column_embedding in enumerate(sentence_embeddings):
        similarity_matrix[i][j]=1-spatial.distance.cosine(row_embedding,column_embedding)
print(similarity_matrix)

[[1.         0.77842277 0.73990798 0.80623448 0.66596884 0.62674886
  0.6247347  0.66628879 0.48455659 0.53557593 0.66616797 0.48269325
  0.8068682 ]
 [0.77842277 1.         0.94986784 0.95573455 0.85033149 0.80226189
  0.79970229 0.85097426 0.62316644 0.69113314 0.85086727 0.62143332
  0.96138918]
 [0.73990798 0.94986784 1.         0.90760857 0.89940935 0.85212719
  0.85283911 0.89906245 0.68671745 0.7514078  0.90093911 0.68750477
  0.91812694]
 [0.80623448 0.95573455 0.90760857 1.         0.83176255 0.78460878
  0.78243357 0.83306611 0.61989808 0.68305069 0.83296162 0.61842948
  0.99759716]
 [0.66596884 0.85033149 0.89940935 0.83176255 1.         0.94004345
  0.94033813 0.99717361 0.73086035 0.80639416 0.99925023 0.72986954
  0.81961304]
 [0.62674886 0.80226189 0.85212719 0.78460878 0.94004345 1.
  0.99711019 0.94154871 0.78555882 0.86498463 0.94156569 0.78253978
  0.77424937]
 [0.6247347  0.79970229 0.85283911 0.78243357 0.94033813 0.99711019
  1.         0.93804789 0.79763031 0.874

**implement PageRank**


> Convert the similarity matrix to a network/graph


> Apply pagerank()





In [None]:
nx_graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(nx_graph)
print(scores)

{0: 0.06648528095912415, 1: 0.07784784411812283, 2: 0.07975563663364352, 3: 0.07735306264708693, 4: 0.08075394198654515, 5: 0.08003775721426168, 6: 0.0801811591515655, 7: 0.08108974282846147, 8: 0.07171796050837635, 9: 0.07537486763119604, 10: 0.08093280912219844, 11: 0.07162080061892813, 12: 0.07684913658048982}


Sekarang, yang akan kita lakukan adalah membuat kamus dengan kalimat (kita dapatkan setelah sent_tokenize()) sebagai kunci & PageRank mereka sebagai nilai & mengurutkan kamus ini berdasarkan PageRanks kalimat. Disini diambil 4 kalimat teratas untuk ringkasan:

In [None]:
top_sentence={sentence:scores[index] for index,sentence in enumerate(sentences)}
top=dict(sorted(top_sentence.items(), key=lambda x: x[1], reverse=True)[:4])
print(top)

{'Initially, astronomers expected DART to be a success if it shortened the trajectory by 10 minutes.': 0.08108974282846147, '“This mission shows that NASA is trying to be ready for whatever the universe throws at us.': 0.08093280912219844, 'Astronomers used ground-based telescopes to measure how Dimorphos’ orbit changed after impact.': 0.08075394198654515, 'The DART spacecraft changed the moonlet asteroid’s orbit by 32 minutes.': 0.0801811591515655}


In [None]:
for sent in sentences:
    if sent in top.keys():
        print(sent)

Astronomers used ground-based telescopes to measure how Dimorphos’ orbit changed after impact.
The DART spacecraft changed the moonlet asteroid’s orbit by 32 minutes.
Initially, astronomers expected DART to be a success if it shortened the trajectory by 10 minutes.
“This mission shows that NASA is trying to be ready for whatever the universe throws at us.
