## Text summarization (Senetnce Ranking)
Metode sentence ranking digunakan untuk membuat ringkasan dokumen dengan memilih kalimat-kalimat paling penting berdasarkan tingkat keterkaitannya dengan kalimat lain dalam dokumen.

### Tahapan Prosesnya yaitu:
#### 1. Data cleaning ( removing non letter characters, turning to lower case letters ):
        Membersihkan teks dengan menghapus karakter non-huruf, tanda baca, dan mengubah seluruh teks menjadi huruf kecil agar mudah diproses.
#### 2. Building Sentence Similarity Matrix: 
        Menghitung tingkat kemiripan antar kalimat menggunakan representasi vektor (misalnya TF-IDF atau cosine similarity) dan menyimpannya dalam bentuk matriks.
#### 3. Sentence Ranking: 
        Memberi skor pada setiap kalimat berdasarkan tingkat kepentingannya, biasanya menggunakan algoritma berbasis graf seperti PageRank.
#### 4. Summary Generation: 
        Memilih beberapa kalimat dengan skor tertinggi sebagai ringkasan dokumen tanpa mengubah struktur kalimat aslinya.

## Initial Phase
### Importing Libraries and Reading Data 

In [1]:
%pip install networkx

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
### importing the necessary libraries

from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
import networkx as nx
from nltk.tokenize import  sent_tokenize

In [3]:
### Reading the data file

df = pd.read_csv('../dataset/tennis_articles_v4.csv')
df['article_text']

0    Maria Sharapova has basically no friends as te...
1    BASEL, Switzerland (AP), Roger Federer advance...
2    Roger Federer has revealed that organisers of ...
3    Kei Nishikori will try to end his long losing ...
4    Federer, 37, first broke through on tour over ...
5    Nadal has not played tennis since he was force...
6    Tennis giveth, and tennis taketh away. The end...
7    Federer won the Swiss Indoors last week by bea...
Name: article_text, dtype: object

In [4]:
## working with re ( regular expression in python)

import re
s = 'he&&&s'
s = re.sub("[^a-zA-Z]"," ",s)

## STEP 1 : Data Cleaning
Membersihkan Kalimat dengan Menghapus Karakter Non-Huruf dan Mengubah ke Huruf Kecil

In [5]:
### cleaning sentences, by removing non alphabet characters and converting to lower case letters

dict = {}
s = ""
for a in df['article_text']:
      s += a
# print s
s = s.lower()
# print s

sentences = sent_tokenize(s)
# print sentences

final = []

for s in sentences:
      temp = re.sub("[^a-zA-Z]"," ",s)
      temp = temp.lower()
      final.append(temp)
      dict[temp] = s
# printfinal 

## STEP 2 : Building Senetnce Similarity Matrix
Menghitung Kemiripan Menggunakan Cosine Similarity antar Representasi Vektor Kalimat

In [6]:
### define method for calculating similarity

def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)




def build_similarity_matrix(sentences, stop_words):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: #ignore if both are same sentences
                continue 
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)
    return similarity_matrix

## STEP 3 : Sentence Ranking
Mengurutkan Kalimat Menggunakan Algoritma PageRank pada Graf Kemiripan Kalimat

In [None]:
### generating the final summary : graph is generated using networkx library and cosine similarity matrix 
# containg adjacency list; after that sentences are scored using pagerank and sorted and stored in ranked_sentences

    # Step 2 - Generate Similary Martix across sentences
sentence_similarity_martix = build_similarity_matrix(final, '')

    # Step 3 - Rank sentences in similarity martix
sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
scores = nx.pagerank(sentence_similarity_graph)

    # Step 4 - Sort the rank and pick top sentences
ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(final)), reverse=True)    
# print type(ranked_sentence)
# print("Indexes of top ranked_sentence order are ", ranked_sentence)    



# Step 5 - Offcourse, output the summarize texr
# print('Summarize Text: \n', ". ".join(summarize_text))

## STEP 4 : Summary Generation
Menampilkan 10 Kalimat dengan Peringkat Tertinggi sebagai Ringkasan

In [8]:
for i in range(1):
     print(dict[ranked_sentence[i][1]])
        
# for i in range(10):
#       summarize_text.append(" ".join(ranked_sentence[i][1]))

argentina and britain received wild cards to the new-look event, and will compete along with the four 2018 semi-finalists and the 12 teams who win qualifying rounds next february.
