# Text summarization using NLP

#### 1. Data Loading and Initial Exploration:

  * Load the tennis articles dataset from the CSV file.
  * Inspect the dataset to understand its structure and content.
  * Remove unnecessary columns (e.g., ‘article_title’) to simplify the data.

In [1]:
from google.colab import files
uploaded = files.upload()

Saving tennis_articles.zip to tennis_articles.zip


In [2]:
!unzip tennis_articles.zip

Archive:  tennis_articles.zip
  inflating: tennis_articles.csv     


In [25]:
import pandas as pd
import numpy as np

import nltk
import re
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords

# Download necessary NLP resources
nltk.download("punkt")
nltk.download("stopwords")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [26]:
data = pd.read_csv('tennis_articles.csv', encoding='unicode_escape')
data.shape

(8, 4)

In [27]:
data.head()

Unnamed: 0,article_id,article_title,article_text,source
0,1,"I do not have friends in tennis, says Maria Sh...",Maria Sharapova has basically no friends as te...,https://www.tennisworldusa.org/tennis/news/Mar...
1,2,Federer defeats Medvedev to advance to 14th Sw...,"BASEL, Switzerland (AP)  Roger Federer advanc...",http://www.tennis.com/pro-game/2018/10/copil-s...
2,3,Tennis: Roger Federer ignored deadline set by ...,Roger Federer has revealed that organisers of ...,https://scroll.in/field/899938/tennis-roger-fe...
3,4,Nishikori to face off against Anderson in Vien...,Kei Nishikori will try to end his long losing ...,http://www.tennis.com/pro-game/2018/10/nishiko...
4,5,Roger Federer has made this huge change to ten...,"Federer, 37, first broke through on tour over ...",https://www.express.co.uk/sport/tennis/1036101...


In [30]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   article_id     8 non-null      int64 
 1   article_title  8 non-null      object
 2   article_text   8 non-null      object
 3   source         8 non-null      object
dtypes: int64(1), object(3)
memory usage: 388.0+ bytes


In [31]:
df = data.drop(columns=['article_title','source'])
df

Unnamed: 0,article_id,article_text
0,1,Maria Sharapova has basically no friends as te...
1,2,"BASEL, Switzerland (AP)  Roger Federer advanc..."
2,3,Roger Federer has revealed that organisers of ...
3,4,Kei Nishikori will try to end his long losing ...
4,5,"Federer, 37, first broke through on tour over ..."
5,6,Nadal has not played tennis since he was force...
6,7,"Tennis giveth, and tennis taketh away. The end..."
7,8,I PLAYED golf last week with Todd Reid. He pic...


#### 2. Text Preprocessing

  * Tokenize the articles into individual sentences using nltk.sent_tokenize.

In [32]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [23]:
df_tokenized = df.copy()
df_tokenized['article_text'] = df_tokenized['article_text'].apply(nltk.sent_tokenize)

* Download and load pre-trained GloVe word embeddings.

In [21]:
!wget https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
!unzip -q glove.6B.zip

--2025-03-14 11:13:35--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2025-03-14 11:16:15 (5.17 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]



In [34]:
# Path to the GloVe embeddings file (100-dimensional vectors)
path_to_glove_file = "glove.6B.100d.txt"

# Initialize an empty dictionary to store word embeddings
embeddings_index = {}

# Open the GloVe file and read line by line
with open(path_to_glove_file, "r", encoding="utf-8") as f:
    for line in f:
        # The first word is the actual word, the rest are the vector coefficients
        word, coefs = line.split(maxsplit=1)

        # Convert coefficients from string to NumPy array (floating point numbers)
        coefs = np.fromstring(coefs, "f", sep=" ")

        # Store the word and its corresponding vector in the dictionary
        embeddings_index[word] = coefs

# Print the number of word vectors found
print("Found %s word vectors." % len(embeddings_index))


Found 400000 word vectors.


  * Clean the sentences by removing punctuation, special characters, and numbers.
  * Convert all sentences to lowercase.

In [35]:
import re
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def clean_sentence(sentence):

  sentence = re.sub(r'[^\w\s]', '', sentence) # Remove punctuation
  sentence = re.sub(r'\d+', '', sentence) # Remove numbers
  sentence = sentence.lower()
  words = sentence.split()
  words = [word for word in words if word not in stop_words]

  return sentence

sentences = df_tokenized['article_text'].explode().tolist()

cleaned_sentences = [clean_sentence(sentence) for sentence in sentences]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [36]:
df_cleaned = df_tokenized.copy()
df_cleaned['article_text'] = df_cleaned['article_text'].apply(lambda x: [clean_sentence(sentence) for sentence in x])

In [38]:
df_cleaned["article_text"][0]

['maria sharapova has basically no friends as tennis players on the wta tour',
 'the russian player has no problems in openly speaking about it and in a recent interview she said i dont really hide any feelings too much',
 'i think everyone knows this is my job here',
 'when im on the courts or when im on the court playing im a competitor and i want to beat every single person whether theyre in the locker room or across the net',
 'so im not the one to strike up a conversation about the weather and know that in the next few minutes i have to go and try to win a tennis match',
 'im a pretty competitive girl',
 'i say my hellos but im not sending any players flowers as well',
 'uhm im not really friendly or close to many players',
 'i have not a lot of friends away from the courts',
 'when she said she is not really close to a lot of players is that something strategic that she is doing',
 'is it different on the mens tour than the womens tour',
 'no not at all',
 'i think just because y

In [39]:
# Function to get sentence embedding
def get_sentence_embedding(sentence):
    """Convert a sentence into a numerical vector by averaging its word embeddings."""
    words = sentence.split()  # Split cleaned sentence into words
    word_vectors = [embeddings_index[word] for word in words if word in embeddings_index]  # Retrieve word embeddings
    if len(word_vectors) == 0:
        return np.zeros(100)  # Return a zero vector if no words have embeddings
    return np.mean(word_vectors, axis=0)  # Compute the average of word vectors

In [44]:
# Convert cleaned sentences into embeddings
sentence_embeddings = np.array([get_sentence_embedding(sentence) for sentence in cleaned_sentences])

# Print the shape of the final embeddings
print("\n Sentence Embeddings Shape:", sentence_embeddings.shape)



 Sentence Embeddings Shape: (130, 100)


#### 3. Similarity Matrix and Graph Construction:

  * Calculate the cosine similarity between all pairs of sentence vectors.
  * Create a similarity matrix representing the sentence similarities.
  * Construct a graph from the similarity matrix using networkx.

In [47]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity between sentence embeddings
cos_similarity_matrix = cosine_similarity(sentence_embeddings)

In [49]:
print(cos_similarity_matrix.shape)

(130, 130)


In [50]:
# Display the first 5x5 portion of the matrix for verification
print("\n - Sample of the Cosine Similarity Matrix:")
print(cos_similarity_matrix[:5, :5])


 - Sample of the Cosine Similarity Matrix:
[[1.0000001  0.88191336 0.8373998  0.8973751  0.9011506 ]
 [0.88191336 0.9999999  0.93075025 0.9421247  0.95935875]
 [0.8373998  0.93075025 1.0000001  0.8886524  0.9025485 ]
 [0.8973751  0.9421247  0.8886524  1.0000001  0.9729609 ]
 [0.9011506  0.95935875 0.9025485  0.9729609  0.9999998 ]]


In [64]:
import networkx as nx

### Build graph
G = nx.Graph()


# Add nodes (each sentence is a node)
for i, sentence in enumerate(cleaned_sentences):
    G.add_node(i, text=sentence)  # Use index as node ID

# Define similarity threshold (higher = fewer edges)
threshold = 0.75

 # Add edges based on cos_similarity_matrix
for i in range(len(cleaned_sentences)):
    for j in range(i + 1, len(cleaned_sentences)):  # Avoid duplicate pairs
        if cos_similarity_matrix[i][j] > threshold:  # Use your existing similarity matrix
            G.add_edge(i, j, weight=cos_similarity_matrix[i][j])  # Edge with similarity score

print("\n  Graph Construction Complete!")
print(f"  - Total Nodes: {G.number_of_nodes()}")
print(f"  - Total Edges: {G.number_of_edges()}")



  Graph Construction Complete!
  - Total Nodes: 130
  - Total Edges: 8001


#### 4. Sentence Ranking and Summarization:
  * Apply the PageRank algorithm to the graph to rank the sentences based on their importance.
  * Sort the sentences based on their PageRank scores.
  * Extract the top N sentences to form the summary.
  * Print the summarization.

In [67]:
# Apply PageRank algorithm to rank sentences based on importance
sentence_scores = nx.pagerank(G, alpha=0.85,  weight='weight')

In [68]:
# Convert scores into a sorted list (higher score = more important)
sorted_sentences = sorted(sentence_scores.items(), key=lambda x: x[1], reverse=True)

# Print the top-ranked sentences (optional)
print("\n Top Ranked Sentences (by PageRank Score):")
for i, (idx, score) in enumerate(sorted_sentences[:5]):  # Displaying top 5 sentences
    print(f"{i+1}. (Score: {score:.4f}) - {cleaned_sentences[idx]}")



 Top Ranked Sentences (by PageRank Score):
1. (Score: 0.0082) - so im not the one to strike up a conversation about the weather and know that in the next few minutes i have to go and try to win a tennis match
2. (Score: 0.0082) - i was on a nice trajectorythen reid recalledif i hadnt got sick i think i could have started pushing towards the second week at the slams and then who knows duringa comeback attempt some five years later reid added bernard tomic and  us open federer slayer john millman to his list of career scalps
3. (Score: 0.0082) - i just felt like it really kind of changed where people were a little bit definitely in the s a lot more quiet into themselves and then it started to become better meanwhile federer is hoping he can improve his service game as he hunts his ninth swiss indoors title this week
4. (Score: 0.0082) - speaking at the swiss indoors tournament where he will play in sundays final against romanian qualifier marius copil the world number three said that gi

In [73]:
N = 3
top_sentences = [cleaned_sentences[idx] for idx, _ in sorted_sentences[:N]]

# Step 4: Print Final Summary
print("\n  **Improved Summary:**")
print(" ".join(top_sentences))


  **Improved Summary:**
so im not the one to strike up a conversation about the weather and know that in the next few minutes i have to go and try to win a tennis match i was on a nice trajectorythen reid recalledif i hadnt got sick i think i could have started pushing towards the second week at the slams and then who knows duringa comeback attempt some five years later reid added bernard tomic and  us open federer slayer john millman to his list of career scalps major players feel that a big event in late november combined with one in january before the australian open will mean too much tennis and too little rest
