# **Extractive Summarisation using TF-IDF Vectors**

In [None]:
import pandas as pd
import spacy as sp
import nltk
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
text = """
Vijay married Sangeetha Sornalingam,[151][152] a Sri Lankan Tamil whom he met in the United Kingdom, on 25 August 1999.[153][154] They have two children.[155] Vijay's son, Jason Sanjay, made a cameo appearance with his father in Vettaikaaran (2009)[156] and his daughter portrayed a small role as her father's pre-teen daughter in the climax of Theri (2016).[157]

On 5 February 2020, the Income Tax Department raided Vijay's residence in Chennai and inquired about potential tax evasion, making note of his investment in immovable properties, which he inherited from the production studio AGS Entertainment.[158] It was reported that Vijay and AGS Entertainment producer Anbu Cheliyan were suspected of undisclosed payments and alleged tax fraud. Nearly ₹65 crore was seized by the officials from Cheliyan's residence.[159] The investigation took place while Vijay was shooting for his film Master in Cuddalore.[160] On 12 March, officials stated that nothing significant was found during the raid.[161] Opponents of the Bharatiya Janata Party accused them of politically targeting Vijay through such raids because he was critical of them on demonetization and the Goods and Services Tax (India) in the film Mersal.[162]

On 13 July 2021, the Madras High Court dismissed a writ petition filed by Vijay in 2012 seeking exemption of the Entry Tax for his Rolls-Royce Ghost car that was imported from England. It imposed a fine of ₹1 lakh which was allotted to the Tamil Nadu Chief Minister's Public COVID relief fund. Justice S.M. Subramaniam said that Vijay's fan base considers him as a hero and he was expected to be one instead of a "reel" hero, further calling it an anti-national habit.[163] On 15 July 2021, Vijay filed an appeal against defamatory statements made by the judge against him in Madras High court.[164] On 20 July 2021, Vijay's appeal against the tax exemption case issue and defamatory statements was moved to a different tax bench sector of the court.[165] On 27 July 2021, a two-judge bench of Madras High court stayed the earlier passed order by Judge S.M. Subramaniam that included the critical remarks and also stayed the order of a ₹1 lakh fine amount.[166] On 25 January 2022, the court dismissed and removed the defamatory critical statements made by Judge S.M. Subramaniam against Vijay.[167][168] On 15 July 2022, the court declared that no fine should be imposed on the car imported since he had paid the full entry tax before January 2019, closing out the case in the process.[169][170][171]
"""

In [None]:
import re

text = re.sub(r'\[[^\]]*\]', ' ', text)
text = re.sub(r' +', ' ', text)
print(text)


Vijay married Sangeetha Sornalingam, a Sri Lankan Tamil whom he met in the United Kingdom, on 25 August 1999. They have two children. Vijay's son, Jason Sanjay, made a cameo appearance with his father in Vettaikaaran (2009) and his daughter portrayed a small role as her father's pre-teen daughter in the climax of Theri (2016). 

On 5 February 2020, the Income Tax Department raided Vijay's residence in Chennai and inquired about potential tax evasion, making note of his investment in immovable properties, which he inherited from the production studio AGS Entertainment. It was reported that Vijay and AGS Entertainment producer Anbu Cheliyan were suspected of undisclosed payments and alleged tax fraud. Nearly ₹65 crore was seized by the officials from Cheliyan's residence. The investigation took place while Vijay was shooting for his film Master in Cuddalore. On 12 March, officials stated that nothing significant was found during the raid. Opponents of the Bharatiya Janata Party accused 

In [None]:
nlp = sp.load("en_core_web_sm")

nltk.download('punkt')
original_sentences = nltk.sent_tokenize(text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
def preprocess(sentences):

  preprocessed_sentences = []

  for sentence in sentences:
    doc = nlp(sentence)
    extracted_words = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
    preprocessed_sentences.append(" ".join(extracted_words))
  return preprocessed_sentences

In [None]:
preprocessed_sentences = preprocess(original_sentences)
preprocessed_sentences

['Vijay marry Sangeetha Sornalingam Sri Lankan Tamil meet United Kingdom August',
 'child',
 'Vijay son Jason Sanjay cameo appearance father Vettaikaaran daughter portray small role father pre teen daughter climax Theri',
 'February Income Tax Department raid Vijay residence Chennai inquire potential tax evasion make note investment immovable property inherit production studio AGS Entertainment',
 'report Vijay AGS Entertainment producer Anbu Cheliyan suspect undisclosed payment allege tax fraud',
 'nearly crore seize official Cheliyan residence',
 'investigation take place Vijay shoot film Master Cuddalore',
 'March official state significant find raid',
 'opponent Bharatiya Janata Party accuse politically target Vijay raid critical demonetization Goods Services Tax India film Mersal',
 'July Madras High Court dismiss writ petition file Vijay seek exemption Entry Tax Rolls Royce Ghost car import England',
 'impose fine lakh allot Tamil Nadu Chief Minister public covid relief fund',
 '

In [None]:
vectorizer = TfidfVectorizer()
matrix  = vectorizer.fit_transform(preprocessed_sentences)

In [None]:
sent_scores = matrix.toarray().sum(axis = 1)
sent_scores

array([3.27294297, 1.        , 3.79175395, 4.53569618, 3.54815123,
       2.44442802, 2.77805245, 2.44051661, 4.06592239, 4.29026976,
       3.45465341, 1.        , 3.5819497 , 3.12216308, 3.65755205,
       3.04677306, 2.81670012, 2.80765786, 1.36855169, 3.70055034])

In [None]:
sent_scores[2]

3.791753953515344

In [None]:
len(sent_scores)

20

In [None]:
matrix.toarray()[0]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.31639814, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.31639

In [None]:
len(matrix.toarray()[0])

148

In [None]:
len(vectorizer.vocabulary_)

148

In [None]:
ranked_scores = (-sent_scores).argsort()
ranked_scores

array([ 3,  9,  8,  2, 19, 14, 12,  4, 10,  0, 13, 15, 16, 17,  6,  5,  7,
       18,  1, 11])

In [None]:
no_of_sentences = 5
top_score_indices = sorted(ranked_scores[: no_of_sentences])
top_score_indices

[2, 3, 8, 9, 19]

In [None]:
final_sentences = [original_sentences[i] for i in top_score_indices]
summary = nlp(" ".join(final_sentences))
summary

Vijay's son, Jason Sanjay, made a cameo appearance with his father in Vettaikaaran (2009) and his daughter portrayed a small role as her father's pre-teen daughter in the climax of Theri (2016). On 5 February 2020, the Income Tax Department raided Vijay's residence in Chennai and inquired about potential tax evasion, making note of his investment in immovable properties, which he inherited from the production studio AGS Entertainment. Opponents of the Bharatiya Janata Party accused them of politically targeting Vijay through such raids because he was critical of them on demonetization and the Goods and Services Tax (India) in the film Mersal. On 13 July 2021, the Madras High Court dismissed a writ petition filed by Vijay in 2012 seeking exemption of the Entry Tax for his Rolls-Royce Ghost car that was imported from England. On 15 July 2022, the court declared that no fine should be imposed on the car imported since he had paid the full entry tax before January 2019, closing out the cas