# Text Summarization with TF-IDF
* This lab is aimed at conceiving an  extractive summarization mechanism with the use of TF-IDF.
* Given an article, we'll split its sentences and apply TF-IDF to them. Our extraction metric is going to be the me `np.nanmean` of the document's vector.

In [1]:
# ! mkdir data
# !wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv -P data/

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('data/bbc_text_cls.csv')
df = df[df.labels=='business']
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [4]:
from nltk import sent_tokenize
from typing import List

def get_sentences(df:pd.DataFrame, idx:int|None=None)->List[str]:
    ''' 
        Extracts the sentences of a certain article from the dataset.

        Parameters
        ----------
        `df`: `pd.DataFrame`
            The news articles dataset.
        `idx`: int
            The index of the desired article. It will be randomly chosen if it is
            set to None.

        Returns
        -------
        A list with the article's sentences.
    '''
    if idx is None:
        idx = np.random.randint(low=0, high=df.shape[0], size=1)[0]
    text = df.iloc[idx, 0]
    return sent_tokenize(text)

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

class ExtractiveSummarizer:
    def __init__(self, p:float):
        assert p>0 and p<1, '`p` must pertain to ]0,1[.'
        self.p = p

    def fit(self, X:List[str]):
        _means = TfidfVectorizer().fit_transform(X).toarray()
        self._means = np.nanmean(np.where(_means==0, np.nan, _means), axis=1)
        return self

sample_sentence = get_sentences(df, 10)
ExtractiveSummarizer(.25).fit(sample_sentence)._means

array([0.19790323, 0.27374447, 0.25673992, 0.24306069, 0.27074816,
       0.37008136, 0.18136769, 0.24750664, 0.2870746 , 0.34687   ,
       0.23272259])

<p style='color:red'> Iniciei `ExtractiveSummarizer`; programar o `transform`</p>