In [7]:
SCHOOL = "AU"
data_path = f"../bias_processing/data/1/{SCHOOL.lower()}_dataset.poli.csv"
output_path = f"../bias_processing/data/3/{SCHOOL.lower()}_dataset_summarizer.poli.csv"
model = "nltk_sia"

In [8]:
"""
Load in a csv from Sentiment_Dataset_Maker and add 4x3x3 columns
4 topics ("Israel", "Palestine", "India", "China")
3 hypotheses for sentiment (Positive, Negative, Neutral)
3 levels of granularity
'sentence':Compute sentiment for entire article
'paragraph':Summarizes each paragraph using an ML summarizing model, and join those summaries to one body of text. Compute sentiment for this new article version.
'article':Summarize the entire article in one go using the same ML model. Compute sentiment for this new article version
Save a new csv with these added columns

"""

'\nLoad in a csv from Sentiment_Dataset_Maker and add 4x3x3 columns\n4 topics ("Israel", "Palestine", "India", "China")\n3 hypotheses for sentiment (Positive, Negative, Neutral)\n3 levels of granularity\n\'sentence\':Compute sentiment for entire article\n\'paragraph\':Summarizes each paragraph using an ML summarizing model, and join those summaries to one body of text. Compute sentiment for this new article version.\n\'article\':Summarize the entire article in one go using the same ML model. Compute sentiment for this new article version\nSave a new csv with these added columns\n\n'

In [9]:
%pip install transformers nltk

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [10]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/Dana/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [11]:
import pandas as pd
import csv
import os
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from statistics import mean
from transformers import pipeline

summarizer = pipeline('summarization', model='t5-base')

# Summarize each paragraph, join summaries, and compute sentiment
def summarize_paragraphs(text):
    paragraphs = text.split('\n')
    summarized_text = '\n'.join([summarizer(para, max_length=int(len(para.split())/2) - 1, min_length=15)[0]['summary_text']
                                for para in paragraphs if para and len(para.split()) > 40])
    return summarized_text

# Full text summarization and sentiment calculation
def summarize_full_text(text):
    summarized_text = summarizer(text, max_length=512, min_length=50)[0]['summary_text']
    return summarized_text

# Function to return the sentiment of a text
def get_sentiment(text, granularity, keyword, model=model, method='avg'):
    if model == "nltk_sia":
        # Instantiate the sentiment analyzer
        sia = SentimentIntensityAnalyzer()
        # Output is a dict containing {'neg','pos','neu','composition'}. First three are needed for all future functionality
        def get_model_scores(text):
            scores = sia.polarity_scores(text)
            return scores
        def get_keys(text):
            return sia.polarity_scores(text).keys()

    if granularity in ['paragraph','article']:
        if granularity == 'paragraph':
            # Calculate the polarity scores for each paragraph and store them in a list
            # TODO: Revise and check paragraph splitting, may have issues with article splitting
            text = summarize_paragraphs(text)
        elif granularity == 'article':
            text = summarize_full_text(text)

    scores = get_model_scores(text)
    return scores['neg'], scores['pos'], scores['neu']

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [12]:
from sentiment_calculater import build_csv

build_csv(data_path, output_path, get_sentiment)

Token indices sequence length is longer than the specified maximum sequence length for this model (1512 > 512). Running this sequence through the model will result in indexing errors


Article: As the crowd streamed out of W... | Keyword: conservative | Granularity: article | Neg: 0.049 | Pos: 0.073 | Neu: 0.878
Article: As the crowd streamed out of W... | Keyword: conservative | Granularity: paragraph | Neg: 0.0 | Pos: 0.123 | Neu: 0.877
Article: As the crowd streamed out of W... | Keyword: conservative | Granularity: sentence | Neg: 0.016 | Pos: 0.083 | Neu: 0.901
Article: Being at AU for four years now... | Keyword: conservative | Granularity: article | Neg: 0.119 | Pos: 0.0 | Neu: 0.881


Your max_length is set to 512, but your input_length is only 361. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=180)


Article: Being at AU for four years now... | Keyword: conservative | Granularity: paragraph | Neg: 0.044 | Pos: 0.13 | Neu: 0.826
Article: Being at AU for four years now... | Keyword: conservative | Granularity: sentence | Neg: 0.098 | Pos: 0.094 | Neu: 0.808
Article: Some of your readers may recal... | Keyword: conservative | Granularity: article | Neg: 0.105 | Pos: 0.043 | Neu: 0.852
Article: Some of your readers may recal... | Keyword: conservative | Granularity: paragraph | Neg: 0.068 | Pos: 0.0 | Neu: 0.932
Article: Some of your readers may recal... | Keyword: conservative | Granularity: sentence | Neg: 0.057 | Pos: 0.047 | Neu: 0.896
Article: The national teenage birthrate... | Keyword: conservative | Granularity: article | Neg: 0.047 | Pos: 0.0 | Neu: 0.953


Your max_length is set to 512, but your input_length is only 73. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=36)


Article: The national teenage birthrate... | Keyword: conservative | Granularity: paragraph | Neg: 0.0 | Pos: 0.039 | Neu: 0.961
Article: The national teenage birthrate... | Keyword: conservative | Granularity: sentence | Neg: 0.022 | Pos: 0.08 | Neu: 0.898
Article: If Ronald Reagan could see AU'... | Keyword: conservative | Granularity: article | Neg: 0.0 | Pos: 0.174 | Neu: 0.826
Article: If Ronald Reagan could see AU'... | Keyword: conservative | Granularity: paragraph | Neg: 0.0 | Pos: 0.286 | Neu: 0.714
Article: If Ronald Reagan could see AU'... | Keyword: conservative | Granularity: sentence | Neg: 0.0 | Pos: 0.181 | Neu: 0.819
Article: As I watched Ted Kennedy laid ... | Keyword: conservative | Granularity: article | Neg: 0.0 | Pos: 0.206 | Neu: 0.794
Article: As I watched Ted Kennedy laid ... | Keyword: conservative | Granularity: paragraph | Neg: 0.077 | Pos: 0.127 | Neu: 0.797
Article: As I watched Ted Kennedy laid ... | Keyword: conservative | Granularity: sentence | Neg: 0.

KeyboardInterrupt: 