<a href="https://colab.research.google.com/github/bdadeveloper1/MachineLearningProjects/blob/main/Experiment_Slant_Rewriting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h3>Part One: Generate Summaries</h3>

In [None]:
!pip install openai

import openai
import pandas as pd
import datetime
import numpy as np

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re

openai.api_key = ''

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.20.0.tar.gz (42 kB)
[K     |████████████████████████████████| 42 kB 982 kB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting pandas-stubs>=1.1.0.11
  Downloading pandas_stubs-1.2.0.62-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 12.5 MB/s 
Building wheels for collected packages: openai
  Building wheel for openai (PEP 517) ... [?25l[?25hdone
  Created wheel for openai: filename=openai-0.20.0-py3-none-any.whl size=54118 sha256=b4a284f11c65dd09ccb7d772500a5538df70d784f74925f124efe43e34ebf0a5
  Stored in directory: /root/.cache/pip/wheels/71/8d/9b/e28529ec53123e0279208f99148d4661232120d78cb866839b
Successfully built openai
Installing collected packages: pandas-stubs, openai
Successfully i

In [None]:
# This function provides a dictionary with context for GPT-3 in generating its outputs

def set_params(topic, position, slant, context=False):
    params = {}
    
    # The topic should be described briefly, in five words or less (e.g. "Green New Deal")
    params['topic'] = topic
    
    # The context is optional and should be no longer than a phrase (e.g. "a resolution recently introduced to Congress")
    params['context'] = context
    
    # Position: should the new article "support" or "oppose" the topic? 
    params['position'] = position
    
    # Slant: what should the slant of the new article be? (e.g. "conservative", "anti-China", "pro-Bernie")
    params['slant'] = slant
    
    return params

In [None]:
# This function generates the prompt for GPT-3 to summarize the article

def step_one_prompt(params, article):
    if params['context'] == False:
        prompt_string = "Below is a short article about {}. ".format(params['topic'])
    else:
        prompt_string = "Below is a short article about {}{}. ".format(params['topic'], params['context'])
        
    prompt_string += "After reading it, summarize five key takeaways about {}.\n\nText: ".format(params['topic'])
    prompt_string += article
    prompt_string += """\n\nNow summarize in short bullet points five key takeaways from this passage. These should be short bullet points under 10 words and they should not directly repeat the original text.\n1:"""
    
    return prompt_string

In [None]:
# This function gets five summaries from GPT-3 and returns them in list format

def summarize_article(params, article, n=5, temp=0.7):
    input_string = step_one_prompt(params, article)
    response_full = openai.Completion.create(engine='davinci-instruct-beta', 
                                             prompt=input_string, 
                                             max_tokens=400, 
                                             n=n, 
                                             temperature=temp, 
                                             frequency_penalty=0.2)
    responses = [response_full.get('choices')[i].text.strip() for i in range(n)]
    return responses

In [None]:
# Example prompt string (without GPT-3 call)

params = set_params('the Green New Deal', 'oppose', 'strongly conservative', context=', a resolution recently introduced to Congress')

article = "On February 7, Senator Ed Markey (D-MA) and Congresswoman Alexandria Ocasio-Cortez (D-NY) introduced a Green New Deal resolution to Congress that spells out a transformative path forward on the road to decarbonization. The plan provides an ambitious roadmap to not only decarbonize the economy but also make it fairer, by using taxes on the rich to fund a massive push for renewable energy that doubles as an ambitious jobs-creation program. Other parts of the resolution call for re-investing in American infrastructure and taking steps to ensure that the most vulnerable populations can be adequately protected from the costs of a changing climate. Whether or not it passes, the Green New Deal will undoubtedly come to define the climate goals of the progressive movement for the coming years."

print(step_one_prompt(params, article))

Below is a short article about the Green New Deal, a resolution recently introduced to Congress. After reading it, summarize five key takeaways about the Green New Deal.

Text: On February 7, Senator Ed Markey (D-MA) and Congresswoman Alexandria Ocasio-Cortez (D-NY) introduced a Green New Deal resolution to Congress that spells out a transformative path forward on the road to decarbonization. The plan provides an ambitious roadmap to not only decarbonize the economy but also make it fairer, by using taxes on the rich to fund a massive push for renewable energy that doubles as an ambitious jobs-creation program. Other parts of the resolution call for re-investing in American infrastructure and taking steps to ensure that the most vulnerable populations can be adequately protected from the costs of a changing climate. Whether or not it passes, the Green New Deal will undoubtedly come to define the climate goals of the progressive movement for the coming years.

Now summarize in short bul

<h3>Part Two: Quality Checks for Summaries

In [None]:
# This function creates a df of the five generated summaries with 
# fields for the relevant measures we'll use to evaluate them

def init_response_df(responses):
    df = pd.DataFrame(responses, columns=['text'])
    
    n = len(responses)
    
    df['sentences'] = [[]] * n
    df['tokens'] = [[]] * n
    df['format_check'] = [False] * n
    df['repetition_check'] = [False] * n
    df['repetition_score'] = [1] * n
    df['avg_length'] = [100] * n
    df['quality_score'] = [0] * n
    
    return df

In [None]:
# This function splits the full text output for each summary into a list of its bullet points
# It then checks that the GPT-3 output contained five distinct bullets

def check_format(df):
    responses = df.text.tolist()
    responses_split = [re.split('\n\d: ', response) for response in responses]
    
    df['sentences'] = responses_split
    
    format_check = [True if len(r) == 5 else False for r in responses_split]
    df['format_check'] = format_check
    
    if sum(format_check) == 0:
        print('Quality check failed: no outputs formatted correctly.')
    else:
        return df

In [None]:
# This function checks that not only does a GPT-3 output contain five bullet points, but none of those
# bullet points are an exact duplicate of another (a fairly common problem)

def remove_repeats(df):
    responses = df.sentences.tolist()
    repetition_check = [True if len(list(set(r))) == 5 else False for r in responses]
    df['repetition_check'] = repetition_check
    
    if sum(repetition_check) == 0:
        print('Quality check failed: all outputs contained repetition.')
    else:
        return df

In [None]:
# This function breaks the summaries into tokens and lemmatizes them. 
# The result is a list of lists for each summary 

stop = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

def tokenize(df):
    tokens = []
    outputs = df.sentences.tolist()
    
    for sentences in outputs:
        output_toks = []
        for sent in sentences:
            
            # Remove numbers and punctuation
            clean = re.sub('\d+|\W+', ' ', sent.lower())
            
            # Split into words on whitespace characters
            toks = re.split('\s+', clean.strip())
            
            # Remove stop words
            toks = [tok for tok in toks if tok not in stop]
            
            # Lemmatize
            lemmas = [lemmatizer.lemmatize(tok) for tok in toks]
            output_toks.append(lemmas)
            
        tokens.append(output_toks)
    df['tokens'] = tokens
    return df

LookupError: ignored

In [None]:
# This function creates a repetition score by measuring how much the five bullet points overlap with each other

def calc_repetition_score(df):
    outputs = df.tokens.tolist()
    
    repetition_scores = []
    
    for output in outputs:
        
        # Create zero-matrix of size (5,5) (assuming output correctly contains 5 bullet points)
        sims = np.zeros(shape=(len(output), len(output)))
        
        for i in range(len(output)):
            for j in range(len(output)):
                
                # Calculate pairwise comparisons as number of tokens that show up in both bullet points
                # divided by the total number of tokens in the shorter bullet point
                num_overlaps = len(list(set(output[i]) & set(output[j])))
                denom = min([len(set(output[i])), len(set(output[j]))])
                sims[i][j] = num_overlaps / denom
        
        # Remove diagonal
        score_by_col = (sims.sum(1)-np.diag(sims))/(sims.shape[1]-1)
        
        # Take the mean of all pairwise comparisons as the overall repetition score
        score = np.mean(score_by_col)
        repetition_scores.append(score)
        
    df['repetition_score'] = repetition_scores
    
    return df

In [None]:
# This function calculates the average length (in non-stopword tokens) of each bullet point

def calc_avg_length(df):
    avg_lengths = []
    outputs = df.tokens.tolist()
    
    for output in outputs:
        avg = np.mean([len(sent) for sent in output])
        avg_lengths.append(avg)
        
    df['avg_length'] = avg_lengths
    
    return df

In [None]:
# This function calculates an overall quality score using the repetition_score and avg_length fields

def calc_quality_score(df):
    quality_scores = []
    
    for i in range(len(df)):
        
        # Assign score of 0 if output does not consist of five unique bullet points
        if df['format_check'][i] == False:
            quality_scores.append(0)
        elif df['repetition_check'][i] == False:
            quality_scores.append(0)
            
        # Otherwise, calculate score as 100 - 100*the repetition score - the mean distance of sentence length from 7 tokens
        else:
            dist_from_seven = np.abs(7 - df['avg_length'][i])
            quality_scores.append(100 - df['repetition_score'][i]*100 - dist_from_seven*5)
    
    df['quality_score'] = quality_scores
    
    return df

In [None]:
# This function uses the previous functions to pick the best summary and returns the text and score

def pick_best_summary(summaries):
    df = init_response_df(summaries)
    df = check_format(df)
    df = remove_repeats(df)
    df = tokenize(df)
    df = calc_repetition_score(df)
    df = calc_avg_length(df)
    df = calc_quality_score(df)
    
    max_row = df[df.quality_score == df.quality_score.max()].reset_index()
    return max_row.text[0], df.quality_score.max()

<h3>Part Three: Using the Summary to Rewrite the Article from a New Slant</h3>

In [None]:
# This function uses the same dictionary of context to generate a prompt for GPT-3

def step_two_prompt(params, summary):
    if params['context'] == False:
        prompt_string = "Below is a short list of facts regarding {}:".format(params['topic'])
    else:
        prompt_string = "Below is a short list of facts regarding {}{}:".format(params['topic'], params['context'])
        
    prompt_string += "\n\n1: "
    prompt_string += summary
    prompt_string += "\n\nUse this set of facts to write a headline and accompanying article that {}s {} from a strongly {} slant.".format(params['position'], params['topic'], params['slant'])
    prompt_string += "\n\nHeadline:"
    
    return prompt_string


In [None]:
# This function calls GPT-3 to generate a rewrite

def rewrite_from_summary(params, summary, n_outputs=1, temp=0.7):
    input_string = step_two_prompt(params, summary)
    response_full = openai.Completion.create(engine='davinci-instruct-beta',
                                            prompt=input_string,
                                            max_tokens=200,
                                            n=n_outputs,
                                            temperature=temp,
                                            frequency_penalty=0.2)
    responses = [response_full.get('choices')[i].text.strip() for i in range(n_outputs)]
    return responses
    

In [None]:
# This function also performs a quality check on the rewrite

# It contains much of the same functionality as the previous quality checks, but in more compressed form

def quality_check_rewrite(rewrites):
    repetition_scores = []
    
    for rewrite in rewrites:
        
        # Remove numbers and punctuation
        rewrite_clean = re.sub('\d+|W+', ' ', rewrite)
        
        # Tokenize on whitespace and remove stopwords
        toks = re.split('\s+', rewrite_clean)
        toks = [t for t in toks if t not in stop]
        
        # If the length of the output is over 75 tokens, assign repetition score of number of unique tokens
        # divided by total number of tokens; else set repetition score to 1
        
        # (Sometimes the outputs would cut off suddenly; the length check weeded out those outputs)
        if len(toks) > 75:
            repetition_scores.append(1 - len(list(set(toks)))/len(toks))
        else:
            repetition_scores.append(1)
    
    # Use the repetition score to identify the best rewrite
    for r, s in zip(rewrites, repetition_scores):
        if s == min(repetition_scores):
            best_rewrite = r
            
    return best_rewrite, min(repetition_scores)


<h3>Part Four: Combining It All</h3>

In [None]:
# This function combines all of the previous steps into one function call

def rewrite_from_raw(article, # The text of the article
                     topic, # Context field for GPT-3
                     position, # The intended position of the rewrite
                     slant, # The intended slant of the rewrite
                     context=False, # Optional field to provide additional context
                     n_summaries=5, # Number of summaries generated at each intermediate call
                     n_outputs=1, # Number of rewrites generated at each call
                     temperature=0.7):
    
    # First generate dictionary of context fields
    params = set_params(topic, position, slant, context=False)
    
    # Generate summaries in batches of five until either a quality score over 90 is obtained or 25 summaries have been generated
    quality_score = 0
    i = 0
    while i < 5 and quality_score < 90:
        summaries = summarize_article(params, article, n=n_summaries, temp=temperature)
        best_summary, quality_score = pick_best_summary(summaries)
        i += 1
        if quality_score >= 90:
            print('Summary obtained at iteration {}'.format(i))
        elif i == 5:
            print('Failed to obtain quality score > 90; proceeding with best alternative.')

            
    # Generate rewrites one at a time until either a repetition score under 5 is obtained or 5 rewrites have been generated    
    repetition_score = 1
    i = 0
    while i < 5 and repetition_score > 0.5:
        rewrites = rewrite_from_summary(params, best_summary, n_outputs=n_outputs, temp=temperature)
        best_rewrite, repetition_score = quality_check_rewrite(rewrites)
        i += 1
        if repetition_score <= 0.5:
            print('Rewrite obtained at iteration {}'.format(i))
        elif i == 5:
            print('Failed to obtain sufficiently long rewrite; proceeding with best alternative.')
    
    print('————————')
    
    return best_rewrite

In [None]:
# Read in a .csv file with the relevant parameters for each generation

# For each topic, there will be two rewrites for and two against the topic

df = pd.read_csv('key.csv')
df.head()

FileNotFoundError: ignored

In [None]:
# Finally, generate the GPT-3 outputs and append to the df, then save as a .csv

gpt3_outputs = []

for i in range(len(df)):
    output = None
    output = rewrite_from_raw(article=df['original'][i], 
                              topic=df['topic'][i], 
                              position=df['position'][i], 
                              slant=df['slant'][i], 
                              context=df['context'][i], 
                              n_summaries=5,
                              n_outputs=5,
                              temperature=0.75)
    if output:
        gpt3_outputs.append(output)
    else:
        gpt3_outputs.append(None)
        
df['output'] = gpt3_outputs

df.to_csv('rewritings.csv')