<h1 style="text-align:center;">2024 Six Nations Drama:<br> Analyzing YouTube Reactions to Try Decision <br>Did France Deserve the Win or Was Scotland Robbed?</h1>

<p style="text-align:center; font-size:large;">Created By: Christopher Castor</p>

## Overview
The 2024 Six Nations rugby match between Scotland and France delivered a gripping and controversial moment in the final seconds, where an inconclusive last-gasp try review determined the outcome. As on-field referee Nic Berry called no-try despite potential evidence suggesting otherwise, the match concluded with France holding on to a tense victory. This pivotal decision has sparked widespread discussions and debates among rugby enthusiasts and fans.

This Jupyter notebook seeks to explore and analyze public sentiment surrounding this contentious last-minute call using the comments section of the [official Six Nations YouTube highlight video](https://www.youtube.com/watch?v=Rcst-jIOQDo). The goal is to determine whether the majority of viewers agree with the no-try decision by leveraging OpenAI and sentiment analysis techniques. 

## Data
Data collection was performed utilizing the Google YouTube API, which facilitated the extraction of reactions from the highlight video. This API allowed access to and retrieval of comments and likes associated with the video, offering valuable insights into the diverse reactions and opinions of the viewers. Subsequently, the comments underwent evaluation using OpenAI's API to gauge sentiment regarding the contentious last-second call.

## Table of Contents

- <a href='#1'> 1. Import Libraries</a>
- <a href='#2'> 2. Define Functions</a>
- <a href='#3'> 3. Retrieve and Clean Data</a>
- <a href='#4'> 4. Determine Sentiment</a>

# <a id='1'>1. Import Libraries</a>

In [None]:
# Import standard libraries
import pandas as pd
import numpy as np
import os
from dotenv import load_dotenv
from collections import Counter

# Import api libraries
from googleapiclient.discovery import build
from openai import OpenAI

# Import natural language processing (nlp)
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import html

# Import visualization libraries
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-dark')
import seaborn as sns
from wordcloud import WordCloud

# Load spaCy model with TextBlob capabilities
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("spacytextblob")
nltk.download('vader_lexicon')

# Import API Key and Video ID
load_dotenv()
api_key = os.getenv('API_KEY')
video_id = os.getenv('VIDEO_ID')

# Initialize OpenAI client
client = OpenAI(api_key = os.getenv('OPENAI_KEY'))

# <a id='2'>2. Define Functions</a>

In [None]:
def generate_decision(prompt):
    '''
    Generate a decision using OpenAI's chat model.

    Parameters:
    - prompt (str): The input prompt for the chat model.

    Returns:
    - decision (str): The generated decision based on the prompt.
    '''
    # Create chat completion with the given prompt
    response = client.chat.completions.create(
        model = "gpt-3.5-turbo-0125",
        messages=[{'role':'user', 'content':prompt}]
    )

    decision = response.choices[0].message.content

    return decision

In [None]:
def get_decisions(comment):
    '''
    Generate decisions for a comment using OpenAI's chat model.

    Parameters:
    - common_prompt (str): The main prompt to include before each video reaction.
    - comment (str): The main comment for which a decision is generated.

    Returns:
    - comment_decision (str): The generated decision and confidence for the comment.
    '''
    # Generate decision for the comment
    comment_prompt = common_prompt + '\n\n"' + comment + '"'
    comment_decision = generate_decision(comment_prompt)

    return comment_decision

In [None]:
def clean_text(text):
    '''
    Clean and decode HTML entities from the input text.

    Parameters:
    - text (str): The text to be cleaned.

    Returns:
    - html.unescape(text) (str): The text after cleaning and decoding HTML entities.
    '''
    return html.unescape(text)

In [None]:
def video_reactions(video_id, api_key):
	'''
	Retrieve comments data for a YouTube video.

	Parameters:
	- video_id (str): The YouTube video ID.
	- api_key (str): The YouTube Data API key.

	Returns:
	- df (DataFrame): DataFrame containing comments and likes.
	'''
	# Create a list to store comments and related information
	comments_data = []

	# Create YouTube resource object
	youtube = build('youtube', 'v3', developerKey=api_key)

	# Retrieve YouTube video comments
	video_response=youtube.commentThreads().list(
	part='snippet',
	videoId=video_id
	).execute()

	# Iterate through video comments
	while video_response:
	
		# Extract information from each object
		for item in video_response['items']:
		
			# Extract comment text
			comment_text = item['snippet']['topLevelComment']['snippet']['textDisplay']
			
			# Extract number of likes for each comment
			likes_count = item['snippet']['topLevelComment']['snippet']['likeCount']

			# Append information to the list
			comments_data.append({'Comment': comment_text, 'Like_Count': likes_count})

		# Repeat for the next page if available
		if 'nextPageToken' in video_response:
			video_response = youtube.commentThreads().list(
					part = 'snippet',
					videoId = video_id,
					pageToken = video_response['nextPageToken']
				).execute()
		else:
			break

	# Create DataFrame containing comments and related information
	df = pd.DataFrame(comments_data)

	return df

In [None]:
def generate_distribution(counts, percentages, color, title):
    '''
    Generate a bar plot to visualize the distribution of categories.

    Parameters:
    - counts (pd.Series): Series containing counts for each category.
    - percentages (list): List of percentage values corresponding to each category.
    - color (str): Color of the bars in the plot.
    - title (str): Title of the plot.

    Returns:
    None (Displays the plot).
    '''
    # Create a new figure and axis
    fig, ax = plt.subplots()

    # Create bars for each category with the specified color
    bars = ax.bar(counts.index, counts, color=color)

    # Add percentage labels above each bar
    for bar, percentage in zip(bars, percentages):
        yval = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2, yval + 0.05, f'{percentage: .1f}%', ha='center', va='bottom', weight='bold')

    # Set the title of the plot
    ax.set_title(title, fontsize=14, fontweight='bold', y=1.05)

    # Display the plot
    return plt.show()

In [None]:
def calculate_average_sentiment(df, col):
    '''
    Calculate the average sentiment score for a specified column in a DataFrame.

    Parameters: 
    - df (DataFrame): Input DataFrame containing the text data.
    - col (str): Name of the column in the DataFrame containing the text data.

    Returns:
    - average_sentiment (float): Average sentiment score for the specified column.
    '''
    # Ensure the specified column exists in the DataFrame
    if col not in df.columns:
        raise ValueError(f"Column '{col}' does not exist")
    
    # Initialize the VADER sentiment analyzer
    sid = SentimentIntensityAnalyzer()

    # Process each text in the column
    sentiment_scores = []
    for text in df[col]:
        # Tokenize the text using spaCy
        doc = nlp(text)

        # Calculate sentiment score using NLTK VADER 
        compound_score = sid.polarity_scores(text)['compound']
        sentiment_scores.append(compound_score)

    # Calculate the average sentiment score
    average_sentiment = sum(sentiment_scores)/len(sentiment_scores)
    return average_sentiment

In [None]:
def clean_text_word_cloud_word(text):
    '''
    Tokenize and clean the input text for word cloud generation.

    Parameters:
    - text (str): Input text to be processed

    Returns:
    - cleaned_text (str): Cleaned and processed text suitable for word cloud generation.
    '''
    # Tokenize the input text using spaCy
    doc = nlp(text)

    # Initialize the VADER sentiment analyzer
    sid = SentimentIntensityAnalyzer()

    # Extract tokens, excluding stopwords and non-alphabetic tokens
    cleaned_tokens = [token.text.lower() for token in doc if (
        (token.text.lower() in ['robbed']) or (
        not token.is_stop and 
        token.is_alpha and 
        (sid.polarity_scores(token.text)['compound'] != 0))
        )]
    
    # Join the cleaned tokens into a single string
    cleaned_text = ' '.join(cleaned_tokens)

    # Return the cleaned and processed text
    return cleaned_text

In [None]:
def generate_word_cloud(comments, title, stopwords=None):
    '''
    Generate a word cloud from a list of comments.

    Parameters:
    - comments (list): List of text comments.
    - title (str): Title for the word cloud plot.
    - stopwords (set): Set of stopwords to be excluded from the word cloud.

    Returns:
    None (Displays the word cloud plot)
    '''
    # Clean the comments
    cleaned_comments = [clean_text_word_cloud_word(comment) for comment in comments]

    # Create a WordCloud object with specified parameters
    if stopwords:
        wordcloud = WordCloud(width=800, height=400, max_words=1000, background_color='white', stopwords=stopwords).generate(' '.join(cleaned_comments))
    else:
        wordcloud = WordCloud(width=800, height=400, max_words=1000, background_color='white').generate(' '.join(cleaned_comments))
    
    # Plot the word cloud
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')

    # Set the title of the plot
    plt.title(title, fontsize=18, fontweight='bold', y=1.05)

    # Return the word cloud plot
    return plt.show()

# <a id='3'>3. Retrieve and Clean Data</a>

In [None]:
# Retrieve comments and likes from the YouTube video
reaction_df = video_reactions(video_id, api_key)

# Clean text in the 'Comment' column
reaction_df['Comment'] = reaction_df['Comment'].apply(clean_text)

# Display number of comments
print('Comments: {}'.format(len(reaction_df)))

# <a id='4'>4. Determine Sentiment</a>

In [None]:
# Create a common prompt to be used for generating decisions
# common_prompt = '''
# Scotland were denied victory by an inconclusive last-second try review as France held on to win a tense Six Nations rugby encounter. Replays suggested the ball was grounded by Sam Skinner but the on-field referee, Nic Berry, called no try and the Television Match Official (TMO), claimed to not have the evidence to conclusively prove otherwise. Fan's either agree with the referees' decision of no try, implying that the fan is happy and the ball was lost and the referees did a good job and France deservedly won ('Agree') OR fan's disagree with the decision of no try, implying that the fan is upset and the ball was grounded and touched the ground and it was a clear try and the referees did a bad job and Scotland was robbed and deserved to win ('Disagree'). Read a fan's statement below and use the criteria above to choose 'Agree' or 'Disagree' and provide a confidence level between 1-5, where 1 means you have no idea, and 5 means you are extremely confident. Separate the word and number with a colon. So your reply should be in the format, word:number.
# '''
common_prompt = '''
Scotland were denied victory by an inconclusive last-second try review as France held on to win a tense Six Nations rugby encounter. Replays suggested the ball was grounded by Sam Skinner but the on-field referee, Nic Berry, called no try and the Television Match Official (TMO), claimed to not have the evidence to conclusively prove otherwise. Fan's either agree with the referees' decision of no try, implying that the fan is happy and the ball was lost and the referees did a good job and France deservedly won ('Agree') OR fan's disagree with the decision of no try, implying that the fan is upset and the ball was grounded and touched the ground and it was a clear try and the referees did a bad job and Scotland was robbed and deserved to win ('Disagree'). Read a fan's statement below and use the criteria above to choose 'Agree' or 'Disagree' and provide a confidence level between 1-5, where 1 means you have low confidence and 5 is high confidence. Separate the word and number with a colon. So your reply should be in the format, word:number.
'''

# Generate decisions for each comment
reaction_df['OpenAI_Response'] = reaction_df['Comment'].apply(get_decisions)

In [None]:
# Clean returned OpenAI response
reaction_df['OpenAI_Response'] = reaction_df['OpenAI_Response'].astype(str).replace(to_replace = r'[. ]', value='', regex=True)

# Split into decision and confidence
reaction_df[['Comment_Decision','Comment_Decision_Confidence']] = reaction_df['OpenAI_Response'].str.split(':', expand=True)

# Clean the decision column
reaction_df['Comment_Decision'] = np.where(~reaction_df['Comment_Decision'].isin(['Agree','Disagree']),'Inconclusive',reaction_df['Comment_Decision'])

# Clean the confidence column
reaction_df['Comment_Decision_Confidence'] = reaction_df['Comment_Decision_Confidence'].fillna(np.NaN).astype(str).str[:1]
reaction_df['Comment_Decision_Confidence'] = np.where(reaction_df['Comment_Decision_Confidence']=='n',1,reaction_df['Comment_Decision_Confidence']).astype(float)

In [None]:
reaction_df['OpenAI_Response'].value_counts(dropna=False)

In [None]:
reaction_df['Comment_Decision'].value_counts(dropna=False)

In [None]:
reaction_df['Comment_Decision_Confidence'].value_counts(dropna=False)

In [None]:
reaction_df.sample(frac=0.2).head()

In [None]:
print(reaction_df.loc[407][0])

In [None]:
reaction_df.sort_values(by='Like_Count', ascending=False).head()

In [None]:
agree_comment_with_most_likes = reaction_df[reaction_df['Comment_Decision']=='Agree'].reset_index(drop=True).loc[reaction_df[reaction_df['Comment_Decision']=='Agree'].reset_index(drop=True)['Like_Count'].idxmax(), 'Comment']
disagree_comment_with_most_likes = reaction_df[reaction_df['Comment_Decision']=='Disagree'].reset_index(drop=True).loc[reaction_df[reaction_df['Comment_Decision']=='Disagree'].reset_index(drop=True)['Like_Count'].idxmax(), 'Comment']

print('Agree comment with most likes: {}'.format(agree_comment_with_most_likes))
print('Disagree comment with most likes: {}'.format(disagree_comment_with_most_likes))

# <a id='5'>5. Sentiment Distribution</a>

In [None]:
# Calculate distribution of YouTube comment sentiment
comment_counts = reaction_df['Comment_Decision'].value_counts().sort_index()
comment_percentages = (comment_counts / len(reaction_df)) * 100
generate_distribution(comment_counts, comment_percentages, 'lightsteelblue', "OpenAI's Assessment: YouTube Comment Sentiment")

In [None]:
# Calculate distribution of YouTube comment likes sentiment
like_counts = reaction_df['Like_Count'].groupby(reaction_df['Comment_Decision']).sum().sort_index()
like_percentages = (like_counts / (reaction_df['Like_Count'].groupby(reaction_df['Comment_Decision']).sum()).sum()) * 100
generate_distribution(like_counts, like_percentages, 'cornflowerblue', "OpenAI's Assessment: YouTube Comment Likes Sentiment")

In [None]:
# Calculate distribution of OpenAI's confidence on sentiment
confidence_counts = reaction_df['Comment_Decision_Confidence'].value_counts().sort_index()
confidence_percentages = (confidence_counts / len(reaction_df)) * 100
generate_distribution(confidence_counts, confidence_percentages, 'royalblue', "OpenAI's Confidence: YouTube Comment Sentiment")

In [None]:
# Create grouped bar chart showing OpenAI's assessment vs confidence
ax = sns.countplot(data=reaction_df, x='Comment_Decision', hue='Comment_Decision_Confidence', palette="crest")
ax.set_xlabel('')
ax.set_ylabel('')
ax.set_title("OpenAI's Assessment vs Confidence", fontsize=14, fontweight='bold', y=1.05)
ax.legend(title='Confidence')

total = len(reaction_df)
for p in ax.patches:
    height = p.get_height()
    if height > 0:
        ax.text(p.get_x() + p.get_width() / 2., height + 0.1, f'{height/total: .1%}', ha='center', va='bottom', weight='bold')

plt.show()

In [None]:
# Create a DataFrame that only has comment's with a confidence score of 3 or above
confident_decisions = reaction_df[reaction_df['Comment_Decision_Confidence']>2].reset_index(drop=True)
print(len(confident_decisions))

In [None]:
average_sentiment_agree = round(calculate_average_sentiment(confident_decisions[confident_decisions['Comment_Decision']=='Agree'].reset_index(drop=True), 'Comment'),2)
average_sentiment_disagree = round(calculate_average_sentiment(confident_decisions[confident_decisions['Comment_Decision']=='Disagree'].reset_index(drop=True), 'Comment'),2)

print('Sentiment scores from VADER range from -1 to 1')
print('Negative Sentiment: Less than 0')
print('Neutral Sentiment: Equal to 0')
print('Positive Sentiment: More than 0')
print(f"Average Sentiment for Agree: {average_sentiment_agree}")
print(f"Average Sentiment for Disagree: {average_sentiment_disagree}")

# <a id='6'>6. Word Clouds</a>

In [None]:
stopwords=['win','play','fan','supporter','like','played','won']
# Generate word clouds for each sentiment
generate_word_cloud(reaction_df[reaction_df['Comment_Decision']=='Agree']['Comment'], 'Agree Sentiment', stopwords)
generate_word_cloud(reaction_df[reaction_df['Comment_Decision']=='Disagree']['Comment'], 'Disagree Sentiment', stopwords)