### 1. Initial setup: Importing libraries

This cell imports the necessary libraries: **Pandas** for data handling, and **NLTK/VADER** for sentiment analysis. **Pathlib** is included for robust file path construction across different operating systems.

In [26]:
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from pathlib import Path

### 2. Prepare the corpus

This step iterates through the four corpus CSV files located in the `/data` folder, loads them into Pandas DataFrames, and **assigns a `band` label** to each row (song). All four DataFrames are then concatenated into a single master DataFrame (`lyrics_df`) for unified processing.

In [27]:
base_path = Path('..')
corpus_filenames = {
    "Beatles": "beatles.csv",
    "CSNY": "csny.csv",
    "Grunge": "grunge.csv",
    "SNA": "sna.csv"
}

all_data_frames = []
for band, filename in corpus_filenames.items():
    file_path = base_path / "data" / filename

    try:
        df = pd.read_csv(file_path, encoding='utf-8')

        df['band'] = band
        all_data_frames.append(df)

    except FileNotFoundError:
        print(f"Error: Data file not found at path: {file_path}")
        continue
    except Exception as e:
        print(f"Error processing {filename}. Check the file format (delimiters, columns): {e}")

lyrics_df = pd.concat(all_data_frames, ignore_index=True)

### 3. Initialize VADER sentiment analyzer

The VADER analyzer object (`sid`) is instantiated. A check is included to automatically download the **VADER lexicon** from NLTK if it is not already present in the environment.

In [28]:
try:
    sid = SentimentIntensityAnalyzer()
except LookupError:
    nltk.download('vader_lexicon')
    sid = SentimentIntensityAnalyzer()

### 4. Perform sentiment analysis

This cell executes the core VADER analysis for every song in the corpus. The code iterates through the `lyrics` column, **calculates the four primary VADER scores** (negative, neutral, positive, and compound) for each text, and temporarily stores the results in dedicated Python lists.

In [29]:
lyrics_df['lyrics'] = lyrics_df['lyrics'].fillna("")


neg_scores, neu_scores, pos_scores, compound_scores = [], [], [], []

for lyrics in lyrics_df['lyrics']:
    sentiment_scores = sid.polarity_scores(lyrics)
    neg_scores.append(sentiment_scores['neg'])
    neu_scores.append(sentiment_scores['neu'])
    pos_scores.append(sentiment_scores['pos'])
    compound_scores.append(sentiment_scores['compound'])

### 5. Store scores in DataFrame

This cell finalizes the per-song analysis by adding the four calculated sentiment lists directly into the main `lyrics_df`. Four new columns (`neg`, `neu`, `pos`, `compound`) are created to store the individual VADER scores.

In [30]:
lyrics_df['neg'] = neg_scores
lyrics_df['neu'] = neu_scores
lyrics_df['pos'] = pos_scores
lyrics_df['compound'] = compound_scores

### 6. Generate an overall corpus summary

This final step aggregates the per-song data to generate the main comparative result:

1.  **Grouping:** Mean scores are calculated **per corpus** using the `groupby('band').mean()` method to obtain a single average score for each band.
2.  **Classification:** A custom function applies the standard VADER threshold ($\pm 0.05$) to determine the categorical **"Sentiment Classification"** (positive, negative, or neutral) for each corpus.
3.  **Reporting:** The final, formatted table containing the average scores and classification is printed as the primary output for detailed analysis and reporting.

In [31]:
corpus_summary_df = lyrics_df.groupby('band')[['neg', 'neu', 'pos', 'compound']].mean().reset_index()

final_df = corpus_summary_df.rename(columns={
    'band': 'Corpus',
    'neg': 'Negative polarity',
    'neu': 'Neutral polarity',
    'pos': 'Positive polarity',
    'compound': 'Overall sentiment score'
})

def classify_sentiment(score):
    if score >= 0.05:
        return "Positive"
    elif score <= -0.05:
        return "Negative"
    else:
        return "Neutral"

final_df['Sentiment Classification'] = final_df['Overall sentiment score'].apply(classify_sentiment)

pd.set_option('display.float_format', lambda x: '%.4f' % x)

print("\n--- Sentiment analysis results per corpus ---")
print("-----------------------------------------------------")
print(final_df)


--- Sentiment analysis results per corpus ---
-----------------------------------------------------
    Corpus  Negative polarity  Neutral polarity  Positive polarity  \
0  Beatles             0.0606            0.7550             0.1844   
1     CSNY             0.0819            0.7736             0.1307   
2   Grunge             0.1235            0.7544             0.1222   
3      SNA             0.0174            0.1392             0.0234   

   Overall sentiment score Sentiment Classification  
0                   0.4658                 Positive  
1                   0.3681                 Positive  
2                  -0.0382                  Neutral  
3                   0.0295                  Neutral  
