# Multilingual Sentiment Analysis Lexicons for African Languages: An LLM Approach

This notebook implements the methodology and findings from research conducted by Gift Markus Xipu on **"Multilingual Sentiment Analysis Lexicons for African Languages: An LLM Approach."** 

Our study evaluates the capabilities of large language models (OpenAI, Claude, Gemini, and BERT) to perform sentiment analysis directly on African languages, specifically Sepedi, Sesotho, and Setswana, without relying on translation-based techniques. We examine the effectiveness of various prompting strategies and assess the degree of fine-tuning required to optimise performance across these languages.

The code provided here demonstrates how to leverage these LLMs to create sentiment lexicons and analyse African language text, representing a step toward more culturally and linguistically inclusive NLP tools. The approach aims to address the critical gap in NLP resources for African languages by utilising the multilingual capabilities of modern language models.

<div style="background-color: #d1f3d1; padding: 10px; border-radius: 5px;">
    Importing the necessary dependencies
</div>

In [2]:
import os
from openai import OpenAI
import anthropic
import  google.generativeai as genai

<div style="background-color: #d1f3d1; padding: 10px; border-radius: 5px;">
    <strong>Code Explanation:</strong> This defines the base <code>LLM</code> class that serves as an abstract interface for all language models. It stores common attributes (name, API key, model identifier, temperature, and token limits) and declares abstract methods that child classes must implement. The <code>setup_client()</code> method will establish connections to model APIs, while <code>generate()</code> will handle prompt submission and response retrieval. Each language model implementation (Claude, OpenAI, Gemini) will extend this class with their specific functionality.
</div>

In [3]:
class LLM:
    def __init__(self, name, api_key, model, temperature=0.0, max_tokens=1000):

        self.name = name
        self.api_key = api_key
        self.model = model
        self.temperature = temperature
        self.max_tokens = max_tokens
        self.client = None
    
    def setup_client(self):
        raise NotImplementedError("Subclasses must implement setup_client()")
    
    def generate(self, prompt, system_prompt=None):
        raise NotImplementedError("Subclasses must implement generate()")
    
    def __str__(self):
        return f"{self.name}(model='{self.model}')"

<div style="background-color: #d1ecff; padding: 10px; border-radius: 5px;">
    <strong>Claude Initialization</strong>
</div>

<div style="background-color: #d1f3d1; padding: 10px; border-radius: 5px;">
    <strong>Code Explanation:</strong> This code implements the Claude-specific LLM subclass. The <code>ClaudeLLM</code> class inherits from the base <code>LLM</code> class and provides Claude-specific implementations for client setup and text generation. It initializes with default values optimized for Claude (using the latest claude-3-7-sonnet model and 4096 token limit). The <code>generate()</code> method formats the request according to Claude's API requirements, handling both standard prompts and optional system prompts. The code then initializes a Claude instance with the provided API key, sets the temperature to 0 for deterministic responses, and establishes the connection to Anthropic's API.
</div>

In [14]:
# Claude-specific implementation
class ClaudeLLM(LLM):
    def __init__(self, api_key, model="claude-3-7-sonnet-20250219", temperature=0.0, max_tokens=4096):
        super().__init__("Claude", api_key, model, temperature, max_tokens)
    
    def setup_client(self):
        self.client = anthropic.Anthropic(api_key=self.api_key)
        return self.client
    
    def generate(self, prompt, system_prompt=None):
        # Set up client if not already done
        if not self.client:
            self.setup_client()
        
        # Prepare the message parameters
        message_params = {
            "model": self.model,
            "max_tokens": self.max_tokens,
            "temperature": self.temperature,
            "messages": [
                {"role": "user", "content": prompt}
            ]
        }
        
        # Add system prompt if provided
        if system_prompt:
            message_params["system"] = system_prompt
        
        # Send request to Claude
        response = self.client.messages.create(**message_params)
        
        # Return the text response
        return response.content[0].text


# Now initialize a Claude instance (replace with your API key)
CLAUDE_API_KEY = "Add key"

# Create the Claude instance
claude = ClaudeLLM(
    api_key=CLAUDE_API_KEY,
    temperature=0.0  # Deterministic responses as requested
)

# Setup the client
claude.setup_client()

# Verify setup
print(f"Claude setup complete: {claude}")

Claude setup complete: Claude(model='claude-3-7-sonnet-20250219')


<div style="background-color: #c8e6c9; padding: 10px; border-radius: 5px;">
    <strong>OpenAI Initialization</strong>
</div>

In [15]:
# OpenAI-specific implementation
class OpenAILLM(LLM):
    def __init__(self, api_key, model="gpt-4o", temperature=0.0, max_tokens=4096):
        """
        Initialize an OpenAI LLM instance.
        
        Args:
            api_key (str): The OpenAI API key for authentication
            model (str): The OpenAI model to use (default: gpt-4o)
            temperature (float): Controls randomness (0.0 to 1.0)
            max_tokens (int): Maximum tokens in the response
        """
        # Call the parent constructor with the name "OpenAI"
        super().__init__("OpenAI", api_key, model, temperature, max_tokens)
        self.base_url = None  # Optional base URL for API requests
    
    def setup_client(self):
        """Set up the OpenAI client."""
        if self.base_url:
            self.client = OpenAI(api_key=self.api_key, base_url=self.base_url)
        else:
            self.client = OpenAI(api_key=self.api_key)
        return self.client
    
    def set_base_url(self, base_url):
        """
        Set a custom base URL for API requests (useful for proxies or Azure OpenAI).
        
        Args:
            base_url (str): The base URL to use for API requests
        """
        self.base_url = base_url
        # Reset client if it was already set up
        if self.client:
            self.setup_client()
        return self
    
    def generate(self, prompt, system_prompt=None):
        """
        Generate a response from OpenAI.
        
        Args:
            prompt (str): The user prompt
            system_prompt (str, optional): System instructions for the AI
            
        Returns:
            str: OpenAI's response
        """
        # Set up client if not already done
        if not self.client:
            self.setup_client()
        
        # Prepare the messages list
        messages = []
        
        # Add system prompt if provided
        if system_prompt:
            messages.append({"role": "system", "content": system_prompt})
        
        # Add user prompt
        messages.append({"role": "user", "content": prompt})
        
        # Send request to OpenAI
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            max_tokens=self.max_tokens,
            temperature=self.temperature
        )
        
        # Return the text response
        return response.choices[0].message.content


# Now initialize an OpenAI instance (replace with your API key)
OPENAI_API_KEY= "add key"

# Create the OpenAI instance
openai_llm = OpenAILLM(
    api_key=OPENAI_API_KEY,
    temperature=0.0  # Deterministic responses as requested
)

# Setup the client
openai_llm.setup_client()

# Verify setup
print(f"OpenAI setup complete: {openai_llm}")

OpenAI setup complete: OpenAI(model='gpt-4o')


In [23]:
class MultilingualSentimentBearings:
    
    # Language codes and names
    LANGUAGES = {
        'sepedi': 'Sepedi',
        'sesotho': 'Sesotho',
        'setswana': 'Setswana'
    }
    
    def __init__(self, llm_model):
        """
        Initialize with an LLM model instance.
        
        Args:
            llm_model: An instance of an LLM class (Claude, OpenAI, etc.)
        """
        self.llm = llm_model
        
    def get_bearing_prompt(self, word, language):
        """
        Generate a prompt to determine the sentiment bearing of a word in the specified language.
        
        Args:
            word (str): The word to analyze
            language (str): The language code ('sepedi', 'sesotho', or 'setswana')
            
        Returns:
            str: A prompt for the LLM
        """
        language_name = self.LANGUAGES.get(language.lower(), language)
        
        return f"""
        Analyze the sentiment bearing of the {language_name} word: "{word}"
        
        Please respond with ONLY a single line in exactly this format:
        <word>, <sentiment>, <score>
        
        Where:
        - <word> is the analyzed word in {language_name}
        - <sentiment> is either "positive", "neutral", or "negative" in English
        - <score> is either 1 (positive), 0 (neutral), or -1 (negative)
        
        For example:
        lerato, positive, 1
        thata, neutral, 0
        bohloko, negative, -1
        """
    
    def get_bearing_system_prompt(self, language):
        """
        Generate a system prompt for sentiment bearing analysis in the specified language.
        
        Args:
            language (str): The language code ('sepedi', 'sesotho', or 'setswana')
            
        Returns:
            str: A system prompt with instructions
        """
        language_name = self.LANGUAGES.get(language.lower(), language)
        
        return f"""
        You are a precise sentiment analyzer for the {language_name} language. Your task is to determine 
        if a {language_name} word has a positive, neutral, or negative sentiment bearing.
        
        Respond with ONLY a single line containing the word, sentiment (in English), and score (-1, 0, or 1) 
        in the exact format requested. Do not include any explanations or additional text.
        
        Be objective in your analysis and ensure you understand the cultural context and nuances of 
        the {language_name} language.
        """
    
    def analyze_word(self, word, language):
        """
        Analyze the sentiment bearing of a single word in the specified language.
        
        Args:
            word (str): The word to analyze
            language (str): The language code ('sepedi', 'sesotho', or 'setswana')
            
        Returns:
            dict: A dictionary with the word, language, sentiment, and score
        """
        if language.lower() not in self.LANGUAGES:
            raise ValueError(f"Unsupported language: {language}. Supported languages are: {', '.join(self.LANGUAGES.keys())}")
        
        prompt = self.get_bearing_prompt(word, language)
        system_prompt = self.get_bearing_system_prompt(language)
        
        # Get response from the LLM
        response = self.llm.generate(prompt, system_prompt=system_prompt)
        
        # Parse the response (expecting format: "word, sentiment, score")
        try:
            # Clean the response and split by comma
            clean_response = response.strip()
            parts = clean_response.split(',')
            
            if len(parts) >= 3:
                analyzed_word = parts[0].strip()
                sentiment = parts[1].strip().lower()
                score_str = parts[2].strip()
                
                # Convert score to int
                try:
                    score = int(score_str)
                except ValueError:
                    # If score is not an integer, try to extract it from the string
                    if '-1' in score_str:
                        score = -1
                    elif '1' in score_str and not score_str.startswith('-'):
                        score = 1
                    else:
                        score = 0
                
                return {
                    'word': analyzed_word,
                    'language': language.lower(),
                    'sentiment': sentiment,
                    'score': score
                }
            else:
                # If parsing fails, return a default response
                print(f"Warning: Could not parse LLM response correctly. Raw response: {response}")
                return {
                    'word': word,
                    'language': language.lower(),
                    'sentiment': 'unknown',
                    'score': None,
                    'raw_response': response
                }
                
        except Exception as e:
            print(f"Error parsing LLM response: {e}")
            return {
                'word': word,
                'language': language.lower(),
                'sentiment': 'error',
                'score': None,
                'error': str(e),
                'raw_response': response
            }
    
    def analyze_words(self, words_dict):
        """
        Analyze the sentiment bearing of multiple words in different languages.
        
        Args:
            words_dict (dict): A dictionary where keys are language codes and values are
                              lists of words to analyze in that language
            
        Returns:
            dict: A dictionary with language codes as keys and lists of analysis results as values
        """
        results = {}
        
        for language, words in words_dict.items():
            language_results = []
            
            for word in words:
                result = self.analyze_word(word, language)
                language_results.append(result)
                
            results[language] = language_results
            
        return results
    
    def create_sentiment_lexicon(self, words_dict):
        """
        Create a sentiment lexicon for multiple languages.
        
        Args:
            words_dict (dict): A dictionary where keys are language codes and values are
                              lists of words to analyze in that language
            
        Returns:
            dict: A multilingual sentiment lexicon
        """
        lexicon = {}
        
        for language, words in words_dict.items():
            language_lexicon = {}
            
            for word in words:
                result = self.analyze_word(word, language)
                if result['score'] is not None:
                    language_lexicon[word] = result['score']
                
            lexicon[language] = language_lexicon
            
        return lexicon

In [24]:
# Test the multilingual sentiment analysis

# Make sure your LLM is initialized with the correct API key
# For example:
# claude = ClaudeLLM(api_key="your_actual_api_key_here")
# claude.setup_client()

# Create a MultilingualSentimentBearings instance
try:
    sentiment_llm = claude  # Use your initialized LLM
    multilingual_analyzer = MultilingualSentimentBearings(sentiment_llm)
    print("Multilingual analyzer initialized successfully")
except Exception as e:
    print(f"Error initializing multilingual analyzer: {e}")

# Sample words in each language
test_words = {
    'sepedi': [
        'manyami',     # love
        'thabo',      # joy
        'kgahlego',   # interest
        'pefelo',     # anger
        'manyami',    # sadness
        'poifo',      # fear
        'botho',      # humanity/ubuntu
        'tshele'      # life
    ],
    'sesotho': [
        'lerato',     # love
        'thabo',      # joy
        'thahasello', # interest
        'kgalefo',    # anger
        'maswabi',    # sadness
        'tshabo',     # fear
        'botho',      # humanity/ubuntu
        'bophelo'     # life
    ],
    'setswana': [
        'botlhoko',     # love
        'boitumelo',  # joy
        'kgatlhego',  # interest
        'tenego',     # anger
        'kutlobotlhoko', # sadness
        'poifo',      # fear
        'botho',      # humanity/ubuntu
        'botshelo'    # life
    ]
}

# Function to display sentiment analysis results for a language
def display_language_results(language, results):
    """
    Display sentiment analysis results for a specific language.
    
    Args:
        language (str): The language name
        results (list): List of result dictionaries
    """
    print(f"\nSentiment Analysis Results for {language.title()}:")
    print(f"{'Word':<15} | {'Sentiment':<10} | {'Score':<5}")
    print("-" * 35)
    
    for result in results:
        word = result.get('word', 'unknown')
        sentiment = result.get('sentiment', 'unknown')
        score = result.get('score', 'N/A')
        
        print(f"{word:<15} | {sentiment:<10} | {score:<5}")

# Test with a smaller subset for initial testing
test_subset = {
    'sepedi': test_words['sepedi'][:3],   # Just first 3 words
    'sesotho': test_words['sesotho'][:3], # Just first 3 words
    'setswana': test_words['setswana'][:3] # Just first 3 words
}

# Analyze words in each language
try:
    print("Starting multilingual sentiment analysis...")
    
    # Use smaller subset for initial testing
    results = multilingual_analyzer.analyze_words(test_subset)
    
    # Display results for each language
    for language, language_results in results.items():
        display_language_results(language, language_results)
    
    print("\nMultilingual sentiment analysis completed successfully")
except Exception as e:
    print(f"Error in multilingual sentiment analysis: {e}")

Multilingual analyzer initialized successfully
Starting multilingual sentiment analysis...

Sentiment Analysis Results for Sepedi:
Word            | Sentiment  | Score
-----------------------------------
manyami         | negative   | -1   
thabo           | positive   | 1    
kgahlego        | positive   | 1    

Sentiment Analysis Results for Sesotho:
Word            | Sentiment  | Score
-----------------------------------
lerato          | positive   | 1    
thabo           | positive   | 1    
thahasello      | positive   | 1    

Sentiment Analysis Results for Setswana:
Word            | Sentiment  | Score
-----------------------------------
botlhoko        | negative   | -1   
boitumelo       | positive   | 1    
kgatlhego       | positive   | 1    

Multilingual sentiment analysis completed successfully
