# YMMB Project: Evaluating the Usage of African-American Vernacular English in Large Language Models


---


## Deja Dunlap

This project in designed to evaluate LLM, specifically the ChatGPT 4o, Gemma 2.7, and the Llama 3 models, ability to understand grammatical features and usage of Non-"Standard" Dialects, specifically African American Vernacular English. We focus on the grammatical features: "ain't", the negative module, and the expletive "be".

The first phrase of the project was to evaluate how well ChatGPT usage of grammatical features in comparison to human usage. To do this, we prompted ChatGPT to create a dialogue as if they were an African American from one of the cities represented in the dataset. We did this 1000 times and evaluate how often the previously mentioned grammatical features occured and compared it to human use.

The next stage was analyzing the context in which the feature regulary occurs by speakers of AAVE. for the feature 'aint', we determine the 10 most likely words the precede ain't and the probability that ain't will follow any of those given words ( ie. P(ain't | word)) . From there, we compare that to the probabilities that ChatGPT assigns to the "ain't' following the word given some preceding sentence containing some element of AAVE to make it "speak" in the dialect. For the feature expletive "be" it was much the similar process for ain't. For the negative module, we evaluated the percentage of sentences in the human dataset that contained the negative feature. When prompting the model, we had the model prompt an entire sentence multiple times and evaluated how often it had the feature.



In [None]:
!pip install stanza
!pip install together

# dowloading the dependecies need for this project
import re
import os
from together import Together
import torch
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
import random
from google.colab import files
import openai
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import Counter
from google.colab import userdata

nltk.download('all')



[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_r

True

## Using GPUs in Colab

LLMs run much faster on GPUs to attach you Colab to a GPU use the follow steps:


1.   Click "Runtime" in the menu at the top
2.   Click "Change runtime type"
3.   Select "T4 GPU"
4.   Click "Save"

In [None]:
# checking if cuda is active
print(torch.cuda.is_available())


# setting the GPU as the device we are using
device = torch.device('cuda')

True


In [None]:
class AAVE_Feature:

    def __init__(self, folder):
        self.folder = folder
        self.dataset = ""
        self.bigrams = []
        self.feature_prob = {}
        self.feature_density = {"be": 0, "negative": 0, "ain't": 0}
        self.sentiment = [0, 0]
        self.files_count = 0

    """
    Input: none
    Output: none
    Function reads in the files from the folder and returns the cleaned version of the text from them
    """
    def read_files(self, sentiment = False, word_count = False, human = False):
      # going through the files
      for root, _, files in os.walk(self.folder):
        self.files_count += len(files)
        for file in files:
          file_path = os.path.join(root, file)
          # ignore the checkpoint folder (model generated datasets)
          if "checkpoints" in file_path:
            continue
          # clean the data
          cleaned = self.clean_data(file_path, human)
          # add to dataset
          self.dataset += cleaned

        # print out number of words, sentences across dataset
        if word_count:
          print("Words: " + str(len(self.dataset)))
          print("Sentences: " + str(len(self.dataset.split("."))))

        # print out sentiment [positive, negative] of dataset
        if sentiment:
         self.sentiment[0] /= self.files_count
         self.sentiment[1] /= self.files_count

    """
    Input: text (str)
    Output: None
    Calculates the sentiment of the tex as well as returns the top bigrams
    """
    def content_analysis(self, text):
      # prepocessing the text
      tokens = word_tokenize(text.lower())
      filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]
      lemmatizer = WordNetLemmatizer()
      lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

      # rejoining the processed version of the text
      processed_text = ' '.join(lemmatized_tokens)

      # sentiment analysis
      analyzer = SentimentIntensityAnalyzer()
      self.sentiment[0] += analyzer.polarity_scores(processed_text)['pos']
      self.sentiment[1] += analyzer.polarity_scores(processed_text)['neg']

    """
    Input: file (str)
    Output: None
    Returns the density of the grammatical feature in the dataset
    """
    def feature_densities(self, feature):
      if feature == "ain't":
        self.feature_density["ain't"] = (self.dataset.count("ain't")/ len(self.dataset.split('.')))
      elif feature == "negative":
        self.feature_density["negative"] = (self.double_negatives(self.dataset) / len(self.dataset.split('.')))
      elif feature == "be":
        self.feature_density["be"] =  (self.feature_density["be"] / len(self.dataset.split('.')))

    """
    Input: file (str)
    Output: cleaned_text (str)
    Returns the parts of the interviews with just the speakers of AAVE and removes non alphnumeric characters (except periods and commas) from the text
    """
    def clean_data(self, file, human = True):
        interview = []
        with open(file, 'r', encoding='utf-8', errors='ignore') as file:
          if human:
            for line in file:
                line = line.split('\t')
                if len(line) < 4:
                    continue
                speaker = line[1]
                content = line[3]

                # only pulling the lines of the dataset that are spoken by the speaker (se means speaker) and are actual pieces of content (data also contains things like pauses)
                if "se" in speaker and "(pause " not in content:
                    interview.append(line[3])
          else:
            for line in file:
              interview.append(line)

        # joining all of the text together
        text = " ".join(interview)
        cleaned_text = re.sub(r"[^\w\s'.]", '', text)
        #self.content_analysis(cleaned_text)

        return cleaned_text.lower()

    """
    Input: None
    Return: examples (arr)
    Returns 10 random sentences from the dataset
    """
    def ran_sentences(self, count = 10):
      dataset = self.dataset.split(".")
      random.shuffle(dataset)
      examples = dataset[:count]
      return examples

    """
    Input: text (str), window (int OPTIONAL)
    Return: n_grams (arr)
    Returns n_grams of window size (default is 2)
    """
    def n_grams(self, window = 2):
        words = self.dataset.split(" ")
        n_grams = []
        for idx in range(len(words) - 1):
            n_grams.append(words[idx:idx+window])
        self.bigrams = n_grams

    """
    Input: Key (string), k (int)
    Output: top_pre (dict)
    Goes through the bigrams and finds all of the bigrams that contain the key, find and return common words preceded it , and returns the probabiltity of the feature after the word (depending on the task)
    """
    def top_k_bigrams(self, key, k, human=True, human_key = None):

        # constructing bigrams from the dataset
        self.n_grams()

        # finding all the bigrams that contain the key word
        key_count = 0
        preceding = {}

        for bigram in self.bigrams:
            if key == "be":
              if '' not in bigram and bigram[1] == "be" and pos_tag([bigram[0]])[0][1] in ["NN", "NNP", "NNS", "PRP"] and "'" not in bigram[0] and bigram[0] not in ["couldn't", "wanna", "um", "can't", "gonna", "could", "should", "uh", "gotta", "sposta"]:
                preceding[bigram[0]] = preceding[bigram[0]] + 1 if bigram[0] in preceding else 1
                self.feature_density["be"] += 1
            elif key in bigram:
              if bigram[1] == key:
                preceding[bigram[0]] = preceding[bigram[0]] + 1 if bigram[0] in preceding else 1

        # focusing only on the words that regular occur next to key
        top_pre = {}
        if human:
          sorted_preceding = sorted(preceding.items(), key=lambda item: item[1], reverse=True)
          top_pre = dict(sorted_preceding[:k])
        else:
          for key in preceding.keys():
            if key in human_key:
              top_pre[key] = preceding[key]

        return top_pre

    """
    Input: text (str)
    Return: neg_sentence (int)
    Returns the negative of times the negative module was present in the text
    """
    def double_negatives(self, text):
      # List of common negation words
      negation_words = r"\b(no|not|never|none|nothing|nowhere|nobody|neither|nor|ain't|can't|won't|don't|isn't|aren't|hasn't|haven't|hadn't)\b"
      neg_sentence = 0
      # Find all negations
      for sentence in text.split('.'):
        negations = re.findall(negation_words, sentence, re.IGNORECASE)
        neg_sentence += (len(negations) >= 2)

      return neg_sentence

    """
    Input: key (string), feature (string)
    Return: combo_count (float)
    Returns context-specific feature probabilites of 'be' and 'aint' feature
    """
    def prob_word(self, key, feature):

      # find the total number of times the preceding word appears in the dataset [p(word)]
      word_count = self.dataset.count(key)

      # determine the probabilities of given grammatical feature depending on the feature [p(key | word)]
      combo_count = 0
      for bigram in self.bigrams:
        if '' in bigram:
            continue

        if feature == "ain't":
          if bigram == [key, feature]:
              combo_count += 1
        elif feature == "be":
          if bigram == [key, feature]:
              combo_count += 1
        elif feature == "negative":
          combo_count = self.feature_density["negative"]


      if feature == "ain't" or feature == "be":
        return combo_count / word_count
      else:
        return combo_count


    """
    Input: none
    Output: none
    Goes through the bigrams and finds all of the bigrams that contain feature, find and return common words preceded and following it, and returns the probabiltity of aint either before or after the word (depending on the task)
    """
    def lexical_feature(self, feature, human=True, human_keys = None):

        # if from the human data -  finding the top k words that common preceding/follow the feature word of interest
        # if from the model data - finding the probabilities of the most commoning preceding words from the human dataset
        human_pre = self.top_k_bigrams(feature, 10, human, human_keys)

        if feature == "ain't" or feature == "be":
          self.feature_prob[feature] = {key: self.prob_word(key, feature) for key in human_pre.keys()}
        else:
          self.feature_prob[feature] = self.prob_word("", feature)

    """
    Input: none
    Output: none
    Samples the model to create 'sociololinguistic interview' as a user from one of the cities in the human dataset - stores output in file
    If looking to extend model to one of the other models in the Together API model, just need to receive the model tag, else must add other models API
    Must have own API key for either models to run from python terminal
    """
    def model_data(self, model = 'meta', n = 1000):
      STORY_PROMPT = """Produce a narrative as if you were an African American {gender} from {city} where you are being recorded for a sociolinguistic interview. Create a piece of text equivalent in length to talking for about 30 minutes. Include only the text for the prompt. Do not include any other extraneous information (including the phrase "Here's the prompt" or any of it's varieties)."""
      for i in range(1000):
        gender = random.choice(["Female", "Male"])
        city = random.choices(population=["Atlanta", "DC", "Detroit", "Lower East Side of New York City", "Princeville", "Rochester", "Valdosta"], weights=[0.05, 0.5, 0.14, 0.05, 0.11, 0.06, 0.05], k=1)[0]
        prompt = STORY_PROMPT.format(gender=gender, city=city)
        models = {'meta':"meta-llama/Meta-Llama-3-8B-Instruct-Turbo" , 'deepseek': 'deepseek-ai/DeepSeek-R1', 'google': 'google/gemma-2-27b-it'}
        if model in ['google', 'meta']:
          client = Together(api_key=userdata.get('TOGETHER_API_KEY'))
          response = client.chat.completions.create(
              model=models[model],
              messages=[{"role": "user", "content": prompt}]          )
        else:
          openai.api_key = userdata.get('OPENAI_API_KEY')
          client = openai.OpenAI(api_key = openai.api_key)
          response = client.chat.completions.create(
              model="gpt-4o-mini",
              store=True,
              messages=[{"role": "user", "content": prompt}])

        directory = self.folder
        if not os.path.exists(directory):
          os.makedirs(directory)
        file_name = directory + str(city) + "_" + str(gender) + "_" + str(i) + ".txt"
        with open(file_name, 'w') as file:
          file.write(response.choices[0].message.content)


# Generate Model Data

Running the following code segment will generate a new batch of model generate test for evaluation. Pass in which model you would like to be generated and the number of samples you would like to creat.

In [None]:
# create the model datasets, prompting the model to create sociolinguistic interviews based off of the demographics represented in the human dataset
# pass in folder that you want the data to be created in
model_AAVE = AAVE_Feature(folder = "./openai_data")

# uncomment to generate the sociolinguistic interview from model
#model_AAVE.model_data(model = "openai")

In [None]:
# download data to computer
from google.colab import drive
from google.colab import files

drive.mount('/content/drive')


%cd /content/drive/MyDrive/model_data

!zip -r openai_data.zip openai_data/

files.download('openai_data.zip')

Mounted at /content/drive
[Errno 2] No such file or directory: '/content/drive/MyDrive/model_data'
/content
  adding: openai_data/ (stored 0%)
  adding: openai_data/openai_data/ (stored 0%)
  adding: openai_data/openai_data/DC_Female_297.txt (deflated 53%)
  adding: openai_data/openai_data/openai_data (2)/ (stored 0%)
  adding: openai_data/openai_data/openai_data (2)/openai_data/ (stored 0%)
  adding: openai_data/openai_data/openai_data (2)/openai_data/DC_Female_107.txt (deflated 54%)
  adding: openai_data/openai_data/openai_data (2)/openai_data/DC_Female_379.txt (deflated 53%)
  adding: openai_data/openai_data/openai_data (2)/openai_data/DC_Male_411.txt (deflated 54%)
  adding: openai_data/openai_data/openai_data (2)/openai_data/Atlanta_Male_4.txt (deflated 55%)
  adding: openai_data/openai_data/openai_data (2)/openai_data/DC_Male_392.txt (deflated 54%)
  adding: openai_data/openai_data/openai_data (2)/openai_data/DC_Female_359.txt (deflated 55%)
  adding: openai_data/openai_data/open

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Download the Data

The following code snippet will request that you upload a file. Upload either the human dataset or one of the model datasets to test for feature densities + context-based usage.


In [None]:
uploaded = files.upload()

# Unzip the file
!unzip final_openai_data.zip

Saving final_openai_data.zip to final_openai_data (1).zip
Archive:  final_openai_data.zip
replace openai_data/openai_data/Atlanta_Female_112.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


In [None]:
uploaded = files.upload()

# Unzip the file
!unzip meta_data.zip

Saving meta_data.zip to meta_data.zip
Archive:  meta_data.zip
   creating: deepseek_data/
  inflating: deepseek_data/DC_Female_896.txt  
  inflating: deepseek_data/DC_Female_324.txt  
  inflating: deepseek_data/DC_Female_128.txt  
  inflating: deepseek_data/Detroit_Male_602.txt  
  inflating: deepseek_data/DC_Female_250.txt  
  inflating: deepseek_data/DC_Female_383.txt  
  inflating: deepseek_data/Detroit_Male_168.txt  
  inflating: deepseek_data/Atlanta_Female_174.txt  
  inflating: deepseek_data/Valdosta_Female_371.txt  
  inflating: deepseek_data/Valdosta_Female_987.txt  
  inflating: deepseek_data/DC_Male_725.txt  
  inflating: deepseek_data/Valdosta_Female_116.txt  
  inflating: deepseek_data/Atlanta_Male_282.txt  
  inflating: deepseek_data/Rochester_Male_882.txt  
  inflating: deepseek_data/DC_Male_562.txt  
  inflating: deepseek_data/Detroit_Male_958.txt  
  inflating: deepseek_data/DC_Female_743.txt  
  inflating: deepseek_data/Rochester_Male_249.txt  
  inflating: deepseek_d

In [None]:
uploaded = files.upload()

# Unzip the file
!unzip google_data.zip

KeyboardInterrupt: 

In [None]:
uploaded = files.upload()

# Unzip the file
!unzip data.zip

# Exploratory Analysis

Below is an example of running the `ran_sentences` function, that randomly chooses ten sentences from the dataset to be able to get a general glimpse into the dataset.

In [None]:
openai_AAVE = AAVE_Feature(folder = "./openai_data")

openai_AAVE.read_files(human=False)
print(openai_AAVE.ran_sentences())

# Feature Detection

Run the following code chunk below to all of the relevant data points in an intepretable manner.

In [None]:
# collects the human probabilites for the features from the dataset
human_AAVE = AAVE_Feature(folder = "./data")
human_AAVE.read_files()
human_AAVE.n_grams()

for feature in ["ain't", "negative", "be"]:
  human_AAVE.lexical_feature(feature)
  human_AAVE.feature_densities(feature)

In [None]:
# dislplay context-specific feature probabilities
human_AAVE.feature_prob

In [None]:
# display feature density
human_AAVE.feature_density

In [None]:
openai_AAVE = AAVE_Feature(folder = "./openai_data")

# read in the files from the dataset
openai_AAVE.read_files(human=False)
openai_AAVE.n_grams()

# calculate context specific feature densities for be and aint features
for feature in ["ain't",'be']:
  openai_AAVE.lexical_feature(feature, human=False, human_keys = human_AAVE.feature_prob[feature].keys())
  openai_AAVE.feature_densities(feature)

# calculate feature densities for double negative feature
openai_AAVE.feature_densities('negative')
openai_AAVE.lexical_feature('negative')

# display densities
print(openai_AAVE.feature_prob)
print(openai_AAVE.feature_density)

In [None]:
meta_AAVE = AAVE_Feature(folder = "./meta_data")

# read in the files from the dataset
meta_AAVE.read_files(human=False)
meta_AAVE.n_grams()

# calculate context specific feature densities for be and aint features
for feature in ["ain't",'be']:
  meta_AAVE.lexical_feature(feature, human=False, human_keys = human_AAVE.feature_prob[feature].keys())
  meta_AAVE.feature_densities(feature)

# calculate feature densities for double negative feature
meta_AAVE.feature_densities('negative')
meta_AAVE.lexical_feature('negative')

# display densities
print(meta_AAVE.feature_prob)
print(meta_AAVE.feature_density)

In [None]:
google_AAVE = AAVE_Feature(folder = "./google_data")

# read in the files from the dataset
google_AAVE.read_files(human=False)
google_AAVE.n_grams()

# calculate context specific feature densities for be and aint features
for feature in ["ain't",'be']:
  google_AAVE.lexical_feature(feature, human=False, human_keys = human_AAVE.feature_prob[feature].keys())
  google_AAVE.feature_densities(feature)

# calculate feature densities for double negative feature
google_AAVE.feature_densities('negative')
google_AAVE.lexical_feature('negative')

# display densities
print(google_AAVE.feature_prob)
print(google_AAVE.feature_density)