## Burstiness

In this notebook we construct a data frame consisting of essays and their associated burstiness score.

In this document "burstiness" will have the meaning of a measure of deviation from average sentence length. Do note that some people use the term "burstiness" to mean variance in word frequency, which is another potentially useful metric.

The code below is the same as Derek's (essay topic analysis). What it does is it imports each text file from our data set as a string into a text_list using the os module, and then reads off the first entry.

In [51]:
import os

folder_path = 'small test'

file_list = [file for file in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, file))]
file_list.sort() 
# os.listdir() returns the file list in random(ish) order. Sort to standardize.

text_list =[]

for file_name in file_list:
    file_path = os.path.join(folder_path, file_name)
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
        text_list.append(text)

In [52]:
text_list[1]

"Driverless cars are exaclty what you would expect them to be. Cars that will drive without a person actually behind the wheel controlling the actions of the vehicle. The idea of driverless cars going in to developement shows the amount of technological increase that the wolrd has made. The leader of this idea of driverless cars are the automobiles they call Google cars. The arduous task of creating safe driverless cars has not been fully mastered yet. The developement of these cars should be stopped immediately because there are too many hazardous and dangerous events that could occur.\n\nOne thing that the article mentions is that the driver will be alerted when they will need to take over the driving responsibilites of the car. This is such a dangerous thing because we all know that whenever humans get their attention drawn in on something interesting it is hard to draw their focus somewhere else. The article explains that companies are trying to implement vibrations when the car is

We import nltk and download stopwords and punkt packages. 

We create a function that tokenises words from a piece of text, with punctuation and stop words removed. Note that stop words are words like "a", "the", "in", and these are usually removed so that we save processing time.

We also create a function that tokenises sentences.

In [53]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
import string

def token_word(text):
    tokens = nltk.word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words and token not in string.punctuation]
    return tokens

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\s1557452\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\s1557452\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [54]:
from nltk.tokenize import sent_tokenize

def token_sent(text):
    tokens = nltk.sent_tokenize(text.lower())
    tokens = [token for token in tokens if token not in string.punctuation]
    return tokens

Examples of tokenisation:

In [55]:
print(token_sent(text_list[1]))
print(token_word(text_list[1]))

['driverless cars are exaclty what you would expect them to be.', 'cars that will drive without a person actually behind the wheel controlling the actions of the vehicle.', 'the idea of driverless cars going in to developement shows the amount of technological increase that the wolrd has made.', 'the leader of this idea of driverless cars are the automobiles they call google cars.', 'the arduous task of creating safe driverless cars has not been fully mastered yet.', 'the developement of these cars should be stopped immediately because there are too many hazardous and dangerous events that could occur.', 'one thing that the article mentions is that the driver will be alerted when they will need to take over the driving responsibilites of the car.', 'this is such a dangerous thing because we all know that whenever humans get their attention drawn in on something interesting it is hard to draw their focus somewhere else.', 'the article explains that companies are trying to implement vibr

We now make a function that calculates the average sentence length of an essay.

In [56]:
def avg_sent_len(text):
    num_words    = len(token_word(text))  #Total number of words in text
    num_sents    = len(token_sent(text))  #Total number of sentences in text
    avg_sent_len = num_words/num_sents #Average number of words per sentence 
    return avg_sent_len


asl = avg_sent_len(text_list[5])   
print(f"Average number of words per sentence is {asl:.1f}")

Average number of words per sentence is 10.2


We now create a function that calculates variance in sentence length of an essay.

In [57]:
def length_variance(text):
    sentences = token_sent(text)
    avg_freq = avg_sent_len(text)
    variance = sum((len(sentence.split()) - avg_freq) ** 2 for sentence in sentences) / len(sentences)
    return variance

length_variance(text_list[10])

118.81163434903046

Now we will create a data frame consisting of essays and their burstiness score.

In [58]:
import pandas as pd 

In [59]:
data_burst=[]
for essay_index in range(0,len(text_list)): #loop through essays
    burstiness = length_variance(text_list[essay_index])
    data_burst.append([text_list[essay_index], burstiness]) #create a list of tuples [essay, number of errors]


Below we create the dataframe consisting of essays and the corresponding burstiness.

In [60]:
df_burst = pd.DataFrame(data_burst, columns=['essay', 'burstiness'])
df_burst

Unnamed: 0,essay,burstiness
0,"Some people belive that the so called ""face"" o...",137.637755
1,Driverless cars are exaclty what you would exp...,205.626667
2,Dear: Principal\n\nI am arguing against the po...,384.107438
3,Would you be able to give your car up? Having ...,100.303241
4,I think that students would benefit from learn...,113.299383
5,The seagoing cowboy program is the best thing ...,147.6
6,"Venus also known as the Earth's ""twin"" is simi...",165.216049
7,It is every student's dream to be able to loun...,214.481445
8,Cars have been an issue to our community for a...,68.442708
9,Phones & Driving\n\nWaking up from a wonderful...,136.998615


## Chat GPT essays - comparison

In this section I import three chat GPT essays I have generated and I calculate their burstiness.


In [61]:
ai_text_list =[
    "Mars, the fourth rock from the sun, has always been the coolest planet to me. Its rusty-red hue and mysterious landscapes make it a captivating subject that sparks the curiosity of scientists and dreamers alike. As a high schooler, the allure of Mars goes beyond its scientific wonders – it's about the possibility of humans setting foot on its surface and the adventure that awaits. One of the most exciting things about Mars is the idea that it might have hosted life. Scientists have found evidence of ancient riverbeds and minerals that suggest there was once water on the planet. It's like a cosmic detective story, with rovers like Curiosity and Perseverance searching for clues about Mars' past. Imagine, just a few million years ago, there might have been Martian microbes partying on the red soil! The rovers are like our interplanetary explorers, sending us selfies and digging into the Martian soil. It's mind-blowing to think that, while we're sitting in classrooms, there are robots millions of miles away, uncovering the secrets of another planet. It's not just science fiction anymore; it's science reality! But the coolest part? The idea that we might go there ourselves one day. Elon Musk talks about colonizing Mars like it's the next big vacation spot. Sure, it's a bit far-fetched, but the thought of humans becoming Martians is mind-boggling. Imagine telling your grandkids, 'I was part of the generation that first stepped onto Mars!' That's the kind of history-making stuff that makes studying Mars in high school so exciting. In conclusion, Mars isn't just a distant planet; it's a potential future for humanity. Its mysteries, scientific discoveries, and the dream of human exploration make it the coolest topic to study. Who knows, maybe one day we'll be the ones leaving footprints on the red Martian soil. Mars isn't just a planet; it's an invitation to dream big and reach for the stars – or, in this case, the next planet over!",
    "Electric cars are not just futuristic dreams anymore; they're the cool, eco-friendly rides that are transforming the way we think about transportation. As a high schooler, the buzz around electric cars has me excited about the future of driving and its impact on our environment. First off, let's talk about the planet. Electric cars are like superheroes for Mother Earth. They run on electricity, which means no tailpipe emissions, no smoke, and no harm to the air we breathe. It's like giving our planet a breath of fresh air. As teenagers who will inherit the Earth, it feels awesome to know that we can contribute to a greener future simply by driving an electric car. Charging an electric car is as easy as plugging in your phone. No more waiting in line at gas stations, and no more worrying about those unpredictable gas prices. It's a game-changer for our wallets and our schedules. Plus, some electric cars are so sleek and high-tech; they make regular cars look like they're from the Stone Age. But here's the best part – speed! Electric cars are like the race cars of the future. They can go from zero to sixty in no time, leaving traditional cars in the dust. It's not just about being eco-friendly; it's about looking cool while doing it. Sure, there are some challenges, like finding charging stations everywhere and the cost of buying an electric car upfront. But think about it – every superhero had to face some challenges before they became legends. The more we support electric cars, the more accessible and affordable they'll become. It's like being part of a movement that's steering the world towards a cleaner, brighter future. In conclusion, electric cars aren't just about getting from point A to B; they're about sparking a positive change. As high schoolers, we have the power to shape the future, and supporting electric cars is one way we can drive towards a planet-friendly, stylish tomorrow. So, let's buckle up, plug in, and ride into a greener, cooler future!",
    "Distance learning – it's a term that became part of our everyday vocabulary faster than you can say 'online class.' As a high school student navigating the world of virtual education, distance learning has been a rollercoaster of challenges and unexpected perks. One of the coolest things about distance learning is the flexibility it offers. No more racing against the clock to catch the bus or stressing about being late for first period. With distance learning, we have the freedom to roll out of bed, grab our laptops, and dive into class from the comfort of our homes. It's like having school in our pajamas – a dream come true for every teenager. However, it's not all sunshine and rainbows. Staring at a screen for hours can feel like a never-ending Netflix binge, minus the popcorn. The lack of face-to-face interaction with teachers and classmates makes it challenging to stay engaged. The struggle to resist the temptation of checking social media or getting distracted by the latest YouTube video is real. It's a constant battle between staying focused and the allure of the internet. Another hurdle is the loneliness. High school is supposed to be about building friendships, sharing inside jokes, and surviving the ups and downs together. Distance learning, with its virtual classrooms and muted microphones, can feel isolating. It's like we're missing out on the high school experience – the camaraderie that comes with navigating the maze of lockers and crowded hallways. Yet, in the midst of these challenges, distance learning has taught us resilience. We've become tech-savvy problem solvers, troubleshooting internet issues and mastering the art of the virtual handshake. It's a crash course in adaptability, preparing us for a future where digital skills are non-negotiable. In conclusion, distance learning is a double-edged sword. While it provides unprecedented flexibility, it comes with its share of challenges – the battle against distractions, the yearning for real human connection, and the occasional Wi-Fi meltdown. As high schoolers, we're not just students; we're pioneers navigating the uncharted territory of the digital classroom, learning lessons that go beyond textbooks and assignments. Whether we love it or loathe it, distance learning is shaping us into resilient, tech-savvy individuals ready to conquer the challenges of the digital age."
    ]

In [62]:
ai_text_list[2]

"Distance learning – it's a term that became part of our everyday vocabulary faster than you can say 'online class.' As a high school student navigating the world of virtual education, distance learning has been a rollercoaster of challenges and unexpected perks. One of the coolest things about distance learning is the flexibility it offers. No more racing against the clock to catch the bus or stressing about being late for first period. With distance learning, we have the freedom to roll out of bed, grab our laptops, and dive into class from the comfort of our homes. It's like having school in our pajamas – a dream come true for every teenager. However, it's not all sunshine and rainbows. Staring at a screen for hours can feel like a never-ending Netflix binge, minus the popcorn. The lack of face-to-face interaction with teachers and classmates makes it challenging to stay engaged. The struggle to resist the temptation of checking social media or getting distracted by the latest YouTu

We now create a list "ai_burst" containing the burstiness scores of the three essays above. We print the list along with the average value.

In [63]:
ai_burst=[]
for essay_index in range(0,len(ai_text_list)): #loop through essays
    burstiness = length_variance(ai_text_list[essay_index])
    ai_burst.append(burstiness)
print(ai_burst)
print(sum(ai_burst)/3)

[73.03, 78.91942148760332, 83.71280991735539]
78.55407713498624


The average burstiness of human generated essays is calculated below.

In [65]:
print(df_burst['burstiness'].mean())

233.53882368494692
