# Essay Topic Analysis

In this notebook, we use the package NLTK (natural lanuage toolkit) to do a topic analysis of the dataset of essays. We create a ChatGPT prompt and essay from these topics.

Lets import each text file from our data set as a string into a text_list using the os module, and then read off the first entry.

In [24]:
import os

folder_path = 'data/train'

file_list = [file for file in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, file))]
file_list.sort() 
# os.listdir() returns the file list in random(ish) order. Sort to standardize.

text_list =[]

for file_name in file_list:
    file_path = os.path.join(folder_path, file_name)
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
        text_list.append(text)

In [25]:
text_list[0]

'Some people belive that the so called "face" on mars was created by life on mars. This is not the case. The face on Mars is a naturally occuring land form called a mesa. It was not created by aliens, and there is no consiracy to hide alien lifeforms on mars. There is no evidence that NASA has found that even suggests that this face was created by aliens.\n\nA mesa is a naturally occuring rock formation, that is found on Mars and Earth. This "face" on mars only looks like a face because humans tend to see faces wherever we look, humans are obviously extremely social, which is why our brain is designed to recognize faces.\n\nMany conspiracy theorists believe that NASA is hiding life on Mars from the rest of the world. These people would be very wrong. If NASA found life on Mars, then they would get millions of people\'s attention. NASA\'s budget would increase drasticly, which means that their workers would get paid more. There is no good reason that NASA would hide life on Mars from th

Next, we form a list of the tokenized texts composed of the lowercase words each contains. The lowercasing is preprocessing to be used when we investigate the topic of the text. Then we remove all non-alphabetic characters.

In [26]:
import nltk
from nltk.tokenize import word_tokenize

words_in_text = [word_tokenize(text.lower()) for text in text_list]

In [27]:
words_in_text[0]

['some',
 'people',
 'belive',
 'that',
 'the',
 'so',
 'called',
 '``',
 'face',
 "''",
 'on',
 'mars',
 'was',
 'created',
 'by',
 'life',
 'on',
 'mars',
 '.',
 'this',
 'is',
 'not',
 'the',
 'case',
 '.',
 'the',
 'face',
 'on',
 'mars',
 'is',
 'a',
 'naturally',
 'occuring',
 'land',
 'form',
 'called',
 'a',
 'mesa',
 '.',
 'it',
 'was',
 'not',
 'created',
 'by',
 'aliens',
 ',',
 'and',
 'there',
 'is',
 'no',
 'consiracy',
 'to',
 'hide',
 'alien',
 'lifeforms',
 'on',
 'mars',
 '.',
 'there',
 'is',
 'no',
 'evidence',
 'that',
 'nasa',
 'has',
 'found',
 'that',
 'even',
 'suggests',
 'that',
 'this',
 'face',
 'was',
 'created',
 'by',
 'aliens',
 '.',
 'a',
 'mesa',
 'is',
 'a',
 'naturally',
 'occuring',
 'rock',
 'formation',
 ',',
 'that',
 'is',
 'found',
 'on',
 'mars',
 'and',
 'earth',
 '.',
 'this',
 '``',
 'face',
 "''",
 'on',
 'mars',
 'only',
 'looks',
 'like',
 'a',
 'face',
 'because',
 'humans',
 'tend',
 'to',
 'see',
 'faces',
 'wherever',
 'we',
 'look'

In [28]:
words_in_text_alp = [ [word for word in text if word.isalpha()]
                      for text in words_in_text ]

In [29]:
words_in_text_alp[0]

['some',
 'people',
 'belive',
 'that',
 'the',
 'so',
 'called',
 'face',
 'on',
 'mars',
 'was',
 'created',
 'by',
 'life',
 'on',
 'mars',
 'this',
 'is',
 'not',
 'the',
 'case',
 'the',
 'face',
 'on',
 'mars',
 'is',
 'a',
 'naturally',
 'occuring',
 'land',
 'form',
 'called',
 'a',
 'mesa',
 'it',
 'was',
 'not',
 'created',
 'by',
 'aliens',
 'and',
 'there',
 'is',
 'no',
 'consiracy',
 'to',
 'hide',
 'alien',
 'lifeforms',
 'on',
 'mars',
 'there',
 'is',
 'no',
 'evidence',
 'that',
 'nasa',
 'has',
 'found',
 'that',
 'even',
 'suggests',
 'that',
 'this',
 'face',
 'was',
 'created',
 'by',
 'aliens',
 'a',
 'mesa',
 'is',
 'a',
 'naturally',
 'occuring',
 'rock',
 'formation',
 'that',
 'is',
 'found',
 'on',
 'mars',
 'and',
 'earth',
 'this',
 'face',
 'on',
 'mars',
 'only',
 'looks',
 'like',
 'a',
 'face',
 'because',
 'humans',
 'tend',
 'to',
 'see',
 'faces',
 'wherever',
 'we',
 'look',
 'humans',
 'are',
 'obviously',
 'extremely',
 'social',
 'which',
 'is',

Calculate the mean and median for the entire list

In [31]:
import numpy as np

word_lengths = [len(w) for w in words_in_text_alp]

average_word_count = np.mean(word_lengths)
median_word_count = np.median(word_lengths)

print(f"Average word count without punctuation: {average_word_count}")
print()
print(f"Median word count without punctuation: {median_word_count}")

Average word count without punctuation: 417.6122867769655

Median word count without punctuation: 380.0


Topic Identification -- find key words to distinguish context. We first naively count the most common words in the first semi-processed list. 

In [32]:
from collections import Counter

Counter(words_in_text_alp[0]).most_common(10)

[('is', 13),
 ('on', 12),
 ('mars', 12),
 ('that', 10),
 ('the', 9),
 ('nasa', 8),
 ('life', 7),
 ('a', 7),
 ('would', 7),
 ('face', 6)]

We produce a simple processing/cleaning of the data in order to get rid of these common occurances as they clearly do not contribute to the context of the paragraph. We first download the wordnet package, even though we use it later.

In [33]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/derekkielty/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In order to preprocess, lets first remove stop words to clarify context by removing redundant words

In [34]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))

no_stops = [ [word for word in text_alp if word not in stop_words]
            for text_alp in words_in_text_alp ] 

In [35]:
no_stops[0]

['people',
 'belive',
 'called',
 'face',
 'mars',
 'created',
 'life',
 'mars',
 'case',
 'face',
 'mars',
 'naturally',
 'occuring',
 'land',
 'form',
 'called',
 'mesa',
 'created',
 'aliens',
 'consiracy',
 'hide',
 'alien',
 'lifeforms',
 'mars',
 'evidence',
 'nasa',
 'found',
 'even',
 'suggests',
 'face',
 'created',
 'aliens',
 'mesa',
 'naturally',
 'occuring',
 'rock',
 'formation',
 'found',
 'mars',
 'earth',
 'face',
 'mars',
 'looks',
 'like',
 'face',
 'humans',
 'tend',
 'see',
 'faces',
 'wherever',
 'look',
 'humans',
 'obviously',
 'extremely',
 'social',
 'brain',
 'designed',
 'recognize',
 'faces',
 'many',
 'conspiracy',
 'theorists',
 'believe',
 'nasa',
 'hiding',
 'life',
 'mars',
 'rest',
 'world',
 'people',
 'would',
 'wrong',
 'nasa',
 'found',
 'life',
 'mars',
 'would',
 'get',
 'millions',
 'people',
 'attention',
 'nasa',
 'budget',
 'would',
 'increase',
 'drasticly',
 'means',
 'workers',
 'would',
 'get',
 'paid',
 'good',
 'reason',
 'nasa',
 'wou

Lastly, we lemmatize our no_stops list of lists, we downloaded the wordnet previously

In [36]:
from nltk.stem import WordNetLemmatizer

#Instatiate 
word_net = WordNetLemmatizer()
 
lemmatized = [ [(word_net.lemmatize(t)) for t in no_stop ]
              for no_stop in no_stops ]

Next we can count the common words...

In [102]:
mars_topic_freq = Counter(lemmatized[0]).most_common(10)
mars_topic_freq

[('mar', 12),
 ('face', 8),
 ('nasa', 8),
 ('life', 7),
 ('would', 7),
 ('people', 5),
 ('found', 4),
 ('created', 3),
 ('mesa', 3),
 ('alien', 3)]

Next, I will feed the above key words to ChatGPT: 

Input : "Write an essay as though it is written by 6-12 grader that is no more than 420 words based on the keywords : mars, face, nasa, would, life, people, found, created, mesa, alien" 

( Not sure if this isn't specific enough, or even the best way to prompt the question )

Output : "Title: My Amazing Adventure to Mars: The Search for Aliens and the Mysterious Face

Hey there, fellow explorers! Let me tell you about this incredible journey to Mars. Imagine a planet with red deserts and rocky landscapes, where scientists from NASA are sending cool rovers to find signs of life. And guess what? There's this mysterious mesa that people used to think looked like a giant face!

Back in 1976, NASA's Viking 1 spacecraft sent back a picture of a mesa on Mars that everyone thought was a face. It was like something out of a sci-fi movie! But later, when newer pictures came in, the face turned out to be just a funny-shaped hill. Still, the idea of finding a face on another planet got everyone super excited.

Now, why are we so obsessed with Mars? Well, NASA believes there might be tiny living things, like microbes, hiding under the surface or in old Martian lakes. How cool would that be? It's like playing hide and seek with aliens on a whole other planet!

The rovers NASA created, like Curiosity, are like our robotic buddies exploring Mars for us. These machines are super smart and have all sorts of tools to dig, sniff, and take awesome pictures. They're like our interplanetary detectives, searching for clues that could tell us if Mars ever had or still has any form of life.

Just think about it – if those rovers found even the tiniest hint of life on Mars, it would be mind-blowing! It would mean that we're not alone in this vast universe. It would be like discovering a whole new neighborhood of alien neighbors.

But getting to Mars and sending these cool rovers isn't easy. It takes a lot of brainpower from scientists and engineers who design and build these amazing space machines. They're like the superheroes of space exploration, using their smarts to unlock the secrets of the universe.

The idea of aliens has always been a big mystery that sparks our imagination. We wonder what they might look like and if they're friendly or not. Finding life on Mars would be like making friends with aliens, and that's just awesome to think about!

So, here's to the brave people at NASA and their rovers, exploring Mars and searching for signs of life. Who knows what they might discover next? Maybe they'll find something even cooler than a face – maybe they'll find our future Martian pals! It's like the best space adventure ever, and we're all a part of it, dreaming big and reaching for the stars."

### Topic Analysis of ChatGPT Mars Essay

In [84]:
mars_GPT_path = 'mars_GPT_essay.txt'
with open(mars_GPT_path, 'r', encoding='utf-8') as file:
        mars_GPT_text = file.read()

"My Amazing Adventure to Mars: The Search for Aliens and the Mysterious Face\nHey there, fellow explorers! Let me tell you about this incredible journey to Mars. Imagine a planet with red deserts and rocky landscapes, where scientists from NASA are sending cool rovers to find signs of life. And guess what? There's this mysterious mesa that people used to think looked like a giant face!\nBack in 1976, NASA's Viking 1 spacecraft sent back a picture of a mesa on Mars that everyone thought was a face. It was like something out of a sci-fi movie! But later, when newer pictures came in, the face turned out to be just a funny-shaped hill. Still, the idea of finding a face on another planet got everyone super excited.\nNow, why are we so obsessed with Mars? Well, NASA believes there might be tiny living things, like microbes, hiding under the surface or in old Martian lakes. How cool would that be? It's like playing hide and seek with aliens on a whole other planet!\nThe rovers NASA created, l

In [103]:
#Same analysis as above on the GPT mars essay
#Should just write a function to do this

mars_GPT_words = word_tokenize(mars_GPT_text.lower())
mars_GPT_alp = [word for word in mars_GPT_words if word.isalpha()]
mars_GPT_no_stop = [word for word in mars_GPT_alp if word not in stop_words]
mars_GPT_lem = [(word_net.lemmatize(t)) for t in mars_GPT_no_stop ]

In [110]:
mars_GPT_topic_freq = Counter(mars_GPT_lem).most_common(10)
mars_GPT_topic_freq

[('like', 12),
 ('mar', 10),
 ('face', 6),
 ('alien', 5),
 ('nasa', 5),
 ('rover', 5),
 ('life', 5),
 ('would', 5),
 ('planet', 3),
 ('cool', 3)]

In [104]:
mars_topics = [topic[0] for topic in mars_topic_freq]
mars_GPT_topics = [topic[0] for topic in mars_GPT_topic_freq]

In [107]:
print(mars_topics)
print(mars_GPT_topics)

['mar', 'face', 'nasa', 'life', 'would', 'people', 'found', 'created', 'mesa', 'alien']
['like', 'mar', 'face', 'alien', 'nasa', 'rover', 'life', 'would', 'planet', 'cool']
{'mar', 'face', 'would', 'alien', 'life', 'nasa'}


The collection of "topics" have 6 elements in common:

In [108]:
print(set(mars_topics).intersection(mars_GPT_topics))

{'mar', 'face', 'would', 'alien', 'life', 'nasa'}


### Topic Analysis of ChatGPT Driving Essay

As a second go through, we will also look at the second entry:

In [124]:
driving_topic_freq = Counter(lemmatized[1]).most_common(10)

In [125]:
driving_topic_freq

[('car', 23),
 ('driverless', 7),
 ('could', 6),
 ('thing', 6),
 ('driver', 6),
 ('call', 5),
 ('one', 5),
 ('article', 5),
 ('wrong', 5),
 ('developement', 4)]

Input : "Write an essay as though it is written by 6-12 grader that is no more than 420 words based on the keywords : car, driverless, could, thing, driver, call, one, article, wrong, development"

Output: "Title: The Coolness of Driverless Cars: A Rollercoaster Ride into the Future

Have you ever imagined a world where cars drive themselves? It might sound like something from a futuristic movie, but guess what? Driverless cars could be a real thing, and they're cooler than you might think!

First off, let's talk about what a driverless car is. It's like having a super-smart robot buddy take you wherever you want to go. You just hop in, tell the car where you want to be, and boom, you're off! No need for a grown-up driver – you're in charge, and it's like having your own personal chauffeur.

Now, why are people so excited about driverless cars? Well, for one, they could make our roads safer. You know how sometimes drivers can get distracted or tired? With driverless cars, there's no need to worry about that. These cars don't get sleepy or check their phones; they're like the superheroes of the road, always alert and ready to keep us safe.

Imagine being able to call one of these cool cars whenever you need to go somewhere. No more waiting for your parents to drive you around – you're the boss of your own transportation! It's like having your own magic carpet, but instead of flying, it cruises on the roads, taking you to your favorite places.

Some people in an article I read said that driverless cars might be a bad thing. They're worried about the development of this new technology and think it could go wrong. But here's the thing – scientists and really smart people are working hard to make sure these cars are super safe. It's like having a bunch of wizards making sure our magical ride is the best and safest it can be.

And get this – driverless cars could be good for the planet too. They might be able to drive more smoothly, using less fuel and creating fewer yucky emissions. So not only are they like your personal chauffeur, but they're also helping to take care of our Earth. How awesome is that?

Sure, there might be a few bumps in the road as we figure things out, but that's how it is with anything new. Remember when your parents first gave you the keys to your bike, and you wobbled a bit before becoming a biking pro? It's the same idea with driverless cars – we're learning and getting better every day.

So, next time someone says driverless cars are a bad thing, just remember the coolness they bring – safety, independence, and a greener planet. It's like stepping into the future, and who wouldn't want to be a part of that awesome adventure? Buckle up, fellow adventurers – the driverless car rollercoaster ride is just beginning!"

Remarks : There are a good amount of question marks and apostrophies, this could be how the question is posed.
          Also the text really appears to read rather colloquially it seems. 

In [112]:
driving_GPT_path = 'driving_GPT_essay.txt'
with open(driving_GPT_path, 'r', encoding='utf-8') as file:
        driving_GPT_text = file.read()

In [114]:
driving_GPT_words = word_tokenize(driving_GPT_text.lower())
driving_GPT_alp = [word for word in driving_GPT_words if word.isalpha()]
driving_GPT_no_stop = [word for word in driving_GPT_alp if word not in stop_words]
driving_GPT_lem = [(word_net.lemmatize(t)) for t in driving_GPT_no_stop ]

In [121]:
driving_GPT_topic_freq = Counter(driving_GPT_lem).most_common(10)
driving_GPT_topic_freq

[('car', 26),
 ('driverless', 16),
 ('like', 14),
 ('might', 10),
 ('could', 8),
 ('thing', 8),
 ('road', 8),
 ('drive', 6),
 ('go', 6),
 ('need', 6)]

In [126]:
driving_topics = [topic[0] for topic in driving_topic_freq]
driving_GPT_topics = [topic[0] for topic in driving_GPT_topic_freq]

In [127]:
print(driving_topics)
print(driving_GPT_topics)

['car', 'driverless', 'could', 'thing', 'driver', 'call', 'one', 'article', 'wrong', 'developement']
['car', 'driverless', 'like', 'might', 'could', 'thing', 'road', 'drive', 'go', 'need']


Note, the first two topics are the same! The collection of "topics" have 5 elements in common:

In [129]:
print(set(driving_topics).intersection(driving_GPT_topics))

{'car', 'driverless', 'thing', 'could'}


### Topic Distribution of Dataset

In this section, we determine the most common word from each essay. Considering the most common word as the "topic" of an essay allows us to calculate a distribution of topics, which we do below. 

In [79]:
top_word_freq = [Counter(topics).most_common(1)[0] for topics in lemmatized]
top_word = [freq[0] for freq in top_word_freq]

In [132]:
print('There are ' + str(len(text_list)) + ' essays' + \
      ' but only ' + str(len(set(top_word))) + ' "topics".')

There are 15594 essays but only 397 "topics".


In [174]:
#count of topics
Counter(top_word)

Counter({'mar': 135,
         'car': 2130,
         'would': 446,
         'student': 2406,
         'animal': 33,
         'venus': 749,
         'phone': 783,
         'school': 590,
         'grade': 62,
         'vote': 764,
         'kid': 96,
         'community': 357,
         'face': 356,
         'college': 58,
         'person': 49,
         'people': 563,
         'policy': 36,
         'driver': 28,
         'system': 44,
         'electoral': 484,
         'opinion': 182,
         'state': 152,
         'computer': 79,
         'ad': 2,
         'still': 1,
         'topic': 3,
         'believe': 5,
         'make': 47,
         'author': 49,
         'seagoing': 80,
         'technology': 88,
         'advice': 391,
         'class': 160,
         'need': 16,
         'facial': 31,
         'sport': 136,
         'extracurricular': 14,
         'get': 100,
         'driving': 227,
         'feel': 10,
         'one': 88,
         'could': 172,
         'home': 35,
      

In [184]:
#normalized distribution of topics

total = len(text_list)
topic_dist = [(topic, round(count/total,3)) for topic, count in Counter(top_word).items()]
topic_dist = sorted(topic_dist, key = lambda topic: topic[1], reverse = True)

for topic in topic_dist:
    print(topic)

('student', 0.154)
('car', 0.137)
('phone', 0.05)
('vote', 0.049)
('venus', 0.048)
('school', 0.038)
('people', 0.036)
('electoral', 0.031)
('would', 0.029)
('advice', 0.025)
('community', 0.023)
('face', 0.023)
('project', 0.018)
('driving', 0.015)
('opinion', 0.012)
('help', 0.012)
('could', 0.011)
('activity', 0.011)
('state', 0.01)
('class', 0.01)
('mar', 0.009)
('sport', 0.009)
('alien', 0.007)
('emotion', 0.007)
('kid', 0.006)
('technology', 0.006)
('get', 0.006)
('one', 0.006)
('luke', 0.006)
('president', 0.006)
('computer', 0.005)
('seagoing', 0.005)
('think', 0.005)
('grade', 0.004)
('college', 0.004)
('choice', 0.004)
('average', 0.004)
('like', 0.004)
('cowboy', 0.004)
('person', 0.003)
('system', 0.003)
('make', 0.003)
('author', 0.003)
('planet', 0.003)
('decision', 0.003)
('elector', 0.003)
('ask', 0.003)
('time', 0.003)
('cell', 0.003)
('program', 0.003)
('animal', 0.002)
('policy', 0.002)
('driver', 0.002)
('facial', 0.002)
('home', 0.002)
('good', 0.002)
('want', 0.00

The most common topic is "student". Now we look at the second most common word among the essays whose topic is "student".

In [188]:
student_word_freq = [Counter(topics).most_common(10)[1] for topics in lemmatized \
                   if Counter(topics).most_common(1)[0][0] == 'student']

student_word = [freq[0] for freq in student_word_freq]

In [189]:
print('There are ' + str(len(student_word)) + ' essays with a top word of "student"' + \
      ' but only ' + str(len(set(student_word))) + ' "sub-topics" among them.')

There are 2406 essays with a top word of "student" but only 163 "sub-topics" among them.


In [190]:
Counter(student_word).most_common(20)

[('school', 398),
 ('project', 332),
 ('class', 171),
 ('would', 136),
 ('activity', 125),
 ('teacher', 97),
 ('learning', 83),
 ('online', 78),
 ('home', 67),
 ('community', 66),
 ('phone', 64),
 ('summer', 57),
 ('could', 48),
 ('sport', 45),
 ('help', 37),
 ('work', 35),
 ('technology', 29),
 ('grade', 25),
 ('time', 22),
 ('average', 19)]

In [192]:
total = len(student_word)
topic_dist = [(topic, round(count/total,3)) for topic, count in Counter(student_word).items()]
topic_dist = sorted(topic_dist, key = lambda topic: topic[1], reverse = True)

for topic in topic_dist:
    print(topic)

('school', 0.165)
('project', 0.138)
('class', 0.071)
('would', 0.057)
('activity', 0.052)
('teacher', 0.04)
('learning', 0.034)
('online', 0.032)
('home', 0.028)
('phone', 0.027)
('community', 0.027)
('summer', 0.024)
('could', 0.02)
('sport', 0.019)
('help', 0.015)
('work', 0.015)
('technology', 0.012)
('grade', 0.01)
('time', 0.009)
('average', 0.008)
('assignment', 0.007)
('may', 0.007)
('emotion', 0.007)
('cell', 0.007)
('extracurricular', 0.007)
('classroom', 0.007)
('policy', 0.006)
('learn', 0.005)
('computer', 0.005)
('get', 0.005)
('think', 0.004)
('system', 0.004)
('c', 0.004)
('like', 0.004)
('distance', 0.003)
('want', 0.003)
('many', 0.003)
('topic', 0.003)
('club', 0.002)
('need', 0.002)
('one', 0.002)
('new', 0.002)
('might', 0.002)
('program', 0.002)
('facial', 0.002)
('kid', 0.002)
('education', 0.002)
('know', 0.002)
('play', 0.002)
('lesson', 0.002)
('software', 0.002)
('make', 0.002)
('child', 0.002)
('expression', 0.002)
('also', 0.002)
('facs', 0.002)
('able', 0.

This analysis shows that essay topics are far from uniformly distributed and are biased toward things you might students in grades 6-12 would right about.