# Exercise 5: Spark APIs [100 points]

## 1. Accumulators [10 points]
[10 points].The title of this Q&A is wrong. It’s really about global variables (aka accumulators). The question shows code that is incorrect. 
val data = Array(1,2,3,4,5)
var counter = 0
var rdd = sc.parallelize(data)

// Wrong: Don't do this!!
rdd.foreach(x => counter += x)

println("Counter value: " + counter)
Write a corrected version of the code and demonstrate its intended operation.



In [None]:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.util.LongAccumulator

object SparkAccumulatorExample {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("AccumulatorExample").setMaster("local[*]")
    val sc = new SparkContext(conf)

    val data = Array(1, 2, 3, 4, 5)
    val rdd = sc.parallelize(data)

    // Correct approach using Accumulator
    val counter: LongAccumulator = sc.longAccumulator("Counter Accumulator")

    rdd.foreach(x => counter.add(x))

    println("Counter value: " + counter.value)
    
    sc.stop()
  }
}

Output: 15

## 2. Airline Traffic [45 points]

#### 1. [15 points] Describe in words and in code (where applicable) the steps you took to set up the environment for gathering the statistical data in the below questions.

#### 2. [6 points] Which US Airline Has the Least Delays? Report by full names, (e.g., Delta Airlines, not DL)

In [23]:
# Step 1: Import libraries
import pandas as pd
import re


    # Step 2: Define documented column names
documented_columns = [
        'Carrier', 'FlightNumber',
        'Undocumented_1', 'Undocumented_2',  # Placeholder for extra columns
        'OperatingCarrier', 'OperatingFlightNumber',
        'DepartureAirport', 'ArrivalAirport', 'FlightDate', 'DayOfWeek',
        'ScheduledDepartureTime_OAG', 'ScheduledDepartureTime_CRS',
        'ActualDepartureTime', 'ScheduledArrivalTime_OAG', 'ScheduledArrivalTime_CRS',
        'ActualArrivalTime', 'Diff_ScheduledDepartureTimes', 'Diff_ScheduledArrivalTimes',
        'ScheduledElapsedMinutes', 'DepartureDelayMinutes', 'ArrivalDelayMinutes',
        'Diff_ElapsedMinutes', 'WheelsOffTime', 'WheelsOnTime', 'AircraftTailNumber',
        'CancellationCode', 'MinutesLate_DelayCodeE', 'MinutesLate_DelayCodeF',
        'MinutesLate_DelayCodeG', 'MinutesLate_DelayCodeH', 'MinutesLate_DelayCodeI'
    ]

    # Step 3: Determine the total number of columns in the data
with open('ontime.td.202406.asc', 'r') as f:
        first_line = f.readline()
        total_columns = len(first_line.split('|'))

    # Step 4: Generate column names
if total_columns > len(documented_columns):
        extra_columns = total_columns - len(documented_columns)
        column_names = documented_columns + [f'ExtraColumn_{i}' for i in range(1, extra_columns + 1)]
else:
        column_names = documented_columns[:total_columns]

    # Step 5: Specify data types for all columns as strings
dtype_spec = {col: str for col in column_names}

    # Step 6: Load the datasets for June 2024 and July 2024
june_data = pd.read_csv('ontime.td.202406.asc', delimiter='|', header=None, names=column_names, dtype=dtype_spec, low_memory=False)
july_data = pd.read_csv('ontime.td.202407.asc', delimiter='|', header=None, names=column_names, dtype=dtype_spec, low_memory=False)

combined_data = pd.concat([june_data, july_data], ignore_index=True)

    # Step 7: Remove the 4 undocumented columns and extra ones
columns_to_keep = [
        'Carrier', 'FlightNumber', 'OperatingCarrier', 'OperatingFlightNumber',
        'DepartureAirport', 'ArrivalAirport', 'FlightDate', 'DayOfWeek',
        'ScheduledDepartureTime_OAG', 'ScheduledDepartureTime_CRS',
        'ActualDepartureTime', 'ScheduledArrivalTime_OAG', 'ScheduledArrivalTime_CRS',
        'ActualArrivalTime', 'Diff_ScheduledDepartureTimes', 'Diff_ScheduledArrivalTimes',
        'ScheduledElapsedMinutes', 'DepartureDelayMinutes', 'ArrivalDelayMinutes',
        'Diff_ElapsedMinutes', 'WheelsOffTime', 'WheelsOnTime', 'AircraftTailNumber',
        'CancellationCode', 'MinutesLate_DelayCodeE', 'MinutesLate_DelayCodeF',
        'MinutesLate_DelayCodeG', 'MinutesLate_DelayCodeH', 'MinutesLate_DelayCodeI'
    ]
combined_data = combined_data[columns_to_keep]

    # Step 8: Clean Carrier column
combined_data['Carrier'] = combined_data['Carrier'].str.upper().str.strip()

    # Step 9: Filter valid Carrier codes (two uppercase letters)
valid_carrier_pattern = re.compile(r'^[A-Z]{2}$')
carrier_filter = combined_data['Carrier'].str.match(valid_carrier_pattern, na=False)
print(f"\nRows with valid Carrier codes: {carrier_filter.sum()}")
combined_data = combined_data[carrier_filter]

    

        


Rows with valid Carrier codes: 1225398


The above code shows how I parsed and cleaned the raw data to be able to answer the questions.
I had to assign labels to the columns. I had to only account for 2 extra columns between B and C because doing
4 like the instructions said messed up the Arrival and Departure Airport columns for the following questions.
I decided to declare all the variables as strings and then convert to int as necessary. I used this method because
I was getting a lot of improper data type errors and couldn't figure them out otherwise. This was likely due to the
column labels being offset by two at first. 

In [53]:
# Step 10: Filter valid DepartureDelayMinutes
    # Convert column to numeric, forcing errors to NaN
combined_data['DepartureDelayMinutes'] = pd.to_numeric(combined_data['DepartureDelayMinutes'], errors='coerce')

    # Drop rows where conversion resulted in NaN
combined_data = combined_data.dropna(subset=['DepartureDelayMinutes'])

    # Convert from float to int (since NaN values are removed)
combined_data['DepartureDelayMinutes'] = combined_data['DepartureDelayMinutes'].astype(int)

relevant_data = combined_data.dropna(subset=['DepartureDelayMinutes'])

    # Step 12: Analyze the data
if relevant_data.empty:
        print("\nThe cleaned dataset is empty. No valid rows found.")
else:
        # Group by Carrier and calculate the mean delay (including all flights)
    average_delays = relevant_data.groupby('Carrier')['DepartureDelayMinutes'].mean()

# Sort by average delay
    sorted_delays = average_delays.sort_values()

    airline_mapping = {
                'DL': 'Delta Airlines',
                'AA': 'American Airlines',
                'UA': 'United Airlines',
                'WN': 'Southwest Airlines',
                'AS': 'Alaska Airlines',
                'B6': 'JetBlue Airways',
                'NK': 'Spirit Airlines',
                'F9': 'Frontier Airlines',
                'HA': 'Hawaiian Airlines',
                'G4': 'Allegiant Air',
                'YX': 'Midwest Airlines',
                'OO': 'SkyWest Airlines',
                'MQ': 'Envoy Air',
                'OH': 'PSA Airlines',
                'YV': 'Mesa Airlines',
                'QX': 'Horizon Air',
                'EV': 'ExpressJet Airlines'
            }

            # Convert airline codes to full names
    sorted_delays.index = sorted_delays.index.map(airline_mapping)

            # Report the airline with the least delays
    least_delay_airline = sorted_delays.idxmin()
    least_delay_value = sorted_delays.min()
    print(sorted_delays.head(5))

    print(f"\nThe airline with the least delays is {least_delay_airline} with an average delay of {least_delay_value:.2f} minutes.")



Carrier
Southwest Airlines    125.858359
American Airlines     129.479622
Delta Airlines        131.965433
Hawaiian Airlines     140.665946
United Airlines       141.952446
Name: DepartureDelayMinutes, dtype: float64

The airline with the least delays is Southwest Airlines with an average delay of 125.86 minutes.


The above code is how I calculated the airlines with the least average delays. 
I converted the departure delays to int to be able to calculate this. 
The top 5 least delayed airlines are shown above. SouthWest Airlines ended
up being the airline with the least average delays. 

#### 3. [6 points] What Departure Time of Day Is Best to Avoid Flight Delays, segmented into 5 time blocks [night (10 pm - 6 am), morning (6 am to 10 am), mid-day (10 am to 2 pm), afternoon (2 pm - 6 pm), evening (6 pm - 10 pm)]

In [52]:
def categorize_time(hour):
    if 22 <= hour or hour < 6:
        return 'Night'
    elif 6 <= hour < 10:
        return 'Morning'
    elif 10 <= hour < 14:
        return 'Mid-Day'
    elif 14 <= hour < 18:
        return 'Afternoon'
    else:
        return 'Evening'

## Find the best time of day to avoid delays
combined_data['DepartureHour'] = combined_data['ActualDepartureTime'].str[:2]
combined_data['DepartureHour'] = pd.to_numeric(combined_data['DepartureHour'], errors='coerce')
combined_data['TimeBlock'] = combined_data['DepartureHour'].apply(categorize_time)
timeblock_delays = combined_data.groupby('TimeBlock')['DepartureDelayMinutes'].mean().sort_values()
print(timeblock_delays)

TimeBlock
Night        132.867186
Mid-Day      135.265775
Afternoon    135.455740
Evening      137.914275
Morning      175.985673
Name: DepartureDelayMinutes, dtype: float64


The code for calculating the time block with the least delays is above. Night was the best 
departure in terms of avoiding delays. 

#### 4. [5 points] Which Airports Have The Most Flight Delays? Report by full name, (e.g., “Newark Liberty International,” not “EWR,” when the airport code EWR is provided).

In [48]:
## Find the airports with the most delays
airport_mapping = {
    'ATL': 'Atlanta - Hartsfield Jackson',
    'BWI': "Baltimore/Wash. Int'l Thurgood Marshall",
    'BOS': 'Boston - Logan International',
    'CLT': 'Charlotte - Douglas',
    'MDW': 'Chicago - Midway',
    'ORD': "Chicago - O'Hare",
    'CVG': 'Cincinnati Greater Cincinnati',
    'DFW': 'Dallas-Fort Worth International',
    'DEN': 'Denver - International',
    'DTW': 'Detroit - Metro Wayne County',
    'FLL': 'Fort Lauderdale Hollywood International',
    'IAH': 'Houston - George Bush International',
    'LAS': 'Las Vegas - McCarran International',
    'LAX': 'Los Angeles International',
    'MIA': 'Miami International',
    'MSP': 'Minneapolis-St. Paul International',
    'EWR': 'Newark Liberty International',
    'JFK': 'New York - JFK International',
    'LGA': 'New York - LaGuardia',
    'MCO': 'Orlando International',
    'OAK': 'Oakland International',
    'PHL': 'Philadelphia International',
    'PHX': 'Phoenix - Sky Harbor International',
    'PDX': 'Portland International',
    'SLC': 'Salt Lake City International',
    'STL': 'St. Louis Lambert International',
    'SAN': 'San Diego Intl. Lindbergh Field',
    'SFO': 'San Francisco International',
    'SEA': 'Seattle-Tacoma International',
    'TPA': 'Tampa International',
    'DCA': 'Washington - Reagan National',
    'IAD': 'Washington - Dulles International',
    'PPG': 'Pago Pago International',
    'GUM': 'Guam International',
    'HNL': 'Honolulu International',
    'OGG': 'Kahului Airport',
    'KOA': 'Kona International',
    'LIH': 'Lihue Airport',
    'ITO': 'Hilo International',
    'BQN': 'Aeropuerto Internacional Rafael Hernández',
    'SJU': 'San Juan - Luis Muñoz Marín International',
    'STT': 'Cyril E. King Airport'
}

    # Find the airports with the most delays
combined_data['ArrivalDelayMinutes'] = pd.to_numeric(combined_data['ArrivalDelayMinutes'], errors='coerce')
combined_data['DepartureDelayMinutes'] = pd.to_numeric(combined_data['DepartureDelayMinutes'], errors='coerce')

arrival_delay = combined_data.groupby('ArrivalAirport')['ArrivalDelayMinutes'].mean().sort_values(ascending=False)
departure_delay = combined_data.groupby('DepartureAirport')['DepartureDelayMinutes'].mean().sort_values(ascending=False)
total_delay = arrival_delay + departure_delay

total_delay.index = total_delay.index.to_series().replace(airport_mapping)
total_delay = total_delay.sort_values(ascending=False)
print(total_delay.head(5))





Pago Pago International                      334.461538
Aeropuerto Internacional Rafael Hernández    254.245902
San Juan - Luis Muñoz Marín International    247.113487
Guam International                           242.926230
Cyril E. King Airport                        242.833724
dtype: float64


The code above shows the airports with the most delays. Pago Pago International in Puerto Rico was the airport with the most delays.

#### [5 points] What Are the Top 5 Busiest Airports in the US. Report by full name, (e.g., “Newark Liberty International,” not “EWR”).

In [46]:
# Count total arrivals and departures
arrivals_count = combined_data['ArrivalAirport'].value_counts()
departures_count = combined_data['DepartureAirport'].value_counts()

# Sum arrivals and departures for each airport
total_flights = arrivals_count.add(departures_count, fill_value=0)

# Map airport codes to full names
total_flights.index = total_flights.index.to_series().replace(airport_mapping)

# Sort in descending order and get the top 5
top_5_busiest_airports = total_flights.sort_values(ascending=False).head(5)

# Display result
print("Top 5 Busiest Airports in the US (by total arrivals and departures):")
print(top_5_busiest_airports)


Top 5 Busiest Airports in the US (by total arrivals and departures):
Atlanta - Hartsfield Jackson       114182
Dallas-Fort Worth International    110252
Chicago - O'Hare                   109931
Denver - International             101826
Charlotte - Douglas                 86198
Name: count, dtype: int64


The above output shows the busiest airports in the US by total arrivals and departures. Atlanta was the busiest overall.

## 3. ShortStoryJam [45 pts]

#### 1. [3 points] To seed the effort, the text of about 22 short stories by Edgar Allan Poe, he of the “quoth the raven” fame, are available in my github repository. Clean the text and remove stopwords, as specified in a previous assignment.

In [57]:
#!/usr/bin/env python3
import re
import string
import requests
import sys

def load_stopwords():
    url = "https://gist.githubusercontent.com/rg089/35e00abf8941d72d419224cfd5b5925d/raw/12d899b70156fd0041fa9778d657330b024b959c/stopwords.txt"
    stopwords_list = requests.get(url).content
    return set(stopwords_list.decode().splitlines())

stopwords = load_stopwords()

def process_text(text, is_recursive=False):
    if not is_recursive:
        text = text.lower()
        text = re.sub(r'\[.*?\]', '', text)
        text = re.sub(r'[%s]' % re.escape(string.punctuation), ' ', text)
        text = re.sub(r'\d+', ' ', text)
        text = ' '.join(remove_stopwords(text))
        return process_text(text, is_recursive=True)
    else:
        return text

def remove_stopwords(words):
    words_list = re.sub(r'[^a-zA-Z0-9]', " ", words.lower()).split()
    return [word for word in words_list if word not in stopwords]

def process_file(filename):
    try:
        with open(filename, 'r', encoding='utf-8') as file:
            text = file.read()
            processed_text = process_text(text)
            return processed_text
    except FileNotFoundError:
        print(f"Error: File '{filename}' not found.")
    except Exception as e:
        print(f"An error occurred: {e}")


paragraph = process_file('A_DESCENT_INTO_THE_MAELSTROM.txt')
print(paragraph)

ways god nature providence ways models frame commensurate vastness profundity unsearchableness works depth greater democritus joseph glanville reached summit loftiest crag minutes man exhausted speak long ago length guided route youngest sons years happened event happened mortal man man survived hours deadly terror endured broken body soul suppose man single day change hairs jetty black white weaken limbs unstring nerves tremble exertion frightened shadow scarcely cliff giddy cliff edge carelessly thrown rest weightier portion body hung falling tenure elbow extreme slippery edge cliff arose sheer unobstructed precipice black shining rock sixteen feet crags beneath tempted half dozen yards brink truth deeply excited perilous position companion fell length ground clung shrubs dared glance upward sky struggled vain divest idea foundations mountain danger fury winds long reason sufficient courage sit distance fancies guide brought view scene event mentioned story spot eye continued particu

I used the above code from my assignment 5 to clean the 22 short stories. It was modified to accept filenames instead of just text.
The output is from "A Descent Into the Maelstorm" as an example.

#### 2. [8 points] Use NLTK to decompose the first story (A_DESCENT_INTO…) into sentences & sentences into tokens. Here is the code for doing that, after you set the variable paragraph to hold the text of the story.

In [67]:
import nltk
nltk.download('punkt', download_dir='/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/')
nltk.download('averaged_perceptron_tagger', download_dir='/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/')


[nltk_data] Downloading package punkt to /Library/Frameworks/Python.fr
[nltk_data]     amework/Versions/3.12/lib/python3.12/site-packages/...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to /Library
[nltk_data]     /Frameworks/Python.framework/Versions/3.12/lib/python3
[nltk_data]     .12/site-packages/...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
import nltk
nltk.download()
from nltk.tokenize import sent_tokenize, word_tokenize
sent_text = nltk.sent_tokenize(paragraph) # this gives us a list of sentences
# now loop over each sentence and tokenize it separately
all_tagged = [nltk.pos_tag(nltk.word_tokenize(sent)) for sent in sent_text]

I was having issues with the given code, so I changed the download to "nltk.download()". I got this from a stack overflow post. The whole python code is pasted below.

In [1]:
#!/usr/bin/env python3
import re
import string
import requests
import sys
import nltk

def load_stopwords():
    url = "https://gist.githubusercontent.com/rg089/35e00abf8941d72d419224cfd5b5925d/raw/12d899b70156fd0041fa9778d657330b024b959c/stopwords.txt"
    stopwords_list = requests.get(url).content
    return set(stopwords_list.decode().splitlines())

stopwords = load_stopwords()

def process_text(text, is_recursive=False):
    if not is_recursive:
        text = text.lower()
        text = re.sub(r'\[.*?\]', '', text)
        text = re.sub(r'[%s]' % re.escape(string.punctuation), ' ', text)
        text = re.sub(r'\d+', ' ', text)
        text = ' '.join(remove_stopwords(text))
        return process_text(text, is_recursive=True)
    else:
        return text

def remove_stopwords(words):
    words_list = re.sub(r'[^a-zA-Z0-9]', " ", words.lower()).split()
    return [word for word in words_list if word not in stopwords]

def process_file(filename):
    try:
        with open(filename, 'r', encoding='utf-8') as file:
            text = file.read()
            processed_text = process_text(text)
            return processed_text
    except FileNotFoundError:
        print(f"Error: File '{filename}' not found.")
    except Exception as e:
        print(f"An error occurred: {e}")


paragraph = process_file('A_DESCENT_INTO_THE_MAELSTROM')

from nltk.tokenize import sent_tokenize, word_tokenize
sent_text = nltk.sent_tokenize(paragraph) # this gives us a list of sentences
# now loop over each sentence and tokenize it separately
all_tagged = [nltk.pos_tag(nltk.word_tokenize(sent)) for sent in sent_text]                                                                                                                                   22,52         All

SyntaxError: invalid syntax (2918685338.py, line 47)

#### 3. [11 points] Tag all remaining words in the story as parts of speech using the Penn POS Tags. This SO answer shows how to obtain the POS tag values. Create and print a dictionary with the Penn POS Tags as keys and a list of words as the values.

THe code and explanation for this part is below question 4.

####  4. [11 points] In this framework, each row will represent a story. The columns will be as follows:
The text of the story,
Two-letter prefixes of each tag, for example NN, VB, RB, JJ etc.and the words belonging to that tag in the story. 
Show your code and the tag columns, at least for the one story.


The code for this and part 3 combined is below. 

In [None]:
#!/usr/bin/env python3
import re
import string
import requests
import sys
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
import os

# Download required NLTK resources
nltk.download('punkt')  # Required for sentence tokenization
nltk.download('averaged_perceptron_tagger')  # Required for POS tagging

def load_stopwords():
    url = "https://gist.githubusercontent.com/rg089/35e00abf8941d72d419224cfd5b5925d/raw/12d899b70156fd0041fa9778d657330b024b959c/stopwords.txt"
    stopwords_list = requests.get(url).content
    return set(stopwords_list.decode().splitlines())

stopwords = load_stopwords()

def process_text(text, is_recursive=False):
    if not is_recursive:
        text = text.lower()
        text = re.sub(r'\[.*?\]', '', text)
        text = re.sub(r'[%s]' % re.escape(string.punctuation), ' ', text)
        text = re.sub(r'\d+', ' ', text)
        text = ' '.join(remove_stopwords(text))
        return process_text(text, is_recursive=True)
    else:
        return text

def remove_stopwords(words):
    words_list = re.sub(r'[^a-zA-Z0-9]', " ", words.lower()).split()
    return [word for word in words_list if word not in stopwords]

def process_file(filename):
    try:
        with open(filename, 'r', encoding='utf-8', errors='ignore') as file:
            text = file.read()
            processed_text = process_text(text)
            return processed_text
    except FileNotFoundError:
        print(f"Error: File '{filename}' not found.")
        return None
    except Exception as e:
        print(f"An error occurred while processing the file: {e}")
        return None

# Process the file
filename = 'A_DESCENT_INTO_THE_MAELSTROM.txt'
paragraph = process_file(filename)

if paragraph is not None:

    # Tokenize sentences and perform POS tagging
    sent_text = sent_tokenize(paragraph)  # Sentence tokenization
    all_tagged = [nltk.pos_tag(word_tokenize(sent)) for sent in sent_text]  # POS tagging

    # Flatten the list of tagged words
    flat_tagged_words = [word_tag for sentence in all_tagged for word_tag in sentence]

    # Create a dictionary with POS tags as keys and lists of words as values
    pos_dict = {}
    for word, tag in flat_tagged_words:
        if tag not in pos_dict:
            pos_dict[tag] = []
        pos_dict[tag].append(word)

    # Print the dictionary
    print("\nPOS Tags Dictionary:")
    for tag, words in pos_dict.items():
        print(f"{tag}: {words[:10]}...")  # Print only the first 10 words for brevity
else:
    print("No text to process due to file error.")

Tag Output for the first 10 of each part of speech:

    POS Tags Dictionary:
    NNS: ['ways', 'ways', 'models', 'minutes', 'sons', 'years', 'hours', 'limbs', 'nerves', 'feet']...
    VBP: ['god', 'frame', 'terror', 'brink', 'mountain', 'guide', 'sea', 'land', 'remote', 'swell']...
    JJ: ['nature', 'commensurate', 'unsearchableness', 'speak', 'mortal', 'suppose', 'single', 'jetty', 'black', 'white']...
    NN: ['providence', 'vastness', 'profundity', 'democritus', 'joseph', 'glanville', 'summit', 'crag', 'man', 'length']...
    VBZ: ['works', 'hairs', 'assumes', 'passages', 'decreases', 'runs', 'leagues', 'whales', 'rocks', 'precipitates']...
    RB: ['depth', 'long', 'ago', 'deadly', 'scarcely', 'carelessly', 'beneath', 'deeply', 'length', 'upward']...
    JJR: ['greater', 'weightier', 'higher', 'smaller', 'yonder', 'greater', 'smaller', 'higher', 'deeper', 'lower']...
    VBD: ['reached', 'exhausted', 'guided', 'happened', 'happened', 'survived', 'frightened', 'hung', 'arose', 'tempted']...
    JJS: ['loftiest', 'youngest', 'divest', 'crest', 'faintest', 'loudest', 'highest', 'honest', 'largest', 'finest']...
    VBN: ['endured', 'beheld', 'reared', 'called', 'ascended', 'set', 'acquired', 'lashed', 'phrensied', 'assumed']...
    IN: ['broken', 'teeth', 'otterholm', 'ver', 'abyss', 'amid', 'drove', 'overcast', 'thrown', 'wind']...
    RBR: ['shadow', 'farther', 'matter', 'feather', 'higher', 'longer', 'listen', 'explore', 'farther', 'limbs']...
    VBG: ['falling', 'shining', 'particularizing', 'beetling', 'howling', 'shrieking', 'blowing', 'offing', 'dashing', 'increasing']...
    FW: ['elbow', 'kircher']...
    VB: ['raise', 'timid', 'morrow', 'watch', 'deck', 'elder', 'shake', 'slack', 'keel', 'hold']...

#### 5. [12 points] The conjecture of many linguists is that the number of different parts of speech per thousand words, (nouns, verbs, adjectives, adverbs, …). is pretty much the same for all stories in a given language. In this case, with all stories in English, and all from the same author, we expect it to be true. Is the conjecture consistent with your findings?

I calculated the frequencies per 1000 for each part of speech for the first 3 short stories:

Domain of Arnheim

    POS frequencies per 1000 words for THE_DOMAIN_OF_ARNHEIM:
        CC: 34.24
        CD: 6.21
        DT: 122.73
        EX: 2.88
        FW: 0.30
        IN: 140.76
        JJ: 80.61
        JJR: 3.33
        JJS: 3.33
        MD: 9.85
        NN: 170.76
        NNP: 15.15
        NNS: 39.24
        PDT: 1.36
        PRP: 26.82
        PRP$: 17.42
        RB: 45.45
        RBR: 2.12
        RBS: 2.27
        RP: 1.52
        TO: 20.61
        VB: 25.30
        VBD: 29.24
        VBG: 13.64
        VBN: 27.73
        VBP: 10.15
        VBZ: 22.27
        WDT: 10.45
        WP: 4.39
        WP$: 0.91
        WRB: 1.36

Bernice:

        Processing file: BERENICE
    POS frequencies per 1000 words for BERENICE:
    CC: 47.74
    CD: 1.92
    DT: 102.61
    EX: 1.65
    IN: 134.43
    JJ: 81.76
    JJR: 0.82
    JJS: 3.29
    MD: 5.76
    NN: 152.26
    NNP: 20.85
    NNS: 36.21
    PDT: 0.82
    PRP: 48.83
    PRP$: 29.36
    RB: 49.93
    RBR: 3.29
    RBS: 2.47
    RP: 1.10
    TO: 13.44
    UH: 1.37
    VB: 16.74
    VBD: 49.11
    VBG: 14.54
    VBN: 21.40
    VBP: 12.07
    VBZ: 7.13
    WDT: 5.76
    WP: 1.37
    WP$: 0.27
    WRB: 1.92

Cask of Amontillado:

        Processing file: THE_CASK_OF_AMONTILLADO
    POS frequencies per 1000 words for THE_CASK_OF_AMONTILLADO:
    CC: 35.45
    CD: 4.49
    DT: 106.76
    EX: 2.56
    FW: 0.23
    IN: 130.49
    JJ: 73.03
    JJR: 2.98
    JJS: 2.26
    MD: 9.13
    NN: 151.85
    NNP: 18.77
    NNPS: 0.16
    NNS: 34.70
    PDT: 1.21
    PRP: 51.86
    PRP$: 24.31
    RB: 51.74
    RBR: 2.38
    RBS: 1.42
    RP: 1.76
    TO: 17.94
    UH: 0.55
    VB: 23.87
    VBD: 52.29
    VBG: 12.57
    VBN: 24.72
    VBP: 11.21
    VBZ: 10.18
    WDT: 8.21
    WP: 3.84
    WP$: 0.71
    WRB: 3.02

I calculated the frequencies of Nouns, Verbs and Adjectives from the above output for each of the 3 stories.

#### Nouns (NN, NNP, NNS):
        THE_DOMAIN_OF_ARNHEIM: 225.15

        BERENICE: 209.32

        THE_CASK_OF_AMONTILLADO: 205.32

#### Verbs (VB, VBD, VBG, VBN, VBP, VBZ):
        THE_DOMAIN_OF_ARNHEIM: 128.33

        BERENICE: 120.99

        THE_CASK_OF_AMONTILLADO: 134.84

#### Adjectives (JJ, JJR, JJS):
        THE_DOMAIN_OF_ARNHEIM: 87.27

        BERENICE: 85.87

        THE_CASK_OF_AMONTILLADO: 78.27


Based on the above calculations, the data from these 3 short stories does support the conjecure from linguists. All 3 parts of speech had very similar frequencies per 1000 words, as can be seen above. The code I used is pasted below.

In [None]:
import os
from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

def calculate_pos_frequencies_per_thousand(file_path):
    # Load and tokenize the text
    with open(file_path, 'r') as file:
        text = file.read()
    tokens = word_tokenize(text)
    total_words = len(tokens)

    # Tag tokens with POS
    tagged_tokens = pos_tag(tokens)

    # Count POS frequencies
    pos_counts = defaultdict(int)
    for word, pos in tagged_tokens:
        pos_counts[pos] += 1

    # Normalize frequencies per 1000 words
    pos_frequencies_per_thousand = {pos: (count / total_words) * 1000 for pos, count in pos_counts.items()}

    return pos_frequencies_per_thousand

def main():
    # Get the current working directory (where the script and text files are located)
    directory = "."


    # Loop through each file in the directory
    for filename in os.listdir(directory):
        # Skip Python scripts and process only non-Python files
        if  os.path.isfile(os.path.join(directory, filename)):
            file_path = os.path.join(directory, filename)
            print(f"Processing file: {filename}")

            # Calculate POS frequencies per 1000 words
            pos_frequencies = calculate_pos_frequencies_per_thousand(file_path)

            # Print the results
            print(f"POS frequencies per 1000 words for {filename}:")