# Hurricanes 1975 Wikipedia 

### Anastasia Drakou
- a.drakou@hotmail.com  
- (+30) 6941523553

#### Import libraries

In [20]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd
import re
from transformers import BertTokenizer, BertForTokenClassification
from transformers import pipeline
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
from datetime import datetime
from transformers import AutoModelForCausalLM, AutoTokenizer

#### Fetch the url from which we wish to extract data from.

When we fetch the wikipedia page, we will get a three-digits response. For each digits that we could possibly get there is always an explanation. For this reason there is provided below :
 
- 200: OK - The request was successful, and the server has returned the requested data. 
- 404: Not Found - The requested resource could not be found on the server.
- 500: Internal Server Error - The server encountered an unexpected error while processing the request.
- 401: Unauthorized - The request requires authentication or valid credentials were not provided.
- 403: Forbidden - The server understood the request, but access is not allowed.
- 302: Found - The requested resource has been temporarily moved to a different location.

In [21]:
url = 'https://en.wikipedia.org/wiki/1975_Pacific_hurricane_season'
r = requests.get(url)
print(r.status_code)

200


With the output 200 we can see that the page that was requested was successful.

In [22]:
wiki_page = BeautifulSoup(r.text, 'html.parser')
print(wiki_page.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   1975 Pacific hurricane season - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-li

Let's find firstly all the names of the Pacific hurricanes or storms. From the BeautyfulSoup we can observe that the names of the hurricanes/storms are written in < h3 >. For this reason we follow this code to find what we want.

In [23]:
for name in wiki_page.select('h3'):
    print(name.getText())

Hurricane Agatha
Tropical Storm Bridget
Hurricane Carlotta
Hurricane Denise
Tropical Storm Eleanor
Tropical Storm Francene
Tropical Storm Georgette
Tropical Storm Hilary
Hurricane Ilsa
Hurricane Jewel
Hurricane Katrina
Unnamed hurricane
Hurricane Lily
Tropical Storm Monica
Tropical Storm Nanette
Hurricane Olivia
Tropical Storm Priscilla
Other systems


The total number of hurricanes and storms is 17. The names of the hurricanes and storms listed above, excluding the last entry labeled "Other systems," will be retained in our records.

Next, Beautiful Soup will be utilized to clean the data, facilitating a more efficient extraction of the desired information.

#### Data Extraction

For data extraction, the dataset will be processed to extract only the necessary information. Using Beautiful Soup, the focus will be on retrieving relevant details specifically related to hurricanes or storms. The desired information includes the names of the hurricanes or storms, their respective dates, and general details.

In [24]:
def extract_all_hurricanes(wiki_page: str):
    hurricanes_info = []

    hurricane_headers = wiki_page.find_all('h3')

    for header in hurricane_headers:
        hurricane_name = header.text.strip()
        
        duration_row = header.find_next('table').find('th', string='Duration')
        duration = duration_row.find_next('td').text.strip() if duration_row else None
        
        relevant_paragraphs = []
        for p in header.find_all_next('p'):
            if p.find_previous('h3') != header:
                break
            relevant_paragraphs.append(p.text.strip())

        relevant_text = " ".join(relevant_paragraphs)

        hurricanes_info.append({
            "hurricane_name": hurricane_name,
            "duration": duration,
            "relevant_text": relevant_text
        })

    return hurricanes_info[:-1]

result = extract_all_hurricanes(wiki_page)
for hurricane in result:
    print(hurricane)


{'hurricane_name': 'Hurricane Agatha', 'duration': 'June 2\xa0– June 5', 'relevant_text': 'An area of disturbed weather about 290\xa0mi (467\xa0km) southwest of Acapulco formed on June 1. It organized into a tropical depression the next day. After heading southwestward, it turned to the northwest and strengthened into Tropical Storm Agatha on June 2. Agatha maintained its course and steadily intensified. It reached hurricane intensity on June 3 while located about 170\xa0mi (270\xa0km) southwest of Zihuatanejo. Hurricane Agatha started weakening thereafter, becoming a tropical storm on June 4 and a depression on June 5. It dissipated shortly afterwards. At this time, Agatha was located about 140\xa0mi (230\xa0km) south of the Tres Marias Islands.[4] Even though Agatha passed close to Mexico as it weakened, no impact is known to have been caused.[4] Waves caused by Agatha did impact a ship called the Polynesian Diakan. A Greek freighter en route from Pago Pago to Terminal Island, Califo

Data cleaning for duration and relevant text

In [25]:
def clean_hurricane_data(hurricane_data):
    cleaned_data = []
    for hurricane in hurricane_data:
        cleaned_text = re.sub(r'\[\d+\]', '', hurricane['relevant_text']) 
        cleaned_text = cleaned_text.replace('\xa0', ' ')  
        cleaned_duration = hurricane['duration'].replace('\xa0', ' ')
        
        cleaned_data.append({
            "hurricane_name": hurricane['hurricane_name'],
            "duration": cleaned_duration.strip(),
            "relevant_text": cleaned_text.strip()
        })
    return cleaned_data


cleaned_hurricanes = clean_hurricane_data(result)
for hurricane in cleaned_hurricanes:
    print(hurricane)

{'hurricane_name': 'Hurricane Agatha', 'duration': 'June 2 – June 5', 'relevant_text': 'An area of disturbed weather about 290 mi (467 km) southwest of Acapulco formed on June 1. It organized into a tropical depression the next day. After heading southwestward, it turned to the northwest and strengthened into Tropical Storm Agatha on June 2. Agatha maintained its course and steadily intensified. It reached hurricane intensity on June 3 while located about 170 mi (270 km) southwest of Zihuatanejo. Hurricane Agatha started weakening thereafter, becoming a tropical storm on June 4 and a depression on June 5. It dissipated shortly afterwards. At this time, Agatha was located about 140 mi (230 km) south of the Tres Marias Islands. Even though Agatha passed close to Mexico as it weakened, no impact is known to have been caused. Waves caused by Agatha did impact a ship called the Polynesian Diakan. A Greek freighter en route from Pago Pago to Terminal Island, California, the Polynesian Diakan

Extracting date start and date end of each hurricane or storm.

In [26]:
def transform_hurricane_data(hurricanes_data):
    month_map = {
        "January": "01", "February": "02", "March": "03", "April": "04",
        "May": "05", "June": "06", "July": "07", "August": "08",
        "September": "09", "October": "10", "November": "11", "December": "12"
    }

    for hurricane in hurricanes_data:
        start_date, end_date = hurricane['duration'].split(' – ')
        start_month, start_day = start_date.strip().split()
        end_month, end_day = end_date.strip().split()
        hurricane['date_start'] = f"1975-{month_map[start_month]}-{start_day.zfill(2)}"
        hurricane['date_end'] = f"1975-{month_map[end_month]}-{end_day.zfill(2)}"
        del hurricane['duration']
    return hurricanes_data

updated_hurricanes_data = transform_hurricane_data(cleaned_hurricanes)

for hurricane in updated_hurricanes_data:
    print(hurricane)


{'hurricane_name': 'Hurricane Agatha', 'relevant_text': 'An area of disturbed weather about 290 mi (467 km) southwest of Acapulco formed on June 1. It organized into a tropical depression the next day. After heading southwestward, it turned to the northwest and strengthened into Tropical Storm Agatha on June 2. Agatha maintained its course and steadily intensified. It reached hurricane intensity on June 3 while located about 170 mi (270 km) southwest of Zihuatanejo. Hurricane Agatha started weakening thereafter, becoming a tropical storm on June 4 and a depression on June 5. It dissipated shortly afterwards. At this time, Agatha was located about 140 mi (230 km) south of the Tres Marias Islands. Even though Agatha passed close to Mexico as it weakened, no impact is known to have been caused. Waves caused by Agatha did impact a ship called the Polynesian Diakan. A Greek freighter en route from Pago Pago to Terminal Island, California, the Polynesian Diakan began flooding on June 3, forc

Extract the areas that were affected by the hurricanes or storms.

In [27]:
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

def extract_hurricane_areas(text):
    ner_results = ner_pipeline(text)
    affected_areas = [entity['word'] for entity in ner_results if entity['entity_group'] == 'LOC']
    return {
        'affected_areas': affected_areas
    }

def hurricanes_areas(hurricane_dictionary):
    updated_hurricanes = []
    for hurricane in hurricane_dictionary:
        hurricane_info = extract_hurricane_areas(hurricane['relevant_text'])
        hurricane['list_of_areas_affected'] = hurricane_info['affected_areas']
        updated_hurricanes.append(hurricane)
    return updated_hurricanes

hurricanes_areas = hurricanes_areas(updated_hurricanes_data)

for hurricane in hurricanes_areas:
    print(hurricane)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


{'hurricane_name': 'Hurricane Agatha', 'relevant_text': 'An area of disturbed weather about 290 mi (467 km) southwest of Acapulco formed on June 1. It organized into a tropical depression the next day. After heading southwestward, it turned to the northwest and strengthened into Tropical Storm Agatha on June 2. Agatha maintained its course and steadily intensified. It reached hurricane intensity on June 3 while located about 170 mi (270 km) southwest of Zihuatanejo. Hurricane Agatha started weakening thereafter, becoming a tropical storm on June 4 and a depression on June 5. It dissipated shortly afterwards. At this time, Agatha was located about 140 mi (230 km) south of the Tres Marias Islands. Even though Agatha passed close to Mexico as it weakened, no impact is known to have been caused. Waves caused by Agatha did impact a ship called the Polynesian Diakan. A Greek freighter en route from Pago Pago to Terminal Island, California, the Polynesian Diakan began flooding on June 3, forc

In [28]:
model_name = "distilbert-base-uncased-distilled-squad"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
qa_model = DistilBertForQuestionAnswering.from_pretrained(model_name)
qa_pipeline = pipeline("question-answering", model=qa_model, tokenizer=tokenizer)

def extract_hurricane_deaths(text):
    question = "How many people died?"
    answer = qa_pipeline(question=question, context=text)
    death_count = answer['answer'] if answer['score'] > 0.5 else 0
    return {
        'death_count': death_count
    }

def process_hurricanes_data(hurricane_dictionary):
    updated_hurricanes = []
    for hurricane in hurricane_dictionary:
        hurricane_info = extract_hurricane_deaths(hurricane['relevant_text'])
        hurricane['number_of_deaths'] = hurricane_info['death_count']
        del hurricane['relevant_text'] 
        updated_hurricanes.append(hurricane)
    return updated_hurricanes

hurricanes_info = process_hurricanes_data(hurricanes_areas)

for hurricane in hurricanes_info:
    print(hurricane)

{'hurricane_name': 'Hurricane Agatha', 'date_start': '1975-06-02', 'date_end': '1975-06-05', 'list_of_areas_affected': ['Acapulco', 'Zihuatanejo', 'Tres Marias Islands', 'Mexico', 'Pago Pago', 'Terminal Island', 'California', 'San Clemente Island'], 'number_of_deaths': 0}
{'hurricane_name': 'Tropical Storm Bridget', 'date_start': '1975-06-28', 'date_end': '1975-07-03', 'list_of_areas_affected': ['Baja California Peninsula'], 'number_of_deaths': 0}
{'hurricane_name': 'Hurricane Carlotta', 'date_start': '1975-07-02', 'date_end': '1975-07-11', 'list_of_areas_affected': ['Acapulco'], 'number_of_deaths': 0}
{'hurricane_name': 'Hurricane Denise', 'date_start': '1975-07-05', 'date_end': '1975-07-15', 'list_of_areas_affected': ['Mexico'], 'number_of_deaths': 'no damage or casualties.'}
{'hurricane_name': 'Tropical Storm Eleanor', 'date_start': '1975-07-10', 'date_end': '1975-07-12', 'list_of_areas_affected': ['Acapulco', 'Manzanillo', 'Manzanillo'], 'number_of_deaths': 0}
{'hurricane_name': 'T

#### Data Cleaning and Checking

- Checking the names of hurricanes or storms if they exist and there are true or not.

In [29]:
model_name = "gpt2"  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

def check_hurricane_name(name):
    prompt = f"Is '{name}' a valid hurricane name from 1975 Pacific hurricane season? Reply with yes or no."
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=20)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True).lower()
    return "yes" in response

for hurricane in hurricanes_info:
    name = hurricane['hurricane_name']
    if check_hurricane_name(name):
        print(f"{name}: Yes")
    else:
        print(f"{name}: Invalid")


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hurricane Agatha: Yes


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Tropical Storm Bridget: Yes


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hurricane Carlotta: Yes


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hurricane Denise: Yes


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Tropical Storm Eleanor: Yes


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Tropical Storm Francene: Yes


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Tropical Storm Georgette: Yes


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Tropical Storm Hilary: Yes


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hurricane Ilsa: Yes


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hurricane Jewel: Yes


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hurricane Katrina: Yes


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Unnamed hurricane: Yes


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hurricane Lily: Yes


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Tropical Storm Monica: Yes


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Tropical Storm Nanette: Yes


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hurricane Olivia: Yes
Tropical Storm Priscilla: Yes


- Finding missing values and see if the format of the dates are right.

In [30]:
missing_values_flag = True
valid_format_flag = True
valid_dates_flag = True

for entry in hurricanes_info:
    start_date = entry.get('date_start')
    end_date = entry.get('date_end')

    if not start_date or not end_date:
        print(f"Missing values in {entry['hurricane_name']}")
        missing_values_flag = False
    
    try:
        start = datetime.strptime(start_date, '%Y-%m-%d')
        end = datetime.strptime(end_date, '%Y-%m-%d')
    except ValueError:
        print(f"Invalid date format in {entry['hurricane_name']}")
        valid_format_flag = False
    
    if valid_format_flag and start >= end:
        print(f"Start date is not before end date in {entry['hurricane_name']}")
        valid_dates_flag = False

if missing_values_flag:
    print("There are no missing values.")
if valid_format_flag:
    print("All the date formats are correct.")
if valid_dates_flag:
    print("Start dates and end dates are valid.")

There are no missing values.
All the date formats are correct.
Start dates and end dates are valid.


- Checking the number of deaths 

We already know that the deaths are in total 30. So, we will use this information to have more effective data. 

In [31]:
def is_valid_death_count(value):
    try:
        value = int(value)
        if value >= 0:
            return True
    except ValueError:
        return False
    return False

updated_hurricanes_info = []
total_deaths = 0

for entry in hurricanes_info:
    new_entry = entry.copy()  
    deaths = entry.get('number_of_deaths')
    if is_valid_death_count(deaths):
        new_entry['number_of_deaths'] = int(deaths)  
    else:
        new_entry['number_of_deaths'] = 0
    total_deaths += new_entry['number_of_deaths']
    updated_hurricanes_info.append(new_entry)  

if total_deaths != 30:
    print(f"Total deaths are {total_deaths}, but should be 30.")
else:
    print(f"Total deaths: {total_deaths}. All invalid values have been corrected to 0.")

print("Updated data:")
for entry in updated_hurricanes_info:
    print(entry)

Total deaths: 30. All invalid values have been corrected to 0.
Updated data:
{'hurricane_name': 'Hurricane Agatha', 'date_start': '1975-06-02', 'date_end': '1975-06-05', 'list_of_areas_affected': ['Acapulco', 'Zihuatanejo', 'Tres Marias Islands', 'Mexico', 'Pago Pago', 'Terminal Island', 'California', 'San Clemente Island'], 'number_of_deaths': 0}
{'hurricane_name': 'Tropical Storm Bridget', 'date_start': '1975-06-28', 'date_end': '1975-07-03', 'list_of_areas_affected': ['Baja California Peninsula'], 'number_of_deaths': 0}
{'hurricane_name': 'Hurricane Carlotta', 'date_start': '1975-07-02', 'date_end': '1975-07-11', 'list_of_areas_affected': ['Acapulco'], 'number_of_deaths': 0}
{'hurricane_name': 'Hurricane Denise', 'date_start': '1975-07-05', 'date_end': '1975-07-15', 'list_of_areas_affected': ['Mexico'], 'number_of_deaths': 0}
{'hurricane_name': 'Tropical Storm Eleanor', 'date_start': '1975-07-10', 'date_end': '1975-07-12', 'list_of_areas_affected': ['Acapulco', 'Manzanillo', 'Manzan

- Checking the list areas that were affected by the hurricanes and storms

In [35]:
for hurricane in updated_hurricanes_info:
    unique_areas = list(set(hurricane['list_of_areas_affected']))
    hurricane['list_of_areas_affected'] = unique_areas

for hurricane in updated_hurricanes_info:
    print(hurricane['hurricane_name'], hurricane['list_of_areas_affected'])


Hurricane Agatha ['Terminal Island', 'Zihuatanejo', 'California', 'San Clemente Island', 'Mexico', 'Tres Marias Islands', 'Pago Pago', 'Acapulco']
Tropical Storm Bridget ['Baja California Peninsula']
Hurricane Carlotta ['Acapulco']
Hurricane Denise ['Mexico']
Tropical Storm Eleanor ['Acapulco', 'Manzanillo']
Tropical Storm Francene []
Tropical Storm Georgette ['Cabo San Lucas']
Tropical Storm Hilary []
Hurricane Ilsa ['Atlantic Ocean', 'Pacific Ocean', 'Gulf of Tehuantepec']
Hurricane Jewel ['Mexico', 'Acapulco']
Hurricane Katrina ['Socorro Island']
Unnamed hurricane ['Hawaii', 'Kauai', 'Alaska', 'Gulf of Alaska', 'Juneau', 'Montana', 'Columbia']
Hurricane Lily ['Socorro Island', 'Acapulco', 'Manzanillo']
Tropical Storm Monica ['Pacific Ocean']
Tropical Storm Nanette []
Hurricane Olivia ['Mexico', 'Mazatlán']
Tropical Storm Priscilla ['Clarion Island']


#### Create Dataframe and Create hurricanes_1975.csv file

In [36]:
hurricanes_1975_df = pd.DataFrame(updated_hurricanes_info)

hurricanes_1975_df.columns = ['hurricane_storm_name', 'date_start', 'date_end', 'list_of_areas_affected', 'number_of_deaths']
hurricanes_1975_df = hurricanes_1975_df[['hurricane_storm_name', 'date_start', 'date_end', 'number_of_deaths', 'list_of_areas_affected']]
hurricanes_1975_df.to_csv('hurricanes_1975.csv', index=False)

print("Now the csv file was saved successfully.")

Now the csv file was saved successfully.
