# **1.01 - Audio Feature Extraction**

Adding Features to haunted places dataset. 

Features added using Pandas and Regular Expressions.

## **Audio Feature Extraction**

Used keywords specified in **audio_keywords.txt**. 

Keywords generated with 2 queries to chatgpt:
 - "could you generate a long list of words that are similar to "hear", "overheard", "hearing", "sound", in all tenses."
 - "could you generate more words similar to "screams" "crying" "cries" etc"
 
**"Audio Evidence" [bool]**
- True - audio evidence in description
    - example True {index: 2}: "Others report seeing nothing at all but hearing the killer\'s shouts of rage followed by the victims\' screams of agony"
- False - otherwise
    -


In [181]:
import pandas as pd
import re 


# Reading CSV
df = pd.read_csv("../data/processed/haunted_places.tab", sep = "\t")


# Define Audio Keyword List
audio_keywords = open(f'../data/keywords/audio_keywords.txt', 'r').read().split(',')


def contains_audio_keywords(text):
    if isinstance(text, str):
        for keyword in audio_keywords:
            if re.search(keyword, text, re.IGNORECASE):
                return True  
    return False

df["Audio_Evidence"] = df["description"].apply(contains_audio_keywords)

df['Audio_Evidence'].value_counts()

Audio_Evidence
True     5615
False    5377
Name: count, dtype: int64

## **Visual Feature Extraction**

Used keywords specified in **visual_keywords.txt**. 

Keywords generated with 2 queries to chatgpt:
 - "could you generate a long list of verbs that are similar to "saw", "viewed", and "spot" in all tenses."
 - "could you generate a long list of nouns similar to "picture" "images" "drawings" etc"
 
**"Visual Evidence" [bool]**
- True - visual evidence in description
    - example True flag {index: 53}: *'Many people claim to see a light come out of river and chase their vehicle to the end of the road'*
- False - otherwise
    - example False flag {index: 4}: *'Kappa Delta Sorority - The Kappa Delta Sorority is haunted by an entity simply known as \'P\'. It is said she was a sister there who died in a car accident. Current sisters there have reported hearing giggling and running around coming from the upstairs floor while they are in the basement. At one time a sister called out to "P" and received a "hello" in reply.'*

In [None]:
# Define Visual Keyword List
visual_keywords = open('../data/keywords/visual_keywords.txt', 'r').read().split(',')
def contains_visual_keywords(text):
    if isinstance(text, str):
        for keyword in visual_keywords:
            if re.search(keyword, text, re.IGNORECASE):
                return True  
    return False


df["Visual_Evidence"] = df["description"].apply(contains_visual_keywords)
df['Visual_Evidence'].value_counts()


Visual_Evidence
True     6949
False    4043
Name: count, dtype: int64

## **Date Extraction**

Date extacted using [datefinder](https://github.com/akoumjian/datefinder)
 
**"Haunted Places Date" [datetime]**
- Format: YYYY/MM/DD
- Default Value: 2015/01/01

**NOTE**:

- datefinder.find_dates() will parse any number as an incomplete date and set the year to 2025. 
    - To filter out false positives, we filter out dates with year == 2025. 
- Used Regex Expression to capture "20's, 30's, etc.". 
    - eg {index: 1275}: *"A little boy haunts theater number 5 who was killed back in the '70's during a freak construction accident.* -> [datetime.datetime(1970, 1, 1, 0, 0)]

    

In [197]:
import datefinder
import datetime

def extract_dates(text):
    
    # Parse Using DateFinder
    matched_dates = [date for date in datefinder.find_dates(text, base_date = datetime.datetime(2025, 1, 1)) if date.year < 2025]

    # Parse Using Regex
    pattern = r"\b\d{2}'s\b"
    matched_years = re.findall(pattern, text)

    for year in matched_years:
        year = "19" + year.replace("'s", "").strip()
        matched_dates.append(datetime.datetime(int(year), 1, 1))

    # If No Dates Matched, Return [2025, 1, 1]
    if matched_dates == []:
        return datetime.datetime(2025, 1, 1)
    return [date for date in matched_dates if date.year != 2025]

df["Haunted_Places_Date"] = df["description"].apply(extract_dates)

# Handling Multi-Date Entries
print("Extraction Completed")

multi_date_entries = df["Haunted_Places_Date"].apply(lambda x : len(x) > 2 if isinstance(x, list) else False)

print("There are ", df[multi_date_entries == True].size, " entries with multiple dates")

# Expand DataFrame
exploded_df = df.explode("Haunted_Places_Date")
# Take Dates Out of List
exploded_df['Haunted_Places_Date'].apply(lambda x: x[0] if isinstance(x, list) else x)
# Convert to Datetime. Fillna with [2025, 1, 1]
exploded_df["Haunted_Places_Date"] = pd.to_datetime(exploded_df["Haunted_Places_Date"], errors="coerce").fillna(datetime.datetime(2025, 1, 1))




Extraction Completed
There are  1932  entries with multiple dates


## **Haunted Places Witness Count**

Date extacted using [numberscraper](https://github.com/scrapinghub/number-parser)

**"Haunted Places Date" [datetime]**
- Format: YYYY/MM/DD
- Default Value: 2015/01/01

**NOTE**:

- datefinder.find_dates() will parse any number as an incomplete date and set the year to 2025. 
    - To filter out false positives, we filter out dates with year == 2025. 
- Used Regex Expression to capture "20's, 30's, etc.". 
    - eg {index: 1275}: *"A little boy haunts theater number 5 who was killed back in the '70's during a freak construction accident.* -> [datetime.datetime(1970, 1, 1, 0, 0)]

    

In [None]:
def text_2_num

In [None]:
df.to_csv('../data/processed/haunted_places_features_added.tab', sep = '\t')