# **1.03 Date Feature Extraction**

## **Date Extraction**

Date extacted using [datefinder](https://github.com/akoumjian/datefinder)
 
**"Haunted Places Date" [datetime]**
- Format: YYYY/MM/DD
- Default Value: 2015/01/01

**NOTES**:

- datefinder.find_dates() will parse any number as an incomplete date and set the year to 2025. 
    - To filter out false positives, we filter out dates with year == 2025. 
- Used Regex Expression to capture "20's, 30's, etc.". 
    - eg {index: 1275}: *"A little boy haunts theater number 5 who was killed back in the '70's during a freak construction accident.* -> [datetime.datetime(1970, 1, 1, 0, 0)]
- Regex Pattern for 4 digits includes "In the". This avoids false positives:
    - eg. {index: 1167}: "A young man dressed **in 1700's clothing** has been seen through the windows of the first floor. He also bangs on windows late at night as if trying to escape. It is believed that the body of the young soldier is buried on the grounds of Craven Hall."
- Before extract_dates output, we removed all dates < 1620 (landing at Plymouth Rock). 
    - eg. {index: 3} = "In the 1970's, one room, **room 211** ..." -> datetime([211, 1, 1]).


    

### Feature Extraction

In [19]:
import pandas as pd
import datefinder
import datetime
import time
import re

# Reading CSV
df = pd.read_csv("../data/processed/haunted_places_cleaned.tab", sep = "\t")
df["description"] = df["description"].fillna(" ").astype(str)



def extract_dates(text):
    """
    Extract dates from a given text using three different methods:
    - `datefinder`
    - Two-digit regex patterns (e.g., "20's", "30s")
    - Four-digit regex patterns (e.g., "1920s", "1970's")

    Args:
        text (str): The input text containing potential date references.

    Returns:
        dict: A dictionary containing:
            - dates (list of datetime): Extracted date objects.
            - datefinder_count (int): Number of dates found using datefinder.
            - two_digit_pattern_count (int): Number of dates found using two-digit patterns.
            - four_digit_pattern_count (int): Number of dates found using four-digit patterns.
    """

    ## Parse Using DateFinder ##
    # Remove Years < 1620 #
    matched_dates = [date for date in datefinder.find_dates(text, base_date = datetime.datetime(2025, 1, 1)) 
                    if isinstance(date, datetime.datetime) and 1620 <= date.year < 2026]
    datefinder_count = len(matched_dates)


    matched_years = []

    ## Parse Two Digit Pattern eg. "20's" ##
    two_digit_pattern = [r"\b\d{2}'s\b", r"\b\d{2}s\b", r"\bin the \d{2}'s\b", r"\bin the \d{2}s\b"]
    for pattern in two_digit_pattern:
        matched_years.extend(
            [re.sub(r"in the|'|s", "", year.lower()).strip() for year in re.findall(pattern, text, re.IGNORECASE)]
        )
    matched_years = ["19" + year for year in matched_years]
    two_digit_pattern_count = len(matched_years)

    ## Parse 4 Digit Pattern eg. "in the 1970's" ##
    four_digit_pattern = [r"\bin the \d{4}'s\b", r"\bin the \d{4}s\b"]
    for pattern in four_digit_pattern:
        matched_years.extend(
            [re.sub(r"in the|'|s", "", year.lower()).strip() for year in re.findall(pattern, text, re.IGNORECASE)]
        )
    four_digit_pattern_count = len(matched_years) - two_digit_pattern_count

    ## Add Regex to Matched_Dates **
    for year in matched_years:
        matched_dates.append(datetime.datetime(int(year), 1, 1)) 

    ## If No Dates Matched, Return [2025, 1, 1] ##
    if matched_dates == []:
        matched_dates.append(datetime.datetime(2025, 1, 1))

    ## Remove Duplicates ##
    matched_dates = list(set(matched_dates))
    
    res = {
        "dates" : matched_dates,
        "datefinder_count" : datefinder_count,
        "two_digit_pattern_count" : two_digit_pattern_count,
        "four_digit_pattern_count" : four_digit_pattern_count  
    }

    return res

start = time.time()
df["Haunted_Places_Date"] = df["description"].apply(extract_dates)
end = time.time()




### Post-Processing

In [20]:
## Extract Counts From Each Method ##
df["Datefinder_Extracts"] = df["Haunted_Places_Date"].apply(lambda x: x["datefinder_count"])
df["Two_Digit_Extracts"] = df["Haunted_Places_Date"].apply(lambda x: x["two_digit_pattern_count"])
df["Four_Digit_Extracts"] = df["Haunted_Places_Date"].apply(lambda x: x["four_digit_pattern_count"])


date_finder_total = df["Datefinder_Extracts"].sum()
two_digit_total = df["Two_Digit_Extracts"].sum()
four_digit_total = df["Four_Digit_Extracts"].sum()

extract_printout = [("Datefinder", date_finder_total), 
                    ("Two_Digit", two_digit_total), 
                    ("Four_Digit",four_digit_total), 
                    ("Total", date_finder_total + two_digit_total + four_digit_total)]

## Overwrite Date Column To Exclude Counts ##
df["Haunted_Places_Date"] = df["Haunted_Places_Date"].apply(lambda x: x["dates"])

## Find Multi Date Entries ##
multi_date_idx = df["Haunted_Places_Date"].apply(lambda x : len(x) > 2 if isinstance(x, list) else False)
multi_date_entries = df[multi_date_idx == True]

## Expand DataFrame ##
exploded_df = df.explode("Haunted_Places_Date")
## Take Dates Out of List ##
exploded_df['Haunted_Places_Date'].apply(lambda x: x[0] if isinstance(x, list) else x)
## Convert to Datetime. Fillna with [2025, 1, 1] ##
exploded_df["Haunted_Places_Date"] = pd.to_datetime(exploded_df["Haunted_Places_Date"], errors="coerce").fillna(datetime.datetime(2025, 1, 1))





In [None]:
df[df['description'].str.contains('thirties', na=False)]

In [25]:
df.loc[9520, "Haunted_Places_Date"]

KeyError: 'Haunted_Places_Date'

### Report and Save

In [21]:
## Printout Report ##
print("-" * 150, "Extraction Completed", "-" * 150)
print(f"Extraction Took: {end - start:.6f} seconds", end = "\n\n")
print("\n".join([f"{extract_str[0]}: {extract_str[1]} dates extracted" for extract_str in extract_printout]))
print(f"multi-date entries: {len(multi_date_entries)}", end = "\n\n")
print(f"Old Dataframe Shape: {df.shape} -> New DataFrame Shape: {exploded_df.shape}")
print("-" * 150)

------------------------------------------------------------------------------------------------------------------------------------------------------ Extraction Completed ------------------------------------------------------------------------------------------------------------------------------------------------------
Extraction Took: 9.535893 seconds

Datefinder: 5817 dates extracted
Two_Digit: 94 dates extracted
Four_Digit: 0 dates extracted
Total: 5911 dates extracted
multi-date entries: 421

Old Dataframe Shape: (10992, 14) -> New DataFrame Shape: (12871, 14)
------------------------------------------------------------------------------------------------------------------------------------------------------
