# Shark attack data project

#### Import data, explore it, determine what needs to be cleaned or removed in order to make data useful and make hipotesis/questions

In [35]:
import numpy as np
import pandas as pd
import os
import matplotlib as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')


In [3]:
attacks = pd.read_csv("../data/attacks.csv", encoding='latin1')


In [211]:
pd.set_option('display.max_columns', None) #Displays all the columns if they don't fit in the notebook
attacks.sample()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
25496,,,,,,,,,,,,,,,,,,,,,,,,


### `What would be interesting to find out?`

- Are sahrks gender discriminative?
- Waht percentage of the attacks occur when doing activities like boat trips, fishing, surfing...?
- ~~Which species is te most agresive?~~ (too many unique values)
- Whats the mortality rate of a shark attack?
- Do they increase or decrease alongst the years?
- What are the most common injuries (parts of the body)?
- How many megalodon attack are...?
- what time of the day do more attack occur?
- waht range of age is the most attacked?
- where do they occur? country/region

### `What's this data?`

`This is a dirty and desorganized dataset of global shark attacks with the following information`

- `Case number`: Case indexes
- `Date`
- `Year`
- `Type`: Type of the incident that can mainly be: Boating, Unprovoked, Provoked, Questionable o Sea Disaster
- `Country`
- `Area`: Where the attack occured
- `Location`: More specific location of the incident
- `Activity`: that the person was doing when the incident happened
- `Name`
- `Sex`
- `Age`
- `Injury`: description of the injury
- `Fatal (Y/N)`: If the person was killed "Y" or survived the attack "N".
- `Time`: The hour and minutes wehn the incident happened
- `Species`: The species of shark involved in the incident
- `Investigatior or Source`: Person or entity that who carried out the case investigation (could be both)
- `pdf`: name of document related to the incident
- `href formula & href`: link to the actual document
- `Case Number.1 & Case Number.2`: copies of 'Case Number'
- `original order`: ? another identifier?????


### `What needs to be cleaned?`

- 'Unnamed: 22' and 'Unnamed: 23' are almost compleatly empty. Can drop them
- Fix columns names. 'Sex ' and 'Species ' have space at the end.
- From row index 8707 to the end is all NaN. Actual data goes till row 6302
- Standarize. All Sex values should be 'F'->female, 'M'->male or 'unknown'
- Standarize. All fatality values should be 'Y'->yes, 'N'->no or 'unknown'
- Too many unique values in species, not useful data
- Lots of matching values between the three identifier columns. Drop two of them and make all 'Case Number' values, unique identifier
- Remove duplictae rows, indexes: 4688, 5709, 6295
- Dates/ year need to be cleaned and standrized in the same format
- Description on the activity being realized when the incident happened need to be simpler. Find out the most comon activities, use regex to put them in the same category and the rest should be unknown
- Injury should also be cleaned with regex
- Change all null values of the reduced list for 'unknown'
- Type values will be: Boating, Unprovoked, Provoked, Questionable o Sea Disaster
- Country/Area/Location: Useful data. May be deducible one with each other. it's cleanable but might take too much work???
- Time: Useful data. might take too much work but cleanable.
- not using the pdf for now...

`Explorantion practices used to reach to this conclusions can be found in the section below`

In [6]:
attacks.shape

(25723, 24)

##### Exploration methods/attributes

- table
    - attacks.shape`
    - attacks.columns`
    - attacks.info
- cohesiveness of data
    - attacks.duplicated
    - attacks.ina().sum()
- values
    - attacks.dtypes: categorical / qty / str, int, float
    - attacks.describe, # qtve variables
    - attacks[col1].value_counts() # frequency of each level of cat, dimensions
    - attacks.unique()/.nunique()

In [77]:
attacks.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25723 entries, 0 to 25722
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Case Number             8702 non-null   object 
 1   Date                    6302 non-null   object 
 2   Year                    6300 non-null   float64
 3   Type                    6298 non-null   object 
 4   Country                 6252 non-null   object 
 5   Area                    5847 non-null   object 
 6   Location                5762 non-null   object 
 7   Activity                5758 non-null   object 
 8   Name                    6092 non-null   object 
 9   Sex                     5737 non-null   object 
 10  Age                     3471 non-null   object 
 11  Injury                  6274 non-null   object 
 12  Fatal (Y/N)             5763 non-null   object 
 13  Time                    2948 non-null   object 
 14  Species                 3464 non-null 

In [8]:
attacks.isna().sum()
# 'Unnamed: 22' and 'Unnamed: 23' are almost compleatly empty. Can drop them

Case Number               17021
Date                      19421
Year                      19423
Type                      19425
Country                   19471
Area                      19876
Location                  19961
Activity                  19965
Name                      19631
Sex                       19986
Age                       22252
Injury                    19449
Fatal (Y/N)               19960
Time                      22775
Species                   22259
Investigator or Source    19438
pdf                       19421
href formula              19422
href                      19421
Case Number.1             19421
Case Number.2             19421
original order            19414
Unnamed: 22               25722
Unnamed: 23               25721
dtype: int64

In [17]:
attacks.columns.values.tolist()
#Fix columns names. Sex and species have space at the end.

['Case Number',
 'Date',
 'Year',
 'Type',
 'Country',
 'Area',
 'Location',
 'Activity',
 'Name',
 'Sex ',
 'Age',
 'Injury',
 'Fatal (Y/N)',
 'Time',
 'Species ',
 'Investigator or Source',
 'pdf',
 'href formula',
 'href',
 'Case Number.1',
 'Case Number.2',
 'original order',
 'Unnamed: 22',
 'Unnamed: 23']

In [9]:
attacks.isna().all(axis=1).sum()

17020

In [10]:
attacks.index[attacks.isna().all(axis=1)].min()

8702

In [11]:
attacks.index[attacks.isna().all(axis=1)].max()
#From row index 8707 to the end is all NaN

25721

In [40]:
attacks.index.max()

25722

In [59]:
attacks.iloc[25722]

Case Number                xx
Date                      NaN
Year                      NaN
Type                      NaN
Country                   NaN
Area                      NaN
Location                  NaN
Activity                  NaN
Name                      NaN
Sex                       NaN
Age                       NaN
Injury                    NaN
Fatal (Y/N)               NaN
Time                      NaN
Species                   NaN
Investigator or Source    NaN
pdf                       NaN
href formula              NaN
href                      NaN
Case Number.1             NaN
Case Number.2             NaN
original order            NaN
Unnamed: 22               NaN
Unnamed: 23               NaN
Name: 25722, dtype: object

In [99]:
#attacks.iloc[8701]
#attacks.iloc[7373]
#attacks.iloc[6373]
attacks.iloc[6302]

Case Number                    0
Date                         NaN
Year                         NaN
Type                         NaN
Country                      NaN
Area                         NaN
Location                     NaN
Activity                     NaN
Name                         NaN
Sex                          NaN
Age                          NaN
Injury                       NaN
Fatal (Y/N)                  NaN
Time                         NaN
Species                      NaN
Investigator or Source       NaN
pdf                          NaN
href formula                 NaN
href                         NaN
Case Number.1                NaN
Case Number.2                NaN
original order            6304.0
Unnamed: 22                  NaN
Unnamed: 23                  NaN
Name: 6302, dtype: object

In [72]:
# check where does the actual data stops because there's lots of rows with empty values except for Case Number that has value '0'
# columns_to_check = [col for col in attacks.columns if col != 'Case Number']

# same with values of column 'original order'
columns_to_check = [col for col in attacks.columns if col != 'Case Number' and col != 'original order']

min_index = attacks[columns_to_check].isna().all(axis=1).idxmax()

print(f"actual data ends at row {min_index}")

actual data ends at row 6302


In [12]:
attacks.describe()

Unnamed: 0,Year,original order
count,6300.0,6309.0
mean,1927.272381,3155.999683
std,281.116308,1821.396206
min,0.0,2.0
25%,1942.0,1579.0
50%,1977.0,3156.0
75%,2005.0,4733.0
max,2018.0,6310.0


In [15]:
attacks["Sex "].unique()
#Standarize. All Sex values should be 'F'->female, 'M'->male or 'unknown'

array(['F', 'M', nan, 'M ', 'lli', 'N', '.'], dtype=object)

In [140]:
attacks[:6302]["Sex "].isna().sum()

565

In [16]:
attacks["Fatal (Y/N)"].unique()
#Standarize. All fatality values should be 'Y'->yes, 'N'->no or 'unknown'

array(['N', 'Y', nan, 'M', 'UNKNOWN', '2017', ' N', 'N ', 'y'],
      dtype=object)

In [141]:
attacks[:6302]["Fatal (Y/N)"].isna().sum()

539

In [13]:
attacks["Species "].nunique()
#Too many unique values in species, not useful data

1549

In [81]:
# attacks['Case Number'].equals(attacks['Case Number.1']) -> False
# attacks['Case Number'].equals(attacks['Case Number.2']) -> False
# attacks['Case Number.1'].equals(attacks['Case Number.2']) -> False

matching_cases = attacks.iloc[:6302][attacks['Case Number'] == attacks['Case Number.1']]
non_matching_cases = attacks.iloc[:6302][attacks['Case Number'] != attacks['Case Number.1']]

matching_count = matching_cases.shape[0]
non_matching_count = non_matching_cases.shape[0]

print(matching_count, non_matching_count)
# When comparing seems like NaN != NaN. These are not counted as a match. No need to use the index 8702 .


6278 24


In [181]:
column_pairs = [
    ('Case Number', 'Case Number.1'),
    ('Case Number', 'Case Number.2'),
    ('Case Number.1', 'Case Number.2'),
    ('href', 'href formula')
]

results = []

for column1, column2 in column_pairs:
    matching_count = attacks.iloc[:6302][attacks[column1] == attacks[column2]].shape[0]
    non_matching_count = attacks.iloc[:6302][attacks[column1] != attacks[column2]].shape[0]
    result_string = f"{column1} and {column2} have {matching_count} matching values and {non_matching_count} non-matching values"
    results.append(result_string)

for result in results:
    print(result)

Case Number and Case Number.1 have 6278 matching values and 24 non-matching values
Case Number and Case Number.2 have 6298 matching values and 4 non-matching values
Case Number.1 and Case Number.2 have 6282 matching values and 20 non-matching values
href and href formula have 6242 matching values and 60 non-matching values


In [76]:
attacks["Case Number"].is_unique
#Lots of matching values between the three identifier columns. Drop two of them and make all 'Case Number' values, unique identifier

False

In [185]:
#href formula can be droped, because its almost the same as href
attacks[:6302]["href"].isna().sum()

0

In [79]:
subset = attacks[['Date', 'Year', 'Type', 'Country', 'Area', 'Location', 'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)']]

subset.duplicated().any()

True

In [85]:
subset[:6302][subset.duplicated()]
# These three rows are dublicated

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N)
4688,Fall 1943,1943.0,Unprovoked,USA,Hawaii,"Midway Island, Northwestern Hawaiian Islands",Spearfishing,2 males,M,,Calf nipped in each case,N
5709,1890,1890.0,Unprovoked,INDIA,Tamil Nadu,Tuticorin,Diving,a pearl diver,M,,No details,UNKNOWN
6295,Before 1906,0.0,Unprovoked,AUSTRALIA,,,Fishing,fisherman,M,,FATAL,Y


In [105]:
attacks["Type"].nunique()

8

In [107]:
attacks["Type"].unique()

array(['Boating', 'Unprovoked', 'Invalid', 'Provoked', 'Questionable',
       'Sea Disaster', nan, 'Boat', 'Boatomg'], dtype=object)

In [142]:
attacks[:6302]["Type"].isna().sum()

4

In [95]:
#attacks["Year"].nunique()
attacks["Year"].unique()

array([2018., 2017.,   nan, 2016., 2015., 2014., 2013., 2012., 2011.,
       2010., 2009., 2008., 2007., 2006., 2005., 2004., 2003., 2002.,
       2001., 2000., 1999., 1998., 1997., 1996., 1995., 1984., 1994.,
       1993., 1992., 1991., 1990., 1989., 1969., 1988., 1987., 1986.,
       1985., 1983., 1982., 1981., 1980., 1979., 1978., 1977., 1976.,
       1975., 1974., 1973., 1972., 1971., 1970., 1968., 1967., 1966.,
       1965., 1964., 1963., 1962., 1961., 1960., 1959., 1958., 1957.,
       1956., 1955., 1954., 1953., 1952., 1951., 1950., 1949., 1948.,
       1848., 1947., 1946., 1945., 1944., 1943., 1942., 1941., 1940.,
       1939., 1938., 1937., 1936., 1935., 1934., 1933., 1932., 1931.,
       1930., 1929., 1928., 1927., 1926., 1925., 1924., 1923., 1922.,
       1921., 1920., 1919., 1918., 1917., 1916., 1915., 1914., 1913.,
       1912., 1911., 1910., 1909., 1908., 1907., 1906., 1905., 1904.,
       1903., 1902., 1901., 1900., 1899., 1898., 1897., 1896., 1895.,
       1894., 1893.,

In [111]:
attacks["Date"].nunique()
attacks["Date"].unique()


array(['25-Jun-2018', '18-Jun-2018', '09-Jun-2018', ..., '1883-1889',
       '1845-1853', nan], dtype=object)

In [113]:
attacks["Date"].nunique()

5433

In [101]:
attacks["Location"].nunique()
#might be too much work

4108

In [103]:
attacks["Area"].nunique()
#might be too much work

825

In [144]:
#attacks["Country"].nunique()
attacks["Country"].unique()

array(['USA', 'AUSTRALIA', 'MEXICO', 'BRAZIL', 'ENGLAND', 'SOUTH AFRICA',
       'THAILAND', 'COSTA RICA', 'MALDIVES', 'BAHAMAS', 'NEW CALEDONIA',
       'ECUADOR', 'MALAYSIA', 'LIBYA', nan, 'CUBA', 'MAURITIUS',
       'NEW ZEALAND', 'SPAIN', 'SAMOA', 'SOLOMON ISLANDS', 'JAPAN',
       'EGYPT', 'ST HELENA, British overseas territory', 'COMOROS',
       'REUNION', 'FRENCH POLYNESIA', 'UNITED KINGDOM',
       'UNITED ARAB EMIRATES', 'PHILIPPINES', 'INDONESIA', 'CHINA',
       'COLUMBIA', 'CAPE VERDE', 'Fiji', 'DOMINICAN REPUBLIC',
       'CAYMAN ISLANDS', 'ARUBA', 'MOZAMBIQUE', 'FIJI', 'PUERTO RICO',
       'ITALY', 'ATLANTIC OCEAN', 'GREECE', 'ST. MARTIN', 'FRANCE',
       'PAPUA NEW GUINEA', 'TRINIDAD & TOBAGO', 'KIRIBATI', 'ISRAEL',
       'DIEGO GARCIA', 'TAIWAN', 'JAMAICA', 'PALESTINIAN TERRITORIES',
       'GUAM', 'SEYCHELLES', 'BELIZE', 'NIGERIA', 'TONGA', 'SCOTLAND',
       'CANADA', 'CROATIA', 'SAUDI ARABIA', 'CHILE', 'ANTIGUA', 'KENYA',
       'RUSSIA', 'TURKS & CAICOS', 'UNITE

In [175]:
list_of_countries_to_change = []

for country, count in attacks["Country"].value_counts().items():
  if count == 1:
    list_of_countries_to_change.append(country)

print(list_of_countries_to_change)

['SAN DOMINGO', 'RED SEA?', 'REUNION ISLAND', 'CYPRUS', 'IRELAND', 'KUWAIT', 'ASIA?', 'MONACO', 'FALKLAND ISLANDS', 'INDIAN OCEAN?', 'PARAGUAY', 'EQUATORIAL GUINEA / CAMEROON', 'ALGERIA', 'Coast of AFRICA', 'TASMAN SEA', 'GHANA', 'AFRICA', 'PERU', 'GREENLAND', 'SWEDEN', 'COOK ISLANDS', 'ROATAN', 'BRITISH NEW GUINEA', 'ANDAMAN ISLANDS', 'Between PORTUGAL & INDIA', 'TUVALU', 'DJIBOUTI', 'SYRIA', 'GEORGIA', 'BAHREIN', 'OCEAN', 'KOREA', 'ITALY / CROATIA', 'COMOROS', 'CURACAO', 'SLOVENIA', 'BRITISH ISLES', 'WESTERN SAMOA', 'BANGLADESH', 'SOUTH CHINA SEA', 'ANGOLA', 'NORTHERN ARABIAN SEA', 'EGYPT / ISRAEL', 'MEXICO ', 'Seychelles', 'GRAND CAYMAN', 'ST. MAARTIN', 'Sierra Leone', 'GULF OF ADEN', 'BRITISH VIRGIN ISLANDS', 'NEVIS', 'MALDIVES', 'PALESTINIAN TERRITORIES', 'DIEGO GARCIA', 'ST. MARTIN', 'PUERTO RICO', 'ARUBA', 'RED SEA', 'FEDERATED STATES OF MICRONESIA', 'ADMIRALTY ISLANDS', 'ARGENTINA', 'MID-PACIFC OCEAN', 'BAY OF BENGAL', 'SOLOMON ISLANDS / VANUATU', ' PHILIPPINES', 'JAVA', 'IRAN 

In [149]:
attacks["Time"].unique()
# cleanable but might be too much

array(['18h00', '14h00  -15h00', '07h45', nan, 'Late afternoon', '17h00',
       '14h00', 'Morning', '15h00', '08h15', '11h00', '10h30', '10h40',
       '16h50', '07h00', '09h30', 'Afternoon', '21h50', '09h40', '08h00',
       '17h35', '15h30', '07h30', '19h00, Dusk', 'Night', '16h00',
       '15h01', '12h00', '13h45', '23h30', '09h00', '14h30', '18h30',
       '12h30', '16h30', '18h45', '06h00', '10h00', '10h44', '13h19',
       'Midday', '13h30', '10h45', '11h20', '11h45', '19h30', '08h30',
       '15h45', 'Shortly before 12h00', '17h34', '17h10', '11h15',
       '08h50', '17h45', '13h00', '10h20', '13h20', '02h00', '09h50',
       '11h30', '17h30', '9h00', '10h43', 'After noon', '15h15', '15h40',
       '19h05', '1300', '14h30 / 15h30', '22h00', '16h20', '14h34',
       '15h25', '14h55', '17h46', 'Morning ', '15h49', '19h00',
       'Midnight', '09h30 / 10h00', '10h15', '18h15', '04h00', '14h50',
       '13h50', '19h20', '10h25', '10h45-11h15', '16h45', '15h52',
       '06h15', '14h

In [188]:
from dateutil.parser import parse
def is_valid_time(time_str):
    if isinstance(time_str, str):
        try:
            parse(time_str)
            return True
        except ValueError:
            return False
    return False  # Return False for non-string values

# List of time columns (you can add more if needed)
time_columns = ['Time']

# Initialize counters for valid and invalid times
valid_time_count = 0
invalid_times = []

# Iterate through time columns and check for valid times
for column in time_columns:
    is_valid = attacks.iloc[:6302][column].apply(is_valid_time)
    valid_time_count += is_valid.sum()
    
    # Collect invalid times
    invalid_times.extend(attacks.iloc[:6302][~is_valid][column])

    print(f'Number of valid times in column {column}: {valid_time_count}')
    print(f'Number of invalid times in column {column}: {len(invalid_times)}')

print('Invalid times:')

Number of valid times in column Time: 2325
Number of invalid times in column Time: 3977
Invalid times:


In [130]:
# attacks.sample()
# seems like case number columns have dates, might be useful
# found an interesting function to import
from dateutil.parser import parse

# Define a function to check if a date is valid
def is_valid_date(date_str):
    if isinstance(date_str, str):
        try:
            parse(date_str)
            return True
        except ValueError:
            return False
    return False  # Return False for non-string values

# List of date columns (you can add more if needed)
date_columns = ['Date']

# Initialize counters for valid and invalid dates
valid_date_count = 0
invalid_dates = []

# iterate through date columns and check for valid dates up to row 6301
for column in date_columns:
    is_valid = attacks.iloc[:6302][column].apply(is_valid_date)
    valid_date_count += is_valid.sum()
    
    # collect invalid dates
    invalid_dates.extend(attacks.iloc[:6302][~is_valid][column])

    print(f'Number of valid dates in column {column}: {valid_date_count}')
    print(f'Number of invalid dates in column {column}: {len(invalid_dates)}')

print('Invalid dates:')
print(invalid_dates)



Number of valid dates in column Date: 5462
Number of invalid dates in column Date: 840
Invalid dates:
['Reported 30-Apr-2018', 'Reported 10-Apr-2018', 'Reported 25-Nov-2017', 'Reported 13-Nov-2017', 'Reported 31-Oct-2017', 'Reported 06-Sep-2017', 'Reported 26-Jul-2017', 'Reported 07-Jul-2017', 'Reported 14-Jun-2017', 'Reported 07-Jun-2017', 'Reported 06-May-2017', 'Reported 09-Mar-2017', 'Reported 08-Jan-2017', 'Reported  14-Jul-2016', 'Reported 08-Jul-2016', 'Reported 03-Mar-2016', 'Reported 10-Feb-2016', 'Reported 11-Jan-2016', 'Reported 25-Jun-2015', 'Reported 23-Dec-2014', 'Reported 03-Dec-2014', 'Reported 17-Nov-2014', 'Reported 12-Sep-2014', 'Reported 25-Aug-2014', 'Reported 27-Jun-2014', 'Reported 17-Jun-2014', 'Reported 10-May-2014', 'Reported 12-Apr-2014', 'Reported 17-Feb-2014', 'Reported 08-Aug-2013', 'Reported 17-Jul-2013', 'Reported 14-Jun-2013', 'Reported 02-Apr-2013', 'Reported 21-Mar-2013', 'Reported 21-Jan-2013', 'Reported 11-Oct-2012', 'Reported 28-Jun-2012', 'Reporte

Invalid dates:
['Reported 30-Apr-2018', 'Reported 10-Apr-2018', 'Reported 25-Nov-2017', 'Reported 13-Nov-2017', 'Reported 31-Oct-2017', 'Reported 06-Sep-2017', 'Reported 26-Jul-2017', 'Reported 07-Jul-2017', 'Reported 14-Jun-2017', 'Reported 07-Jun-2017', 'Reported 06-May-2017', 'Reported 09-Mar-2017', 'Reported 08-Jan-2017', 'Reported  14-Jul-2016', 'Reported 08-Jul-2016', 'Reported 03-Mar-2016', 'Reported 10-Feb-2016', 'Reported 11-Jan-2016', 'Reported 25-Jun-2015', 'Reported 23-Dec-2014', 'Reported 03-Dec-2014', 'Reported 17-Nov-2014', 'Reported 12-Sep-2014', 'Reported 25-Aug-2014', 'Reported 27-Jun-2014', 'Reported 17-Jun-2014', 'Reported 10-May-2014', 'Reported 12-Apr-2014', 'Reported 17-Feb-2014', 'Reported 08-Aug-2013', 'Reported 17-Jul-2013', 'Reported 14-Jun-2013', 'Reported 02-Apr-2013', 'Reported 21-Mar-2013', 'Reported 21-Jan-2013', 'Reported 11-Oct-2012', 'Reported 28-Jun-2012', 'Reported 22-Jan-2012', 'Reported 26-Dec-2011', 'Reported 20-Nov-2011', 'Reported 28-Oct-2011', '16-Aug--2011', '11-Aug--2011', 'Reported 14-Jun-2011', 'Reported 06-Jun-2011', 'Reported 07-May-2011', 'Reported 29-Mar-2011', 'Reported 10-Mar-2010', 'Reported 28-Feb-2011', 'Reported 04-Feb-2011', 'Reported 12-Jan 2011', 'Reported 03-Dec-2010', 'Reported 27-Nov-2010', 'Reported 12-Nov-2010', 'Reported 28-Oct-2010', 'Reported 06-Sep-2010', '190Feb-2010', 'Reported 06-Feb-2010', 'Reported 29-Oct-2009', 'Reported 14-Oct-2009', 'Reported 24-Jul-2009', 'Reported 14-Jun-2009', 'Reported 25-Apr-2009', 'Reported 17-Mar-2009', 'Reported 16-Mar-2009', 'Reported 27-Jan-2009', 'Reported 26-Jan-2009', 'Reported 13-Jan-2009', 'Reported 30-Jul-2008', 'Late Jul-2008', 'Reported 26-Jun-2008', 'Reported 02-Jun-2008', 'Reported 19-Apr-2008', 'Reported 09-Apr-2008', 'Reported 08-Apr-2008', 'Reported 21-Feb-2008', 'Reported 19-Jan-2008', 'Fall 2008', 'Summer-2008', '19-Jul-2007.b', '19-Jul-2007.a', 'Reported 17-May-2007', 'Reported 09-May-2007', 'Reported      13-Apr-2007', 'Reported 14-Mar-2007', 'Reported 05-Mar-2007', 'Reported 18-Sep-2006', 'Early Aug-2006', 'Reported 17-Jul-2006', 'Reported      23-Apr-2006', 'Reported  28-Mar-2006', 'Reported 28-Jan-2006', 'Reported 06-Dec-2005', 'Reported 29-Nov-2005', 'Reported  16-Nov-2005', 'Reported      27-Sep-2005', 'Reported      15-Jul-2005', 'Reported    27-Mar-2005', ' 19-Jul-2004 Reported to have happened  "on the weekend"', 'Reported 15-Jan-2004 ', 'Reported 26-Jul-2003', 'Late Jul-2003', 'Reported 06-Aug-2002', 'Reported 13-Jun-2002', 'Reported 13-Jun-2002', 'Reported 21-May-2002', '02-Ap-2001', 'Reported  24-Jan-2001', 'Early Sep-2000', 'Reported 27-Aug-2000', 'Early Jun-2000', 'Reported      03-Mar-2000', 'Reported 28-Jan-2000', 'Reported 16-Sep-1999', 'Reported 18-Mar-1999 ', 'Reported 03-Jan-1999', 'Reported 20-Dec-1998', 'Reported 16-Sep-1998', 'Reported      23-Aug-1998', 'Reported 28-Jan-1998', 'Reported 05-Nov-1997', 'Reported 11-Oct-1997', 'Reported      19-Feb-1996', 'Reported 28-Oct-1995', 'Early Jul-1995', 'Reported      10-Dec-1994 ', 'Reported 11-Sep-1994', 'Reported 16-Apr-1994', 'Reported 12-Jan-1994', 'Last incident of 1994 in Hong Kong', 'Reported      19-Aug-1993', 'Late May 1993', 'Fall 1993', 'Between May & Nov-1993', 'Reported 12-Nov-1992', 'Reported 29-Oct-1989', 'Reported    06-Aug-1989', 'Reported      05-Jan-1988', 'Reported 22-Sep-1986 ', 'Mid Jul-1985 or mid Jul-1986', 'Reported      10-Nov-1983', 'Ca. 1983', 'Late Aug-1982', 'Reported 15-Jun-1981', 'Summer of 1981', 'Reported 22-Aug-1980', 'Late Jul-1980', 'Early Jul-1980', 'Summer 1980', '1980s ', '1980s ', 'Reported 02-Sep-1978', 'Reported 01-Aug-1978', 'Apr-1978`', 'Reported 02-Jun-1976', '26-Jul-1975.b', 'Reported 02-Jun-1975', 'Reported 14-Feb-1975', 'Reported 25-Apr-1974', 'Early Feb-1974', 'Summer 1974', 'Reported 18-Dec-1973', 'Reported 10-Sep-1973', 'Reported 10-Oct-1972', 'Reported 26-Jun-1972', 'Reported 26-Jun-1972', 'Reported 25-Nov-1971', 'Reported 16-Apr-1971', 'Late Apr-1971', 'Reported 09-Jan-1970', '1970s', 'Late 1970s', 'Ca. 1970', '1970s', '1970s', '1970s', '1970s', 'Reported 17-Feb-1969', 'Winter 1969', 'Reported 11-Apr-1968', 'Reported 21-Dec-1967', 'Reported 26-Oct-1967', 'Reported 14-Aug-1967', '13 or 30-May-1967', 'Early Nov-1966', 'Sep- 1966', 'Mid Aug-1966', 'Summer of 1996', 'Reported 02-Jul-1965', 'May-Jun-1965', 'Summer 1965', 'Early 1965', 'May-Jun-1965', 'May-Jun-1965', 'Ca. 1965', 'Reported 17-Feb-1964', 'Reported 06-Jan-1964', 'Reported 16-Nov-1963', 'Reported 05-Nov-1963', 'Reported 10-Jul-1963', 'Early 1963', 'Reported 31-Aug-1962', 'Late Aug-1962', 'Reported 03-Jul-1962', 'Jan-Jun-1962', 'Ca. 1962', 'Reported 06-Sep-1961', 'Reported 06-Jun-1961', 'Reported 02-Jan-1961', 'Reported 22-Aug-1960', 'Reported 20-Apr-1960', 'Early summer 1960', 'Late 1960s', '1960s', '1960-1961', 'Ca. 1960', 'Ca. 1960', 'Reported 12-Nov-1959', 'Between 10 and 12-Sep-1959', 'Reported 10-Sep-1959', 'Reported 31-Aug-1959', '21764', 'Late Jul-1959', 'Reported 26-Jun-1959', '02-Feb-1959 Reported', 'Summer of 1959', 'Jul- to Sep-1959', 'Reported 07-Nov-1958', '  Reported 31-Jul-1958', 'Reported 02-Jun-1958', 'Reported 09-Jan-1958', '1958-1959', 'Circa 1958', 'Reported 04-Nov-1957', 'Reported 07-May-1957', 'Reported 15-Dec-1956', 'Reported 15-Aug-1956', 'Reported 26-May-1956', 'Reported 10-Feb-1956', 'Reported 16-Jan-1956', 'Reported 31-Dec-1955', 'Reported 04-Aug-1955', 'Reported 13-Apr-1955', 'Ca. 1955 ', '19955', 'Reported 04-Jul-1954', 'Reported 02-Jul-1954', 'Reported 01-Jul-1954', 'Reported 26-May-1954', '1954 (same day as  1954.00.00.f)', '1954 (same day as  1954.00.00.f)', 'Reported 20-Sep-1953', 'Reported 03-Sep-1953', 'Reported 19-Mar-1953', 'Reported 07-May-1952', '1952-1954', '\n1951.12.15.R', 'Reported 23-Nov-1951', 'Reported 03-Sep-1951', 'Reported 02-Sep-1951', 'Reported 16-Aug-1951', 'Between 01-Aug-1951 & 08-Aug-1951', 'Reported 19-Jul-1951', 'Reported 09-May-1951', 'Reported 19-Dec-1950', 'Reported 27-Jul-1950', 'Reported 18-Feb-1950', 'Reported 12-Jan-1950', '1950 - 1951', 'Summer 1950', '1950s', 'Summer 1950', 'Ca. 1950', 'Ca. 1950', 'Reported 10-Aug-1949', 'Mar-1949 or Apr-1949', '1949-1950', 'Reported 17-Sep-1848', 'Reported 06-Jun-1948', 'Reported 13-Mar-1948 "Bitten last weekend', 'Summer 1948', 'Reported 15-Dec-1947', 'Reported 27-Jul-1947', 'Reported 24-Jul-1947', 'Reported 13-May-1947', 'Reported 06-Feb-1947', 'Reported 24-Dec-1946', 'Reported 26-Oct-1946', 'Between 18 & 22-Dec 1944', 'Reported 23-Oct-1944', 'Reported 24-May-1944', 'Some time between Apr & Nov-1944', 'Reported 01-May-1943', '02-Mar-1943 to 07-Mar-1943', 'Summer 1943', 'Fall 1943', 'Fall 1943', '11-Sep-1942 to 16-Sep-1942', 'Reported 11-Jun-1942', 'Reported 08-Jun-1942', 'Winter 1942', 'Summer 1942', 'Reported 21-Aug-1941', 'Reported 19-Dec-1940', 'Ca. 1940', 'Reported 02-Nov-1939', 'Reported 25-Oct-1939', 'Reported 27-Sep-1939', 'Woirld War II', 'Ca. 1939', 'Reported 17-Jul-1938', 'Reported 02-May-1938', 'Reported 21-Mar-1938', 'Reported 1938', 'Reported 06-Nov-1937', 'Reported 26-Sep-t937', 'Reported 16-Jul-1937', 'Reported 28-Jun-1937', 'Reported 11-Sep-1936', 'Reported 04-Aug-1936', '08-Ap-1936', 'Reported 20-Feb-1936', 'Reported 04-Sep-1935', 'Reported 05-Jun-1935', 'Reported 12-Apr-1935', 'Reported 08-Apr-1935', 'Reported 25-Mar-1935', 'Reported 21-Jan-1935', 'Reported 26-Aug-1934', 'Reported 08-Feb-1934', 'Reported 25-Oct-1933', 'Reported 27-Sep-1933', 'Reported 26-Aug-1933', 'Reported 07-Jul-1933', 'Reported 08-Jun-1933', 'Reported 15-Feb-1933', 'Reported 11-Dec-1932', 'Reported 09-Dec-1932', 'Reported 26-Sep-1932', 'Reported 14-Jul-1932', 'Reported 06-Aug-1931', 'Reported 28-Jul-1931', 'Reported 04-Jun-1931', 'Reported 27-Apr-1931', 'Reported 26-Sep-1930', 'Reported 11-May-1930', 'Reported 07-Mar-1930', 'Reported 03-Feb-1930', 'Reported 09-Dec-1929', 'Reported 03-Dec-1929', 'Reported 17-Jul-1929', 'Reported 26-Apr-1929', 'Reported 17-Apr-1929', 'Reported 1929', 'Ca. 1929', 'Reported 15-Nov-1928', 'Reported 24-Aug-1928', 'Late Jul-1928', 'Reported 24-Jun-1928', 'Reported 14-Apr-1928', 'Reported 28-Mar-1928', 'Some time between 08-Jan-1928 & 21-Jan-1928', 'Reported 09-May 1927', 'Reported 08-Jan-1927', 'Reported 02-Dec-1926', 'Reported 29-Oct-1926', 'Reported 06-Sep-1926', 'Summer of 1926', '07-Mar-1925 or 27-Mar-1925', 'Reported 27-Jan-1925', 'Reported 31-Oct-1924', 'Reported 28-Mar-1924', 'Reported 02-Jul-1923', 'Reported 23-May-1923', '1923-1924', 'Reported 26-Sep-1922', 'Reported 21-Sep-1922', 'Reported 02-Feb-1922', 'Reported 28-Jan-1922', 'Reported 15-Nov-1921', 'Reported 29-Aug-1929', 'Reported 11-Jan-1921', 'Reported 06-Jul-1920', 'Reported 24-Jan-1920', 'Reported 24-Jan-1920', '1920s', 'Reported 30-Dec-1919', 'Reported to have taken place in 1919', 'Reported 05-May-1917', 'Reported 24-Jun-1916', 'Reported 25-Apr-1916', 'Reported 09-Dec-1915', 'Reported 06-Jul-1915', 'Reported 06-Jul-1915', 'Reported 15-May-1915', 'Ca. 1915', 'Reported 04-Dec-1914', 'Reported 26-Sep-1914', 'Reported 15-Jul-1914', 'Reported 09-Jul-1914', 'Reported 14-Mar-1914', 'Reported 10-Mar-1914', 'Reported 09-Feb-1914', 'Reported 17-Jan-1914', 'Reported 30-Dec-1913', 'Reported 26-Dec-1913', 'Reported 27-Aug-1913', 'Reported 27-Aug-1913', 'Reported 10-Jul-1913', 'Reported 30-Nov-1912', 'Reported 06-Jul-1912', 'Reported 13-Jan-1912', 'Reported 31-Jul-1911', 'Reported 16-Jul-1911', 'Reported 01-May-1911', 'Reported 08-Apr-1911', 'Reported 29-Mar-1911', 'Ca. 1911', 'Reported 25-Dec-1910', 'Reported 23-Dec-1910', 'Reported 25-Jun-1910', 'Reported 08-Jun-1910', 'Reported 16-May-1910', 'Reported 26-Nov-1909', 'Reported 15-Dec-1909', '14-Nov-1909 to 19-Nov-1909', 'Reported 04-Sep-1909', 'Reported 26-Jun-1909', 'Reported 27-Apr-1909', 'Reported 09-Apr-1909', 'Reported 16-Dec-1908', 'Reported 28-Aug-1908', 'Reported 18-Jul-1908', 'Reported 08-Jul-1908', 'Reported 02-Jun-1908', 'Reported 18-Oct-1907', 'Reported 16-Oct-1907', 'Reported 16-Oct-1907', 'Reported 19-Sep-1907', 'Reported 12-Aug-1907', 'Reported 12-Aug-1907', 'Reported 08-Aug-1907', 'Reported 14-Jul-1907', 'Reported 04-Jul-1907', 'Reported 10-Oct-1906', 'Reported 10-Oct-1906', 'Reported 10-Oct-1906', 'Reported 10-Oct-1906', 'Reported 27-Sep-1906', 'Reported 27-Sep-1906', 'Reported 05-Sep-1906', 'Reported 05-Jul-1906', 'Reported 27-Apr-1906', 'Reported 10-April 1906', 'Reported 02-April 1906', 'Reported 14-Feb-1906', 'Reported 06-Sep-1905', 'Late Aug-1905', 'Reported 25-Jul-1905', 'Reported 11-Oct-1904', 'Reported 01-Jul-1904', 'Reported 16-Sep-1903', 'Reported 20-Mar-1903', 'Summer of 1903', 'Ca. 1903', 'Reported 22-Dec-1902', 'Reported 01-Nov-1902', 'Reported 29-Aug-1902', 'Reported 24-Aug-1902', '.Reported 22-Feb-1902', 'Reported 25-Jan-1902', 'Reported 19-Jan-1902', 'Mid Oct-1901', 'Reported 23-Sep-1901', 'Reported 29-Jun-1901', 'Summer 1901', 'Late Jul-1900', 'Early 1900s', 'Ca. 1900', 'Reported 12-Oct-1899', 'Reported 12-Oct-1899', 'Reported 11-Sep-1899', 'Reported 23-Aug-1899', 'Reported 08-Jul-1899', 'Reported 04-May-1899', '1899 During the Seige of Ladysmith', 'Ca. 1899', 'Reported 28-Dec-1898', 'Reported 19-Sep-1898', 'Reported 19-Sep-1898', 'Reported 07-Sep-1898', 'Reported 26-Jul-1898', 'Reported 1898', '1898 (soon after the close of the Spanish-American War)', '1898-1899', 'Summer of 1898', 'Reported 04-Dec-1897', 'Reported 05-Oct-1897', 'Reported 15-Mar-1897', 'Reported 15-Mar-1897', '23-Decp1896', 'Reported 17-Dec-1896', 'Reported 11-Sep-1896', 'Reported 21-Jun-1896', 'Reported 21-Nov-1895', 'Reported 14-Sep-1895', 'Reported 16-Jul-1895', 'Reported 13-Jun-1895', 'Reported 03-Jun-1895', 'Reported 29-Mar-1895', 'Reported 23-Feb-1895', 'Reported 06-Oct-1894', 'Reported 12-Sep-1894', 'Reported 06-Sep-1894', 'Reported 01-Sep-1894', 'Reportd 15-Jul-1894', 'Reported 15-Jun-1894', 'Reported 15-Jun-1894', 'Reported 28-Apr-1894', 'Reported 28-Apr-1894', 'Reported 20-Oct-1893', 'Reported 22-Jun-1893', 'Reported 23-May-1893', 'Reported 15-Apr-1893', 'Reported 30-Jan-1893', 'Reported 18-Jan-1893', 'Reported 09-Nov-1892', 'Reported 16-Sep-1892', 'Reported 20-Jun-1892', 'Reported 19-May-1892', 'Reported 21-Apr-1892', 'Reported 25-Mar-1892', 'Reported 31-Dec-1891', 'Reported 22-Dec-1891', 'Reported 14-Sep-1891', 'Reported 30-Aug-1891', 'Reported 08-Jan-1891', 'Reported 27-Dec-1890', 'Reported 25-Oct-1890', 'Reported 17-Aug-1890', 'Reported 14-Jun-1890', 'Reported 02-Jun-1890', 'Reported 03-Mar-1890', 'Ca. 1890', 'Reported 29-Nov-1889', 'Reported 03-Oct-1889', 'Reported 31-Jan-1889', 'Reported 25-Dec-1888', 'Reported 23-Oct-1888', 'Reported 01-Aug-1888', 'Reported 20-Jul-1888', 'Reported 13-Jul-1888', 'Reported 18-Jun-1888', 'Reported 22-Dec-1887', 'Reported 08-Feb-1887', 'Mid-Aug-1886', 'Mid-Aug-1886', 'Reported 06-Mar-1886', 'Reported 26-Nov-1885', 'Reported 16-Apr-1885', 'Reported 08-Dec-1884', 'Reported 28-Aug-1884', 'Reported 28-Apr-1884', 'Reported 18-Dec-1883', 'Reported 20-Oct-1883', 'Reported 19-Oct-1883', 'Reported 14-Sep-1883 (probably happened Ca. 1843/1844)', 'Reported 26-Feb-1883', 'Reported 25-Feb-1883', 'Summer of 1883', 'Reported 12-May-1882', 'Reported  07-Feb-1882', 'Reported 23-Jan-1882', 'Reprted 05-Jan-1882', 'Reported 16-Aug-1881', 'Ca. 1881', 'Reported 07-May-1880', '1880?', 'Ca. 1880', 'Ca. 1880', 'Reported 30-Aug-1879', 'Reported 24-Oct-1878', 'Reported 14-Sep-1878', 'Reported 30-March-1878', 'Reported 15-Dec-1877', 'Reported 28-Aug-1877', 'Reported 17-Feb-1877', 'Reported 24-Jan-1877', 'Before 1878', 'Reported 13-Oct-1876', 'Reported 07-Sep-1876', 'Reported 04-Jun-1876', 'Reported 27-Nov-1875', 'Reported 20-Jan-1875', 'Ca. mid-1870s', 'Reported 15-Jul-1874', 'Reported 18-Jun- 1874', 'Reported 15-Jun-1874', 'Reported 15-Jun-1874', 'Reported 20-Apr-1874', '09-Jan- 1874', 'Reported 09-Jun-1873', 'Reported 09-Jan-1873', 'Nov- or Dec-1873', 'Reported 30-Nov-1872', 'Reported 07-Aug-1872', 'Reported 03-Apr-1872', 'Circa 1872', 'Reported 11-Dec-1871', 'Reported 29-Aug-1871', 'Reported 22-May-1871', 'Reported 13-May 1871', 'Before 1871', 'Before 1871', 'Reported 26-May-1870', 'Early 1870s', 'Reported 15-Apr-1869', 'Reported 04-Jan-1868', 'Reported 24-Oct-1888, but took place around 1868', '1868 (?)', 'Reported 22-Aug-1867', 'Reported 26-Jan-1867', 'Reported 09-Jul-1866', 'Reported 24-Apr-1866', 'Reported 14-Jul-1865', 'Reported 01-Mar-1865', 'Reported 18-Sep-1864', 'Reported 27-Jan-1864', 'Reported 16-Jan-1864', 'Reported 13-Sep-1863', 'Reported 09-Jul-1863', 'Reported 1863', 'Reported 19-Dec-1862', 'Reported 15-Aug-1862', 'Reported 02-Aug-1862', 'Circa 1862', 'Reported 25-Sep-1861', 'Reported 12-Feb-1861', 'Reported 15-Jan-1861', 'Reported 14-Jan 1858', 'Reported 09-Jan-1858', 'Reported 25-Nov-1856', 'Reported 20-Nov-1856', 'Reported 21-Jun-1856', 'Reported 09-Apr-1855', 'Reported 12-Feb-1855', 'Circa 1855', 'Reported 13-Jul-1853', 'Sep or Oct-1853', '1853 or 1854', 'Reported 23-Oct-1852', 'Reported 09-Oct-1852', '"Anniversary Day" 22-Jan-1850 or 1852', 'Reported 16-Jun-1851', 'Reported 17-Jul-1848', 'Reported 03-Jul-1847', 'Reported in 1847', 'Ca. 1847', 'Reported 30-Sep-1846', 'Reported 20-Aug-1846', 'Reported 16-Sep-1845', 'Reported 10-Sep-1845', 'Reported 31-Jul-1845', 'Reported 1845', '1844.07.16.R', 'Reported 18-Sep-1840', 'Reported 22-Jul-1840', 'Reported 09-Apr-1840', '1839/1840', 'Ca. 1839', 'Ca. 1837', 'Reported 19-Aug-1836', '1836.07.26.R', '1836.00.', 'Reported 21-Feb-1835', 'Reported 15-Jul-1834', 'Reported 23-Jan-1832', 'Reported 22- Jan-1831', 'Reported 02-Jul-1830', 'Reported 22-Apr-1830', 'Reported 03-Jul-1829', 'Ca. Nov-1826', 'Reported 15-Aug-1826', 'Reported 20-May-1826', '1820s', 'Ca . 1825', 'Reported 30-Dec-1823', 'Reported 08-Jul-1819', 'Reported 22-May-1818', 'Reported 03-Sept-1816', 'Reported 25-Dec-1808', 'Reported 01-May-1808', 'Reported 26-Feb-1804', 'Reported Apr-13-1802', 'Reported 18-Dec-1801', 'Reported May-28-1797', 'Reported 10-Aug-1786', 'Reported 26-Sep-1785', 'Reported 1776', 'Reported 12-Jul-1771', 'Reported 27-Oct-1753', 'Reported 06-Apr-1738', '1700s', '1700s', '1700s', 'Late 1600s Reported 1728', 'Reported 1638', 'Reported 1637', 'Reported 1617', 'Letter dated 10-Jan-1580', 'Ca. 1554', 'Ca. 1543', 'Circa 500 A.D.', '77  A.D.', 'Ca. 5 A.D.', 'Ca. 214 B.C.', 'Ca. 336.B.C..', '493 B.C.', 'Ca. 725 B.C.', 'Before 1939', '1990 or 1991', 'Before 2016', 'Before Oct-2009', 'Before 1934', 'Before 1934', '2009?', 'Before 1930', '1880-1899', 'Before 1909', 'Before 2012', 'Before 1916', 'Between   1951-1963', 'Before 1908', 'Before 1900', 'Before 1876', 'Before 2012', 'Before 2011', 'Before 2011', 'Before 2009', 'Beforer 1994', 'Before 1963', '1896-1913', 'Before 1936', 'Before 08-Jun-1912', 'Before 2012', 'Before 1911', 'Before 1901', 'No date, late 1960s', 'Before 2006', 'Before 2003', 'Before 2004', 'Before 1962', '1950s', 'No date, Before 1963', '2003?', 'No date', 'No date', 'Before Feb-1998', 'No date, Before May-1996', 'No date, Before Mar-1995', 'Before 1996', 'No date, Before Aug-1989', 'No date, Before Aug-1987', 'No date, Before 1987', 'No date, Before  1975', 'No date, Before 1975', 'No date, Before 1969', 'No date, Before 3-Jan-1967', 'No date, Before 1963', 'No date, Before 8-May-1965', 'No date, Before 1963', 'No date, Before 1902', 'No date, Before 1902', 'No date, Before 1963', 'No date, After August 1926 and before 1936', 'No date, Before 1963', 'Before 1962', 'Before 1962', 'Before 1961', '1960s', '1960s', '1960s', 'Before 1960', 'Before  19-Jun-1959', 'Before  24 Apr-1959', 'Before  1958', 'Before  1958', 'Before 1958', 'Before 1958', 'Before 1958', 'Before 1958', 'Before 1957', 'Before 1957', 'Before 1956', 'Before 1956', 'Before Mar-1956', 'Before 1952', '1941-1945', '"During the war" 1943-1945', '"Before the war"', 'Said to be 1941-1945, more likely 1945', '1941-1945', '1941-1945', '1941-1942', '1940 - 1950', '1940 - 1950', '1940 - 1950', '1940-1946', 'Before 1905', 'World War II', 'World War II', 'Before 1905', 'A few years before 1938', 'No date', 'Early 1930s', 'Before 1927', 'Between 1918 & 1939', 'No date', 'No date', 'No date', '1920 -1923', 'Before 1921', 'Before 1911', 'Before 1921', 'Before 1921', 'Before 1917', 'Before 17-Jul-1916', 'No date (3 days after preceding incident) & prior to 19-Jul-1913', 'Before 19-Jul-1913', 'Before 1911', 'Circa 1862', 'Before 1906', 'Before 1906', 'Before 1906', 'Before 1906', 'Before 1903', 'Before 1903', '1900-1905', '1883-1889', '1845-1853']


In [136]:
# Define a function to check if a date is valid
def is_valid_date(date_str):
    if isinstance(date_str, str):
        try:
            parse(date_str)
            return True
        except ValueError:
            return False
    return False  # Return False for non-string values

# List of date columns (you can add more if needed)
date_columns = ['Date']

# Initialize a flag to track the valid date range
valid_range_start = None
valid_range_end = None

# Iterate through date columns and check for valid dates up to row 6301
for column in date_columns:
    for index, date_value in enumerate(attacks.iloc[:6301][column]):
        if is_valid_date(date_value):
            if valid_range_start is None:
                valid_range_start = index
            valid_range_end = index

# Check if any valid dates were found
if valid_range_start is not None:
    print(f'Valid dates are from row {valid_range_start} to row {valid_range_end}.')
else:
    print('No valid dates found within the specified range.')


Valid dates are from row 0 to row 6171.


In [135]:
# Define a function to check if a date is valid
def is_valid_date(date_str):
    if isinstance(date_str, str):
        try:
            parse(date_str)
            return True
        except ValueError:
            return False
    return False  # Return False for non-string values

# List of date columns (you can add more if needed)
date_columns = ['Case Number.1']

# Initialize a flag to track the valid date range
valid_range_start = None
valid_range_end = None

# Iterate through date columns and check for valid dates up to row 6301
for column in date_columns:
    for index, date_value in enumerate(attacks.iloc[:6301][column]):
        if is_valid_date(date_value):
            if valid_range_start is None:
                valid_range_start = index
            valid_range_end = index

# Check if any valid dates were found
if valid_range_start is not None:
    print(f'Valid dates are from row {valid_range_start} to row {valid_range_end}.')
else:
    print('No valid dates found within the specified range.')


Valid dates are from row 0 to row 6300.


In [177]:
attacks["Age"].unique()

array(['57', '11', '48', nan, '18', '52', '15', '12', '32', '10', '21',
       '34', '30', '60', '33', '29', '54', '41', '37', '56', '19', '25',
       '69', '38', '55', '35', '46', '45', '14', '40s', '28', '20', '24',
       '26', '49', '22', '7', '31', '17', '40', '13', '42', '3', '8',
       '50', '16', '82', '73', '20s', '68', '51', '39', '58', 'Teen',
       '47', '61', '65', '36', '66', '43', '60s', '9', '72', '59', '6',
       '27', '64', '23', '71', '44', '62', '63', '70', '18 months', '53',
       '30s', '50s', 'teen', '77', '74', '28 & 26', '5', '86', '18 or 20',
       '12 or 13', '46 & 34', '28, 23 & 30', 'Teens', '36 & 26',
       '8 or 10', '84', '\xa0 ', ' ', '30 or 36', '6½', '21 & ?', '75',
       '33 or 37', 'mid-30s', '23 & 20', ' 30', '7      &    31', ' 28',
       '20?', "60's", '32 & 30', '16 to 18', '87', '67', 'Elderly',
       'mid-20s', 'Ca. 33', '74 ', '45 ', '21 or 26', '20 ', '>50',
       '18 to 22', 'adult', '9 & 12', '? & 19', '9 months', '25 to 35',
  

In [179]:
attacks["Injury"].nunique()

3737