# Required Files:

NOAA Data
from ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ :
1. ghcnd-stations.txt
2. ghcnd-states.txt
3. ghcnd_all.tar.gz  - for this zipped file you will need a program like winrar/7zip

FRA Data
from https://data.transportation.gov/Railroads/Highway-Rail-Grade-Crossing-Accident-Data/7wn6-i5b9 : 
1. https://data.transportation.gov/api/views/7wn6-i5b9/rows.csv?accessType=DOWNLOAD&bom=true&format=true

Cities Data
from https://github.com/kelvins/US-Cities-Database : 
1. https://github.com/kelvins/US-Cities-Database/blob/main/csv/us_cities.csv

-------
Additional Files

Weather Events Data
from https://www.kaggle.com/sobhanmoosavi/us-weather-events :
1. https://www.kaggle.com/sobhanmoosavi/us-weather-events/download

Safety Events Data 
from https://www.transit.dot.gov/ntd/data-product/safety-security-major-only-time-series-data :
1. https://data.transportation.gov/Public-Transit/Major-Safety-Events/9ivb-8ae9

## Instructions

# IMPORTANT - You will need ~40 GB of HDD or SSD space for NOAA Data, as well as 2 hours (?) for downloading + extracting noaa data 
# (you dont have to be at the computer for most of that time)


1. Download all files  and move them to the directory you plan to work in (working directory / wd)
2. select the file ghcnd_all.tar.gz and open it with your unzipping tool (I used WinRAR), it will take a multiple minutes to load due to the size (~120,000 files)
3. once everything is loaded, you will see a text file ghcnd-version.txt, and folder ghcnd_all.  

you have two options for the next part, extracting all data or extracting only US data.  Both take a long time, only getting US data is a slightly faster but slightly more effort.

extract all:
4. Select the folder and extract to your current working directory (or another directory of your choice).  You will need ~40 GB  of space to be safe, I used an external harddrive.  Extraction will take a while, so go grab a snack or watch an episode of something. 

extract US only:
4. Select the folder and open it in WinRAR/7Zip, it will take a minute to load all the files inside.  Sort the files by name. Scroll down to 'US'. Select all files beginning with 'US'. and extract them to a new folder 'ghcnd_all' in your working directory (wd/ghcnd_all). Extraction still takes a while, but this is faster than getting all 120k files. 

last step: 
check the imports below, pip install any that are missing from your pc

In [None]:
# ensure you have at least: pandas 1.2.3 , numpy 1.21.x ()

In [4]:
pip freeze

alabaster==0.7.12
anaconda-client==1.7.2
anaconda-navigator==1.10.0
anaconda-project==0.8.3
asn1crypto==1.0.1
astroid==2.3.1
astropy==3.2.1
atomicwrites==1.3.0
attrs==19.2.0
Babel==2.7.0
backcall==0.1.0
backports.functools-lru-cache==1.5
backports.os==0.1.1
backports.shutil-get-terminal-size==1.0.0
backports.tempfile==1.0
backports.weakref==1.0.post1
beautifulsoup4==4.8.0
bitarray==1.0.1
bkcharts==0.2
bleach==3.1.0
bokeh==1.3.4
boto==2.49.0
Bottleneck==1.2.1
certifi==2019.9.11
cffi==1.12.3
chardet==3.0.4
Click==7.0
cloudpickle==1.2.2
clyent==1.2.2
colorama==0.4.1
comtypes==1.1.7
conda==4.10.3
conda-build==3.18.9
conda-package-handling==1.6.0
conda-verify==3.4.2
contextlib2==0.6.0
cryptography==2.7
cycler==0.10.0
Cython==0.29.13
cytoolz==0.10.0
dask==2.5.2
decorator==4.4.0
defusedxml==0.6.0
distributed==2.5.2
docutils==0.15.2
entrypoints==0.3
et-xmlfile==1.0.1
fastcache==1.1.0
filelock==3.0.12
Flask==1.1.1
fonttools==4.28.3
fsspec==0.5.2
future==0.17.1
gevent==1.4.0
glob2==0.7
greenlet=



In [None]:
!pip install --upgrade --user pandas

In [None]:
!pip install --upgrade --user numpy

In [5]:
import glob
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import numpy as np
import os
import pandas as pd
import seaborn as sns
import csv

  import pandas.util.testing as tm


In [6]:
# get current working directory
os.chdir('C:/Users/Skurai/practicum/final')
wd = os.getcwd()
print(wd)

C:\Users\Skurai\practicum\final


# Loading Data

We begin with NOAA because it is the most complex. Before we can use NOAA daily data, we have to know which daily files we want to investigate. There are so many files we may not be able to read them all in, so we must restrict the files that we look at by first checking which files will be relevant.  We do this by mapping NOAA Stations to the FRA (or other) dataset(s).  We will only pull daily weather observations for stations that map to other dataset(s).  In this way we reduce the amount of data stored in memory to a workable amount. 

# NOAA Dailies part 1: NOAA Stations

use the function below to create a dataframe containing station information from the ghcnd_stations.txt file.  
the only input is your working directory (wd), your ghcnd_stations.txt file must be saved there. 

In [7]:
def load_noaa_stations_data(wd):
    """
    This function will load the weather station dataset from NOAA
    this is used for mapping to rail crossing locations.
    
    Contains weather station reference data
    
    Source: ghcnd-stations.txt from NOAA ftp
    Input: wd - working directory
    Output: stations_df - dataframe of NOAA stations
    """
    f = open(os.path.join(wd,"ghcnd-stations.txt"),"r")
    lines = f.readlines()

    # columns in the station file
    colnames = ['ID', 'LAT', 'LON', 'ELEV', 'STATE', 'NAME', 'GSN', 'HCNCRN', 'WMOID']
    stationlist = []

    # initialize dataframe with correct columns
    stations_df = pd.DataFrame(columns = colnames)

    # iterate through stations and add them to our collection of stations if they are in the US
    for line in lines:
        # first 2 characters are the country code , we only care about us stations
        if line[0:2] == 'US':

            # the description of the file seemed slightly off, i tested and found these column numbers to work best
            row = {"ID": line[0:11].upper(),
                    "LAT": float(line[13:20]),
                    "LON": float(line[21:30]),
                    "ELEV": float(line[31:37]),
                    "STATE": line[38:40],
                    "NAME": line[41:71],
                    "GSN": line[72:75],
                    "HCNCRN": line[76:79],
                    "WMOID": line[80:85]
                   }
            stationlist.append(row)
        else:
            pass
    stations_df = stations_df.append(stationlist)
    f.close()
    
    return stations_df

In [8]:
stations_df = load_noaa_stations_data(wd)

# NOAA Dailies part 2: NOAA States

use the function below to create a dataframe containing station+state information from the ghcnd_states.txt file.  
the only input is your working directory (wd), your ghcnd_states.txt file must be saved there. 

In [9]:
def load_noaa_states_data(wd):
    """
    This function will load the state dataset from NOAA
    this is used for mapping to rail crossing locations.
    
    Contains state reference data
    
    Source: ghcnd-states.txt from NOAA ftp
    Input: wd - working directory
    Output: states_df - dataframe of NOAA states
    """
    # read in states dataset to supplement weather stations data
    f = open(os.path.join(wd,"ghcnd-states.txt"),"r")
    lines = f.readlines()

    colnames = ['CODE', 'NAME']

    # create dataframe of state data
    states_df = pd.DataFrame(columns=colnames)
    for line in lines:
        modline = line.strip('\n')
        data = {'CODE': line[0:2],
                "NAME": modline[3:50]
               }
        states_df = states_df.append(data, ignore_index=True)    

    f.close()
    return states_df

In [10]:
states_df = load_noaa_states_data(wd)

# NOAA Dailies part 3: NOAA Stations+States

use the function below to combine the two dataframes, while also adding the engineered feature 'wcoordinateID'.  
'wcoordinateID' will be used to join this dataset with others.

In [11]:
def merge_noaa_refdata(stations_df, states_df):
    """
    This function will merge the NOAA refdata
    this is used for mapping to rail crossing locations.
    
    Contains state & station reference data
    
    Input: stations_df, states_df
    Output: stations_plus_df - dataframe of NOAA refdata
    """
    
    # add state data to the stations dataset
    station_plus_df = stations_df.join(states_df.set_index('CODE'), on='STATE', rsuffix='_STATE')

    # create our key feature: coordinateID (wcoordinateID for weather)
    # round latitude & longitude to 1 decimal, combine them in a tuple (lat, lon)
    station_plus_df['wcoordinateID'] = list(zip(round(station_plus_df['LAT'],1),round(station_plus_df['LON'],1)))
    station_plus_df = station_plus_df[['ID','ELEV','wcoordinateID']]
    
    return station_plus_df

In [12]:
station_plus_df = merge_noaa_refdata(stations_df, states_df)

# NOAA Dailies part 4: Cities Data

As mentioned above, we need to load in FRA data and join it with the station data before we pull in the daily observations.  
There is no good candidate for joining FRA and NOAA data, so we must engineer a 'coordinateID' for the FRA dataset, creating a candidate to merge on.
to add 'coordinateID' to the FRA data, we first need the city data, so that is the next step.

In [13]:
def load_us_cities_data(wd):
    """
    This function will load cities data which will 
    be used to attach coordinateID to other datasets 
    which only have city or county level data.
    Also derives county locations.
    
    Input: wd - working directory
    Output: grouped_meancounties_df
    """
    cities_df = pd.read_csv(os.path.join(wd,"us_cities.csv"))
    
    # standardize county and state, city is not populated for all events.
    # one change to approach would be to include all cities + the grouped mean of each county
    cities_df['County'] = cities_df['COUNTY'].str.upper()
    cities_df['State'] = cities_df['STATE_NAME'].str.upper()
    return cities_df


In [14]:
cities_df = load_us_cities_data(wd)

In [15]:
def grouped_meancounties(cities):
    # subset of data that we care about, lat+lon to make coordinateID, county, state, state code to merge on
    counties = cities[['County','State','LATITUDE','LONGITUDE','STATE_CODE']]
    grouped_counties = counties.groupby(['State','County'])
    grouped_meancounties_df = grouped_counties.mean()
    grouped_meancounties_df = grouped_meancounties_df.reset_index()
    grouped_meancounties_df['wcoordinateID'] = list(zip(round(grouped_meancounties_df['LATITUDE'],1),round(grouped_meancounties_df['LONGITUDE'],1)))
    
    return grouped_meancounties_df

In [16]:
grouped_meancounties_df = grouped_meancounties(cities_df)

# NOAA Dailies part 5: FRA Data

As mentioned above, we need to load in FRA data and join it with the station data before we pull in the daily observations.  
Run the function below to create a dataframe from the FRA dataset.

In [17]:
def load_rail_crossing_data(wd, grouped_meancounties_df):
    """
    This function will load data for rail crossings
    which will be used for instances for model training.
    Will also be used to limit weather station observations
    
    Input: wd - working directory, grouped_meancounties_df - location base data
    Output: rail_city_df
    """
    railcrossing_df = pd.read_csv(os.path.join(wd,"Highway-Rail_Grade_Crossing_Accident_Data.csv"))
    
    # gather the fields necessary for coordinateID, as well as any  fields you want for analysis later 
    # change to approache what fields we include in refined_rr
    refined_rr_df = railcrossing_df #[['Incident Number','Date','County Name', 'State Name']]

    # drop any incident without a date
    refined_rr_df = refined_rr_df.dropna(subset=['Date'])

    # create our feature incident date, which is an integer with format: yyyymmdd 
    incident_date = refined_rr_df['Date'].str.split(' ', expand=True)
    incident_date = incident_date[0].str.split('/', expand=True)
    refined_rr_df['incident_date'] = (incident_date[2].astype(int) * 10000) + (incident_date[0].astype(int) * 100) + (incident_date[1].astype(int) * 1)

    # merge accident data with city/county data to add coordinateID to each accident.
    merg_rail_city_df = refined_rr_df.merge(grouped_meancounties_df, how='inner', left_on=['County Name','State Name'], right_on=['County','State'])
    print("Shape of merged unfiltered rail_city dataset:  {}".format(merg_rail_city_df.shape))
    merg_rail_city_df = merg_rail_city_df[merg_rail_city_df['incident_date'] > 20160000]
    merg_rail_city_df = merg_rail_city_df[['Grade Crossing ID', 
                                           'Maintenance Parent Railroad Code', 
                                           'Incident Number',
                                           'Crossing Illuminated',
                                           'Railroad Type',
                                           'Track Type Code', 
                                           'Number of Locomotive Units',
                                           'Number of Cars',
                                           'incident_date', 
                                           'wcoordinateID', 
                                           'State', 
                                           'County'
                                          ]]
    print("Shape of merged filtered rail_city dataset:  {}".format(merg_rail_city_df.shape))
    
    return merg_rail_city_df

In [18]:
merg_rail_city_df = load_rail_crossing_data(wd, grouped_meancounties_df)

  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Shape of merged unfiltered rail_city dataset:  (231891, 165)
Shape of merged filtered rail_city dataset:  (11412, 12)


# FRA Manipulation 
It may make sense to do as much manipulation of the FRA Data as you can (restricting state/year/etc.) before you merge with the NOAA data. took your preprocessing code and threw it in here for now

In [19]:
def pre_process_rail(merg_rail_city_df):
    """
    This function will preprocess rail df.
    Focus is on datetime formats
    
    input: merg_rail_city_df
    output: enriched merg_rail_city_df
    """
    
    merg_rail_city_df['incident_datetime'] = pd.to_datetime(merg_rail_city_df['incident_date'], format='%Y%m%d')
    merg_rail_city_df['incident_year'] = merg_rail_city_df['incident_datetime'].dt.year
    merg_rail_city_df['incident_month'] = merg_rail_city_df['incident_datetime'].dt.month
    merg_rail_city_df['incident_year_month'] = merg_rail_city_df['incident_year'].astype(str) + '_' + merg_rail_city_df['incident_month'].astype(str)
    
    return merg_rail_city_df

In [20]:
merg_rail_city_df = pre_process_rail(merg_rail_city_df)

# FRA Merged Stations
Use the code below to get the list of relevant weather stations IDs.  
the name of each daily file contains the weather station ID, so the list will be the filter used to prevent loading irrelevant daily files from NOAA

In [21]:
# dataframe of weather stations - only those that share coordinateID with an accident
merged_stations_incidents_df = station_plus_df.merge(merg_rail_city_df,left_on='wcoordinateID', right_on='wcoordinateID', how='inner')

# filter by state 
target_states = ['NEW JERSEY','NEW YORK','PENNSYLVANIA','CONNECTICUT','DELAWARE','MARYLAND','MASSACHUSETTS','NEW HAMPSHIRE','VIRGINIA']
target_state_codes = ['NJ', 'NY', 'PA', 'CT', 'DE', 'MD', 'MA', 'NH', 'VA']
statefiltered_stations_incidents_df = merged_stations_incidents_df[merged_stations_incidents_df['State'].isin(target_states)]

# filter by year
yearstatefiltered_stations_incidents_df = statefiltered_stations_incidents_df[statefiltered_stations_incidents_df['incident_year'].isin([2015,2016,2017,2018,2019,2020,2021])]

# save a list of the station IDs that were included in the merged dataframe
incident_stations = [x.upper() for x in yearstatefiltered_stations_incidents_df['ID'].unique()]

In [22]:
print(len(incident_stations))

546


# Loading NOAA Daily Data

now that we have the incident stations list, we can load daily noaa data using the function below.  
## you will need to edit line 41 so that it points to where your daily files are located 

In [23]:
def load_noaa_dailies(wd):
    '''
    prior attempts to include this data failed because ghcnd_ghc.tar.gz is too large.
    by limiting the number of stations included to only those where incidents occurred,
    and by limiting the observation years from each station, we can reduce the amount of
    memory required to process this data

    '''
    # with assistance from 
    # https://stackoverflow.com/questions/62165172/convert-dly-files-to-csv-using-python
    # fields as given by the spec
    
    fields = [
        ["ID", 1, 11],
        ["YEAR", 12, 15],
        ["MONTH", 16, 17],
        ["ELEMENT", 18, 21]]

    offset = 22

    for value in range(1, 32):
        fields.append((f"VALUE{value}", offset,     offset + 4))
        fields.append((f"MFLAG{value}", offset + 5, offset + 5))
        fields.append((f"QFLAG{value}", offset + 6, offset + 6))
        fields.append((f"SFLAG{value}", offset + 7, offset + 7))
        offset += 8

    # Modify fields to use Python numbering
    fields = [[var, start - 1, end] for var, start, end in fields]
    fieldnames = [var for var, start, end in fields]


    # the goal of this code is to make 1 file TOTAL from many (originally 1 per station)

    # enter where you want a csv saved - it will be many Gigs
    csv_filename = wd+'\\noaa_relevant_dailies.csv'

    with open(csv_filename, 'w', newline='') as f_csv:

        # glob.glob should aim at the folder where you extracted all the daily files, wd/ghcnd_all - do not forget to include '\*.dly'
        for dly_filename in glob.glob(r'F:\weather\ghcnd_all\*.dly', recursive=True):
            path, name = os.path.split(dly_filename)
            station = name[:-4].upper()
            if station in incident_stations:
                # you could replace this with adding to a dataframe or something else, but i am running out of brain power.
                with open(dly_filename, newline='') as f_dly:
                    spamwriter  = csv.writer(f_csv)
                    spamwriter.writerow(fieldnames) 

                    for line in f_dly:
                        row = [line[start:end].strip() for var, start, end in fields]
                        year = int(row[1])

                        # important check to save memory, only add recent observations
                        if year > 2014:
                            spamwriter.writerow(row)

            
                
                
# end product is a csv with us weather station data.  needs more cleaning 

In [26]:
load_noaa_dailies(wd)

# Cleaning NOAA Daily Data

previous function took all the relevant .dly files and combined them into a single .csv  
now we will work with the csv to further clean the data

In [28]:
def clean_noaa_dailies(wd):
    df = pd.read_csv(os.path.join(wd,"noaa_relevant_dailies.csv"))
    # we added a header row for every file, but we only need 1 header row. remove the others:
    df = df[df['YEAR'] != 'YEAR']

    # month and year had some strings and some ints. lets standardize
    df['YEAR'] = pd.to_numeric(df['YEAR'])
    df['MONTH'] = pd.to_numeric(df['MONTH'])


    # base for transposed data
    base = pd.DataFrame(columns=['ID','YEAR','MONTH','ELEMENT','VALUE', 'MFLAG', 'QFLAG', 'SFLAG'])

    # loop through all days to partially transpose the file (day cols -> rows)
    for i in range(1,32):
        colnames = [f'VALUE{i}', f'MFLAG{i}', f'QFLAG{i}', f'SFLAG{i}']
        newcolnames = ['VALUE', 'MFLAG', 'QFLAG', 'SFLAG']
        col_order = ['ID','YEAR','MONTH','DAY','ELEMENT', colnames[0], colnames[1], colnames[2], colnames[3]]

        df_new = df[['ID','YEAR','MONTH','ELEMENT', colnames[0], colnames[1], colnames[2], colnames[3]]]
        df_new['DAY'] = i
        df_new = df_new[col_order]
        df_new = df_new.rename(columns={colnames[0]:newcolnames[0], colnames[1]:newcolnames[1], colnames[2]:newcolnames[2], colnames[3]:newcolnames[3]})
        base = pd.concat([base, df_new], sort=False)


    newcsv = base[['ID','YEAR','MONTH','DAY','ELEMENT','VALUE','MFLAG','QFLAG','SFLAG']]

    daily_station_coordinates = station_plus_df[['ID','wcoordinateID']]
    daily_final = newcsv.merge(daily_station_coordinates, left_on='ID', right_on='ID')
    daily_final.to_csv('final_daily_observations.csv')
    print('cleaning complete.  final shape: ', daily_final.shape, '.  reload directory to see file')

In [29]:
clean_noaa_dailies(wd)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


cleaning complete.  final shape:  (2099351, 10) .  reload directory to see file


# Weather Event Data 
One may load and work with Kaggle weather event dataset using the code below, which includes added 'wcoordinateID' 

In [31]:
def load_weatherEvents(wd):
    '''
    This data may not actually be used by us if we like the daily data
    or possibly we will use both somehow
    '''
    w_events = pd.read_csv('WeatherEvents_Jan2016-Dec2020.csv')

    # it looks like 'Severe' and 'Heavy' are most extreme, so filter to these 
    extreme_severities = ['Severe', 'Heavy']
    extreme_w_events = w_events[w_events['Severity'].isin(extreme_severities)]
    print(w_events.shape, extreme_w_events.shape)
    print(f'extreme events represent about {round(100*(extreme_w_events.shape[0]/w_events.shape[0]),2)} percent of all events in the original data')

    # add coordinateID
    extreme_w_events['wcoordinateID'] = list(zip(round(extreme_w_events['LocationLat'],1),round(extreme_w_events['LocationLng'],1)))

    # calculate start and end dates as integers yyyymmdd
    w_events_date_start = extreme_w_events['StartTime(UTC)'].str.split(' ', expand=True)
    w_events_date_start = w_events_date_start[0].str.split('-', expand=True)
    extreme_w_events['event_start_dt'] = (w_events_date_start[0].astype(int) * 10000) + (w_events_date_start[1].astype(int) * 100) + (w_events_date_start[2].astype(int) * 1)

    w_events_date_end = extreme_w_events['EndTime(UTC)'].str.split(' ', expand=True)
    w_events_date_end = w_events_date_end[0].str.split('-', expand=True)
    extreme_w_events['event_end_dt'] = (w_events_date_end[0].astype(int) * 10000) + (w_events_date_end[1].astype(int) * 100) + (w_events_date_end[2].astype(int) * 1)
    
    return extreme_w_events


In [32]:
extreme_w_events = load_weatherEvents(wd)

(6274206, 13) (1333526, 13)
extreme events represent about 21.25 percent of all events in the original data


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


# Safety Events Data
One may load and work with safety events dataset using the code below, which includes added 'coordinateID' 

In [33]:
def load_safetyEvents(wd):
    '''
    not sure if we will use the safety events dataset or not
    it is included here with incident date and coordinate ID so we can join on other datasets

    '''

    # ingest data
    m_safety_events = pd.read_csv('Major_Safety_Events.csv')

    # drop events without an area
    m_safety_events.dropna(subset=['Primary UZA Name'],inplace=True)

    # add city, state to each event
    safety_events_citystate = m_safety_events['Primary UZA Name'].str.split(',', expand=True)
    safety_events_state = safety_events_citystate[1].str.split('-', expand=True)
    safety_events_state[0] = safety_events_state[0].str.strip()
    m_safety_events['City'] = safety_events_citystate[0]
    m_safety_events['State'] = safety_events_state[0]

    # add coordinate ID by merging with cities data
    m_safety_events = m_safety_events.merge(cities_df, how='inner', left_on=['City','State'], right_on=['CITY','STATE_CODE'])
    m_safety_events['coordinateID'] = list(zip(round(m_safety_events['LATITUDE'],1),round(m_safety_events['LONGITUDE'],1)))

    # add event date in integer format yyyymmdd
    safety_events_date = m_safety_events['Incident Date'].str.split(' ', expand=True)
    safety_events_date = safety_events_date[0].str.split('/', expand=True)
    m_safety_events['event_date'] = (safety_events_date[2].astype(int) * 10000) + (safety_events_date[1].astype(int) * 100) + (safety_events_date[0].astype(int) * 1)

    # filter for relevant events
    inscope_events = ['Non-Rail Collision', 'Main Line Derailment', 'Rail Fire', 'Rail Collision', 'Flood','Ferry Boat Collision','Other High Winds','Tornado','Lightning','Hurricane']
    inscope2_events = ['Non-Rail Collision', 'Main Line Derailment', 'Rail Fire', 'Rail Collision']

    rail_events = m_safety_events[m_safety_events['Event Type'].isin(inscope_events)]
    rail_events2 = m_safety_events[m_safety_events['Event Type'].isin(inscope2_events)]

    print(m_safety_events.shape, rail_events.shape)
    print(f'rail events represent about {round(100*(rail_events.shape[0]/m_safety_events.shape[0]),2)} percent of all events in the original data')

    print(m_safety_events.shape, rail_events2.shape)
    print(f'rail events2 represent about {round(100*(rail_events2.shape[0]/m_safety_events.shape[0]),2)} percent of all events in the original data')

    # based on this data i recommend we use inscope2_events, if anything
     # can add more fields, but we need to be careful managing memory when we merge with other data
    rail_events2 = rail_events2[['coordinateID', 'event_date', 'Incident Number', 'Event Type']]
    return rail_events2

In [34]:
rail_events = load_safetyEvents(wd)

  """Entry point for launching an IPython kernel.


(54007, 97) (43696, 97)
rail events represent about 80.91 percent of all events in the original data
(54007, 97) (43653, 97)
rail events2 represent about 80.83 percent of all events in the original data



## NOAA Daily Observation Data
dataset overview:  
https://www.ncei.noaa.gov/metadata/geoportal/rest/metadata/item/gov.noaa.ncdc:C00861/html  
main ftp directory:  
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/  
readme:  
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt  
required ftp files:  
1. ghcnd-stations.txt
2. ghcnd-states.txt
3. ghcnd_all.tar.gz - YOU WILL NEED TO UNZIP THIS 


## Highway Rail Grade Crossing Accident Data
dataset overview:  
https://data.transportation.gov/Railroads/Highway-Rail-Grade-Crossing-Accident-Data/7wn6-i5b9  
download link:  
https://data.transportation.gov/api/views/7wn6-i5b9/rows.csv?accessType=DOWNLOAD&bom=true&format=true  
from the overview page, click export -> choose your output type (I chose CSV for this code).  or use the download link


## Weather Events 2016 - 2020
dataset overview:  
https://www.kaggle.com/sobhanmoosavi/us-weather-events  
download link:  
https://www.kaggle.com/sobhanmoosavi/us-weather-events/download


## Major Safety Events
dataset overview:  
https://www.transit.dot.gov/ntd/data-product/safety-security-major-only-time-series-data  
dataset download:  
https://data.transportation.gov/Public-Transit/Major-Safety-Events/9ivb-8ae9


# US Cities
dataset overview:  
https://github.com/kelvins/US-Cities-Database  
dataset download:  
https://github.com/kelvins/US-Cities-Database/blob/main/csv/us_cities.csv