-
Notifications
You must be signed in to change notification settings - Fork 0
Importing the Data into my Database
I used MySQL Workbench's Table Data Import Wizard Tool to load the CSV file. All data cleaning was executed using MySQL Workbench.
Before importing the CSV file into Workbench, I ensured all characters were Unicode compatible. Unsupported characters, especially accented ones, were substituted with ‘?’. This was the only pre-import modification. The script that handles this can be found here.
- I added a new date_time_formatted column to the dataset
ALTER TABLE mod_earthquakes
ADD COLUMN date_time_formatted DATETIME;- I converted the strings to date data types
UPDATE mod_earthquakes
SET date_time_formatted = STR_TO_DATE(date_time, '%d-%m-%Y %H:%i');- I created/extracted day, month, year, and hour attributes
UPDATE mod_earthquakes
SET day = DAY(date_time_formatted),
month = MONTH(date_time_formatted),
year = YEAR(date_time_formatted),
time_hour = HOUR(date_time_formatted);- I added the
event_idprimary key
ALTER TABLE mod_earthquakes
ADD COLUMN event_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;- I removed rows with unspecified locations
DELETE FROM mod_earthquakes
WHERE location = "";- I identified and later removed rows that had duplicate titles
SELECT e1.event_id
FROM mod_earthquakes e1
JOIN (
SELECT title, MAX(event_id) as max_event_id
FROM mod_earthquakes
GROUP BY title
HAVING COUNT(event_id) > 1
) AS e2 ON e1.title = e2.title
WHERE e1.event_id <> e2.max_event_id
ORDER BY e1.title, e1.event_id;Query Description: The above query contains a subquery. The subquery returns a title and the maximum event_id for that title if there were multiple event_ids corresponding to that title, indicating the presence of duplicates. The query joins the results of the subquery with the event_ids of the original table. The event ids returned are those that need to be deleted i.e. those that were present in the dataset, but not returned by the subquery.
- I created City and Country attributes by Parsing the Location
UPDATE mod_earthquakes
SET
City = CASE
WHEN LOCATE(',', location) > 0 THEN TRIM(SUBSTRING(location, 1, LOCATE(',', location) - 1))
ELSE location
END,
Country = CASE
WHEN LOCATE(',', location) > 0 THEN TRIM(SUBSTRING(location, LOCATE(',', location) + 1))
ELSE location
END;Query Description: The original dataset consisted of a Location attribute which provided the location of an earthquake. The format of this attribute varied from value to value. However, a general pattern was present. Normally, the location column followed the pattern of "City, Country". In this case, the City and Country attributes were parsed from the Location attribute by the location of the comma. However, there were cases where only the City/Country name was provided. In this case, both the City and Country attribute took on the value of the Location attribute. Another case I had to consider was if the earthquake occurred in the United States. In this case, the format was City, State. I handled this case using the following SQL command.
UPDATE mod_earthquakes
SET Country = 'US'
WHERE Country IN ('CA', 'Alaska', 'Hawaii', 'Idaho', 'Aleutian Islands, Alaska', 'California', 'Washington');- I used a pre-existing dataset that can be found here
SELECT *
FROM mod_earthquakes AS m
JOIN `countries-continents` AS c
ON m.Country = c.Country;I then proceeded to insert the results of this query into another intermediary table. There were cases where the country to continent map was not present in the external dataset; however, these cases were handled manually with INSERT SQL commands.
- Extracting Distance from Title
UPDATE mod_earthquakes
SET distance_from_location =
CASE
WHEN title REGEXP '[0-9]+ ?km' THEN
TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(title, ' - ', -1), 'km', 1))
ELSE NULL
END;Query Description: If the the title followed the format (as outlined by the REGEX notation), then the command would parse the location information and set that as the value. Otherwise the value would be NULL. Below I have provided an example.
Suppose we are given the following title: M 7.3 - 205 km ESE of Neiafu, Tonga. The title matches the REGEX pattern since 205 km is in the title. The inner SUBSTRING_INDEX command selects everything to the right of the hyphen, that is 205 km ESE of Neiafu, Tonga. The outer SUBSTRING_INDEX command splits the string based on km. The first index is taken to be the value. In this case the first index is 205. The TRIM command is used to remove any leading or training spaces.
- Extracting Direction from Title
UPDATE earthquakes
SET direction_from_location =
CASE
WHEN title REGEXP '[0-9]+ ?km [NSEW]{1,3} of' THEN
TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(title, 'km ', -1), ' of', 1))
ELSE NULL
END;Query Description: The mechanics of this query are similar to extracting the distance, so I will move on to an example. We take the same title: M 7.3 - 205 km ESE of Neiafu, Tonga. As described above, the title matches the REGEX pattern. The inner SUBSTRING_INDEX takes everything to the right of km, which corresponds to ESE of Neiafu, Tonga. The outer SUBSTRING_INDEX takes everything to the left of of which is ESE. This corresponds to the direction.
After cleaning the data with newly extracted attributes, I used INSERT INTO SQL commands to insert data into the decomposed tables: Continent, Country, City, Location, MagnitudeType, Network, EarthquakeDetails. The cleaned dataset can be found here.
INSERT INTO Continent (continent_name)
SELECT DISTINCT continent
FROM mod_earthquakes;INSERT INTO Country (country_name, continent_id)
SELECT DISTINCT m.country, c.continent_id
FROM mod_earthquakes AS m
JOIN Continent c ON m.continent = c.continent_name;INSERT INTO City (city_name, country_id)
SELECT DISTINCT m.city, c.country_id
FROM mod_earthquakes m
JOIN Country c ON m.country = c.country_name;INSERT INTO Location (latitude, longitude, city_id)
SELECT DISTINCT m.latitude, m.longitude, c.city_id
FROM mod_earthquakes AS m
JOIN City c ON m.city = c.city_name;INSERT INTO MagnitudeType (magType)
SELECT DISTINCT magType
FROM mod_earthquakes;INSERT INTO Network (net)
SELECT DISTINCT net
FROM mod_earthquakes;INSERT INTO EarthquakeDetails
(distance_from_location, direction_from_location, time_hour, magnitude, year, month, day,
cdi, mmi, alert, tsunami, sig, nst, dmin, gap, depth, location_id, magType_id, net_id)
SELECT
m.distance_from_location, m.direction_from_location, m.time_hour, m.magnitude, m.year, m.month,
m.day, m.cdi, m.mmi, m.alert, m.tsunami, m.sig, m.nst, m.dmin, m.gap, m.depth,
l.location_id, mt.magType_id, n.net_id
FROM mod_earthquakes as m
JOIN Location AS l ON (m.latitude = l.latitude AND m.longitude = l.longitude)
JOIN MagnitudeType as mt ON m.magType = mt.magType
JOIN Network n ON m.net = n.net;