-
Notifications
You must be signed in to change notification settings - Fork 0
Importing the Data into my Database
I used MySQL Workbench's Table Data Import Wizard Tool to load the CSV file into MySQL Workbench. Thus the import code is not relevant to this project. All data cleaning was executed using MySQL Workbench.
I updated the data in CSV before importing into MySQL Workbench. Before importing the CSV file into Workbench, I ensured all characters were Unicode compatible. Unsupported characters, especially accented ones, were substituted with ‘?’. This was the only pre-import modification. The script that handles this can be found here.
I had to make several data changes after importing the data into MySQL Workbench to address many issues outlined below. A common issue was that many attributes that were in the proposed ERD diagram in Benchmark 2 were not yet extracted in the original dataset. Thus, decomposing the data directly was not possible. Below, I have described the issues and my solutions in more detail.
- I added a new date_time_formatted column to the dataset
ALTER TABLE mod_earthquakes
ADD COLUMN date_time_formatted DATETIME;- I converted the strings to date data types
UPDATE mod_earthquakes
SET date_time_formatted = STR_TO_DATE(date_time, '%d-%m-%Y %H:%i');- I created/extracted day, month, year, and hour attributes
UPDATE mod_earthquakes
SET day = DAY(date_time_formatted),
month = MONTH(date_time_formatted),
year = YEAR(date_time_formatted),
time_hour = HOUR(date_time_formatted);- I added the
event_idprimary key
ALTER TABLE mod_earthquakes
ADD COLUMN event_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;- I removed rows with unspecified locations
DELETE FROM mod_earthquakes
WHERE location = "";- I identified and later removed rows that had duplicate titles
SELECT e1.event_id
FROM mod_earthquakes e1
JOIN (
SELECT title, MAX(event_id) as max_event_id
FROM mod_earthquakes
GROUP BY title
HAVING COUNT(event_id) > 1
) AS e2 ON e1.title = e2.title
WHERE e1.event_id <> e2.max_event_id
ORDER BY e1.title, e1.event_id;Query Description: The above query contains a subquery. The subquery returns a title and the maximum event_id for that title if there were multiple event_ids corresponding to that title, indicating the presence of duplicates. The query joins the results of the subquery with the event_ids of the original table. The event ids returned are those that need to be deleted i.e. those that were present in the dataset, but not returned by the subquery.
- I created City and Country attributes by Parsing the Location
UPDATE mod_earthquakes
SET
City = CASE
WHEN LOCATE(',', location) > 0 THEN TRIM(SUBSTRING(location, 1, LOCATE(',', location) - 1))
ELSE location
END,
Country = CASE
WHEN LOCATE(',', location) > 0 THEN TRIM(SUBSTRING(location, LOCATE(',', location) + 1))
ELSE location
END;Query Description: The original dataset consisted of a Location attribute which provided the location of an earthquake. The format of this attribute varied from value to value. However, a general pattern was present. Normally, the location column followed the pattern of "City, Country". In this case, the City and Country attributes were parsed from the Location attribute by the location of the comma. However, there were cases where only the City/Country name was provided. In this case, both the City and Country attribute took on the value of the Location attribute. Another case I had to consider was if the earthquake occurred in the United States. In this case, the format was City, State. I handled this case using the following SQL command.
UPDATE mod_earthquakes
SET Country = 'US'
WHERE Country IN ('CA', 'Alaska', 'Hawaii', 'Idaho', 'Aleutian Islands, Alaska', 'California', 'Washington');- I used a pre-existing dataset that can be found here
SELECT *
FROM mod_earthquakes AS m
JOIN `countries-continents` AS c
ON m.Country = c.Country;I then proceeded to insert the results of this query into another intermediary table. There were cases where the country to continent map was not present in the external dataset; however, these cases were handled manually with INSERT SQL commands.
- Extracting Distance from Title
UPDATE mod_earthquakes
SET distance_from_location =
CASE
WHEN title REGEXP '[0-9]+ ?km' THEN
TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(title, ' - ', -1), 'km', 1))
ELSE NULL
END;Query Description: If the the title followed the format (as outlined by the REGEX notation), then the command would parse the location information and set that as the value. Otherwise the value would be NULL. Below I have provided an example.
Suppose we are given the following title: M 7.3 - 205 km ESE of Neiafu, Tonga. The title matches the REGEX pattern since 205 km is in the title. The inner SUBSTRING_INDEX command selects everything to the right of the hyphen, that is 205 km ESE of Neiafu, Tonga. The outer SUBSTRING_INDEX command splits the string based on km. The first index is taken to be the value. In this case the first index is 205. The TRIM command is used to remove any leading or training spaces.
- Extracting Direction from Title
UPDATE mod_earthquakes
SET direction_from_location =
CASE
WHEN title REGEXP '[0-9]+ ?km [NSEW]{1,3} of' THEN
TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(title, 'km ', -1), ' of', 1))
ELSE NULL
END;Query Description: The mechanics of this query are similar to extracting the distance, so I will move on to an example. We take the same title: M 7.3 - 205 km ESE of Neiafu, Tonga. As described above, the title matches the REGEX pattern. The inner SUBSTRING_INDEX takes everything to the right of km, which corresponds to ESE of Neiafu, Tonga. The outer SUBSTRING_INDEX takes everything to the left of of which is ESE. This corresponds to the direction.
I decomposed the data after importing into MySQL Workbench. After cleaning the data with newly extracted attributes, I used INSERT INTO SQL commands to insert data into the decomposed tables: Continent, Country, City, Location, MagnitudeType, Network, EarthquakeDetails. The cleaned dataset can be found here.
I inserted all of the distinct continent names from our cleaned dataset. The auto-incrementing takes care of the continent_id.
INSERT INTO Continent (continent_name)
SELECT DISTINCT continent
FROM mod_earthquakes;I join the cleaned table with the Continent table created in the previous step to collect distinct countries and continents. As before, the auto-incrementing takes care of the country_id.
INSERT INTO Country (country_name, continent_id)
SELECT DISTINCT m.country, c.continent_id
FROM mod_earthquakes AS m
JOIN Continent AS c ON m.continent = c.continent_name;This follows the same structure with creating the Country table.
INSERT INTO City (city_name, country_id)
SELECT DISTINCT m.city, c.country_id
FROM mod_earthquakes m
JOIN Country c ON m.country = c.country_name;This follows the same structure with creating the City table.
INSERT INTO Location (latitude, longitude, city_id)
SELECT DISTINCT m.latitude, m.longitude, c.city_id
FROM mod_earthquakes AS m
JOIN City c ON m.city = c.city_name;This follows the same structure with creating the Continent table.
INSERT INTO MagnitudeType (magType)
SELECT DISTINCT magType
FROM mod_earthquakes;This follows the same structure with creating the MagnitudeType table.
INSERT INTO Network (net)
SELECT DISTINCT net
FROM mod_earthquakes;Many of the columns present in EarthquakesDetails comes from the cleaned table itself. There are three columns: location_id, magType_id, and net_id that come from other tables. I joined these tables with the cleaned table based on distinguishing attributes and selected the attributes that were planned for the EarthquakeDetails table.
INSERT INTO EarthquakeDetails
(distance_from_location, direction_from_location, time_hour, magnitude, year, month, day,
cdi, mmi, alert, tsunami, sig, nst, dmin, gap, depth, location_id, magType_id, net_id)
SELECT
m.distance_from_location, m.direction_from_location, m.time_hour, m.magnitude, m.year, m.month,
m.day, m.cdi, m.mmi, m.alert, m.tsunami, m.sig, m.nst, m.dmin, m.gap, m.depth,
l.location_id, mt.magType_id, n.net_id
FROM mod_earthquakes as m
JOIN Location AS l ON (m.latitude = l.latitude AND m.longitude = l.longitude)
JOIN MagnitudeType as mt ON m.magType = mt.magType
JOIN Network n ON m.net = n.net;