Skip to content

Importing the Data into my Database

Amit Das edited this page Aug 14, 2023 · 22 revisions

Import Method Description

I used MySQL Workbench's Table Data Import Wizard Tool to load the CSV file. All data cleaning was executed using MySQL Workbench.

Data Preparation

Pre-import Changes

Before importing the CSV file into Workbench, I ensured all characters were Unicode compatible. Unsupported characters, especially accented ones, were substituted with ‘?’. This was the only pre-import modification. The script that handles this can be found here.

Post-import Changes

Date Reformatting

  • I added a new date_time_formatted column to the dataset
ALTER TABLE mod_earthquakes
ADD COLUMN date_time_formatted DATETIME;
  • I converted the strings to date data types
UPDATE mod_earthquakes
SET date_time_formatted = STR_TO_DATE(date_time, '%d-%m-%Y %H:%i');
  • I created/extracted day, month, year, and hour attributes
UPDATE mod_earthquakes
SET day = DAY(date_time_formatted),
    month = MONTH(date_time_formatted),
    year = YEAR(date_time_formatted),
    time_hour = HOUR(date_time_formatted);

Introducing the Primary Key

  • I added the event_id primary key
ALTER TABLE mod_earthquakes
ADD COLUMN event_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;

Data/Row Cleaning

  • I removed rows with unspecified locations
DELETE FROM mod_earthquakes
WHERE location = "";
  • I identified and later removed rows that had duplicate titles
SELECT e1.event_id
FROM mod_earthquakes e1
JOIN (
    SELECT title, MAX(event_id) as max_event_id
    FROM mod_earthquakes
    GROUP BY title
    HAVING COUNT(event_id) > 1
) AS e2 ON e1.title = e2.title
WHERE e1.event_id <> e2.max_event_id
ORDER BY e1.title, e1.event_id;

Query Description: The above query contains a subquery. The subquery returns a title and the maximum event_id for that title if there were multiple event_ids corresponding to that title, indicating the presence of duplicates. The query joins the results of the subquery with the event_ids of the original table. The event ids returned are those that need to be deleted i.e. those that were present in the dataset, but not returned by the subquery.

City and Country Attribute Extraction

  • I created City and Country attributes by Parsing the Location
UPDATE mod_earthquakes
SET
    City = CASE
               WHEN LOCATE(',', location) > 0 THEN TRIM(SUBSTRING(location, 1, LOCATE(',', location) - 1))
               ELSE location
           END,
    Country = CASE
                  WHEN LOCATE(',', location) > 0 THEN TRIM(SUBSTRING(location, LOCATE(',', location) + 1))
                  ELSE location
              END;

Query Description: The original dataset consisted of a Location attribute which provided the location of an earthquake. The format of this attribute varied from value to value. However, a general pattern was present. Normally, the location column followed the pattern of "City, Country". In this case, the City and Country attributes were parsed from the Location attribute by the location of the comma. However, there were cases where only the City/Country name was provided. In this case, both the City and Country attribute took on the value of the Location attribute. Another case I had to consider was if the earthquake occurred in the United States. In this case, the format was City, State. I handled this case using the following SQL command.

UPDATE mod_earthquakes
SET Country = 'US'
WHERE Country IN ('CA', 'Alaska', 'Hawaii', 'Idaho', 'Aleutian Islands, Alaska', 'California', 'Washington');

Adding Continent Data

  • I used a pre-existing dataset that can be found here
SELECT *
FROM mod_earthquakes AS m
JOIN `countries-continents` AS c
ON m.Country = c.Country;

I then proceeded to insert the results of this query into another intermediary table. There were cases where the country to continent map was not present in the external dataset; however, these cases were handled manually with INSERT SQL commands.

Extracting Distance and Direction Attributes from Title

  • Extracting Distance from Title
UPDATE mod_earthquakes
SET distance_from_location =
    CASE
        WHEN title REGEXP '[0-9]+ ?km' THEN
            TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(title, ' - ', -1), 'km', 1))
        ELSE NULL
    END;

Query Description: If the the title followed the format (as outlined by the REGEX notation), then the command would parse the location information and set that as the value. Otherwise the value would be NULL. Below I have provided an example.
Suppose we are given the following title: M 7.3 - 205 km ESE of Neiafu, Tonga. The title matches the REGEX pattern since 205 km is in the title. The inner SUBSTRING_INDEX command selects everything to the right of the hyphen, that is 205 km ESE of Neiafu, Tonga. The outer SUBSTRING_INDEX command splits the string based on km. The first index is taken to be the value. In this case the first index is 205. The TRIM command is used to remove any leading or training spaces.

  • Extracting Direction from Title
UPDATE earthquakes
SET direction_from_location =
    CASE
        WHEN title REGEXP '[0-9]+ ?km [NSEW]{1,3} of' THEN
            TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(title, 'km ', -1), ' of', 1))
        ELSE NULL
    END;

Query Description: The mechanics of this query are similar to extracting the distance, so I will move on to an example. We take the same title: M 7.3 - 205 km ESE of Neiafu, Tonga. As described above, the title matches the REGEX pattern. The inner SUBSTRING_INDEX takes everything to the right of km, which corresponds to ESE of Neiafu, Tonga. The outer SUBSTRING_INDEX takes everything to the left of of which is ESE. This corresponds to the direction.

Data Decomposition

After cleaning the data with newly extracted attributes, I used INSERT INTO SQL commands to insert data into the decomposed tables: Continent, Country, City, Location, MagnitudeType, Network, EarthquakeDetails. The cleaned dataset can be found here.

Inserting into Continent Table

INSERT INTO Continent (continent_name)
SELECT DISTINCT continent
FROM mod_earthquakes;

Inserting into Country Table

INSERT INTO Country (country_name, continent_id)
SELECT DISTINCT m.country, c.continent_id
FROM mod_earthquakes AS m
JOIN Continent c ON m.continent = c.continent_name;

Inserting into City Table

INSERT INTO City (city_name, country_id)
SELECT DISTINCT m.city, c.country_id
FROM mod_earthquakes m
JOIN Country c ON m.country = c.country_name;

Inserting into Location Table

INSERT INTO Location (latitude, longitude, city_id)
SELECT DISTINCT m.latitude, m.longitude, c.city_id
FROM mod_earthquakes AS m
JOIN City c ON m.city = c.city_name;

Inserting into MagnitudeType Table

INSERT INTO MagnitudeType (magType)
SELECT DISTINCT magType
FROM mod_earthquakes;

Inserting into Network Table

INSERT INTO Network (net)
SELECT DISTINCT net
FROM mod_earthquakes;

Inserting into EarthquakeDetails Table

INSERT INTO EarthquakeDetails
    (distance_from_location, direction_from_location, time_hour, magnitude, year, month, day,
     cdi, mmi, alert, tsunami, sig, nst, dmin, gap, depth, location_id, magType_id, net_id)
SELECT
    m.distance_from_location, m.direction_from_location, m.time_hour, m.magnitude, m.year, m.month,
    m.day, m.cdi, m.mmi, m.alert, m.tsunami, m.sig, m.nst, m.dmin, m.gap, m.depth,
    l.location_id, mt.magType_id, n.net_id
FROM mod_earthquakes as m
JOIN Location AS l ON (m.latitude = l.latitude AND m.longitude = l.longitude)
JOIN MagnitudeType as mt ON m.magType = mt.magType
JOIN Network n ON m.net = n.net;

Clone this wiki locally