<a href="https://colab.research.google.com/github/ctmes/test/blob/main/HANDLING_GEOSPATIAL_DATA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IMPORTING GOOGLE EARTH (KML/KMZ FILES) - EARTHQUAKE DATASET

<img src="https://github.com/Nouhailadr/HANDLING_GEOSPATIAL_DATA/blob/main/xs.png?raw=1" alt="Image Alt Text">

First, let's install geopandas, a Python library for geospatial data manipulation.

In [None]:
%pip install geopandas



Let's import the required dependencies.

In [None]:
import zipfile #We will use this library to extract data from KMZ file into KML one.
import os #We will use it to manage file paths and check if files exist.
import fiona #We will use it to parse and read the file in the KML format
import pandas as pd
import geopandas as gpd #We'll use it to read, process, and manipulate the KML file.
from bs4 import BeautifulSoup #We will use it to parse HTML content embedded within one column of the KML file

# Importing Google Earth Files

First of all, we need to 'unzip' the KMZ file.

In [None]:
# Specify the path to our KMZ file
kmz_file_path = "C:/Users/HP/OneDrive/Documents/PROJECTS/Earthquake/hazards.kmz"

# Specify the directory where we want to extract the KML file (same directory as the KMZ file)
extraction_dir = os.path.dirname("C:/Users/HP/OneDrive/Documents/PROJECTS/Earthquake/hazards.kmz"  )

# Open the KMZ file and extract its contents
with zipfile.ZipFile(kmz_file_path, "r") as kmz:
    kmz.extractall(extraction_dir)

We should want to enable KML support which is disabled by default, or else, we will face the unsupported Driver Error.

In [None]:
fiona.drvsupport.supported_drivers['libkml'] = 'rw' # enable KML support which is disabled by default
fiona.drvsupport.supported_drivers['LIBKML'] = 'rw'

In [None]:
#gdf = gpd.read_file("C:/Users/Earthquake/hazards/files/significantEarthquakes.kml", driver='libkml')
#gdf.head()

Oups, the output is not right, only the names of the columns are displayed, no data in sight, maybe there is a problem with loading the whole dataset? let's try another way.

In [None]:
# Specify the path to the extracted KML file
kml_file_path = "C:/Users/HP/OneDrive/Documents/PROJECTS/Earthquake/hazards/files/"
fp_eq=kml_file_path+"significantEarthquakes"+'.kml'

Next, we will use the fiona library to list all the layers within a KML file. Then, for each layer, we use geopandas to read and load the data from that specific layer within the KML file. Finally, we concatenate all these individual GeoDataFrames into a single GeoDataFrame called gdf, effectively merging the data from all KML layers into one consolidated dataset, sort of like collecting all the pieces of a puzzle into a single picture.

In [None]:
gdf_list = []
for layer in fiona.listlayers(fp_eq):
    gdf = gpd.read_file(fp_eq, driver='LIBKML', layer=layer)
    gdf_list.append(gdf)

gdf = gpd.GeoDataFrame(pd.concat(gdf_list, ignore_index=True))

In [None]:
gdf.head(1)

Unnamed: 0,Name,description,timestamp,begin,end,altitudeMode,tessellate,extrude,visibility,drawOrder,icon,geometry,snippet
0,-2150/??/??,\n <table width='300'>\n <tr>\n ...,NaT,NaT,NaT,,-1,0,1,,,POINT Z (35.50000 31.10000 0.00000),"JORDAN: BAB-A-DARAA,AL-KARAK"


Voila!
What a satisfying moment (Yes because I don't know about you data scientists, but it was my first time working on this type of files, and it definitely took me some time to understand how it works)

# Data Processing: Few examples

First let's explore the data

In [None]:
print("Number of rows & columns in GeoDataFrame:", gdf.shape)

Number of rows & columns in GeoDataFrame: (5884, 13)


In [None]:
print(gdf.info())

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 5884 entries, 0 to 5883
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Name          5884 non-null   object        
 1   description   5884 non-null   object        
 2   timestamp     0 non-null      datetime64[ns]
 3   begin         0 non-null      datetime64[ns]
 4   end           0 non-null      datetime64[ns]
 5   altitudeMode  0 non-null      object        
 6   tessellate    5884 non-null   object        
 7   extrude       5884 non-null   object        
 8   visibility    5884 non-null   object        
 9   drawOrder     0 non-null      object        
 10  icon          0 non-null      object        
 11  geometry      5884 non-null   geometry      
 12  snippet       5884 non-null   object        
dtypes: datetime64[ns](3), geometry(1), object(9)
memory usage: 597.7+ KB
None


In [None]:
print(gdf.isnull().sum())

Name               0
description        0
timestamp       5884
begin           5884
end             5884
altitudeMode    5884
tessellate         0
extrude            0
visibility         0
drawOrder       5884
icon            5884
geometry           0
snippet            0
dtype: int64


In [None]:
gdf['snippet'].value_counts().head()

CHINA:  YUNNAN PROVINCE     68
TURKEY                      49
CHINA:  SICHUAN PROVINCE    44
RUSSIA:  KURIL ISLANDS      39
VANUATU ISLANDS             33
Name: snippet, dtype: int64

In [None]:
gdf = gdf.rename(columns={'Name': 'EQ_Date'})
gdf['EQ_Date'].value_counts().head()

1954/09/09    4
1901/08/09    4
2006/06/03    3
1989/03/10    3
1750/??/??    3
Name: EQ_Date, dtype: int64

In [None]:
gdf['description'].value_counts().head(1)

\n        <table width='300'>\n        <tr>\n        <th>Location of Earthquake Effects:</th><td><nobr>JORDAN:  BAB-A-DARAA,AL-KARAK</nobr></td></tr>\n        <tr><th>Earthquake Magnitude:</th><td>7.3</td></tr>\n        <tr><th>Number of Deaths:</th><td>null</td></tr>\n        <tr><th>Triggered a Tsunami?</th><td>No</td></tr>\n        </table>\n        <hr>\n        <br><a href="https://www.ngdc.noaa.gov/nndc/struts/results?t=101650&s=13&d=229,26,13,12&nd=display&eq_0=1">Get more details from NGDC Natural Hazards Website</a>\n        <br>    1
Name: description, dtype: int64

In [None]:
html_column = gdf['description']

# Initialize empty lists to store extracted data
EQ_location = []
EQ_magnitude = []
nb_deaths = []
tsunami = []

# Loop through the HTML strings in the column
for html_string in html_column:
    # Use BeautifulSoup to parse the HTML
    soup = BeautifulSoup(html_string, 'html.parser')

    # Extract data based on HTML structure and tags
    # Example: Extracting location, magnitude, deaths, and tsunami information
    table = soup.find('table')
    rows = table.find_all('tr')

    location = None
    magnitude = None
    deaths = None
    tsunami_info = None

    for row in rows:
        header = row.find('th').text.strip()
        data = row.find('td').text.strip()

        if header == "Location of Earthquake Effects:":
            location = data
        elif header == "Earthquake Magnitude:":
            magnitude = data
        elif header == "Number of Deaths:":
            deaths = data
        elif header == "Triggered a Tsunami?":
            tsunami_info = data

    # Append the extracted data to the corresponding lists
    EQ_location.append(location)
    EQ_magnitude.append(magnitude)
    nb_deaths.append(deaths)
    tsunami.append(tsunami_info)

# Create new columns in the GeoDataFrame
gdf['EQ_location'] = EQ_location
gdf['EQ_magnitude'] = EQ_magnitude
gdf['Nb_deaths'] = nb_deaths
gdf['Tsunami'] = tsunami

In [None]:
gdf['EQ_location']

0           JORDAN:  BAB-A-DARAA,AL-KARAK
1                        TURKMENISTAN:  W
2                          SYRIA:  UGARIT
3       GREECE:  THERA ISLAND (SANTORINI)
4                ISRAEL:  ARIHA (JERICHO)
                      ...                
5879                                CHILE
5880                         FIJI ISLANDS
5881                       INDIA:  AMBASA
5882                         IRAN:  KHONJ
5883                    ITALY:  FARINDOLA
Name: EQ_location, Length: 5884, dtype: object

In [None]:
cols_to_drop = ['description','snippet','visibility','tessellate','extrude','timestamp', 'begin', 'end','altitudeMode','drawOrder', 'icon']
gdf.drop(columns=cols_to_drop, inplace=True)

In [None]:
gdf['Nb_deaths'].value_counts()

null                                4031
Few (~1 to 50 people)                955
Many (~101 to 1000 people)           412
Very Many (~1001 or more people)     311
Some (~51 to 100 people)             175
Name: Nb_deaths, dtype: int64

In [None]:
gdf['Nb_deaths']=gdf['Nb_deaths'].replace('null', 'unknown/None')
gdf['Nb_deaths'].value_counts()

unknown/None                        4031
Few (~1 to 50 people)                955
Many (~101 to 1000 people)           412
Very Many (~1001 or more people)     311
Some (~51 to 100 people)             175
Name: Nb_deaths, dtype: int64

In [None]:
gdf.head(1)

Unnamed: 0,EQ_Date,geometry,EQ_location,EQ_magnitude,Nb_deaths,Tsunami
0,-2150/??/??,POINT Z (35.50000 31.10000 0.00000),"JORDAN: BAB-A-DARAA,AL-KARAK",7.3,unknown/None,No


What a clean one! AHH satisfying, let's plot the earthquakes locations now

In [None]:
gdf['Country'] = gdf['EQ_location'].str.split(':').str[0]
gdf['Country'] = gdf['Country'].str.strip()

In [None]:
gdf['Region'] = gdf['EQ_location'].str.split(':').str[1]
gdf['Region'] = gdf['Region'].str.strip()

In [None]:
gdf[['Year', 'Month', 'Day']] = gdf['EQ_Date'].str.split('/', expand=True).astype(str)

gdf.drop(columns=['Day'], inplace=True)

In [None]:
gdf['Month']=gdf['Month'].replace('??', 0)

In [None]:
gdf.tail(1)

Unnamed: 0,EQ_Date,geometry,EQ_location,EQ_magnitude,Nb_deaths,Tsunami,Country,Region,Year,Month
5883,2017/01/18,POINT Z (13.24100 42.60100 0.00000),ITALY: FARINDOLA,5.7,Few (~1 to 50 people),No,ITALY,FARINDOLA,2017,1


In [None]:
#export to csv

csv_file_path = "C:/Users/HP/OneDrive/Documents/PROJECTS/Earthquake/DATA/Earthquakes1.csv"
gdf.to_csv(csv_file_path, index=False)

print("GeoDataFrame has been exported to CSV:", csv_file_path)

GeoDataFrame has been exported to CSV: C:/Users/HP/OneDrive/Documents/PROJECTS/Earthquake/DATA/Earthquakes1.csv
