# Wildfire Analysis

In this Jupyter Notebook we present an analysis of brazilian wildfires.

First, we import all needed libraries and declare some defined values.

In [1]:
import random
import os
from os import listdir

import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import col, asc, desc

from ipywidgets import interact, widgets

DATA_PATH = './data'
DATA_FILE = './data.zip'
KUDU_MASTER = 'kudu-master-1:7051'
KUDU_TABLE = 'impala::default.queimada'

Importing the `Kudu-Spark` connector from Cloudera.

In [2]:
os.environ['PYSPARK_SUBMIT_ARGS'] = f'--packages org.apache.kudu:kudu-spark3_2.12:1.13.0.7.1.5.17-1 --repositories https://repository.cloudera.com/artifactory/cloudera-repos/ pyspark-shell'

Initializing *Spark*.

In [3]:
spark = SparkSession.builder.config('spark.packages', 'org.apache.kudu:kudu-spark3_2.12:1.13.0.7.1.5.17-1').getOrCreate()
sc = SparkContext.getOrCreate()
sc.setLogLevel('OFF')

https://repository.cloudera.com/artifactory/cloudera-repos/ added as a remote repository with the name: repo-1


:: loading settings :: url = jar:file:/usr/local/spark-3.1.2-bin-hadoop3.2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/jovyan/.ivy2/cache
The jars for the packages stored in: /home/jovyan/.ivy2/jars
org.apache.kudu#kudu-spark3_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-49502993-9f6f-4c7d-ac95-d8b4c96a8b95;1.0
	confs: [default]
	found org.apache.kudu#kudu-spark3_2.12;1.13.0.7.1.5.17-1 in repo-1
:: resolution report :: resolve 443ms :: artifacts dl 12ms
	:: modules in use:
	org.apache.kudu#kudu-spark3_2.12;1.13.0.7.1.5.17-1 from repo-1 in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   1   |   0   |   0   |   0   ||   1   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-4950

Importing the Kudu datastore table.

In [4]:
kudu = spark.read.option('kudu.master', KUDU_MASTER).option('kudu.table', KUDU_TABLE).format('kudu').load()
kudu.createOrReplaceTempView('queimada')

Importing the data.

In [5]:
!unzip {DATA_FILE} -d {DATA_PATH}

Archive:  ./data.zip
  inflating: ./data/Focos_2010-01-01_2010-12-31.csv  
  inflating: ./data/Focos_2011-01-01_2011-12-31.csv  
  inflating: ./data/Focos_2012-01-01_2012-12-31.csv  
  inflating: ./data/Focos_2013-01-01_2013-12-31.csv  
  inflating: ./data/Focos_2014-01-01_2014-12-31.csv  
  inflating: ./data/Focos_2015-01-01_2015-12-31.csv  
  inflating: ./data/Focos_2016-01-01_2016-12-31.csv  
  inflating: ./data/Focos_2017-01-01_2017-12-31.csv  
  inflating: ./data/Focos_2018-01-01_2018-12-31.csv  
  inflating: ./data/Focos_2019-01-01_2019-12-31.csv  
  inflating: ./data/Focos_2020-01-01_2020-12-31.csv  


In [6]:
data = None

print(f'loading data files from {DATA_FILE}')

for file in listdir(DATA_PATH):
    if not file.endswith('.csv'):
        continue

    print(f'... {file}')
    tmp = spark.read.csv(f'{DATA_PATH}/{file}', header='true', inferSchema='true')
    data = data.union(tmp) if data else tmp

print('done')

loading data files from ./data.zip
... Focos_2010-01-01_2010-12-31.csv


                                                                                

... Focos_2011-01-01_2011-12-31.csv


                                                                                

... Focos_2012-01-01_2012-12-31.csv


                                                                                

... Focos_2013-01-01_2013-12-31.csv


                                                                                

... Focos_2014-01-01_2014-12-31.csv


                                                                                

... Focos_2015-01-01_2015-12-31.csv


                                                                                

... Focos_2016-01-01_2016-12-31.csv


                                                                                

... Focos_2017-01-01_2017-12-31.csv


                                                                                

... Focos_2018-01-01_2018-12-31.csv


                                                                                

... Focos_2019-01-01_2019-12-31.csv


                                                                                

... Focos_2020-01-01_2020-12-31.csv




done


                                                                                

Printing the original table schema

In [7]:
data.printSchema()

root
 |-- datahora: string (nullable = true)
 |-- satelite: string (nullable = true)
 |-- pais: string (nullable = true)
 |-- estado: string (nullable = true)
 |-- municipio: string (nullable = true)
 |-- bioma: string (nullable = true)
 |-- diasemchuva: string (nullable = true)
 |-- precipitacao: string (nullable = true)
 |-- riscofogo: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- frp: string (nullable = true)



Changing columns to decimal type.

In [8]:
data = data.withColumn('diasemchuva', data.diasemchuva.cast(DecimalType(10, 5))) \
    .withColumn('precipitacao', data.precipitacao.cast(DecimalType(10, 5))) \
    .withColumn('riscofogo', data.riscofogo.cast(DecimalType(10, 5))) \
    .withColumn('latitude', data.latitude.cast(DecimalType(8, 5))) \
    .withColumn('longitude', data.longitude.cast(DecimalType(8, 5))) \
    .withColumn('frp', data.frp.cast(DecimalType(10, 5)))

Printing the resulting table schema

In [9]:
data.printSchema()

root
 |-- datahora: string (nullable = true)
 |-- satelite: string (nullable = true)
 |-- pais: string (nullable = true)
 |-- estado: string (nullable = true)
 |-- municipio: string (nullable = true)
 |-- bioma: string (nullable = true)
 |-- diasemchuva: decimal(10,5) (nullable = true)
 |-- precipitacao: decimal(10,5) (nullable = true)
 |-- riscofogo: decimal(10,5) (nullable = true)
 |-- latitude: decimal(8,5) (nullable = true)
 |-- longitude: decimal(8,5) (nullable = true)
 |-- frp: decimal(10,5) (nullable = true)



In [10]:
data.show(n=5, truncate=False)

+-------------------+--------+------+--------------+-------------------+--------------+-----------+------------+---------+---------+---------+----+
|datahora           |satelite|pais  |estado        |municipio          |bioma         |diasemchuva|precipitacao|riscofogo|latitude |longitude|frp |
+-------------------+--------+------+--------------+-------------------+--------------+-----------+------------+---------+---------+---------+----+
|2010/01/01 15:40:00|AQUA_M-T|Brasil|SERGIPE       |JAPOATA            |Mata Atlantica|null       |null        |null     |-10.34700|-36.77700|null|
|2010/01/01 15:41:00|AQUA_M-T|Brasil|PERNAMBUCO    |PESQUEIRA          |Caatinga      |null       |null        |null     |-8.44200 |-36.68300|null|
|2010/01/01 15:41:00|AQUA_M-T|Brasil|SERGIPE       |PORTO DA FOLHA     |Caatinga      |null       |null        |null     |-9.86100 |-37.53200|null|
|2010/01/01 15:41:00|AQUA_M-T|Brasil|PERNAMBUCO    |PESQUEIRA          |Caatinga      |null       |null        |

Interactive demonstration

In [11]:
@interact(x=widgets.IntSlider(min=0, max=30, step=1, value=10, description='yes', continuous_update=False))
def f(x):
    return x

interactive(children=(IntSlider(value=10, continuous_update=False, description='yes', max=30), Output()), _dom…

In [12]:
@interact(x=widgets.IntSlider(min=0, max=30, step=1, value=5, description='yes', continuous_update=False))
def f(x):
    return data.show(n=x, truncate=False)

interactive(children=(IntSlider(value=5, continuous_update=False, description='yes', max=30), Output()), _dom_…

Writing data to *Kudu*

In [13]:
data.write.option('kudu.master', KUDU_MASTER).option('kudu.table', KUDU_TABLE).mode('append').format('kudu').save()

                                                                                

Retriving data from *Kudu*

In [14]:
#spark.sql('SELECT * FROM queimada LIMIT 30').show()
spark.sql('SELECT * FROM queimada LIMIT 30').toPandas()

Unnamed: 0,datahora,latitude,longitude,satelite,pais,estado,municipio,bioma,diasemchuva,precipitacao,riscofogo,frp
0,2012/11/05 16:30:00,-19.628,-45.868,AQUA_M-T,Brasil,MINAS GERAIS,ESTRELA DO INDAIA,Cerrado,,,,
1,2012/11/05 16:35:00,-2.801,-49.299,AQUA_M-T,Brasil,PARA,MOJU,Amazonia,,,,
2,2012/11/05 18:15:00,2.937,-61.108,AQUA_M-T,Brasil,RORAIMA,ALTO ALEGRE,Amazonia,,,,
3,2012/11/05 18:15:00,3.399,-60.363,AQUA_M-T,Brasil,RORAIMA,BOA VISTA,Amazonia,,,,
4,2012/11/06 15:37:00,-9.94,-36.469,AQUA_M-T,Brasil,ALAGOAS,JUNQUEIRO,Mata Atlantica,,,,
5,2012/11/06 15:37:00,-9.873,-36.489,AQUA_M-T,Brasil,ALAGOAS,JUNQUEIRO,Mata Atlantica,,,,
6,2012/11/06 15:37:00,-9.865,-36.485,AQUA_M-T,Brasil,ALAGOAS,JUNQUEIRO,Mata Atlantica,,,,
7,2012/11/06 15:37:00,-9.86,-36.491,AQUA_M-T,Brasil,ALAGOAS,JUNQUEIRO,Mata Atlantica,,,,
8,2012/11/06 15:37:00,-9.851,-36.487,AQUA_M-T,Brasil,ALAGOAS,JUNQUEIRO,Mata Atlantica,,,,
9,2012/11/06 15:37:00,-9.846,-36.492,AQUA_M-T,Brasil,ALAGOAS,JUNQUEIRO,Mata Atlantica,,,,


QUERY: Get all biomes

In [15]:
%%time
kudu.select('bioma').distinct().show()

                                                                                

+--------------+
|         bioma|
+--------------+
|Mata Atlantica|
|         Pampa|
|      Pantanal|
|      Amazonia|
|       Cerrado|
|      Caatinga|
+--------------+

CPU times: user 6.76 ms, sys: 3.49 ms, total: 10.3 ms
Wall time: 5.43 s


QUERY: Count wildfires on Amazonia for each municipality which ocurred between 2019/01/01 and 2019/09/05, ordered by count.

In [16]:
%%time
kudu.filter(('2019/01/01' <= kudu.datahora) & (kudu.datahora <= '2019/09/05') & (kudu.bioma == 'Amazonia')).groupBy('municipio').count().orderBy(col('count').desc()).show()



+------------------+-----+
|         municipio|count|
+------------------+-----+
|          ALTAMIRA| 3007|
|SAO FELIX DO XINGU| 2560|
|       PORTO VELHO| 2465|
|              APUI| 2032|
|            LABREA| 1822|
|    NOVO PROGRESSO| 1797|
|           COLNIZA| 1558|
|         CARACARAI| 1379|
|          ITAITUBA|  978|
|     NOVO ARIPUANA|  936|
|             FEIJO|  828|
|       NOVA MAMORE|  772|
|          MANICORE|  752|
|           CUJUBIM|  693|
|      BOCA DO ACRE|  688|
|CANDEIAS DO JAMARI|  677|
|          ARIPUANA|  652|
|           MUCAJAI|  611|
|      JACAREACANGA|  585|
|          TARAUACA|  567|
+------------------+-----+
only showing top 20 rows

CPU times: user 9.07 ms, sys: 6.24 ms, total: 15.3 ms
Wall time: 3.27 s


                                                                                

In [17]:
#sc.stop()