# hw 5: kernel density estimation (KDE)

In [9]:
using CSV
using DataFrames
using PyPlot
using Statistics
using Random
using LaTeXStrings # for L"$x$" to work instead of needing to do "\$x\$"
using Printf

# (optional)change settings for all plots at once, e.g. font size
rcParams = PyPlot.PyDict(PyPlot.matplotlib."rcParams")
rcParams["font.size"] = 16

# use PyCall to call in Seaborn
using PyCall
seaborn = pyimport("seaborn")

# note: some have done the following to bring `kdeplot` and `rugplot` into the namespace
#  but I recommend the above. you need to install Seaborn, the Python package.
# using Seaborn

PyObject <module 'seaborn' from '/usr/local/lib/python3.6/dist-packages/seaborn/__init__.py'>

## data on forest fires in the northeast region of Portugal

(1) read in `forestfires.csv` as a `DataFrame`. [source](https://archive.ics.uci.edu/ml/datasets/Forest+Fires)

* each row corresponds to the occurance of a forest fire. 
* the `:temp` attribute is the temperature in Celsius degrees when the forest fire occurred
* the `:RH` attribute is the relative humidity in % when the forest fire occurred

## using Seaborn to do KDE

(2) use Seaborn's [kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) function and [rugplot](https://seaborn.pydata.org/generated/seaborn.rugplot.html) function to, on the same plot, draw a rugplot and 1D KDE of the temperature during forest fires in Portugal.

* shade in the area under the curve (see `shade` in the documentation)
* use the cosine kernel, denoted as `"cos"` in Seaborn (see `kernel` in the documentation) the cosine kernel has finite, compact support, unlike the Gaussian kernel
* label the x, y-axes with appropriate units (note: both have units!)

note regarding translation of Seaborn (Python) documentation to Julia: keep in mind that we need to pass strings `kernel="gau"` in Julia as opposed to `kernel='gau'` which works in Python.

(3) draw a scatter plot to visualize the relationship, during forest fires, between:
* x-axis: temperature
* y-axis: relative humidity

label the x- and y-axes along with units.

use green "+" markers.

(4) now draw a bivariate KDE using Seaborn's [kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) function that corresponds to your scatter plot in (3).
* devote the x-axis to temperature
* devote the y-axis to relative humidity
* label the x, y- axes with units
* pass `shade=true` to shade between the contours
* pass `shade_lowest=false` to avoid shading regions of temperature-relative humidity space where forest fires were very unlikely to occur.
* pass `cmap="Greens"` to change the colormap to use green colors for the shading.

## coding up your own 1D top hat KDE
To intimately understand KDE, let's code up own own KDE with a new kernel we haven't seen before, the top hat function, which looks like a top hat. See what the top hat kernel looks like [here](https://scikit-learn.org/stable/_images/sphx_glr_plot_kde_1d_002.png). The top hat function has finite and compact [support](https://en.wikipedia.org/wiki/Support_(mathematics)) and is (piecewise) flat. The top hat kernel is implemented in scikitlearn's kernel density estimate module if you are interesting in checking your code. let $X$ denote the random variable whose density we seek to estimate (via top hat kernel density estimation).

(5) write a function `K_top_hat(x::Float64, x_i::Float64, λ::Float64)` the returns the value of the top hat kernel density at `x` conributed by data point `x_i`. `λ` is the bandwidth of the top hat kernel, which is half of its width. think carefully about what the height should be...

(6) write a function `top_hat_kde(x::Float64, x_sample::Array{Float64}, λ::Float64)` that takes in the point `x` at which we seek to estimate the density, the array of samples of $x$ (the data and the `x_i`'s for `K_top_hat`) `x_sample`, and the bandwidth `λ` of the top hat kernel used to make the density estimate, then returns the top hat kernel density estimate at `x`.

(7) finally, use your function `top_hat_kde` to estimate the density of temperatures during forest fires in Portugal at a dense span of temperatures ranging from 0 to 50.0. Use a bandwidth of 1.5. plot the density as in question (2) and compare. What strikes you as a qualitative difference between the KDE using top hat vs. cosine kernel in (2)?

note: in comparing the scale on the y-axis here and in your plot from question (2) generated with Seaborn, you should get a hint of whether or not you chose the height of the top hat kernel correctly. Remember that the total kernel density should integrate to 1.0.