# 1. Data Exploration

# Imports

In [None]:
%reload_ext autoreload
%autoreload 2

import data_chaser as dc
import numpy as np
import os
import pandas as pd
import plotly.graph_objects as go
from data_chaser.plot.plotly import missing_value_heatmap, missing_data_ratios

# Data loading

First we will define the data directory. I recommend `lost-data-chaser/data` such that you can follow along with the notebook. The datasets we will use first are all .csv files from:
- [Meteorite landings](https://catalog.data.gov/dataset/meteorite-landings)
- [Near Earth Comets](https://catalog.data.gov/dataset/near-earth-comets-orbital-elements)
- [Fire and Bolide Reports](https://catalog.data.gov/dataset/fireball-and-bolide-reports)
- [Global Landslide Catalog](https://catalog.data.gov/dataset/global-landslide-catalog)

In [None]:
datadir = os.path.join(os.path.dirname(os.getcwd()), 'data')
fnames = sorted([os.path.join(datadir, fname) for fname in os.listdir(datadir)])
print(fnames)

Now we have the filenames, let's load the data in and inspect the head to get a feeling of the components.

In [None]:
fire_df = pd.read_csv(fnames[0])
fire_df.head(3)

In [None]:
landslide_df = pd.read_csv(fnames[1])
landslide_df.head(3)

In [None]:
meteor_df = pd.read_csv(fnames[1])
meteor_df.head(3)

In [None]:
comet_df = pd.read_csv(fnames[3])
comet_df.head(3)

# Visualising the `NaN` distributions

## Location of NaNs in each dataset

Before we start implementing a solution, it is important for us to visualise the distribution of missing values (or `NaNs`) for each dataset. This way, we can better understand the sparsity of the data that we're dealing with!

In [None]:
fig = missing_value_heatmap(fire_df, "fire_df")
fig.show()

In [None]:
fig = missing_value_heatmap(meteor_df, "meteor_df")
fig.show()

In [None]:
fig = missing_value_heatmap(landslide_df, "landslide_df")
fig.show()

In [None]:
fig = missing_value_heatmap(comet_df, "comet_df")
fig.show()

From these plots, we can see there are generally 4 types of missing data challenges that we must consider: 
1. Columns with **complete** sparsity (no values)
2. Columns with **high** sparsity (around 90% of values are missing)
3. Columns with **low/medium sparsity** (50% or higher values are present)
4. Completely sparsity (few values in most columns). This type isn't present in these datasets but we can experiment with engineering some.

We must also consider dependencies (or lack of) in the data. Some columns may be measuring samples (rows) with some temporal dependence on each other, e,g a time series from the same signal. Others may be measuring **independent** events.

## Ratio of missing data to present data 

In [None]:
ratio_fig = missing_data_ratios([comet_df, meteor_df, landslide_df, fire_df], ['comet_df', 'meteor_df', 'landslide_df', 'fire_df'])
ratio_fig.show()