# Statistical analysis

# Completeness of data series and outliers detection

Use Case: Check completeness of lake water temperature time series for Great African Lakes and outliers detection.

User Question: The satellite lakes water temperature dataset for Great African Lakes is complete in time? Are there some outliers?

Methods:

```
    - Select Great African Lakes area and extract the mean water lakes temperature
    - Plot the time series
    - Calculate percentage of missing values
    - Boxplot of the values and outliers detection
```

Short answer:

Considering the mean water lake temperature over the Great African Lakes we have a not complete dataseries. The percentage of missing value is of 44.73 %. From Boxplot analysis emerges the presence of outliers both in the upper and lower part.

## Import packages

In [None]:
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np
from c3s_eqc_automatic_quality_control import diagnostics, download, plot, utils

plt.style.use("seaborn-v0_8-notebook")

## Set variables

In [None]:
# Time
start = "1997-01"
stop = "1999-12"

# Region
lon_slice = slice(28, 41)
lat_slice = slice(-16, 4)

# Variable
varname = "lake_surface_water_temperature"

## Set the data request

In [None]:
collection_id = "satellite-lake-water-temperature"
request = {
    "version": "4.0",
    "variable": "all",
    "format": "zip",
}

## Define function to extract region and compute spatial weighted mean

In [None]:
def spatial_weighted_mean_of_region(ds, lon_slice, lat_slice, varname):
    ds = ds[[varname]]
    ds = utils.regionalise(ds, lon_slice=lon_slice, lat_slice=lat_slice)
    ds = diagnostics.spatial_weighted_mean(ds)
    return ds

## Download data

In [None]:
chunks = {"year": 1, "month": 1}
requests = download.update_request_date(
    request, start=start, stop=stop, stringify_dates=True
)
ds = download.download_and_transform(
    collection_id,
    requests,
    chunks=chunks,
    transform_func=spatial_weighted_mean_of_region,
    transform_func_kwargs={
        "lon_slice": lon_slice,
        "lat_slice": lat_slice,
        "varname": varname,
    },
)
da = ds[varname]

## Extract lake id to plot a map of the region

In [None]:
# We use one of the request previously cached
single_request = requests[0]
single_request["month"] = single_request["month"][0]
ds_raw = download.download_and_transform(
    collection_id,
    single_request,
    chunks=chunks,
)

da_lakeid = utils.regionalise(
    ds_raw["lakeid"].isel(time=0), lon_slice=lon_slice, lat_slice=lat_slice
)

## Plot projected map

In [None]:
plot.projected_map(da_lakeid, projection=ccrs.PlateCarree())

## Plot spatial weighted mean

In [None]:
da.plot()
plt.title("Spatial weighted mean")

## Percentage of missing values

In [None]:
num_missing = np.count_nonzero(np.isnan(da)) / da.size * 100
# Print the result
print(f"Number of missing values: {round(num_missing,2)} %.")

## Boxplot

In [None]:
# Create a boxplot
plt.boxplot(da.values[~np.isnan(da.values)])

# Add title and labels
# plt.title("Boxplot of array with missing values")
plt.xlabel("Array")
plt.ylabel("lake surface skin temperature")


# Show the plot
plt.show()

arr1 = da.values[~np.isnan(da.values)]
# finding the 1st quartile
q1 = np.quantile(arr1, 0.25)

# finding the 3rd quartile
q3 = np.quantile(arr1, 0.75)
med = np.median(arr1)


# finding the iqr region
iqr = q3 - q1

# finding upper and lower whiskers
upper_bound = q3 + (1.5 * iqr)
lower_bound = q1 - (1.5 * iqr)
print(
    "The median value is",
    round(med, 2),
    "K , the IQR upper bound:",
    round(upper_bound, 2),
    "K , the IQR lower bound is:",
    round(lower_bound, 2),
    "K",
)


# Test for significance
if min(arr1) < lower_bound or max(arr1) > upper_bound:
    print(
        "In the series there are some outliers: the minimum values is",
        round(min(arr1), 2),
        "K , the maximum value is",
        round(max(arr1), 2),
        "K.",
    )
else:
    print("In the series there are not outliers.")