# Quality Control for Ecological Data using the Frictionless Framework

### 1. **Introduction to Frictionless**

The **Frictionless framework** is a set of open-source tools for managing and ensuring data quality. It helps with the validation, cleaning, and documentation of datasets. Researchers and IT staff can use it to ensure the ecological data uploaded to EnviDat meets specific quality standards.

* **For Researchers (Self-check)**: Use the graphical interface ([Open Data Editor](https://okfn.org/en/projects/open-data-editor/)) to validate datasets before upload.
* **For IT Staff (Backend Check)**: Automate quality control through Python scripts that validate data on upload.

You can explore the detailed documentation:

* [Frictionless Framework Documentation](https://framework.frictionlessdata.io/index.html)
* [Frictionless RDM Workflows (Colab Example)](https://colab.research.google.com/github/frictionlessdata/frictionless-py/blob/v4/site/docs/tutorials/notebooks/frictionless-RDM-workflows.ipynb#scrollTo=dc538394)

### 2. **Setting Up the Environment**

# Let's install the frictionless package if it's not already installed.
```python
!pip install frictionless
```
First, we will load the sample ecological dataset and explore it.

```python
import pandas as pd
from io import StringIO

# Sample data provided by WSL for ecological measurements
data = """
Site.ID,Biomasstype,Site,Invasion,Treatment,Weight_20by100_cm
1,Litter,PnK,Native,Open,15.515
1,Living,PnK,Native,Open,95.89
2,Litter,PnK,Native,No livestock,39.14
2,Living,PnK,Native,No livestock,177.355
3,Litter,PnK,Native,No mammals,38.95
3,Living,PnK,Native,No mammals,117.16
...
48,Living,David,Invaded,No insects,150.84
"""

# Convert the data into a pandas dataframe
df = pd.read_csv(StringIO(data))

# Show the first few rows of the dataset
df.head()
```

### 3. **Defining a Schema for Data Validation**

To validate the dataset, we need to define the expected structure and rules for each column. We will use the **Frictionless** schema to define field types and constraints.

```python
from frictionless import Schema, Field

# Define the schema for the ecological dataset
schema = Schema(fields=[
    Field(name="Site.ID", type="integer", required=True),
    Field(name="Biomasstype", type="string", allowed=["Living", "Litter"], required=True),
    Field(name="Site", type="string", required=True),
    Field(name="Invasion", type="string", allowed=["Native", "Invaded"], required=True),
    Field(name="Treatment", type="string", allowed=["Open", "No livestock", "No mammals", "No insects"], required=True),
    Field(name="Weight_20by100_cm", type="number", required=True)
])

# Validate the dataset against the schema
report = schema.validate(df)

# Display the validation issues (if any)
report.to_dict()
```

### 4. **Identifying Common Data Issues**

Next, we will check for two common data issues in ecological datasets:

1. **Missing values** in critical columns like `Weight_20by100_cm`.
2. **Outliers** in numerical data, such as extreme weight values that may be erroneous.

```python
# Check for missing values in the dataset
missing_data = df.isnull().sum()
missing_data

# Detect outliers: Values exceeding 3 standard deviations from the mean
mean_weight = df['Weight_20by100_cm'].mean()
std_weight = df['Weight_20by100_cm'].std()
outliers = df[(df['Weight_20by100_cm'] > mean_weight + 3 * std_weight) | 
              (df['Weight_20by100_cm'] < mean_weight - 3 * std_weight)]

outliers
```

### 5. **Backend Validation with Frictionless for IT Staff**

IT staff can use a Python script to validate the data programmatically. This approach allows you to automate the validation of any new datasets uploaded to EnviDat.

```python
from frictionless import Package

# Assume the dataset is saved as 'ecological_data.csv'
df.to_csv('ecological_data.csv', index=False)

# Create a package that includes the dataset and schema for validation
package = Package(resources=[{
    'name': 'ecological_data',
    'path': 'ecological_data.csv',
    'schema': schema
}])

# Validate the package
package.validate()

# Check for validation issues (if any)
if package.valid:
    print("Data passed validation")
else:
    print("Data failed validation:", package.errors)
```

### 6. **Handling Missing Data and Outliers**

Hereâ€™s how to handle missing data and outliers before uploading the data to EnviDat. Researchers can manually fill in missing values or use imputation methods.

#### Handling Missing Data:

```python
# Filling missing values with a placeholder (e.g., 0 or the column mean)
df['Weight_20by100_cm'].fillna(df['Weight_20by100_cm'].mean(), inplace=True)

# Verify that missing data has been handled
df.isnull().sum()
```

#### Handling Outliers:

```python
# Handling outliers by replacing them with a threshold value (e.g., the 99th percentile)
threshold = df['Weight_20by100_cm'].quantile(0.99)
df['Weight_20by100_cm'] = df['Weight_20by100_cm'].apply(lambda x: min(x, threshold))

# Verify outliers are handled
df['Weight_20by100_cm'].describe()
```

### 7. **Visualizing Data Issues**

Visualizing missing data and outliers can help researchers better understand data quality issues.

#### Visualizing Missing Data:

```python
import seaborn as sns
import matplotlib.pyplot as plt

# Heatmap for missing data
sns.heatmap(df.isnull(), cbar=False, cmap='Blues')
plt.title('Missing Data Heatmap')
plt.show()
```

#### Visualizing Outliers:

```python
# Boxplot for detecting outliers in the 'Weight_20by100_cm' column
sns.boxplot(x=df['Weight_20by100_cm'])
plt.title('Outliers in Weight_20by100_cm')
plt.show()
```

### 8. **Conclusion: Quality Assurance and Control for Ecological Datasets**

By using Frictionless, both researchers and IT staff at WSL can:

* **Self-check** their datasets before uploading to EnviDat using the Open Data Editor (GUI).
* **Automate backend validation** to ensure uploaded datasets meet the required quality standards.

This framework will help identify common issues like missing values, outliers, and data type mismatches early, ensuring the integrity and consistency of ecological data shared in EnviDat.

For more detailed examples and information:

* [Frictionless Framework Documentation](https://framework.frictionlessdata.io/index.html)
* [Frictionless RDM Workflows (Colab)](https://colab.research.google.com/github/frictionlessdata/frictionless-py/blob/v4/site/docs/tutorials/notebooks/frictionless-RDM-workflows.ipynb#scrollTo=dc538394)

