# Data Engineering Take-Home Assignment: Nature Conservation & Geospatial Data

## Context
Assume you have been hired as a Data Engineer for an organization focused on nature conservation. The organization is working on a project to monitor and protect natural habitats using satellite data, wildlife sensor data, and geospatial information. Your task is to design and implement a data pipeline that ingests, processes, and analyzes this data to help identify areas needing immediate conservation attention as well as build a model that provides helpful insights related our organization's interests.

## Objective 

Your goal in this assessment is to showcase your curiousity and creativity to design rigorous models and derive interesting insights.  

You'll be given two tasks.

The first is a design task, in which we expect you to diagram and describe how you'd set up a process to injest this data from a live streamed source, assuming you are also paying montoring services to supply this data from scratch. Think about how you might transform and store the data efficiently for querying and analysis and feed it into your model. 

The second task will require you devise interesting questions from preliminary explorations of a subset of migration data, found alongside this notebook, and construct a rigorous model to answer them. Please demonstrate all of your process using this notebook, and most importantly your outputs. 




## Tasks

### 1) Design - Data Ingestion & Storage:
- **Ingestion**: Design and implement a solution to ingest data from three different sources: GeoJSON, CSV, and JSON.
- **Automation**: Ensure the pipeline can handle regular data updates (e.g., daily or hourly).
- **Storage**: Choose appropriate storage solutions for each dataset (e.g., relational database, NoSQL, cloud storage, or data lake). Provide justification for your choices.

### 2) Data Transformation & Analysis:
- **Data Parsing & Cleaning**: 
  - Parse and clean the wildlife tracking data (CSV) and geospatial data (GeoJSON) to ensure consistency.
  - Ensure the data is ready for analysis by standardizing formats, removing errors, and handling missing values.

- **Exploratory Data Analysis**:
  - Investigate the data to understand key characteristics, distributions, and trends.

- **Behavioral Analysis**:
  - Identify more complex animal behaviors:
    - Determine when animals cross the boundaries of protected areas.
    - Analyze potential factors contributing to these crossings (e.g., time, weather, or environmental changes).
    - Calculate the total number of animal entries and exits from protected areas over time.

- **Advanced Insights**:
  - Identify migration paths or clustering patterns.
  - Build a predictive model to anticipate future animal movements or identify risk zones for endangered species.

### 3) Optional Bonus - Visualization/Reporting:
- Provide interactive visualizations to demonstrate your analysis, ideally within this notebook.

### Here are data sources you can use to build your analysis. 

- https://storage.googleapis.com/data-science-assessment/animal_events.csv
- https://storage.googleapis.com/data-science-assessment/animals.csv
- https://storage.googleapis.com/data-science-assessment/protected_areas.json
- https://storage.googleapis.com/data-science-assessment/satellites.json

## Deliverables
#### Design component:
- A clear description and diagrams for the architecture and tools you might used, including any cloud services, databases, or libraries (if applicable). During the discussion we'll go over different scenarios. 

#### Implementation:
- Code for the data pipeline that includes:
  - Data ingestion scripts or setup.
  - Transformation and processing logic.
  - Queries or outputs showcasing the results.
- (Optional) a visualization of the results.

## Data
### 1. **Animal Events - CSV** [Download link](https://storage.googleapis.com/data-science-assessment/animal_events.csv)

- Contains data on animal movement events with details like location and speed.
- **Key Columns**: `event_id`, `animal_id`, `timestamp`, `latitude`, `longitude`, `speed`.

---

### 2. **Animals - CSV** [Download link](https://storage.googleapis.com/data-science-assessment/animals.csv)

- Metadata about tracked animals, including species and conservation status.
- **Key Columns**: `animal_id`, `species`, `endangered`, `animal_type`, `preferred_landcover`.

---

### 3. **Protected Areas - GeoJSON** [Download link](https://storage.googleapis.com/data-science-assessment/protected_areas.json)

- Geospatial data representing protected areas with boundaries and metadata.
- **Key Fields**: `name`, `category`, `protected_area_id`, `geometry`.

---

### 4. **Satellite Metadata - JSON** [Download link](https://storage.googleapis.com/data-science-assessment/satellites.json)

- Metadata from satellite imagery, covering factors like cloud cover and resolution.
- **Key Fields**: `satellite_id`, `start_time`, `last_time`, `frequency`, `bounding_box`, `cloud_cover_percentage`, `resolution`.

---

## Evaluation Criteria

- **Data Engineering Skills**: How well the pipeline handles ingestion, transformation, and storage.
- **Geospatial Data Handling**: Ability to process geospatial data and perform spatial operations (e.g., joins, intersections).
- **Scalability & Efficiency**: The pipeline’s ability to handle larger datasets or more frequent updates.
- **Code Quality**: Structure, readability, and use of best practices.
- **Documentation**: Clear explanations of your approach and any assumptions made.
- **Bonus (Visualization/Reporting)**: Extra points for insightful data visualization or reporting.

## Set up

Feel free to set up this notebook using condo, or your own kernal / virtual environment. To make it easier, you can set up the notebook using this docker with the potentialy libraries you might need. 

#### To start using a prepared Docker image, 
- 1 navigate to this shared folder in your terminal, and then load up docker and run the docker file to pull in needed libraries

```bash
docker build -t geospatial-notebook .
docker run -p 8888:8888 -v $(pwd):/home/jovyan/work geospatial-notebook
```


When the container runs, it will display a URL with a token (something like http://127.0.0.1:8888/?token=...). It will probably be something like http://127.0.0.1:8888/tree You can copy this URL into your browser, and you'll open to a Jupyter lab. Your existing notebook will be available inside the container under the work directory.

Anytime you want to work again, just run the following command to start the Docker container and access your notebooks:

```bash
docker run -p 8888:8888 -v $(pwd):/home/jovyan/work geospatial-notebook
```


In [66]:
# Libraries you may or may not need
import pandas as pd
import geopandas as gpd
import shapely
import sqlalchemy
import psycopg2
import osgeo.gdal
#extra
import geoalchemy2
import json


### Task 1: Data Ingestion & Storage


In [67]:
# We will be using a postgres database to store our data from multiple sources
# Prerequesites: 
# 1. Postgres 16 with the PostGIS extension installed on your machine
# 2. Create a local database called 'assignment' accessible on port '5433' with username 'postgres' and password 'postgres'

# Connect to your database
db_username = 'postgres'  
db_password = 'postgres'
db_host = 'localhost'
db_port = '5433'
db_name = 'assignment'

conn_string = f'postgresql://{db_username}:{db_password}@{db_host}:{db_port}/{db_name}' # Create connection string
engine = sqlalchemy.create_engine(conn_string) # Create database engine using sqlalchemy

In [70]:
# For each data source, we define a function to ingest the data into the database

# Ingest animal_events.csv
def ingest_animal_events(csv_file):
    animal_events = pd.read_csv(csv_file) # Read CSV data
    animal_events['timestamp'] = pd.to_datetime(animal_events['timestamp']) # ensure proper datetime format
    animal_events.to_sql('animal_events', engine, if_exists='replace', index=False) # Store in PostgreSQL

# Ingest animals.csv
def ingest_animals(csv_file):
    animals = pd.read_csv(csv_file) # Read CSV data
    animals.drop_duplicates(subset='animal_id', inplace=True) # Clean and transform data
    animals.to_sql('animals', engine, if_exists='replace', index=False) # Store in PostgreSQL

# Ingest ingest_protected_areas
def ingest_protected_areas(geojson_file):
    protected_areas = gpd.read_file(geojson_file) # Read GeoJSON data
    protected_areas = protected_areas.to_crs(epsg=4326) # Ensure consistent Coordinate Reference System (crs)
    protected_areas.to_postgis('protected_areas', engine, if_exists='replace', index=False) # Store in PostGIS

# Ingest ingest_satellites
def ingest_satellites(json_file):
    with open(json_file) as f: # Read JSON data
        satellites_data = json.load(f)
    satellites_df = pd.json_normalize(satellites_data) # Convert JSON to DataFrame
    satellites_df.to_sql('satellites', engine, if_exists='replace', index=False) # Store in PostgreSQL

In [71]:
# Execute the data ingestion
ingest_animal_events('data/animal_events.csv')
ingest_animals('data/animals.csv')
ingest_protected_areas('data/protected_areas.json')
ingest_satellites('data/satellites.json')

### Task 2: Data Transformation & Analysis


#### 2.1 Parse and Clean the Data


In [None]:
#### animal_events ####
animal_events = pd.read_sql('SELECT * FROM animal_events', engine) # Load data
animal_events['timestamp'] = pd.to_datetime(animal_events['timestamp']) # Convert timestamp to datetime
animal_events.drop_duplicates(subset=['event_id'], inplace=True) # Drop duplicates and handle missing values
animal_events.dropna(subset=['latitude', 'longitude'], inplace=True)
geometry = [shapely.geometry.Point(xy) for xy in zip(animal_events['longitude'], animal_events['latitude'])] # Convert to a GeoDataFrame
animal_events_gdf = gpd.GeoDataFrame(animal_events, geometry=geometry, crs='EPSG:4326')

#### protected_areas ####
protected_areas_gdf = gpd.read_postgis('SELECT * FROM protected_areas', engine, geom_col='geometry') # Load data
protected_areas_gdf = protected_areas_gdf.to_crs(epsg=4326) # Ensure correct CRS

#### animals ####
animals_df = pd.read_sql('SELECT * FROM animals', engine) # Load data
animals_df.drop_duplicates(subset=['animal_id'], inplace=True) # Drop duplicates and handle missing values

#### satellites ####