Skip to content

Data Collection

github-actions[bot] edited this page May 30, 2026 · 2 revisions

Data Collection

This page explains how to collect sighting data, run the pipeline, and add new records.


What data we collect

Each record in the dataset represents one confirmed human sighting with:

Field Description
Date The calendar date of the sighting (local date)
Location Latitude, longitude, and elevation in metres
Observed time The local time at which the sighting occurred
UTC offset The hours offset from UTC at that date and location

The pipeline converts each record into a solar depression angle by back-calculating the sun's position at the UTC moment of the sighting using PyEphem with atmospheric refraction.

Not included: calculated prayer times, angle guesses, or aggregate statistics. Only records where an actual human reported "I saw true dawn at this time on this date at this location."


Running the pipeline

Prerequisites

# Python 3.10+
python -m venv .venv
source .venv/bin/activate          # on Windows: .venv\Scripts\activate
pip install -r requirements.txt

Full run (recommended)

python -m src.pipeline

This does three things in sequence:

  1. Fetches the OpenFajr iCal feed from calendar.google.com — ~4,018 community-verified Fajr records from Birmingham, UK, 2016-2026. Requires network access.
  2. Loads manually compiled records from src/collect/verified_sightings.py and per-source CSVs in data/raw/raw_sightings/.
  3. Loads pre-computed SQM angles from src/collect/precomputed_angles.py (1,621 Basthoni 2022 records where depression angles were measured directly by instrument).
  4. Looks up missing elevations via the Open-Topo-Data API (with Open-Elevation fallback) for any record where elevation_m == 0.

Output:

data/processed/fajr_angles.csv   — 48,668 Fajr records
data/processed/isha_angles.csv   — 34,529 Isha records

Without elevation lookup

python -m src.pipeline --no-elevation-lookup

Skips the Open-Elevation API calls. Use this when:

  • You're offline
  • You want faster iteration while adding new records
  • All records in verified_sightings.py already have non-zero elevations

Interpreting the pipeline output

Loading OpenFajr Birmingham iCal feed...
  4018 Fajr records from OpenFajr
Loading manually verified sightings...
  ... genuine manually compiled records (after quality filter)
Loading ingested raw CSV sightings...
  ... records from raw CSVs
Loading pre-computed angle records (SQM instrument data)...
  1621 pre-computed angle records
Computing solar depression angles...
  Dropping N record(s) with implausible angles (< 7.0° Fajr / < 10.0° Isha):
    ...

Fajr dataset: 48668 records → data/processed/fajr_angles.csv
Isha dataset: 34529 records → data/processed/isha_angles.csv

Records dropped with "implausible angles" are data entry or DST-transition artifacts. The quality filter (7° for Fajr, 10° for Isha) removes physically impossible values. All dropped records are logged so you can investigate them.


Data sources

Primary: OpenFajr (Birmingham, UK)

The OpenFajr Project runs a continuous community astrophotography program in Birmingham. A panel of scholars reviews daily sky photos and votes on the moment of true dawn. The voted times are published as a public Google Calendar iCal feed.

  • ~4,018 records, 2016-2026
  • Location: 52.4862°N, 1.8904°W, 141m elevation
  • All times are UTC (Z suffix in iCal)
  • Fetched live by the pipeline — no local cache needed

This is the highest-quality source: actual community-reviewed per-date timestamps at a single well-documented location. It provides ~68% of the Fajr training data.

Secondary: Basthoni 2022 SQM network (Indonesia)

1,621 per-night SQM records across 46 Indonesian sites, extracted from Basthoni's 2022 PhD dissertation at UIN Walisongo. Each record is a direct instrument measurement where the Fajr depression angle was determined by linear fitting of SQM time-series data. Loaded by src/collect/precomputed_angles.py.

Tertiary: Manually compiled records

Located in src/collect/verified_sightings.py and per-source CSVs in data/raw/raw_sightings/. These come from:

  • Peer-reviewed academic papers (NRIAG Egypt, Malaysia, Indonesia, Saudi Arabia, Mauritania)
  • Community observation programs (Miftahi/Shaukat UK, Asim Yusuf UK, Moonsighting.com)
  • Institutional SQM data (BRIN Mount Timau, BRIN multistation network)

See Data Sources for the full citation table.


Adding new sighting records

Open src/collect/verified_sightings.py and append to the VERIFIED_SIGHTINGS list:

{
    "prayer": "fajr",              # "fajr" or "isha"
    "date_local": "2024-06-21",    # ISO date, local calendar date
    "time_local": "04:38",         # HH:MM, 24-hour, local time at moment of sighting
    "utc_offset": 1.0,             # hours from UTC (e.g. 1.0 for BST, -5.0 for EST, 5.5 for IST)
    "lat": 51.150,                 # decimal degrees (south = negative)
    "lng": -3.650,                 # decimal degrees (west = negative)
    "elevation_m": 430.0,          # metres above sea level (0 = will be looked up by API)
    "source": "Your citation here",
    "notes": "Any relevant notes about conditions, method, observer count, etc.",
}

UTC offset tips

Region UTC offset
UK (BST, summer) +1.0
UK (GMT, winter) 0.0
Egypt / Eastern Europe (EET) +2.0
Egypt / EE (summer, EEST) +3.0
Saudi Arabia / Arabia Standard +3.0
Iran (IRST) +3.5
Iran (IRDT, summer) +4.5
UAE / Oman (GST) +4.0
Pakistan (PKT) +5.0
India / Sri Lanka (IST) +5.5
Bangladesh (BST) +6.0
Malaysia / Singapore (MYT) +8.0
Indonesia West (WIB) +7.0
Indonesia East (WIT) +9.0
Australia East (AEST, winter) +10.0
Australia East (AEDT, summer) +11.0
New Zealand (NZST) +12.0
New Zealand (NZDT) +13.0
US Eastern (EST) -5.0
US Eastern (EDT) -4.0
US Central (CST) -6.0
US Central (CDT) -5.0
West Africa (WAT) +1.0
East Africa (EAT) +3.0
South Africa (SAST) +2.0

Verifying a new record

After adding records, run the pipeline and check the output. A correctly entered record should produce an angle between 8° and 21° for Fajr, or 11° and 22° for Isha. If the pipeline drops your record (angle below the threshold), the time is too close to sunrise/sunset — recheck the UTC offset and local time.

python -m src.pipeline --no-elevation-lookup 2>&1 | grep -A5 "Dropping"

Priority gaps to fill

The Isha dataset is the most critical gap at 46 records. Fajr has excellent Birmingham coverage but needs more geographic diversity:

Gap What to look for
Isha (all regions) Shafaq al-Abyad disappearance logs with explicit per-date timestamps
South America Any Muslim community observation records with coordinates and times
Southeast Asia Additional Indonesian/Malaysian per-night SQM data files
High latitudes (55°N+) Scandinavian or northern Canadian observation logs
Sub-Saharan Africa Observation records from West Africa, East Africa, Southern Africa

← Home · ML Crunching →

Clone this wiki locally