Laboratory — Geospatial Raster Analysis & RAG Chatbot
Course: Python Programming — Applied Data Science
Total Score: 20 points
Deadline: Saturday, May 16 — 11:59 PM
Submission: Course Submission Form — Repository & Video Links
Score Summary
| Component |
Description |
Points |
| Task 1 |
Geospatial Raster Analysis — Territorial Digital Divide |
8 pts |
| Task 2 |
RAG Chatbot — Beca 18 Regulations |
8 pts |
| Video |
Explanatory demonstration video (both tasks) |
4 pts |
| Total |
|
20 pts |
Task 1 — Territorial Digital Divide: Geospatial Raster Analysis
Score: 8 points
Description
Design a geospatial analysis pipeline that measures the territorial digital divide in the Cusco region, Peru, by cross-referencing two satellite-derived raster datasets: NASA nighttime lights (VNL 2025) as a proxy for urbanization, and mobile network coverage density (OSIPTEL 2019, 50 m kernel) as a proxy for internet access. The analysis must produce four thematic maps and a classification layer, exposing inequality patterns between connected urban zones and digitally excluded rural areas.
Input Data
Download the input raster files from the following Google Drive folder and place them inside your local data/ folder before running the notebook. Do not commit these files to your repository.
Drive folder (all files): https://drive.google.com/drive/folders/16oP-IEX8EWklvuigtt-cTWJfDdD4E0ec?usp=drive_link
| File |
Direct link |
Description |
Native CRS |
VNL_cusco_2025.tif |
Available in Drive folder above |
NASA Black Marble nighttime radiance (nW·cm⁻²·sr⁻¹) |
EPSG:4326 |
kernel_cobmovil2019_50m.tif |
Download link |
Mobile coverage kernel density |
EPSG:32719 |
Note: If kernel_cobmovil2019_50m.tif fails to open, download Cobmovil_raster_opcional.zip from the Drive folder as an alternative.
Repository Structure
Create a repository named exactly raster-digital-divide with the following layout:
raster-digital-divide/
│
├── data/
│ ├── VNL_cusco_2025.tif
│ └── kernel_cobmovil2019_50m.tif
│
├── notebooks/
│ └── digital_divide_cusco.ipynb # Main analysis notebook
│
├── output/
│ ├── vnl_norm.tif # Normalized nighttime lights
│ ├── conn_norm.tif # Normalized connectivity
│ ├── ibd_brecha_digital.tif # Digital Divide Index raster
│ ├── clasificacion_brecha.tif # 4-category classification raster
│ └── dashboard_brecha_digital.png # Final composite figure
│
├── README.md
└── requirements.txt
Pipeline Requirements
Step 0 — Environment Setup
Install and import the required libraries. Print library versions to confirm the environment is reproducible.
rasterio · numpy · matplotlib · scipy · seaborn · pandas
Step 1 — Raster Loading and Inspection
For each raster, print: CRS, shape (height × width), band count, NoData value, data type, bounding box, and pixel resolution in degrees and approximate kilometers. Report the number of valid (non-NoData) pixels and the value range.
Step 2 — Reprojection and Grid Alignment
Reproject the connectivity raster from EPSG:32719 to EPSG:4326 using bilinear resampling. Resample the reprojected connectivity raster to the exact grid of the VNL raster (same transform, same shape). Verify that both arrays share identical dimensions before proceeding.
Step 3 — Robust Normalization
Apply percentile-based normalization [2nd–98th percentile] to both layers to produce values in [0, 1]. Replace negative values and NoData sentinels with 0 before clipping. Print the post-normalization min, max, mean, and standard deviation for each layer.
Step 4 — Map 1: VNL Nighttime Lights
Display the raw and normalized VNL rasters side by side. Use the inferno colormap with geographic extent on both axes. Add a colorbar and axis labels. Print an observation explaining what the bright zones indicate.
Step 5 — Map 2: Digital Divide Index (IBD) and Exclusion Index (EDT)
Compute and display two indices:
- IBD (Digital Divide Index):
VNL_norm - Connectivity_norm, range [−1, 1]. Use the RdYlGn_r colormap. Red tones indicate zones with more light than connectivity (active digital divide). Green tones indicate zones that are relatively well connected.
- EDT (Total Digital Exclusion):
(1 - VNL_norm) × (1 - Connectivity_norm), range [0, 1]. Use the Purples colormap. High values identify zones with neither light nor connectivity (maximum exclusion).
Show both maps in a single figure with individual colorbars. Add a brief printed interpretation for each index.
Step 6 — Map 3: Intervention Priority
Classify the territory into three priority levels based on the presence of nighttime light (population proxy) combined with low connectivity:
| Priority |
Condition |
Rationale |
| Critical (P3) |
VNL ≥ 0.30 and Connectivity < 0.10 |
High population density, no internet |
| High (P2) |
VNL ≥ 0.15 and Connectivity < 0.15 |
Urban fringe, incomplete coverage |
| Medium (P1) |
VNL ≥ 0.10 and Connectivity < 0.25 |
Peri-urban, partial coverage |
Display the map with a discrete colormap and a legend. Print the pixel count and percentage of total area for each priority level.
Step 7 — Map 4: Social Exclusion Risk
Compute the Social Exclusion Risk score:
Risk = EDT × (1 - VNL_norm), then normalize to [0, 1]
Apply a Gaussian spatial filter (sigma=5) to reveal regional patterns. Display the raw and smoothed risk maps side by side using the hot_r colormap. Report the 75th and 90th percentile risk thresholds.
Step 8 — Territorial Classification
Apply a 2 × 2 classification using thresholds VNL ≥ 0.15 and Connectivity ≥ 0.15:
| Class |
Name |
Color |
| 1 |
Urban Connected |
Green |
| 2 |
Urban Divide |
Red |
| 3 |
Rural Connected |
Blue |
| 4 |
Critical Divide |
Purple |
Display the classification map with a legend. Build a pandas DataFrame reporting: class name, pixel count, percentage of total area, and approximate area in km². Save the classification raster to output/clasificacion_brecha.tif.
Step 9 — Statistical Summary
Compute the following using scipy and seaborn:
- Descriptive statistics (mean, std, min, max) for VNL, connectivity, and IBD, grouped by class.
- Pearson correlation between VNL and connectivity (subsample every 40th pixel).
- KDE distribution plots for VNL and connectivity, one curve per class, in a single figure.
- Welch's t-test comparing VNL values between Class 1 (Urban Connected) and Class 4 (Critical Divide). Report t-statistic, p-value, and Cohen's d.
Step 10 — Export Deliverables
Export the four processed rasters (normalized VNL, normalized connectivity, IBD, classification) to the output/ folder as GeoTIFF files aligned to the VNL grid. Save the final composite dashboard figure as dashboard_brecha_digital.png at 150 dpi.
an example :
GitHub Workflow (Mandatory)
- Do not commit directly to
main.
- Create a working branch (e.g.,
feature/raster-analysis).
- Make at least five commits with descriptive messages covering distinct stages of the pipeline.
- Merge into
main via a Pull Request.
Penalty: Committing directly to main deducts 1.5 points automatically.
README.md
Must include:
- Project description and research question.
- Dependencies and installation instructions (
pip install -r requirements.txt).
- How to run the notebook end-to-end.
- Description of each output file.
- Brief interpretation of the main findings (2–3 sentences).
Grading Rubric — Task 1 (0–8 pts)
| Criteria |
Points |
| Raster loading, reprojection, and grid alignment (Steps 1–2) |
1.5 |
| Normalization pipeline with correct NoData handling (Step 3) |
0.5 |
| Map 1 — VNL Nighttime Lights: raw and normalized side by side |
1.0 |
| Map 2 — IBD and EDT: both indices displayed and interpreted |
1.5 |
| Map 3 — Intervention Priority: 3 levels classified and quantified |
1.5 |
| Map 4 — Social Exclusion Risk: raw and smoothed maps displayed |
1.0 |
| Territorial Classification: map, table, and exported raster |
0.5 |
| Statistical summary: correlation, KDE plots, Welch t-test |
0.5 |
| README and Pull Request |
0.5 (shared with branch workflow) |
| TOTAL |
8 |
Note: Code cells without executed output will receive 0 points for that criterion. All maps must include colorbars, axis labels, and a title.
Task 2 — Beca 18 RAG Chatbot: Document Retrieval and Grounded Generation
Score: 8 points
Description
Build an end-to-end Retrieval-Augmented Generation (RAG) pipeline that answers user questions about the official Beca 18 regulations (PRONABEC) by retrieving relevant fragments from the source PDF and passing them as context to a large language model. The system must never rely on the model's parametric knowledge and must decline to answer when the information is not present in the document.
Source document: Resolución Directoral Ejecutiva N.° 033-2026-MINEDU/VMGI-PRONABEC
https://www.gob.pe/institucion/pronabec/normas-legales/7778068-033-2026-minedu-vmgi-pronabec
Pipeline overview:
PDF → text extraction → chunking → embeddings → ChromaDB
→ user question → query embedding → top-k retrieval
→ LLM with context → grounded answer + cited sources
Repository Structure
Create a repository named exactly beca18-rag-chatbot with the following layout:
beca18-rag-chatbot/
│
├── data/
│ └── beca18_reglamento.pdf # Source regulation document
│
├── notebooks/
│ └── beca18_rag_chatbot.ipynb # Main notebook
│
├── .env.example # Template: GEMINI_API_KEY=your_key_here
├── .gitignore # Excludes .env and chroma_db_*/
├── requirements.txt
└── README.md
Pipeline Requirements
Step 0 — Setup
Install dependencies and load the Gemini API key from a .env file using python-dotenv. Never hardcode the key in the notebook. Confirm the environment by printing the loaded package versions.
You need a free API key from Google AI Studio: https://aistudio.google.com/app/apikey
Step 1 — PDF Text Extraction
Extract text page by page using pypdf. Insert a [PAGE N] marker at the start of each page. Apply light cleaning: collapse multiple spaces, remove isolated line breaks, and strip headers/footers if present. Print total character and word counts.
Step 2 — Tokenization and Chunking Justification
Count total tokens using tiktoken with the cl100k_base encoding. Print the total token count and explain in a Markdown cell why a chunk size of 400 tokens with 60-token overlap is appropriate given the 8,192-token embedding limit.
Chunk the cleaned text using LangChain RecursiveCharacterTextSplitter with:
chunk_size = 400
chunk_overlap = 60
separators = ["\n\n", "\n", ". ", " "]
Attach the following metadata to each chunk: {document, topic, language}. Print total chunk count and the average chunk length in characters.
Step 3 — Embeddings
Implement two embedding functions using gemini-embedding-001 (768 dimensions):
embed_documents(texts) — uses task type RETRIEVAL_DOCUMENT for indexing.
embed_query(text) — uses task type RETRIEVAL_QUERY for search.
Add exponential backoff and retry logic to handle the free-tier rate limit (approximately 60 requests per minute).
Step 4 — Vector Database
Create a persistent ChromaDB collection using cosine distance:
chromadb.PersistentClient(path="chroma_db_beca18")
Implement idempotent indexing: check whether the collection is already populated before embedding. If it contains documents, skip the embedding step and load the existing collection. Print the total number of stored documents after indexing.
Step 5 — Semantic Search
Implement a semantic_search(question: str, k: int = 5) function that:
- Embeds the question using
embed_query.
- Queries the ChromaDB collection for the top-k nearest chunks.
- Returns a list of dictionaries containing:
text, metadata, and distance.
Test the function with one sample question and print the top-3 results with their distances.
Step 6 — Grounded Generation
Implement answer_with_context(question: str, k: int = 5) using gemini-2.5-flash. The system prompt must:
- Instruct the model to answer exclusively from the retrieved context.
- Require the model to cite the page number when available.
- Instruct the model to respond with "The document does not contain information about this topic." when the context is insufficient.
Test with at least five on-topic questions covering: eligibility requirements, scholarship modalities, monthly stipend amount, student obligations, and conditions for losing the scholarship. Test with one off-topic question to confirm the model refuses to hallucinate.
Step 7 — Interactive Chat Interface
Build a chat interface using ipywidgets with:
- A text input box for the question.
- "Ask" and "Clear" buttons.
- An integer slider to control the value of k (retrieved chunks).
- An output area that displays the answer and an expandable accordion showing the source fragments with their page numbers and distances.
Technical Requirements
- Python 3.10+
- Models:
gemini-embedding-001 (768 dim), gemini-2.5-flash
- Must run end-to-end in Google Colab without manual intervention.
- All API keys loaded from
.env — never committed to the repository.
Required packages (requirements.txt must include pinned versions):
pypdf
tiktoken
langchain-text-splitters
google-genai
chromadb
ipywidgets
tqdm
python-dotenv
GitHub Workflow (Mandatory)
- Do not commit directly to
main.
- Create a branch for this task (e.g.,
feature/rag-pipeline).
- Make progressive commits with descriptive messages at each pipeline stage.
- Merge into
main via a Pull Request.
Penalty: Committing directly to main deducts 1.5 points automatically.
README.md
Must include:
- Purpose of the project and source document description.
- Pipeline summary (one paragraph).
- Installation and setup instructions (API key configuration, dependency installation).
- How to run the notebook.
- How to use the chat interface.
Grading Rubric — Task 2 (0–8 pts)
| Criteria |
Points |
| PDF extraction with page markers and light cleaning (Step 1) |
1.5 |
Token count with tiktoken and chunking justification (Step 2) |
0.5 |
RecursiveCharacterTextSplitter correctly configured with metadata (Step 2) |
0.5 |
| Embeddings: both task types implemented with rate-limit handling (Step 3) |
1.5 |
| ChromaDB: persistent, cosine distance, idempotent indexing (Step 4) |
1.0 |
semantic_search returns text, metadata, and distance (Step 5) |
1.0 |
answer_with_context: strict system prompt, 5 on-topic + 1 off-topic test (Step 6) |
1.5 |
ipywidgets chat UI: input, buttons, k slider, expandable sources (Step 7) |
0.5 |
| README and Pull Request |
0.5 (shared with branch workflow) |
| TOTAL |
8 |
Note: Committing the API key to the repository deducts 2 points regardless of other criteria.
Explanatory Video
Score: 4 points
Requirements
Record a single video covering both tasks. The video must demonstrate that the code works correctly and that you understand the pipeline end-to-end.
| Requirement |
Detail |
| Duration |
5 minutes maximum |
| Language |
Spanish or English |
| Content — Task 1 |
Brief walkthrough of the raster pipeline: reprojection, normalization, and at least two of the four maps |
| Content — Task 2 |
Brief walkthrough of the RAG pipeline: chunking, embedding, retrieval, and a live query answered by the chatbot |
| Live execution |
At least one notebook must be shown running live (cells executing with visible output) |
| Upload |
Upload to YouTube (unlisted) or Google Drive and paste the link in video/link.txt inside each repository |
Submission of Video Link
Create a file video/link.txt in both repositories with the following format:
Video URL: https://youtu.be/your_link_here
Grading Rubric — Video (0–4 pts)
| Criteria |
Points |
| Covers Task 1 pipeline with visible map outputs |
1.5 |
| Covers Task 2 pipeline with a live chatbot query |
1.5 |
| Clear explanation showing understanding of the code |
0.5 |
Duration ≤ 5 minutes and link accessible in video/link.txt |
0.5 |
| TOTAL |
4 |
Global Score Summary
| Component |
Description |
Points |
| Task 1 |
Geospatial Raster Analysis — Territorial Digital Divide |
8 pts |
| Task 2 |
RAG Chatbot — Beca 18 Regulations |
8 pts |
| Video |
Explanatory demonstration video (both tasks) |
4 pts |
| Total |
|
20 pts |
Submission Instructions
- Paste the URL of each repository and your video link in the submission form:
Course Submission Form — Repository & Video Links
- Ensure all code cells are executed with visible output before submission.
- Verify that neither repository contains API keys,
.env files, or raster data exceeding 100 MB — use .gitignore to exclude them.
- The video link must also be present in
video/link.txt inside each repository.
Deadline: Saturday, May 16 — 11:59 PM
Questions and clarifications: Discord course channel.
Laboratory — Geospatial Raster Analysis & RAG Chatbot
Course: Python Programming — Applied Data Science
Total Score: 20 points
Deadline: Saturday, May 16 — 11:59 PM
Submission: Course Submission Form — Repository & Video Links
Score Summary
Task 1 — Territorial Digital Divide: Geospatial Raster Analysis
Score: 8 points
Description
Design a geospatial analysis pipeline that measures the territorial digital divide in the Cusco region, Peru, by cross-referencing two satellite-derived raster datasets: NASA nighttime lights (VNL 2025) as a proxy for urbanization, and mobile network coverage density (OSIPTEL 2019, 50 m kernel) as a proxy for internet access. The analysis must produce four thematic maps and a classification layer, exposing inequality patterns between connected urban zones and digitally excluded rural areas.
Input Data
Download the input raster files from the following Google Drive folder and place them inside your local
data/folder before running the notebook. Do not commit these files to your repository.VNL_cusco_2025.tifkernel_cobmovil2019_50m.tifRepository Structure
Create a repository named exactly
raster-digital-dividewith the following layout:Pipeline Requirements
Step 0 — Environment Setup
Install and import the required libraries. Print library versions to confirm the environment is reproducible.
Step 1 — Raster Loading and Inspection
For each raster, print: CRS, shape (height × width), band count, NoData value, data type, bounding box, and pixel resolution in degrees and approximate kilometers. Report the number of valid (non-NoData) pixels and the value range.
Step 2 — Reprojection and Grid Alignment
Reproject the connectivity raster from EPSG:32719 to EPSG:4326 using bilinear resampling. Resample the reprojected connectivity raster to the exact grid of the VNL raster (same transform, same shape). Verify that both arrays share identical dimensions before proceeding.
Step 3 — Robust Normalization
Apply percentile-based normalization [2nd–98th percentile] to both layers to produce values in [0, 1]. Replace negative values and NoData sentinels with 0 before clipping. Print the post-normalization min, max, mean, and standard deviation for each layer.
Step 4 — Map 1: VNL Nighttime Lights
Display the raw and normalized VNL rasters side by side. Use the
infernocolormap with geographic extent on both axes. Add a colorbar and axis labels. Print an observation explaining what the bright zones indicate.Step 5 — Map 2: Digital Divide Index (IBD) and Exclusion Index (EDT)
Compute and display two indices:
VNL_norm - Connectivity_norm, range [−1, 1]. Use theRdYlGn_rcolormap. Red tones indicate zones with more light than connectivity (active digital divide). Green tones indicate zones that are relatively well connected.(1 - VNL_norm) × (1 - Connectivity_norm), range [0, 1]. Use thePurplescolormap. High values identify zones with neither light nor connectivity (maximum exclusion).Show both maps in a single figure with individual colorbars. Add a brief printed interpretation for each index.
Step 6 — Map 3: Intervention Priority
Classify the territory into three priority levels based on the presence of nighttime light (population proxy) combined with low connectivity:
Display the map with a discrete colormap and a legend. Print the pixel count and percentage of total area for each priority level.
Step 7 — Map 4: Social Exclusion Risk
Compute the Social Exclusion Risk score:
Apply a Gaussian spatial filter (
sigma=5) to reveal regional patterns. Display the raw and smoothed risk maps side by side using thehot_rcolormap. Report the 75th and 90th percentile risk thresholds.Step 8 — Territorial Classification
Apply a 2 × 2 classification using thresholds
VNL ≥ 0.15andConnectivity ≥ 0.15:Display the classification map with a legend. Build a pandas DataFrame reporting: class name, pixel count, percentage of total area, and approximate area in km². Save the classification raster to
output/clasificacion_brecha.tif.Step 9 — Statistical Summary
Compute the following using
scipyandseaborn:Step 10 — Export Deliverables
Export the four processed rasters (normalized VNL, normalized connectivity, IBD, classification) to the
output/folder as GeoTIFF files aligned to the VNL grid. Save the final composite dashboard figure asdashboard_brecha_digital.pngat 150 dpi.an example :
GitHub Workflow (Mandatory)
main.feature/raster-analysis).mainvia a Pull Request.README.md
Must include:
pip install -r requirements.txt).Grading Rubric — Task 1 (0–8 pts)
Task 2 — Beca 18 RAG Chatbot: Document Retrieval and Grounded Generation
Score: 8 points
Description
Build an end-to-end Retrieval-Augmented Generation (RAG) pipeline that answers user questions about the official Beca 18 regulations (PRONABEC) by retrieving relevant fragments from the source PDF and passing them as context to a large language model. The system must never rely on the model's parametric knowledge and must decline to answer when the information is not present in the document.
Source document: Resolución Directoral Ejecutiva N.° 033-2026-MINEDU/VMGI-PRONABEC
https://www.gob.pe/institucion/pronabec/normas-legales/7778068-033-2026-minedu-vmgi-pronabec
Pipeline overview:
Repository Structure
Create a repository named exactly
beca18-rag-chatbotwith the following layout:Pipeline Requirements
Step 0 — Setup
Install dependencies and load the Gemini API key from a
.envfile usingpython-dotenv. Never hardcode the key in the notebook. Confirm the environment by printing the loaded package versions.Step 1 — PDF Text Extraction
Extract text page by page using
pypdf. Insert a[PAGE N]marker at the start of each page. Apply light cleaning: collapse multiple spaces, remove isolated line breaks, and strip headers/footers if present. Print total character and word counts.Step 2 — Tokenization and Chunking Justification
Count total tokens using
tiktokenwith thecl100k_baseencoding. Print the total token count and explain in a Markdown cell why a chunk size of 400 tokens with 60-token overlap is appropriate given the 8,192-token embedding limit.Chunk the cleaned text using LangChain
RecursiveCharacterTextSplitterwith:Attach the following metadata to each chunk:
{document, topic, language}. Print total chunk count and the average chunk length in characters.Step 3 — Embeddings
Implement two embedding functions using
gemini-embedding-001(768 dimensions):embed_documents(texts)— uses task typeRETRIEVAL_DOCUMENTfor indexing.embed_query(text)— uses task typeRETRIEVAL_QUERYfor search.Add exponential backoff and retry logic to handle the free-tier rate limit (approximately 60 requests per minute).
Step 4 — Vector Database
Create a persistent ChromaDB collection using cosine distance:
Implement idempotent indexing: check whether the collection is already populated before embedding. If it contains documents, skip the embedding step and load the existing collection. Print the total number of stored documents after indexing.
Step 5 — Semantic Search
Implement a
semantic_search(question: str, k: int = 5)function that:embed_query.text,metadata, anddistance.Test the function with one sample question and print the top-3 results with their distances.
Step 6 — Grounded Generation
Implement
answer_with_context(question: str, k: int = 5)usinggemini-2.5-flash. The system prompt must:Test with at least five on-topic questions covering: eligibility requirements, scholarship modalities, monthly stipend amount, student obligations, and conditions for losing the scholarship. Test with one off-topic question to confirm the model refuses to hallucinate.
Step 7 — Interactive Chat Interface
Build a chat interface using
ipywidgetswith:Technical Requirements
gemini-embedding-001(768 dim),gemini-2.5-flash.env— never committed to the repository.Required packages (
requirements.txtmust include pinned versions):GitHub Workflow (Mandatory)
main.feature/rag-pipeline).mainvia a Pull Request.README.md
Must include:
Grading Rubric — Task 2 (0–8 pts)
tiktokenand chunking justification (Step 2)RecursiveCharacterTextSplittercorrectly configured with metadata (Step 2)semantic_searchreturns text, metadata, and distance (Step 5)answer_with_context: strict system prompt, 5 on-topic + 1 off-topic test (Step 6)ipywidgetschat UI: input, buttons, k slider, expandable sources (Step 7)Explanatory Video
Score: 4 points
Requirements
Record a single video covering both tasks. The video must demonstrate that the code works correctly and that you understand the pipeline end-to-end.
video/link.txtinside each repositorySubmission of Video Link
Create a file
video/link.txtin both repositories with the following format:Grading Rubric — Video (0–4 pts)
video/link.txtGlobal Score Summary
Submission Instructions
Course Submission Form — Repository & Video Links
.envfiles, or raster data exceeding 100 MB — use.gitignoreto exclude them.video/link.txtinside each repository.