# Notebook to scrap RPQS from SISPEA and/or Google

Notebook to scrap RPQS corresponding to collectivities listed in eg `data/scraping/collectivity_list_for_scraping_example.csv` :
- **Scraping from SISPEA works well** but it is still needed to filter out scanned versions, deliberation reports, ... 
- **Scraping from Google does not work** as it does not filter out PDFs which are not RPQS. However this settles the first step. See comments below.

### Import modules

In [None]:
from pathlib import Path
import sys
sys.path.append("../")    # Add the path to the root directory (where we can find the folder narval/)

%load_ext autoreload
%autoreload 2 

from narval.pdfscraper import PDFScraperFromSispea, PDFScraperFromGoogle
from narval.utils import get_data_dir

In [13]:
data_dir = get_data_dir()

## Scraping from SISPEA website

We scrap RPQS directly from SISPEA website (https://services.eaufrance.fr/) using the template link https://www.services.eaufrance.fr/sispea//referential/download-rpqs.action?collectivityId=206841&rpqsId=512773 (only the `collectivity_id` and `rpqs_id` must be known). Be careful that the fetched PDFs are not necessarily RPQS (there might be also RAD or deliberation reports).


Scrap RPQS from SISPEA website for a given year, competence, status : PDFs are downloaded in `data/scraping/from_sispea/` and the scraping report in `data/scraping` is updated (created if needed).

In [None]:
# Name of the input file in data/scraping containing the list of relevant collectivities 
# (with one municipality, one service per competence)
# This file is the output of the notebook "sispea_extract_data_for_scraping.ipynb"
collectivity_filename = "collectivity_list_for_scraping_example.csv"
# Name of the output file (scraping report) in data/scraping
report_filename = "scraping_report_from_"+collectivity_filename

# Scraping parameters
year = "2015"
competence = "assainissement collectif"
status_lot = "WaitingForInput"

# Scrap PDFs from SISPEA
scrapper = PDFScraperFromSispea(collectivity_filename, report_filename)
scrapper.download_all_rpqs_from_ids(year=year, competence=competence, status_lot=status_lot)

Scrap RPQS from SISPEA website for a given competence and status but all year : PDFs that already exist in `data/scraping/from_sispea/` are not downloaded again. The scraping report is updated.

In [None]:
collectivity_filename = "collectivity_list_for_scraping_20012025.csv"
report_filename = "scraping_report_from_"+collectivity_filename

# Scraping parameters
competence = "assainissement collectif"
status_lot = "WaitingForInput"

# Scrap PDFs from SISPEA
scrapper = PDFScraperFromSispea(collectivity_filename, report_filename)
scrapper.download_all_rpqs_from_ids(competence=competence, status_lot=status_lot)

Scrap more PDFs from SISPEA with now the status `Published` 

In [None]:
collectivity_filename = "collectivity_list_for_scraping_example.csv"
report_filename = "scraping_report_from_"+collectivity_filename

# Scraping parameters
year = "2022"
competence = "assainissement collectif"
status_lot = "Published"

# Scrap PDFs from SISPEA
scrapper = PDFScraperFromSispea(collectivity_filename, report_filename)
scrapper.download_all_rpqs_from_ids(year=year, competence=competence, status_lot=status_lot)




**WARNING**  
- There are some `doc`, `rtf`, ... scrapped files. For now, they are not converted to PDFs but downloaded as they are.  
- The PDF scrapped files are not necessarily RPQS. There might be also deliberation reports, RAD (OK), or any other reports. Also even if the PDF is a RPQS, it can be a scanned version $\rightarrow$ a module `pdfclassifier.py` should be added to classify automatically a PDF (Is it a RPQS? Is it a scanned version?)

## Scraping from Google

We scrap RPQS from Google :
- We search relevant URLS on Google for each collectivity, competence, year by keeping  the first 5 URLs returned from keywords f"RPQS {collectivity_name} {competence} {year}", then f"RAD {collectivity_name} {competence} {year}" then f"Rapport du maire {collectivity_name} {competence} {year}"  
- In each url, we look for PDF links  
- We download all PDFs (with no filtering) in `data/scraping/from_google` and update the scraping report   

In [None]:
# Name of the input file in data/scraping containing the list of relevant collectivities 
# (with one municipality, one service per competence)
# This file is the output of the notebook "sispea_extract_data_for_scraping.ipynb"
collectivity_filename = "collectivity_list_for_scraping_example.csv"
report_filename = "scraping_report_from_"+collectivity_filename

# Scraping parameters
year = "2015"
competence = "assainissement collectif"
status_lot = "WaitingForInput"

scrapper = PDFScraperFromGoogle(collectivity_filename, report_filename)
scrapper.download_all_pdfs(year=year, competence=competence, status_lot=status_lot)

**WARNING**  
- Work in progress, there might be errors. 
- We have very often a `Request error` (see above), hence many urls are ignored. Why? Can we correct it quickly?
- We download far too many PDFs! Most of them are not RPQS.         
    - Try to see if keywords must be changed or if we should keep only the 3 (?, instead of 5) top Google results for each query.     
    - More importantly, add a module `pdfclassifier.py` to classify PDFs and filter out the ones which are not RPQS (Is it a RPQS? Is it the correct city? Is it the correct competence? Is it the correct year? Is it a scanned version?)
- Note that PDFs coming from https://services.eaufrance.fr/ should be immediately filtered out since they are more easily scrapped with `PDFScraperFromSispea`

## Count the number of scrapped files

In [None]:
path = Path(data_dir + "/data/scraping/from_sispea")

number_files = sum(1 for x in path.rglob('*') if x.is_file() and x.suffix==".pdf")
print(f"There are {number_files} PDF files in the folder {path.parent}/{path.name}/")

In [None]:
path = Path(data_dir + "/data/scraping/from_google")

number_files = sum(1 for x in path.rglob('*') if x.is_file() and x.suffix==".pdf")
print(f"There are {number_files} PDF files in the folder {path.parent}/{path.name}/")