# Scraping
In this notebook we will try to gather as much data as possible concerning our subject XXXXXX

To achieve this we will follow this path :

1. Scrape ArianeWeb, to retrieve the most important decisions
2. Scrape Opendata.justive administrative, to get every other decisions

We will later on reconcily those two databases to filter out redundancy.


## Install  and import dependencies

In [None]:
! pip install -q pandas bs4 requests selenium regex lxml

In [34]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys as KeysBrowser
from bs4 import BeautifulSoup
import requests
import time
import regex as re 


# 1. Scrape ArianeWeb


We want first to scrape [ArianeWeb](https://www.conseil-etat.fr/arianeweb/#/recherche) website which provides numerous decisions made by the Conseil d'Etat.

More precisely, we found out that it contains following data (*in French*) :


**Les décisions et analyses du Conseil d’État, les conclusions des rapporteurs publics**

Décisions, avis contentieux et ordonnances rendus depuis le 19 février 1875. En particulier : 

- 19/02/1875-1965 : décisions fondamentales (Grands arrêts)
- 1965-1986 : décisions présentant un intérêt jurisprudentiel (décisions publiées (A) ou mentionnées (B) dans les tables du recueil Lebon et dont les analyses sont disponibles dans Ariane Web
- Depuis 1987 : toutes les décisions, sauf celles statuant sur l’admission de pourvois en cassation et les ordonnances usuelles des présidents
- Analyses de décisions collégiales, avis contentieux et ordonnances de référé publiés (A) ou mentionnés (B) dans les tables du recueil Lebon, depuis le 19 février 1875
- Sélection des conclusions des rapporteurs publics : conclusions relatives à des décisions collégiales et avis contentieux depuis le 31 janvier 2007

(*[source](https://jurisguide.fr/fiches-documentaires/arianeweb-1/)*)


## 1.1 Get Ariane landing page

In [35]:
# we instantiate the webdriver
driver = webdriver.Chrome() 

# we get the page thanks to its url
url = "https://www.conseil-etat.fr/arianeweb/#/recherche"

# at this point drivers lands in the home page of the website
driver.get(url)

# we inspect the .html page and find the path to "Décisions du conseil d'etat" and "Rechercher" buttons
x_path_decisions = "/html/body/div[2]/div[2]/div[1]/form/div[2]/div/div[1]/div[1]/label/input"
x_path_rechercher = "/html/body/div[2]/div[2]/div[1]/form/div[4]/div/button"

# we click on the "Décisions du conseil d'etat" checkbox to get only the decisions 
driver.find_element(By.XPATH, x_path_decisions).click()

# we click on the "Rechercher" button to get the results
driver.find_element(By.XPATH, x_path_rechercher).click()


## 1.2. Tests

### 1.2.1 Test to parse a single decision

We saved a single decision page as an html file and try to parse it to decide what we want to keep.

Post discussion with our team we agreed on keeping :

- date of the decision
- decision id
- composition of the counsel
- corpus
- decision

We will later define what kind of decisions are interesting for us

In [40]:
# path to the decision we want to parse
html_path = "../data/487634.html"

# path to the line with date and id of the decision
x_path_date_id = "/html/body/strong[3]"

# load page in driver
driver.get("file:/" + html_path)

# get the date and id of the decision
date_id = driver.find_element(By.XPATH, x_path_date_id).text

date_id



NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=121.0.6167.139)
Stacktrace:
0   chromedriver                        0x00000001009b27dc chromedriver + 4040668
1   chromedriver                        0x00000001009aa9e0 chromedriver + 4008416
2   chromedriver                        0x000000010061d870 chromedriver + 284784
3   chromedriver                        0x00000001005f8064 chromedriver + 131172
4   chromedriver                        0x00000001006872b0 chromedriver + 717488
5   chromedriver                        0x000000010069a75c chromedriver + 796508
6   chromedriver                        0x000000010065574c chromedriver + 513868
7   chromedriver                        0x0000000100656044 chromedriver + 516164
8   chromedriver                        0x0000000100977a04 chromedriver + 3799556
9   chromedriver                        0x000000010097bee4 chromedriver + 3817188
10  chromedriver                        0x0000000100960260 chromedriver + 3703392
11  chromedriver                        0x000000010097ca2c chromedriver + 3820076
12  chromedriver                        0x000000010095301c chromedriver + 3649564
13  chromedriver                        0x0000000100999e3c chromedriver + 3939900
14  chromedriver                        0x0000000100999fb4 chromedriver + 3940276
15  chromedriver                        0x00000001009aa660 chromedriver + 4007520
16  libsystem_pthread.dylib             0x000000019b633fa8 _pthread_start + 148
17  libsystem_pthread.dylib             0x000000019b62eda0 thread_start + 8


### 1.2.2 Test to scrape the first page

We'll check if we can get the result for a single page before trying to retrieve all the results.

For improved readibility and fanciness we will use a function that parses a page's html.

This function is mainly inspired by Prof Charlotin's code snippet provided in Notebook #3 : Scraping 

(honesty would force us to confess that it is a 100% copy-paste).

In [None]:
def parse_decision()

In [36]:
# the functions takes the whole html page stored as a soup and parses it 

def parse_page(soup):

    table = soup.find_all("table")[-1]  # Collect tables from the page; there are two of them
# in the page source (and find_all returns a list), and we are interested in the last one.

    df = pd.read_html(str(table))[-1]  # To make things easier, we convert the table in a panda
    # dataframe 

    for index, row in df.iterrows():  # For each row, we'll make the browser click on the element
        # and collect the judgment
        num = re.search(r"\d+", row["Numéro d'affaire"]).group()  # Taking only the number because
        # the (...) messes up xPath
        row_el = driver.find_element(By.XPATH, ".//td[contains(text(), '" + num + "')]")
        # With that num, we look for the relevant element in browser
        row_el.click()  # load page with judgment
        time.sleep(1)  # Giving the page time (2s) to load before changing focus with function
        # time.sleep (imported above)
        driver.switch_to.window(driver.window_handles[-1])
        ## Switch the driver's focus to the window you just opened, with method "switch to",
        # and argument the relevant window from the list of window_handles
        # (latest loaded window will be -1)



       ##########
        a change ici






        ave_el = driver.find_element(By.CSS_SELECTOR, "button[title='enregistre le document']")
        # Find the download button, using CSS selector here so as to rely on the unique 'title'
        ave_el.click()  # Downloading the judgment, in html format; it will end up in
        # your normal Download folder

        # we add a time.sleep to let the file download
        time.sleep(1)
        driver.switch_to.window(driver.window_handles[0])
        # Important to return to main window, or the next search for rowel won't work

parse_page(BeautifulSoup(driver.page_source, "html.parser")) # We call the function with the current page (first one)

IndexError: list index out of range

### 1.2.3 Test to keep only interesting pages

## 1.3 Final parsing

As from now (07/02), the above request outputs 3190 pages. 
This parameter can be easily changed by attributing a new value to the 'num_page' variable.

In [None]:
num_pages = 3190 

