# Get the data into the right format

The purpose of this section is to prepare the data into one storage format easier to manipulate or edit than exchange format such as CSV or JSON. As CSV or JSON are useful to exchange data, they are very limited when we need to manipulate or query the data.

In this section, we deal with:
1. download the raw data from  [data.gouv.fr](https://www.data.gouv.fr/fr/datasets/donnees-ouvertes-du-grand-debat-national/)
2. transform the data into the more proper format
3. example to query / access the data

## Download the data

In [2]:
import hashlib
import requests
from pathlib import Path

# Define the URL and expected SHA-1 checksum
url = "https://www.data.gouv.fr/fr/datasets/r/bc085888-e6bd-445d-b3f4-632190c29e3f"
expected_sha1 = "90540350af64eb61f8a9823c83468934b19634c1"

# Define the target directory and file path
data_dir = Path("../data/raw")
data_dir.mkdir(parents=True, exist_ok=True)
file_path = data_dir / "fiscalite_et_les_depenses_publiques.csv"

# Check if the file already exists
if file_path.exists():
    print(f"File already exists: {file_path}")
else:
    # Download the file if it doesn't exist
    print(f"Downloading file: {file_path}")
    response = requests.get(url)
    response.raise_for_status()  # Raise an error for bad HTTP responses
    file_path.write_bytes(response.content)

    # Verify the SHA-1 checksum
    sha1 = hashlib.sha1()
    with file_path.open("rb") as f:
        while chunk := f.read(8192):
            sha1.update(chunk)
    calculated_sha1 = sha1.hexdigest()
    if calculated_sha1 == expected_sha1:
        print("SHA-1 checksum verified successfully.")
    else:
        print(f"SHA-1 checksum mismatch! Expected: {expected_sha1}, Got: {calculated_sha1}")

File already exists: ../data/raw/fiscalite_et_les_depenses_publiques.csv


## Read the data

In [None]:
import pandas as pd

df = pd.read_csv(file_path, sep=",")
print(df.columns)

  df = pd.read_csv(file_path, sep=",")


Index(['id', 'reference', 'title', 'createdAt', 'publishedAt', 'updatedAt',
       'trashed', 'trashedStatus', 'authorId', 'authorType', 'authorZipCode',
       'QUXVlc3Rpb246MTYy - Quelles sont toutes les choses qui pourraient être faites pour améliorer l'information des citoyens sur l'utilisation des impôts ?',
       'QUXVlc3Rpb246MTYz - Que faudrait-il faire pour rendre la fiscalité plus juste et plus efficace ?',
       'QUXVlc3Rpb246MTY0 - Quels sont selon vous les impôts qu'il faut baisser en priorité ?',
       'QUXVlc3Rpb246MjA2 - Afin de financer les dépenses sociales, faut-il selon vous...',
       'QUXVlc3Rpb246MjA1 - S'il faut selon vous revoir les conditions d'attribution de certaines aides sociales, lesquelles doivent être concernées ?',
       'QUXVlc3Rpb246MTY1 - Quels sont les domaines prioritaires où notre protection sociale doit être renforcée ?',
       'QUXVlc3Rpb246MTY2 - Pour quelle(s) politique(s) publique(s) ou pour quels domaines d'action publique, seriez-vou

Pay attention only to the following question for the proof-of-concept (POC):
> 'QUXVlc3Rpb246MTYz - Que faudrait-il faire pour rendre la fiscalité plus juste et plus efficace ?'

In [26]:
col_name = "QUXVlc3Rpb246MTYz - Que faudrait-il faire pour rendre la fiscalité plus juste et plus efficace ?"
df_contrib = df[['authorId', col_name]].rename(
    {col_name:'contribution'},
    axis=1)

In [27]:
df_contrib.head()

Unnamed: 0,authorId,contribution
0,VXNlcjo3ZTVjYTUwMi0xZDZlLTExZTktOTRkMi1mYTE2M2...,
1,VXNlcjo5NmNhYWM4ZS0xZTIwLTExZTktOTRkMi1mYTE2M2...,
2,VXNlcjo3ZTVjYTUwMi0xZDZlLTExZTktOTRkMi1mYTE2M2...,
3,VXNlcjpjNDY0ZjllMy0xZDk4LTExZTktOTRkMi1mYTE2M2...,Repartir les richesses. suppression de la tax...
4,VXNlcjo3MDdkM2IzOC0xZDYxLTExZTktOTRkMi1mYTE2M2...,"Les droits soient automatiques, comme nos devo..."


## Parse the data into the arrow/parquet format

In [30]:
from datasets import Dataset
import numpy as np

In [28]:
dataset = Dataset.from_pandas(df_contrib)
print(dataset.info)

DatasetInfo(description='', citation='', homepage='', license='', features={'authorId': Value(dtype='string', id=None), 'contribution': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name=None, dataset_name=None, config_name=None, version=None, splits=None, download_checksums=None, download_size=None, post_processing_size=None, dataset_size=None, size_in_bytes=None)


In [34]:
n = np.random.randint(0, dataset.num_rows)
dataset[n]

{'authorId': 'VXNlcjpkZWE1ODYxOC0yNmYxLTExZTktOTRkMi1mYTE2M2VlYjExZTE=',
 'contribution': 'Supprimer toutes  les taxes et imposer tous les revenus au même taux.'}

In [39]:
save_path = data_dir / "contributions"
save_path.mkdir(parents=True, exist_ok=True)
dataset.save_to_disk(str(save_path.resolve()))

Saving the dataset (1/1 shards): 100%|██████████| 186711/186711 [00:00<00:00, 373371.97 examples/s]
