<a href="https://colab.research.google.com/github/nicojimestre/ml_project2/blob/main/Code/Traduction_and_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Library Installation and CSV file reading

In [None]:
!pip install --no-cache-dir transformers sentencepiece
!pip install sentencepiece
from transformers import AutoTokenizer, AutoModelWithLMHead, TranslationPipeline, AutoModelForSequenceClassification, TextClassificationPipeline
from google.colab import files

import pandas as pd
import torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 26.5 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 57.3 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 83.8 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 62.0 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers, sentencepiece
Successfully installed huggingface-hub-0.11.1 sentencepiece-0.1.97 tokenizers-0.13.2 transformers-4.2

# 

In [None]:
trad_ds = pd.read_csv('1ere_lecture_NLP_ds.csv')

In [None]:
print(trad_ds.dtypes)

Unnamed: 0             int64
article_text          object
artId                 object
article_translated    object
dtype: object


# Translation Pipeline

Translation in english of all the articles using a pretrained model from HuggingFace 

In [None]:
pipeline_translation = TranslationPipeline(
model = AutoModelWithLMHead.from_pretrained("SEBIS/legal_t5_small_trans_fr_en"),
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path = "SEBIS/legal_t5_small_trans_fr_en", use_fast=False, do_lower_case=False, 
                                            skip_special_tokens=True),
    device=0
)

list_trad = []

for index, row in trad_ds.iterrows():
  dict_trad = pipeline_translation([row[1]], max_length=512)
  list_trad.append(dict_trad[0]['translation_text'])

articles_translated = pd.DataFrame(list_trad)




Downloading:   0%|          | 0.00/822 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/840k [00:00<?, ?B/s]



In [None]:
print(articles_translated.head())

                                                   0
0  The Valais canton is a democratic Republic in ...
1  The Canton Organisation of Canton The Canton o...
2  The capital of capital of capital of capital o...
3                                          Minority:
4  The official valaisan anthem is made up of the...


# Text Labelling Pipeline

The goal is to assign a label to the previously translated articles using a pretrained model from HuggingFace. 
In the API that we are using, Each article is assigned a score on each of the following themes: 
- external relations
- freedom and democracy
- political system
- economy
- welfare and quality of life
- fabric of society
- social groups

In [None]:
model_name = "MoritzLaurer/policy-distilbert-7d"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

#pipeline_classif = TextClassificationPipeline(model = AutoModelForSequenceClassification.from_pretrained(model_name), tokenizer = AutoTokenizer.from_pretrained(model_name))

list_row = []
articles_classif = pd.DataFrame()

text = "The state and the municipalities provide education for the citizenship of children and young people. The state is putting in place instruments to enable children and young people to participate in political life."
text = "The new variant first detected in southern England in September is blamed for sharp rises in levels of positive tests in recent weeks in London, south-east England and the east of England"

# pipeline_classif(text)

for index, row in articles_translated.iterrows():
  input = tokenizer([row[0]], truncation=True, return_tensors="pt")
  output = model(input["input_ids"])

  prediction = torch.softmax(output["logits"][0], -1).tolist()
  label_names = ["external relations", "freedom and democracy", "political system", "economy", "welfare and quality of life", "fabric of society", "social groups"]
  prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}

  #list_row.extend( [prediction['external relations'], prediction['freedom and democracy'], prediction['political system'], prediction['economy'], 
                #prediction['welfare and quality of life'], prediction['fabric of society'], prediction['social groups']  ])
  

  articles_classif = articles_classif.append(prediction, ignore_index=True)
print(articles_classif.head())


   external relations  freedom and democracy  political system  economy  \
0                 0.0                  100.0               0.0      0.0   
1                 0.0                    0.0             100.0      0.0   
2                 0.0                    0.0             100.0      0.0   
3                 0.0                    0.2               0.0      0.0   
4                 0.0                    0.0               0.0      0.0   

   welfare and quality of life  fabric of society  social groups  
0                          0.0                0.0            0.0  
1                          0.0                0.0            0.0  
2                          0.0                0.0            0.0  
3                         98.9                0.4            0.4  
4                          0.0              100.0            0.0  


In [None]:
articles_classif = articles_classif.join(trad_ds['artId'])
print(articles_classif.head())


   external relations  freedom and democracy  political system  economy  \
0                 0.0                  100.0               0.0      0.0   
1                 0.0                    0.0             100.0      0.0   
2                 0.0                    0.0             100.0      0.0   
3                 0.0                    0.2               0.0      0.0   
4                 0.0                    0.0               0.0      0.0   

   welfare and quality of life  fabric of society  social groups      artId  
0                          0.0                0.0            0.0   Art. 100  
1                          0.0                0.0            0.0   Art. 101  
2                          0.0                0.0            0.0   Art. 102  
3                         98.9                0.4            0.4   Art. 103  
4                          0.0              100.0            0.0  Art. 103a  


In [None]:
articles_classif.to_csv('all_articles_labels.csv', encoding = 'utf-8-sig') 
files.download('all_articles_labels.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

After studying more in depth precise labelling examples of our dataset, we concluded that this method was not very accurate. 