## EnvBert - BERT for Environment domain 


This notebook is walkthrough on how to use EnvBert to predict for your environment data. For any queries please reach out to:

Afreen Aman https://www.linkedin.com/in/afreen-aman-177824b8/

Deepak John Reji https://www.linkedin.com/in/deepak-john-reji/

#### Install EnvBert from Pypi repository

!pip install EnvBert 

for latest versions : https://pypi.org/project/EnvBert/

### EnvBert for EDD Prediction

EnvBert has a base model to predict across 11 categories of Environment data. It uses a finetuned model and a custom embedding layer to predict the class.

In [3]:
# load functions from envbert
from EnvBert.due_diligence import envbert_predict

# provide a test sentence
doc = "At the every month post-injection monitoring event, TCE, carbon tetrachloride, and chloroform concentrations were above CBSGs in three of the wells"

# run the function over this test sentence
envbert_predict(doc)

['Remediation Standards', 0.3396157974881729]

It returns the predicted class and its prediction probability in a list

### EnvBert for determining the relevancy

In [5]:
# provide a sentence which is not related Environment
doc1 = "Tesla designs and manufactures electric vehicles, battery energy storage from home to grid-scale, solar panels and solar roof tiles, and related products and services."

# run the function over this irrelevant sentence
envbert_predict(doc1)

['Not Relevant', 0.14867964865350358]

In [6]:
# provide a sentence which is about environment domain
doc2 = "weathered shale was encountered below the surface area with fluvial deposits. Sediments in the coastal plain region are found above and below the bedrock with sandstones and shales that form the basement rock"

# run the function over this irrelevant sentence
envbert_predict(doc2)

['Geology', 0.6775815994496605]

### EnvBert for ranking the data

In [13]:
# provide a set of sentences
doc_list = ["Norilsk Nickel says ‘flagrant violation of rules’ has been committed by dumping wastewater from reservoir into wildlife.",
"Also, some of the people most harmed by groundwater contamination are indigenous or people of color who live in under-resourced areas, she says. Smaller or disadvantaged communities “should be at the front of the line to make sure they get the money that they need,” Evelyn adds. “Unfortunately, this is not the way that it works. So it’s a scramble.”",
"California fined Chevron more than $2.7 million for numerous violations, noting that the spills caused “significant threat of harm to human health and the environment.” ​",
"Last fall, the Central Valley Regional Water Quality Control Board assured critics that it had reviewed studies of the practice and found no elevated risks to human health or crop safety. The board focused primarily on one question—are crops grown with produced water safe to eat?—and considered as beyond the scope of its responsibility the wider range of potential harms associated with recycling the oil industry’s wastewater.",
"But scientists in other parts of the country have investigated these questions, looking at both the consequences of intentional reuse of oil wastewater for irrigation and disposal and accidental spills of the wastewater on wildlife and the environment. And a growing body of research shows that even highly diluted produced water can harm soil, plants, and aquatic life, and that oil drilling boosts groundwater concentrations of naturally occurring toxic elements like arsenic, and radioactive elements like radium, while also endangering sensitive ecosystems and protected wildlife."]

for i in doc_list:
    print(f'doc: ', i)
    for j in [envbert_predict(i)]:
        print(f'prediction:',j[0])
        print(f'probability:', j[1])
    print("")


doc:  Norilsk Nickel says ‘flagrant violation of rules’ has been committed by dumping wastewater from reservoir into wildlife.
prediction: Contaminated media
probability: 0.3669099021814714

doc:  Also, some of the people most harmed by groundwater contamination are indigenous or people of color who live in under-resourced areas, she says. Smaller or disadvantaged communities “should be at the front of the line to make sure they get the money that they need,” Evelyn adds. “Unfortunately, this is not the way that it works. So it’s a scramble.”
prediction: Contaminated media
probability: 0.32199568997106415

doc:  California fined Chevron more than $2.7 million for numerous violations, noting that the spills caused “significant threat of harm to human health and the environment.” ​
prediction: Contaminated media
probability: 0.3608322522465483

doc:  Last fall, the Central Valley Regional Water Quality Control Board assured critics that it had reviewed studies of the practice and found n

### EnvBert for Fine-tuning new data

In [None]:
# load all the functions
from EnvBert.due_diligence import *

# define training config
training_config = {
    'learning_rate':5e-5,
    'epochs':10,
    'batch_size':16,
    'sentence column name':'Sentence', #training sentences column name
    'label column name': 'label', #encoded labels column name
    'save_dir': r'XX\XX\XXX' #model save path
    }

"""
please make sure you encode your labels
provide the save_dir path to automatically save the model after training
'sentence column name' and 'label column name' are mandatory fields in training config
you can tweak the other parameters or it will be taken by default
"""

# Train the model with just 1 line
new_model, new_tokenizer = finetune(df, training_config) #df is the dataframe with your sentences and labels