If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
! pip install datasets transformers

Collecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/519.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [32m286.7/519.3 kB[0m [31m8.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.32.0-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m93.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1

In [None]:
! pip install -U accelerate
! pip install -U transformers
! pip install -U SentencePiece

Collecting accelerate
  Downloading accelerate-0.22.0-py3-none-any.whl (251 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/251.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.22.0
Collecting SentencePiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: SentencePiece
Successfully installed SentencePiece-0.1.99


In [None]:
import os
os._exit(00)

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [None]:
from huggingface_hub import notebook_login
#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

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS. Uncomment the following instructions:

In [None]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 16 not upgraded.


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [None]:
import transformers

print(transformers.__version__)

4.32.0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/text-classification).

We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.

# Fine-tuning a model on a text classification task

This notebook is built to run on any of the tasks in the list above, with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a classification head. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:

In [None]:
task = "INCIBE"
model_checkpoint = "mnaylor/mega-base-wikitext"
batch_size = 32

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:
from datasets import load_dataset, load_metric

Apart from `mnli-mm` being a special code, we can directly pass our task name to those functions. `load_dataset` will cache the dataset to avoid downloading it again the next time you run this cell.

In [None]:
dataset = load_dataset("agarc15/TFM_INCIBE")

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/33.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.05M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.12M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
metric = load_metric('accuracy', 'f1')


  metric = load_metric('accuracy', 'f1')


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set (with more keys for the mismatched validation and test set in the special case of `mnli`).

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['DESCRIPTION', 'INCIBE_TAXONOMY', 'label', 'A', 'CI', 'F', 'I', 'IA', 'Others', 'SUM'],
        num_rows: 6247
    })
    test: Dataset({
        features: ['DESCRIPTION', 'INCIBE_TAXONOMY', 'label', 'A', 'CI', 'F', 'I', 'IA', 'Others', 'SUM'],
        num_rows: 2678
    })
})

To access an actual element, you need to select a split first, then give an index:

In [None]:
dataset["train"][0]

{'DESCRIPTION': " The White House says the program is part of a broader cybersecurity plan designed to address issues across the nation's critical infrastructure.. The 100-day initiative will involve government agencies that are responsible for the security of critical infrastructure as well as businesses and private utilities that oversee or own infrastructure, such as electrical distribution systems that deliver power to homes.. Public-private partnership is paramount to the administration's efforts because protecting our nation's critical infrastructure is a shared responsibility of government and the owners and operators of that infrastructure, says Emily Horne, a spokesperson for the National Security Council.. Some lawmakers and a government watchdog agency have recently criticized the Department of Energy for its cybersecurity practices, especially in the wake of the SolarWinds supply chain attack, which led to follow-on attacks on the DOE and eight other federal agencies, plus 

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(dataset["train"])

Unnamed: 0,DESCRIPTION,INCIBE_TAXONOMY,label,A,CI,F,I,IA,Others,SUM
0,"In a statement issued Jan. 22, Neiman Marcus President and CEO Karen Katz says a network malware attack designed to collect or scrape payment card data had been identified by forensics investigators. The investigation is ongoing.. To date, Visa, MasterCard and Discover have notified us that approximately 2,400 unique customer payment cards used at Neiman Marcus and Last Call stores were subsequently used fraudulently, Katz says in the company's statement. Last Call is a retail clearance center with 28 locations owned by Neiman Marcus.. No fraudulent activity has yet been linked to Neiman Marcus or Bergdorf Goodman payment cards, the statement notes. Bergdorf Goodman is a subsidiary of Neiman Marcus.. So far, the retailer says its investigation has revealed that personally identifiable information, such as Social Security numbers and dates of birth, was not compromised. The retailer also notes that online purchases and PINs were not adversely affected by the breach. We do not use PIN pads in our stores, Katz states in the Jan. 22 statement.. Like Target Corp., which announced its network breach Dec. 19, Neiman Marcus is stressing its zero liability for consumers adversely affected by fraudulent charges.. The policies of the payment brands such as Visa, MasterCard, American Express, Discover and the Neiman Marcus card provide that you have zero liability for any unauthorized charges if you report them in a timely manner, the company says. Please contact your card brand or issuing bank for more information about the policy that applies to you.. Neiman Marcus also is offering free credit monitoring to all customers who conducted transactions at Neiman Marcus or Last Call from January 2013 to January 2014. We are notifying all customers for whom we have addresses or e-mail, the company says.. Additional information is available for consumers under the general questions section on Neiman Marcus' website.",I,4,0,0,0,1,0,0,"Last Call is a retail clearance center with 28 locations owned by Neiman Marcus.. No fraudulent activity has yet been linked to Neiman Marcus or Bergdorf Goodman payment cards, the statement notes."
1,"In a blog posted by Facebook Security on Feb. 15, the company said it found no evidence that Facebook user data was compromised.. Here's what happened at Facebook, according to its blog:. Several Facebook employees visited a mobile developer website that was compromised. The compromised website hosted an exploit that then allowed malware to be installed on these employees' laptops.. The laptops were fully-patched and running up-to-date anti-virus software, the blog says. As soon as we discovered the presence of the malware, we remediated all infected machines, informed law enforcement and began a significant investigation that continues to this day.. Facebook Security flagged a suspicious domain in its corporate DNS (Domain Name Servers) logs and tracked it back to an employee laptop. The security team conducted a forensic examination of that laptop and identified a malicious file, and then searched company-wide and flagged several other compromised employee laptops.. After analyzing the compromised website where the attack originated, Facebook found the site was using a previously unseen, zero-day exploit to bypass the Java sandbox (built-in protections) to install the malware. Facebook immediately reported the exploit to Oracle, and Oracle confirmed Facebook's findings and provided a patch on Feb. 1 that addressed the vulnerability.. Facebook says it wasn't the only victim of this exploit. It is clear that others were attacked and infiltrated recently as well, the blog says. As one of the first companies to discover this malware, we immediately took steps to start sharing details about the infiltration with the other companies and entities that were affected. We plan to continue collaborating on this incident through an informal working group and other means.. The social-media company says it is working with law enforcement and the other organizations affected by this attack. It is in everyone's interests for our industry to work together to prevent attacks such as these in the future, Facebook says.. Sharing threat information has received much attention in Washington this past week. President Obama, in his State of the Union address on Feb. 12, announced an executive order that calls on the government to share cyberthreat information with critical infrastructure owners and called for legislation to allow businesses to share threat information with the government and with each other [see Obama Issues Cybersecurity Executive Order]. The following day, the heads of the House Permanent Select Committee on Intelligence introduced a bill to do just that [see Is Compromise in Offing for CISPA?]. Facebook is the latest high-profiled media company to reveal it's been victimized by intruders. The New York Times, Wall Street Journal, Twitter and Washington Post have reported their websites being attacked [see N.Y. Times' Transparent Hack Response and Twitter, Washington Post Report Cyberattacks].. The Facebook attack is reminiscent of the 2011 breach at security provider RSA, when a well-crafted e-mail tricked an RSA employee to retrieve from a junk-mail folder and open a message containing a virus that led to a sophisticated attack on the company's information systems [see 'Tricked' RSA Worker Opened Backdoor to APT Attack].",I,4,0,0,0,1,0,0,The social-media company says it is working with law enforcement and the other organizations affected by this attack.
2,"The cybersecurity workforce gap narrowed for the second consecutive year, but the global workforce still must grow by 65 percent in order to effectively defend critical assets and data, according to analysis from (ISC)².. (ISC)² collected survey data from over 4,500 cybersecurity professionals. Only 4 percent of respondents reported working in healthcare, which validates previous findings of inadequate IT staffing within the sector.. LASTING CONSEQUENCES OF A CYBERSECURITY WORKFORCE SHORTAGE. The cybersecurity workforce gap, which (ISC)² defines as the number of additional professionals that organizations need to adequately defend their critical assets, decreased from 2.12 million last year to 2.72 million this year. The study also revealed that in 2021, over three-quarters of respondents reported being satisfied or extremely satisfied with their jobs.. While this improvement in numbers and job satisfaction shows promise, increasing the workforce by 65 percent is not an easy task. As current cybersecurity professionals continue to work in the middle of the workforce shortage, negative consequences may emerge.. A workforce shortage can result in employee burnout, as exhibited by the current nationwide clinician shortage. For IT teams, the shortage could mean that employees are stretched too thin and may miss key vulnerabilities and suspicious network activity as a result.. READ MORE: Security Automation, Collaboration Prove Critical For Healthcare. Cyberattacks on the healthcare sectors are ramping up, which naturally requires a more robust IT security team. A report conducted by CyberMDX and Philips found that hospitals in particular are struggling with a cybersecurity talent shortage. Respondents reported struggling to fill jobs within 100 days of posting new roles.. Without proper staffing to account for common vulnerabilities, healthcare organizations may face risks to patient safety and costly recovery costs in the event of a cyberattack.. (ISC)² survey respondents reported misconfigured systems, not enough time to focus on risk management and assessment, slower patching of critical systems, and oversights in processes and procedures as a consequence of being short-staffed. Respondents also reported high rates of rushed deployments and the inability to remain aware of all active threats.. LACK OF DIVERSITY IN THE CYBERSECURITY WORKFORCE LIMITS POTENTIAL. The report indicated that the global cybersecurity community is well-educated, technically grounded, and strongly compensated.. However, the field is about three-quarters male and Caucasian, which reveals significant missed opportunities for bright minds to join the field and contribute diverse perspectives.. READ MORE: Recent Health Data Breaches Cause EHR Downtime, Deploy Malware. A few government agencies and private sector organizations are actively trying to combat the lack of diversity in the cybersecurity workforce.. The Cybersecurity and Infrastructure Security Agency (CISA) recently awarded $2 million to two organizations to develop cyber workforce training programs in underserved communities in rural and urban areas.. The three-year pilot program, led by NPower and CyberWarrior, will focus on developing a comprehensive retention strategy and delivering accessible entry-level cybersecurity training while providing opportunities to underserved communities.. President Biden, along with a coalition of private companies, recently announced numerous national cybersecurity initiatives aimed at increasing the availability of cybersecurity training and education.. IBM pledged to train 150,000 people in cybersecurity skills over the next three years and partner with more than 20 Historically Black Colleges & Universities to establish Cybersecurity Leadership Centers. Girls Who Code announced that it will establish a micro-credentialing program for historically excluded groups in technology. In addition, Code.org said it will teach cybersecurity concepts to three million students over the next three years.. READ MORE: Growing Number of States Enact New Genetic Data Privacy Laws. It is crucial that organizations acknowledge the lack of diversity in the cybersecurity workforce and work to improve it.. HOW ORGANIZATIONS CAN NARROW THE GAP. The study suggested that organizations begin by embracing diversity, equity, and inclusion (DEI).. “DEI is a catalyst for positive change,” the study asserted.. “Organizations that take a hard look at their own skills gap, reconsider the qualities that make a successful cybersecurity professional, focus on their people before technology and remove geographical barriers through remote work will tap into a broader pool of talent that opens up new possibilities. Cybersecurity professionals are not only aware of how DEI can contribute to solving the skills gap, but they expect their employers to act.”. Organizations should also consider prioritizing investments in existing staff before investing in technology to improve their security posture. By focusing on recruitment, development, and retention, organizations can build a cohesive team of competent cybersecurity professionals.. Survey respondents reported investing in training, providing more flexible working conditions, investing in DEI programs, and addressing pay gaps in order to address the workforce gap.. Organizations may also consider investing in automation technologies, cloud service providers, and involving cybersecurity staff earlier in product design and development to alleviate the challenges of the ongoing workforce shortage.. Tagged Cyber Hygiene Cybersecurity Ransomware.",CI,6,0,1,0,0,0,0,It is crucial that organizations acknowledge the lack of diversity in the cybersecurity workforce and work to improve it.. HOW ORGANIZATIONS CAN NARROW THE GAP.
3,"The botnet, which was uncovered in October by Juniper researchers, originally targeted vulnerable Linux applications as well as IoT devices, according to the report. The operators behind Gitpaste-12 were also using legitimate services, such as GitHub and Pastebin, to help hide the malware's infrastructure (see: Botnet Operators Abusing Legit GitHub, Pastebin Resources).. The initial wave of Gitpaste-12 attacks started in July but was not uncovered until October, when the GitHub repository that was hosting the bulk of the worm's payloads was removed. On Nov. 10, the Juniper researchers discovered a second round of attacks had started, according to the report.. The Juniper analysis notes that seven of the vulnerability exploits found in the latest version were ported over from the previous Gitpaste-12 sample. The worm also attempts to compromise open Android Debug Bridge connections and existing malware backdoors.. The [latest] wave of attacks used payloads from yet another GitHub repository, which contained a Linux cryptominer ('ls'), a list of passwords for brute-force attempts ('pass') and a statically linked Python 3.9 interpreter of unknown provenance, the researchers note in the report.. The Juniper analysis also notes the latest version contains a cryptomining tool that mines for monero virtual currency, as was the case in the earlier version.. GitHub has taken down a section of its platform used to host the malicious worm, which slowed its spread, the researcher say.. Initial Attack. The Juniper researchers who examined the second wave of attacks also found a repository of new attacks vectors in one of the malware's variants that they examined.. That variant, which the researchers call 10-inix, is a UPX-packed binary written in the Go programming language and compiled for x86_64 Linux systems. The repository contained the 31 vulnerabilities that the worm would look for, according to the report.. Many of the vulnerabilities exploited are new, with public disclosures and proof-of-concept exploits dated as recently as September, the researchers say.. Although the X10-unix attacks vary depending on the operating systems and architectures targeted, each of these attacks tries to install the monero cryptomining malware, install the appropriate version of the X10-unix worm and open a backdoor that listens for instructions on ports 30004 and 30006, the report notes.. X10-unix also attempts to connect to Android Debug Bridge connections on port 5555, according to the report.. While it's difficult to ascertain the breadth or effectiveness of this malware campaign, in part because monero - unlike bitcoin - does not have publicly traceable transactions, [Juniper Threat Labs] can confirm over a hundred distinct hosts have been observed propagating the infection, the researchers note.. Other Botnet Activity. Other researchers have also noted several new botnets that target vulnerable Linux-based devices as well as IoT devices.. In May, for example, researchers uncovered a botnet dubbed Kaiji that uses brute-force methods targeting the SSH protocol to infect endpoints, which also allows it to launch distributed denial-of-service attacks (see: Kaiji Botnet Targets Linux Servers, IoT Devices).. In October, researchers at security firm Avira Protection Lab identified a new strain of the Mirai botnet targeting vulnerable IoT devices (see: Even in Test Mode, New Mirai Variant Infecting IoT Devices).",I,4,0,0,0,1,0,0,"The repository contained the 31 vulnerabilities that the worm would look for, according to the report.."
4,"Acquirers have traditionally used rules-based systems to deal with fraud. Now with growing liability risk, they are developing two distinct problems: more alerts than can be investigated and growing staffing needs to handle those alerts.. With such a high volume of alerts, acquirers’ fraud management teams can only comb through volumes of data looking for the most obvious anomalies while missing others. Regardless of the number of resources on staff, the work is overwhelming.. View this video OnDemand and learn about:. Reducing acquirer fraud alerts and increasing accuracy. Why early detection needs speedy processing and output. Assessing risk for acquirer fraud",F,7,0,0,1,0,0,0,"Regardless of the number of resources on staff, the work is overwhelming.. View this video OnDemand and learn about:."
5,"Organizations can use a cloud-native app protection platform, or CNAPP, to bolster their security processes.. It is not uncommon for organizations to bite off more than they can chew by procuring various point-solutions to beef up their cloud security. However, these tools (which may come from different vendors) often target specific security issues in their respective silos. These silos block information exchange, making it much more difficult to identify threats and prevent attackers from embedding across multiple cloud services.. A CNAPP aims to fix this by combining disparate cloud security functions (e.g. CSPM, CWPP, KSPM, CIEM, IaC) into one proprietary software solution, offering a platform-level view of the entire attack surface. CNAPP configurations can also include infrastructure-as-code automation, which allows for earlier discovery of faulty code and misconfigurations in the CI/CD pipeline. In other words, it grants DevOps teams greater visibility, ease of control, and speed in resolving cloud security issues as they emerge.. How to deploy a CNAPP. So, let’s say you’ve decided to invest in a CNAPP. What do you need to keep in mind to get the most out of it? Consider the purpose of deploying a CNAPP: It is designed for consolidating cloud security and reducing complexities in the management of security tools. With that in mind, there are several important items that security leaders should consider as they identify a CNAPP that works for their cloud needs.. Checklist for deploying a CNAPP. #1: Strategize before diving into the deep end. Before pulling the trigger, organizations should consider how a CNAPP will work for their business needs.. Do they have a hybrid cloud model that spans on-premise, private, and public cloud environments?. Do they have contingencies for securing all (not just some) artifacts across the dev cycle, including source code, containers, VM files, IaC scripts, APIs and cloud configuration files?. Will they need to shift responsibilities and feedback loops to accommodate the platform approach? The right CNAPP strategy should be flexible in addressing instances across all these spaces and more. It’s worth taking time to understand how a CNAPP might reveal gaps in the architecture and what needs to be done to address them.. #2: Be clear in communicating to vendors your requirements. As the CNAPP market is still relatively young, it’s worth negotiating one to two year contracts with a vendor in case new features become available elsewhere. Gartner recommends requiring CNAPP vendors to only charge for licenses-based modules that your team actually uses, rather than jumping all in on the full set of integrated capabilities (as the latter could take several years to fully adopt). Organizations can also require that vendors scan containers already in development and provide features like infrastructure-as-code scanning, Kubernetes security posture management, and runtime assessment using APIs to get the most bang for their buck.. #3: Keep the focus and the priority on developers. A CNAPP can help developers create a more secure environment with better visibility. However, implementing new technology may still present some challenges. Taking a platform approach to cloud-native security could necessitate a shift in how teams work together, structure and report their findings, and configure and test new features. Organizations should provide time and resources for devs to familiarize themselves with the CNAPP’s offerings, such as improved threat scanning, the ability to automate cloud configurations, and proactive controls for fixing code in development rather than at runtime. A CNAPP can shift team responsibilities and structures. Organizations should clarify and communicate these changes with developers to ensure smooth adoption of the platform.. #4: Treat security and compliance as a continuum from development to production. The added value of a CNAPP includes consolidated insights from the development lifecycle. By integrating solutions, a CNAPP platform builds shared context for security and development teams to analyze for future improvements. Unifying functions, tools and oversight into a single platform gives DevSecOps greater visibility throughout the development process. With infrastructure-as-code automation, a CNAPP platform helps developers see and correct problems before they lead to an incident. Overall, this results in fewer bugs and more expedient production.",IA,3,0,0,0,0,1,0,"With that in mind, there are several important items that security leaders should consider as they identify a CNAPP that works for their cloud needs.. Checklist for deploying a CNAPP."
6,"Natus Medical has updated its NeuroWorks software to plug eight cybersecurity vulnerabilities that could enable an attacker to get control of the Natus Xltek electroencephalogram (EEG) device and crash it, according to a June 14 ICS-CERT advisory.. Natus recommended installing the update, NeuroWorks/SleepWorks 8.5 GMA 3, “as quickly as possible on affected systems.”. The NeuroWorks software uses a SQL server database, which enables collaboration between multiple users while also providing customization capabilities to fit any clinical configuration. The software enables remote access to the Xltek EEG device and video monitoring and review, as well as running, analyzing, reporting on, and managing an EEG study using an intuitive user interface.. Dig Deeper. Medical Device Security Should Be Focus for Healthcare Providers. Medical Device Cybersecurity Top Challenge to IoT Ecosystem. Medical Devices Reportedly Infected in Ransomware Attack. Cory Duplantis of Cisco Talos discovered the vulnerabilities and reported them to Natus.. In a blog post, Paul Rascagneres of Cisco Talos explained that the company identified code execution vulnerabilities and denial-of-service vulnerability in the NeuroWorks software. The vulnerabilities can be triggered remotely without authentication.. The Windows-based NeuroWorks software uses the hospital’s ethernet network to connect to EEG devices and integrate with patient information systems.. “Clinicians rely on accurate clinical data in order to decide what is the most appropriate care for their patients. Medical devices such as Natus Xltek EEG are a convenient tool for collecting and recording complex data relating to patients’ state of health,” explained Rascagneres.. “However, this captured clinical data is only as reliable as the platform on which it is collected. If the system collecting the data is liable to be compromised, then the care of the patients will also be compromised,” he noted.. Cisco Talos has observed attackers targeting the healthcare sector to deploy ransomware and steal confidential health care records.. ICS-CERT said that no known public exploits target these vulnerabilities.. National Cybersecurity and Communications Integration Center (NCCIC) recommended that device end users take the following defensive measures:. • Minimize network exposure for all control system devices and/or systems and ensure that they are not accessible from the internet. • Locate control system networks and remote devices behind firewalls and isolate them from the business network. • Use secure methods for remote access, such as virtual private networks (VPNs), recognizing that VPNs may have vulnerabilities, should be updated to the most current version available, and are only as secure as the devices connected to them. NCCIC advised organizations to perform impact analysis and risk assessment prior to deploying defensive measures.. Earlier in June, ICS-CERT also issued an advisory about security vulnerabilities in Philips’ IntelliVue patient and Avalon fetal monitor.. The vulnerabilities could enable an attacker to read/write memory and induce a denial of service through a system restart, the advisory warned.. Oran Avraham of Medigate reported the Philips device vulnerabilities to NCCIC.. Philips said it will provide a remediation patch for supported versions of the devices, as well as an upgrade path for all versions. The company said it will communicate service options to all affected install-base users.. In its product security advisory, Philips said that the vulnerabilities cannot be exploited without an attacker first attaining local area network (LAN) access to the medical device.. Last month, ICS-CERT highlighted vulnerabilities in another Philips medical device, its Brilliance CT scanners. Those vulnerabilities could be exploited by attackers to steal PHI and other sensitive data files.. The vulnerabilities affect the following Philips CT scanners: Brilliance 64 version 2.6.2 and below, Brilliance iCT versions 4.1.6 and below, Brilliance CT SP versions 3.2.4 and below, and Brilliance CT Big Bore 2.3.5 and below.. The security vulnerabilities include execution with unnecessary privileges, exposure of resources to wrong sphere, and use of hard-coded credentials. These security flaws could impact system confidentiality, system integrity, or system availability, the advisory noted.. The rash of medical device security flaws uncovered by security researchers has prompted the Food and Drug Administration to issue a medical device safety action plan to help reduce the vulnerabilities in legacy medical devices.. As part of those efforts, the FDA wants to set up a CyberMed Safety (Expert) Analysis Board, which would be a public-private partnership between the FDA and devices makers to complement existing device vulnerability coordination and response mechanisms.. Medical Device Connectivity Medical Device Security Virtual Private Networks.",CI,6,0,1,0,0,0,0,"In its product security advisory, Philips said that the vulnerabilities cannot be exploited without an attacker first attaining local area network (LAN) access to the medical device.. Last month, ICS-CERT highlighted vulnerabilities in another Philips medical device, its Brilliance CT scanners."
7,"Deputy Secretary of the Treasury Wally Adeyemo met with Israeli Finance Minister Avigdor Lieberman and Director General of the National Cyber Directorate Yigal Unna in Israel on Sunday to announce the partnership, which aims to protect critical financial infrastructure and counter ransomware, they said.. Officials said in a statement that they will form a new U.S.-Israeli Task Force on Fintech Innovation and Cybersecurity.. As the global economy recovers and ransomware and other illicit finance threats present a grave challenge to Israel and the U.S., increased information exchanges, joint work, and collaboration on policy, regulation, and enforcement are critical to our economic and national security objectives, Adeyemo said.. Strategic alliances win wars, pure and simple, says Tim Wade, a former network and security technical manager with the U.S. Air Force. The U.S.-Israeli partnership against ransomware is not only welcome and likely to be productive, it's in the tradition of effective international partnerships.. Wade, currently the technical director and CTO at the firm Vectra AI, adds, Our efforts against cybercrime to date have been hampered by insufficient global alliances. Good ones devoted to other forms of criminal activity point the way toward better cybersecurity.. Yet it is also about optics - especially on the heels of the recent White House ransomware summit, says Rosa Smothers, a former Central Intelligence Agency threat analyst and technical intelligence officer.. This makes sense, but there is also likely a PR motivation, says Smothers, currently the senior vice president of cyber operations at the firm KnowBe4. She cites recent NSO Group headlines as potential motivation for Israel to strengthen ties with the U.S.. Memorandum of Understanding. U.S. Treasury Department officials said a yet-to-be-drafted memorandum of understanding between the two nations will cover the following:. Financial sector information sharing on regulations and guidance, and on threat intelligence;. Staff training, study visits and cross-border competency-building activities;. Technical exchanges on policy, regulation and outreach;. Improved public sector analytics and enforcement.. 'Valuable Partnerships'. Praising similar partnerships, Adam Flatley, a member of the U.S. Ransomware Task Force and a former technical lead for the National Security Agency, says these efforts will need to take different forms to be effective. Some, like this one, will be bilateral, he says. The U.S. and Israel have long worked closely and effectively together on critical security issues of common interest.. Flatley, currently the director of threat intelligence for the firm [redacted], adds, Others will be multilateral …. [but] all of these relationships are valuable and necessary in concert with a global coordinated campaign.. Israel is a strategic partner in this sector because it has long seen cybersecurity as a national security issue, says Marcus Fowler, a former department chief for the CIA and currently the director of strategic threat at the firm Darktrace. Israel is also highly advanced in applying new technologies and innovations.. File image of NSO Group. Developments With NSO Spyware. This month, the U.S. Department of Commerce added both NSO Group and fellow Israeli spyware company Candiru to its Entity List for allegedly engaging in activities contrary to the national security or foreign policy interests of the U.S. A final rule from the Commerce Department's Bureau of Industry and Security - or BIS - says the companies threatened the privacy and security of individuals and organizations worldwide. Those on the Entity List cannot purchase U.S. technologies or goods without a license provided by the Department of Commerce (see: US Commerce Department Blacklists Israeli Spyware Firms).. An NSO Group spokesperson previously told U.S. media that the company was dismayed by the decision, given that our technologies support U.S. national security interests and policies by preventing terrorism and crime.. Israeli Foreign Affairs Minister Yair Lapid later distanced the government from the NSO Group, which distributes its products under licenses from Israel's Defense Ministry, which is reportedly investigating the company's activities. NSO Group has said it sells its products to law enforcement and intelligence agencies for legitimate use.. NSO is a private company. It is not a governmental project and therefore, even if it is designated, it has nothing to do with the policies of the Israeli government, Lapid told reporters this month, according to Reuters.. Shalev Hulio, the co-founder and CEO of NSO Group, announced that he will remain in his position as CEO, following reports by Israeli media that its CEO-designate, Itzik Benbenisti, currently NSO's co-president, has resigned (see: NSO's Troubles Extend Beyond CEO-Designate Quitting).. Whole-of-Government Approach. Last week, the Treasury Department blacklisted cryptocurrency exchange Chatex, along with a network of entities it says support the exchange, for allegedly facilitating ransomware-related financial transactions (see: US Treasury Blacklists Cryptocurrency Exchange Chatex).. This followed the White House's counter-ransomware summit held last month - with attendance from more than 30 countries, including Israel. During the meeting, Adeyemo reportedly urged global action against the abuse of virtual currency used in ransomware transactions (see: US Convenes Global Ransomware Summit Without Russia).. And last week, Vice President Kamala Harris confirmed that the U.S. would be joining the Paris Call, an 80-nation cybersecurity pact established in 2018 by French President Emmanuel Macron to develop international cybersecurity norms (see: VP Kamala Harris: US Will Join 80-Nation Cybersecurity Pact).. Recapping her meeting with Macron, Harris told reporters last week, We talked extensively … about what we as nations must do, who have similar values … to apply those principles and norms to how we will engage with each other … as it relates to our use of technology.. And, [on] cybersecurity … addressing what we have seen in the U.S. and around the world - [with] hackers that have compromised systems [via] ransomware.. U.S. President Joe Biden is set to sign his administration's landmark $1.2 trillion Infrastructure Investment and Jobs Act on Monday, unlocking some $1.9 billion in new cybersecurity funding for the federal government. The bill includes a $1 billion grant program to assist state, local, tribal and territorial governments guard against cyberthreats (see: Infrastructure Bill Features $1.9 Billion in Cyber Funding).",A,5,1,0,0,0,0,0,"This month, the U.S. Department of Commerce added both NSO Group and fellow Israeli spyware company Candiru to its Entity List for allegedly engaging in activities contrary to the national security or foreign policy interests of the U.S. A final rule from the Commerce Department's Bureau of Industry and Security - or BIS - says the companies threatened the privacy and security of individuals and organizations worldwide."
8,"The Thursday letter, presented by the nonprofit group Issue One, which focuses on reducing the role of money in politics and “modernizing elections,” requests the Senate approve five bills covering a range of cybersecurity-related issues.. “We are alarmed at the lack of meaningful Congressional action to secure our elections. The United States cannot afford to sit by as our adversaries exploit our vulnerabilities,” the letter states. “Congress - especially the Senate - must enact a robust and bipartisan set of policies now. China, Iran, Russia, and nonstate actors are utilizing every means possible to manipulate our elections and undermine the faith Americans have in our democracy. These efforts pose severe threats to our national security.”. The legislation the letter endorses includes:. The Secure Elections Act, which seeks to bolster voting systems while reaffirming each state’s role in administering federal elections;. The Honest Ads Act, which backers say would help protect against hidden, foreign propaganda efforts online.. The Foreign Agents Disclosure and Registration Enhancement Act, which is designed to modernize and enforce lobbying laws and impose meaningful penalties for rule breakers;. The Shell Company Abuse Act, which backers say would help ensure foreign actors cannot hide behind tax laws to subvert elections;. The Defending Elections from Threats by Establishing Redlines (DETER) Act, which would impose sanctions on countries that interfere in American elections. ”In addition to action on these five important bills, Congress should ensure that states and counties have the additional financial support they need to address election vulnerabilities, coupled with minimum standards and requirements to ensure election security and verifiability,” the letter states.. In recent days, Senate Democrats have tried to get votes on certain election security bills, but have been blocked by Republicans, who have argued, for example, that some of the legislation would give the federal government unprecedented control over elections, according to The Hill.. Helping Local Governments. Meanwhile, yet another bill, the DOTGOV Online Trust in Government Act of 2019, was introduced Wednesday by senators James Lankford, R-Okla; Gary Peters, D-Mich; Ron Johnson, R-Wisc.; and Amy Klobuchar, D-Minn... The bill aims to strengthen local governments' cybersecurity defenses by enabling them to switch to thegov domain for websites and email addresses. The measure is an attempt to make it more difficult for cybercriminals to impersonate government officials for phishing and other identity-theft related scams.. When official government websites use thegov domain instead of alternatives likeus orcom, it makes those government websites and email addresses more secure,” Klobuchar says. “Unfortunately, right now most county and local governments don’t use thegov domain. This allows cybercriminals to more easily impersonate government officials in order to defraud the public and get people to share sensitive information. (News Editor Howard Anderson contributed to this story.)",IA,3,0,0,0,0,1,0,"”In addition to action on these five important bills, Congress should ensure that states and counties have the additional financial support they need to address election vulnerabilities, coupled with minimum standards and requirements to ensure election security and verifiability,” the letter states.."
9,"The Department of Justice indicted two Iranian hackers behind the targeted and highly successful SamSam ransomware campaign that has plagued the healthcare sector for several years.. The federal prosecutors charged Mohammad Mehdi Shah Mansouri and Faramarz Shahi Savandi for an extortion scheme that targeted a wide range of organizations, especially the healthcare sector.. Assistant Attorney General Brian A. Benczkowski explained the hackers were responsible for some of the biggest hacks on the healthcare sector in the last two years: Allscripts, medical testing giant LabCorp, Washington, DC-based MedStar Health, Nebraska Orthopedic Hospital, Hancock Health and a host of others.. Dig Deeper. 2.65M Atrium Health Patient Records Breached in Third-Party Vendor Hack. HealthEquity Email Hack Breaches Data of 190K Patients. Phishing Attacks Breach Data of 42K Florida Patients for 3 Months. The notorious SamSam variant has been actively targeting the healthcare sector and the government since 2016. DOJ officials allege the most recent ransomware attack took place on Sept. 25, 2018.. The hackers primarily use brute force attacks on Remote Desktop Services to gain access onto a victim’s system. They’d use the RDP as an entry point onto a system to then infect other computers on the network. The defendants would also mask attacks to appear like legitimate network activity.. Further, the hackers purposefully launched attacks outside of regular business hours, “when a victim would find it more difficult to mitigate the attack, and by encrypting backups on the victims’ computers.”. “This was intended to -- and often did -- cripple the regular business operations of the victims,” according to the indictment.. Their ransom demands ranged from $5,000 to $60,000, depending on the attack size. While typically ransomware attacks are random, SamSam hackers leveraged a targeted and manual nature and heavily researched victims before launching an attack.. The method proved successful as the hacking group banked at least $6 million in ransom payments and caused more than $30 million in damages for its 200 victims.. “The allegations in the indictment unsealed today -- the first of its kind – outline an Iran-based international computer hacking and extortion scheme that engaged in 21st-century digital blackmail,” Benczkowski said in a statement.. “The defendants in this case developed and deployed the SamSam Ransomware in order to hold public and private entities hostage and then extort money from them,” U.S. Attorney for New Jersey Craig Carpenito said in a statement.. The hackers began by targeting a Mercer County business, then moved to public entities like the city of Newark and specifically healthcare providers including Kansas Heart Hospital in Wichita and the Hollywood Presbyterian Medical Center in Los Angeles, Carpenito explained.. In fact, DOJ officials allege the hackers targeted healthcare as the organizations rely on data to serve the public without interruption.. “By calling out those who threaten American systems, we expose criminals who hide behind their computer and launch attacks that threaten our public safety and national security,” FBI Executive Assistant Director said in a statement.. “The actions highlighted today, which represent a continuing trend of cybercriminal activity emanating from Iran, were particularly threatening, as they targeted public safety institutions, including U.S. hospital systems and governmental entities,” she added.. DOJ charged Savandi and Mansouri with one count of conspiracy to commit wire fraud, one count of conspiracy to commit fraud and related activity in connection with computers, two substantive counts of intentional damage to a protected computer and two substantive counts of transmitting a demand in relation to damaging a protected computer.. The indictment is not an admission of guilt and the two hackers are still at large. In a separate announcement, the Department of the Treasury imposed sanctions against two bitcoin addresses connected to SamSam. The two addresses processed over 7,000 ransom demands from its victims.. Savandi and Mansouri are still wanted by the FBI, so it’s yet to be seen if the SamSam attacks will continue. However, the indictment sheds light on some of the biggest healthcare breaches in recent years.. Data Breaches Protected Health Information Ransomware.",CI,6,0,1,0,0,0,0,The Department of Justice indicted two Iranian hackers behind the targeted and highly successful SamSam ransomware campaign that has plagued the healthcare sector for several years..


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [None]:
metric

Metric(name: "accuracy", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
    sample_weight (`list` of `float`): Sample weights Defaults to None.

Returns:
    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.

Examples:

    Example 1-A simple example
        >>> accuracy_metric = datasets.load_metric("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
        >>> print(results)
        {'accuracy': 0.5}

   

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

Downloading (…)okenizer_config.json:   0%|          | 0.00/346 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

To preprocess our dataset, we will thus need the names of the columns containing the sentence(s). The following dictionary keeps track of the correspondence task to column names:

In [None]:
data_key = {
    "Incident": ("SUM", None),
}

We can double check it does work on our current dataset:

In [None]:
sentence1_key, sentence2_key = data_key["Incident"]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence: As part of the 100-day plan for the nation's electrical grid, the Energy Department's Office of Cybersecurity, Energy Security, and Emergency Response, or CESER, will work with the Cybersecurity and Infrastructure Security Agency and private utilities to make a series of cybersecurity improvements..


We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model.

In [None]:
def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
preprocess_function(dataset['train'][:5])

{'input_ids': [[0, 1620, 233, 9, 5, 727, 12, 1208, 563, 13, 5, 1226, 18, 8980, 7961, 6, 5, 2169, 641, 18, 1387, 9, 12324, 15506, 6, 2169, 2010, 6, 8, 6824, 19121, 6, 50, 18694, 2076, 6, 40, 173, 19, 5, 12324, 15506, 8, 13469, 2010, 3131, 8, 940, 9987, 7, 146, 10, 651, 9, 13468, 5139, 7586, 2], [0, 1121, 41, 11256, 2187, 6, 5, 16886, 13063, 5, 10646, 13, 5, 542, 46706, 414, 8, 11210, 5, 335, 124, 7, 5, 3526, 9230, 6, 309, 7, 5, 266, 7586, 2], [0, 1620, 503, 535, 7, 31391, 81, 5, 3302, 9, 1083, 12, 25706, 5381, 25764, 136, 5, 121, 4, 104, 482, 80, 4039, 7028, 31, 258, 1799, 236, 7, 216, 55, 59, 141, 5, 641, 9, 9777, 2010, 8, 63, 7681, 2244, 32, 1759, 21574, 447, 7, 543, 225, 5, 1778, 16946, 9, 5, 752, 168, 8, 2008, 2112, 7586, 2], [0, 48527, 28665, 12859, 7, 453, 9, 41, 14818, 44366, 333, 14, 30, 6814, 33, 455, 797, 9, 5, 11170, 6, 61, 16, 596, 215, 13801, 64, 28, 3656, 30, 13463, 36, 7048, 35, 2612, 14520, 268, 23827, 14818, 44366, 43, 7586, 2], [0, 3908, 209, 28631, 12758, 11, 317, 6, 

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [None]:
encoded_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/6247 [00:00<?, ? examples/s]

Map:   0%|          | 0/2678 [00:00<?, ? examples/s]

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since all our tasks are about sentence classification, we use the `AutoModelForSequenceClassification` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which is always 2, except for STS-B which is a regression problem and MNLI where we have 3 labels):

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np

num_labels = 11
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
model.config.pad_token_id = model.config.eos_token_id

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/29.3M [00:00<?, ?B/s]

Some weights of MegaForSequenceClassification were not initialized from the model checkpoint at mnaylor/mega-base-wikitext and are newly initialized: ['classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define two more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
metric_name = "accuracy"
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=25,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=True,
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.


The last thing to define for our `Trainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, the only preprocessing we have to do is to take the argmax of our predicted logits (our just squeeze the last axis in the case of STS-B):

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
validation_key = "test"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics

)

You might wonder why we pass along the `tokenizer` when we already preprocessed our data. This is because we will use it once last time to make all the samples we gather the same length by applying padding, which requires knowing the model's preferences regarding padding (to the left or right? with which token?). The `tokenizer` has a pad method that will do all of this right for us, and the `Trainer` will use it. You can customize this part by defining and passing your own `data_collator` which will receive the samples like the dictionaries seen above and will need to return a dictionary of tensors.

We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.535949,0.37416
2,No log,1.525682,0.37528
3,1.503500,1.531996,0.371919
4,1.503500,1.523085,0.371546
5,1.503500,1.520347,0.374533
6,1.475500,1.521731,0.37416
7,1.475500,1.530072,0.371919
8,1.453100,1.5131,0.380508
9,1.453100,1.521179,0.378267
10,1.453100,1.517347,0.377147


TrainOutput(global_step=4900, training_loss=1.4334916469029018, metrics={'train_runtime': 6605.3441, 'train_samples_per_second': 23.644, 'train_steps_per_second': 0.742, 'total_flos': 107098892230512.0, 'train_loss': 1.4334916469029018, 'epoch': 25.0})

We can check with the `evaluate` method that our `Trainer` did reload the best model properly (if it was not the last one):

In [None]:
trainer.evaluate()

To see how your model fared you can compare it to the [GLUE Benchmark leaderboard](https://gluebenchmark.com/leaderboard).

You can now upload the result of the training to the Hub, just execute this instruction:

trainer.push_to_hub()