If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
! pip install datasets transformers

Collecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.32.0-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━

In [None]:
! pip install -U accelerate
! pip install -U transformers

Collecting accelerate
  Downloading accelerate-0.22.0-py3-none-any.whl (251 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.22.0


In [None]:
import os
os._exit(00)

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [None]:
from huggingface_hub import notebook_login
#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

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS. Uncomment the following instructions:

In [None]:
 !apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 16 not upgraded.


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [None]:
import transformers

print(transformers.__version__)

4.32.0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/text-classification).

# Fine-tuning a model on a text classification task

This notebook is built to run on any of the tasks in the list above, with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a classification head. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:

In [None]:
task = "INCIBE"
model_checkpoint = "xlnet-base-cased"
batch_size = 16

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:
from datasets import load_dataset, load_metric

Apart from `mnli-mm` being a special code, we can directly pass our task name to those functions. `load_dataset` will cache the dataset to avoid downloading it again the next time you run this cell.

In [None]:
dataset = load_dataset("agarc15/TFM_INCIBE")

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/33.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.05M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.12M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
metric = load_metric('accuracy', 'f1')


  metric = load_metric('accuracy', 'f1')


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set (with more keys for the mismatched validation and test set in the special case of `mnli`).

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['DESCRIPTION', 'INCIBE_TAXONOMY', 'label', 'A', 'CI', 'F', 'I', 'IA', 'Others', 'SUM'],
        num_rows: 6247
    })
    test: Dataset({
        features: ['DESCRIPTION', 'INCIBE_TAXONOMY', 'label', 'A', 'CI', 'F', 'I', 'IA', 'Others', 'SUM'],
        num_rows: 2678
    })
})

To access an actual element, you need to select a split first, then give an index:

In [None]:
dataset["train"][4580]

{'DESCRIPTION': ' On behalf of its nearly 5,000 member healthcare organizations, the American Hospital Association (AHA) expressed its support for the Protecting and Transforming Cyber Health Care (PATCH) Act, which was introduced by Senators in April to enhance medical device security.. In a letter addressed to Senators Bill Cassidy (R-LA) and Tammy Baldwin (D-WI), who first introduced the PATCH Act, the AHA said that the association and its members were committed to preventing cyberattacks and would support the PATCH Act’s intentions of doing the same via medical device security improvements.. “We are pleased to support this legislation to improve the security of medical devices, which can create cyber vulnerabilities and serious risks to the security and privacy of patient data along with vital medical technology used in care delivery,” the letter stated.. Dig Deeper. GAO Calls on HHS to Improve Healthcare Data Breach Reporting Process. Senators Call on FTC to Investigate Apple, Goo

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(dataset["train"])

Unnamed: 0,DESCRIPTION,INCIBE_TAXONOMY,label,A,CI,F,I,IA,Others,SUM
0,"As market fluctuations continue, the Federal Reserve is watching. On Tuesday, the Fed announced it was holding down interest rates.. Upon the news, the Dow Jones Industrial Average bounced back, up 429 points from its more than 600-point plummet on Monday. The S&P 500 jumped 4.7 percent and trading for Bank of America and Citigroup saw upticks of about 13 percent.. But industry pundits such as Tim Ghriskey, chief information officer at wealth strategy and asset-management firm Solaris Group LLC, say it's just too early for the Fed to make a call on what the long-term impact on the economy and U.S. financial institutions will be. [See Market Collapse Worst Since '08.]. We're really talking about investor sentiment and investor psychology at this point, Ghriskey says.. Doug Johnson, vice president of risk management policy for the American Bankers Association, says alarmist reactions would do bankers more harm than good. The market fluctuation was one day in time. I don't think it's going to have a long-term impact, he says.. Johnson also points out that, despite overall dips in share prices, banking institutions are not expected to decrease investments in fraud prevention. I don't think there's any connection between the market fluctuations and the investments banks will make in fraud prevention, he says. It's not about making budget cuts; it's about protecting the customer relationship and ensuring security. Banks know we need to be very careful how we protect customers.. Mike Mancusi, managing director at FTI Consulting and former senior deputy controller with the Office of the Comptroller, says banks will closely analyze mandates versus what's deemed good practice or a nice-to-have.. I suspect they're going to look at when they'll have to spend or make certain investments, Mancusi says. Some of the spend will be deferred until things improve. Again, I think anytime banks have to comply with something, they're going to spend to comply. If it's something that enhances their practice, in difficult times, they'll defer that spending.. When it comes to complying with the new FFIEC Authentication Guidance, Mancusi says spending for fraud-prevention could be affected. The FFIC guidance is there to tell you how to manage or address a particular area, he says. But during an examination, if you're in compliance, you're not going to be cited for a violation of an act or a safety practice, unless one exists outside the guidance.. In light of tightened budgets, how banks might balance fraud-prevention investments with investments in anti-money-laundering solutions that help them comply with the Bank Secrecy Act and the USA Patriot Act, for instance, is difficult to estimate.. But Julie McNelley, a fraud and financial services analyst at Aite, says making generalizations about future fraud investments is risky.. Each bank's environment is unique, and it really depends on where they feel the internal need is the greatest, she says. All things being equal, I would give a slight edge to fraud-prevention technologies, particularly those that can provide a revenue-augmentation component. A fraud-prevention technology can provide an improvement to the customer experience, through reduced false-negatives, or through the opportunity to deploy new products. This is a powerful business case component any time, but particularly so in times of recession.. Long-Term Effects. Mancusi says gauging the long-term impact of current market fluctuations is difficult. Besides, this market plunge is different than the one the market felt Sept. 29, 2008, when Dow stocks dropped 778 points, the S&P fell 60 points and the NASDAQ dropped 137 points.. The financial institutions have taken a real beating. They are definitely being hit hard in the market, he says. I don't know that this market is necessarily going to cause more banks to fail, though, because the things that are going on in the market are not the typical bank-lending activities that led to the number of failures we saw a few years ago. This is something that is dealing with what S&P says about the debt rating in the U.S.. More bank failures are coming, but they would have come regardless of whether the market took a plunge. The FDIC has a considerable list of problem banks, Mancusi says. I think we'll see more failures, and the FDIC selling the deposits and some of the assets to stronger institutions.. Johnson says the market fluctuations should encourage core processors to offer more fraud-prevention solutions and measures that are linked to AML. [See AML Case Study: New Way to Fight Fraud.]. There's a clearly a need to do that, and it does affect the cost if those two are together, he says.. McNelley adds that nothing is ever exempt from budgetary cuts, of course, but says there will be increasing pressure on all vendors to prove their worth, whether they are incumbent or new solutions.. The bar for new business cases also rises during times of recession, as IT resources are typically even more constrained than they are during the positive economic times, McNelley says. Fraud-prevention vendors will need to show their ability to provide a solid ROI, and if they can also show how their solution can help minimize false-negatives, thus improving the customer experience, that can be a powerful way to help prioritize their business case versus others competing for shared resources.",F,7,0,0,1,0,0,0,"I don't know that this market is necessarily going to cause more banks to fail, though, because the things that are going on in the market are not the typical bank-lending activities that led to the number of failures we saw a few years ago."
1,"Graham Ivan Clark , who is now 18, pleaded guilty to 30 felony charges stemming from a worldwide Twitter hack during which he gained access to celebrity Twitter accounts by tricking several Twitter employees into sharing needed admin credentials.. Clark used this access to post a tweet on 45 of the accounts asking for donations of $1,000 in bitcoin. The note said anyone sending the money would be sent $2,000 in return. This resulted in about $118,000 being stolen from 360 people, the state attorney's office in Tampa, Florida, says.. He took over the accounts of famous people, but the money he stole came from regular, hardworking people, says Andrew Warren, the state attorney in Florida's 13th Judicial District, who prosecuted the case. Graham Clark needs to be held accountable for that crime, and other potential scammers out there need to see the consequences.. By agreeing to the deal, Clark was sentenced as a juvenile under Florida's Youthful Offender Act, with the seven months he has already spent in jail being counted against the sentence. He was originally charged as an adult. If Clark violates the terms of his three-year probation, he will face a 10-year sentence in an adult prison, prosecutors say.. The Scam. Clark was arrested on July 30, 2020 - when he was 17 years old - for conducting the campaign that allowed him to gain access to 130 Twitter accounts, including those of now President Joe Biden, Bill Gates and Elon Musk.. Using social engineering techniques, Clark conducted a multistep telephone scam against several Twitter employees to obtain the admin-level credentials needed for accessing the internal support tools available to only a few employees that would allow him to take over the celebrity accounts, Twitter said in a report issued last year.. On July 15, 2020, Clark used the stolen admin tool to access Twitter's internal system and formally took control of the accounts, Warren says. Posing as account owner, he posted messages on 45 accounts with the same theme:. Bill Gates' verified Twitter account was hijacked for a cryptocurrency scam. (Source: SeekingAlpha). About 360 people fell for the crytocurrency scam, sending in 12.86 bitcoin, which equaled $117,440 at the time, Warren says. Law enforcement authorities seized the money, and it will be returned to its owners, the prosecutor says.. The Co-Conspirators. Two alleged co-conspirators also charged in connection to the July 2020 hack are Nima Fazeli, aka Rolex, 22, of Orlando, and U.K. resident Mason Sheppard, aka Chaewon, 19.. Fazeli was arrested and charged with aiding and abetting the intentional access of a protected computer. Sheppard was charged with conspiracy to commit wire fraud, conspiracy to commit money laundering and the intentional access of a protected computer, according to the U.S. Attorney's Office for the Northern District of California, which is overseeing the federal prosecutions.. Court documents dated Feb. 1 state Fazeli and prosecutors are negotiating a potential resolution of the case. A hearing on the matter was set for March 8, but at this time, no other information from the Justice Department is available.. Fazeli's lawyers did not immediately reply to a request for an update on his case.. Sheppard's case is still pending.",I,4,0,0,0,1,0,0,"Sheppard was charged with conspiracy to commit wire fraud, conspiracy to commit money laundering and the intentional access of a protected computer, according to the U.S. Attorney's Office for the Northern District of California, which is overseeing the federal prosecutions.. Court documents dated Feb. 1 state Fazeli and prosecutors are negotiating a potential resolution of the case."
2,"This approach may be the handiwork of an advanced persistent threat group known as APT32, or OceanLotus, which has ties to Vietnam, Malwarebytes says. The domain used to host some of the data is registered to Ho Chi Minh City, Vietnam, says Hossein Jazi, a senior threat researcher with Malwarebytes, and Jérôme Segura, director of threat intelligence for the security firm.. Starts With Phishing. The Malwarebytes researchers say the attack kicks off with a phishing scam that uses the subject line Your Right to Compensation. The email contains a zip file that hosts a document labeled Compensation manual.doc.. The document says it is encrypted and requests that the victim enable editing. When this is done, the victim is taken to a website where the fileless malware is loaded into the Windows Error Reporting system, according to the report.. The attackers use the Windows Error Reporting service because that makes the attack more difficult to detect, according to Malwarebytes. Werfault.exe, the Windows Error Reporting process of Windows 10, is used to report errors. If any application or hardware crashes in a device, then Werfault.exe makes it possible to forward the crash report to Microsoft, the researchers note.. Inside [the document] we see a malicious macro that uses a modified version of CactusTorch VBA module to execute its shellcode. CactusTorch is leveraging the DotNetToJscript technique to load aNet compiled binary into memory and execute it from vbscript, the Malwarebytes researchers say.. The use of CactusTorch is another indicator APT32 may be behind the campaign because the group is known to use that VBA module to drop variants of the Denis remote access Trojan, according to Malwarebytes.. However, since we were not able to get the final payload we cannot definitely attribute this attack to APT32, the researchers say (see: Vietnamese APT Group Targets BMW, Hyundai: Report).. Kraken Loader. The loaded payload is aNET Dynamic Link Library with Kraken.dll as its internal name, researchers say.. This DLL is a loader that injects an embedded shellcode into WerFault.exe. To be clear, this is not the first case of such a technique. It was observed before with the NetWire RAT and even the Cerber ransomware, Segura and Jazi note.. The researchers report that the loader has two main classes - Kraken and Loader. Kraken contains the shellcode that gets injected into the target process defined in this class as WerFault.exe.. It only has one function that calls the Load function of Loader class with shellcode and target process as parameters. Whereas, the Loader class is responsible for injecting shellcode into the target process by making Windows API calls, the researchers note.. Anti-Analysis Checks. To perform anti-analysis checks, the hackers created multiple threads to make sure the fileless malware is not running in a sandbox environment or in a debugger. Researchers first checked the existence of a debugger by calling GetTickCount, which is a timing function that is used to measure the time needed to execute some instruction sets.. The Malwarebytes researchers note: In this thread, it is being called two times before and after a sleep instruction and then the difference is being calculated. If it is not equal to 2, the program exits, as it identifies it is being debugged.",I,4,0,0,0,1,0,0,"When this is done, the victim is taken to a website where the fileless malware is loaded into the Windows Error Reporting system, according to the report.."
3,"The Healthcare & Public Health Sector Coordinating Councils (HSCC) published model contract language to help healthcare organizations ensure medical device security when crafting contracts with device manufacturers.. Mayo Clinic, Premier Inc., and Siemens Healthineers led the drafting process intending to deliver a template to help healthcare organizations and medical technology companies navigate and create cybersecurity contractual terms and conditions.. The need for a contract template stemmed from ongoing complications between healthcare organizations and medical device manufacturers (MDMs) regarding responsibility, accountability, and varying cybersecurity expectations.. Dig Deeper. 7 New Vulnerabilities Threaten Supply Chain, Medical Device Security. BD Discloses Viper, Pyxis Medical Device Vulnerabilities. Healthcare IoT, Medical Device Vulnerability Disclosures Skyrocket. “These factors have introduced and sustained ambiguities in cybersecurity and accountability between MDM’s and [healthcare organizations] that historically have been reconciled at best inconsistently in the purchase contract negotiation process, leading to downstream disputes and potential patient safety implications,” an accompanying press release explained.. To ensure adequate security measures, the contract template includes language that articulates compliance and security requirements surrounding how healthcare organizations and MDMs store, transfer, or access medical devices and network-connected solutions.. HSCC noted that the contract template is not a one-size-fits-all solution, and organizations will have to modify some aspects during contract negotiations to align with their needs. The guide is meant to serve as a scalable template for organizations of any size.. The HSCC Model Contract Language task group attributed some miscommunications between healthcare organizations and MDMs to inconsistent contract terminology. The group suggested that the inconsistent language ultimately led to cybersecurity responsibility and accountability ambiguities in the past.. MDMs and healthcare organizations are linked by HIPAA business associate agreements (BAAs), which subject vendors with protected health information (PHI) access to the same security standards as HIPAA-covered entities.. The model contract language can serve as a standalone agreement or as an addendum to a BAA, a Master Service Agreement (MSA), or a Requests for Proposals (RFP).. HSCC organized the model contract framework into three key cybersecurity pillars: performance, maturity, and product design maturity. The task group organized contract clauses into fourteen core principles within these pillars.. As with any business associate, healthcare organizations are responsible for ensuring that the vendor has implemented adequate security standards before entrusting them with PHI. Organizations must conduct regular risk assessments and implement technical safeguards to prevent cyberattacks and data breaches.. This is especially apparent with medical devices, which are notorious for being difficult to manage from a cybersecurity perspective due to their mobility and the number of legacy devices that cannot be patched.. Medical devices are often the subject of severe vulnerability disclosures. A recent report by Unit 42 found that 75 percent of 200,000 analyzed infusion pumps contained known security gaps. Claroty also found that healthcare IoT, IT, and medical device vulnerability disclosures have increased exponentially over the last four years.. BD disclosed severe vulnerabilities in some of its BD Pyxis and BD Viper LT products in early March. Separately, Forescout’s global research team discovered seven vulnerabilities, known as Access:7, that impact the PTC Axeda agent and could result in supply chain and medical device security issues.. These recent disclosures underscore the need for consistent communication and thorough contract language between healthcare organizations and medical device manufacturers.. The model contract language included considerations about vulnerability management, security patch validation, and incident response management, among other core principles.. “Medical device manufacturers, health delivery organizations, and group purchasing organizations are encouraged to closely review this contract language and adopt as much as is appropriate for the organization,” the press release continued.. “The more uniformity and predictability the sector can achieve in cross enterprise cybersecurity management expectations, the greater strides it will make toward patient safety and a more secure and resilient healthcare system.”. Tagged HIPAA Business Associates Internet of Things Medical Device Security.",CI,6,0,1,0,0,0,0,These recent disclosures underscore the need for consistent communication and thorough contract language between healthcare organizations and medical device manufacturers..
4,"The White House says the program is part of a broader cybersecurity plan designed to address issues across the nation's critical infrastructure.. The 100-day initiative will involve government agencies that are responsible for the security of critical infrastructure as well as businesses and private utilities that oversee or own infrastructure, such as electrical distribution systems that deliver power to homes.. Public-private partnership is paramount to the administration's efforts because protecting our nation's critical infrastructure is a shared responsibility of government and the owners and operators of that infrastructure, says Emily Horne, a spokesperson for the National Security Council.. Some lawmakers and a government watchdog agency have recently criticized the Department of Energy for its cybersecurity practices, especially in the wake of the SolarWinds supply chain attack, which led to follow-on attacks on the DOE and eight other federal agencies, plus 100 companies.. In March, the Government Accountability Office released a report that found the U.S. electrical grid's distribution systems, which deliver electricity directly to customers, are increasingly vulnerable to cyberthreats and urged the Energy Department to incorporate these systems into its cybersecurity plans (see: GAO: Electrical Grid's Distribution Systems More Vulnerable).. Some security experts have criticized President Joe Biden's $2 trillion infrastructure spending proposal for lacking cybersecurity specifics, including security enhancements for the nation's electrical grid. Others analysts, however, noted that any improvements in infrastructure would likely strengthen security by updating and replacing older equipment (see: Biden's Infrastructure Plan: 3 Cybersecurity Provisions).. Security Improvements. As part of the 100-day plan for the nation's electrical grid, the Energy Department's Office of Cybersecurity, Energy Security, and Emergency Response, or CESER, will work with the Cybersecurity and Infrastructure Security Agency and private utilities to make a series of cybersecurity improvements.. The goals of the project include:. Encouraging owners and operators of power plants and facilities to enhance security incident detection, mitigation, response and forensic capabilities;. Deploying technologies to allow for real-time situational awareness within industrial control systems and operational technology networks;. Reinforcing the IT networks and infrastructure used within facilities;. Deploying technologies to increase the visibility of threats within ICS and OT systems.. The Energy Department also is seeking suggestions from electric utilities, energy companies, academia, research laboratories, government agencies and others for improving supply chain security within U.S. energy systems.. Modernization. While the emphasis on protecting and shoring up cybersecurity around the nation's electrical grid is long overdue, updating and improving complex OT and ICS systems will be a time-consuming process, says Austin Berglas, who formerly was an assistant special agent in charge of cyber investigations at the FBI's New York office.. Operational technology - or computing systems used to manage industrial operations opposed to administrative actions - often rely on outdated, unprotected systems that were not manufactured with security in mind, says Berglas, who is now global head of professional services at cybersecurity firm BlueVoyant. In many instances, this will require a complete transformation of process and technology. There will need to be a significant investment in resources, both human and capital, to bring many energy companies up to a higher standard of cybersecurity.. Padraic O'Reilly, co-founder and chief product officer of CyberSaint Security, also notes that when making changes and updates to ICS and OT systems, the federal government is in a unique position to help private organizations focus on what needs to be modernized.. With so much of the infrastructure privatized and in need of modernization, it can be difficult to get everyone pulling in the same direction, and the Department of Energy and CISA can really help with this, O'Reilly says.. Growing Concerns. Lawmakers are growing more concerned about cyberthreats facing the nation's electrical grid, including from nation-state attackers and others.. In March, a bipartisan group of U.S. senators sent a letter to Energy Secretary Jennifer Granholm demanding that the DOE place a greater emphasis on cybersecurity as part of strategic planning and that the new administration keep the leadership of CESER in place to better respond to threats (see: Senators Raise Concerns About Energy Dept. Cybersecurity).. At a recent U.S. Senate Intelligence Committee hearing, Sen. Dianne Feinstein, D-Calif., asked Gen. Paul Nakasone, the head of the U.S. Cyber Command and the National Security Agency, about China's ability to use cyber tools to disrupt natural gas pipelines and Russia's ability to interfere with the U.S. electrical grid (see: Senators Push for Changes in Wake of SolarWinds Attack).. Nakasone acknowledged that China and Russia have continued to improve their cyber capabilities and noted that the U.S. government is looking to strengthen its defenses for critical infrastructure.",IA,3,0,0,0,0,1,0,"As part of the 100-day plan for the nation's electrical grid, the Energy Department's Office of Cybersecurity, Energy Security, and Emergency Response, or CESER, will work with the Cybersecurity and Infrastructure Security Agency and private utilities to make a series of cybersecurity improvements.."
5,"In their letter to Blinken dated Sept. 22, Cotton and Gallagher, who is a member of the Cyberspace Solarium Commission, say Huawei's cloud services run in more than 40 countries, providing potential system access to the CCP. This includes projects in countries of immense geopolitical importance to the U.S., such as Egypt, Indonesia, Malaysia, Mexico, Saudi Arabia, Turkey and the United Arab Emirates, they say.. The GOP lawmakers echo ongoing security and privacy concerns related to the telecom giant and cite a nearly 170% revenue increase for Huawei's cloud offerings in 2020. This undermines U.S. efforts to curtail [its] power, influence and financial strength, they add.. The lawmakers ask Blinken to outline the department's related actions/plans, including efforts to prevent other governments from adopting Huawei technologies, and similarly, whether alternatives can be presented in those cases.. Both the U.S. Department of State and Huawei could not immediately be reached for comment Thursday. Huawei has previously denied allegations that it poses a national security threat.. Assisting China's MSS?. The international threat to data and integrity posed by Huawei extends far beyond 5G, Cotton and Gallagher say in the letter this week. They cite an alleged incident of China reportedly spying on the African Union headquarters through Huawei-made cameras it installed in 2012 - alongside the AU's information and computer systems. The lawmakers claim China installed backdoors in the systems and reportedly obtained sensitive information. Huawei has denied the allegations.. If allowed to proliferate, Huawei's cloud services could give the Chinese Communist Party similar access to additional governments, companies, and other important institutions, the GOP lawmakers write.. Rosa Smothers, a former technical intelligence officer and cyber threat analyst for the Central Intelligence Agency, tells Information Security Media Group that China's National Intelligence Law, enacted in 2017, created legal responsibilities for Chinese companies to provide access, cooperation or support for Beijing's intelligence collection needs.. Noting that, Smothers, the senior vice president of operations at the security firm KnowBe4, says, Huawei and other Chinese companies that can serve as a force multiplier for the Ministry of State Security will do so.. E-Government Services. Cotton and Gallagher contend that Huawei Cloud's e-Government services, which streamline digitization, tax services, national ID systems and elections, may expose its clients to the prying eyes of the CCP.. They add, When Huawei's client is a country, its entire population and political structure sits in the crosshairs.. The threat, they add, could lead to CCP access to the personal data of visiting U.S. citizens, service members, businesspersons and diplomats.. Our FCC designated Huawei as a national security threat last year, and I expect the [current] administration will maintain that stance, KnowBe4's Smothers says. But anything we can do to dissuade other countries from leveraging Huawei's cloud products is better not just for our national security but the security of the 40 countries where these cloud services are currently in use.. U.S. Senator Tom Cotton, R-Ark., co-author of a Huawei letter to the Department of State (Photo: Gage Skidmore via Flickr). 'Clean Network' Program. Cotton and Gallagher also address the Clean Network program launched by former Secretary of State Mike Pompeo and former Undersecretary of State Keith Krach, which they say helps address the long-term threat [that] malign authoritarian actors pose to data privacy, security and democratic values. They press Blinken on whether the program will proceed during the Biden administration.. The alliance of democracies yielded digital trust and democratic values commitments from more than 60 countries and 200 telecom companies, among others, after its launch in 2020. At the time, it was a departure from then-President Donald Trump's America First economic strategy.. We must combat Huawei as a whole and target each of the company's commercial units, including their 5G, cloud services, mobile-phone, and underwater cable businesses, the letter's authors maintain.. Sanctions. In the National Defense Authorization Act for 2019, the U.S. banned federal use of equipment from Huawei and fellow Chinese telecom company ZTE. In May 2019, the Department of Commerce added Huawei to its Entity List for its dealings with the Iranian government, restricting U.S. companies from doing business with the multinational giant without a special license.. Additionally, in 2020, the U.S. extended its ban to include semiconductors. And in June of that year, the Federal Communications Commission officially classified Huawei as a national security threat and later prohibited approvals of Huawei equipment in U.S. telecom networks.. In July, the FCC finalized a $1.9 billion plan that will assist smaller, rural telecom carriers in paying to rip and replace Huawei and ZTE technologies from their networks.. Also in July, Sens. Mark Warner, D-Va., and Cotton, looked to place additional restrictions on the use of telecom equipment from Huawei and ZTE by introducing a still-pending bill that prohibits the use of funds from the $1.9 trillion American Rescue Plan stimulus package to buy such equipment (see: Senate Bill Proposes Further Restrictions on Huawei, ZTE).",A,5,1,0,0,0,0,0,"In the National Defense Authorization Act for 2019, the U.S. banned federal use of equipment from Huawei and fellow Chinese telecom company ZTE."
6,"Sharing clinical trial data is an important aspect for healthcare, but the key is to ensure that facilities are securely sharing data, according to a recent report from the Institute of Medicine (IOM).. Increasing scientific knowledge is the ultimate goal of data sharing. However, it is essential that the public also trusts that the data sharing is being done in a secure way, the report stated. There are significant risks, burdens, and challenges that accompany data sharing. That is why IOM formed a committee “to develop guiding principles and a practical framework for the responsible sharing of clinical trial data.”. “Policies for granting access to data should be in the service of several goals — protecting the privacy of participants; reducing risk of invalid analyses or misuse; avoiding undue burdens on data users and harm to investigators and sponsors; and enhancing public trust in clinical trial data sharing,” the committee members stated in the report.. The IOM report also explained how the the timing of data sharing should balance three separate goals:. Consumer Health Privacy Violated Online, Says New Study. Legal experts examine future HIPAA audits, OCR guidance. Despite Microsoft Patch, Attacks Using WannaCry Exploit on the Rise. 1. allow a fair opportunity for clinical trialists to publish results before secondary investigators gain access to the data;. 2. allow secondary investigators to access unpublished trial data after a fair period has passed or reproduce the findings of a published analysis; and. 3. protect the commercial interests of sponsors in gaining regulatory approval for a product so that they receive fair financial rewards for their investment.. Moreover, the IOM report stated that while it is important for clinical analysts to have enough time after a trial has finished to go over all data, the analysis period should not exceed 18 months.. “When that period has passed — regardless of whether the trial results have been published — the IOM committee finds that the scientific process is best served by allowing other investigators to access the data,” the committee explained.. Also, when trial findings are published before the end of those 18 months, the committee recommended that the supporting analytic dataset be shared within six months of publication. As systems for “responsible data sharing” continue to evolve, the committee said that it hoped that simultaneous sharing would soon become the norm.. “Greater data sharing could enhance public well-being by accelerating the drug discovery and development process, reducing redundant research, and facilitating scientific innovation. Before these benefits can be realized, however, stakeholders must confront significant risks and challenges,” the committee members wrote.. The IOM committee is not the only group investigating the benefits of health data sharing. 23andMe partnered with Pfizer last week to further the use of health data sharing for genetic research purposes. The two companies plan to collaborate on certain genome-wide association studies, surveys, and clinical trial recruitment.. Many consumers are also willing to share their health data, if it is done anonymously and in a secure way, according to a recent NPR-Truven Health Analytics Poll. In terms of drug/pharma researchers, 87 percent of respondents stated that they were comfortable sharing their data anonymously. The number was even higher of government researchers were asking for health information – 92 percent stated they were fine with anonymous health data sharing.. Tagged Protected Health Information.",CI,6,0,1,0,0,0,0,"However, it is essential that the public also trusts that the data sharing is being done in a secure way, the report stated."
7,"The National Institute of Standards and Technology (NIST) published its official definition of “critical software,” as instructed by President Biden’s executive order (EO) on improving the nation’s cybersecurity. NIST solicited feedback and position papers from the community to settle on a reasonable definition.. The executive order also directs the Cybersecurity & Infrastructure Security Agency (CISA) to use the “critical software” definition to create a list of categories of software that might fall under the first phase of the executive order’s implementation. NIST proposed a phased implementation approach to give the government and software industry time to secure the supply chain of critical software.. In a white paper released on June 25th, a day before the EO’s official deadline, NIST explains that “One of the goals of the EO is to assist in developing a security baseline for critical software products used across the Federal Government. The designation of software as EO-critical will then drive additional activities, including how the Federal Government purchases and manages deployed critical software.”. Dig Deeper. NIST Releases Draft of Ransomware Risk Management Framework. NIST Unveils Guide to Mobile Device Authentication for First Responders. NIST IoT Guidance for Network-Based Attacks, Device Communication. NIST uses the term “EO-critical” to differentiate between the common usage of the word “critical” and avoid any confusion. The official definition states:. EO-critical software is defined as any software that has, or has direct software dependencies upon, one or more components with at least one of these attributes:. is designed to run with elevated privilege or manage privileges;. has direct or privileged access to networking or computing resources;. is designed to control access to data or operational technology;. performs a function critical to trust; or,. operates outside of normal trust boundaries with privileged access.. NIST recommends that the first EO implementation phase focus on on-site standalone software that has security functions and the potential to be compromised. Future phases will tackle cloud-based software, software development tools, software components in operational technology (OT), and software that controls data access.. The paper provides a lengthy list of EO-critical software, along with descriptions, types of products, and rationale for the inclusion of each category of software. Categories include endpoint security, remote scanning, and identity, credential, and access management (ICAM), among others. This list is just a head start, and CISA will issue the finalized list in the near future.. NIST’s work contributes to the executive order’s main goal of managing risks to the cyber supply chain within the federal government. While private companies will not be required to follow NIST’s software supply chain guidelines, it is strongly recommended. Companies that sell to the federal government will need to comply with the government’s software supply chain practices.. Biden’s executive order contained a long list of tasks for NIST with deadlines extending into 2022. By July 11th, NIST will publish guidance outlining critical software security measures, and guidance on the minimum standards for the testing of a vendor’s source code.. Outside of it is executive order duties, NIST recently released a preliminary draft of its ransomware risk management framework, which aims to help organizations respond to ransomware attacks. The draft identifies crucial steps to maintaining cybersecurity, including using antivirus software, restricting the use of personal devices at the workplace, and keeping computers up to date.. Tagged Cyber Hygiene Cybersecurity NIST.",A,5,1,0,0,0,0,0,The executive order also directs the Cybersecurity & Infrastructure Security Agency (CISA) to use the “critical software” definition to create a list of categories of software that might fall under the first phase of the executive order’s implementation.
8,"In its new report, entitled The State of DeFi Security 2021, CertiK researchers say, however, that due to the uptick in investment, 2021 losses represented just 0.05% of crypto's total market capitalization - dropping 17% from 2020.. CertiK credits much of the growth in digital currencies to the rise of Binance Smart Chain, whose total value locked, or TVL, grew from $62 million to $21 billion in 2021 - a 31,000% increase, the firm says.. But the rise of DeFi protocols - which do not rely on traditional intermediaries and instead run on peer-to-peer smart contracts across decentralized apps, or DApps - has made the reward for successful exploits even greater, CertiK says. And increased interoperability, it says, has opened up new attack vectors.. According to DeFi Pulse, which tracks related investments, DeFi had $95 billion in TVL at the time of writing.. Centralization and Other Risks. CertiK researchers, who audited more than 1,700 projects, say the most common vulnerability detected across DeFi protocols was centralization risk, in which a single actor controls multiple addresses. CertiK encountered 286 discrete centralization risks across the 1,737 audits performed in 2021. It says: Centralization is antithetical to the ethos of DeFi and poses major security risks. Single points of failure can be exploited by dedicated hackers and malicious insiders alike.. Other common vulnerabilities included 211 instances of mission event emissions, or functions that should emit notifications to users when sensitive variables or important processes are changed. CertiK also cites the use of an unlocked compiler version, detected in 176 instances, which can lead to differences in bytecode.. CertiK came across 104 lines of code lacking proper input validation - or inputs that limit the functionality of an executable to a set of known possibilities.. The firm warned against a reliance on third-party dependencies, which it detected in 102 cases. It writes: A developer can only control the security of their own code, not that of the external contracts with which theirs interact.. Offering a similar warning on DeFi security, Jennifer Fernick, a governing board member of The Linux Foundation’s Open Source Security Foundation, tells ISMG: One leaked cryptographic key or a single software flaw could lead to the collapse of entire organizations. I suspect that serious DeFi companies will, over time, more easily understand the intrinsic value of robust cybersecurity than their so-called 'web2' counterparts, mainly because for DeFi, 'code is law,' and there is so much at stake that can vanish in an instant.. Fernick, who is currently the global head of research at cybersecurity consulting firm NCC Group, says she expects to see a potentially unprecedented market-driven push for higher assurance systems for DeFi companies.. And Connie Lam, head of CertiK's Incident Response Team, tells ISMG that crypto markets are no doubt widening, and the need for cybersecurity is intensifying. Still, she says, We're entering a multi-chain world. … The real opportunity [moving forward] lies in efficiently maximizing opportunity across all chains.. Source: CertiK's The State of DeFi Security 2021 report. 'Security: A Foundational Concern'. Noting that Solidity - the language in which Ethereum Virtual Machine, or EVM, smart contracts are written - is only seven years old, CertiK states, Developers are still exploring the possibilities of smart contract code, and there is no better time than these early days to make security a foundational concern and protect users well into the future.. The CertiK report says hasty forks, or chain-splits following protocol updates, unaudited deployments and outright scams resulted in significant losses. It says that Uranium Finance, a fork of Uniswap deployed on Binance Smart Chain, lost $57 million in user funds due to a single character in its source code.. Any changes to a platform's code should be reviewed and audited, no matter how small the initial modification is, the CertiK researchers say. As we've seen, a byte-sized piece of code can have multimillion-dollar ramifications.. Regulators Watching Closely. Citing a wide-scale increase in crypto adoption, the CertiK researchers also acknowledge that regulators have circled DeFi and the broader cryptocurrency market of late.. In China in 2021, regulators cracked down on cryptocurrency and cryptomining, resulting in an exodus of miners from the Chinese mainland.. In the U.S., under the leadership of new Chair Gary Gensler, the Securities and Exchange Commission has repeatedly signaled potentially broader enforcement of securities laws to govern crypto markets.. On Aug. 3, 2021, Gensler called crypto markets rife with fraud, scams and abuse, and urged Congress to provide the SEC with additional authority to regulate the markets. Gensler also noted in August that DeFi projects are not immune to regulation - with features that warrant federal oversight (see: SEC to Monitor Illicit Activity on DeFi Platforms).. Outspoken critic Sen. Elizabeth Warren, D-Mass., has also called for comprehensive regulation around cryptocurrencies - citing both security and market risks, including crypto's highly volatile nature.. Going forward, security will continue to be inextricably tied to the future of DeFi, the CertiK researchers say. Without meaningful security that protects users and secures platforms, innovation will suffer and interest will die off.. U.S. Federal Reserve Chair Jerome Powell (Photo: Federalreserve via Flickr). Pending Legislation?. Elsewhere on the regulatory front, Federal Reserve Chair Jerome Powell said on Tuesday before the Senate Banking Committee that the Fed will be issuing its report on cryptocurrencies - including the feasibility of a central bank digital currency, or CBDC - in the coming weeks. Powell appeared for a hearing to be reconfirmed as Fed chair for four years.. Meanwhile, on Capitol Hill, Rep. Tom Emmer, R-Minn., tweeted on Tuesday that he intends to introduce legislation around digital currencies, but did not offer specifics. Last month, Sen. Cynthia Loomis, R-Wyo., a longtime crypto evangelist, announced that she too will introduce a bill that attempts to regulate the cryptocurrency space - including the creation of a self-regulatory body under the jurisdiction of the SEC and its sister agency, the Commodities Futures Trading Commission (see: GOP Senator to Introduce 'Comprehensive' Crypto Regs Bill).. +++. [Update - Jan. 12, 6:15 p.m.]: On Wednesday, Emmer officially introduced a bill that would prohibit the Federal Reserve from issuing a CBDC directly to individuals. In a statement, he said: Not only would this CBDC model centralize Americans' financial information, leaving it vulnerable to attack, but it could also be used as a surveillance tool. … Requiring users to open up an account at the Fed to access a U.S. CBDC would put the Fed on an insidious path akin to China's digital authoritarianism. The congressman said any CBDC must be accessible, transact on a transparent blockchain, and maintain the privacy elements of cash.",F,7,0,0,1,0,0,0,"In its new report, entitled The State of DeFi Security 2021, CertiK researchers say, however, that due to the uptick in investment, 2021 losses represented just 0.05% of crypto's total market capitalization - dropping 17% from 2020.. CertiK credits much of the growth in digital currencies to the rise of Binance Smart Chain, whose total value locked, or TVL, grew from $62 million to $21 billion in 2021 - a 31,000% increase, the firm says.."
9,"Impresa says that the websites of Expresso, SIC and the Blitz magazine are temporarily unavailable, impeding its ability to report news from Portugal. In its effort to recover from Sunday's cyberattack, the media group has launched a temporary website: Expresso.pt.. The Impresa group, in its statement, says that it is collaborating with authorities to resolve the situation at the earliest opportunity and guarantees delivery of its next weekly edition.. In the wake of the cyberattack, several readers and publications, such as CNN Portugal and Publico, expressed solidarity on Twitter and Facebook with Expresso's new motto #liberdadeparainformar, which translates to Freedom to Inform.. Online media outlet The Record reports that the Lapsus$ ransomware group has claimed responsibility for the attack that impacted Impresa's IT server. The attack knocked Expresso and SIC websites offline, in addition to SIC’s internet streaming service.. Following takeover of the Expresso and SIC websites, the Lapsus$ group posted a ransom note in Portuguese which translates to: The data will be leaked if the required amount is not paid. We have access to their cloud dashboards (AWS), among other types of devices. This was followed by the Lapsus$ group' Telegram ID and email address.. Soon after, the Lapsus$ group posted from Expresso's official Twitter account saying Lapsus$ is officially the new president of Portugal.. A report by news agency Reuters also says that the Lapsus$ group sent a phishing e-mail to Expresso subscribers.. The Impresa group has not responded to Information Security Media Group's request for information on the nature of intrusion or demands made by the ransomware group.. The Lapsus$ Ransomware Group. Avkash Kathiriya, vice president of research and innovation at cybersecurity firm Cyware tells ISMG that the Lapsus$ group hit the limelight in December 2021 following a ransomware attack on websites owned by Brazil's Ministry of Health. The group claimed to have stolen and subsequently deleted around 50TB of data from the ministry’s systems.. Following the cyberattack on Brazil’s health ministry, Lapsus$ also claimed to have breached Brazilian telecom provider Claro and allegedly gained access to a gargantuan data trove of 10,000TB, according to Kathiriya.. Kathiriya says that based on the messaging on its website and Telegram channel, the group is financially motivated and does not seem to be focused on any particular industry. The group initiated its Telegram channel on Dec. 10, 2021, as a medium to expose its victims and provide evidence of breaches.. Targeting Portuguese-speaking countries and usage of the Brazilian Portuguese dialect in its messages hints at the fact that the group may be based in Brazil, says Kathiriya.. He points out that in most of its attacks, the Lapsus$ group claimed to have gained access to cloud-based servers and applications of the targeted organizations, such as AWS instances and VMware vCenter servers, but so far it is unclear which malware is being used by the group.. News Outlets Under Attack - 3 Ransomware Hits in 3 Weeks. The cyberattack on Impresa is the third security incident in the news publication space in just three weeks - all three incidents were the result of ransomware attacks and all of the affected news websites were knocked offline for an extended period.. An incident very similar to Lapsus$ group's ransomware attacks on Expresso and SIC websites occurred on December 28, when a ransomware attack on Norway-based media company Amedia brought its presses to a halt.. According to a report on Digi.no, Amedia's executive vice president of technology, Pål Nedregotten, said in a press conference that a known security hole in Windows was exploited and that impacted Amedia's Windows servers.. Preceding Amedia's ransomware incident, the Philippines' biggest and oldest television broadcaster, ABS-CBN, fell prey to a cyberattack on December 11. According to local media organization Rappler, ABS-CBN's News website was targeted by a distributed denial of service or DDoS attack.. As with Portugal's Impresa group, ABS-CBN was also forced to push its news updates through its social media channels.. A report by cybersecurity company Fortinet says that media companies using outdated software and ineffective authentication and verification procedures coupled with an ever-increasing attack surface gives hackers a range of options to attack their digital infrastructure.. Fortinet advises media firms to apply least privilege access, deploy multi-factor authentication and run backups frequently along with regular pen-testing and patching exercises.. In a 2015 cyberattack which resulted in a dozen TV5 Monde channels blacking out simultaneously, IT company Atos found that TV5 Monde multimedia servers had their remote desktop protocol ports exposed to the internet and the staff was using default usernames and passwords.. Using social engineering, hackers targeted TV5 Monde's journalists and were eventually successful in penetrating the network through a Trojan and deploying malware in TV5 Monde's IT infrastructure, following which they were able to create accounts with administrator privileges.",IA,3,0,0,0,0,1,0,The Impresa group has not responded to Information Security Media Group's request for information on the nature of intrusion or demands made by the ransomware group..


*The* metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [None]:
metric

Metric(name: "accuracy", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
    sample_weight (`list` of `float`): Sample weights Defaults to None.

Returns:
    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.

Examples:

    Example 1-A simple example
        >>> accuracy_metric = datasets.load_metric("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
        >>> print(results)
        {'accuracy': 0.5}

   

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading (…)lve/main/config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

To preprocess our dataset, we will thus need the names of the columns containing the sentence(s). The following dictionary keeps track of the correspondence task to column names:

In [None]:
data_key = {
    "Incident": ("SUM", None),
}

We can double check it does work on our current dataset:

In [None]:
sentence1_key, sentence2_key = data_key["Incident"]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence: As part of the 100-day plan for the nation's electrical grid, the Energy Department's Office of Cybersecurity, Energy Security, and Emergency Response, or CESER, will work with the Cybersecurity and Infrastructure Security Agency and private utilities to make a series of cybersecurity improvements..


We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model.

In [None]:
def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
preprocess_function(dataset['train'][:5])

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'input_ids': [[228, 188, 20, 18, 842, 13, 765, 493, 28, 18, 1051, 26, 23, 5836, 8174, 19, 18, 3425, 760, 26, 23, 1884, 20, 16589, 11805, 19, 3425, 1424, 19, 21, 11535, 18653, 19, 49, 17, 24496, 6031, 19, 53, 154, 33, 18, 16589, 11805, 21, 23729, 1424, 2907, 21, 804, 12038, 22, 144, 24, 461, 20, 11148, 11805, 6371, 9, 9, 4, 3], [67, 48, 7432, 2774, 19, 18, 28979, 10670, 18, 17, 12734, 28, 18, 422, 254, 11442, 68, 527, 21, 12239, 18, 229, 126, 22, 18, 4618, 1448, 23, 19, 549, 22, 18, 419, 9, 9, 4, 3], [228, 361, 786, 22, 22953, 95, 18, 3009, 20, 965, 13, 22883, 11148, 23000, 23, 157, 18, 128, 9, 83, 9, 19, 87, 3578, 11004, 40, 207, 1385, 210, 22, 175, 70, 75, 160, 18, 760, 20, 17092, 1424, 21, 81, 5528, 2358, 41, 24113, 111, 481, 22, 500, 254, 18, 2247, 1339, 23, 20, 18, 819, 146, 21, 2083, 3735, 9, 9, 4, 3], [19552, 21794, 6599, 22, 340, 20, 48, 16653, 13854, 256, 29, 37, 5327, 47, 410, 475, 20, 18, 4987, 19, 59, 27, 611, 148, 14717, 64, 39, 5101, 37, 14033, 17, 10, 3080, 60, 2291, 189

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [None]:
encoded_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/6247 [00:00<?, ? examples/s]

Map:   0%|          | 0/2678 [00:00<?, ? examples/s]

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since all our tasks are about sentence classification, we use the `AutoModelForSequenceClassification` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which is always 2, except for STS-B which is a regression problem and MNLI where we have 3 labels):

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np

num_labels = 11
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Downloading pytorch_model.bin:   0%|          | 0.00/467M [00:00<?, ?B/s]

Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.bias', 'sequence_summary.summary.weight', 'logits_proj.weight', 'logits_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define two more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
metric_name = "accuracy"
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=True,
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/bert-finetuned-mrpc"` or `"huggingface/bert-finetuned-mrpc"`).

The last thing to define for our `Trainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, the only preprocessing we have to do is to take the argmax of our predicted logits (our just squeeze the last axis in the case of STS-B):

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels, )

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
validation_key = "test"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

You might wonder why we pass along the `tokenizer` when we already preprocessed our data. This is because we will use it once last time to make all the samples we gather the same length by applying padding, which requires knowing the model's preferences regarding padding (to the left or right? with which token?). The `tokenizer` has a pad method that will do all of this right for us, and the `Trainer` will use it. You can customize this part by defining and passing your own `data_collator` which will receive the samples like the dictionaries seen above and will need to return a dictionary of tensors.

We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

You're using a XLNetTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.454924,0.42233


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.454924,0.42233


We can check with the `evaluate` method that our `Trainer` did reload the best model properly (if it was not the last one):

In [None]:
trainer.evaluate()

To see how your model fared you can compare it to the [GLUE Benchmark leaderboard](https://gluebenchmark.com/leaderboard).

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
trainer.push_to_hub()