If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
! pip install datasets transformers

Collecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.32.0-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m92.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━

In [None]:
! pip install -U accelerate
! pip install -U transformers

Collecting accelerate
  Downloading accelerate-0.22.0-py3-none-any.whl (251 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.22.0


In [None]:
import os
os._exit(00)

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [None]:
from huggingface_hub import notebook_login
#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

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS. Uncomment the following instructions:

In [None]:
 !apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 16 not upgraded.


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [None]:
import transformers

print(transformers.__version__)

4.32.0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/text-classification).

# Fine-tuning a model on a text classification task

This notebook is built to run on any of the tasks in the list above, with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a classification head. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:

In [None]:
task = "INCIBE"
model_checkpoint = "gpt2"
batch_size = 16

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:
from datasets import load_dataset, load_metric

Apart from `mnli-mm` being a special code, we can directly pass our task name to those functions. `load_dataset` will cache the dataset to avoid downloading it again the next time you run this cell.

In [None]:
dataset = load_dataset("agarc15/TFM_INCIBE")

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/33.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.05M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.12M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
metric = load_metric('accuracy', 'f1')


  metric = load_metric('accuracy', 'f1')


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set (with more keys for the mismatched validation and test set in the special case of `mnli`).

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['DESCRIPTION', 'INCIBE_TAXONOMY', 'label', 'A', 'CI', 'F', 'I', 'IA', 'Others', 'SUM'],
        num_rows: 6247
    })
    test: Dataset({
        features: ['DESCRIPTION', 'INCIBE_TAXONOMY', 'label', 'A', 'CI', 'F', 'I', 'IA', 'Others', 'SUM'],
        num_rows: 2678
    })
})

To access an actual element, you need to select a split first, then give an index:

In [None]:
dataset["train"][0]

{'DESCRIPTION': " The White House says the program is part of a broader cybersecurity plan designed to address issues across the nation's critical infrastructure.. The 100-day initiative will involve government agencies that are responsible for the security of critical infrastructure as well as businesses and private utilities that oversee or own infrastructure, such as electrical distribution systems that deliver power to homes.. Public-private partnership is paramount to the administration's efforts because protecting our nation's critical infrastructure is a shared responsibility of government and the owners and operators of that infrastructure, says Emily Horne, a spokesperson for the National Security Council.. Some lawmakers and a government watchdog agency have recently criticized the Department of Energy for its cybersecurity practices, especially in the wake of the SolarWinds supply chain attack, which led to follow-on attacks on the DOE and eight other federal agencies, plus 

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(dataset["train"])

Unnamed: 0,DESCRIPTION,INCIBE_TAXONOMY,label,A,CI,F,I,IA,Others,SUM
0,"The worm, which has been active since early December, typically attempts to inject XMRig malware - increasingly used to mine for cryptocurrency such as monero - within vulnerable servers, the researchers say (see: Kubeflow Targeted in XMRig Monero Cryptomining Campaign). It targets vulnerable, public-facing services such as MySQL, the Tomcat administration panel and the open-source automation Jenkins server that use weak passwords. Plus, it targets a vulnerability in Oracle WebLogic that is tracked as CVE-2020-14882.. Oracle and the U.S. Cybersecurity and Infrastructure Security Agency have previously warned WebLogic users to apply patches for the vulnerability (see: CISA and Oracle Warn Over WebLogic Server Vulnerability).. During our analysis, the attacker kept updating the worm on the command-and-control server, indicating that it's active and might be targeting additional weak configured services in future updates, Avigayil Mechtinger, a security researcher at Intezer, notes in the report.. How It Works. An attack typically starts with the worm attempting to brute force passwords to gain access to a device. Once inside, it uses three separate files to continue its attack. The first is a dropper - either a Bash or PowerShell script. The second is a Golang binary worm, and the third is the XMRig miner. All are hosted on the same command-and-control server, the researchers determined.. During the attack, the worm checks if a process on the infected machine is listening on port 52013 of the targeted server. A listener on this port would function as a mutex - a synchronization mechanism for enforcing limits on access to a resource in an environment where there are many threads of execution. If a listener is not found on the port, a network socket is opened, the researchers say.. The Linux version of the worm so far remains undetected on the VirusTotal scanning platform, according to the report. The fact that the worm's code is nearly identical for both its [Windows] and [Linux] malware - and the [executable Linux file] malware going undetected in VirusTotal - demonstrates that Linux threats are still flying under the radar for most security and detection platforms, Mechtinger says.. Kyung Kim, senior managing director and the head of cybersecurity for the Asia-Pacific Region at FTI Consulting, says more threat actors are using the Golang programming language to help them target operating systems other than Windows.. Golang is popular for attackers because it's multi-variate and allows a single codebase to be accumulated into all major operating systems, Kim says. Rather than attacking end-users, Golang malware focuses its efforts on compromising application servers, frameworks and web applications, which is partially why it can infiltrate systems easily without being detected.. Targeting Linux. Other security researchers have noted an increase in malware, especially cryptominers, targeting the Linux platform.. In November, Intezer found the Linux version of the Stantinko botnet was recently updated to better mine cryptocurrency and deliver malware (see: Linux Botnet Disguises Itself as Apache Server).. Another example is the InterPlanetary Storm botnet that infects Windows, Linux, Mac and Android devices, according to Barracuda Networks. It mines for cryptocurrency and can initiate distributed denial-of-service attacks (see: 'InterPlanetary Storm' Botnet Infecting Mac, Android Devices ).",F,7,0,0,1,0,0,0,"All are hosted on the same command-and-control server, the researchers determined.. During the attack, the worm checks if a process on the infected machine is listening on port 52013 of the targeted server."
1,"A proposed settlement for a 2020 breach requires BJC HealthCare to implement MFA for email access, estimated to cost $2.7 million. (Photo by Mark Wilson/Getty Images). BJC HealthCare reached a settlement with the 287,873 patients impacted by a 2020 protected health information breach of its email system brought on by a successful phishing attack. Nineteen of its affiliated hospitals were involved in the incident.. Each affected patient will receive up to $250 for bank fees, interest, credit monitoring costs, postage, mileage and up to three hours of lost time. Individuals who’ve faced extraordinary expenses as a direct result of the hack may also qualify for up to $5,000 in reimbursement.. The proposed settlement also requires BJC HealthCare to implement multi-factor authentication for email access to reduce the risk of phishing, projected to cost $2.7 million. Depending on how many of the patients file claims, the overall settlement costs could be staggering.. BJC Health has been defending itself against allegations that its poor cybersecurity policies and practices directly led to a 2020 phishing attack and subsequent PHI breach. In May 20202, the Missouri-based provider notified patients that their data was exposed during a phishing attack two months earlier. Three employees were duped by the phishing emails on March 6 and detected by the security team on the same day. The investigation determined the phishing attack enabled the threat actor to gain access to the accounts for only one day.. The accounts contained a trove of patient data including Social Security numbers, medical record or patient account numbers, provider names, treatments, medications, and clinical data. BJC could not rule out whether the emails, attachments, or patient data were viewed by the attacker during the incident. Nineteen affiliated hospitals were affected by the security incident.. What was notable, however, was that this was the third healthcare data breach reported by BJC in two years. In March 2018, a data server misconfiguration exposed the data of 33,420 patients for nearly a year. Later that year, malware was installed onto its patient portal, which allowed a hacker to intercept the credit and debit card numbers of 5,850 for approximately one month.. After the May 2020 notification, five separate class-action lawsuits were filed against BJC over the incident, which claimed that its failure to implement and follow basic security procedures enabled the success of the phishing attack.. BJC was also accused of failing to adequately encrypt, if at all, the PHI in its possession, while failing to follow contractually agreed upon security standards in direct violation of the HITECH Act and Health Insurance Portability and Accountability Act.. The lawsuit claims these missteps have put patients at an increased risk of identity theft and are “immediately and imminently in danger of sustaining some or further direct injury/injuries as a result of the identity theft they suffered when [BJC] did not protect and secure the PHI and disclosed the PHI to hackers.” Under the proposed settlement, BJC must provide breach victims with the aforementioned payments, in addition to two years of credit monitoring services. The health system has also agreed to bolster its cybersecurity policies to better protect patient information, including conducting mandatory cybersecurity training annually and during new hire orientation.. The settlement also requires BJC Health to apply periodic training updates to reflect new information security issues. The health system must also maintain a written password policy that requires the appropriate password complexity.. The MFA project must target remote access to the email systems. The estimated $2.7 price tag will include $1.22 million for the initial implementation and another $1.5 million for annual maintenance. However, these are “reasonable estimates only.” The MFA project is required, but BJC is not mandated to “spend a particular dollar amount towards these measures.”. BJC is also required to pay the costs for notifying breach victims of the settlement, as well as related fees and attorneys’ costs, up to $790,000 to Missouri Class Counsel and up to $415,000 for the Illinois Class Counsel. Named plaintiffs may receive up to $2,000.. “As detailed herein, the settlement surely satisfies the preliminary approval standard of likely to be approved as fair, reasonable, and adequate,” according to the proposal.",A,5,1,0,0,0,0,0,"A proposed settlement for a 2020 breach requires BJC HealthCare to implement MFA for email access, estimated to cost $2.7 million."
2,"At the center of the strategy is “a safe, aggressive vaccination campaign” to meet the goal of administrating 100 million vaccine shots in the administration’s first 100 days, Biden said at a Thursday press briefing. “This will be one of the greatest operational challenges our nation has ever undertaken.. Director of National Intelligence’s Role. The document calls for the director of national intelligence to lead the risk assessment.. On Wednesday, the Senate approved Avril Haines as national intelligence director. She's been a vocal proponent of improving U.S. organizations' cybersecurity postures as well as public and private cooperation.. “The U.S. government will take steps to address cyberthreats to the fight against COVID-19, including cyberattacks on COVID-19 research, vaccination efforts, the healthcare systems and the public health infrastructure,” Biden’s coronavirus strategic document states.. President Biden discusses new national COVID-19 strategy. (Source: C-SPAN). The strategy paper also notes that the national intelligence director will assist “in the federal government’s efforts to provide warning of pandemics, protect our biotechnology infrastructure from cyberattacks and intellectual property theft, identify and monitor biological threats from states and non-state actors, provide validation of foreign data and response efforts and assess strategic challenges and opportunities from emerging biotechnologies.”. Critical Issues. Responding to the initiative, Greg Garcia, executive director of cybersecurity of the public-private Healthcare and Public Health Sector Coordinating Council, says: “It’s critically important that the [Biden] administration continue the work that’s been done under [the Trump administration’s] Operation Warp Speed to mitigate threats of cyberattacks against the vaccine supply chain and research community. The health sector has been working closely with the Department of Health and Human Services and the Department of Homeland Security’s Cybersecurity and Infrastructure Security Agency over the past several months in this area.”. Garcia recommends educating all those in the vaccine value chain about the imperative of investing in strong data and network security. He says all participants must “work together to identify those critical linchpin functions and entities - for which there is little or no redundancy – and develop protective and backup strategies in the event of cyber or physical disruption.”. Threat Intelligence Sharing. Errol Weiss, chief security officer at the Health Information Sharing and Analysis Center, points out: IP theft, disinformation and security threats against the development, manufacture and distribution of biotechnology are not new - although they are certainly more visible with COVID-19.”. As part of the pandemic response, “Health-ISAC has formed dedicated working groups to help improve security of the entire vaccine supply chain,” he adds.. Global Threats. In recent months, some threat actors have targeted COVID-19 vaccine development and distribution.. For instance, in November, Microsoft warned that three state-sponsored advanced persistent threat groups - one Russian, two North Korean - had been targeting companies across the globe involved in COVID-19 vaccine and treatment development.. Then in December, CISA and IBM X-Force alerted organizations involved in COVID-19 vaccine production and distribution of a global phishing campaign targeting the cold storage and transport supply chain (see: Phishing Campaign Targets COVID-19 'Cold Chain').. Also in December, Europol, the European Union's law enforcement agency, warned that organized crime gangs have reacted swiftly to adapt methods and product offerings to the COVID-19 pandemic. That alert followed a warning from international law enforcement agency Interpol about a potential surge in organized crime activity tied to COVID-19 vaccines.. More recently, the European Medicines Agency, which helps evaluate and authorize medicines and vaccines, including those for COVID-19, revealed that documents on coronavirus vaccines and medications, including some containing personal information, were stolen in a cyberattack last month.. The EMA said last week that some of the COVID-19 vaccine documents that were leaked on the internet have been manipulated by the perpetrators prior to publication in a way which could undermine trust in vaccines.",F,7,0,0,1,0,0,0,"Responding to the initiative, Greg Garcia, executive director of cybersecurity of the public-private Healthcare and Public Health Sector Coordinating Council, says: “It’s critically important that the [Biden] administration continue the work that’s been done under [the Trump administration’s] Operation Warp Speed to mitigate threats of cyberattacks against the vaccine supply chain and research community."
3,"1. Stay on Top of Vendors -- Often vendor managed systems are the ones that are maintained most sloppily - terrible passwords, multiple critical patches missing, etc. If need be, change the administration password when they are done working on it, and isolate it from the network via a firewall to avoid exposed vulnerabilities.. 2. Be Password Wise -- Remember that the longer and more complex passwords are, the harder they are to crack. Using phrases with a mix of substituted characters like numbers, special characters, upper and lower case, etc. helps greatly. In particular, spaces - spaces have a tendency to throw off password cracking software. Lastly, don't use the ALT+255 character. It's an old trick and well known at this point.. 3. Mind Your Patches -- Don't forget that almost all software will have a security patch or update at some point, so don't just rely on Microsoft patches. Check with each software vendor to see if there are updates for the other systems on a regular basis. To make it easy, most vendors have a mailing list that will alert you if there is a patch or update that should be applied.. 4. Clear Your Cache -- Web browsers maintain temporary files in what is called cache.” These files can contain a multitude of customer information, so implement a policy to regularly clear the cache on users’ machines to avoid possible privacy violation risks.. 5. Restrict Internet Use -- Banks are one of those environments that have no good need for its users to have unrestricted access to the Internet. Use your firewall to restrict access to only the services that are truly needed, and shut down the rest.. 6. Don’t be Fooled By Appearances -- Try to remember that you can never really know what someone is thinking. The aside Oh, he would never do something like that often turns into I would never thought that he was capable of something like that.",F,7,0,0,1,0,0,0,"To make it easy, most vendors have a mailing list that will alert you if there is a patch or update that should be applied.. 4."
4,"The information, including financial details, contact information, memos and private chats, was leaked in December but only recently spotted.. The leak includes details for German celebrities as well as members of six of the seven main political parties in the Bundestag lower house, including the ruling center-right and center-left parties, as well as The Greens, left-wing party Die Linke and the Free Democratic Party, the BBC reported.. But there's a notable exception: No members of the far-right Alternative for Germany - AfD - saw their personal details get spilled, according to German media reports. It's not clear, however, if that's a clue to the perpetrator's identity or a false flag.. Whoever is behind this wants to damage faith in our democracy and its institutions, says Justice Minister Katarina Barley in a statement.. It's also not clear if all of the leaked data is authentic or unaltered.. 'Immense' Leak. The leaked information was made available online via tweets from a Twitter account, which has now been suspended, that linked to a platform that appeared to be based in the German city of Hamburg.. The amount of data published is immense, says Hamburg's Data Protection Commissioner, who has been responding to the data leak by cataloging tweets that contain links to the stolen data. The commissioner has been communicating to Twitter as part of its legal request that all such information be removed.. Even if no information relevant to public safety is concerned, the damage that may be caused by the publication of personal information to the individual concerned is nonetheless significant, the commissioner says.. BSI Investigates. Germany's Federal Office for Information Security, or BSI, is investigating the leak.. Hacker attack on politicians: The BSI is currently intensively examining the case in close cooperation with other federal authorities, the BSI tweeted on Friday. The National Cyber Defense Center has taken over the central coordination. According to our current information, government networks have not been targeted.. The data dump included Merkel's email address and fax number, as well as letters she wrote or which were written to her, German news agency DPA reported. One reporter who reviewed the data dump said it also appears to contain numerous private details, including sensitive information about individuals' private lives.. Officials say the data may have been obtained by hackers using stolen passwords to log into email accounts, social networks and cloud-based services (see: Credential Stuffing Attacks: How to Combat Reused Passwords).. After an initial analysis, much evidence points toward the data being obtained through the improper use of login details to cloud services, email accounts or social networks, Minister of the Interior Horst Seehofer said in a statement on Friday, the Guardian reported. Currently, nothing points towards the system of the parliament or government having been compromised.. Dump is Massively Mirrored. The information security researcher known as the Grugq says that whoever stole and packaged up the information appears to have done so over a significant period of time. They also went to great lengths to make it difficult to eradicate online copies of the information by mirroring the data in numerous places online, and then creating mirrors of the mirrors, according to the Grugg.. This data leak has so much data squirrelled away to avoid take downs. It must have required many man hours of uploading.. - 70 mirrors of the download links. - 40 d/l links, each with 3-5 mirrors. - 161 mirrors of data files. Plus the tweets, blog posts, mirrors of mirror links.. — the grugq (@thegrugq) January 4, 2019. If I had to guess, I'd say that the leak files were not produced at the same time, the Grugq says via Twitter. The changes in layout and naming suggest that it wasn't one person in one marathon session creating these. There is variation in the archive passwords too: 123, abbreviations, variations.. At least one German media outlet published links to the stolen information, drawing a rebuke from information security experts.. Today's German data leak presents a particularly sharp dilemma: It is highly unethical to further publicize access to all the private data of so many prominent, high-interest individuals - but the leak's rollout design is also highly resilient to takedowns, says German political scientist Thomas Rid, a professor of strategic studies at Johns Hopkins University's School of Advanced International Studies.. Let's spell this out more clearly:. 1-Twitter accounts spreads URLs to bad leak. 2-Twitter suspends account. 3-Journalist posts screenshot of suspended account with live links. 4-Press stories simply mention suspended Twitter handle. 5-Hello archives. 3 = just stupid. 4 = please stop. — Thomas Rid (@RidT) January 4, 2019. Follows Alleged APT28 Attack. This isn't the first major information security mishap to occur on the BSI's watch. In 2015, the BSI shut down the parliamentary intranet after discovering it had been infected with spyware.. In February 2018, it admitted that in December 2017, it discovered that for up to a year, hackers had infiltrated the sensitive Informationsverbund Berlin-Bonn - IVBB - network used by Germany's Foreign Ministry and Defense Ministry, and planted malware, German public broadcaster Deutsche Welle reported.. The Russian government hacking group APT28 is suspected as being responsible for that attack. The group is also known as BlackEnergy Actors, Cyber Berkut, CyberCaliphate, Fancy Bear, Pawnstorm, Sandworm, Sednit, Sofacy, Strontium, Tsar Team and Voodoo Bear (see: Dutch and British Governments Slam Russia for Cyberattacks).. Reuters, meanwhile, reported that the BSI only learned of the new, massive data dump on Friday, shortly before it was reported by German news media.. Advent Calendar of Leaks. Some information security experts say that the dump of German politicians' personal details, memos and other potentially sensitive data has none of the hallmarks of a typical Russian information operations campaign.. For starters, the dump appeared to be designed to be an Advent calendar of big and little leaks, with new data being dumped every day in December up until Christmas via a Twitter account - reportedly followed by up to 18,000 people - before it was suspended.. Initially, at the beginning of December 2018, the account began leaking data for celebrities before switching to politicians on Dec. 20.. Someone put a lot of effort into this. It doesn't make sense for a Russian op, the timing is way off, the Grugq tweets. And they'd have been pissed that they got ignored for all of December as they were leaking. It is unusual to do an IO and just wait around until it is found.. Privacy Commissioner Seeks Link Removal. The Hamburg Commissioner for Data Protection says it's been working throughout Friday to legally compel Twitter to excise all links to the stolen data from any tweets. To do so, the commissioner is working with Ireland's Data Protection Commission because Twitter's European operations are based in Ireland (see: GDPR: EU Sees More Data Breach Reports, Privacy Complaints).. But it's not clear yet if any of the links specified by Hamburg's data protection commissioner have yet been removed by Twitter or if the social networking firm will honor those requests.. We are continuing to investigate this issue and our teams will take action where appropriate, a Twitter spokeswoman tells Information Security Media Group.. Posting a person's private information without their permission or authorization is a direct and serious violation of the Twitter Rules, she says. We also recently updated our rules to prohibit the distribution of any hacked material that contains private information, trade secrets or could put people in harm's way.",I,4,0,0,0,1,0,0,"The amount of data published is immense, says Hamburg's Data Protection Commissioner, who has been responding to the data leak by cataloging tweets that contain links to the stolen data."
5,"The malware is another example of how fraudsters are increasingly getting around standard modes of authentication, such as usernames and passwords, says fraud-fighting expert Avivah Litan.. I don't think most banks are aware of these latest scams that are replacing Zeus, SpyEye and other financial Trojans, in terms of popularity and usefulness to the criminals, says Litan, an analyst at the consultancy Gartner. This particular Trojan is using techniques that I've seen before, so I'm not sure if it's that unique. But Beta Bot is most definitely indicative of the new trend in cyber-attack vectors.. Beta Bot's Attack. The Internet Crime Complaint Center and the Federal Bureau of Investigation recently issued an advisory about Beta Bot, the new malware that targets e-commerce sites, online payment platforms and even social networking sites to compromise log-in credentials and financial information.. When Beta Bot infects a system, an illegitimate but official-looking Microsoft Windows message box named User Account Control pops up, asking the user to approve modifications to the computer's settings. If the user complies with the request, the hackers are able to exfiltrate data from the computer, the advisory states. Beta Bot is also spread via USB thumb drives or online via Skype, where it redirects the user to compromised websites.. Beta Bot defeats malware detection programs because it blocks access to security websites and disables anti-virus programs, according to IC3.. This is a good demonstration of how fraudsters' methods are evolving constantly, says Shirley Inscoe, a fraud analyst with consultancy Aite. They are coming up with sophisticated methods that appear so convincing, even people who typically would not fall for their schemes may do so.. Beta Bot's attacks also resemble the ransomware attacks that coupled the banking Trojan known as Citadel with the drive-by virus known as Reveton, which seized consumers' computers and demanded ransom, purporting to be from the FBI (see Trojans Tied to New Ransomware Attacks).. Distribution Increasing. Andreas Baumhof, chief technology officer at online security and research firm ThreatMetrix, says Beta Bot first surfaced in March, targeting U.S. consumers. But distribution of Beta Bot has recently picked up, making it more of a concern, he says. Security firm RSA earlier noted that the malware's DNS-redirection scheme resembled features of the Citadel Trojan.. And while it's not a banking Trojan, Beta Bot possesses the same characteristics of most common banking Trojans, such as Zeus, Baumhof says. It can block access to AV update servers, so your anti-virus engine can't update its signature patterns; it can grab HTTP post data and also has DDoS [distributed-denial-of-service] capabilities, he says. It doesn't matter what Trojan people use, Baumhof adds. What matters is what effect it has on the current transaction you do, and this is where people should focus.. Al Pascual, a fraud expert and analyst with Javelin Strategy & Research, says banking institutions should be concerned about any malware, such as Beta Bot, that proliferates.. For now, it is not, apparently, designed to function as a banking Trojan, Pascual says. While this is good news, it has all the basic components in place to become just that; so this should stay on the radar.. Mitigating Risks. IC3 and the FBI warn that if consumers see what appears to be an alert from Microsoft but have not requested computer setting modifications from the company, they have likely been targeted for a Beta Bot attack.. If infected, running a full system scan with up-to-date anti-virus software is recommended. And if access to security sites has been blocked, then downloading anti-virus updates or a new anti-virus program is advised.. Inscoe says continual compromise of login credentials, which compromises standard online authentication practices, should be concerning to banking institutions. And they should be taking steps to educate their customers.. I have not heard of any bank proactively alerting their customers to this new threat, Inscoe says. There may be some who have put information on their websites, but at some point, banks must realize this is just not adequate to protect their client base.",F,7,0,0,1,0,0,0,"Security firm RSA earlier noted that the malware's DNS-redirection scheme resembled features of the Citadel Trojan.. And while it's not a banking Trojan, Beta Bot possesses the same characteristics of most common banking Trojans, such as Zeus, Baumhof says."
6,"SAN FRANCISCO — There’s a growing consensus that the device manufacturers, provider organizations, and regulators are moving the security of medical devices in the right direction. But systemic challenges are stymying the progress.. At the RSA Conference, Marty Edwards, vice president of OT security for Tenable, Ankit Patel, business information security officer at Humana, and Errol Weiss, chief security officer of Health-ISAC, held an informative discussion on the state of medical devices and what’s truly holding the healthcare industry back.. Overall, there’s still room to grow and improve in the outreach and educational aspects of medical device security, but the industry is making great strides in building more security features into devices, explained Edwards.. The culture in the manufacturing community is getting better, as well. In the past, researchers who discovered a vulnerability would call the manufacturer and be immediately connected with the legal department.. “That was their response to vulnerability disclosures because they had no security contact information on the webpages,” said Edwards. “Now I see most manufacturers have gained a little bit of maturity, and they're starting to lean in.”. Many have added chief product security officers assigned to managing the product security, rather than the corporate business. There are obviously manufacturers that need to improve, but the “trend is moving in a positive direction,” he added.. “We may not have all the security controls that you would want in a difficult medical device, or connected IoT device … but there has been a lot more progress made over the last five years or so, than ever before,” said Patel. But despite progress being made to secure new devices, including improved use of authentication, the complexity of the medical device infrastructure and heavy reliance on legacy tech will keep the state of device security in flux, without a solution.. Legacy technology is the biggest security problem for healthcare. For Patel, the real problem is concentrated around legacy technologies. For example, MRI machines and ultrasound machines cost more than $1 million each, which means providers can’t replace those technologies with “the latest and greatest” product with these newly implemented features.. While it’s clear that newer devices will be vastly more secure, it won’t solve the continued use of legacy devices and those that rely on Windows XP. These devices weren’t implemented with concern that it would be connected to the internet because “ultimately, why would anybody bother with it? What's the problem?”. “In this industry, in healthcare, we see MRI systems wide open on the internet, as well,” said Weiss. “Every single one of those devices on your network could represent an entry point for the bad guy to get into your network and [malware] spreads from that.. “That's the problem. That's the biggest challenge,” he added.. In the last few years, there have been multiple, critical vulnerabilities disclosed within the underlying software packages of these medical devices, including those on the wireless connectivity modules used on millions of devices, said Weiss. “It seems like an endless supply of vulnerabilities constantly popping up.”. And as healthcare works to address these longstanding issues, the IoT trends keep moving forward.. “Home health is becoming mainstream, with a lot of health systems investing a significant amount of dollars into technologies where they can monitor patients overnight,” said Patel. There are also conversations around performing surgeries remotely using robots, and “we're moving very quickly in using innovation and technology.”. Regulations, communication will play a role in reducing medical device risks. Congress and regulators are the likely key to reducing the risks posed by medical devices, while supporting providers with the process. There are several proposed legislations that target manufacturers, software bill of materials (SOMBs), and other items that the healthcare sector has sought for years.. From a legislative perspective, Patel explained there are a lot of good foundational elements of medical device security. But within the manufacturer community, there need to be greater conversations between those vendors and security leaders to collaborate on tangible ideas and “identify more concrete practical solutions.”. To Weiss, improving communication between medical device manufacturers and providers will lead to the creation of devices that are more practical for implementation, as well as continuous upgrades on security going forward.. As it stands, the industry is well aware of the issues out there, said Patel. “But we need to find a way to move towards a solution, and analyzing it, as opposed to saying, ‘Hey, the vendor community does this,’ ‘the practitioner, they don't know what they're doing,’ and it's not logistical.” “Manufacturers have their own challenges, but it’s something we’re all trying to figure out and navigate through,” Patel added. “As we keep thinking about security and trying to bake security as part of the process, I think we will be moving in the right way.”",Others,9,0,0,0,0,0,1,"SAN FRANCISCO — There’s a growing consensus that the device manufacturers, provider organizations, and regulators are moving the security of medical devices in the right direction."
7,"Technology alone cannot solve this problem, says Clark in an interview with Information Security Media Group [transcript below].. In reviewing the three key areas of insider threats - IT sabotage, theft of intellectual property and insider fraud - technology can serve as an additional mitigation layer, he says.. For IT sabotage, technologies such as resiliency, back-up, access control, code review and log analysis are beneficial, he points outs. We would suggest data loss prevention, encryption and intrusion detection systems when monitoring for the theft of IP, he adds.. To mitigate fraud risks, organizations should consider two-factor authentication and auditing technologies, Clark says. Also, technologies that are capable of detecting unauthorized addition or modification of data in databases are of paramount importance.. During this interview, Clark discusses:. Linking insider risks back to common network attacks and breaches;. Why so-called low-and-slow attacks are always the most damaging; and. Where and how technology fits into insider fraud detection.. Clark recently made a presentation on insider threats at Information Security Media Group's Fraud Summit. A video of his presentation is available on ISMG's Fraud Summit page.. As a researcher at researcher at the CERT Insider Threat Center at the Carnegie Mellon University Software Engineering Institute, Clark's main area of interest is insider threat and cybersecurity. He previously worked at the Census Bureau and the Institute for Defense Analyses. Clark is also researching cybercrime in the doctoral program at George Mason University.. Insider Risks. TRACY KITTEN: Why are insider threats so difficult for organizations to mitigate?. JASON CLARK: There are many reasons why insider threat is challenging to mitigate. First, insiders can bypass existing physical and electronic security measures through legitimate measures. In other words, they're supposed to be there working on systems with access, unlike an outside attacker. Essentially they have authorized access to authorized systems. Also, some organizations may not be aware they have been a victim of an insider attack or, for a variety of different reasons, choose not to report it.. One common misconception is that insider threats can be solved with technologies. Unfortunately, solving the insider threat problem with technology will not suffice as it's difficult to search and analyze logs to look for bad behavior. In fact, the insider, for all intents and purposes, looks normal until they have the intent to complete a malicious attack. Additionally, it's often difficult to determine what normal behavior is versus insider threat behavior. Even if you could somehow solve the insider threat problem with technology, there's an entire aspect of social behavior to consider. Often times, it's challenging to predict when and why an insider may go down a slippery slope to committing an insider attack.. Duration of Insider Cases. KITTEN: How long do most insider fraud cases go on before they're detected?. CLARK: We define insider fraud as an insider's use of IT, from the unauthorized modification, addition or deletion of an organization's data, not programs or systems, for personal gain; or theft of information which leads to fraud, identity theft, credit card fraud, etc. Our results, based on a study completed in July 2012 sponsored by the Department of Homeland Security Science and Technology Directorate and the Homeland Security Advanced Research Projects Agency in collaboration with 80 case files from the Secret Service, found.. there are approximately 32 months on average between the beginning of the fraud and its detection.. KITTEN: Are there certain industries or sectors that are at greater risk for insider fraud than others?. CLARK: Of the 250 cases that we have coded and subsequently analyzed at the SEI, we have found as no surprise that banking/financial was the highest industry coming in at about 47 percent of our cases. This was followed by government at the state, local and federal level as well as healthcare, commercial facilities and communications.. Average Cost of Insider Fraud Scheme. KITTEN: How much does a typical insider fraud scheme cost a company or organization?. CLARK: According to the previous study mentioned earlier, sponsored by the Department of Homeland Security Science and Technology Directorate and the 80 case files we received from the Secret Service, we found that average damage caused by managers was slightly over $1.5 million, and the median was approximately $200,000. For non-managers, we found the average was $287,000, and median of $112,000. Of the 250 cases in our own database where we had information on the financial impact of the cases, 13 percent were impacted over a million dollars, while 32 percent had an impact between $100,000 and $999,000. Finally, 19 percent had an impact of between $10,000 and $99,000.. However, there are additional costs that cannot be measured in dollar figures alone. There are operational costs, loss of customers, embarrassment, lost privacy in the form of stolen PII that could cause additional damage that can't be measured in just dollar figures.. Emerging Insider Threat Concerns. KITTEN: What are some of the emerging insider threat concerns?. CLARK: While we do not have the data to support this, nor do we have real-time data, we do see some emerging concerns for 2014 that we're researching and watching very closely. Of interest is a look at the difference between U.S.-based and international insider threat cases. We're also looking at insider threat in the cloud and unintentional insider threat problems. Often we put such information in the form of a blog post on the CERT website which can be found at www.cert.org/insider_threat.. Use of Technology. KITTEN: What types of technologies should organizations be investing in to help at least mitigate some of their risks?. CLARK: Given that we're an FFRDC - federally funded research and development center - we're not at liberty to provide any specific vendor product recommendations. However, if we take a look at how we break down the insider threat problems in three key areas - IT sabotage, theft of intellectual property and insider fraud - we can offer categories of technologies. For IT sabotage, we would suggest technologies such as resiliency, back-up, access control, code review and log analysis. For theft of IP, we would suggest data loss prevention, or DLP solutions, encryption, intrusion detection systems and the like. For fraud, we would really consider business practices such as two-factor authorization as well as auditing technologies. Also, technologies that are capable of detecting unauthorized addition or modification of data in databases is of paramount importance. However, as a reminder, technology alone cannot solve this problem.. Breaches Linked to Insiders. KITTEN: How often would you say breaches are linked to an insider?. CLARK: This is a difficult question to answer. However in 2013, [a] magazine conducted a survey of 501 respondents and found that 53 percent of participants stated that they experienced an insider incident. It's unknown as to whether these reports are linked to a specific insider. Also, there are elements that many of these insider attacks are grossly under-reported. The most likely reasons for the under-reporting come from the fact that damage level is insufficient to warrant prosecution or there's a lack of evidence to prosecute. Often it's difficult to identify the individuals responsible for committing an electronic crime. The study also showed that 75 percent of the time organizations do not involve law enforcement. Additionally, the survey found that electronic crime, or e-crime, events were known [or] suspected to have been caused by outsiders 56 percent of the time, insiders 23 percent of the time, and 21 percent of the time it was unknown.. Information Sharing. KITTEN: How has information sharing helped to reduce insider fraud losses across industries, if at all?. CLARK: Information sharing has certainly helped reduce insider fraud. Of course, agencies and organizations are somewhat fearful of sharing information. However, there are several task forces, such as the National Insider Threat Task Force, that really strive to improve information sharing. Conferences and other trusted communities are important in reducing insider fraud losses across industries. However, unless there are formal agreements in place, information sharing will not occur as frequently as it probably should. Given the CERT insider threats group can't compete as a trusted broker, we're in a unique position to conduct unbiased assessments of an organization's insider threat program. We have over 13 years of experience and have a wide variety of services, including training, workshops, assessments and [can] help an organization in setting up their own insider threat program. All the information we collect is protected and our reputation is stellar so we certainly urge organizations to reach out to us on our website at www.cert.org/insider_threat, and contact us if you have any questions or need assistance.",F,7,0,0,1,0,0,0,Often we put such information in the form of a blog post on the CERT website which can be found at www.cert.org/insider_threat.. Use of Technology.
8,"There are no bad robots or good robots, says Tracy Altman. However, the humans developing and applying AI technology can certainly have either good or bad intentions — and scientists are still grappling how AI will reshape the world in both constructive and deleterious ways.. Altman, executive director of the Museum of AI in Denver promises that this future pop-up attraction will look at both sides of the artificial intelligence debate across an array of industries, including cybersecurity.. In the world of cyber, AI can act as friend or foe. For instance, network defenders can employ AI to quickly flag and mitigate anomalous end-user behavior or identify malware based on its inherent characteristics. But on the flip side, attackers can leverage AI to craft deceiving deepfake videos or generate phishing email content.. Click here for more SC Media coverage from the Identiverse Conference.. The museum, which is slated to open in fall 2022, will use immersive theater and interactive visitor engagement to tell the story of AI from a B2B perspective — though consumers and students are meant to enjoy the experience as well. Altman said she has seen very little happening in B2B in terms of experiential marketing or experiential learning, so the Museum of AI is created to close that gap.. One exhibit will look at the deepfake phenomenon, noted Altman, in an interview with SC Media that was recently conducted at the Gaylord Rockies Resort in the Denver metropolitan area. (The resort is not affiliated with the museum.). The docent, aka actor, in our story... challenges the guests to identify an authentic video versus a deepfake, and they learn various steps that we as humans can apply, Altman said. And then they also learn how an AI vendor... [would] apply AI to identify anomalies that a human would never recognize.. While AI's presence in our world is rapidly expanding, the museum itself is very much powered by genuine human creativity and inspiration. In addition to Altman, the attraction is buoyed by a team of visionaries that includes artistic director Lonnie Hanzon, whose installations can be seen across Denver including at Coors Field; head storyteller Jessica Austgen, who was named 2018 Colorado Theatre Person of the Year; experience designer Cody Borst, who has constructed over 20 escape rooms; and Vice President of Operations Jeff Altman, a sales and marketing veteran. Additional advisors and experts are also contributing to various projects within the museum's walls.. Altman said she hopes that as visitors exit the experience, they will feel transformed and ready to take action, and more seriously recognize how AI might help them in their work.. Watch the embedded video to see Tracy Altman preview what the Museum of AI will have to offer. Perhaps the SC Media team will return in the fall to cover the museum's opening in person. Unless, of course, AI bots will be writing all our stories for us by then.",IA,3,0,0,0,0,1,0,Perhaps the SC Media team will return in the fall to cover the museum's opening in person.
9,"In this exclusive interview, security expert Diana Kelley discusses:. id='blist'The types of multi-channel fraud now prevalent in the marketplace;. id='blist'How these attacks are launched;. id='blist'Ways institutions can spot and respond to the threat.. Diana Kelley founded SecurityCurve in April of 2003. She has extensive experience creating secure network architectures and business solutions for large corporations and delivering strategic, competitive knowledge to security software vendors.. Prior to returning to SecurityCurve in January 2008, she was Vice President and Service Director for the Security and Risk Management Strategies (SRMS) service at Burton Group. Diana was the Executive Security Advisor for CA's eTrust Business Unit. At CA she was responsible for advising customers on strategic security solutions and helped guide CA's security business.. She served as the Vice President of Security Technology for Safe3W, Inc (acquired by iPass), a provider of strong, two factor authentication. Representing Safe3W she was actively involved in the Technical Group for NACHA's Project Action. And she was a security industry Analyst with Baroudi Bloor, a top-tier analyst firm where she delivered strategic advice to, among others, IBM and Psionic (acquired by Cisco.). TOM FIELD: Hi, this is Tom Field, Editorial Director with Information Security Media Group. The topic today is multi-channel fraud, and we're talking with Diana Kelley, Partner with Security Curve. Diana, thanks so much for joining me today.. DIANA KELLEY: Oh, thanks for having me, Tom.. TOM FIELD: Now you've recently written a white paper about multi-channel fraud, and I wanted to give you just a second to tell us a bit about yourself and about Security Curve.. KELLEY: Okay, sure. Thanks. We'll I'm an 18-year veteran of IT and security, and I have a broad range of experience including being a manager and a financial service consulting at KPMG, a general manager at Symantec, and most recently I was a the vice-president and service director at Burton Group for the security and risk management strategy service. So it's you know, sort of a full background of actually having worked with vendors, worked as a SI and doing system integrations inside of large organizations, as well as looking at it from the analyst point of view for research. I'm bringing all that together with Security Curve, which is an independent research and consultant firm that provides strategic guidance to companies and vendors.. FIELD: Now you've just written this white paper on multi-channel fraud, so my question for you is what types of fraud are you seeing in the financial services market place, and then talk about multi-channel in particular, please.. KELLEY: Well, it's interesting because they are actually very inter-twined as it works because a lot of the fraud that is going on is in fact multi-channel. It may not always appear to be multi-channeled, and that's the catch, because if you think about especially the larger financial services organizations, the numbers of ways to get into the system to look at the information. You can actually have a fraudulent transaction occur that may appear to come from one channel. Phone for example is one channel where there is a lot of rise going on. It's been reported in the media. So phone fraud is definitely on the rise and trackable, but sometimes that phone fraud is actually fueled by fraud that is coming in from other channels, for example online.. So what attackers are really trying to do is two things: 1) is to increase their intelligence so that when they go in and execute an attack, they have more information so they can make that attack a little bit stronger, a little bit better. For example, knowing when you get your paycheck or when you get that bonus. That is knowledge that is useful to an attacker. They know how much is in that account that they could potentially remove unfortunately, but that is what they are trying to do. 2) The other thing that the attackers are doing with the multi-channel is they are trying to make it hard for financial services organizations to understand where that attack is coming from. So if you break it up, it can be more difficult to understand what they are trying to do than if you are going through only one specific area. So it is pretty similar with when you see attacks online. What we're seeing is a rise in password-stealing software, for example, but then nobody's quite sure, well, how is that password-stealing software being used? Is it in fact being used to get into accounts and move money? And the multi-channel attack, you could steal a password that appears to go through one channel. You get online through that reconnaissance work and know how much information or how much money is in that account, or get information such as even in some cases credit card numbers, and I can explain that a little bit more if you liked. But then go off and then make the attack through another channel, and then tying that all together is where it really becomes very difficult sometimes to know.. You know, from all those different points of getting into the information what the attackers are really up to, or that it is the same attack line from one particular attack group.. FIELD: Now one of the things that was interesting in your white paper, is you actually outline the anatomy of an attack. Could you sort of summarize and give us kind of the profile you are looking at there.. KELLEY: Yeah this is actually came out of an interview that I had with one of the financial services firms I spoke with, and they did approve use of that attack without mentioning the name of the financial services firm. And what happened in that attack was exactly the kind of point that I had found was happening again and again, so it was a good summary.. In that case, it appeared to be an off-line attack. So it was an attack where an order came through for transfer of money and the approval came through on a fax. So, you would see that as -- although we know faxes are digital -- that doesn't count as being on the internet. So first that institution thought 'Well, this is your kind off-line attack,' but as they went back through and looked at all the activity on that particular account where the fraud had occurred, they realized that the attack had started much earlier than when that fax got transferred. The attack had started back in the online account, but what the attacker had done on the online account did not appear fraudulent because they weren't flagging for that level of activity. And it was activity, and you might say, well shouldn't they have been flagging? Well not necessarily because the activity was all reconnaissance work so it was logging in. It was understanding how much money was in the account, so how much you could transfer without you know, cleaning the account completely blank. They knew what was there, so that they weren't going to over-transfer. They could also see things such as signatures, and then signature could be used on a fax to make it appear as though it was a legitimate request for the transfer of the funds.. So as they went back, they realized that what looked like an off-line attack was in fact actually a combination online and off-line attack or multi-channel fraud.. FIELD: That is scary stuff.. KELLEY: I know it is. It was a little bit of a scary research product because you do, you hear about this quite a bit in the media, but to go through and actually talk to institutions to hear what is going on.. And I actually looked at my own accounts, and that is actually how as I was saying, credit card information, I did find out that in some cases our banks because they are putting, we often have savings accounts with banks. Now that we have some sort of a credit card with as well, and some organizations are actually putting the PDF of your credit card statements online. And guess what? That is one of the few places other than your credit card where you see that full sixteen digits... FIELD: Now given the economic times we're in right now, are financial institutions more vulnerable to fraud?. KELLEY: Because of what would be the economic crisis that is going on? Would it make them more vulnerable? I don't think necessarily that the crisis itself.. But I think, however, what could impact additional vulnerability is that one of the best things that a financial service institution can do is to stay on top of things and monitor at all times, because fraudsters are always going to be attacking. So it's really about monitoring. And a couple of things are happening. One is that many financial services institutions have to slash their staff, and as they slash staff that could mean slashing the people that are actually monitoring the reporting tools that they have telling them a fraud is occurring or not. So that could impact. You know, if you don't have somebody watching store, then you could have a higher impact or you could see fraud increase because of that.. The other thing that is going on is that as financial institutions are essentially dropping like flies and merging and getting bought by other companies and being brought in, you're seeing a lot of IT departments that are now absorbing a whole other large company. If you are Bank of America, you are looking at what do we do with Merrill Lynch's IT organization for example. And as you merge sometimes -- and I'm not saying that Bank of America or the Merrill Lynch IT departments that this will happen -- but sometimes when you do see big mergers of IT departments, some things can occasionally fall through the crack, can be hard to reconcile the different architectures quickly so that you may find that there could be some vulnerability holds there for any financial services institutions that is going through a big merger or an acquisition of another institution and trying to bring all of those IT and monitoring systems on, you know into one consolidated version.. FIELD: You know it's interesting because I'm not a customer of Bank of America, but I'm receiving fishing emails from Bank of America now, you know to the tune of you might be a new customer, you are coming over. It seems like we're going to be seeing a lot of those.. KELLEY: Yeah and the attackers their whole thing, as I said earlier, you know if somebody is not watching the store as these companies are going through the mergers and their cutting, and some companies are cutting staff so yeah they're going to try. They are going to try to exploit that.. FIELD: Now it occurs to me, Diana, that one of the risks that financial institutions have to be mindful of is the insider threat, as you said, people are cutting staff people are loosing jobs, and there is little more desperation Have you ever heard of multi-channel insider fraud?. KELLEY: That was actually not one of the things that I was researching specifically, but without a doubt when you've got insiders -- especially as they become disgruntled -- that is going to rise up as a potential. So this is certainly a red-flag and something that organizations should be aware of, and again great monitoring is going to help, but also identity management where you know, you cut off people's access. It's a little bit scary. It crosses all kinds of verticals that even with all the work we've done in identity management, how many times you run into somebody weeks even months after leaving a company saying, hey, you want to see something cool? and they can still log in with privileged access to trusted machine. So that is definitely something that the financial services should be very, very aware of. Most of them do have very strong identity management and do have a very global across the board cut-off but for any that don't. It's always a good time; this is even a better time to make sure that is up and running properly.. FIELD: Well you raise a good point there. What are some of the other risks and vulnerabilities that institutions should specifically watch now?. KELLEY: Well, I do think that this multi-channel is very important, and that was really the point of the research is that sometimes what seems to be innocuous behavior is actually just a reconnaissance mission. So you said for example that you are seeing increased phishing, and that increased phishing you know being able to tie that back to is this actually now resulting in an increase reconnaissance work and is that you know, resulting in an increased fraud. So I think that is absolutely major. We've got the red flags from that coming out soon, so I think that is November 1st is when the red flag goes into effect.. So companies looking at that and being more aware of things, such as you know, and in red flags they are talking about address changes. And people might say, Does that really matter but that is a core to its huge potentially, it's potentially innocuous. It is something that many of us do multiple times in our life. We legally, legitimately change our address, yet it is an underlining piece of you know how you can begin an identity theft attack. So I think it is really the understanding that there are many of these different pieces of information that people can either receive or alter that would leave them to be able to launch the bigger attacks is one of the most important things for SI to look at. And so a good thing to do would be to really tie them together so as they are bringing organizations together or even cutting staff, look at the increasingly efficient about how their tying the information together. That the credit cards subsidiary and the banking subsidiary are all sharing information with each other so they can tie together suspicious activity on these accounts.. FIELD: One last question for you, Diana. In your research, what are some of the effective ways you want covered that institutions are responding to in preventing multi-channel fraud?. KELLEY: Well, they've all got a lot of really great tool boxes, which is wonderful. And monitoring is without a doubt one of the things that organizations use in order to find out that this is that fraud is going on. And so credit card transactions for example, they've got very good with all the systems to identify does this look like Tom? Does Tom usually his credit card this way or not? So continuing to use those kinds of tools, we've seen because of the strong authentication guidance, we've seen a great increase in what they are doing to prevent you know, simple log-ins. Just stealing your user name and password may not be enough for log-ins whether they've added complete two-factor or what I think of as partial two-factor, when you get mutual authentication with they want to mark you, so you know it is really the site. You don't log in and give your credentials away to the wrong site, hopefully, to an attack site. But also being able to do a little bit of additional factoring on whether or not you are the legitimate user by understanding things such as what is your IP address, where do you usually log in from, what time do you usually log in. So that, although, just because you login from a different machine, doesn't you are fraudulent, but if you start adding out plots of different pieces of information, then you can start to see that this doesn't appear to be Tom. Because when you go with your credit card, you actually have so many recognizable patterns of how you use your credit card. We are finding out that there is fairly recognizable pattern of how people access their bank accounts and what they do online with them.. FIELD: That makes sense. Diana, I appreciate your time and your insight today.. KELLEY: Oh sure, my pleasure.. FIELD: We've been talking with Diana Kelley, a Partner with Security Curve. The topic has been multi-channel fraud. For Information Security Media Group, I'm Tom Field. Thank you very much.",F,7,0,0,1,0,0,0,2) The other thing that the attackers are doing with the multi-channel is they are trying to make it hard for financial services organizations to understand where that attack is coming from.


*The* metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [None]:
metric

Metric(name: "accuracy", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
    sample_weight (`list` of `float`): Sample weights Defaults to None.

Returns:
    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.

Examples:

    Example 1-A simple example
        >>> accuracy_metric = datasets.load_metric("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
        >>> print(results)
        {'accuracy': 0.5}

   

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token


Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Using pad_token, but it is not set yet.


We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

To preprocess our dataset, we will thus need the names of the columns containing the sentence(s). The following dictionary keeps track of the correspondence task to column names:

In [None]:
data_key = {
    "Incident": ("SUM", None),
}

We can double check it does work on our current dataset:

In [None]:
sentence1_key, sentence2_key = data_key["Incident"]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence: As part of the 100-day plan for the nation's electrical grid, the Energy Department's Office of Cybersecurity, Energy Security, and Emergency Response, or CESER, will work with the Cybersecurity and Infrastructure Security Agency and private utilities to make a series of cybersecurity improvements..


We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model.

In [None]:
def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
preprocess_function(dataset['train'][:5])

{'input_ids': [[1722, 636, 286, 262, 1802, 12, 820, 1410, 329, 262, 3277, 338, 12278, 10706, 11, 262, 6682, 2732, 338, 4452, 286, 15101, 12961, 11, 6682, 4765, 11, 290, 18154, 18261, 11, 393, 42700, 1137, 11, 481, 670, 351, 262, 15101, 12961, 290, 33709, 4765, 7732, 290, 2839, 20081, 284, 787, 257, 2168, 286, 31335, 8561, 492], [818, 281, 14112, 3335, 11, 262, 18953, 15455, 262, 13931, 329, 262, 555, 43628, 1366, 290, 12800, 262, 1321, 736, 284, 262, 7394, 5937, 11, 1864, 284, 262, 989, 492], [1722, 2828, 2555, 284, 33960, 625, 262, 5885, 286, 3394, 12, 34762, 10075, 38458, 1028, 262, 471, 13, 50, 1539, 734, 8667, 15469, 422, 1111, 4671, 765, 284, 760, 517, 546, 703, 262, 2732, 286, 17444, 4765, 290, 663, 7515, 5942, 389, 386, 33329, 1762, 284, 1327, 268, 262, 4875, 18370, 286, 262, 2717, 1230, 290, 4688, 6884, 492], [43961, 13169, 10229, 284, 1866, 286, 281, 14199, 27387, 1448, 326, 416, 4277, 423, 1336, 1630, 286, 262, 7386, 11, 543, 318, 1521, 884, 18031, 460, 307, 7977, 416, 16391,

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [None]:
encoded_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/6247 [00:00<?, ? examples/s]

Map:   0%|          | 0/2678 [00:00<?, ? examples/s]

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since all our tasks are about sentence classification, we use the `AutoModelForSequenceClassification` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which is always 2, except for STS-B which is a regression problem and MNLI where we have 3 labels):

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, GPT2Config, GPT2ForSequenceClassification
import numpy as np

num_labels = 11
#model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

# instantiate the configuration for your model, this can be imported from transformers
configuration = GPT2Config()
# set up your tokenizer, just like you described, and set the pad token
#GPT2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
#GPT2_tokenizer.pad_token = GPT2_tokenizer.eos_token
# instantiate the model
model = GPT2ForSequenceClassification(configuration).from_pretrained(model_checkpoint, num_labels=num_labels)
# set the pad token of the model's configuration
model.config.pad_token_id = model.config.eos_token_id

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define two more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
metric_name = "accuracy"
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=True,
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.



The last thing to define for our `Trainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, the only preprocessing we have to do is to take the argmax of our predicted logits (our just squeeze the last axis in the case of STS-B):

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
validation_key = "test"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

You might wonder why we pass along the `tokenizer` when we already preprocessed our data. This is because we will use it once last time to make all the samples we gather the same length by applying padding, which requires knowing the model's preferences regarding padding (to the left or right? with which token?). The `tokenizer` has a pad method that will do all of this right for us, and the `Trainer` will use it. You can customize this part by defining and passing your own `data_collator` which will receive the samples like the dictionaries seen above and will need to return a dictionary of tensors.

We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.469651,0.42009
2,1.606600,1.407831,0.438013


We can check with the `evaluate` method that our `Trainer` did reload the best model properly (if it was not the last one):

In [None]:
trainer.evaluate()

To see how your model fared you can compare it to the [GLUE Benchmark leaderboard](https://gluebenchmark.com/leaderboard).

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
trainer.push_to_hub()