In [1]:
from IPython.display import display, HTML
display(HTML ("""
<style>
div.container{width:90% !important;}
div.cell.code_cell.rendered{width:100%;}
div.input_prompt{padding:0px;}
div.CodeMirror {font-family:Consolas; font-size:12pt;}
div.text_cell_render.rendered_html{font-size:12pt;}""
div.output {font-size:12pt; font-weight:bold;}
div.input{font-family:Consolas; font-size:12pt;}
div.prompt {min-width:70px;}
div#toc-wrapper{padding-top:120px;}
div.text_cell_render ul li{font-size:12pt;padding:5px;}
table.dataframe{font-size:12px;}
</style>
"""))

In [2]:
import warnings
import os
import logging
# 경고 제거
warnings.filterwarnings('ignore')

# transformers 로깅 레벨 조정
logging.getLogger("transformers").setLevel(logging.ERROR)

# Hugging Face symlink 경고 제거
os.environ['HF_HUB_DISABLE_SYMLINKS_WARNING'] = '1'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# from transformers import pipeline, logging as hf_logging
# hf_logging.set_verbosity_error()


**<font size='6' color='red'>ch1_허깅페이스</font>**
- Inference API 이용 : 모델의 결과를 server에서 
- pipeline() 이용 : 모델을 다운로드 받아 모델의 결과를 local에서
    * raw text -> tokenizer - > moedel -> [0.11, 0.55, 0.xx, ~] logits값으로 prediction 결과 출력

```
허깅페이스 transformers에서 지원하는 task
"sentiment-analysis" : "text-classification"의 별칭(감정분석 적용)
"text-classification" : 감정분석, 뉴스분류, 리뷰 분류 등 일반적인 문장 분류
"zero-shot-classification : 레이블을 학습 없이 주어진 후보군 중에서 분류
"token-classification" : 개체명 인식(NER; Named Entity REcognition) 등 단위 라벨링
"ner" : "token-classification"의 별칭
"fill-mask" : 빈칸 채우기
"text-generation" : 텍스트 생성 (GPT류 모델에 사용)
"text2text-generation" : 번역, 요약 등 입력 -> 출력 변환
"translation" : 번역
"summarization" : 텍스트요약
"question-answering" : 주어진 context를 보고 질문에 답하기.
"image-to-text" : 그림을 설명
"image-classification": 이미지분류
```

## 1. 텍스트 기반 감정분석(긍정/부정)
- c:/사용자/내 컴퓨터명/.cache/huggingface/hub 모델 다운로드

In [3]:
from transformers import pipeline
classifier = pipeline(task="sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.951606810092926}]

In [4]:
classifier = pipeline(task="text-classification",
                     model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
# 감정분석시 내용이 많으면 list로
classifier([
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!"
])

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [5]:
classifier(["이 영화 정말 최고였어요. 감동적이고 연기가 대단해",
           "This moive was the best. It's touching, and the acting is amazing"])

[{'label': 'POSITIVE', 'score': 0.857815682888031},
 {'label': 'POSITIVE', 'score': 0.9998846054077148}]

In [6]:
classifier("이 물건 정말 사고 싶어요")

[{'label': 'POSITIVE', 'score': 0.8577604293823242}]

In [7]:
classifier(["I like you", "I hate you", "나 너가 싫어", "힘들어요"])

[{'label': 'POSITIVE', 'score': 0.9998695850372314},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079},
 {'label': 'NEGATIVE', 'score': 0.599323034286499},
 {'label': 'POSITIVE', 'score': 0.8669533729553223}]

In [8]:
classifier=pipeline(task="sentiment-analysis",
                   model="matthewburke/korean_sentiment")
texts=(['나는 너가 좋아', "당신이 싫어요", "힘들어요", "오늘 기분이 최고야"])
result= classifier(texts)

Device set to use cpu


In [9]:
for text, result in zip(texts, classifier(texts)):
    label = "긍정"if result['label']=='LABEL_1' else "부정"
    print(f"{text}=>{label} : {result['score']:.4f}")

나는 너가 좋아=>긍정 : 0.9558
당신이 싫어요=>부정 : 0.9093
힘들어요=>부정 : 0.9140
오늘 기분이 최고야=>긍정 : 0.9714


## 2. 제로샷분류(Zero-shot분류)
- 기계학습 및 자여넝 처리에서 각 개별 작업에 대한 특정 교육없이 작업을 수행할 수 있는 모형(비지도학습)

In [10]:
classifier = pipeline("zero-shot-classification",
                     #model="facebook/bart-Large-mnli"
                     )
classifier(
    "I have a problem with my iphoe that needs to be resloved asap!",
    candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"]
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'sequence': 'I have a problem with my iphoe that needs to be resloved asap!',
 'labels': ['phone', 'urgent', 'computer', 'tablet', 'not urgent'],
 'scores': [0.6687335968017578,
  0.31948044896125793,
  0.005518774501979351,
  0.004069005139172077,
  0.002198058646172285]}

In [11]:
sequen_to_classify = "One dat I well see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequen_to_classify, candidate_labels)

{'sequence': 'One dat I well see the world',
 'labels': ['travel', 'dancing', 'cooking'],
 'scores': [0.9798721671104431, 0.011021879501640797, 0.009105958044528961]}

## 3. text 생성

In [12]:
from transformers import pipeline
generation = pipeline("text-generation", "gpt2") # 텍스트 생성 gpt3부터는 허깅페이스 없음
generation(
    "in this course. We will teach you how to",
    pad_token_id=generation.tokenizer.eos_token_id
)

Device set to use cpu


[{'generated_text': 'in this course. We will teach you how to use virtual machine and Java to build a virtual machine that can run Windows. We will be able to write applications with a Java virtual machine on your home computer. We will be able to run Android applications on your home computer. We will be able to run Windows applications on your home computer. We will be able to use a virtual machine that will run Linux on your home computer.\n\nWe will be able to use a virtual machine that can run Windows on your home computer. We will be able to write applications with a Java virtual machine on your home computer. We will be able to run Android applications on your home computer. We will be able to run Windows applications on your home computer. We will be able to use a virtual machine that will run Linux on your home computer. Our course will address the following topics:\n\nHow to build a virtual machine that can run Java on your home computer\n\nHow to use Java as a virtual machin

In [13]:
generation = pipeline("text-generation", "gpt2") # 텍스트 생성 gpt3부터는 허깅페이스 없음
result = generation(
    "in this course. We will teach you how to",
    pad_token_id=generation.tokenizer.eos_token_id
)
print(result[0]['generated_text'])

Device set to use cpu


in this course. We will teach you how to create your own tools and create your own projects.

Learn to create your own projects

This course will help you create your own projects in your own way.

How to create your own projects:

1. Find the right place

A lot of people say that they are going to build a website. But what if you are just going to build a website and use a tool like Sketch or Illustrator? Now what if you just want to create your own website? How about using a tool like Paint? This course will show you how to create your own website using Sketch and Illustrator.

2. Create your own project templates

This is where you will learn how to create your own templates.

3. Create your own templates

The main idea of this course is to create a new project template. In this course, you will learn how to create your own templates.

4. Create your own templates

The main idea of this course is to create a new project template. In this course, you will learn how to create your own

In [14]:
generation = pipeline("text-generation", "skt/kogpt2-base-v2")
result = generation(
    "이 과정은 다음과 같은 방법을 알려드려요.",
    pad_token_id = generation.tokenizer.eos_token_id,
    max_new_tokens = 100, # 생성할 최대 길이(생성할 토큰 수)
    num_return_sequences=1, # 생성할 문장 갯수
    do_sample=True, # 다양한 샘플 사용
    top_k=50, # top-k 샘플링(확률 높은 상위 50개 토큰만 사용)
    top_p=0.95, #확률이 높은 순서대로 95% 될 때까지의 단어들로만 후보로 사용
    temperature=1.2, # 창의성 조절(낮을 수록 보수적)
    no_repeat_ngram_size=2 # 반복 방지
)
print(result[0]['generated_text'])

Device set to use cpu


이 과정은 다음과 같은 방법을 알려드려요." 하고 말하고 다시 한 번 "내 말이 정말입니까?" 하며 이 말을 되풀이합니다.
어떤 사람은 이 일을 해냈습니다.
어떤 이는 "내가 일을 했다고 해. 그럼 네가 그 일을 하도록 했느냐?"고 질문합니다.
그러나 그는 그 말을 믿지 못하죠.
그러니 그 일은 일어나지 않았습니다.
'네가 왜 그랬을까?'는 질문에는 아무런 대답도 하지 않아요.
그리고 어떤 사람은 네 자신을 이해할 수 없는 것 같습니다.
그래서 네게 '넌 왜 그래, 아니면 나는?'


## 4. 마스크(빈칸)채우기

In [15]:
unmasker = pipeline(task='fill-mask',
                   model='distilbert/distilroberta-base') # 마스크 채우기
unmasker("I'm going to hospital and meet a <mask>", top_k=2) # top_k 기본값 5

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.19275707006454468,
  'token': 3299,
  'token_str': ' doctor',
  'sequence': "I'm going to hospital and meet a doctor"},
 {'score': 0.06794589757919312,
  'token': 27321,
  'token_str': ' psychiatrist',
  'sequence': "I'm going to hospital and meet a psychiatrist"}]

In [16]:
#unmasker("병원에 가서 <mask>를 만날 거예요")

In [17]:
unmasker("Hello, I'm a <mask> model.")

[{'score': 0.0629730075597763,
  'token': 265,
  'token_str': ' business',
  'sequence': "Hello, I'm a business model."},
 {'score': 0.038101598620414734,
  'token': 18150,
  'token_str': ' freelance',
  'sequence': "Hello, I'm a freelance model."},
 {'score': 0.03764132782816887,
  'token': 774,
  'token_str': ' role',
  'sequence': "Hello, I'm a role model."},
 {'score': 0.037326786667108536,
  'token': 2734,
  'token_str': ' fashion',
  'sequence': "Hello, I'm a fashion model."},
 {'score': 0.026023676618933678,
  'token': 24526,
  'token_str': ' Playboy',
  'sequence': "Hello, I'm a Playboy model."}]

In [18]:
unmasker("안녕하세요? 나는 <mask> 모델이예요.", top_k=3)

[{'score': 0.14130638539791107,
  'token': 35,
  'token_str': ':',
  'sequence': '안녕하세요? 나는: 모델이예요.'},
 {'score': 0.1223798543214798,
  'token': 116,
  'token_str': '?',
  'sequence': '안녕하세요? 나는? 모델이예요.'},
 {'score': 0.08188082277774811,
  'token': 328,
  'token_str': '!',
  'sequence': '안녕하세요? 나는! 모델이예요.'}]

In [19]:
unmasker = pipeline(task="fill-mask",
                   model="google-bert/bert-base-uncased")
unmasker("Hello, I'm a [MASK] model.")

Some weights of the model checkpoint at google-bert/bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.1441437155008316,
  'token': 2535,
  'token_str': 'role',
  'sequence': "hello, i ' m a role model."},
 {'score': 0.14175789058208466,
  'token': 4827,
  'token_str': 'fashion',
  'sequence': "hello, i ' m a fashion model."},
 {'score': 0.062214579433202744,
  'token': 2047,
  'token_str': 'new',
  'sequence': "hello, i ' m a new model."},
 {'score': 0.041028350591659546,
  'token': 3565,
  'token_str': 'super',
  'sequence': "hello, i ' m a super model."},
 {'score': 0.025911200791597366,
  'token': 2449,
  'token_str': 'business',
  'sequence': "hello, i ' m a business model."}]

# ※ InferenceAPI 사용

In [6]:
from dotenv import load_dotenv
import os 
load_dotenv()
#os.environ['HF_TOKEN']
# 허깅페이스 토큰을 Read권한으로 생성하여 .env에 추가

True