To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

**[NEW] Llama-3 8b is trained on a crazy 15 trillion tokens! Llama-2 was 2 trillion.**

Use our [Llama-3 8b Instruct](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing) notebook for conversational style finetunes.

In [1]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

In [2]:
!pip install huggingface_hub



* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

In [3]:
from huggingface_hub import login
from google.colab import userdata
token = userdata.get('huggingface_LCIA_2')
login(token)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

##從Hugging Face上載下來用!

In [4]:
import torch
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template

max_seq_length = 2048  # Choose any! We auto support RoPE Scaling internally!
dtype = None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True  # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "JT000/LCIA_Test2_Llama", # YOUR MODEL YOU USED FOR TRAINING (Llama3.1_3 , Llama3.1_2 , Llama3.1 , taide_1 . taide_2)
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    attn_implementation="flash_attention_2",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# tokenizer = get_chat_template(
#     tokenizer,
#     chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
#     mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
#     map_eos_token = True, # Maps <|im_end|> to </s> instead
# )

# messages = [
#     {"from": "human", "value": "What is a famous tall tower in Paris?"},
# ]
# inputs = tokenizer.apply_chat_template(
#     messages,
#     tokenize = True,
#     add_generation_prompt = True, # Must add for generation
#     return_tensors = "pt",
# ).to("cuda")

# from transformers import TextStreamer
# text_streamer = TextStreamer(tokenizer)
# _ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.43.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/131 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.1k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [5]:
# 設定標籤
labels = [
    "加工中心機臺", "床台結構_頭部", "床台結構_立柱", "床台結構_工作台", "床台結構_鞍座", "床台結構_底座",
    "主軸系統_主軸電機", "主軸系統_主軸軸承", "主軸系統_鬆刀機構", "主軸系統_拉刀機構", "進給系統_導軌及滑塊",
    "進給系統_導螺桿", "進給系統_進給電機", "進給系統_進給軸承", "進給系統_聯軸器", "控制系統",
    "鈑金件_强化透明隔板", "鈑金件_金屬板件", "加工機臺_周邊組件", "刀庫", "自動換刀裝置", "潤滑系統", "冷卻系統",
    "排屑裝置", "夾具", "馬達", "潤滑油", "工具", "套筒", "刀具、銑刀、鑽頭", "螺絲刀", "金屬鑄件", "原料"
]

In [None]:
# 設定標籤
labels = [
    "加工中心機臺", "床台結構_頭部", "床台結構_立柱", "床台結構_工作台", "床台結構_鞍座", "床台結構_底座",
    "主軸系統_主軸電機", "主軸系統_鬆刀機構", "主軸系統_拉刀機構", "進給系統_導軌及滑塊",
    "進給系統_導螺桿", "進給系統_進給電機", "進給系統_進給軸承", "進給系統_聯軸器", "控制系統",
    "鈑金件_强化透明隔板", "鈑金件_金屬板件", "加工機臺_周邊組件", "刀庫", "自動換刀裝置", "潤滑系統", "冷卻系統",
    "排屑裝置", "夾具", "馬達", "潤滑油", "工具", "套筒", "刀具、銑刀、鑽頭", "螺絲刀", "金屬鑄件"
]

In [11]:
import pandas as pd
from tqdm import tqdm
from difflib import get_close_matches
# 讀取 Excel 文件
df = pd.read_excel('/content/output_file_Llama3_2 (1).xlsx')

In [7]:
from transformers import LogitsProcessor

In [8]:
class LabelConstrainedLogitsProcessor(LogitsProcessor):
    def __init__(self, tokenizer, labels):
        self.label_token_ids = [tokenizer.encode(str(label), add_special_tokens=False)[0] for label in labels]

    def __call__(self, input_ids, scores):
        scores[:, :] = torch.log(torch.zeros_like(scores))
        scores[:, self.label_token_ids] = 0
        return scores

def classify_text(row, model, tokenizer, labels):
    # 將所有相關欄位組合成一個文本
    combined_text = f"""
    Company Profile: {row['ComProfile']}
    Business Item Description: {row['Business_Item_Desc']}
    Product Info: {row['ProductInfo']}
    Business Item eTax: {row['Business_Item_eTax']}
    Product/Service: {row['product/service']}
    """

    alpaca_prompt = """The following is a description of the company and its products. Please select the most appropriate category from the given list of labels.

    ### Instruction:
    Select the most appropriate category from the following list: {labels}

    ### Input:
    {input_text}

    ### Response:
    The most appropriate category is:"""

    formatted_prompt = alpaca_prompt.format(
        labels=", ".join(labels),
        input_text=combined_text
    )

    inputs = tokenizer([formatted_prompt], return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

    # 從回應中提取類別
    category = response.split("The most appropriate category is:")[-1].strip()

    # 確保輸出的類別在我們的標籤列表中
    if category not in labels:
        # 如果不在，找到最相似的標籤
        category = max(labels, key=lambda x: len(set(category.lower().split()) & set(x.lower().split())))

    return category

In [12]:
logits_processor = LabelConstrainedLogitsProcessor(tokenizer, labels)

# 對每一行進行分類
tqdm.pandas()
df['result'] = df.progress_apply(lambda row: classify_text(row, model, tokenizer, labels), axis=1)

# 將結果保存到新的 Excel 文件
df.to_excel('output_file_Llama3_0805.xlsx', index=False)

100%|██████████| 122/122 [02:40<00:00,  1.31s/it]


In [None]:
import pandas as pd
from google.colab import files

# 上下游資料搜索定義

In [None]:
def process_input(choice, product_ids, file_path, sheet_name):
    if choice == '1':
        df = pd.read_excel(file_path, sheet_name = sheet_name)
        # 将产品ID列转换为字符串类型
        df['product_id'] = df['product_id'].astype(str)
        processed_data = []

        # 遍历产品ID在输入列表中的行，并处理'up'列的值
        for product_id in product_ids:
            # 选择特定产品ID的行，并获取'up'列的值
            up_values = df[df["product_id"] == product_id]['up'].tolist()

            for value in up_values:
              if pd.isna(value):
                  continue  # 如果值为空，则跳过当前循环
              items = value.split(',')
              processed_data.extend(items)

        processed_data = list(set(processed_data))
        return processed_data

    elif choice == '2':
        return product_ids


    else :
        df = pd.read_excel(file_path, sheet_name = sheet_name)
        # 将产品ID列转换为字符串类型
        df['product_id'] = df['product_id'].astype(str)
        processed_data = []

        # 遍历产品ID在输入列表中的行，并处理'up'列的值
        for product_id in product_ids:
            # 选择特定产品ID的行，并获取'up'列的值
            down_values = df[df["product_id"] == product_id]['down'].tolist()

            for value in down_values:
              if pd.isna(value):
                  continue  # 如果值为空，则跳过当前循环
              items = value.split(',')
              processed_data.extend(items)

        processed_data = list(set(processed_data))
        return processed_data

In [None]:
def load_company_sheets(product_ids, file_path):
    # 创建一个空字典来存储各个工作表的数据
    data_frames = {}

    # 循环输入的产品ID列表
    for product_id in product_ids:
        # 构建工作表名称
        sheet_name = f"company_{product_id}"

        # 尝试读取工作表
        try:
            # 使用pandas的read_excel函数，指定文件路径和工作表名
            data_frames[product_id] = pd.read_excel(file_path, sheet_name=sheet_name)
            print(f"成功加载工作表：{sheet_name}")
        except Exception as e:
            print(f"加载工作表{sheet_name}失败: {e}")

    return data_frames

In [None]:
from google.colab import files
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
# 上传文件后的路径
file_path = '/content/drive/Shareddrives/碩研/LCIA/dataset/LCIA_test_v2.xlsx'

# 加载 Excel 文件
df = pd.read_excel(file_path, engine='openpyxl')

# 读取包含产品名称和对应产品ID的Excel表格
df = pd.read_excel(file_path, sheet_name='工作表1')
# 将产品名称和对应产品ID存储到字典中
product_mapping = dict(zip(df['product_name'], df['product_id']))
product_mapping_reverse = {v: k for k, v in product_mapping.items()}

# 單獨輸入文字分類

In [None]:
input_text = input("請輸入需判別之描述：")
alpaca_prompt = """The following is a description of the company's products. Please select the most appropriate category from the given list of labels.

### Instruction:
Select the most appropriate category from the following list: {labels}

### Input:
{input_text}

### Response:
The most appropriate category is:"""

formatted_prompt = alpaca_prompt.format(
    labels=", ".join(labels),
    input_text=input_text
)

inputs = tokenizer([formatted_prompt], return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
response = tokenizer.batch_decode(outputs)[0]
# 從回應中提取類別
category = response.split(":")[-1].strip()
category = category.split("<|eot_id|>")[0].strip()  # 這行移除 <|eot_id|>
category = category.encode('utf-8', errors='ignore').decode('utf-8')
# 確保類別是預定義標籤之一
#closest_match = get_close_matches(category, labels, n=1, cutoff=0.6)
#if closest_match:
#    category = closest_match[0]
#else:
#    category = "無法分類"

print(category)
#範例： 億達精密刀具成立於2001年，位處於擁有著豐富工業資源的台中。我們秉持著一步一腳印、循序漸進的態度，以提供最好品質且合理價格的產品給客戶。億達從傳統焊刃式切削刀具發展至高精度硬質的鎢鋼切削刀具，集傳統精髓與創新技術於一身。從原料檢閱到成品輸出皆透過高度嚴謹的品管監督出廠。因為這樣堅持的精神，億達廣受客戶地肯定與愛戴，也相當榮幸獲得鄧白氏企業認證 (D-U-N-S® Certificate)此項殊榮。
# 使用split方法按逗号分隔字符串，转换成列表
product_names = category.split('、')

product_ids = [product_mapping.get(name, '未知') for name in product_names]

# 打印结果以验证输入
print("您输入的產品是：", product_names)
# 打印结果以验证输入
print("您输入的產品ID是：", product_ids)

請輸入需判別之描述：億達精密刀具成立於2001年，位處於擁有著豐富工業資源的台中。我們秉持著一步一腳印、循序漸進的態度，以提供最好品質且合理價格的產品給客戶。億達從傳統焊刃式切削刀具發展至高精度硬質的鎢鋼切削刀具，集傳統精髓與創新技術於一身。從原料檢閱到成品輸出皆透過高度嚴謹的品管監督出廠。因為這樣堅持的精神，億達廣受客戶地肯定與愛戴，也相當榮幸獲得鄧白氏企業認證 (D-U-N-S® Certificate)此項殊榮
刀庫  <|im_end|>


NameError: name 'product_mapping' is not defined

In [None]:
# 使用split方法按逗号分隔字符串，转换成列表
product_names = category.split('、')

product_ids = [product_mapping.get(name, '未知') for name in product_names]

# 打印结果以验证输入
print("您输入的產品是：", product_names)
# 打印结果以验证输入
print("您输入的產品ID是：", product_ids)

In [None]:
# 定义选项
options = {
    '1': '上游公司',
    '2': '本產品公司',
    '3': '下游客戶'
}

# 打印选项
print("请选择您針對",category,"查找何資料：")
for key, value in options.items():
    print(f"{key}. {value}")

# 获取用户输入
choice = input("请输入选项（1, 2, 或 3）：")

# 检查用户输入是否有效
while choice not in options:
    print("输入无效，请输入1, 2, 或3作为您的选择！")
    choice = input("请输入选项（1, 2, 或 3）：")

# 打印用户选择的结果
print(f"您选择的是：{options[choice]}")


In [None]:
find_list = process_input(choice, product_ids, file_path, '工作表1')
if find_list == []:
  print("查無資料")
  company_found_list=[]
else:
  sheets_data = load_company_sheets(find_list, file_path)
# 输出加载的数据，以检查
  company_found_list=[]
  for id, df in sheets_data.items():
      product_name = product_mapping_reverse.get(id, '未知')
      print(f"数据 {product_name}:")
      print(df)  # 打印每个数据表的前几行
      for index, row in df.iterrows():
        company_found_list.append(row['company_id'])