<a target="_blank" href="https://colab.research.google.com/github/rapidsai-community/showcase/blob/main/getting_started_tutorials/rapids-pip-colab-template.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Install RAPIDS into Colab"/>
</a>

# RAPIDS cuDF is now already on your Colab instance!
RAPIDS cuDF is preinstalled on Google Colab and instantly accelerates Pandas with zero code changes. [You can quickly get started with our tutorial notebook](https://nvda.ws/rapids-cudf). This notebook template is for users who want to utilize the full suite of the RAPIDS libraries for their workflows on Colab.  

# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

You can check the output of `!nvidia-smi` to check which GPU you have.  Please uncomment the cell below if you'd like to do that.  Currently, RAPIDS runs on all available Colab GPU instances.

In [None]:
# !nvidia-smi

#Setup:
This set up script:

1. Checks to make sure that the GPU is RAPIDS compatible
1. Pip Installs the RAPIDS' libraries, which are:
  1. cuDF
  1. cuML
  1. cuGraph
  1. cuSpatial
  1. cuxFilter
  1. cuCIM
  1. xgboost

# Controlling Which RAPIDS Version is Installed
This line in the cell below, `!python rapidsai-csp-utils/colab/pip-install.py`, kicks off the RAPIDS installation script.  You can control the RAPIDS version installed by adding either `latest`, `nightlies` or the default/blank option.  Example:

`!python rapidsai-csp-utils/colab/pip-install.py <option>`

You can now tell the script to install:
1. **RAPIDS + Colab Default Version**, by leaving the install script option blank (or giving an invalid option), adds the rest of the RAPIDS libraries to the RAPIDS cuDF library preinstalled on Colab.  **This is the default and recommended version.**  Example: `!python rapidsai-csp-utils/colab/pip-install.py`
1. **Latest known working RAPIDS stable version**, by using the option `latest` upgrades all RAPIDS labraries to the latest working RAPIDS stable version.  Usually early access for future RAPIDS+Colab functionality - some functionality may not work, but can be same as the default version. Example: `!python rapidsai-csp-utils/colab/pip-install.py latest`
1. **the current nightlies version**, by using the option, `nightlies`, installs current RAPIDS nightlies version.  For RAPIDS Developer use - **not recommended/untested**.  Example: `!python rapidsai-csp-utils/colab/pip-install.py nightlies`


**This will complete in about 5-6 minutes**

In [None]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py


Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 490, done.[K
remote: Counting objects: 100% (221/221), done.[K
remote: Compressing objects: 100% (130/130), done.[K
remote: Total 490 (delta 149), reused 124 (delta 91), pack-reused 269[K
Receiving objects: 100% (490/490), 136.70 KiB | 6.21 MiB/s, done.
Resolving deltas: 100% (251/251), done.
Collecting pynvml
  Downloading pynvml-11.5.0-py3-none-any.whl (53 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 714.5 kB/s eta 0:00:00
Installing collected packages: pynvml
Successfully installed pynvml-11.5.0
Installing the rest of the RAPIDS 24.4.* libraries
Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cuml-cu12==24.4.*
  Downloading https://pypi.nvidia.com/cuml-cu12/cuml_cu12-24.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1200.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 GB 1.1 MB/s eta 0:00:00
Collecting cugraph-cu12==24.4.*
  Downloading

# RAPIDS is now installed on Colab.  
You can copy your code into the cells below or use the below to validate your RAPIDS installation and version.  
# Enjoy!

In [None]:
import cudf
cudf.__version__

'24.04.01'

In [None]:
import cuml
cuml.__version__

'24.04.00'

In [None]:
import cugraph
cugraph.__version__

'24.04.00'

In [None]:
import cuspatial
cuspatial.__version__

'24.04.00'

In [None]:
import cuxfilter
cuxfilter.__version__

'24.04.01'

# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-contrib

In [None]:
!pip install bertopic[gensim]

Collecting bertopic[gensim]
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic[gensim])
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic[gensim])
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic[gensim])
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic[gensim])
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-

In [None]:
!pip install --extra-index-url=https://pypi.nvidia.com cuml-cu12==25.2.*

In [None]:
import pandas as pd
import cuml
from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer

In [None]:
from sentence_transformers import SentenceTransformer

In [None]:
df = pd.read_excel('jdseg.xlsx')

In [None]:
df.seg = df.seg.astype(str)

In [None]:
docs = df.seg.tolist()

In [None]:
docs[0]

'东西 不错 优惠 放心 省钱 购买 不要 犹豫 发货 很快 一天 颜色 很 喜欢 性价比 很高 非常 nice 宝贝 收到 实物 真的 好看 价格 非常 实惠 强烈推荐 款 超级 喜欢'

In [None]:
embedding_model = SentenceTransformer(
  'BAAI/bge-base-zh-v1.5',
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/409M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/409M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/439k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
embeddings = embedding_model.encode(docs, show_progress_bar=True)

Batches:   0%|          | 0/93 [00:00<?, ?it/s]

In [None]:
import numpy as np
np.save('jd_seg_1.5_2.npy',embeddings)

In [None]:
stopwords = [line.strip() for line in open('hit_stopwords.txt', encoding='UTF-8').readlines()]

In [None]:
umap_model = UMAP(
  n_neighbors=5,
  n_components=5,
  min_dist=0.0,
  metric='cosine',
  random_state=30
)
hdbscan_model = HDBSCAN(
  min_cluster_size=10,
  min_samples=5,
  metric='euclidean'
)
vectorizer_model = CountVectorizer(stop_words=stopwords)
ctfidf_model = ClassTfidfTransformer()

[2025-03-09 13:38:56.457] [CUML] [info] build_algo set to brute_force_knn because random_state is given


In [None]:
reduced_emb = UMAP(
  n_neighbors=5,
  n_components=2,
  min_dist=0.0,
  metric='cosine',
  random_state=30
).fit_transform(embeddings)

[2025-03-09 13:48:04.062] [CUML] [info] build_algo set to brute_force_knn because random_state is given


In [None]:
topic_model = BERTopic(
    embedding_model=embedding_model,
    vectorizer_model=vectorizer_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    #representation_model=representation_model,
    ctfidf_model=ctfidf_model,
    nr_topics=21
)

In [None]:
topics, probs = topic_model.fit_transform(docs, embeddings=embeddings)

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,859,-1_不错_非常_血压_功能,"[不错, 非常, 血压, 功能, 质量, 很好, 喜欢, 方便, 测量, 使用]",[这次 购物 非常 愉快 商品 实惠 好用 质量上乘 物流 迅速 商家 服务 热情 售后 无...
1,0,410,0_血压_测量_功能_测血压,"[血压, 测量, 功能, 测血压, 程度, 比较, 主要, 方便, 准确, 续航]",[主要 功能 量 血压 续航 能力 还 行 舒适 程度 带 挺舒服 精准 程度 还 算 精准...
2,1,340,1_非常_物流_包装_满意,"[非常, 物流, 包装, 满意, 商品, 质量, 购物, 商家, 速度, 服务]",[真的 不错 真的 超级 喜欢 非常 支持 质量 非常 好 卖家 描述 完全一致 非常 满意...
3,2,251,2_监测_运动_健康_睡眠,"[监测, 运动, 健康, 睡眠, 心率, pro, 功能, 佩戴, 续航, 智能]",[智能 生活 好帮手 款 设计 简约 时尚 佩戴 舒适 功能 却 十分 强大 实时 监测 心...
4,3,226,3_表带_表盘_好看_很好,"[表带, 表盘, 好看, 很好, 很好看, 比较, 非常, 不错, 灵敏度, 外观]",[年底 更新 装备 之前 表盘 有点 嫌小 买 pro 最新款 曲面 屏 表盘 大气 好看 ...
5,4,189,4_操作_灵敏度_做工_准确性,"[操作, 灵敏度, 做工, 准确性, 简单, 质量, 外形, 难易, 外观, 喜欢]",[灵敏度 灵敏度 很高 触摸 灵敏 跟手 准确性 显示 准确 操作 难易 操作 方便 简单 ...
6,5,157,5_不错_续航_操作_屏幕,"[不错, 续航, 操作, 屏幕, 非常, 做工, 灵敏度, 外形, 外观, 真的]",[真的 很好看 做工 精美 功能 很多 表面 显示 很 清晰 运行 非常 流畅 金属外壳 质...
7,6,101,6_睡眠_心率_监测_检测,"[睡眠, 心率, 监测, 检测, 功能, 喜欢, 质量, 不错, nfc, 闹钟]",[nfc 版 使用 体验 非常 好 心率 血 氧 监测 准确 运动 追踪 功能 全面 睡眠 ...
8,7,93,7_nfc_门禁卡_功能_门禁,"[nfc, 门禁卡, 功能, 门禁, 运动, 非常, 方便, 实用, 公交, 健康]",[款 nfc 版 真是 送礼 理想 选择 外观设计 时尚 银色 配色 非常 质感 适合 男友...
9,8,76,8_京东_东西_快递_放心,"[京东, 东西, 快递, 放心, 好评, 物流, 速度, 包装, 小哥, 下去]",[n 次 京东 买 东西 东西 便宜 质量 好 物美价廉 买 放心 开心 东西 品种 特别 ...


In [None]:
topic_model.get_topic_info().to_excel('tm.xlsx',index=None)

In [None]:
tps = pd.DataFrame(topics)

In [None]:
tps.to_excel('tps.xlsx',index=None)

In [None]:
np.save('re_2.npy',reduced_emb)

以下是情感分析代码

In [None]:
# 导入必要的库
from transformers import pipeline
import pandas as pd

In [None]:

# 1. 加载预训练的情感分析 pipeline，使用 Erlangshen 模型
#    "IDEA-CCNL/Erlangshen-Bert-110M-Sentiment" 是一个中文情感分析的预训练模型
#    你可以在 Hugging Face Model Hub 上找到更多 Erlangshen 模型: https://huggingface.co/models?search=erlangshen
sentiment_pipeline = pipeline("sentiment-analysis", model="IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment")

In [None]:

# 2. 定义 CSV 文件路径和 Excel 文件路径
csv_file_path = 'your_comments.csv'  # 替换为你的 CSV 文件路径
excel_file_path = 'sentiment_results.xlsx' # 结果将保存到这个 Excel 文件

# 3. 从 CSV 文件中读取数据
try:
    df = pd.read_csv(csv_file_path)
except FileNotFoundError:
    print(f"错误: CSV 文件 {csv_file_path} 未找到，请检查文件路径是否正确。")
    exit()

# 检查 CSV 文件中是否包含 '评论内容' 这一列
if '评论内容' not in df.columns:
    print(f"错误: CSV 文件中缺少 '评论内容' 列。请确保你的 CSV 文件包含名为 '评论内容' 的列，其中存储了需要分析的文本。")
    exit()

In [None]:
# 4. 创建新的列来存储情感标签和情感强度
df['情感标签'] = ''  # 用于存储积极或消极标签
df['情感强度'] = 0.0 # 用于存储情感强度评分 (置信度)

# 5. 遍历 '评论内容' 列的每一行文本，进行情感分析
for index, row in df.iterrows():
    text = row['评论内容']

    # 使用 pipeline 进行情感分析
    result = sentiment_pipeline(text)

    # pipeline 返回的是一个列表，包含一个字典，例如: [{'label': 'POSITIVE', 'score': 0.999}]
    sentiment_label = result[0]['label'] # 情感标签 (例如: POSITIVE, NEGATIVE)
    sentiment_score = result[0]['score'] # 情感强度 (0 到 1 之间的浮点数，表示模型对预测的置信度)

    # 将结果保存到 DataFrame 的新列中
    df.loc[index, '情感标签'] = sentiment_label
    df.loc[index, '情感强度'] = sentiment_score

# 6. 将包含情感分析结果的 DataFrame 保存到 Excel 文件
try:
    df.to_excel(excel_file_path, index=False) # index=False 避免将 DataFrame 的索引写入 Excel
    print(f"情感分析结果已保存到 {excel_file_path}")
except Exception as e:
    print(f"保存 Excel 文件时发生错误: {e}")

print("程序运行完毕。")