# 项目名称语义分析
本笔记本演示如何使用OpenAI嵌入和余弦相似度对项目名称进行语义分析，以识别潜在的重复项目。


## 概述
1. 环境和依赖配置
2. 加载和准备项目数据
3. 使用豆包 API生成文本嵌入
4. 分析项目名称之间的相似度
5. 处理和可视化结果


## 1. 环境和依赖配置
首先，安装并导入所需的库，配置OpenAI客户端。

In [None]:
# Install required packages
!pip install openai pandas numpy scikit-learn seaborn matplotlib

In [None]:
# Import required libraries
import os
from openai import OpenAI
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import time
import seaborn as sns
import matplotlib.pyplot as plt

# Initialize OpenAI client
client = OpenAI(
    api_key="YOUR_API_KEY",  # Replace with your API key
    base_url="https://ark.cn-beijing.volces.com/api/v3",
)

## 2. 加载和准备数据
从Excel文件加载项目数据并准备分析。

In [None]:
# Load data from Excel file
file_path = "project_name.xlsx"  # Update with your file path
df = pd.read_excel(file_path)

# Extract IDs and project names
ids = df.iloc[:, 0].tolist()
project_names = df.iloc[:, 1].tolist()

print(f"Loaded {len(project_names)} projects")
print("\nFirst 5 projects:")
for i in range(min(5, len(project_names))):
    print(f"ID: {ids[i]}, Name: {project_names[i]}")

## 3. 文本嵌入生成
实现带有批处理和错误处理的嵌入生成函数。

In [None]:
def get_embeddings(texts, batch_size=100):
    """
    Generate embeddings for a list of texts using batch processing
    """
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        print(f"Processing batch {i // batch_size + 1} of {len(texts) // batch_size + 1}")
        try:
            resp = client.embeddings.create(
                model="doubao-embedding-text-240515",
                input=batch,
                encoding_format="float"
            )
            batch_embeddings = [item.embedding for item in resp.data]
            embeddings.extend(batch_embeddings)
        except Exception as e:
            print(f"Error in batch {i}: {e}")
        time.sleep(1)  # Rate limiting
    return embeddings

# Generate embeddings
print("----- Starting embeddings generation -----")
embeddings = get_embeddings(project_names)
embeddings = np.array(embeddings)
print(f"\nGenerated embeddings shape: {embeddings.shape}")

## 4. 相似度分析
计算相似度矩阵并识别相似的项目对。

In [None]:
# Calculate similarity matrix
print("----- Calculating similarity matrix -----")
similarity_matrix = cosine_similarity(embeddings)

# Find similar projects
threshold = 0.9
duplicates = []

for i in range(len(project_names)):
    for j in range(i + 1, len(project_names)):
        if similarity_matrix[i][j] > threshold:
            duplicate_entry = {
                "ID_1": ids[i],
                "Project_Name_1": project_names[i],
                "ID_2": ids[j],
                "Project_Name_2": project_names[j],
                "Similarity": similarity_matrix[i][j]
            }
            duplicates.append(duplicate_entry)

print(f"\nFound {len(duplicates)} potential duplicate pairs")

## 5. 结果处理和可视化
可视化相似度矩阵并导出结果。

In [None]:
# Visualize similarity matrix
plt.figure(figsize=(10, 8))
sns.heatmap(similarity_matrix, cmap='YlOrRd')
plt.title('Project Name Similarity Matrix')
plt.show()

# Create and display results DataFrame
if duplicates:
    output_df = pd.DataFrame(duplicates)
    print("\nTop 10 most similar project pairs:")
    display(output_df.sort_values('Similarity', ascending=False).head(10))
    
    # Export results
    output_path = "duplicate_projects.xlsx"
    output_df.to_excel(output_path, index=False)
    print(f"\nResults exported to {output_path}")
else:
    print("\nNo duplicates found above threshold")

## Summary
This notebook has demonstrated:
1. How to process project names using OpenAI embeddings
2. How to calculate semantic similarity between project names
3. How to identify and visualize potential duplicate projects
4. How to export and analyze results

The threshold value (0.9) can be adjusted based on your specific needs - lower values will catch more potential duplicates but may include more false positives.