In [2]:
from langchain_openai import ChatOpenAI
import os

openai_api_key=os.getenv("OPEN_API_KEY")
model = ChatOpenAI(
    model='deepseek-chat', # 或者使用 'deepseek-reasoner' (对应 DeepSeek-R1)
    openai_api_key=openai_api_key,
    base_url='https://api.deepseek.com', # 必须修改此项，指向 DeepSeek 的服务器
    temperature=0.7
)

llm=model

In [4]:
from langchain_community.utilities import ArxivAPIWrapper

arxiv=ArxivAPIWrapper()
docs=arxiv.run("1605.08386")
docs

'Published: 2016-05-26\nTitle: Heat-bath random walks with Markov bases\nAuthors: Caprice Stanley, Tobias Windisch\nSummary: Graphs on lattice points are studied whose edges come from a finite set of allowed moves of arbitrary length. We show that the diameter of these graphs on fibers of a fixed integer matrix can be bounded from above by a constant. We then study the mixing behaviour of heat-bath random walks on these graphs. We also state explicit conditions on the set of moves so that the heat-bath random walk, a generalization of the Glauber dynamics, is an expander in fixed dimension.'

In [5]:
docs=arxiv.run("sora")
docs

"Published: 2025-09-22\nTitle: Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking\nAuthors: Zihan Su, Xuerui Qiu, Hongbin Xu, Tangyu Jiang, Junhao Zhuang, Chun Yuan, Ming Li, Shengfeng He, Fei Richard Yu\nSummary: The explosive growth of generative video models has amplified the demand for reliable copyright preservation of AI-generated content. Despite its popularity in image synthesis, invisible generative watermarking remains largely underexplored in video generation. To address this gap, we propose Safe-Sora, the first framework to embed graphical watermarks directly into the video generation process. Motivated by the observation that watermarking performance is closely tied to the visual similarity between the watermark and cover content, we introduce a hierarchical coarse-to-fine adaptive matching mechanism. Specifically, the watermark image is divided into patches, each assigned to the most visually similar video frame, and further localized to the optimal spati

In [7]:
import arxiv

search=arxiv.Search(
    query="gpt-4",
    max_results=5,
    sort_by=arxiv.SortCriterion.Relevance
)
search

arxiv.Search(query='gpt-4', id_list=[], max_results=5, sort_by=<SortCriterion.Relevance: 'relevance'>, sort_order=<SortOrder.Descending: 'descending'>)

In [9]:
client=arxiv.Client()
results=client.results(search)

results

<itertools.islice at 0x173d9dc6610>

In [10]:
papers=[]
for item in results:
    print(item) 
    papers.append(item)

http://arxiv.org/abs/2304.03277v1
http://arxiv.org/abs/2303.13375v2
http://arxiv.org/abs/2308.07921v1
http://arxiv.org/abs/2311.15732v2
http://arxiv.org/abs/2312.14302v2


In [11]:
papers[0]

arxiv.Result(entry_id='http://arxiv.org/abs/2304.03277v1', updated=datetime.datetime(2023, 4, 6, 17, 58, 9, tzinfo=datetime.timezone.utc), published=datetime.datetime(2023, 4, 6, 17, 58, 9, tzinfo=datetime.timezone.utc), title='Instruction Tuning with GPT-4', authors=[arxiv.Result.Author('Baolin Peng'), arxiv.Result.Author('Chunyuan Li'), arxiv.Result.Author('Pengcheng He'), arxiv.Result.Author('Michel Galley'), arxiv.Result.Author('Jianfeng Gao')], summary='Prior work has shown that finetuning large language models (LLMs) using machine-generated instruction-following data enables such models to achieve remarkable zero-shot capabilities on new tasks, and no human-written instructions are needed. In this paper, we present the first attempt to use GPT-4 to generate instruction-following data for LLM finetuning. Our early experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance

In [14]:
from langchain_community.document_loaders import ArxivLoader
docs=ArxivLoader(query="2309.12732v1",load_max_docs=2).load()
docs



In [15]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

prompt=ChatPromptTemplate.from_template("{article}\n\n\n请使用中文详细讲解上面这篇文章内容，并将核心的要点提炼出来")

chain=prompt | llm | StrOutputParser()

In [17]:
for chunk in chain.invoke({"article":docs[0].page_content}):
    print(chunk,end="",flush=True)

这篇文章题为《OpenAI的GPT4作为编码助手》，作者是Lefteris Moussiades和George Zografos，来自希腊国际希腊大学计算机科学系。文章发表于2023年9月25日，主要评估了GPT-3.5和GPT-4作为编程助手的性能，涵盖三个核心任务：**回答问题**、**代码开发**和**代码调试**。

---

## 一、文章核心内容讲解

### 1. **研究背景与动机**
- **大型语言模型（LLM）** 在代码生成领域已被广泛应用，如CodeBERT、Codex、AlphaCode等。
- GPT-4被认为是当前最强大的LLM之一，但尚未有公开研究系统评估其编码能力。
- 本文旨在填补这一空白，通过自定义测试集（非公开基准数据集）评估GPT-3.5和GPT-4在真实编程场景中的表现。

### 2. **研究方法**
- 设计了三个测试套件：
  - **回答问题**：模拟开发者常见的语法、语义疑问。
  - **代码开发**：要求生成特定功能的代码（如高精度幂函数、井字棋游戏）。
  - **代码调试**：提供有异常或逻辑错误的代码，要求解释并修复。
- 使用Java作为编程语言，通过OpenAI的Web界面进行交互，遵循最佳提示工程实践。
- 结果由人类专家评估或与可靠来源（如Java标准库函数）对比。

### 3. **回答问题能力测试**
- 提出了三个具有挑战性的问题：
  1. Java是否支持将函数作为参数传递？语法是什么？
  2. 解释一段代码为何只输出一个布尔值而非两个。
  3. 简要说明Java中“默认方法”与“非抽象方法”的区别。
- **结果**：GPT-3.5和GPT-4均能正确回答所有问题，表现令人满意。

### 4. **代码开发能力测试**
#### a) **幂函数实现（PF）**
- **要求**：实现高精度计算 `pow(double b, int e)`，不使用`Math.pow`或`BigDecimal.pow`。
- **第一轮**：两者均使用“平方取幂”算法实现，但GPT-4使用位运算优化，GPT-3.5使用算术运算。
- **精度对比**：与`Math.pow`对比，两者平均偏差相近。
- **第二轮**：要求提升精度。
  - GPT-3.5改用泰勒级数展开，但精度反而下降。