Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Repeated Rendering of Answers Causes Multiple Image Requests #4827

Open
1 of 3 tasks
ficapy opened this issue Jun 7, 2024 · 2 comments
Open
1 of 3 tasks

[Bug] Repeated Rendering of Answers Causes Multiple Image Requests #4827

ficapy opened this issue Jun 7, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@ficapy
Copy link

ficapy commented Jun 7, 2024

Bug Description

Repeated Rendering of Answers Causes Multiple Image Requests

Steps to Reproduce

Simulate a Markdown translation task that includes a link.

Send the following content to the LLM and observe the web console to see the numerous requests to https://placehold.co/600x400 for the image.

Please translate the following Markdown into Chinese

![https://placehold.co/600x400](https://placehold.co/600x400)

Hello GPT-4o
============

We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.

All videos on this page are at 1x real time.

Guessing May 13th’s announcement.

GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to [human response time(opens in a new window)](https://www.pnas.org/doi/10.1073/pnas.0903616106) in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.

Model capabilities
------------------

Two GPT-4os interacting and singing.  

Interview prep.

Rock Paper Scissors.

Sarcasm.

Math with Sal and Imran Khan.

Two GPT-4os harmonizing.

Point and learn Spanish.

Meeting AI.

Real-time translation.

Lullaby.

Talking faster.

Happy Birthday.

Dog.

Dad jokes.

GPT-4o with Andy, from BeMyEyes in London.

Customer service proof of concept.

Prior to GPT-4o, you could use [Voice Mode](https://openai.com/index/chatgpt-can-now-see-hear-and-speak) to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.

With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.

Explorations of capabilities
----------------------------

Select sample:

Visual Narratives - Robot Writer’s BlockVisual narratives - Sally the mailwomanPoster creation for the movie 'Detective'Character design - Geary the robotPoetic typography with iterative editing 1Poetic typography with iterative editing 2Commemorative coin design for GPT-4oPhoto to caricatureText to font3D object synthesisBrand placement - logo on coasterPoetic typographyMultiline rendering - robot textingMeeting notes with multiple speakersLecture summarizationVariable binding - cube stackingConcrete poetry

Model safety and limitations
----------------------------

GPT-4o has safety built-in by design across modalities, through techniques such as filtering training data and refining the model’s behavior through post-training. We have also created new safety systems to provide guardrails on voice outputs.  

We’ve evaluated GPT-4o according to our [Preparedness Framework](https://openai.com/preparedness) and in line with our [voluntary commitments](https://openai.com/index/moving-ai-governance-forward/). Our evaluations of cybersecurity, CBRN, persuasion, and model autonomy show that GPT-4o does not score above Medium risk in any of these categories. This assessment involved running a suite of automated and human evaluations throughout the model training process. We tested both pre-safety-mitigation and post-safety-mitigation versions of the model, using custom fine-tuning and prompts, to better elicit model capabilities.  

GPT-4o has also undergone extensive external red teaming with 70+ [external experts](https://openai.com/index/red-teaming-network) in domains such as social psychology, bias and fairness, and misinformation to identify risks that are introduced or amplified by the newly added modalities. We used these learnings to build out our safety interventions in order to improve the safety of interacting with GPT-4o. We will continue to mitigate new risks as they’re discovered.  

We recognize that GPT-4o’s audio modalities present a variety of novel risks. Today we are publicly releasing text and image inputs and text outputs. Over the upcoming weeks and months, we’ll be working on the technical infrastructure, usability via post-training, and safety necessary to release the other modalities. For example, at launch, audio outputs will be limited to a selection of preset voices and will abide by our existing safety policies. We will share further details addressing the full range of GPT-4o’s modalities in the forthcoming system card.  

Through our testing and iteration with the model, we have observed several limitations that exist across all of the model’s modalities, a few of which are illustrated below.  

Examples of model limitations

We would love feedback to help identify tasks where GPT-4 Turbo still outperforms GPT-4o, so we can continue to improve the model. 

Model availability
------------------

GPT-4o is our latest step in pushing the boundaries of deep learning, this time in the direction of practical usability. We spent a lot of effort over the last two years working on efficiency improvements at every layer of the stack. As a first fruit of this research, we’re able to make a GPT-4 level model available much more broadly. GPT-4o’s capabilities will be rolled out iteratively (with extended red team access starting today). 

GPT-4o’s text and image capabilities are starting to roll out today in ChatGPT. We are making GPT-4o available in the free tier, and to Plus users with up to 5x higher message limits. We'll roll out a new version of Voice Mode with GPT-4o in alpha within ChatGPT Plus in the coming weeks.

Developers can also now access GPT-4o in the API as a text and vision model. GPT-4o is 2x faster, half the price, and has 5x higher rate limits compared to GPT-4 Turbo. We plan to launch support for GPT-4o's new audio and video capabilities to a small group of trusted partners in the API in the coming weeks.

Authors
-------

[OpenAI](/news/?author=openai#results)

Expected Behavior

I observed in the Chrome DevTools Network panel that the image at https://placehold.co/600x400 was requested over a hundred times.

Screenshots

image

Deployment Method

  • Docker
  • Vercel
  • Server

Desktop OS

OSX

Desktop Browser

Chrome

Desktop Browser Version

125.0.6422.142

Smartphone Device

No response

Smartphone OS

No response

Smartphone Browser

No response

Smartphone Browser Version

No response

Additional Logs

No response

@ficapy ficapy added the bug Something isn't working label Jun 7, 2024
@Kosette
Copy link

Kosette commented Jun 7, 2024

the markdown syntax ![alter text](url) imports image, which will be rendered in dialog, that's why there are so many failed requests.

replace it with a valid url, you'll see an image.

@Kosette
Copy link

Kosette commented Jun 7, 2024

the markdown syntax ![alter text](url) imports image, which will be rendered in dialog, that's why there are so many failed requests.

replace it with a valid url, you'll see an image.

my bad, it's a valid image url. ignore my comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants