This is a personal information-collection repository, not an official Happy Horse repository and not a public release.
Happy Horse 1.0 has not yet been officially open-sourced. No model weights, no inference code, and no official GitHub repository have been published yet. Everything in this README is gathered from publicly circulating sources — community-compiled architecture notes, alleged technical leaks, the Artificial Analysis Video Arena, project landing pages, and independent discussion threads — and is collected here for personal reference and learning.
I am not affiliated with the Happy Horse team. The information below may be incomplete, inconsistent across sources, or change at any time. I am watching for updates and will sync this README the moment the official model, repository, or technical report is released.
HappyHorse 1.0, officially translated as “快乐小马”, is Alibaba ATH Innovation Unit’s latest video generation and editing model. It started grayscale testing in China on April 27, 2026.
- official website: www.happyhorse.com
- Try HappyHorse now: happyhorse
HappyHorse 1.0 can be experienced through:
- Alibaba Cloud Bailian
- HappyHorse official website
- Qwen App
- Alibaba Wukong
- MuleRun
- JVS Claw
On Arena.ai, HappyHorse 1.0 ranked second in:
- Text-to-video
- Image-to-video
- Video editing
It ranked behind ByteDance Seedance 2.0.
- Text-to-video generation
- Image-to-video generation
- Video editing
- 3s–15s video generation
- Multi-shot generation
- Coherent short-form storytelling
- Up to 1080p resolution
- Up to 4 videos generated at the same time
| Resolution | Listed Price | Discounted Pro Price |
|---|---|---|
| 720p | RMB 0.9 / second | RMB 0.44 / second |
| 1080p | RMB 1.6 / second | RMB 0.78 / second |
- Fast generation speed: around 2–5 minutes per video
- Strong prompt-following ability
- Good control over camera movement, composition, style, and atmosphere
- High reference fidelity in image-to-video generation
- Supports multi-shot and coherent short video storytelling
- Audio-video synchronization still needs improvement in complex scenes
- Longer videos may contain physical inconsistencies
- Text rendering may produce garbled or incorrect characters
Happy Horse 1.0 is described as an open-source state-of-the-art AI video generator with native joint audio-video generation — meaning the Happy Horse AI video model produces video frames and the corresponding audio track (dialogue, ambient sound, Foley) together in a single forward pass, rather than generating silent video and dubbing it afterward.
According to community-compiled architecture notes, the model is built around a 15-billion-parameter unified self-attention Transformer that processes text, image, video, and audio tokens within a single token sequence. It is reportedly built without dedicated cross-attention branches and without a separate audio module. Combined with DMD-2 distillation, the distilled variant is reported to generate 1080p video in roughly 38 seconds on an NVIDIA H100, using only 8 denoising steps without classifier-free guidance.
The Happy Horse AI video generator first surfaced publicly as a "mystery model" on the Artificial Analysis Video Arena, where it appeared anonymously alongside frontier closed models from ByteDance, Kling, and Google. Independent community observers noted that it appeared to come from Asia and that — unusually for a mystery entry — it included native audio output.
As of this writing, the announcement promises that the public release will include the base model, the distilled 8-step model, the super-resolution module, and inference code. None of these have been published yet. You can follow the official Happy Horse platform for release news.
The technical details below come from community-compiled architecture notes and alleged technical leaks circulating about Happy Horse 1.0. None of this has been confirmed by an official technical report or peer review. Treat as plausible but unverified until the model is officially released.
| Component | Reported Specification |
|---|---|
| Total parameters | ~15B |
| Architecture | Unified self-attention Transformer (reportedly without dedicated cross-attention branches) |
| Total layers | 40 |
| Layer layout | Sandwich — first 4 + last 4 layers use modality-specific projections; middle 32 layers reportedly share parameters across all modalities |
| Modalities processed | Text, image, video, and audio tokens — concatenated into a single token sequence |
| Multimodal fusion | Per-attention-head learned scalar gates with sigmoid activation, used for training stability |
| Conditioning | Reference image and denoising signals routed through one minimal unified interface — reportedly no dedicated conditioning branches |
| Timestep handling | Reportedly no explicit timestep embeddings — the model is said to infer the denoising state directly from the noise level of input latents |
| Distillation | DMD-2 (Distribution Matching Distillation v2) |
| Sampling steps | 8 |
| Classifier-free guidance | Reportedly not required |
| Inference runtime | MagiCompiler — full-graph compilation with operator fusion across Transformer layers, reported to give ~1.2× end-to-end speedup |
| Reference GPU | NVIDIA H100 80GB |
Unified self-attention instead of cross-attention. Most existing open-source video models — Wan 2.2, HunyuanVideo, LTX-2, CogVideoX — use a Diffusion Transformer (DiT) backbone where text conditioning is injected via cross-attention from a separate text encoder, and audio (when present) is generated by a different model entirely. Happy Horse 1.0 is reported to rely entirely on unified self-attention, putting text, image, video, and audio into the same sequence and letting attention handle all of it. The claim is that this lets the model learn audio-video alignment as a fundamental part of denoising rather than as a downstream fix-up step.
Sandwich layer layout. The first 4 and last 4 layers reportedly handle modality-specific embedding and decoding; the middle 32 layers share parameters across all modalities. The reported benefit is parameter efficiency — most of the network does cross-modal reasoning instead of being split into siloed sub-networks.
Per-head sigmoid gating. Joint multimodal training is notoriously unstable, because gradients from the audio loss can dominate or be dominated by gradients from the video loss. The reported solution: a learned scalar gate on each attention head, allowing the model to selectively dampen heads producing destructive gradients in a particular modality.
Timestep-free denoising. Most diffusion models feed the current denoising timestep into every layer as an explicit embedding. Happy Horse 1.0 is reported to omit this entirely, on the basis that the noise level is already encoded in the noisy latents themselves. This is described as one of the prerequisites for the aggressive 8-step DMD-2 distillation.
DMD-2 distillation. Standard video diffusion typically needs 25–50 sampling steps with classifier-free guidance, both of which double or triple inference cost. DMD-2 (Distribution Matching Distillation v2) trains a student model to match the teacher's output distribution in just 8 steps with no CFG. This is what the reported "1080p in ~38 seconds" figure rests on.
This is the defining reported feature. A single unified Transformer is said to denoise video tokens and audio tokens together in the same sequence. Dialogue, Foley, and ambient sound are produced in one pass and are inherently aligned to the visuals — no separate dubbing or lip-sync model needed.
Reported to generate up to 1080p video at clip lengths of roughly 5–10 seconds, with multiple aspect ratios.
According to the technical project description, Happy Horse 1.0 lip-sync is natively trained for English, Mandarin Chinese, Japanese, Korean, German, and French with low Word Error Rate. (One marketing-oriented page lists 7 languages including Cantonese — the official count should be confirmed at release.)
Reported wall-clock generation time on an NVIDIA H100 using the distilled Happy Horse model. A 5-second 256p preview is reported at around 2 seconds.
A single set of weights is reported to handle both text-to-video and image-to-video tasks within the same Happy Horse pipeline.
The announced release scope includes the base Happy Horse 1.0 model, the distilled 8-step model, the super-resolution module, and the inference code. License terms have not been published.
The most-cited public benchmark for AI video models is the Artificial Analysis Video Arena, which uses blind head-to-head voting to compute Elo ratings. Happy Horse 1.0 has been observed competing in the arena under its codename, but as of this writing it is not yet listed in the official public leaderboard tables.
Authoritative leaderboards to watch for the model's eventual public placement:
- 🥊 Artificial Analysis Video Arena — vote on blind comparisons
- 📈 Artificial Analysis Text-to-Video Leaderboard
- 🖼️ Artificial Analysis Image-to-Video Leaderboard
Note on rankings: Elo ratings on the Artificial Analysis arena are recomputed continuously as new votes come in, and a model's rank can move several places within a single day. The tier table below is a point-in-time snapshot of the Text-to-Video leaderboard as of the date this README was last updated. For the most current ordering, always check the live leaderboards linked above.
Where Happy Horse 1.0 eventually places will be most meaningful when read against the existing field. At the time of writing, top entries on the Artificial Analysis Text-to-Video Leaderboard fall roughly into these tiers:
| Tier | Models |
|---|---|
| Frontier closed models (Elo ~1,200–1,275) | Dreamina Seedance 2.0, SkyReels V4, Kling 3.0, PixVerse V6, Veo 3.1, Runway Gen-4.5 |
| Mid-tier closed models (Elo ~1,150–1,200) | Sora 2 Pro, Hailuo 2.3, Wan 2.6, Vidu Q2 |
| Top open-weights models (Elo ~1,100–1,135) | LTX-2 Pro, LTX-2 Fast, Wan 2.2 A14B |
| Earlier open-weights (Elo ~950–1,020) | HunyuanVideo-1.5, Wan 2.1 14B, Wan 2.2 5B, Hunyuan Video |
In this landscape, anything that places at or above the LTX-2 line is considered the open-source state of the art. Anything in the frontier closed tier is competing directly with the best paid APIs. I will update this section with Happy Horse 1.0's verified placement once it appears on the official leaderboard.
A frequently asked question in the AI video community is how Happy Horse 1.0 stacks up against the current open-source leaders. Based on reported specifications, here is the comparison:
| Feature | Happy Horse 1.0 (reported) | LTX-2 Pro | Wan 2.2 A14B | HunyuanVideo-1.5 | CogVideoX-5B |
|---|---|---|---|---|---|
| Parameters | ~15B | ~13B | 14B | ~13B | 5B |
| Backbone | Unified self-attention Transformer | DiT | DiT | DiT | DiT |
| Native audio generation | ✅ joint with video | ❌ | ❌ | ❌ | ❌ |
| Native lip-sync languages | 6 (reported) | 0 | 0 | 0 | 0 |
| Sampling steps | 8 (no CFG) | ~25 | ~50 | ~50 | ~50 |
| Reported 1080p time | ~38s on H100 | minutes | minutes | minutes | minutes |
| Text-to-video | ✅ | ✅ | ✅ | ✅ | ✅ |
| Image-to-video | ✅ unified | ✅ | ✅ | ✅ | ✅ |
| Open weights available today | ❌ not yet | ✅ | ✅ | ✅ | ✅ |
The headline differentiator on paper is native joint audio-video generation. Wan 2.2, HunyuanVideo, LTX-2, and CogVideoX all generate silent video and require entirely separate models for sound and lip-sync. The headline disclaimer is the last row: all of those competitors have already released weights, while Happy Horse 1.0 has not.
Based on the announced capabilities, the Happy Horse AI video generator — once released — would be relevant for the following workflows. None of this has been verified by hands-on testing yet.
- Short-form social video — TikTok, Reels, and Shorts content with native sound, no separate dubbing pipeline.
- Marketing and ad creative — launch trailers, product teasers, and high-converting ad creatives with cinematic motion.
- Multilingual campaigns — running a single creative concept across English, Chinese, Japanese, Korean, German, and French markets without re-shooting.
- B-roll and previz — establishing shots, concept footage, and animatics for film, TV, and YouTube production.
- E-commerce product video — turning product photos into motion demos via image-to-video.
- AI research — open weights for studying joint audio-video diffusion, unified multimodal Transformers, DMD-2 distillation, and timestep-free denoising, once published.
Last updated: April 27, 2026
HappyHorse is an Alibaba-associated AI video generation model designed for creating short videos from text prompts, images, and reference visuals. If you want to try it without setting up API keys or writing code, you can use the HappyHorse online AI video generator to create videos directly in your browser.
This FAQ answers the most common questions about HappyHorse, including whether it is free, whether it supports 1080P, how the API works, how long video generation takes, whether outputs can be used commercially, and what you should know before using it for ecommerce, ads, social media, or creative projects.
HappyHorse is an AI video generation model associated with Alibaba. It supports several video creation workflows, including text-to-video, image-to-video, and reference-to-video.
In practical terms, HappyHorse can help users create short AI-generated videos from a written prompt, animate an existing image, or use reference images to guide the look of a character, product, or visual subject.
The current HappyHorse API family includes three main model types:
| Capability | Model | What it does |
|---|---|---|
| Text to video | happyhorse-1.0-t2v |
Generates a video from a text prompt |
| Image to video | happyhorse-1.0-i2v |
Animates a single image into a video |
| Reference to video | happyhorse-1.0-r2v |
Generates a video using one or more reference images |
HappyHorse is best suited for short-form AI video generation, especially for social media clips, product visuals, ecommerce content, advertising concepts, character animation, and creative storyboarding.
HappyHorse may be free or paid depending on where you use it.
If you access HappyHorse through Alibaba Cloud Model Studio, usage depends on Alibaba Cloud’s API access, pricing, quota, and regional availability. Direct API usage usually requires an API key, endpoint configuration, asynchronous task handling, and result retrieval.
If you use a third-party HappyHorse-powered website, free access depends on that platform’s own policy. Some tools may offer free trials, limited daily generations, paid credits, subscription plans, or watermark-free exports.
For users who want to try HappyHorse without configuring the API, happyhorses.io provides an online way to generate HappyHorse-powered AI videos from prompts or images.
Yes. HappyHorse supports 1080P video generation.
HappyHorse API documentation lists both 720P and 1080P as supported resolution options. In many HappyHorse workflows, 1080P is the default resolution.
However, resolution is only one part of video quality. The final output also depends on the prompt, input image quality, subject complexity, video duration, camera motion, aspect ratio, and generation randomness.
For best results, use a clear prompt, avoid overly crowded scenes, and upload high-quality images when using image-to-video or reference-to-video.
Yes. HappyHorse has API access through Alibaba Cloud Model Studio.
The API supports three major video generation workflows:
- Text-to-video — generate a video directly from a written prompt.
- Image-to-video — animate a single image into a short video.
- Reference-to-video — use one or more reference images to guide the generated video.
HappyHorse API calls are asynchronous. That means you do not receive the finished video immediately after submitting a request. Instead, the API creates a task, returns a task ID, and then you query the task status until the result is ready.
For developers, this workflow offers flexibility. For non-technical users, an online generator is usually easier because it handles the API setup, task polling, and video retrieval behind the scenes.
HappyHorse video generation usually takes about 1 to 5 minutes, depending on the workflow, resolution, duration, server load, and complexity of the request.
A simple 5-second video may complete faster than a longer 1080P video with detailed motion, multiple subjects, or complex reference images.
Generation time can vary based on:
- Video duration
- Output resolution
- Prompt complexity
- Input image quality
- Number of reference images
- Platform traffic or server load
- Whether you are using text-to-video, image-to-video, or reference-to-video
If you are generating videos for a campaign, product launch, or client project, it is best to create multiple variations in advance rather than waiting until the last minute.
HappyHorse currently focuses on short video generation.
The supported duration range is typically 3 to 15 seconds, with 5 seconds commonly used as a default duration. This makes HappyHorse especially useful for short-form visual content rather than long-form video production.
HappyHorse is a good fit for:
- TikTok clips
- Instagram Reels
- YouTube Shorts
- Product motion shots
- Ecommerce ads
- App promo clips
- Brand concept videos
- Social media visuals
- AI storyboard previews
- Short cinematic scenes
If you need a longer video, the best workflow is to generate multiple short clips and edit them together in a video editor.
HappyHorse text-to-video supports common aspect ratios used across social media, ecommerce, advertising, and web design.
Common supported ratios include:
| Aspect ratio | Best for |
|---|---|
16:9 |
YouTube, websites, landing pages, landscape ads |
9:16 |
TikTok, Reels, Shorts, vertical mobile ads |
1:1 |
Social posts, product previews, square ads |
4:3 |
Presentation-style visuals or classic formats |
3:4 |
Portrait-style creative assets |
For image-to-video, the output ratio usually follows the uploaded image. If you want a vertical video, start with a vertical image. If you want a landscape video, start with a landscape image.
A simple rule:
- Use text-to-video when you want to choose the aspect ratio directly.
- Use image-to-video when you want to animate an existing image and preserve its layout.
Yes. HappyHorse supports both Chinese and English prompts.
A good prompt should describe the subject, action, scene, camera movement, lighting, style, and intended mood. The more specific the prompt, the easier it is for the model to generate a useful result.
Example prompt:
A cinematic close-up of a golden retriever running through a sunlit meadow at sunrise, soft backlight, shallow depth of field, realistic fur movement, smooth slow motion, warm color grading.
For ecommerce or advertising videos, it is usually better to keep the scene simple and describe the product clearly.
Example ecommerce prompt:
A premium skincare bottle standing on a marble surface, soft studio lighting, gentle camera push-in, water droplets on the bottle, clean luxury beauty advertisement style, realistic reflections.
Yes. HappyHorse supports image-to-video generation.
Image-to-video uses an uploaded image as the first frame and creates motion based on that image. You can also add a prompt to describe how the image should move, what the camera should do, or what style the final video should have.
Image-to-video works well for:
- Product photos
- Portraits
- Fashion images
- Character designs
- Posters
- Real estate images
- Food photos
- Game concept art
- Ecommerce visuals
- Social media creative assets
For best results, upload a clear image with a visible subject. Avoid blurry images, low-resolution screenshots, heavy compression, cluttered backgrounds, or images where the main subject is too small.
Yes. HappyHorse offers a reference-to-video workflow that can help keep a character, product, mascot, or visual subject more consistent across a generated video.
Reference-to-video lets you provide reference images and refer to them in the prompt. This is useful when you want the generated video to preserve a specific person, object, outfit, product design, or brand visual.
This workflow is especially useful for:
- Product advertising
- Brand mascots
- Virtual influencers
- Character videos
- Fashion lookbooks
- Game characters
- Ecommerce product clips
- Visual identity testing
Example prompt:
character1 stands in a clean studio environment, the camera slowly pushes in, soft commercial lighting, premium product advertising style, realistic reflections, smooth motion.
For better consistency, use sharp reference images with clear subjects, simple backgrounds, and minimal visual noise.
HappyHorse supports a watermark setting.
Depending on the platform or API configuration, generated videos may include a “HappyHorse” watermark in the lower-right corner. Some platforms may allow watermark-free generation, while others may reserve watermark removal for paid plans or specific export settings.
If you are using generated videos for ads, ecommerce, or client work, always check the final export before publishing.
If you use the HappyHorse API directly, result links may be temporary. API-generated video URLs are typically not meant to be permanent storage links.
For third-party tools, storage policies vary. Some platforms may save your generation history, while others may only provide temporary download links.
Best practice: download and back up your generated video as soon as it is ready, especially if you plan to use it for commercial, marketing, or client-facing work.
HappyHorse can be useful for commercial creative workflows, but whether you can use a specific generated video commercially depends on the platform terms, your input assets, and the content of the final output.
HappyHorse-style AI video generation can be useful for:
- Product videos
- Ecommerce ads
- Social media campaigns
- Brand concept videos
- App promo videos
- Creative testing
- Storyboard previews
- Character or mascot videos
Before using a generated video commercially, check:
- The terms of the platform you used
- Whether your input images are licensed for commercial use
- Whether the output includes protected characters, celebrities, trademarks, logos, or copyrighted elements
- Whether your local laws or advertising platforms require AI-generated content disclosure
A safe rule is to use your own assets, avoid imitating protected IP, and review the platform’s commercial-use policy before publishing.
Yes. HappyHorse is especially useful for ecommerce, advertising, and short-form marketing content.
It can help teams create visual concepts faster, test multiple creative directions, and produce short video assets without organizing a full video shoot.
Common ecommerce and advertising use cases include:
- A product hero shot with subtle camera movement
- A fashion product in a lifestyle scene
- A skincare bottle with studio lighting
- A food product in a cinematic close-up
- A seasonal campaign visual
- A social media ad concept
- A product teaser video
- A mascot or brand character clip
For commercial campaigns, keep the prompt focused and product-centered. Simple scenes with one clear subject usually produce more usable results than crowded scenes with too many details.
No. You only need coding skills if you want to call the HappyHorse API directly.
Developers can use the API to integrate HappyHorse into apps, websites, creative tools, automation workflows, or internal production systems. Direct API usage usually requires handling API keys, endpoints, asynchronous tasks, task status polling, result downloads, and error handling.
If you simply want to create a video, an online generator is easier. You can enter a prompt, upload an image, choose basic settings, start generation, and download the result.
To try this workflow without coding, visit the HappyHorse online AI video generator.
Be careful with that wording.
HappyHorse has been publicly associated with Alibaba, and HappyHorse API documentation is available through Alibaba Cloud Model Studio. However, that does not mean every website using the HappyHorse name is official, and it does not necessarily mean the model is open source.
If a website integrates HappyHorse through an API, the safer wording is:
This tool provides online access to HappyHorse-powered AI video generation.
or:
This platform integrates HappyHorse video generation so users can create AI videos online.
Avoid claiming to be the “official HappyHorse website” unless the site is actually operated or authorized by Alibaba.
HappyHorse is best for short, visually rich AI videos.
You can use it to create:
- AI product videos
- Ecommerce advertising clips
- Social media content
- TikTok and Reels videos
- YouTube Shorts visuals
- App launch videos
- Brand concept videos
- Character animation
- Game concept trailers
- Fashion and beauty visuals
- Food and beverage ads
- Real estate previews
- Interior design visuals
- Storyboard and pitch visuals
Because HappyHorse supports text-to-video, image-to-video, and reference-to-video workflows, it can be useful for both quick idea generation and more controlled creative production.
To get better HappyHorse results, use specific prompts, clean images, and simple scenes.
Here are practical tips:
-
Describe the subject clearly
Say exactly who or what should appear in the video.
-
Add motion
Include actions such as walking, floating, rotating, pouring, blooming, zooming, or camera panning.
-
Define the camera style
Use phrases like “close-up,” “wide shot,” “slow dolly in,” “aerial view,” or “cinematic handheld shot.”
-
Set the lighting and mood
Try “soft studio lighting,” “golden hour,” “neon cyberpunk lighting,” or “clean commercial lighting.”
-
Keep the scene focused
One clear subject often works better than a crowded prompt with many competing details.
-
Use high-quality images
For image-to-video and reference-to-video, clean input images usually produce better output.
-
Generate multiple versions
AI video generation is probabilistic. Small changes in prompt, seed, image, or settings can produce different results.
A strong HappyHorse prompt usually follows this structure:
Subject + action + scene + camera movement + lighting + visual style + mood + platform format.
Example:
A luxury perfume bottle rotating slowly on a reflective black surface, soft studio lighting, close-up product shot, slow camera push-in, cinematic commercial style, elegant and premium mood, 16:9 landscape video.
For vertical social videos, include the intended platform format:
A stylish sneaker floating above a clean gradient background, slow 360-degree rotation, dramatic studio lighting, sharp product details, modern streetwear ad style, vertical 9:16 format for TikTok and Reels.
This structure helps the model understand what should appear, how it should move, and what visual style you want.
You can try HappyHorse online by using a generator that supports HappyHorse-powered video creation.
The basic workflow is simple:
- Enter a text prompt or upload an image.
- Choose the generation mode.
- Select video settings when available.
- Start generation.
- Download the finished video.
Whether you want to create a product video, social media clip, advertising concept, animated image, or creative storyboard, HappyHorse gives you a fast way to turn ideas into short AI-generated videos.
HappyHorse can generate short AI videos from text prompts, images, and reference visuals. It is useful for ecommerce, social media, advertising, product videos, character animation, and creative testing.
If you want to try HappyHorse without setting up API access, use the HappyHorse online AI video generator to create your first video.
Primary CTA: Try the HappyHorse Online Generator
Secondary CTA: Upload an Image to Create a Video
This README is compiled from publicly circulating information about Happy Horse 1.0 as of the date below. Two categories of sources are referenced:
- Official information channel — the official Happy Horse AI video generator site is the source of record for live demos and updates. All other promotional pages discovered during research have been intentionally not linked here, since they are third-party and their claims occasionally contradict each other.
- Authoritative third-party benchmark — the Artificial Analysis Video Arena and its text-to-video and image-to-video leaderboards are the only third-party sources linked in this document, because they are widely cited as the standard public benchmark for AI video models.
Last updated: April 8, 2026 Status: Pre-release · awaiting official open-source publication Will be updated when: the official repository goes live, the model weights are published, a technical report is released, or the model appears in the public Artificial Analysis leaderboard tables.
This is a personal information-collection repository.
Happy Horse 1.0 has not yet been officially open-sourced — this README will be synced as soon as it is.