<a href="https://colab.research.google.com/github/bahramkhanlarov/Analyse_audio_recordings_with_Gemini1.5/blob/main/quickstarts/Audio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyze audio recordings with Gemini 1.5

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Audio.ipynb"><img src="https://github.com/google-gemini/cookbook/blob/main/images/colab_logo_32px.png?raw=1" />Run in Google Colab</a>
  </td>
</table>

This notebook provides an example of how to prompt Gemini 1.5 Flash using an audio file. In this case, you can upload your audio file and ask model to provide the transcription of it or the summury as bullet points.

### Install dependencies

In [1]:
!pip install -q -U "google-generativeai>=0.7.2"

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/165.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m163.8/165.0 kB[0m [31m9.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.0/165.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/725.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m716.8/725.4 kB[0m [31m58.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m725.4/725.4 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import google.generativeai as genai

### Configure your API key

To run the following cell, your API key must be stored it in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see [Authentication](../quickstarts/Authentication.ipynb) for an example.

In [8]:
from google.colab import userdata
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

genai.configure(api_key=GOOGLE_API_KEY)

## Upload an audio file with the File API

To use an audio file in your prompt, you must first upload it using the [File API](../quickstarts/File_API.ipynb).


In [14]:
your_file = genai.upload_file(path='/content/attention_all_you_need.wav')

## Use the file in your prompt

In [16]:
prompt = "Listen carefully to the following audio file. Provide a full transcript."
model = genai.GenerativeModel('models/gemini-1.5-flash')
response = model.generate_content([prompt, your_file])
print(response.text)

All right, ready to dive into some seriously cool AI stuff. Absolutely, let's do it. Today we're tackling this paper that well kind of hit the reset button on how machines understand us. How they understand language. Yeah. It's a big one. Attention Is All You Need. Catchy title, right? Definitely eye-catching. But, before you think whoa, too techy for me, trust me, this one's Well, it's a game changer. Yeah. We're talking about the tech behind, you know, those mind-blowing translation apps that magically makes sense of any language. Mm-hm. Or even that AI chatbot everyone's buzzing about: ChatGPT, the one that can write poems and get this, even code. It's pretty amazing. This paper came out back in 2017. A group of well brilliant minds over at Google basically said Hold my AI-powered drink, and turn the whole field on its head. Yeah. They really did. It's fascinating, really. Before this paper, AI was kind of stuck in this rut. Yeah. Imagine trying to understand a story, right? But you

In [17]:
prompt = "Listen carefully to the following audio file. Provide a summury as bullet points."
model = genai.GenerativeModel('models/gemini-1.5-flash')
response = model.generate_content([prompt, your_file])
print(response.text)

## Summary of Audio File:

* **Topic:** This audio file discusses a groundbreaking paper from 2017 that revolutionized how AI understands language. 
* **Focus:** The paper focused on the concept of "attention," allowing AI to understand the context and relationships between words in a sentence.
* **Challenge:** Previously, AI struggled to grasp the overall meaning of sentences due to its inability to process context. 
* **Breakthrough:** The paper introduced "scaled dot-product attention," a mechanism that helps AI focus on the most relevant words in a sentence. 
* **Metaphor:** It's like giving AI a highlighter to identify key phrases, or a team of linguistic detectives each specializing in a different aspect of the sentence.
* **Transformer:**  The paper used an architecture called the "transformer," which works like a factory with different stations (encoder, decoder, positional encoding) to process information.
* **Importance of Order:**  The paper highlighted the importance of wor

## Inline Audio

For small requests you can inline the audio data into the request, like you can with images. Use PyDub to trim the first 10s of the audio:

In [18]:
!pip install -Uq pydub

In [19]:
from pydub import AudioSegment

In [30]:
sound = AudioSegment.from_mp3("attention_all_you_need.wav")

In [24]:
#sound[:10000] # slices are in ms

Add it to the list of parts in the prompt:

In [31]:
response = model.generate_content([
    "Please transcribe this recording:",
    {
        "mime_type": "audio/mp3",
        "data": sound.export().read()
    }
])

In [32]:
from IPython import display

display.Markdown(response.text)

All right, ready to dive into some seriously cool AI stuff. Absolutely, let's do it. Today we're tackling this paper that well, kind of hit the reset button on how machines understand us, how they understand language. Yeah, it's a big one. Attention is All You Need. Catchy title, right? Definitely eye-catching. But before you think, "Whoa, too techy for me," trust me, this one's well, it's a game changer. Yeah. We're talking about the tech behind, you know, those mind-blowing translation apps that magically make sense of any language. Mhm. Or even that AI chatbot everyone's buzzing about, ChatGPT, the one that can write poems and get this, even code. It's pretty amazing. This paper came out back in 2017. A group of well, brilliant minds over at Google basically said hold my AI-powered drink, and turn the whole field on its head. Yeah. They really did. It's fascinating really. Before this paper, AI was kind of stuck in this rut, you know. Imagine trying to understand a story, right? But you're forced to read just one word at a time. You'd completely lose track of the plot. Yeah, for sure. That's what those older AI models were dealing with. Really? They had a really hard time grasping like the context of like an entire sentence. Especially if it was a longer one. It was like, they had a short-term memory problem or something. Okay, yeah, that analogy definitely makes it relatable. Yeah. I can barely remember a grocery list let alone like a whole sentence word by word. Mhm. So, this attention thing everyone keeps mentioning, that's the secret sauce. Yeah, you could say that. The breakthrough that allowed AI to finally see the whole picture. Yeah, you know it's funny, we humans we just take for granted how easily we can focus on what matters most in a sentence. Like if I say grab the blue book, not the red one, you instantly know which book I'm talking about. Right. But, for AI that used to be a real head-scratcher. It did indeed. And that's precisely what Attention is All You Need sought to address. Okay. How to make AI models attend to specific parts of a sequence just like we do when we read or listen. So, how do they do it? Do they like build a tiny little AI brain that could you know, concentrate? Not quite. The paper lays out this elegant mechanism called Scaled Dot-Product Attention. Scaled Dot-Product Attention? Yeah, it's a mouthful. Yeah, but basically, it's a way for the model to score each word in relation to all the other words. Figuring out which ones are most important for you know, understanding the overall meaning. Okay, so it's like giving AI a highlighter, letting it pick out the key phrases. Precisely. And it gets even more ingenious. They introduce this concept of multi-head attention, which is like having multiple highlighters, all well, each tuned to different aspects of the text. One might focus on the subject, another on the action, and so on. Right. Giving the AI a more multi-faceted understanding of the sentence. Wow, so it's like giving AI a whole team of like little linguistic detectives, each with their own specialty. That's a great way to put it. Yeah. Now, all of this magic happens within a specific architecture that they call the Transformer. Ah, yes. The Transformer. The name's been thrown around a lot. It has. It sounds kind of intimidating, like something out of a sci-fi movie. It might sound like something out of Transformers, but it's actually quite elegant in its design. Think of it as a well-organized factory floor with different stations all working together. Mhm. You've got the encoder station, which is responsible for you know, reading and understanding that input sequence. Like figuring out what our mystery novel is all about. Exactly. The encoder processes the text using this attention mechanism we talked about. Creating a rich representation of the information. Mhm. Then, you have the decoder, which takes that understanding and uses it to generate the output. Okay. Whether that's a translation, a response in a chatbot conversation, or even a poem. Okay, so the encoder is like the reader and the decoder is like the writer? Precisely. And there's one more crucial element positional encoding. Ooh, that sounds important. What does that do? Positional encoding. It sounds like some kind of secret code. Well, not quite a secret, but it is clever. You see, language it relies, like really heavily on order. Okay. The sentence, "The dog bit the man," means something totally different from "The man bit the dog." Oh, absolutely, yeah. Word order is everything. Exactly. But, for a machine, just seeing you know, a bunch of words without any sense of their order, it's like trying to solve a jigsaw puzzle with all the pieces flipped over. Oh, wow. That's where this positional encoding comes in. Okay. It's like giving each word a little flag that says hey, I'm number one in this sentence or you know, I come after this word but before that one. Huh. That way, the Transformer knows exactly how the words relate to each other in the sequence. Wow, that's that's incredibly clever. So, they basically taught AI grammar without actually teaching it grammar? You got it. And, the results well, they speak for themselves. This paper showed that Transformers with this whole attention mechanism, the positional encoding, they were able to translate languages faster, more accurately, with a deeper understanding than well, anything we'd seen before. Really, I'm impressed. But, it gets even better, right? Yeah. This whole Transformer thing, it isn't just a one-hit wonder, just limited to languages. Absolutely not. And that's what's so exciting about this paper. It opened up this whole new world of possibilities for AI. Because at its core, the Transformer, it's simply a powerful engine for understanding sequences. Okay. Whether those sequences are words in a sentence, images in a video, or even musical notes in a melody. Wait, hold on. AI composing music? Mhm. Image recognition, okay, we've all heard that before, but composing music? How does that even work? It's the same principle. Just like a sentence has structure and grammar, so does a piece of music. You have melodies, harmonies, rhythms, all arranged in a particular order to evoke certain emotions. But by feeding the Transformer this massive amount of musical data, it can start to learn those patterns, those relationships. And once it understands the language of music, so to speak, it can start to create its own compositions. This is blowing my mind a little bit. I always thought creativity was something uniquely human. But, here we are, on the verge of AI Mozart or you know, AI Beethoven. It's definitely a fascinating frontier. And it goes beyond just music. We're already seeing Transformers used in everything from like medical diagnosis, where they're analyzing patient data to predict illnesses, to self-driving cars where they can help the car understand and navigate its surroundings in real-time. So, from translating languages to composing symphonies to potentially driving our cars, Attention is All You Need really did kind of usher in a new era of AI, didn't it? Absolutely. This paper was a pivotal moment, a paradigm shift in how we think about machine intelligence. And the most exciting part? We're only just scratching the surface of what's possible. I don't know about you, but I'm both excited and a little terrified by the possibilities. So, to wrap up this incredible deep dive, if AI can now learn to attend like humans, what other human-like abilities might it develop? And, more importantly, what does that mean for the future of humanity? That's something for you all to ponder. Until next time, keep those AI curiosity circuits firing. B-b-b- 

## Count audio tokens

You can count the number of tokens in your audio file like this.

In [33]:
model.count_tokens([your_file])

total_tokens: 13760