Skip to content
/ cs50x Public

My Final Project for CS50x Intro to Computer Science

Notifications You must be signed in to change notification settings

bacchist/cs50x

Repository files navigation

AI Prompt

CS50x Final Project

Description:

It allows the user to prompt an AI assistant with a recorded voice clip and receive a response in both audio and text. To accomplish this, I used three different APIs. To convert the voice clip to text, I used Google Cloud's Speech-to-Text API. The resulting transcription is used in a request to OpenAI's ChatCompletions API, using the gpt-3.5-turbo language model. And finally the generated text is sent to Google Cloud's Text-to-Speech API to provide the audio response.

Google's cloud services utilize their own authentication system using their cli and Python libraries, but OpenAI uses a standard API key. In order to keep my key secure, I have them saved in a .env file. The python-dotenv library sets an environment variable, which is accessed using the os module from the Python standard library. This file and others are prevented from being pushed to this repository through the use of a .gitignore file.

The application itself is a relatively simple Flask app, with one route that accepts 3 methods, GET, POST, and PUT. The use of PUT in particular is not entirely standard according to REST design principles, but it seemed reasonable due to the fact that I did not implement a database and it is essentially a single page application. I wrote one layout page, and another template for each of the "prompt" and "response" pages. I used the Tailwind CSS framework and daisyUI component library for the visual elements of the page. I mostly used the out-of-the-box functionality there, except for the tiniest bit of JavaScript to replace the submit button with one that has a loading animation. Not much, but I would not have known to do that before taking CS50x!

The audio recording is done with JavaScript, using the MediaStream Recording API. The client side code for this was repurposed from the Web Dictaphone sample application. I removed the parts that I didn't need, changed the format of the audio to webm to ensure compatibility with Chrome browsers, and added some code to send the audio blob to the back end and use the response to populate the input field of the page.

One interesting problem that I was able to solve had to do with the audio response from the text-to-speech API. The method of handling the response shown in the documentation is to save it to a file, which is likely the standard practice. Since render_template() is not suitable for sending binary data, the obvious approach was to save it to a file, pass the filename to Jinja and use url_for() to create the src attribute of the audio tag inside the template. But for various reasons from performance to security, I wanted to find a way to avoid writing files to disk altogether. I found a couple articles online that discussed a method of passing image data without saving it to a file 1 2 . The technique involved first encoding the binary data into base64, and then re-encoding into unicode in order to pass it as string data. I was able to adapt that approach for my app, and I think it was a major improvement.

For some time, I did not think I would be able to avoid the complications of file creation. At first, it seemed that I would have to use a third party library to convert the audio into a suitable format for the Speech-to-Text API. I tried to use the method that I outlined above, but I was unable to due to a limitation of the library I was using (likely a filename input requirement, similar to what this Stack Overflow response suggests). Fortunately, I realized that I was not going to have to perform any conversion. After digging into the API documentation and finding out what formats were compatible with different browsers (with the help of this code snippet, I was able to determine that Opus encoded WEBM was the way to go). The last bit of information I needed to satisfy the STT API was the sample rate of MediaRecorder's webm output (it's 48000).

Because I consider it part of my project, I'd like to note that the app is currently live on a VPS that I configured from its default state. This included a number of system administration tasks like creating a regular user account with sudo privileges, disabling root login via SSH, disabling password login for SSH, configuring a firewall, and installing all the necessary software packages for development. I updated the resource records of my domain to point to the new VPS server and installed an HTTPS certificate. The Flask application is being served by gunicorn behind a Nginx proxy server that is configured to redirect HTTP traffic to HTTPS. I was familiar with some of these tasks, but many of them were new to me. Everything considered, this project was a great opportunity to bring together much of what we learned in this class, and I personally found it immensely rewarding.

About

My Final Project for CS50x Intro to Computer Science

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published