It uses AI powered models such as LSTM and CNN to extract information from image and store it in form of text and further by using Google Text to Speech api, converts texts into audio.
image :- Flicker8k_Dataset text :- Flickr_8k_text
reference :- https://data-flair.training/blogs/python-based-project-image-caption-generator-cnn/