# Image Extractors

This notebook tests the ImageExtractors classes created

ImageExtractors are built because the data extracted using the Python library *feedparser*, only retrieves the following fields:

- title
- news_url
- publish_date

Nevertheless, the AI-SharePoint contains a visual that requires also a link to a image that represents the news, you can see the visual [here](https://endava.sharepoint.com/sites/AINewsletter)

In order to extract the news image, it was decided to create web scrappers that could be able to extract the main image of the news_url.

In [1]:
import sys

sys.path.append("..")

from news_extraction_pipeline.extractors.image_url.image_url_extractors import AINEWSImageExtractor, MITImageExtractor

As every webpage displays its data in a very specific way, it is required to generate a ImageExtractor for each page where the news will be extracted.

In [2]:
ai_news_url = "https://www.artificialintelligence-news.com/news/google-new-ai-agent-rewrites-code-automate-vulnerability-fixes/"
ai_news_url2 = "https://www.artificialintelligence-news.com/news/how-ai-is-changing-the-way-we-travel/"
bad_ai_news_url = "https://news.mit.edu/2025/new-prediction-model-could-improve-reliability-fusion-power-plants-1007"

ainews_image_extractor = AINEWSImageExtractor()

image1 = ainews_image_extractor.extract(ai_news_url)
image2 = ainews_image_extractor.extract(ai_news_url2)
image3 = ainews_image_extractor.extract(bad_ai_news_url)

print(f"{image1=}")
print(f"{image2=}")
print(f"{image3=}")

[32m2025-10-09 18:50:19.763[0m | [31m[1mERROR   [0m | [36mnews_extraction_pipeline.extractors.image_url.image_url_extractors[0m:[36m_fetch_html_code[0m:[36m72[0m - [31m[1mThe URL introduced in article_url: https://news.mit.edu/2025/new-prediction-model-could-improve-reliability-fusion-power-plants-1007 does not correspond to the ImageExtractor defined[0m


image1='https://www.artificialintelligence-news.com/wp-content/uploads/2025/10/google-ai-agent-fix-code-cybersecurity-infosec-security-coding-programming-software-development-artificial-intelligence-1024x768.jpg'
image2='https://www.artificialintelligence-news.com/wp-content/uploads/2025/10/From-inspiration-to-itinerary_How-AI-is-rewriting-travel-1024x683.jpg'
image3=None


In [3]:
mit_news_url = "https://news.mit.edu/2025/ai-maps-how-new-antibiotic-targets-gut-bacteria-1003"
mit_news_url2 = "https://news.mit.edu/2025/new-prediction-model-could-improve-reliability-fusion-power-plants-1007"
bad_mit_news_url = "https://www.artificialintelligence-news.com/news/how-ai-is-changing-the-way-we-travel/"

mitnews_image_extractor = MITImageExtractor()

image1 = mitnews_image_extractor.extract(mit_news_url)
image2 = mitnews_image_extractor.extract(mit_news_url2)
image3 = mitnews_image_extractor.extract(bad_mit_news_url)

print(f"{image1=}")
print(f"{image2=}")
print(f"{image3=}")

[32m2025-10-09 18:50:27.120[0m | [31m[1mERROR   [0m | [36mnews_extraction_pipeline.extractors.image_url.image_url_extractors[0m:[36m_fetch_html_code[0m:[36m72[0m - [31m[1mThe URL introduced in article_url: https://www.artificialintelligence-news.com/news/how-ai-is-changing-the-way-we-travel/ does not correspond to the ImageExtractor defined[0m


image1='https://news.mit.edu/sites/default/files/styles/news_article__image_gallery/public/images/202509/Entero3.png?itok=VztGSs5y'
image2='https://news.mit.edu/sites/default/files/styles/news_article__image_gallery/public/images/202510/MIT-Tokamak-Tuning-01_0.jpg?itok=Ym0UVkae'
image3=None
