## Image Extractor Selector

This notebook tests the ImageExtractorSelector class created

In [None]:
import sys

sys.path.append("..")

from news_extraction_pipeline.extractor_selectors.extractor_selector import ImageExtractorSelector
from news_extraction_pipeline.config import AINewsConfig

In [4]:
news_config = AINewsConfig()

## Instanciates a ImageExtractorSelector class

In [5]:
image_extractor_selector = ImageExtractorSelector()

In [6]:
# Definition of the ULR where news will be extracted from
news_source1 = news_config.AI_NEWS_FEED_URL
news_source2 = news_config.MIT_NEWS_FEED_URL

In [7]:
# Example of different news from the sources
news_article1 = "https://www.artificialintelligence-news.com/news/google-new-ai-agent-rewrites-code-automate-vulnerability-fixes/"
news_article2 = "https://news.mit.edu/2025/lincoln-lab-unveils-most-powerful-ai-supercomputer-at-any-us-university-1002"

# News article not related to the registered sources
news_article3 = "https://www.xataka.com/robotica-e-ia/chatgpt-empezo-siendo-simple-asistente-ia-openai-quiere-convertir-tu-futuro-sistema-operativo"

#### ImageExtractors classes works with specific news/articles, but also with the main base url

No matter if you want to extract the image from the feed url or from an specific article url, the image_extractor will be correctly selected for both

In [8]:
image_extractor_ainews = image_extractor_selector.get_extractor(news_source1)
image_extractor_ainews2 = image_extractor_selector.get_extractor(news_article1)

[32m2025-10-09 18:50:42.664[0m | [1mINFO    [0m | [36mnews_extraction_pipeline.selectors.extractor_selector[0m:[36mget_extractor[0m:[36m73[0m - [1mExtractor 'AINEWSImageExtractor' selected for base URL: https://www.artificialintelligence-news.com[0m
[32m2025-10-09 18:50:42.664[0m | [1mINFO    [0m | [36mnews_extraction_pipeline.selectors.extractor_selector[0m:[36mget_extractor[0m:[36m73[0m - [1mExtractor 'AINEWSImageExtractor' selected for base URL: https://www.artificialintelligence-news.com[0m


In [9]:
image_extractor_mitnews = image_extractor_selector.get_extractor(news_source2)
image_extractor_mitnews2 = image_extractor_selector.get_extractor(news_article2)

[32m2025-10-09 18:50:42.678[0m | [1mINFO    [0m | [36mnews_extraction_pipeline.selectors.extractor_selector[0m:[36mget_extractor[0m:[36m73[0m - [1mExtractor 'MITImageExtractor' selected for base URL: https://news.mit.edu[0m
[32m2025-10-09 18:50:42.678[0m | [1mINFO    [0m | [36mnews_extraction_pipeline.selectors.extractor_selector[0m:[36mget_extractor[0m:[36m73[0m - [1mExtractor 'MITImageExtractor' selected for base URL: https://news.mit.edu[0m


If a base_url is not registered, it returns None

In [10]:
image_extractor_xataka = image_extractor_selector.get_extractor(news_article3)
type(image_extractor_xataka)



NoneType

#### Once the ImageExtractor has been selected, you can retrieve the main image link of any article related to the source

In [11]:
ainews_img_link = image_extractor_ainews.extract(news_article1)
print(f"Link to the news article: {news_article1}\nLink to the main image of the news article: {ainews_img_link}")

Link to the news article: https://www.artificialintelligence-news.com/news/google-new-ai-agent-rewrites-code-automate-vulnerability-fixes/
Link to the main image of the news article: https://www.artificialintelligence-news.com/wp-content/uploads/2025/10/google-ai-agent-fix-code-cybersecurity-infosec-security-coding-programming-software-development-artificial-intelligence-1024x768.jpg


In [12]:
mitnews_img_link = image_extractor_mitnews.extract(news_article2)
print(f"Link to the news article: {news_article2}\nLink to the main image of the news article: {mitnews_img_link}")

Link to the news article: https://news.mit.edu/2025/lincoln-lab-unveils-most-powerful-ai-supercomputer-at-any-us-university-1002
Link to the main image of the news article: https://news.mit.edu/sites/default/files/styles/news_article__image_gallery/public/images/202509/LLSC_Rosas_525558-037D.jpg?itok=u4SpaZTe


This ImageExtractorSelector will be use as just a component in the NewsExtractor class, which will retrieve more data related to the news, such as title, publish_date, and news_link