# Notebook Scope

This notebook is designed to compile a comprehensive list of celebrities and collect their facial images by scraping their Wikipedia pages. 

The process is structured as follows:

### Step 1: Compilation of Celebrity Names
- We began by exploring various websites and CSV files, which provided lists of celebrities from around the world.
- A unique list of celebrity names was curated from these sources.

### Step 2: Image Scraping
- We developed a scraper to search each celebrity's name.
- The scraper was programmed to locate and download the Wikipedia thumbnail image of the celebrity.
- These images were stored in a designated 'images' folder within the project.

### Outcome
- The successful execution of this scraping project led to the accumulation of over 1,000 celebrity images.
- These images will be utilized in the main project aiming to identify the celebrity that most closely resembles the input image.


In [1]:
import requests
from bs4 import BeautifulSoup
from urllib.request import urlretrieve
import os
from urllib.parse import urljoin
import pandas as pd
from tqdm import tqdm as ProgressBar
import random
import time

In [1]:
all_names = set()

## Top 100 Greatest Hollywood Actors of All Time (Dataset)

In [45]:
actors = pd.read_csv('Top 100 Greatest Hollywood Actors of All Time.csv')['Name'].to_list()

all_names.update(actors)

## 100 MOST POPULAR CELEBRITIES IN THE WORLD

In [34]:
html_text = requests.get('https://www.imdb.com/list/ls052283250/').text
soup = BeautifulSoup(html_text, 'html.parser')

res = soup.find_all('h3', class_='lister-item-header')

celebrities = [r.find('a').text.strip().replace('\n', '') for r in res]
all_names.update(celebrities)

## Top 1000 Actors and Actresses

In [35]:
for x in ProgressBar(range(1, 11)):
    url = f'https://www.imdb.com/list/ls058011111/?sort=list_order,asc&mode=detail&page={x}'
    html_text = requests.get(url).text
    soup = BeautifulSoup(html_text, 'html.parser')

    res = soup.find_all('h3', class_='lister-item-header')

    celebrities = [r.find('a').text.strip().replace('\n', '') for r in res]
    all_names.update(celebrities)

100%|██████████| 10/10 [00:40<00:00,  4.00s/it]


## 100 famous people

In [84]:
url = 'https://www.biographyonline.net/people/famous-100.html#google_vignette'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

section = soup.find('section', {'class': 'post-content clearfix'})
names = section.find_all('li')
names = [name.find('a').text if name.find('a') is not None else name.text for name in names]
name = [name.split('(')[0].strip() for name in names][:130]
all_names.update(name)

## Another Dataset

In [87]:
nomi = pd.read_csv('nomi_famosi.csv')['Nomi'].to_list()
all_names.update(nomi)

## Image Scraping from Wikipedia

In [88]:
def download_wikipedia_images(url, save_path, person_name):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        os.makedirs(save_path, exist_ok=True)
        try:
            img_tag = soup.find_all('figure', attrs={'typeof':'mw:File/Thumb'})[0].find_all('img')[0]
            img_url = img_tag.get('src')
        except:
            print(f"Failed to retrieve the image. Page: {url}")
            return
        if img_url:
            img_url = urljoin(url, img_url)
            img_path = os.path.join(save_path, person_name + '.jpg')
            urlretrieve(img_url, img_path)
    else:
        print(f"Failed to retrieve the page. Status code: {response.status_code}. Page: {url}")

In [3]:
images_scraped = os.listdir('images')

In [90]:
links = [l.replace(' ', '_') for l in all_names]
print(f'Number of links: {len(links)}')

Number of links: 1397


In [91]:
removed = ['Patrick_Wilson', 'Graham_Greene', 'Ben_Johnson', 'Mohanlal', 'Jane_Alexander', 'Kevin_Hart', 'January_Jones', 'Jane_Seymour', 'Maria_Falconetti', 'Robert_Shaw', 'Richard_Farnsworth', 'John_Mills', 'Tyrese_Gibson', 'Michael_Lerner', 'Craig_Robinson', 'Dan_Aykroyd', 'Bette_Midler', 'John_Cazale', 'Tim_Curry', 'Danny_McBride', 'Julianne_Hough', 'Steve_Martin', 'Andy_Samberg']

In [92]:
for name in ProgressBar(links):
    if any(name in s for s in images_scraped) or name in removed:
        # print(f'{name} already scraped')
        continue
    url = f'https://it.wikipedia.org/wiki/{name}'
    save_path = 'images'
    download_wikipedia_images(url, save_path, name)
    time.sleep(random.randint(1, 2))

  1%|          | 16/1397 [00:16<17:46,  1.29it/s]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Gabriele_Cirilli


  8%|▊         | 105/1397 [01:22<08:06,  2.65it/s]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Leonardo_Decarli


  9%|▉         | 124/1397 [01:30<11:12,  1.89it/s]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Graham_Hill


 12%|█▏        | 173/1397 [02:08<14:16,  1.43it/s]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Elisa_Maino


 17%|█▋        | 243/1397 [02:48<13:22,  1.44it/s]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Marzia_Bisognin


 21%|██        | 293/1397 [03:17<14:10,  1.30it/s]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Michelle_Williams


 22%|██▏       | 312/1397 [03:59<42:23,  2.34s/it]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Andrea_Gabrieli


 24%|██▍       | 338/1397 [04:54<41:55,  2.38s/it]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Prince_Charles


 25%|██▌       | 355/1397 [05:28<25:49,  1.49s/it]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Shirley_Jones


 26%|██▋       | 367/1397 [05:54<39:50,  2.32s/it]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Queen_Elizabeth


 32%|███▏      | 445/1397 [06:21<06:01,  2.64it/s]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Ludovica_Pagani


 33%|███▎      | 463/1397 [06:31<07:56,  1.96it/s]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Riccardo_Pozzoli


 35%|███▍      | 482/1397 [06:43<09:22,  1.63it/s]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Wright_Brothers


 35%|███▌      | 489/1397 [06:49<10:56,  1.38it/s]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Kerri_Strug


 35%|███▌      | 490/1397 [06:51<12:05,  1.25it/s]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Christopher_Columbus


 38%|███▊      | 524/1397 [07:01<05:17,  2.75it/s]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/J.K.Rowling


 40%|███▉      | 557/1397 [07:21<05:11,  2.70it/s]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Karl_Malone


 41%|████      | 568/1397 [07:28<07:46,  1.78it/s]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Amadeus


 42%|████▏     | 581/1397 [07:42<15:38,  1.15s/it]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Matteo_Markus_Bok


 42%|████▏     | 583/1397 [07:47<21:36,  1.59s/it]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Mark_Spitz


 44%|████▍     | 615/1397 [08:06<10:31,  1.24it/s]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Eusebio


 46%|████▌     | 646/1397 [08:22<10:28,  1.19it/s]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Christine_Taylor


 47%|████▋     | 652/1397 [08:34<20:38,  1.66s/it]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Alexis_Thorpe


 49%|████▊     | 681/1397 [09:38<29:53,  2.51s/it]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Carlo_Conti


 51%|█████     | 707/1397 [10:36<24:42,  2.15s/it]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Glenda_Jackson


 53%|█████▎    | 742/1397 [11:23<06:26,  1.70it/s]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/John_M_Keynes


 58%|█████▊    | 812/1397 [12:04<05:39,  1.72it/s]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Plato


 58%|█████▊    | 815/1397 [12:07<06:15,  1.55it/s]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Pope_Francis


 64%|██████▍   | 896/1397 [12:47<03:23,  2.47it/s]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Pauleta


 65%|██████▍   | 906/1397 [12:54<05:14,  1.56it/s]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Raphael


 66%|██████▌   | 916/1397 [13:00<04:58,  1.61it/s]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Emile_Zatopek


 68%|██████▊   | 947/1397 [13:19<05:17,  1.42it/s]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Luis_Sal


 72%|███████▏  | 1001/1397 [14:16<13:38,  2.07s/it]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Marcello_Ascani


 72%|███████▏  | 1007/1397 [14:28<13:07,  2.02s/it]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Common


 72%|███████▏  | 1009/1397 [14:32<13:05,  2.03s/it]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Tiger_Woods


 73%|███████▎  | 1016/1397 [14:49<15:10,  2.39s/it]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Marie_Antoinette


 73%|███████▎  | 1019/1397 [14:55<14:32,  2.31s/it]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Ludwig_Beethoven


 75%|███████▌  | 1050/1397 [16:01<11:42,  2.02s/it]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Lord_Baden_Powell


 76%|███████▌  | 1064/1397 [16:28<11:20,  2.04s/it]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Linda_Fiorentino


 77%|███████▋  | 1075/1397 [16:40<05:24,  1.01s/it]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Ligabue


 77%|███████▋  | 1076/1397 [16:41<05:37,  1.05s/it]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Mary_Magdalene


 79%|███████▉  | 1102/1397 [16:52<02:25,  2.02it/s]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Bob_Geldof


 80%|███████▉  | 1111/1397 [16:58<03:02,  1.57it/s]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Dario_Moccia


 83%|████████▎ | 1156/1397 [17:24<03:41,  1.09it/s]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Luca_Chikovani


 83%|████████▎ | 1158/1397 [17:25<03:23,  1.18it/s]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Madonna


 85%|████████▌ | 1189/1397 [17:45<02:15,  1.53it/s]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Raul_Gonzalez


 89%|████████▉ | 1240/1397 [18:19<02:57,  1.13s/it]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Ronaldo_Nazário


 92%|█████████▏| 1291/1397 [18:45<00:45,  2.35it/s]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Marco_Ferrero


 92%|█████████▏| 1292/1397 [18:46<00:53,  1.97it/s]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Aung_San_Suu_Kyi


 96%|█████████▌| 1339/1397 [19:31<02:15,  2.33s/it]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Birgit_Fischer


 98%|█████████▊| 1363/1397 [20:21<01:06,  1.96s/it]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Anthony_Davis


 98%|█████████▊| 1365/1397 [20:24<00:51,  1.62s/it]

Failed to retrieve the page. Status code: 404. Page: https://it.wikipedia.org/wiki/Mother_Teresa


 99%|█████████▉| 1385/1397 [21:02<00:23,  1.94s/it]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Lodovica_Comello


 99%|█████████▉| 1388/1397 [21:07<00:15,  1.73s/it]

Failed to retrieve the image. Page: https://it.wikipedia.org/wiki/Alessia_Marcuzzi


100%|██████████| 1397/1397 [21:27<00:00,  1.09it/s]
