Currently, I found the data from **booksmovie_list.txt** from the capstone of this student: https://github.com/SKGemini/capstone_project

It contains 3000+ adaptations, scraped from Wikipedia + IMDB. Each line is the title of the book which has a movie adaptation. It is also the title of the corresponding page in the Wikipedia.

However, this data is 6 years old, we should probably perform our own scraping to get the latest data.

The original scraper (which no longer works, but can likely be fixed): https://github.com/SKGemini/capstone_project/blob/master/clean_data/wikiscraper.py

In [22]:
DATA_DIR = "../../data/"

with open(f"{DATA_DIR}/booksmovies_list.txt", encoding="utf8") as f:
    allbooksmovies = f.read().splitlines()

len(allbooksmovies)


3345

In [29]:
import asyncio
import wikipedia
import pandas as pd

async def check_wiki_page(book, i, total):
    data = {
        'book': [book],
        'url': None,
        'status': None, 
        'error': None,
        'first_heading': None,
        'heading_title': None,
        'title_match': None,
        'has_novel': None,
        'has_film': None
    }
    
    try:
        # Search Wikipedia for book + "novel"
        search_results = wikipedia.search(f"{book} novel")
        if not search_results:
            data['error'] = ['No results found']
            return pd.DataFrame(data)
            
        # Get the first result page
        page = wikipedia.page(search_results[0], auto_suggest=False)
        
        data['url'] = [page.url]
        data['status'] = [200]
        data['first_heading'] = [page.title]
        data['heading_title'] = [page.title]
        data['title_match'] = [page.title.strip() == book.strip()]
        
        content = page.content.lower()
        data['has_novel'] = ['novel' in content]
        data['has_film'] = ['film' in content]
        
    except wikipedia.exceptions.DisambiguationError as e:
        data['error'] = ['Disambiguation page']
    except wikipedia.exceptions.PageError:
        data['error'] = ['Page not found']
    except Exception as e:
        data['error'] = [str(e)]
        
    print(f"Processed {i+1}/{total}: {book}")
    return pd.DataFrame(data)

async def run_checks():
    books_to_process = allbooksmovies[:10]
    total = len(books_to_process)
    results = []
    
    for i, book in enumerate(books_to_process):
        result = await check_wiki_page(book, i, total)
        results.append(result)
        # Save progress every 100 books
        if (i + 1) % 100 == 0:
            pd.concat(results, ignore_index=True).to_csv(f"{DATA_DIR}/wiki_results_{i+1}.csv")
    
    final_results = pd.concat(results, ignore_index=True)
    final_results.to_csv(f"{DATA_DIR}/wiki_results_final.csv")
    return final_results

results = await run_checks()


Processed 1/10: Sounder
Processed 2/10: Sunset Song
Processed 3/10: Despair
Processed 4/10: The Big Sky
Processed 5/10: My Louisiana Sky
Processed 6/10: The Dragon Murder Case
Processed 7/10: Portrait of Jennie
Processed 8/10: The Conquest of Space
Processed 9/10: The Treatment
Processed 10/10: Someone Was Watching


In [30]:
results

Unnamed: 0,book,url,status,error,first_heading,heading_title,title_match,has_novel,has_film
0,Sounder,https://en.wikipedia.org/wiki/Sounder_(novel),200,,Sounder (novel),Sounder (novel),False,True,True
1,Sunset Song,https://en.wikipedia.org/wiki/Sunset_Song,200,,Sunset Song,Sunset Song,True,True,True
2,Despair,https://en.wikipedia.org/wiki/Despair_(novel),200,,Despair (novel),Despair (novel),False,True,True
3,The Big Sky,https://en.wikipedia.org/wiki/The_Big_Sky_(novel),200,,The Big Sky (novel),The Big Sky (novel),False,True,False
4,My Louisiana Sky,https://en.wikipedia.org/wiki/My_Louisiana_Sky,200,,My Louisiana Sky,My Louisiana Sky,True,True,True
5,The Dragon Murder Case,https://en.wikipedia.org/wiki/The_Dragon_Murde...,200,,The Dragon Murder Case (film),The Dragon Murder Case (film),False,True,True
6,Portrait of Jennie,https://en.wikipedia.org/wiki/Portrait_of_Jennie,200,,Portrait of Jennie,Portrait of Jennie,True,True,True
7,The Conquest of Space,https://en.wikipedia.org/wiki/Conquest_of_Space,200,,Conquest of Space,Conquest of Space,False,False,True
8,The Treatment,https://en.wikipedia.org/wiki/The_Treatment_(n...,200,,The Treatment (novel),The Treatment (novel),False,True,True
9,Someone Was Watching,https://en.wikipedia.org/wiki/Someone_Was_Watc...,200,,Someone Was Watching,Someone Was Watching,True,True,True


In [None]:
# In general I think we can trust this list, but it'd be better to have our own so we can link it to wikipedia page id, other stuff.