# Dataset from IMDB dowloaded from https://developer.imdb.com/non-commercial-datasets/
## Please download title.basics.tsv.gz and title.ratings.tsv.gz and put them in the data/ folder (unzip)

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import importlib
import os
import kagglehub
import ast

In [2]:
import sys
sys.path.append('scripts/')
import scraping, merge_goodreads, merge_cmu, merge_imdb
from scraping import *
from merge_goodreads import *
from merge_cmu import *
from merge_imdb import *

# Scraping data from wikipedia
We must first define the url that we will scrap data from. They will allow us to make a mapping between books and their film adaptation.

In [3]:
# URL of the Wikipedia page
url_0_C = "https://en.wikipedia.org/wiki/List_of_fiction_works_made_into_feature_films_(0%E2%80%939,_A%E2%80%93C)"
url_D_J = "https://en.wikipedia.org/wiki/List_of_fiction_works_made_into_feature_films_(D%E2%80%93J)"
url_K_R = "https://en.wikipedia.org/wiki/List_of_fiction_works_made_into_feature_films_(K%E2%80%93R)"
url_S_Z = "https://en.wikipedia.org/wiki/List_of_fiction_works_made_into_feature_films_(S%E2%80%93Z)"
url_short = "https://en.wikipedia.org/wiki/List_of_short_fiction_made_into_feature_films"
url_kids = "https://en.wikipedia.org/wiki/List_of_children%27s_books_made_into_feature_films"

urls = [url_0_C, url_D_J, url_K_R, url_S_Z, url_short, url_kids]

Then, we scrap and process data from these wikipedia pages.

In [4]:
# Launches the scrapping on every url selected
dataframes = []
for url in urls: 
    df = scrap_book_to_movie(url)
    clean_df = scrap_post_processing(df)
    dataframes.append(clean_df)

book_adaptations = pd.concat(dataframes).reset_index(drop=True)
book_adaptations = book_adaptations.drop_duplicates().reset_index(drop=True)
book_adaptations.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['title_book'] = df['fiction_work'].str.split('(').str[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['title_book'] = df['title_book'].apply(lambda t: t.replace('"', ''))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['author_book'] = df_split_comma.apply(extract_authors)
A value is tryi

Unnamed: 0,title_book,author_book,year_book,title_film,year_film
0,The 25th Hour,David Benioff,2001,25th Hour,2002
1,3 Assassins,Kōtarō Isaka,2004,Grasshopper,2015
2,4.50 from Paddington,Agatha Christie,1957,"Murder, She Said",1961
3,4.50 from Paddington,Agatha Christie,1957,Crime Is Our Business,2008
4,58 Minutes,Walter Wager,1987,Die Hard 2,1990


We now have a dataframe with 4941 film adaptations together with the book they are adapting.

# Merge with Goodreads
We will now merge the book to movie mapping with the goodreads dataset to have additional information on the books.


First we download the dataset from kaggle

In [5]:
path = kagglehub.dataset_download("bahramjannesarr/goodreads-book-datasets-10m")



In [6]:
df_movies = book_adaptations.copy()
df_goodreads = books_csv_to_df(path)

df_goodreads['merge_authors'] = clean_spaces(df_goodreads['Authors'])
df_goodreads['merge_names'] = clean_spaces(df_goodreads['Name'])
df_goodreads['merge_names'] = remove_parenthesis(df_goodreads['merge_names'])


df_movies['merge_authors'] = clean_spaces(df_movies['author_book'])
df_movies['merge_names'] = clean_spaces(df_movies['title_book'])
df_movies['merge_names'] = remove_parenthesis(df_movies['merge_names'])

merge_goodreads = df_goodreads.merge(right=df_movies, how="right", left_on=['merge_authors', 'merge_names'], right_on=['merge_authors', 'merge_names'], copy=False)
merge_goodreads = merge_goodreads.drop_duplicates(subset = df_movies.columns).reset_index(drop=True)
merge_goodreads = merge_goodreads.drop(columns = ['merge_authors', 'merge_names', 'Authors', 'Name'])

# Merge with CMU
We will now merge this data with the CMU dataset to add extra information on these films.

In [7]:
# Merge df with CMU depending on title_film and year_film
merge_cmu = merge_with_CMU(merge_goodreads)
merge_cmu.head()

Unnamed: 0,movie_name,movie_date,box_office,runtime,language,countries,genres,clean_name,ID,Rating,...,RatingDistTotal,CountsOfReview,Language,PagesNumber,Description,pagesNumber,Count of text reviews,title_book,author_book,year_book
0,Mary Poppins,1964,102272727.0,139.0,English Language,United States of America,"Children's/Family, Musical, Fantasy, Comedy, D...",marypoppins,,4.03,...,total:110287,3845.0,eng,,,209.0,,Mary Poppins,P. L. Travers,1934–1988
1,Mysterious Island,1982,,100.0,Standard Mandarin,Hong Kong,"Action/Adventure, Wuxia, Martial Arts Film, Ch...",mysteriousisland,,4.11,...,total:43120,4.0,eng,728.0,At a time when Verne is making a comeback in t...,,,The Mysterious Island,Jules Verne,1874
2,Juarez,1939,,125.0,"English Language, Spanish Language",United States of America,"Costume drama, Biographical film, Historical f...",juarez,,,...,,,,,,,,The Phantom Crown: The Story of Maximilian & C...,Bertita Harding,1934
3,The Great Santini,1979,4702575.0,115.0,English Language,United States of America,"Family Drama, Drama",thegreatsantini,,4.14,...,total:29100,75.0,eng,,,487.0,,The Great Santini,Pat Conroy,1976
4,The Castle,1968,,88.0,German Language,West Germany,"Mystery, Drama",thecastle,,3.96,...,total:42498,37.0,eng,,<b>Rewriting Kafka</b><p><br />Just before his...,325.0,37.0,The Castle,Franz Kafka,1926


Now we have more information on the films that are an adaptation of a book, such as their genres. Let's add more information such as the film's rating by merging with IMDB's dataset.

# Merge with IMDB
## Dataset from IMDB dowloaded from https://developer.imdb.com/non-commercial-datasets/
### Please download title.basics.tsv.gz and title.ratings.tsv.gz and put them in the data/ folder (unzip)

In [8]:
merge_imdb = merge_with_imdb(merge_cmu)
merge_imdb.to_csv('merge_imdb.csv', index=False)
merge_imdb.head()

lines dropped during merge with IMDB:  546


Unnamed: 0,isAdult,movie_name,movie_date,box_office,runtime,language,countries,genres,ID,Rating,...,Language,PagesNumber,Description,pagesNumber,Count of text reviews,title_book,author_book,year_book,rating,numVotes
0,0,The Fairylogue and Radio-Plays,1908,,120.0,English Language,United States of America,"Silent film, Black-and-white",,3.99,...,,120.0,A cyclone hits Kansas and whirls away Dorothy ...,,,The Wonderful Wizard of Oz,L. Frank Baum,1900,5.2,76
1,0,Atlantis,1913,,113.0,"English Language, Danish Language",Denmark,"Silent film, Drama, Indie, Black-and-white",,,...,,,,,,Atlantis,Gerhart Hauptmann,1912,6.5,500
2,0,Ivanhoe,1913,,,"Silent film, English Language",United States of America,"Swashbuckler films, Silent film, Drama, Adventure",,,...,,,,,,Ivanhoe,Sir Walter Scott,1820,5.6,97
3,0,Cinderella,1914,,52.0,"Silent film, English Language",United States of America,"Silent film, Fantasy, Black-and-white",,3.97,...,eng,,Italian artist Roberto Innocenti's elegantly r...,32.0,,Cinderella,Charles Perrault,1697,6.1,1095
4,0,"His Majesty, the Scarecrow of Oz",1914,,60.0,English Language,United States of America,"Silent film, Adventure, Children's/Family, Bla...",,3.99,...,,120.0,A cyclone hits Kansas and whirls away Dorothy ...,,,The Wonderful Wizard of Oz,L. Frank Baum,1900,5.3,533


We now have 1940 film samples that are adaptations from known books and which can use for analysis.