# Selecting a fiction canon
> with an additive approach
- toc: true 
- badges: true
- comments: true
- categories: [canon]

Still working on the Invisible Canon. Today I want to find a superset of canons for fiction: not genre fiction, but "literature". Of course, the Stanford Literary Lab has already done work on this. In their *Pamphlet 8: Between Canon and Corpus* they demonstrate that overlaying different "top 100 books of the 20th century" lists leads to significant overlap, and can be used to triangulate a sort of "most voted for" list. They use six different lists and end up with roughly 400 unique books. 

I can use this "found canon" technique to craft a data feature, something like "number of times this work has been canonized". That's useful in its own right, but it could also be the target for a collaborative filtering system.

I need to extract that information from a bunch of HTML tables and PDF reports, because academics. And then map it to the features from my previous distillation. One problem that will arise is that goodreads data is sorted by edition, while the canon lists are sorted by work. So the first problem is to reshape the data. Pandas provides MultiIndex, a way of stacking data in multiple dimensions. let's try that

In [3]:
import pandas as pd

pd.set_option("display.max_columns", None)


In [43]:
total_df = pd.read_csv('../../records/cleaned_goodreads_books.csv')

In [44]:
total_df = total_df.drop(['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1'], axis=1)

In [45]:
total_df.tail()

Unnamed: 0,isbn,text_reviews_count,series,language_code,popular_shelves,asin,average_rating,similar_books,description,format,link,authors,publisher,num_pages,isbn13,publication_month,publication_year,url,image_url,book_id,ratings_count,work_id,title,top_genre,author_name
1215978,0689852959,1.0,[],,"[{'count': '22', 'name': 'to-read'}, {'count':...",,4.36,[],One of the most popular series ever published ...,Paperback,https://www.goodreads.com/book/show/331839.Jac...,"[{'author_id': '10681', 'role': ''}, {'author_...",Aladdin,176.0,9780689852954,9.0,2002.0,https://www.goodreads.com/book/show/331839.Jac...,https://s.gr-assets.com/assets/nophoto/book/11...,331839,18.0,25313618.0,Jacqueline Kennedy Onassis: Friend of the Arts,biography,Beatrice Gormley
1215979,0373126476,9.0,[],,"[{'count': '78', 'name': 'to-read'}, {'count':...",,3.42,"['2200344', '695337', '10333421', '1934240', '...","Blackmailed into marriage to save her family, ...",Paperback,https://www.goodreads.com/book/show/2685097-th...,"[{'author_id': '319441', 'role': ''}]",Harlequin,192.0,9780373126477,7.0,2007.0,https://www.goodreads.com/book/show/2685097-th...,https://s.gr-assets.com/assets/nophoto/book/11...,2685097,112.0,2710420.0,The Spaniard's Blackmailed Bride,harlequin,Trish Morey
1215980,178092870X,2.0,[],eng,"[{'count': '702', 'name': 'to-read'}, {'count'...",,3.5,"['12064253', '25017213', '571796', '27306126',...",Sir Arthur Conan Doyle is brought back to life...,Paperback,https://www.goodreads.com/book/show/26168430-s...,"[{'author_id': '2448', 'role': ''}, {'author_i...",MX Publishing,148.0,9781780928708,8.0,2015.0,https://www.goodreads.com/book/show/26168430-s...,https://images.gr-assets.com/books/1440592011m...,26168430,6.0,46130263.0,Sherlock Holmes and the July Crisis,mystery,Arthur Conan Doyle
1215981,0765197456,6.0,[],,"[{'count': '37', 'name': 'to-read'}, {'count':...",,4.0,[],"Gathers poems by William Blake, Emily Bronte, ...",Hardcover,https://www.goodreads.com/book/show/2342551.Th...,"[{'author_id': '82312', 'role': 'Editor'}]",Smithmark Publishers,96.0,9780765197450,8.0,1996.0,https://www.goodreads.com/book/show/2342551.Th...,https://s.gr-assets.com/assets/nophoto/book/11...,2342551,36.0,2349247.0,The Children's Classic Poetry Collection,poetry,Nicola Baxter
1215982,162378140X,17.0,['658195'],eng,"[{'count': '56', 'name': 'to-read'}, {'count':...",,4.37,"['23562786', '13548289', '26094541', '20570173...","Volume One contains: ""Claimed,"" ""Tainted,"" and...",Paperback,https://www.goodreads.com/book/show/22017381-1...,"[{'author_id': '7789809', 'role': ''}]",Guerrilla Wordfare,306.0,9781623781408,4.0,2014.0,https://www.goodreads.com/book/show/22017381-1...,https://images.gr-assets.com/books/1398621236m...,22017381,70.0,41332799.0,"101 Nights: Volume One (101 Nights, #1-3)",erotica,S.E. Reign


In [46]:
total_df = total_df.set_index(['work_id'])

In [57]:
total_df

Unnamed: 0_level_0,isbn,text_reviews_count,series,language_code,popular_shelves,asin,average_rating,similar_books,description,format,link,authors,publisher,num_pages,isbn13,publication_month,publication_year,url,image_url,book_id,ratings_count,title,top_genre,author_name
work_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
5400751.0,0312853122,1.0,[],,"[{'count': '3', 'name': 'to-read'}, {'count': ...",,4.00,[],,Paperback,https://www.goodreads.com/book/show/5333265-w-...,"[{'author_id': '604031', 'role': ''}]",St. Martin's Press,256.0,9780312853129,9.0,1984.0,https://www.goodreads.com/book/show/5333265-w-...,https://images.gr-assets.com/books/1310220028m...,5333265,3.0,W.C. Fields: A Life on Film,p,Ronald J. Fields
8948723.0,,7.0,['189911'],eng,"[{'count': '58', 'name': 'to-read'}, {'count':...",B00071IKUY,4.03,"['19997', '828466', '1569323', '425389', '1176...",Omnibus book club edition containing the Ladie...,Hardcover,https://www.goodreads.com/book/show/7327624-th...,"[{'author_id': '10333', 'role': ''}]","Nelson Doubleday, Inc.",600.0,,,1987.0,https://www.goodreads.com/book/show/7327624-th...,https://images.gr-assets.com/books/1304100136m...,7327624,140.0,"The Unschooled Wizard (Sun Wolf and Starhawk, ...",fantasy,Barbara Hambly
6243154.0,0743294297,3282.0,[],eng,"[{'count': '7615', 'name': 'to-read'}, {'count...",,3.49,"['6604176', '6054190', '2285777', '82641', '75...",Addie Downs and Valerie Adler were eight when ...,Hardcover,https://www.goodreads.com/book/show/6066819-be...,"[{'author_id': '9212', 'role': ''}]",Atria Books,368.0,9780743294294,7.0,2009.0,https://www.goodreads.com/book/show/6066819-be...,https://s.gr-assets.com/assets/nophoto/book/11...,6066819,51184.0,Best Friends Forever,chick-lit,Jennifer Weiner
278577.0,0850308712,5.0,[],,"[{'count': '32', 'name': 'to-read'}, {'count':...",,3.40,[],,,https://www.goodreads.com/book/show/287140.Run...,"[{'author_id': '149918', 'role': ''}]",,,9780850308716,,,https://www.goodreads.com/book/show/287140.Run...,https://images.gr-assets.com/books/1413219371m...,287140,15.0,Runic Astrology: Starcraft and Timekeeping in ...,runes,Nigel Pennick
278578.0,1599150603,7.0,[],,"[{'count': '56', 'name': 'to-read'}, {'count':...",,4.13,[],"Relates in vigorous prose the tale of Aeneas, ...",Paperback,https://www.goodreads.com/book/show/287141.The...,"[{'author_id': '3041852', 'role': ''}]",Yesterday's Classics,162.0,9781599150604,9.0,2006.0,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,287141,46.0,The Aeneid for Boys and Girls,history,Alfred J. Church
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25313618.0,0689852959,1.0,[],,"[{'count': '22', 'name': 'to-read'}, {'count':...",,4.36,[],One of the most popular series ever published ...,Paperback,https://www.goodreads.com/book/show/331839.Jac...,"[{'author_id': '10681', 'role': ''}, {'author_...",Aladdin,176.0,9780689852954,9.0,2002.0,https://www.goodreads.com/book/show/331839.Jac...,https://s.gr-assets.com/assets/nophoto/book/11...,331839,18.0,Jacqueline Kennedy Onassis: Friend of the Arts,biography,Beatrice Gormley
2710420.0,0373126476,9.0,[],,"[{'count': '78', 'name': 'to-read'}, {'count':...",,3.42,"['2200344', '695337', '10333421', '1934240', '...","Blackmailed into marriage to save her family, ...",Paperback,https://www.goodreads.com/book/show/2685097-th...,"[{'author_id': '319441', 'role': ''}]",Harlequin,192.0,9780373126477,7.0,2007.0,https://www.goodreads.com/book/show/2685097-th...,https://s.gr-assets.com/assets/nophoto/book/11...,2685097,112.0,The Spaniard's Blackmailed Bride,harlequin,Trish Morey
46130263.0,178092870X,2.0,[],eng,"[{'count': '702', 'name': 'to-read'}, {'count'...",,3.50,"['12064253', '25017213', '571796', '27306126',...",Sir Arthur Conan Doyle is brought back to life...,Paperback,https://www.goodreads.com/book/show/26168430-s...,"[{'author_id': '2448', 'role': ''}, {'author_i...",MX Publishing,148.0,9781780928708,8.0,2015.0,https://www.goodreads.com/book/show/26168430-s...,https://images.gr-assets.com/books/1440592011m...,26168430,6.0,Sherlock Holmes and the July Crisis,mystery,Arthur Conan Doyle
2349247.0,0765197456,6.0,[],,"[{'count': '37', 'name': 'to-read'}, {'count':...",,4.00,[],"Gathers poems by William Blake, Emily Bronte, ...",Hardcover,https://www.goodreads.com/book/show/2342551.Th...,"[{'author_id': '82312', 'role': 'Editor'}]",Smithmark Publishers,96.0,9780765197450,8.0,1996.0,https://www.goodreads.com/book/show/2342551.Th...,https://s.gr-assets.com/assets/nophoto/book/11...,2342551,36.0,The Children's Classic Poetry Collection,poetry,Nicola Baxter


In [56]:
total_df.loc[total_df.index.duplicated() == True]

Unnamed: 0_level_0,isbn,text_reviews_count,series,language_code,popular_shelves,asin,average_rating,similar_books,description,format,link,authors,publisher,num_pages,isbn13,publication_month,publication_year,url,image_url,book_id,ratings_count,title,top_genre,author_name
work_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
3349802.0,174114244X,9.0,[],,"[{'count': '19688', 'name': 'to-read'}, {'coun...",,3.79,"['8359929', '723742', '297130', '7570244', '39...",From the moment Ross's fiancee Aimee was kille...,,https://www.goodreads.com/book/show/820229.Sec...,"[{'author_id': '7128', 'role': ''}]",,,9781741142440,,,https://www.goodreads.com/book/show/820229.Sec...,https://images.gr-assets.com/books/1293769966m...,820229,82.0,Second Glance,fiction,Jodi Picoult
3349802.0,0340897260,46.0,[],en-GB,"[{'count': '19688', 'name': 'to-read'}, {'coun...",,3.79,"['8359929', '723742', '297130', '7570244', '39...",From the moment Ross's fiancee Aimee was kille...,Paperback,https://www.goodreads.com/book/show/820226.Sec...,"[{'author_id': '7128', 'role': ''}]",Hodder,483.0,9780340897263,,2008.0,https://www.goodreads.com/book/show/820226.Sec...,https://images.gr-assets.com/books/1363397305m...,820226,334.0,Second Glance,fiction,Jodi Picoult
3349802.0,0340897279,4.0,[],eng,"[{'count': '19688', 'name': 'to-read'}, {'coun...",,3.79,"['8359929', '723742', '297130', '7570244', '39...",From the moment Ross's fiancee Aimee was kille...,Mass Market Paperback,https://www.goodreads.com/book/show/820227.Sec...,"[{'author_id': '7128', 'role': ''}]",Hodder,420.0,9780340897270,,2007.0,https://www.goodreads.com/book/show/820227.Sec...,https://images.gr-assets.com/books/1288638236m...,820227,24.0,Second Glance,fiction,Jodi Picoult
206370.0,0684801302,16.0,[],,"[{'count': '1654', 'name': 'to-read'}, {'count...",,4.13,"['25343', '256004', '426682', '160909', '13422...",An award-winning research psychologist who has...,Hardcover,https://www.goodreads.com/book/show/213189.The...,"[{'author_id': '14734208', 'role': ''}, {'auth...",Simon & Schuster,240.0,9780684801308,2.0,1997.0,https://www.goodreads.com/book/show/213189.The...,https://s.gr-assets.com/assets/nophoto/book/11...,213189,70.0,The Heart of Parenting: Raising an Emotionally...,parenting,John M. Gottman
752200.0,080500291X,1.0,['191162'],,"[{'count': '8205', 'name': 'to-read'}, {'count...",,4.14,"['7926', '42337', '7904', '7932', '377889', '3...",Saturdays can make dreams come true when the M...,,https://www.goodreads.com/book/show/8037412-th...,"[{'author_id': '3420', 'role': ''}]",,,9780805002911,,,https://www.goodreads.com/book/show/8037412-th...,https://images.gr-assets.com/books/1412898312m...,8037412,6.0,The Saturdays,childrens,Elizabeth Enright
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2206102.0,0373126794,13.0,[],eng,"[{'count': '231', 'name': 'to-read'}, {'count'...",,3.52,"['2685097', '1866878', '2597992', '6282598', '...","Revenge, passion and an arranged marriage...Du...",Paperback,https://www.goodreads.com/book/show/2200344.Th...,"[{'author_id': '621880', 'role': ''}]",Harlequin,192.0,9780373126798,11.0,2007.0,https://www.goodreads.com/book/show/2200344.Th...,https://s.gr-assets.com/assets/nophoto/book/11...,2200344,223.0,The Spanish Duke's Virgin Bride,harlequin,Chantelle Shaw
16154954.0,,16.0,[],eng,"[{'count': '773', 'name': 'to-read'}, {'count'...",,4.07,"['978053', '425481', '361551', '255045', '2581...","""The Short Happy Life of Francis Macomber"" is ...",,https://www.goodreads.com/book/show/7195902-th...,"[{'author_id': '1455', 'role': ''}]",,,,,,https://www.goodreads.com/book/show/7195902-th...,https://images.gr-assets.com/books/1329637203m...,7195902,290.0,The Short Happy Life of Francis Macomber,short-stories,Ernest Hemingway
7198840.0,0749927577,1.0,[],,"[{'count': '28', 'name': 'to-read'}, {'count':...",,4.32,[],Anne Jirsch is a psychic with an extraordinary...,Paperback,https://www.goodreads.com/book/show/2233591.In...,"[{'author_id': '1009480', 'role': ''}, {'autho...",Piatkus Books,302.0,9780749927578,2.0,2007.0,https://www.goodreads.com/book/show/2233591.In...,https://images.gr-assets.com/books/1328821191m...,2233591,12.0,Instant Intuition: A Psychic's Guide to Findin...,books-i-owe,Anne Jirsch
3155241.0,0263853357,1.0,[],eng,"[{'count': '178', 'name': 'to-read'}, {'count'...",,3.32,"['2200344', '6408413', '2789084', '2494238', '...","A marriage is forever, not just for convenienc...",Paperback,https://www.goodreads.com/book/show/6818068-co...,"[{'author_id': '4990', 'role': ''}]",Mills & Boon,192.0,9780263853353,7.0,2007.0,https://www.goodreads.com/book/show/6818068-co...,https://images.gr-assets.com/books/1504917893m...,6818068,3.0,Contracted: A Wife for the Bedroom,harlequin,Carol Marinelli


In [63]:
len(total_df.index.unique())

927891

It worked! Now I have a list of 927,891 individual works. Time to import the different canons, and associate their data to the books!

## Gather the canons

I can build here off the great work of others. As mentioned above, the Stanford Literary Lab published [six canons](file:///home/mage/Downloads/LiteraryLabPamphlet8.pdf)(PDF) that they concatenated into their 20th Century Fiction corpus. 

I've also found the incredible [The Greatest Books](https://thegreatestbooks.org) project by Shane Sherman, which does a similar thing but with 196 lists! Unfortunately he only distributes the rankings post-weighting, whereas I would like each list to be a separate column and compare how many times each book gets mentioned. I have sent him a request for the lists but i may have to scrape this website myself...

For now, I should gather the lists that *didn't* make it onto Mr. Sherman's list. Those would be the bestsellers and the reader rankings, for the most part. He explicitly prefers prestige over popularity. So there are three lists from the LitLab pamphlet that I suspect didn't make it into the Greatest Books list, and a few from other places around the web, especially [Penguin's 100 must-read classic books, as chosen by our readers
](https://www.penguin.co.uk/articles/2018/100-must-read-classic-books.html). I err on the side of popularity, myself, but the goodreads ratings should do plenty to balance that effect.

In [80]:
%ls ../../records

bkunde.csv                      goodreads_books_0016.csv
cleaned_goodreads_books.csv     goodreads_books_0017.csv
[0m[01;32mgoodbooks-genre-pop-time.html[0m*  goodreads_books_0018.csv
goodreads_books_0000.csv        goodreads_books_0019.csv
goodreads_books_0001.csv        goodreads_books_0020.csv
goodreads_books_0002.csv        goodreads_books_0021.csv
goodreads_books_0003.csv        goodreads_books_0022.csv
goodreads_books_0004.csv        goodreads_books_0023.csv
goodreads_books_0005.csv        goodreads-classics.csv
goodreads_books_0006.csv        library-journal.csv
goodreads_books_0007.csv        modern-library-readers-list.csv
goodreads_books_0008.csv        modern-library.tsv
goodreads_books_0009.csv        penguin-readers.csv
goodreads_books_0010.csv        postcolonial-studies.csv
goodreads_books_0011.csv        pub-weekly.csv
goodreads_books_0012.csv        [01;32mto_graph.csv[0m*
goodreads_books_0013.csv        [01;32mucsd-goodreads-genre-pop-time.html[0m*
goodreads_bo

In [176]:
bestsellers = pd.read_csv('../../records/wikipedia-bestselling-books.csv')
library = pd.read_csv('../../records/library-journal.csv')
penguin = pd.read_csv('../../records/penguin-readers.csv')
ml_readers = pd.read_csv('../../records/modern-library-readers-list.csv')
pw_readers = pd.read_csv('../../records/pub-weekly.csv') 
psa = pd.read_csv('../../records/postcolonial-studies.csv') 

In [83]:
for i in [bestsellers, library, penguin, ml_readers, pw_readers, psa]:
    print(len(i))

167
150
100
100
83
100


See how these lists are all different lengths? That's why we can't just average out the different rankings. We need each book to have a one-hot encoding of the list, either 0 if it's not on or 1 if it is. So we have to add columns to our dataframe, one for each list. That really means I need a bunch of small lists, not one megalist like The Greatest Books provides. I'll have to scrape them.


In [84]:
import requests
from bs4 import BeautifulSoup

In [523]:
url = "https://thegreatestbooks.org/lists/28"
r = requests.get(url)

In [524]:
htm = BeautifulSoup(r.text, 'html.parser')

In [525]:
h4s = htm.find_all('h4')

In [526]:
[a.get_text() for a in h4s[0].findAll('a')]

['Don Quixote', 'Miguel de Cervantes']

In [438]:
def get_list_from_htm(htm):
    title = htm.find_all('h2')[0].get_text()
    h4s = htm.find_all('h4')
    books = [[a.get_text() for a in o.findAll('a')][:2] for o in h4s if len(o) > 1]
    return(title, books)

In [439]:
get_list_from_htm(htm)

('TIME Magazine All Time 100 Novels  by TIME Magazine',
 [['The Adventures of Augie March', 'Saul Bellow'],
  ["All the King's Men", 'Robert Penn Warren'],
  ['American Pastoral', 'Philip Roth'],
  ['Animal Farm', 'George Orwell'],
  ['Appointment in Samarra', "John O'Hara"],
  ["Are You There God? It's Me, Margaret", 'Judy Blume'],
  ['The Assistant', 'Bernard Malamud'],
  ['Atonement', 'Ian McEwan'],
  ['Beloved', 'Toni Morrison'],
  ['The Berlin Stories', 'Christopher Isherwood'],
  ['The Big Sleep', 'Raymond Chandler'],
  ['The Blind Assassin', 'Margaret Atwood'],
  ['Blood Meridian', 'Cormac McCarthy'],
  ['Brideshead Revisited', 'Evelyn Waugh'],
  ['The Bridge of San Luis Rey', 'Thornton Wilder'],
  ['Call It Sleep', 'Henry Roth'],
  ['Catch-22 ', 'Joseph Heller'],
  ['The Catcher in the Rye', 'J. D. Salinger'],
  ['A Clockwork Orange', 'Anthony Burgess'],
  ['The Confessions of Nat Turner', 'William Styron'],
  ['The Corrections', 'Jonathan Franzen'],
  ['The Crying of Lot 49 ',

Hey, that worked! The website has good semantic HTML, so it will be easy to extrapolate the same process to the rest of the 

I want to make a dataframe of these lists, organized by title. That way I can sort it into the larger Goodreads dataset, or extract information from that one to this one, either way.

I don't care about relative rankings within a given list, just the cumulative amount of rankings *across* lists. So I can use a one-hot encoding: 0 if a book is not on a list, 1 if it is. There's probably a nifty method for this, but I've never done it before, so I'm just going to implement it manually right now.

First I'll try it on the data I already have. Then I'll start scraping the website.

In [440]:
local_lists = [bestsellers, library, penguin, ml_readers, pw_readers, psa]
for i in local_lists:
    print(i.keys())

Index(['Book', 'Author(s)', 'Original language', 'First published',
       'Approximate sales', 'Genre', 'Author'],
      dtype='object')
Index(['\nK', 'L', 'M', 'R', 'App.', 'Points', 'Rank', 'Author', 'Title',
       'Date'],
      dtype='object')
Index(['Title', ' Author', ' Year', 'Author'], dtype='object')
Index(['Book', ' Author', ' Date', ' Rank', 'Author'], dtype='object')
Index(['Book', ' Author', ' Date', 'Author'], dtype='object')
Index(['Title', ' Author', ' Date', ' Rank', 'Author'], dtype='object')


Don't care about date or genre or any of that, I can reconstruct that later. Either 'Book' or 'Title' exist in each one, and 'Author' or in one case 'Author(s)', and that's all I can get from the Greatest Books website, so that's what i will work with here.

I didn't pair names with these lists, so I'll just zip up a little list of titles real quick:

In [489]:
list_names = ['Wikipedia Bestselling', 'Library Journal', 'Penguin Readers', 'Modern Library Readers', 'PW Bestsellers', 'Postcolonial Studies']

In [490]:
titles_df = pd.DataFrame()

In [491]:
for i in range(6):
    name = list_names[i]
    print(name)
    df = local_lists[i]
    if 'Book' in df.keys():
        bk = 'Book'
    else: bk = 'Title'
    au = 'Author(s)' if 'Author(s)' in df.keys() else 'Author' if 'Author' in df.keys() else ' Author'
    df['Author'] = df[au]
    new_df = df[[bk, 'Author']].set_index(bk)
    new_df[name] = 1
    titles_df = titles_df.append(new_df)


Wikipedia Bestselling
Library Journal
Penguin Readers
Modern Library Readers
PW Bestsellers
Postcolonial Studies


In [581]:
titles_df

Unnamed: 0,Author,Wikipedia Bestselling,Library Journal,Penguin Readers,Modern Library Readers,PW Bestsellers,Postcolonial Studies
The Hobbit,J. R. R. Tolkien,1.0,,,,,
Harry Potter and the Philosopher's Stone,J. K. Rowling,1.0,,,,,
The Little Prince,Antoine de Saint-Exupéry,1.0,,,,,
Dream of the Red Chamber,Cao Xueqin,1.0,,,,,
And Then There Were None,Agatha Christie,1.0,,,,,
...,...,...,...,...,...,...,...
Nervous Conditions,Tsitsi Dangarembga,,,,,,1.0
The Palace Of The Peacock,Wilson Harris,,,,,,1.0
Rebecca,Daphne Du Maurier,,,,,,1.0
The Autobiography Of My Mother,Jamaica Kincaid,,,,,,1.0


In [582]:
onehot = titles_df.fillna(0).sum(level=0)

onehot

Unnamed: 0,Wikipedia Bestselling,Library Journal,Penguin Readers,Modern Library Readers,PW Bestsellers,Postcolonial Studies
The Hobbit,1.0,0.0,0.0,0.0,0.0,0.0
Harry Potter and the Philosopher's Stone,1.0,0.0,0.0,0.0,0.0,0.0
The Little Prince,1.0,0.0,0.0,0.0,0.0,0.0
Dream of the Red Chamber,1.0,0.0,0.0,0.0,0.0,0.0
And Then There Were None,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...
Goodbye To Berlin,0.0,0.0,0.0,0.0,0.0,1.0
Nervous Conditions,0.0,0.0,0.0,0.0,0.0,1.0
The Palace Of The Peacock,0.0,0.0,0.0,0.0,0.0,1.0
The Autobiography Of My Mother,0.0,0.0,0.0,0.0,0.0,1.0


Brilliant! That's a one-hot encoding for each list, sorted by item. Now I just encapsulate that logic into a function and run it across each new list as I scrape them from the internet!

In [583]:
full_titles_df = titles_df

In [584]:
def books_df_from_list(list_from_htm):
    name, books = list_from_htm
    new_df = pd.DataFrame(data=books, columns=['Title','Author'])
    new_df[name] = 1
    return(new_df)

Need a list of all the fiction lists, not the nonfiction ones. Can scrape URLs from the homepage

In [585]:
url = "https://thegreatestbooks.org/lists/details"
r = requests.get(url)

In [586]:
htm = BeautifulSoup(r.text, 'html.parser')

In [587]:
hrefs = [o.get('href') for o in htm.find_all('a') if o.get('href') is not None]

In [588]:
links = [o for o in hrefs if '/lists/' in o]
links = [o for o in links if 'http' not in o]

In [589]:
links[:5]

['/lists/28', '/lists/122', '/lists/114', '/lists/120', '/lists/44']

Now we put it all together, pull each list and append it to the df, then merge the duplicates for a great big one-hot-encoded best Books list...

In [590]:
from tqdm import tqdm

In [591]:
for link in tqdm(links):
    try:
        url = f"https://thegreatestbooks.org{link}"
        r = requests.get(url)
        htm = BeautifulSoup(r.text, 'html.parser')    
        bks = get_list_from_htm(htm)
        new_df = books_df_from_list(bks)
        full_titles_df = full_titles_df.append(new_df.set_index('Title'))
    except IndexError as e:
        print(e, page)
        pass


100%|██████████| 130/130 [02:25<00:00,  1.12s/it]75%|███████▌  | 150/200 [02:14<00:52,  1.05s/it] 74%|███████▍  | 149/200 [02:13<00:54,  1.07s/it] 74%|███████▍  | 148/200 [02:12<00:48,  1.08it/s] 74%|███████▎  | 147/200 [02:11<00:47,  1.13it/s] 73%|███████▎  | 146/200 [02:10<00:49,  1.09it/s] 72%|███████▎  | 145/200 [02:10<01:02,  1.13s/it] 72%|███████▏  | 144/200 [02:09<01:20,  1.44s/it] 72%|███████▏  | 143/200 [02:08<01:21,  1.44s/it] 71%|███████   | 142/200 [02:06<01:07,  1.17s/it] 70%|███████   | 141/200 [02:05<01:08,  1.16s/it] 70%|███████   | 140/200 [02:03<01:02,  1.04s/it] 70%|██████▉   | 139/200 [02:02<01:06,  1.09s/it] 69%|██████▉   | 138/200 [02:01<01:07,  1.09s/it] 68%|██████▊   | 137/200 [02:00<01:01,  1.03it/s] 68%|██████▊   | 136/200 [01:59<01:05,  1.02s/it] 68%|██████▊   | 135/200 [01:58<01:16,  1.18s/it] 67%|██████▋   | 134/200 [01:57<01:15,  1.14s/it] 66%|██████▋   | 133/200 [01:56<01:18,  1.17s/it] 66%|██████▌   | 132/200 [01:55<01:13,  1.09s/it] 61%|██████    | 122/

In [592]:
full_titles_df

Unnamed: 0,Author,Wikipedia Bestselling,Library Journal,Penguin Readers,Modern Library Readers,PW Bestsellers,Postcolonial Studies,"Top 100 Works in World Literature by Norwegian Book Clubs, with the Norwegian Nobel Institute",Biblioteca by Argentina,The 25 Favorite Books of 100 Francophone Writers by Telerama,For The Love of Books by For The Love of Books,The Top 10: The Greatest Books of All Time by The Top 10 (Book),The 100 Best Books of World Literature by ABC.es,The Ideal Library by Book,El Pais Favorite Books of 100 Spanish Authors by El Pais,Pour une Bibliothèque Idéale by Raymond Queneau,1001 Books You Must Read Before You Die by The Book,Koen Book Distributors Top 100 Books of the Past Century by themodernnovel.com,The Celebrity Reading List by Gardiner Public Library,Finest Works of Fiction by Martin Seymour-Smith and Editors,The 100 Best Non-Fiction Books of the Century by National Review,Great Books of the Western World by Great Books Foundation,100 Life-Changing Books by National Book Award,The New York Public Library's Books of the Century by New York Public Library,The New Lifetime Reading Plan by The New Lifetime Reading Plan,Great Books by The Learning Channel,The 100 Greatest British Novels by BBC,The 50 Best Books of the Century by Intercollegiate Studies Institute,Världsbiblioteket (The World Library) by Tidningen Boken,Recommended Books by Academy of Achievement,The Greatest 20th Century Novels by Waterstone,"""Best Foreign Work of Fiction"" by Transfuge",The 16 Greatest Books of All Time by NYU Local,ZEIT-Bibliothek der 100 Bücher by Die Zeit,The Bigger Read List by English PEN,100 Novels That Shaped Our World by BBC,"""Our Readable Century"", The Best Books of the 20th Century by January Magazine",100 Books to Read in a Lifetime by Amazon.com (USA),100 Books to Read in a Lifetime by Amazon.com (UK),The 100 Greatest Books Ever Written by Easton Press,48 Good Books by University of Buffalo,25 acclaimed international writers choose 25 of the best books from the last 25 years by Wasafiri Magazine,The Modern Library | 100 Best Novels by Modern Library,The 21st Century's 12 Greatest Novels by BBC,Top 100 World Literature Titles by Perfection Learning,A Premature Attempt at the 21st Century Canon by Vulture,The 100 Favorite Novels of Librarians by Bookman.com,Third World Novels… The Top 10 by New Internationalist,110 Best Books: The Perfect Library by The Telegraph,The Great American Read by PBS,Best German Novels of the Twentieth Century by Wikipedia,The New Vanguard by New York Times,Select 100 by University of Wisconsin-Milwaukee,The Modern Library | 100 Best Nonfiction by The Modern Library,The Millions: The Best Fiction of the Millennium by The Millions,100 Best Books by Montana State University,100 Best Novels in English Since 1900 by Counterpunch,Radcliffe's 100 Best Novels by Radcliffe Publishing Course,The 75 Best Books of the Past 75 Years by Parade Magazine,Harvard Book Store Staff's Favorite 100 Books by Harvard Book Store,Man Booker Prize by Man Booker Prize,PEN/Faulkner Award for Fiction by PEN/Faulkner,Pulitzer Prize for Fiction by Pulitzer Prize,National Book Award - Nonfiction by National Book Foundation,Pulitzer Prize for Biography or Autobiography by Pulitzer Prize,James Tait Black Memorial Prize by Wikipedia,National Book Critics Circle Award - Fiction by National Book Critics Circle,National Book Critics Circle Award - Nonfiction by National Book Critics Circle,Best Books Ever by bookdepository.com,Pulitzer Prize for History by Pulitzer Prize,National Book Award - Fiction by National Book Foundation,Pulitzer Prize for Non-Fiction by Pulitzer Prize,How to Read Literature Like a Professor: A Reading List by Thomas C. Foster,100 Essential Books by Bravo! Magazine,The Greatest Novel of All Time by William Faulkner,W. Somerset Maugham’s Ten Greatest Novels of All Time by Great Novelists and Their Novels,The 100 Greatest Novels by greatbooksguide.com,The College Board: 101 Great Books Recommended for College-Bound Readers by http://www.uhlibrary.net/pdf/college_board_recommended_books.pdf,The Great Books Reader by Book,The Best Classics by The Times,In Which These Are the 100 Greatest Novels by ThisRecording.com,How to Read and Why by Harold Bloom,The Novel 100: A Ranking of the Greatest Novels of All Time by The Novel 100,50 Greatest Books of All Time by Globe and Mail,The 100 Greatest Novels of All Time: The List by The Observer,Books That Changed the World: The 50 Most Influential Books in Human History by Book,Masterpieces of World Literature by Frank N. Magill,Great Books by Anthony O'Hear,TIME Magazine All Time 100 Novels by TIME Magazine,The Telegraph’s 100 Novels Everyone Should Read by Telegraph,50 Books That Changed the World by Open Education Database,The best books in Spanish for the last 25 years by El Pais,"Top 10 British, Irish or Commonwealth Novels from 1980 to 2005 by The Observer",The 100 best books of the 21st century by The Guardian,The 100 Best Books in the World by AbeBooks.de (in German),What Is the Best Work of American Fiction of the Last 25 Years? by New York Times,50 Books to Read Before You Die by Complex,The 50 Best Nonfiction Books of the Past 25 Years by Slate,D. G. Myers’ 50 Greatest English Language Novels by D. G. Myers,Modern classics: 11 novels that belong in the classroom by Today.com,The 100 Best Books of the Decade(2000) by Times,50 Books to (Re-)Read at 50 by nextavenue,The Best Books of the 2000s by The Onion AV Club,Extreme Classics: The 100 Greatest Adventure Books of All Time by National Geographic Adventure Magazine,The Best Southern Novels of All Time by Oxford American,Books of the Decade by The Guardian,The 80 Books Every Man Should Read by Esquire,Paste Magazine's Best Books of the Decade(2000-2009) by Paste Magazine,"The 50 Books Everyone Needs to Read, 1963-2013 by Flavor Wire",The New Classics - 100 Best Reads from 1983 to 2008 by Entertainment Weekly,The 10 Best of the Decade(2000) by Entertainment Weekly,From Zero to Well-Read in 100 Books by Jeff O'Neal at Bookriot.com,50 Books to Read Before You Die by Barnes and Noble,The Book of Great Books: A Guide to 100 World Classics by Book,"The 100 Greatest American Novels, 1893 – 1993 by Jeff O'Neal at Bookriot.com",Donald Barthelme’s Reading List by Believer Mag,100 Best Novels Written in English by The Guardian,Books That Changed the World by Book,Costa Book Award - Best Novel by Costa Coffee,Le Monde's 100 Books of the Century by Le Monde,Entertainment Weekly's Top 100 Novels by Entertainment Weekly,The Dream of the Great American Novel by Book,The Graphic Canon by Book,Greatest Prose Works of the 20th Century by Vladimir Nabokov,20th Century's Greatest Hits: 100 English-Language Books of Fiction by Larry McCaffery,Robert McCrum's top 10 books of the twentieth century by The Guardian,100 Major Works of Modern Creative Nonfiction by About.com,Waterstone's Books of the Century by LibraryThing,The Best Fiction Books of the 2010s by Time,50 Memorable Books from 50 Years of Books to Remember by The New York Public Library,The 100 Best Nonfiction Books of All Time by The Guardian,The Best Southern Nonfiction of All Time by Oxford American,The 100 Most Influential Books Ever Written by Martin Seymour-Smith,25 Books to Read Before you Die: 21st Century by Powell's Books,Top 10 Fiction Books of the Decade(2010) by Entertainment Weekly,The 40 Best Novels of the 2010s by Paste Magazine,100 Most Influential Books of the Century by Boston Public Library
The Hobbit,J. R. R. Tolkien,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Harry Potter and the Philosopher's Stone,J. K. Rowling,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
The Little Prince,Antoine de Saint-Exupéry,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Dream of the Red Chamber,Cao Xueqin,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
And Then There Were None,Agatha Christie,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
A Room of One's Own,Virginia Woolf,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0
Native Son,Richard Wright,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0
Syntactic Structures,Noam Chomsky,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0
The Feminine Mystique,Betty Friedan,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0


In [593]:
full_onehot = full_titles_df.fillna(0).sum(level=0)

full_onehot

Unnamed: 0,Wikipedia Bestselling,Library Journal,Penguin Readers,Modern Library Readers,PW Bestsellers,Postcolonial Studies,"Top 100 Works in World Literature by Norwegian Book Clubs, with the Norwegian Nobel Institute",Biblioteca by Argentina,The 25 Favorite Books of 100 Francophone Writers by Telerama,For The Love of Books by For The Love of Books,The Top 10: The Greatest Books of All Time by The Top 10 (Book),The 100 Best Books of World Literature by ABC.es,The Ideal Library by Book,El Pais Favorite Books of 100 Spanish Authors by El Pais,Pour une Bibliothèque Idéale by Raymond Queneau,1001 Books You Must Read Before You Die by The Book,Koen Book Distributors Top 100 Books of the Past Century by themodernnovel.com,The Celebrity Reading List by Gardiner Public Library,Finest Works of Fiction by Martin Seymour-Smith and Editors,The 100 Best Non-Fiction Books of the Century by National Review,Great Books of the Western World by Great Books Foundation,100 Life-Changing Books by National Book Award,The New York Public Library's Books of the Century by New York Public Library,The New Lifetime Reading Plan by The New Lifetime Reading Plan,Great Books by The Learning Channel,The 100 Greatest British Novels by BBC,The 50 Best Books of the Century by Intercollegiate Studies Institute,Världsbiblioteket (The World Library) by Tidningen Boken,Recommended Books by Academy of Achievement,The Greatest 20th Century Novels by Waterstone,"""Best Foreign Work of Fiction"" by Transfuge",The 16 Greatest Books of All Time by NYU Local,ZEIT-Bibliothek der 100 Bücher by Die Zeit,The Bigger Read List by English PEN,100 Novels That Shaped Our World by BBC,"""Our Readable Century"", The Best Books of the 20th Century by January Magazine",100 Books to Read in a Lifetime by Amazon.com (USA),100 Books to Read in a Lifetime by Amazon.com (UK),The 100 Greatest Books Ever Written by Easton Press,48 Good Books by University of Buffalo,25 acclaimed international writers choose 25 of the best books from the last 25 years by Wasafiri Magazine,The Modern Library | 100 Best Novels by Modern Library,The 21st Century's 12 Greatest Novels by BBC,Top 100 World Literature Titles by Perfection Learning,A Premature Attempt at the 21st Century Canon by Vulture,The 100 Favorite Novels of Librarians by Bookman.com,Third World Novels… The Top 10 by New Internationalist,110 Best Books: The Perfect Library by The Telegraph,The Great American Read by PBS,Best German Novels of the Twentieth Century by Wikipedia,The New Vanguard by New York Times,Select 100 by University of Wisconsin-Milwaukee,The Modern Library | 100 Best Nonfiction by The Modern Library,The Millions: The Best Fiction of the Millennium by The Millions,100 Best Books by Montana State University,100 Best Novels in English Since 1900 by Counterpunch,Radcliffe's 100 Best Novels by Radcliffe Publishing Course,The 75 Best Books of the Past 75 Years by Parade Magazine,Harvard Book Store Staff's Favorite 100 Books by Harvard Book Store,Man Booker Prize by Man Booker Prize,PEN/Faulkner Award for Fiction by PEN/Faulkner,Pulitzer Prize for Fiction by Pulitzer Prize,National Book Award - Nonfiction by National Book Foundation,Pulitzer Prize for Biography or Autobiography by Pulitzer Prize,James Tait Black Memorial Prize by Wikipedia,National Book Critics Circle Award - Fiction by National Book Critics Circle,National Book Critics Circle Award - Nonfiction by National Book Critics Circle,Best Books Ever by bookdepository.com,Pulitzer Prize for History by Pulitzer Prize,National Book Award - Fiction by National Book Foundation,Pulitzer Prize for Non-Fiction by Pulitzer Prize,How to Read Literature Like a Professor: A Reading List by Thomas C. Foster,100 Essential Books by Bravo! Magazine,The Greatest Novel of All Time by William Faulkner,W. Somerset Maugham’s Ten Greatest Novels of All Time by Great Novelists and Their Novels,The 100 Greatest Novels by greatbooksguide.com,The College Board: 101 Great Books Recommended for College-Bound Readers by http://www.uhlibrary.net/pdf/college_board_recommended_books.pdf,The Great Books Reader by Book,The Best Classics by The Times,In Which These Are the 100 Greatest Novels by ThisRecording.com,How to Read and Why by Harold Bloom,The Novel 100: A Ranking of the Greatest Novels of All Time by The Novel 100,50 Greatest Books of All Time by Globe and Mail,The 100 Greatest Novels of All Time: The List by The Observer,Books That Changed the World: The 50 Most Influential Books in Human History by Book,Masterpieces of World Literature by Frank N. Magill,Great Books by Anthony O'Hear,TIME Magazine All Time 100 Novels by TIME Magazine,The Telegraph’s 100 Novels Everyone Should Read by Telegraph,50 Books That Changed the World by Open Education Database,The best books in Spanish for the last 25 years by El Pais,"Top 10 British, Irish or Commonwealth Novels from 1980 to 2005 by The Observer",The 100 best books of the 21st century by The Guardian,The 100 Best Books in the World by AbeBooks.de (in German),What Is the Best Work of American Fiction of the Last 25 Years? by New York Times,50 Books to Read Before You Die by Complex,The 50 Best Nonfiction Books of the Past 25 Years by Slate,D. G. Myers’ 50 Greatest English Language Novels by D. G. Myers,Modern classics: 11 novels that belong in the classroom by Today.com,The 100 Best Books of the Decade(2000) by Times,50 Books to (Re-)Read at 50 by nextavenue,The Best Books of the 2000s by The Onion AV Club,Extreme Classics: The 100 Greatest Adventure Books of All Time by National Geographic Adventure Magazine,The Best Southern Novels of All Time by Oxford American,Books of the Decade by The Guardian,The 80 Books Every Man Should Read by Esquire,Paste Magazine's Best Books of the Decade(2000-2009) by Paste Magazine,"The 50 Books Everyone Needs to Read, 1963-2013 by Flavor Wire",The New Classics - 100 Best Reads from 1983 to 2008 by Entertainment Weekly,The 10 Best of the Decade(2000) by Entertainment Weekly,From Zero to Well-Read in 100 Books by Jeff O'Neal at Bookriot.com,50 Books to Read Before You Die by Barnes and Noble,The Book of Great Books: A Guide to 100 World Classics by Book,"The 100 Greatest American Novels, 1893 – 1993 by Jeff O'Neal at Bookriot.com",Donald Barthelme’s Reading List by Believer Mag,100 Best Novels Written in English by The Guardian,Books That Changed the World by Book,Costa Book Award - Best Novel by Costa Coffee,Le Monde's 100 Books of the Century by Le Monde,Entertainment Weekly's Top 100 Novels by Entertainment Weekly,The Dream of the Great American Novel by Book,The Graphic Canon by Book,Greatest Prose Works of the 20th Century by Vladimir Nabokov,20th Century's Greatest Hits: 100 English-Language Books of Fiction by Larry McCaffery,Robert McCrum's top 10 books of the twentieth century by The Guardian,100 Major Works of Modern Creative Nonfiction by About.com,Waterstone's Books of the Century by LibraryThing,The Best Fiction Books of the 2010s by Time,50 Memorable Books from 50 Years of Books to Remember by The New York Public Library,The 100 Best Nonfiction Books of All Time by The Guardian,The Best Southern Nonfiction of All Time by Oxford American,The 100 Most Influential Books Ever Written by Martin Seymour-Smith,25 Books to Read Before you Die: 21st Century by Powell's Books,Top 10 Fiction Books of the Decade(2010) by Entertainment Weekly,The 40 Best Novels of the 2010s by Paste Magazine,100 Most Influential Books of the Century by Boston Public Library
The Hobbit,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Harry Potter and the Philosopher's Stone,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Little Prince,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Dream of the Red Chamber,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
And Then There Were None,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Decline of the West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
The History of the Standard Oil Company,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
Theory of Games and Economic Behavior,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
AA Big Book,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [595]:
full_onehot.to_csv('../../records/greatest-book-lists-onehot.csv')

And there we have it! 136 book lists, 11,225 votes, and 4,414 unique titles. Now we can do all kinds of math on these features, including recommender systems, networkgraphs, etc. But for now: a quick ranking by count, then sweet sleep...


In [637]:
counts = zip(full_onehot.index, full_onehot.sum(axis=1))

In [638]:
counts_df = pd.DataFrame(sorted(counts, key=lambda x:x[1], reverse=True))

In [647]:
counts_df.loc[:50]

Unnamed: 0,0,1
0,Ulysses,51.0
1,The Great Gatsby,50.0
2,One Hundred Years of Solitude,44.0
3,Lolita,43.0
4,Nineteen Eighty Four,42.0
5,Don Quixote,42.0
6,Moby Dick,42.0
7,The Catcher in the Rye,41.0
8,In Search of Lost Time,41.0
9,Pride and Prejudice,39.0


That looks like the classics, all right... :roll_eyes: