# Checking solved request book popularity

For the solved book search requests, we want to have an indicator of book popularity. We use two source:

1. book item popularity: the popularity of the individual book. For this we use the Open Libary API to query for records of the correct answer book, and extract the `readinglog`

2. book genre popularity: the popularity of the genre of the book. For this we combine two sources:
    - 1. the genre label in the title of the request,
    - 2. the tags that user assigned to Goodreads book lists that they are interested in or contributed to. 

In [1]:
import re
import urllib

import numpy as np
import pandas as pd
import requests



In [162]:
tsv_file = '../../data/2026-Why_are_complex_search_requests_so_difficult/solved threads (unchecked)/Books-solved.tsv'

df = pd.read_csv(tsv_file, sep='\t')
df.head(2)

Unnamed: 0,subreddit,thread_id,comment_id,author_name,timestamp,comment,solved,answer,LibraryThing_ID,notes
0,goodreads,13195,1,Angela,2008-08-05T00:00:00Z,"A book about a child who goes to school, but s...",,,,TW: abuse
1,goodreads,13195,2,Krista the Krazy Kataloguer,2008-08-05T00:00:00Z,"Well, it sounds like The Bears' House by Maril...",,,,


## Sanity checks

The `subreddit` column must always be `goodreads` and the `comment_id` column must always be an integer.

In [163]:
df.subreddit.value_counts()

subreddit
goodreads    12198
Name: count, dtype: int64

In [164]:
df.comment_id.value_counts()
#df.comment_id.apply(lambda x: isinstance(x, int))

comment_id
1     1941
2     1921
3     1779
4     1434
5     1011
6      745
7      544
8      411
9      315
10     242
11     200
12     164
13     141
14     111
15      94
16      84
17      78
18      70
19      64
20      59
21      57
22      50
23      44
24      43
25      41
27      38
26      38
28      34
29      31
30      26
31      26
32      26
33      25
34      23
35      21
36      21
37      21
41      19
42      19
39      19
40      19
38      19
43      18
44      16
45      16
46      16
47      16
48      16
49      16
50      16
Name: count, dtype: int64

## Selecting answer rows

In [165]:
answer_df = df[df.answer.apply(lambda x: pd.notna(x) and x != '')]
answer_df.head(2)

Unnamed: 0,subreddit,thread_id,comment_id,author_name,timestamp,comment,solved,answer,LibraryThing_ID,notes
16,goodreads,13195,14,Angela,2008-08-05T00:00:00Z,The book I was looking for is called Lovey by ...,solved / confirmed,Lovey: A Very Special Child,272207,by Mary MacCracken
83,goodreads,37787,14,Deborah,2008-08-05T00:00:00Z,It jumped into my head this morning - after ho...,solved / confirmed,The Hite Report: A Nationwide Study of Female ...,71020,solved by the OP title according to LibraryThing


### Add author information

In [169]:
def extract_author(note):
    if pd.isna(note):
        return np.nan
    if note.startswith('by ') is False:
        return np.nan
    if ' part of ' in note:
        part_idx = note.index(' part of ')
        #print(note[part_idx:])
        return note[3:part_idx]
    else:
        return note[3:]
        
df['has_author'] = df.notes.apply(lambda x: pd.isna(x) is False and x.startswith('by '))
df['author'] = df.notes.apply(extract_author)


In [170]:
df[df.author.notna()]

Unnamed: 0,subreddit,thread_id,comment_id,author_name,timestamp,comment,solved,answer,LibraryThing_ID,notes,has_author,author
16,goodreads,13195,14,Angela,2008-08-05T00:00:00Z,The book I was looking for is called Lovey by ...,solved / confirmed,Lovey: A Very Special Child,272207,by Mary MacCracken,True,Mary MacCracken
104,goodreads,55619,8,Tab,2019-08-05T00:00:00Z,It sounds like a book that Chronicle Books wou...,solved,Unforgettable: Images That Have Changed Our Lives,1564231,by Peter Davenport,True,Peter Davenport
127,goodreads,160509,2,karen,2016-08-05T00:00:00Z,i NEVER abandoned this! i kept hope alive! and...,solved / confirmed,Against Incredible Odds,19290779,by Arthur Roth,True,Arthur Roth
129,goodreads,178121,37,The Flooze,2009-08-05T00:00:00Z,Donbas: A True Story of an Escape Across Russi...,solved,Donbas: A True Story of an Escape Across Russia,8868250,by Jacques Sandulescu,True,Jacques Sandulescu
192,goodreads,205189,2,Misty,2009-08-05T00:00:00Z,"Okay - I can't believe it, but I just found it...",solved / confirmed,Savage Inequalities: Children in America's Sch...,35301,by Jonathan Kozol,True,Jonathan Kozol
...,...,...,...,...,...,...,...,...,...,...,...,...
12155,goodreads,23185338,2,tesla,2025-08-02T00:00:00Z,Menace ?,solved,Menace,18968629,by JM Darhower part of series: Scarlet Scars,True,JM Darhower
12169,goodreads,23186720,5,Juels,2025-07-30T00:00:00Z,Secret Fire by Johanna Lindsey . This starts i...,solved,Secret Fire,79282,by Johanna Lindsey,True,Johanna Lindsey
12177,goodreads,23187261,2,Becca,2025-07-30T00:00:00Z,A Touch of Shadows by Jessica Thorne?,solved,A Touch of Shadows,32882005,by Jessica Thorne part of series: The Lost Queen,True,Jessica Thorne
12182,goodreads,23187967,2,Genesistrine,2025-07-31T00:00:00Z,Ariel ?,solved,Ariel,157923,by Steven R. Boyett part of series: The Change,True,Steven R. Boyett


In [171]:
records = (df[(df.answer.notna()) & (df.answer != '')][['thread_id', 'answer', 'author']]
           .to_dict('records'))

## Downloading Open Library work dump

The Open Library website offers data dumps at different levels. See: https://openlibrary.org/developers/dumps

The `works` dump contains all records at the work level (https://openlibrary.org/data/ol_dump_works_latest.txt.gz).

The data in the file is tab-separated and has the following columns:

- `type` - type of record (/type/edition, /type/work etc.)
- `key` - unique key of the record (/books/OL1M etc.)
- `revision` - revision number of the record
- `last_modified` - last modified timestamp
- `JSON` - the complete record in JSON format

In [70]:
import gzip
import json


dump_file = '/Volumes/T7_Data/Data/CHIIR-papers/why_ki_is_hard/ol_dump_works_2025-08-31.txt.gz'
headers = ['record_type', 'record_id', 'revision', 'last_modified', 'record']

with gzip.open(dump_file, 'rt') as fh:
    for line in fh:
        row = line.strip('\n').split('\t')
        row_record = {header: row[hi] for hi, header in enumerate(headers)}
        row_record['record'] = json.loads(row_record['record'])
        print(json.dumps(row_record, indent=4))
        break

{
    "record_type": "/type/work",
    "record_id": "/works/OL10000196W",
    "revision": "3",
    "last_modified": "2010-04-28T06:54:19.472104",
    "record": {
        "title": "Les mots-cl\u00e9s de l'\u00e9conomie",
        "created": {
            "type": "/type/datetime",
            "value": "2009-12-11T01:57:19.964652"
        },
        "covers": [
            3146541
        ],
        "last_modified": {
            "type": "/type/datetime",
            "value": "2010-04-28T06:54:19.472104"
        },
        "latest_revision": 3,
        "key": "/works/OL10000196W",
        "authors": [
            {
                "type": "/type/author_role",
                "author": {
                    "key": "/authors/OL3965197A"
                }
            }
        ],
        "type": {
            "key": "/type/work"
        },
        "revision": 3
    }
}


## Querying the Open Library API

Run all answer book titles against the API to get exact matches. This is to sanity check that runing the titles against the data dump returns at least the same exact matches. **Otherwise the API is doing something other than exact match.**

In [119]:
BASE_URL = "https://openlibrary.org/search.json"

def search_openlibrary(title, max_retries: int = 5):
    # url-encoded title and author
    title = re.sub(f' +', '+', title)
    title = urllib.parse.quote_plus(title)
    url = f'{BASE_URL}?q={title}&fields=*'
    response = requests.get(url)
    retry = 0
    while retry < max_retries:
        if response.status_code == 200:
            return response.json()
        else:
            print(response.status_code, f"retry {retry} of {max_retries}, error for title #{title}#")
            retry += 1
            time.sleep(2)

In [120]:
len(records)

1481

Make the API requests:

In [146]:
import time

for ri, record in enumerate(records):
    title = record['answer']
    #record['response_q'] = search_openlibrary(title)
    if pd.isna(record['author']):
        record['response_qta'] = record['response_q']
    else:    
        title_author = f'{title} {record["author"]}'
        record['response_qta'] = search_openlibrary(title_author)
        time.sleep(2)
    if (ri+1) % 10 == 0:
        print(f"{ri+1} of {len(records)} records processed")

10 of 1481 records processed
20 of 1481 records processed
30 of 1481 records processed
40 of 1481 records processed
50 of 1481 records processed
60 of 1481 records processed
70 of 1481 records processed
80 of 1481 records processed
90 of 1481 records processed
100 of 1481 records processed
110 of 1481 records processed
120 of 1481 records processed
130 of 1481 records processed
140 of 1481 records processed
150 of 1481 records processed
160 of 1481 records processed
170 of 1481 records processed
180 of 1481 records processed
190 of 1481 records processed
200 of 1481 records processed
210 of 1481 records processed
220 of 1481 records processed
230 of 1481 records processed
240 of 1481 records processed
250 of 1481 records processed
260 of 1481 records processed
270 of 1481 records processed
280 of 1481 records processed
290 of 1481 records processed
300 of 1481 records processed
310 of 1481 records processed
320 of 1481 records processed
330 of 1481 records processed
340 of 1481 records

In [23]:
import gzip
import glob
import json
import re


BASE_FILENAME = '../../data/books/records-open_library-request_answers'

def get_records_files():
    return glob.glob(f'{BASE_FILENAME}_v*.json.gz')


def get_records_file_version(records_file):
    if m := re.search(r"_v(\d+)\.json.gz$", records_file):
        return int(m.group(1))
    else:
        raise ValueError(f"invalid records_file name {records_file}.")
    return None


def get_records_file_versions():
    records_files = get_records_files()
    return [(get_records_file_version(rf), rf) for rf in records_files]


def get_records_lastest_file():
    version_num = determine_last_version_number()
    return (f'{BASE_FILENAME}_v{version_num}.json.gz')

    
def determine_new_version_number():
    return determine_last_version_number() +1


def determine_last_version_number():
    file_versions = get_records_file_versions()
    if len(file_versions) == 0:
        return 0
    latest = max(version for version, _ in file_versions)
    return latest



Store the results:

In [148]:
version_num = determine_new_version_number()
records_file = f'../../data/books/records-open_library-request_answers_v{version_num}.json.gz'

if len(records) == 1481 and all('response' in record for record in records):
    with gzip.open(records_file, 'wt') as fh:
        json.dump(records, fh)

Read the results from file:

In [25]:
records_file = get_records_lastest_file()

with gzip.open(records_file, 'rt') as fh:
    records = json.load(fh)

## Extracting Genre Information

In [40]:
for record in records:
    #print(record)
    docs = record['response_qta']['docs']
    subjects = [doc['subject'] for doc in docs if 'subject' in doc]
    subjects = [subject for subject_list in subjects for subject in subject_list]
    lccs = [doc['lcc'] for doc in docs if 'lcc' in doc]
    print(lccs)
    print(set(subjects))
    break

[['HV-6626.52000000', 'RJ-0506.00000000.P63'], ['RJ-0505.00000000.M54 M3 1976b', 'RJ-0505.00000000.M54 M3']]
{'Child psychotherapy', 'Mental Disorders', 'Mentally ill children, education', 'Large type books', 'Milieu therapy', 'Case studies', 'Problem children', 'Popular works', 'Child psychology', 'Psychotherapy', 'Children, biography', 'Emotionally disturbed children', 'Mentally ill children', 'In infancy and childhood', 'Education', 'Cases, clinical reports, statistics'}


#### Goodreads genres

The main genres on Goodreads, taken from the main page:

Art
Biography
Business
Children's
Christian
Classics
Comics
Cookbooks
Ebooks
Fantasy
Fiction
Graphic Novels
Historical Fiction
History
Horror
Memoir
Music
Mystery
Nonfiction
Poetry
Psychology
Romance
Science
Science Fiction
Self Help
Sports
Thriller
Travel
Young Adult
More genres


In [69]:
genre_map_gr = {
    'Art': 'Art',
    'Biography': 'Biography',
    'Business': 'Business',
    "Children's": "Children's",
    'Christian': 'Christian',
    'Classics': 'Classics',
    'Comics': 'Comics',
    'Cookbooks': 'Cookbooks',
    'Ebooks': 'Ebooks',
    'Fantasy': 'Fantasy',
    'Fiction': 'Fiction',
    'Graphic Novels': 'Graphic Novels',
    'Historical Fiction': 'Historical Fiction',
    'History': 'History',
    'Horror': 'Horror',
    'Memoir': 'Memoir',
    'Music': 'Music',
    'Mystery': 'Mystery',
    'Nonfiction': 'Nonfiction',
    'Poetry': 'Poetry',
    'Psychology': 'Psychology',
    'Romance': 'Romance',
    'Science': 'Science',
    'Science Fiction': 'Science Fiction',
    'Self Help': 'Self Help',
    'Sports': 'Sports',
    'Thriller': 'Thriller',
    'Travel': 'Travel',
    'Young Adult': 'Young Adult'
    
}

Extend the list of genres with genres found in the thread titles, mapping them to normalised/standardised genre labels:

In [143]:
import os


def extract_genre(title):
    if title is None:
        return None
    for genre in genre_map_gr:
        if genre.lower() in title.lower():
            return genre_map_gr[genre]
    for genre in genre_map_upper:
        if genre in title:
            return genre_map_upper[genre]
    for genre in genre_map_title:
        if genre in title.lower():
            return genre_map_title[genre]
    # Fallback on indicator of fiction
    for genre in ['novel', ' fiction', 'adult']:
        if genre in title.lower():
            return 'Fiction'
    return None
    

genre_map_title = {
    'ya fantasy': 'Young Adult',
    'ya fiction': 'Young Adult',
    'ya romance': 'Young Adult',
    'ya book': 'Young Adult',
    'y/a': 'Young Adult',
    'scifi': 'Science Fiction',
    'sci-fi': 'Science Fiction',
    'sci/fi': 'Science Fiction',
    'science fiction': 'Science Fiction',
    'thriller': 'Thriller',
    'adult fantasy': 'Fantasy',
    'fantasy': 'Fantasy',
    'adult romance': 'Romance',
    'romance novel': 'Romance',
    'erotica': 'Romance',
    'historical romance': 'Romance',
    'historical fiction': 'Historical Fiction',
    'adult fiction': 'Fiction',
    'adult non-fiction': 'Nonfiction',
    'non-fiction': 'Nonfiction',
    'non fiction': 'Nonfiction',
    'memoir': 'Memoir',
    'historical': 'Historical Fiction',
    'ya short': 'Young Adult',
    'sci fi': 'Science Fiction',
    'ya teenage': 'Young Adult',
    'short story': 'Short story',
    'short stories': 'Short story',
    'ya novel': 'Young Adult',
    'suspence': 'Thriller',
    'romantic': 'Romance',
    'rom com': 'Romance',
    'non-fic': 'Nonfiction',
    'crime': 'Thriller',
    'graphic novel': 'Comics',
    'comic book': 'Comics',
    'childrens': "Children's",
    'kids': "Children's",
    'drama': 'Fiction',
    'teen': 'Young Adult',
    "children’s": "Children's",
    'adult novel': 'Fiction'
}

genre_map_upper = {
    'YA': 'Young Adult',
    'SF': 'Science Fiction'
}

parsed_thread_dir = '../../data/books/goodreads_crawl/parsed_threads-2025-08-05/'

parsed_thread_files = glob.glob(os.path.join(parsed_thread_dir, '*.json'))

# for genre in genre_map:
#     title = 'Harlequin historical romance about two sisters one was named cat which was the nick name of the evil twin sister'
#     print([genre, genre in title.lower()])
#     print(extract_genre(title))

no_genre = []
yes_genre = []
thread_genre = {}
for pi, ptf in enumerate(parsed_thread_files):
    with open(ptf, 'rt') as fh:
        thread = json.load(fh)
        title = thread['messages'][0]['thread_title']
        genre = extract_genre(title)
        if genre is None:
            no_genre.append([thread['thread_id'], title, None])
        else:
            yes_genre.append([thread['thread_id'], title, genre])
        thread_genre[thread['thread_id']] = genre

len(yes_genre), len(no_genre)

(2380, 344)

In [174]:
missing = len(no_genre) / (len(yes_genre) + len(no_genre))
print(f"fraction of threads with no genre information: {missing: >.4f}")

fraction of threads with no genre information: 0.1263


### Extracting Genre Popularity Information

Goodreads has public book lists that have a title and description to clarify the scope of the list, and that users can add books to and where they can upvote individual book titles. Users can also add tags to individual lists. The most common tags for list are all genre labels. 

Via the genre tags, the book lists can be used as an indicator of popularity. The number of public book lists with a given genre tag are an indicator of that genre's popularity.

In [175]:
# copy-and-paste of the most popular book tags and in parentheses, the number of book lists with that tag
gr_list_tags = """
romance (8174)fiction (7764)young-adult (5880)fantasy (5348)science-fiction (3687)non-fiction (3111)children (2526)history (2351)covers (2323)mystery (2243)horror (2154)historical-fiction (1922)gay (1820)titles (1715)best (1616)middle-grade (1569)nonfiction (1509)queer (1507)paranormal (1507)historical-romance (1458)lgbt (1447)love (1431)contemporary (1417)thriller (1385)title-challenge (1297)women (1288)lgbtq (1287)biography (1240)title (1190)classics (1176)"""

genres_counts = [item.strip() for item in re.split(r"\((\d+)\)", gr_list_tags.strip()) if item != '']
genre_popularity = {genres_counts[i]: genres_counts[i+1] for i in range(0, len(genres_counts), 2)}
genre_popularity
#len(genres_counts)
#genres_counts

{'romance': '8174',
 'fiction': '7764',
 'young-adult': '5880',
 'fantasy': '5348',
 'science-fiction': '3687',
 'non-fiction': '3111',
 'children': '2526',
 'history': '2351',
 'covers': '2323',
 'mystery': '2243',
 'horror': '2154',
 'historical-fiction': '1922',
 'gay': '1820',
 'titles': '1715',
 'best': '1616',
 'middle-grade': '1569',
 'nonfiction': '1509',
 'queer': '1507',
 'paranormal': '1507',
 'historical-romance': '1458',
 'lgbt': '1447',
 'love': '1431',
 'contemporary': '1417',
 'thriller': '1385',
 'title-challenge': '1297',
 'women': '1288',
 'lgbtq': '1287',
 'biography': '1240',
 'title': '1190',
 'classics': '1176'}

In [176]:
genre_lists = {
    'Romance': 8174,
    'Fiction': 7764,
    'Young Adult': 5880,
    'Fantasy': 5348,
    'Science Fiction': 3687,
    'Nonfiction': 3111,
    "Children's": 2526,
    'History': 2351,
    'Mystery': 2243,
    'Horror': 2154,
    'Historical Fiction': 1922,
    'Paranormal': 1507,
    'Historical-romance': 1458,
    'Thriller': 1385,
    'Biography': 1240,
    'Classics': 1176,
    'Memoir': 1104,
    'Poetry': 973,
    'Art': 897,
    'Christian': 788,
    'Science': 783,
    'Comics': 759,
    'Travel': 690,
    'Music': 563,
    'Psychology': 512,
    'Business': 438,
    'Self Help': 432,
    'Short story': 190,
    'Sports': 125
}



In [177]:
import requests
from collections import Counter

genre_freq = Counter([genre_info[-1] for genre_info in yes_genre])
for genre, freq in genre_freq.most_common():
    if genre in genre_lists:
        #print(genre, freq, genre_pop[genre])
        continue
    print(genre, freq)


## Extracting Popularity Information

In [212]:
popularity_fields = [
    'edition_count',
    'osp_count',
    'ratings_count'
]

popularity_len_fields = [
    'language'
]
doc
record = records[-1]
for doc in record['response_qta']['docs']:
    if 'ratings_average' in doc:
        print(json.dumps(doc, indent=4))

{
    "author_alternative_name": [
        "Francis Bacon",
        "Sir Francis Bacon",
        "Francis Bacon Viscount St. Albans",
        "Bacon, Francis Viscount St. Albans",
        "Francis Bacon, Baron Verulam, Viscount St. Albans, etc",
        "Bacon, Francis viscount St. Albans",
        "1626",
        "Francis Bacon, John Milton, Thomas Browne",
        "Francis Bacon, John Milton, Sir Thomas Browne"
    ],
    "author_key": [
        "OL23720A"
    ],
    "author_name": [
        "Francis Bacon"
    ],
    "contributor": [
        "Rawley, William, 1588?-1667",
        "Va\u0301zquez, Juan Adolfo, 1917- ed. and tr.",
        "Smith, G. C. Moore 1858-1940",
        "Haines, Richard, 1633-1685",
        "Rawley, William, 1588?-1667.",
        "R. H.",
        "R., M.",
        "Bartolozzi, Roberto, ed. and tr.",
        "R.H.",
        "George Fabyan Collection (Library of Congress)",
        "R. H",
        "Haines, Richard, 1633-1685.",
        "Raquet, Gilles Bernard",
 

## checking missing titles against data dump

The titles for which there are not matching records, check if we can find a near match.

- TO DO: define near match. 
- TO DO: for titles with matches, check if author names have near matches as well

In [178]:
from unidecode import unidecode

from fuzzy_search.tokenization.token import Tokenizer


def normalise_string(text_string: str):
    if text_string is None or pd.isna(text_string):
        return ''
    text_string = unidecode(text_string)
    tokens = tokenizer.tokenize(text_string)
    return ' '.join([token.n for token in tokens])


def is_flex_match_doc(answer_title, answer_author, doc):
    doc_title = normalise_string(doc['title'])
    has_match = False
    try:
        doc_authors = get_oa_authors(doc, )
    except AttributeError:
        print(doc['author_name'])
        raise
    if answer_author is not None and len(doc_authors) > 0:
        return any(is_flex_match(answer_author, doc_author) for doc_author in doc_authors)
    return is_flex_match(answer_title, doc_title)


def get_oa_titles(record, response_field):
    return [normalise_string(doc['title']) for doc in record[response_field]['docs']]


def get_oa_authors(doc, normalise: bool = True):
    if 'author_name' not in doc:
        return []
    if normalise is True:
        return [normalise_string(author) for author in doc['author_name']]
    else:
        return [author for author in doc['author_name']]


def is_flex_match(answer_string, oa_string):
    return answer_string == oa_string or answer_string in oa_string or oa_string in answer_string


def get_flex_match_doc(record, response_field):
    answer_title = normalise_string(record['answer'])
    answer_author = normalise_string(record['author'])
    for doc in record[response_field]['docs']:
        if is_flex_match_doc(answer_title, answer_author, doc):
            return doc
    #print("unclear match:", answer_title, oa_titles)
    #print(answer_author, oa_authors)
    return None
        



For each record we extract three variables:

1. `readinglog_count` to indicate book item popularity
2. `first_publish_year` to indicate book recency
3. book genre and genre popularity

In [150]:
response_field = 'response_qta'
record = records[0]
no_match = 0
tokenizer = Tokenizer(ignorecase=True, remove_punctuation=True)

readinglog_rows = []
genre_rows = []
for ri, record in enumerate(records):
    if record[response_field]['num_found'] == 0:
        response_field = 'response_q'
    if record[response_field]['num_found'] == 0:
        continue
    doc = get_flex_match_doc(record, response_field)
    if doc is not None:
        readinglog_count = doc['readinglog_count'] if 'readinglog_count' in doc else 0
        answer_id = f"https://openlibrary.org{doc['key']}"
        doc_title = doc['title']
        first_publish_year = doc['first_publish_year'] if 'first_publish_year' in doc else None
        doc_author = '; '.join(get_oa_authors(doc, normalise=False))
        #print(json.dumps(doc, indent=4))
        #break
    else:
        readinglog_count = 0
        answer_id = None
        doc_title = None
        doc_author = None
        first_publish_year = None
        no_match += 1
        print(f"no match {no_match}: record num {ri}\treadinglog_count: {readinglog_count}")
    genre = thread_genre[record['thread_id']]
    genre_pop = genre_lists[genre] if genre in genre_lists else None
    row = [
        ri, record['thread_id'], record['answer'], record['author'], 
        answer_id, doc_title, doc_author, readinglog_count, first_publish_year,
        genre, genre_pop
    ]
    expected_rows = 11
    if len(row) > expected_rows:
        print(f"row has {len(row)} rows instead of {expected_rows}")
        print(row)
        break
    readinglog_rows.append(row)


no match 1: record num 12	readinglog_count: 0
no match 2: record num 16	readinglog_count: 0
no match 3: record num 25	readinglog_count: 0
no match 4: record num 41	readinglog_count: 0
no match 5: record num 59	readinglog_count: 0
no match 6: record num 96	readinglog_count: 0
no match 7: record num 129	readinglog_count: 0
no match 8: record num 161	readinglog_count: 0
no match 9: record num 165	readinglog_count: 0
no match 10: record num 200	readinglog_count: 0
no match 11: record num 212	readinglog_count: 0
no match 12: record num 265	readinglog_count: 0
no match 13: record num 287	readinglog_count: 0
no match 14: record num 289	readinglog_count: 0
no match 15: record num 305	readinglog_count: 0
no match 16: record num 362	readinglog_count: 0
no match 17: record num 372	readinglog_count: 0
no match 18: record num 374	readinglog_count: 0
no match 19: record num 382	readinglog_count: 0
no match 20: record num 422	readinglog_count: 0
no match 21: record num 429	readinglog_count: 0
no matc

In [213]:
readinglog_rows[0]

[0,
 '13195',
 'Lovey: A Very Special Child',
 'Mary MacCracken',
 'https://openlibrary.org/works/OL5262888W',
 'Lovey',
 'Mary MacCracken',
 4,
 1977]

Finally, we write all book feature data to file:

In [151]:
readinglog_cols = [
    'req_no', 'thread_id', 'answer_title', 'answer_author',
    'work_id', 'work_title','work_author', 'readinglog_count', 'first_publish_year',
    'genre', 'genre_popularity'
]
readinglog_df = pd.DataFrame(readinglog_rows, columns=readinglog_cols)
readinglog_df

Unnamed: 0,req_no,thread_id,answer_title,answer_author,work_id,work_title,work_author,readinglog_count,first_publish_year,genre,genre_popularity
0,0,13195,Lovey: A Very Special Child,Mary MacCracken,https://openlibrary.org/works/OL5262888W,Lovey,Mary MacCracken,4,1977.0,,
1,1,14207,Clone Catcher,Alfred Slote,https://openlibrary.org/works/OL15926986W,Clone catcher,Alfred Slote,9,1982.0,Young Adult,5880.0
2,2,37787,The Hite Report: A Nationwide Study of Female ...,,https://openlibrary.org/works/OL2815101W,Hite Report Women & Love,Shere Hite,34,1976.0,Fiction,7764.0
3,5,160509,Against Incredible Odds,Arthur Roth,https://openlibrary.org/works/OL16478118W,Against incredible odds,,0,1983.0,,
4,6,178121,Donbas: A True Story of an Escape Across Russia,Jacques Sandulescu,https://openlibrary.org/works/OL7003295W,Donbas,Jacques Sandulescu,2,1968.0,,
...,...,...,...,...,...,...,...,...,...,...,...
1410,1476,23187261,A Touch of Shadows,Jessica Thorne,https://openlibrary.org/works/OL42425465W,Touch of Shadows,Jessica Thorne,0,2025.0,Fantasy,5348.0
1411,1477,23187967,Ariel,Steven R. Boyett,https://openlibrary.org/works/OL1876345W,Ariel,Steven R. Boyett,11,1984.0,Fantasy,5348.0
1412,1478,23190385,The Young Traveler in Sweden,George L. Proctor,https://openlibrary.org/works/OL6939723W,The young traveler in Sweden,George L. Proctor,0,1953.0,Children's,2526.0
1413,1479,23190983,Visions of Darkness,,https://openlibrary.org/works/OL17732W,Bible,Bible,1999,1200.0,Fantasy,5348.0


In [153]:
readinglog_df.to_csv('../../data/books/book_answers-popularity.tsv', sep='\t', index=False)

In [154]:
readinglog_df.readinglog_count.value_counts()

readinglog_count
0       381
1       159
2       112
3        62
4        58
       ... 
2692      1
1919      1
108       1
4478      1
314       1
Name: count, Length: 223, dtype: int64