# Exploratory data analysis and data preprocessing.

## Introduction.

In our work, we will be using information about ten thousand books from the [goodbooks-10k](https://www.kaggle.com/zygmunt/goodbooks-10k?select=books.csv) [kaggle](https://www.kaggle.com) dataset. All the information was retrieved from [Goodreads](https://www.goodreads.com/), the world’s largest site for readers and book recommendations.

The great majority of word and sentence embedding techniques are trained on large corpora, often involving processing a humongous number of books (e.g., Toronto Book Corpus). This results on representations that, out of the box, offer great performance for a wide variety of feature learning related tasks, often with little fine-tuning required.

## Dataset exploration.

Before getting started, it is first necessary to load all libraries and dependencies that will be used later in the notebook.

In [1]:
import os
import pandas as pd
import numpy as np
from utils import utils
from utils.paths import *
from pathlib import Path

  from .autonotebook import tqdm as notebook_tqdm


Besides, we set a seed to guarantee reproducibility of the experiments.

In [2]:
seed = 0

Let us now unzip containing the collection of books.

In [4]:
from shutil import unpack_archive
unpack_archive(PATH_DATASET_ZIP, DIR_DATASET)

In [5]:
filepath = PATH_BOOKS
data = pd.read_csv(filepath, sep='	')

To sample the first *n* instances of a dataset we can use the `head` function.

In [6]:
data.head(5)

Unnamed: 0,book_id,gr_book_id,gr_best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,0,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,1,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,2,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,3,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,4,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


Alternatively, we can use the `sample` function, which samples *n* random instances of the dataset.

In [7]:
data.sample(5, random_state=seed)

Unnamed: 0,book_id,gr_book_id,gr_best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
9394,9394,38703,38703,575142,43,385733143,9780386000000.0,Louis Sachar,2006.0,Small Steps,...,11837,13095,1387,267,1177,4066,4471,3114,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...
898,898,53835,53835,1959512,836,159308143X,9781593000000.0,"Edith Wharton, Maureen Howard",1920.0,The Age of Innocence,...,102646,114994,5051,2359,6549,25631,42542,37913,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...
2398,2398,43893,43893,1443364,47,765344300,9780765000000.0,Terry Goodkind,2003.0,"Naked Empire (Sword of Truth, #8)",...,39682,42066,548,1519,3639,9953,12891,14064,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...
5906,5906,31244,31244,2888469,378,375761144,9780376000000.0,Charles Dickens,1865.0,Our Mutual Friend,...,18599,20659,1102,434,986,3803,6936,8500,https://images.gr-assets.com/books/1403189244m...,https://images.gr-assets.com/books/1403189244s...
2343,2343,497199,497199,1132770,80,876852630,9780877000000.0,Charles Bukowski,1975.0,Factotum,...,37376,40444,1213,457,1875,8979,16585,12548,https://images.gr-assets.com/books/1407706616m...,https://images.gr-assets.com/books/1407706616s...


The dataset has 23 different features. Let us print the names of all the features in the dataset.

In [8]:
data.columns

Index(['book_id', 'gr_book_id', 'gr_best_book_id', 'work_id', 'books_count',
       'isbn', 'isbn13', 'authors', 'original_publication_year',
       'original_title', 'title', 'language_code', 'average_rating',
       'ratings_count', 'work_ratings_count', 'work_text_reviews_count',
       'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5',
       'image_url', 'small_image_url'],
      dtype='object')

Here's a brief description for the features in the dataset.

* `books_count` represents the number of editions for a given work.

* `gr_best_book_id` contains the most popular edition for a given work. 

* Columns `book_id`, `gr_book_id`, `gr_best_book_id`, `work_id`, `isbn` and `isbn13` are different identifiers for the book. As we will see later, the book overviews are not included in this dataset and have been obtained by means of scraping. Each overview is identified with the `gr_book_id` identifier, thus it is the link between both sources of information. Let us first check that it is a valid identifier (i.e., there are no null values and all identifiers are unique).

In [9]:
print(f"Unique values in 'gr_book_id' column: {len(data.gr_book_id.unique())}")
print(f"Print null values in 'gr_book_id' column {data.gr_book_id[data.gr_book_id.isna()]}")

Unique values in 'gr_book_id' column: 10000
Print null values in 'gr_book_id' column Series([], Name: gr_book_id, dtype: int64)


* `gr_book_id` is, in fact, a valid identifier, thus it is the one that we will be using. Furthermore, we also now know that there are no duplicated instances in the dataset, since at least one of its features has no repeated values. The remaining identifier columns can be deleted, as they do not provide any more meaningful information for the tasks that we are to perform.
* As the name suggests, `authors` contains the names of the authors of the book.

* `original_publication_year` indicates the year in which the book was published. We will not be using this information.

* `title` is the english title of the book.

* `original_title` is the title of the book in its original language. We are primarily concerned with english textual information, hence `title` is a more suitable feature.

* `language_code` indicates the textual code assigned to the language of the book. This feature is particularly useful because it will help us get rid of non-English books.

* `average_rating` is a floating value indicating the average rating of a book, ranging from 1 to 5. This feature does not provide relevant information for semantic search, thus it will be discarded. That notwithstanding, it could be used as a criteria to filter the query results, prioritizing those that have better ratings.

* `ratings_count` indicates the number of registered ratings for a book. Analogously, `work_ratings_count` and `work_text_reviews_count` indicate the number of ratings and reviews a work has in the platform, respectively. None of this information is useful for our work.

* `ratings_1`, `ratings_2`, `ratings_3`, `ratings_4` and `ratings_5` characteristics hold the counts for each rating value. Again, this feature does not provide any relevant information to perform semantic search.

* `image_url` and `small_image_url` contain links to pictures of the book cover. Since images cannot be displayed in CLIs, we will discard this information too.

## Remove useless features.

Let's get rid of all not useful features.

In [10]:
columns_to_drop = set(['book_id', 'gr_best_book_id', 'work_id', 'books_count',
       'isbn', 'isbn13', 'original_publication_year', 'original_title',
       'average_rating','ratings_count', 'work_ratings_count',
       'work_text_reviews_count', 'ratings_1', 'ratings_2', 'ratings_3',
       'ratings_4', 'ratings_5', 'image_url', 'small_image_url'])

In [11]:
data = data.drop(columns_to_drop, axis=1)
data.sample(5, random_state=seed)

Unnamed: 0,gr_book_id,authors,title,language_code
9394,38703,Louis Sachar,"Small Steps (Holes, #2)",eng
898,53835,"Edith Wharton, Maureen Howard",The Age of Innocence,eng
2398,43893,Terry Goodkind,"Naked Empire (Sword of Truth, #8)",en-GB
5906,31244,Charles Dickens,Our Mutual Friend,eng
2343,497199,Charles Bukowski,Factotum,


## Integrate book overviews into the dataset.

Now that all useless characteristics have been deleted, let's append the overviews to the dataframe. The book overviews are stored in a directory, one `txt` file for each overview. We will first generate a dataframe containing all txt files in the directory. The filename for each `txt` file is the `gr_book_id` identifier.

In [12]:
book_overviews = utils.generate_dataframe_from_sparse_txts(DIR_OVERVIEW)

In [13]:
print(f"Number of overviews in the dataset: {book_overviews.overview.shape[0]}")
book_overviews

Number of overviews in the dataset: 9956


Unnamed: 0,gr_book_id,overview
0,1,When Harry Potter and the Half-Blood Prince op...
1,10,"Six years of magic, adventure, and mystery mak..."
2,10000191,"À sa naissance, Lisbeth est enlevée à sa mère ..."
3,10006,The discovery of a mysterious notebook turns a...
4,1000751,When orphaned 11-year-old Pollyanna comes to l...
...,...,...
9951,9995135,"At long last, New York Times bestselling autho..."
9952,99955,Paine's daring prose paved the way for the Dec...
9953,9998,"The Woman in the Dunes, by celebrated writer a..."
9954,9998705,"FLASH! Illuminated by lightning, a lifeless hu..."


There are $9956$ book overviews, which is less than the number of instances in the other dataframe. Consequently, at least $44$ books will have no overview. There are several strategies to merge both dataframes. In this case, we will allow having books with no overview (*left join* operation).

In [14]:
data = pd.merge(data, book_overviews, left_on='gr_book_id', right_on='gr_book_id', how='left')
data

Unnamed: 0,gr_book_id,authors,title,language_code,overview
0,2767052,Suzanne Collins,"The Hunger Games (The Hunger Games, #1)",eng,Winning will make you famous. Losing means cer...
1,3,"J.K. Rowling, Mary GrandPré",Harry Potter and the Sorcerer's Stone (Harry P...,eng,Harry Potter's life is miserable. His parents ...
2,41865,Stephenie Meyer,"Twilight (Twilight, #1)",en-US,About three things I was absolutely positive.F...
3,2657,Harper Lee,To Kill a Mockingbird,eng,The unforgettable novel of a childhood in a sl...
4,4671,F. Scott Fitzgerald,The Great Gatsby,eng,"On its first publication in 1925, The Great Ga..."
...,...,...,...,...,...
9995,7130616,Ilona Andrews,"Bayou Moon (The Edge, #2)",eng,"The Edge lies between worlds, on the border be..."
9996,208324,Robert A. Caro,"Means of Ascent (The Years of Lyndon Johnson, #2)",eng,"Robert A. Caro's life of Lyndon Johnson, which..."
9997,77431,Patrick O'Brian,The Mauritius Command,eng,"""O'Brian's Aubrey-Maturin volumes actually con..."
9998,8565083,Peggy Orenstein,Cinderella Ate My Daughter: Dispatches from th...,eng,The acclaimed author of the groundbreaking bes...


## Remove instances with invalid language codes.

Let us now check whether there are noisy data in any of the selected characteristics. Starting off with the language code, we need to make sure that all data fed into the models is in English, being as they have been trained to derive semantic representations for English texts. To that end, let's see how many language codes are in the dataset.

In [15]:
data.language_code.unique()

array(['eng', 'en-US', 'en-CA', nan, 'spa', 'en-GB', 'fre', 'nl', 'ara',
       'por', 'ger', 'nor', 'jpn', 'en', 'vie', 'ind', 'pol', 'tur',
       'dan', 'fil', 'ita', 'per', 'swe', 'rum', 'mul', 'rus'],
      dtype=object)

As it can be seen, there are plenty of different languages. However, is the title and the overview of the book written in the language indicated in `language_code`? Let's test it on some of the books with `language_code = spa`

In [16]:
data[data.language_code == 'spa'].sample(n=10, random_state=seed)

Unnamed: 0,gr_book_id,authors,title,language_code,overview
9472,53809,Paulo Coelho,Maktub,spa,"Maktub não é um livro de conselhos, mas uma tr..."
83,7677,Michael Crichton,"Jurassic Park (Jurassic Park, #1)",spa,
9890,1365225,José Emilio Pacheco,Las batallas en el desierto,spa,"Historia de un amor imposible, narración de un..."
3751,140302,Agatha Christie,"Poirot Investiga (Hércules Poirot, #3)",spa,
4508,63032,Roberto Bolaño,2666,spa,"A cuatro profesores de literatura, Pelletier, ..."
9222,61794,Anonymous,La vida del Lazarillo de Tormes,spa,Lázaro es un muchacho desarrapado a quien la m...
3476,31343,Anne Rice,"Pandora (New Tales of the Vampires, #1)",spa,"Anne Rice, creator of the Vampire Lestat, the ..."
5125,53926,Mario Vargas Llosa,Travesuras de la niña mala,spa,¿Cuál es el verdadero rostro del amor?Ricardo ...
1799,22590,"Philip K. Dick, David Alabort, Manuel Espín",Ubik,spa,Ubik (/ˈjuːbᵻk/ EW-bik) is a 1969 science fict...
555,10603,Stephen King,Cujo,spa,"Outside a peaceful town in central Maine, a mo..."


Since the title and the overview seems to be written in the language indicated in `language_code`, we will only choose those language codes mapped to English texts: `eng`, `en-US`, `en-CA`, `en-GB` and `en`. It is, however, still necessary to check the instances in which the value for the language code is `NaN`

In [17]:
data[['title', 'overview']][data['language_code'].isna()].sample(n=10, random_state=seed)

Unnamed: 0,title,overview
3241,Born Free: A Lioness of Two Worlds (Story of E...,There have been many accounts of the return to...
3050,Stone Soup,"First published in 1947, this classic picture ..."
4807,The Glass Magician (The Paper Magician Trilogy...,Three months after returning Magician Emery Th...
9918,Nothing's Fair in Fifth Grade,Jenny knows one thing for sure - Elsie Edwards...
3971,Experiencing God: Knowing and Doing the Will o...,Most Bible studies help people; this one chang...
9772,The Voyages of Doctor Dolittle (Doctor Dolittl...,"The delightfully eccentric Doctor Dolittle, re..."
8179,First Love,An extraordinary portrait of true love that wi...
2048,"Ramona the Pest (Ramona, #2)",This is the second title in the hugely popular...
9559,"Relentless (The Lost Fleet, #5)","After successfully freeing Alliance POWs, ""Bla..."
9240,"Truth Will Prevail (The Work and the Glory, #3)",


More than $10\%$ of the data has no language code. We verified that all of them are in English, thus they do not have to be deleted. Furthermore, the `language_code` feature is no longer needed.

In [18]:
eng_lc = set(['en', 'en-CA', 'en-US', 'en-GB', 'eng'])

data = data[(data.language_code.isin(eng_lc)) | (data.language_code.isna())].drop('language_code', axis=1)
data.sample(n=10, random_state=seed)

Unnamed: 0,gr_book_id,authors,title,overview
7203,342994,"Hans Christian Andersen, Rachel Isadora",The Little Match Girl,The wares of the poor little match girl illumi...
8399,53200,Stephen Hawking,Black Holes and Baby Universes,NY Times bestseller. 13 extraordinary essays s...
8179,17899392,"James Patterson, Emily Raymond",First Love,An extraordinary portrait of true love that wi...
7047,2033217,Daniel Silva,"Moscow Rules (Gabriel Allon, #8)",Now the death of a journalist leads Allon to R...
1091,17288661,John Grisham,Sycamore Row,Seth Hubbard is a wealthy man dying of lung ca...
2050,13872,Katherine Dunn,Geek Love,"Geek Love is the story of the Binewskis, a car..."
8558,1015311,Ken Akamatsu,"Love Hina, Vol. 01","At the age of 5, Keitaro and his childhood swe..."
6090,21849362,J.R. Ward,"The Shadows (Black Dagger Brotherhood, #13)",Trez “Latimer” doesn’t really exist. And not j...
6774,7389,"Brian K. Vaughan, Adrian Alphona","Runaways, Vol. 1: Pride and Joy (Runaways, #1)","Meet Alex, Karolina, Gert, Chase, Molly and Ni..."
5760,522525,"Carol Tavris, Elliot Aronson",Mistakes Were Made (But Not by Me): Why We Jus...,Why do people dodge responsibility when things...


## Remove noisy data from book titles.

The nomenclature utilized for the book titles is as follows: $book\_title + (book\_saga\_name \ \#Nº\_book\_saga)$. Both the book title and the book saga can be valuable information. However, the book saga number, along with the # symbol may be removed. I have defined a method called `clean_book_title` that allows removing either all saga information or just the saga number.

In [19]:
data.title = [utils.clean_book_title(title) for title in data.title.tolist()]
data.title.sample(n=10, random_state=seed)

7203                                The Little Match Girl
8399                       Black Holes and Baby Universes
8179                                           First Love
7047                         Moscow Rules (Gabriel Allon)
1091                                         Sycamore Row
2050                                            Geek Love
8558                                   Love Hina, Vol. 01
6090               The Shadows (Black Dagger Brotherhood)
6774           Runaways, Vol. 1: Pride and Joy (Runaways)
5760    Mistakes Were Made (But Not by Me): Why We Jus...
Name: title, dtype: object

## Remove noisy data from book overviews.

Luckily, text is automatically tokenized before being fed into any transformer model. That notwithstanding, there is still some work we need to do to clean our text beforehand, like removing special characters, removing extra blank spaces, etc. The `maketrans` built-in method comes handy. It enables us to create a mapping table. We can create an empty mapping table, but the third argument of this function allows us to list all of the characters to remove during the translation process. On the other hand, we will use the `re` module to work with regular expressions with python to further fix some wrong text patterns.

For further details, please check the implementation included in the `utils` module for the `clean_overview` method.
Let's see an example:

In [20]:
text = data.overview[data.gr_book_id == 5354].tolist()[0]
print(
    f'BEFORE cleaning:\n {text}\n\n'
    f'AFTER cleaning:\n{utils.clean_overview(text)}')

BEFORE cleaning:
 Trumble is a minimum-security federal prison, a "camp," home to the usual assortment of relatively harmless criminals--drug dealers, bank robbers, swindlers, embezzlers, tax evaders, two Wall Street crooks, one doctor, at least five lawyers.And three former judges who call themselves the Brethren: one from Texas, one from California, and one from Mississippi. They meet each day in the law library, their turf at Trumble, where they write briefs, handle cases for other inmates, practice law without a license, and sometimes dispense jailhouse justice. And they spend hours writing letters. They are fine-tuning a mail scam, and it's starting to really work. The money is pouring in.Then their little scam goes awry. It ensnares the wrong victim, a powerful man on the outside, a man with dangerous friends, and the Brethren's days of quietly marking time are over.

AFTER cleaning:
Trumble is a minimum-security federal prison, a camp, home to the usual assortment of relatively 

In [21]:
data.overview = [utils.clean_overview(str(overview)) for overview in data.overview.tolist()]

Once the preprocessing is done, the dataframe can be exported to a CSV file to avoid repeating these steps everytime we need to work with the cleaned data.

In [22]:
data.set_index('gr_book_id').to_csv(DIR_DATASET + 'books_processed.csv', sep=',')

## Annex. Code to perform data scraping.

The webpage for each book follow the format `https://www.goodreads.com/book/show/book_id`. For instance,  [https://www.goodreads.com/book/show/320](https://www.goodreads.com/book/show/320) is the page containing information for the book "One Hundred Years of Solitude" by Gabriel García Márquez.

The overview is contained in an object called `readable stacked` that can be seen inspecting the code of the page.

In [None]:
import requests
from bs4 import BeautifulSoup

def scrap_book_overview(book_id, save=False):
    try:
        # Connect to the page
        url = "https://www.goodreads.com/book/show/"+str(book_id)
        response = requests.get(url)
        # Instantiate a BeautifulSoup object.
        soup = BeautifulSoup(response.text, 'lxml')
        # Access to the component
        sec = soup.find("div", {"class": "readable stacked"})
        # Extract the overview
        overview = sec.findAll('span')[-1]
        # Store it, should you require it
        if not overview.text is None and save:
            file = open("overviews/"+str(book_id)+".txt","w") 
            file.write(overview.text)
            file.close()
        return overview.text
    except:
        return None