# Exploratory data analysis and data preprocessing.

### As part of the undergraduate dissertation: Similarity Measures in Natural Language Processing based on Deep Learning Models.

#### David Lorenzo Alfaro


# 0. Introducción

En este documento se recoge la primera práctica de la asignatura Minería de Datos. 

En esta primera práctica vamos a trabajar algunos de los aspectos más importantes del proceso KDD (Knowledge Discovery from Data), como
el amacenamiento y carga de datos, el análisis exploratorio y preprocesamiento de los mismos o la validación de modelos de clasificación.

Para ello, aprenderemos a manipular y visualizar los datos mediante distintas funciones de las librerías pandas y plotly. Además, aprenderemos a utilizar algoritmos de clasificación como Zero-R y árboles de decisión usando la librería scikit-learn.

El objetivo de la práctica será aprender a cargar, explorar y preparar nuestros datos, aprender y validar distintos modelos de clasificación y ser capaces de interpretar los resultados obtenidos. Para lograrlo, hemos estudiado y analizado tres conjuntos de datos sintéticos, el primero de ellos, se trata del conjunto `iris`, este conjunto estaba ya resuelto por los profesores de la asignatura y nuestro trabajo ha sido analizar y entender todo lo que se había realizado en él. Este primer conjunto no aparece en este documento.

En segundo lugar, el conjunto de datos dado fue **`wisconsin`**, y este será el primero que se tratará en esta libreta. Para esta base de datos se realizará un análisis explortario de datos, y un preprocesamiento acorde a los resultados obtenidos en el análisis. Seguidamente, se aprenderán y validarán varios modelos de clasificación, interpretando los resultados obtenidos.

Por último, se repetirá el mismo proceso explicado previamente con la base de datos **`pima_diabetes`** para concluir con este documento, así como con esta práctica.

Hemos de decir que todas las explicaciones comunes a ambos conjuntos se han tratado principalente en la base de datos `wisconsin`, debido a que es esta la que aparece primeramente en esta libreta. Hemos decidido hacer esto así para evitar ser redundantes en las explicaciones comunes a las dos.

Antes de comenzar es necesario cargar las librerías a emplear para que estén disponibles para su posterior uso:

In [1]:
import os
import pandas as pd
import numpy as np
import paths, utils
from pathlib import Path



Además, fijamos una semilla para que los experimentos sean reproducibles:

In [2]:
seed = 27912

The dataset we are going to be using as information to our semantic search engine is a collection of some information of ten thousand books available in [Goodreads](https://www.goodreads.com/), the world’s largest site for readers and book recommendations. This information was retrieved using Goodreads API.

Let us load the dataset to study its features.


In [3]:
filepath = paths.PATH_BOOKS
data = pd.read_csv(filepath, sep='	')

To sample the first *n* instances of a dataset we can use the `head` function.

In [4]:
data.head(5)

Unnamed: 0,book_id,gr_book_id,gr_best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,0,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,1,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,2,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,3,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,4,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


Alternatively, we can use the `sample` function, which samples *n* random instances of the dataset.

In [5]:
data.sample(5, random_state=seed)

Unnamed: 0,book_id,gr_book_id,gr_best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
4325,4325,22584,22584,949696,92,1857983416,9781858000000.0,Philip K. Dick,1974.0,"Flow My Tears, the Policeman Said",...,22063,25396,1187,206,1104,6014,10985,7087,https://images.gr-assets.com/books/1398026028m...,https://images.gr-assets.com/books/1398026028s...
2906,2906,773276,773276,2992071,36,399230033,9780399000000.0,Peggy Rathmann,1994.0,"Good Night, Gorilla",...,37948,38692,796,511,1544,6784,10585,19268,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...
412,412,7190,7190,1263212,1341,,,Alexandre Dumas,1844.0,Les Trois Mousquetaires,...,195274,221481,4974,2176,8195,46090,83254,81766,https://images.gr-assets.com/books/1320436982m...,https://images.gr-assets.com/books/1320436982s...
6042,6042,702539,702539,2504855,53,006008216X,9780060000000.0,Elmore Leonard,1990.0,Get Shorty,...,15747,16863,505,164,663,3832,7374,4830,https://images.gr-assets.com/books/1330673682m...,https://images.gr-assets.com/books/1330673682s...
4512,4512,14965,14965,3165570,57,316014281,9780316000000.0,Anita Shreve,2004.0,Light on Snow,...,22845,24610,1724,286,2005,9451,9337,3531,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...


As it can be observed, the dataset has 23 different features. However, only the first and last 10 features of the dataset are being displayed. Let us print the names of all the features in the dataset.

In [6]:
data.columns

Index(['book_id', 'gr_book_id', 'gr_best_book_id', 'work_id', 'books_count',
       'isbn', 'isbn13', 'authors', 'original_publication_year',
       'original_title', 'title', 'language_code', 'average_rating',
       'ratings_count', 'work_ratings_count', 'work_text_reviews_count',
       'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5',
       'image_url', 'small_image_url'],
      dtype='object')

Here's a brief description for the features in the dataset.

* Columns `book_id`, `gr_book_id`, `gr_best_book_id`, `work_id`, `books_count` `isbn` and `isbn13` are different identifiers for the book. As we will see later, the book overviews are not included in this dataset. Each overview is identified with the `gr_book_id` identifier, thus it is the link between both sources of information. Let us first check that it is a valid identifier (i.e., there are no null values and all identifiers are unique).

In [7]:
print(f"Unique values in 'gr_book_id' column: {len(data.gr_book_id.unique())}")
print(f"Print null values in 'gr_book_id' column {data.gr_book_id[data.gr_book_id.isna()]}")

Unique values in 'gr_book_id' column: 10000
Print null values in 'gr_book_id' column Series([], Name: gr_book_id, dtype: int64)


* `gr_book_id` is, in fact, a valid identifier, thus it is the one that we will be using. Furthermore, we also now know that there are no duplicated instances in the dataset, since at least one of its features has no repeated values. The remaining identifier columns can be deleted, as they do not provide any more meaningful information for the tasks that we are to perform.
* As the name suggests, `authors` contains the names of the authors of the book.

* `original_publication_year` indicates the year in which the book was published. We will not be using this information.

* `title` is the english title of the book.

* `original_title` is the title of the book in its original language. We are primarily concerned with english textual information, hence `title` is a more suitable feature.

* `language_code` indicates the textual code assigned to the language of the book. This feature is particularly useful because it will help us get rid of non-English books.

* `average_rating` is a floating value indicating the average rating of a book, ranging from 1 to 5. This feature does not provide relevant information for semantic search, thus it will be discarded. That notwithstanding, it could be used as a criteria to filter the query results, prioritizing those that have better ratings.

* `ratings_count` indicates the number of registered ratings for a book. Analogously, `work_ratings_count` and `work_text_reviews_count` indicate the number of ratings and reviews a work has in the platform, respectively. None of this information is useful for our work.

* `ratings_1`, `ratings_2`, `ratings_3`, `ratings_4` and `ratings_5` characteristics hold the counts for each rating value. Again, this feature does not provide any relevant information to perform semantic search.

* `image_url` and `small_image_url` contain links to pictures of the book cover. Since images cannot be displayed in CLIs, we will discard this information too.

**Removing useless features.**

Let's get rid of all not useful features.

In [8]:
columns_to_drop = set(['book_id', 'gr_best_book_id', 'work_id', 'books_count',
       'isbn', 'isbn13', 'original_publication_year', 'original_title',
       'average_rating','ratings_count', 'work_ratings_count',
       'work_text_reviews_count', 'ratings_1', 'ratings_2', 'ratings_3',
       'ratings_4', 'ratings_5', 'image_url', 'small_image_url'])

In [9]:
data = data.drop(columns_to_drop, axis=1)
data.sample(5, random_state=seed)

Unnamed: 0,gr_book_id,authors,title,language_code
4325,22584,Philip K. Dick,"Flow My Tears, the Policeman Said",eng
2906,773276,Peggy Rathmann,"Good Night, Gorilla",
412,7190,Alexandre Dumas,The Three Musketeers,
6042,702539,Elmore Leonard,"Get Shorty (Chili Palmer, #1)",eng
4512,14965,Anita Shreve,Light on Snow,


**Appending book overviews into the dataset.**

Now that all useless characteristics have been deleted, let's append the overviews to the dataframe. The book overviews are stored in a directory, one `txt` file for each overview. We will first generate a dataframe containing all txt files in the directory. The filename for each `txt` file is the `gr_book_id` identifier.

In [10]:
book_overviews = utils.generate_dataframe_from_sparse_txts(paths.DIR_OVERVIEW)

In [11]:
print(f"Number of overviews in the dataset: {book_overviews.overview.shape[0]}")
book_overviews

Number of overviews in the dataset: 9956


Unnamed: 0,gr_book_id,overview
0,1,When Harry Potter and the Half-Blood Prince op...
1,10,"Six years of magic, adventure, and mystery mak..."
2,10000191,"À sa naissance, Lisbeth est enlevée à sa mère ..."
3,10006,The discovery of a mysterious notebook turns a...
4,1000751,When orphaned 11-year-old Pollyanna comes to l...
...,...,...
9951,9995135,"At long last, New York Times bestselling autho..."
9952,99955,Paine's daring prose paved the way for the Dec...
9953,9998,"The Woman in the Dunes, by celebrated writer a..."
9954,9998705,"FLASH! Illuminated by lightning, a lifeless hu..."


As it can be seen, there are $9956$ book overviews, which is less than the number of instances in the other dataframe. Consequently, at least $44$ books will have no overview. There are several strategies to merge both dataframes. In this case, we will allow having books with no overview (*left join* operation).

In [12]:
data = pd.merge(data, book_overviews, left_on='gr_book_id', right_on='gr_book_id', how='left')
data

Unnamed: 0,gr_book_id,authors,title,language_code,overview
0,2767052,Suzanne Collins,"The Hunger Games (The Hunger Games, #1)",eng,Winning will make you famous. Losing means cer...
1,3,"J.K. Rowling, Mary GrandPré",Harry Potter and the Sorcerer's Stone (Harry P...,eng,Harry Potter's life is miserable. His parents ...
2,41865,Stephenie Meyer,"Twilight (Twilight, #1)",en-US,About three things I was absolutely positive.F...
3,2657,Harper Lee,To Kill a Mockingbird,eng,The unforgettable novel of a childhood in a sl...
4,4671,F. Scott Fitzgerald,The Great Gatsby,eng,"On its first publication in 1925, The Great Ga..."
...,...,...,...,...,...
9995,7130616,Ilona Andrews,"Bayou Moon (The Edge, #2)",eng,"The Edge lies between worlds, on the border be..."
9996,208324,Robert A. Caro,"Means of Ascent (The Years of Lyndon Johnson, #2)",eng,"Robert A. Caro's life of Lyndon Johnson, which..."
9997,77431,Patrick O'Brian,The Mauritius Command,eng,"""O'Brian's Aubrey-Maturin volumes actually con..."
9998,8565083,Peggy Orenstein,Cinderella Ate My Daughter: Dispatches from th...,eng,The acclaimed author of the groundbreaking bes...


**Removing instances with invalid language codes.**


Let us now check whether there are noisy data in any of the selected characteristics. Starting off with the language code, we need to make sure that all data fed into the models is in English, being as they have been trained to derive semantic representations for English texts. To that end, let's see how many language codes are in the dataset.

In [13]:
data.language_code.unique()

array(['eng', 'en-US', 'en-CA', nan, 'spa', 'en-GB', 'fre', 'nl', 'ara',
       'por', 'ger', 'nor', 'jpn', 'en', 'vie', 'ind', 'pol', 'tur',
       'dan', 'fil', 'ita', 'per', 'swe', 'rum', 'mul', 'rus'],
      dtype=object)

As it can be seen, there are plenty of different languages. However, is the title and the overview of the book written in the language indicated in `language_code`? Let's test it on some of the books with `language_code = spa`

In [14]:
data[data.language_code == 'spa'].sample(n=10, random_state=seed)

Unnamed: 0,gr_book_id,authors,title,language_code,overview
4508,63032,Roberto Bolaño,2666,spa,"A cuatro profesores de literatura, Pelletier, ..."
47,4381,Ray Bradbury,Fahrenheit 451,spa,Fahrenheit 451 ofrece la historia de un sombrí...
4605,24790,Isabel Allende,Paula,spa,"Paula es el libro más conmovedor, más personal..."
5698,23875,Gabriel García Márquez,El coronel no tiene quien le escriba,spa,El coronel no tiene quien le escriba fue escri...
3718,53447,Ernesto Sabato,El túnel,spa,"Breve e intensa novela publicada en 1948, este..."
9222,61794,Anonymous,La vida del Lazarillo de Tormes,spa,Lázaro es un muchacho desarrapado a quien la m...
1799,22590,"Philip K. Dick, David Alabort, Manuel Espín",Ubik,spa,Ubik (/ˈjuːbᵻk/ EW-bik) is a 1969 science fict...
9890,1365225,José Emilio Pacheco,Las batallas en el desierto,spa,"Historia de un amor imposible, narración de un..."
7755,60142,Mario Vargas Llosa,La ciudad y los perros,spa,"En 1962, La ciudad y los perros recibía el Pre..."
4151,54984,Ildefonso Falcones,La catedral del mar,spa,La Barcelona medieval en tiempos de la constru...


Since the title and the overview seems to be written in the language indicated in `language_code`, we will only choose those language codes mapped to English texts: `eng`, `en-US`, `en-CA`, `en-GB` and `en`. It is, however, still necessary to check the instances in which the value for the language code is `NaN`

In [15]:
data[['title', 'overview']][data['language_code'].isna()].sample(n=10, random_state=seed)

Unnamed: 0,title,overview
9189,City of Bones / City of Ashes / City of Glass ...,\nDon’t miss The Mortal Instruments: City of B...
9791,The Knockoff,"An outrageously stylish, wickedly funny novel ..."
8808,The Testament of Mary,"Provocative, haunting, and indelible, Colm Tói..."
5669,Love's Executioner and Other Tales of Psychoth...,The collection of ten absorbing tales by maste...
5691,"Alex Cross, Run (Alex Cross, #20)","Kill Alex Cross was ""Patterson at the top of h..."
6594,Tom's Midnight Garden,"Lying awake at night, Tom hears the old grandf..."
2764,Hinds' Feet on High Places,"With over 2 million copies sold, Hinds’ Feet o..."
9935,The Bridge Across Forever: A True Love Story,More than one year on the New York Times bests...
4163,Maine,"In her best-selling debut, Commencement, J. Co..."
8837,Lost Horizon,"While attempting to escape a civil war, four p..."


More than $10\%$ of the data has no language code. We verified that all of them are in English, thus they do not have to be deleted. Furthermore, the `language_code` feature is no longer needed.

In [16]:
eng_lc = set(['en', 'en-CA', 'en-US', 'en-GB', 'eng'])

data = data[(data.language_code.isin(eng_lc)) | (data.language_code.isna())].drop('language_code', axis=1)
data.sample(n=10, random_state=seed)

Unnamed: 0,gr_book_id,authors,title,overview
8893,6497645,"Steve Berry, Scott Brick","The Paris Vendetta (Cotton Malone, #5)","When Napoleon Bonaparte died in exile in 1821,..."
1069,112750,Karen Marie Moning,"Darkfever (Fever, #1)",Librarian Note: Alternate/new cover edition fo...
6888,2137,Michael Cunningham,A Home at the End of the World,"From Michael Cunningham, the Pulitzer Prize-wi..."
9417,15799151,Mason Currey,Daily Rituals: How Artists Work,"Franz Kafka, frustrated with his living quarte..."
5694,687215,Stephen R. Donaldson,The Power That Preserves (The Chronicles of Th...,"""A trilogy of remarkable scope and sophisticat..."
1160,4799,John Steinbeck,Cannery Row,Cannery Row is a book without much of a plot. ...
3416,328854,Jonathan Lethem,Motherless Brooklyn,Lionel Essrog is Brooklyn’s very own self-appo...
8007,238139,J.D. Robb,"Imitation in Death (In Death, #17)","Summer, 2059. A man wearing a cape and a top h..."
4203,6801582,Aprilynne Pike,"Spells (Wings, #2)","""I can't just storm in and proclaim my intenti..."
6181,787660,V.C. Andrews,"Heaven (Casteel, #1)","Of all the folks in the mountain shacks, the C..."


**Removing noisy data from the book titles.**

The nomenclature utilized for the book titles is as follows: $book\_title + (book\_saga\_name \ \#Nº\_book\_saga)$. Both the book title and the book saga can be valuable information. However, the book saga number, along with the # symbol may be removed. I have defined a method called `clean_book_title` that allows removing either all saga information or just the saga number.

In [17]:
data.title = [utils.clean_book_title(title) for title in data.title.tolist()]
data.title.sample(n=10, random_state=seed)

8893                   The Paris Vendetta (Cotton Malone)
1069                                    Darkfever (Fever)
6888                       A Home at the End of the World
9417                      Daily Rituals: How Artists Work
5694    The Power That Preserves (The Chronicles of Th...
1160                                          Cannery Row
3416                                  Motherless Brooklyn
8007                        Imitation in Death (In Death)
4203                                       Spells (Wings)
6181                                     Heaven (Casteel)
Name: title, dtype: object

**Removing noisy data from the book overviews.**

Luckily, text is automatically tokenized before being fed into any transformer model. That notwithstanding, there is still some work we need to do to clean our text beforehand, like removing special characters, removing extra blank spaces, etc. The `maketrans` built-in method comes handy. It enables us to create a mapping table. We can create an empty mapping table, but the third argument of this function allows us to list all of the characters to remove during the translation process. On the other hand, we will use the `re` module to work with regular expressions with python to further fix some wrong text patterns.

For further details, please check the implementation included in the `utils` module for the `clean_overview` method.
Let's see an example:

In [18]:
text = data.overview[data.gr_book_id == 5354].tolist()[0]
print(f'BEFORE cleaning:\n {text}\n\nAFTER cleaning:\n{utils.clean_overview(text)}')

BEFORE cleaning:
 Trumble is a minimum-security federal prison, a "camp," home to the usual assortment of relatively harmless criminals--drug dealers, bank robbers, swindlers, embezzlers, tax evaders, two Wall Street crooks, one doctor, at least five lawyers.And three former judges who call themselves the Brethren: one from Texas, one from California, and one from Mississippi. They meet each day in the law library, their turf at Trumble, where they write briefs, handle cases for other inmates, practice law without a license, and sometimes dispense jailhouse justice. And they spend hours writing letters. They are fine-tuning a mail scam, and it's starting to really work. The money is pouring in.Then their little scam goes awry. It ensnares the wrong victim, a powerful man on the outside, a man with dangerous friends, and the Brethren's days of quietly marking time are over.

AFTER cleaning:
Trumble is a minimum-security federal prison, a camp, home to the usual assortment of relatively 

In [37]:
data.overview = [utils.clean_overview(str(overview)) for overview in data.overview.tolist()]

Once the preprocessing is done, the dataframe can be exported to a CSV file to avoid repeating these steps everytime we need to work with the cleaned data.

In [56]:
data.set_index('gr_book_id').to_csv(paths.DIR_DATA + 'books_processed.csv', sep=',')