#### Problema:

Se requiere construir un sistema de recomendación de libros basados en los resumenes de libros y los temas (tópicos) de los mismos.

Para tal fin, se utilizará eñ [CMU Book Summary Dataset](https://www.cs.cmu.edu/~dbamman/booksummaries.html)

## Librerias

In [29]:
import csv 
import json
import pickle

import pandas as pd
import numpy as np
from collections import Counter # Para contar frecuencias

# Preprocesar texto
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Modelado de tópicos 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Visualizaciones
import pyLDAvis
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore', category = DeprecationWarning) # Al instalar pyLDAvis ocasiona un warning con ipkernel

import sklearn
for lib in [sklearn, pyLDAvis, np, pd]:
    print(lib.__name__, lib.__version__)

sklearn 1.2.1
pyLDAvis 3.4.0
numpy 1.26.4
pandas 2.2.3


In [4]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/danielml/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
!tar -xzvf "/Users/danielml/Documents/Computational/Curso_Bourbaki/Semana_7/booksummaries.tar.gz" -C"/Users/danielml/Documents/Computational/Curso_Bourbaki/Semana_7/"

x booksummaries/
x booksummaries/README
x booksummaries/booksummaries.txt


## Lectura y exploración de datos.

In [6]:
data = []
with open("/Users/danielml/Documents/Computational/Curso_Bourbaki/Semana_7/booksummaries/booksummaries.txt", 'r') as f:
    reader = csv.reader(f, dialect='excel-tab')
    for row in reader:
        data.append(row)

In [7]:
len(data)

16559

In [8]:
title = []
author = []
genre = []
summary = []

for i in range(len(data)):
    title.append(data[i][2])
    author.append(data[i][3])
    if data[i][5] == '':
        genre.append([''])
    else:
        genre.append([j for j in json.loads(data[i][5]).values()])
    summary.append(data[i][6])

df = pd.DataFrame({'Title': title, 'Author': author,
                   'Genre': genre, 'Summary': summary})

print(df.shape)
df.head(5)

(16559, 4)


Unnamed: 0,Title,Author,Genre,Summary
0,Animal Farm,George Orwell,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca..."
1,A Clockwork Orange,Anthony Burgess,"[Science Fiction, Novella, Speculative fiction...","Alex, a teenager living in near-future Englan..."
2,The Plague,Albert Camus,"[Existentialism, Fiction, Absurdist fiction, N...",The text of The Plague is divided into five p...
3,An Enquiry Concerning Human Understanding,David Hume,[],The argument of the Enquiry proceeds by a ser...
4,A Fire Upon the Deep,Vernor Vinge,"[Hard science fiction, Science Fiction, Specul...",The novel posits that space around the Milky ...


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16559 entries, 0 to 16558
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Title    16559 non-null  object
 1   Author   16559 non-null  object
 2   Genre    16559 non-null  object
 3   Summary  16559 non-null  object
dtypes: object(4)
memory usage: 517.6+ KB


In [10]:
df[['Title', 'Author']].nunique()

Title     16277
Author     4715
dtype: int64

In [11]:
df['Title'].value_counts().head()

Title
Nemesis     6
Outcast     4
Haunted     4
Inferno     4
The Gift    3
Name: count, dtype: int64

¿Por qué hay mas de un resumen para cada titulo?

In [12]:
df[df['Title'] == 'Nemesis']

Unnamed: 0,Title,Author,Genre,Summary
375,Nemesis,Isaac Asimov,"[Science Fiction, Speculative fiction, Childre...",The novel is set in an era in which interstel...
3499,Nemesis,Agatha Christie,"[Crime Fiction, Mystery, Children's literature...",Miss Marple receives a post card from the rec...
5157,Nemesis,Scott Ciencin,"[Speculative fiction, Horror]",One of Fred's old friends from graduate schoo...
6159,Nemesis,Jo Nesbø,[Crime Fiction],A bank robbery is committed by a lone robber ...
13696,Nemesis,Philip Roth,[],Nemesis explores the effect of a 1944 polio e...
13842,Nemesis,,[],"The story, set in Latium in AD 77, opens with..."


¿Cuántas categorias tiene la variable 'Genre'?

In [13]:
genre_dict = {}
for i in df['Genre']:
    for j in i:
        if j not in genre_dict:
            genre_dict[j] = 1
        else:
            genre_dict[j] += 1
frec_genre = Counter(genre_dict)

In [14]:
print('Generos distintos: {}\n '.format(len(frec_genre)))

Generos distintos: 228
 


In [15]:
frec_genre.most_common(30)

[('Fiction', 4747),
 ('Speculative fiction', 4314),
 ('', 3718),
 ('Science Fiction', 2870),
 ('Novel', 2463),
 ('Fantasy', 2413),
 ("Children's literature", 2122),
 ('Mystery', 1396),
 ('Young adult literature', 825),
 ('Suspense', 765),
 ('Crime Fiction', 753),
 ('Historical novel', 654),
 ('Thriller', 568),
 ('Horror', 511),
 ('Romance novel', 435),
 ('Historical fiction', 388),
 ('Detective fiction', 341),
 ('Adventure novel', 330),
 ('Non-fiction', 230),
 ('Alternate history', 226),
 ('Spy fiction', 190),
 ('Comedy', 145),
 ('Dystopia', 127),
 ('Autobiography', 124),
 ('Satire', 123),
 ('Gothic fiction', 112),
 ('Comic novel', 104),
 ('Biography', 102),
 ('Novella', 87),
 ('War novel', 87)]

Podemos observar que 3718 resumenes no cuentan con información sobre género del libro.

In [16]:
df['len Summary'] = df['Summary'].apply(lambda x: len(str(x).split()))
df['len Summary'].describe()

count    16559.000000
mean       429.202126
std        500.339692
min          1.000000
25%        120.000000
50%        263.000000
75%        569.000000
max      10334.000000
Name: len Summary, dtype: float64

In [17]:
df.head()

Unnamed: 0,Title,Author,Genre,Summary,len Summary
0,Animal Farm,George Orwell,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca...",957
1,A Clockwork Orange,Anthony Burgess,"[Science Fiction, Novella, Speculative fiction...","Alex, a teenager living in near-future Englan...",998
2,The Plague,Albert Camus,"[Existentialism, Fiction, Absurdist fiction, N...",The text of The Plague is divided into five p...,1119
3,An Enquiry Concerning Human Understanding,David Hume,[],The argument of the Enquiry proceeds by a ser...,2825
4,A Fire Upon the Deep,Vernor Vinge,"[Hard science fiction, Science Fiction, Specul...",The novel posits that space around the Milky ...,722


In [18]:
df[df['len Summary'] < 10].sort_values('len Summary')

Unnamed: 0,Title,Author,Genre,Summary,len Summary
16531,Guardians of Ga'Hoole Book 4: The Siege,Helen Dunmore,"[Speculative fiction, Fantasy, Historical novel]",==Receptio,1
11215,Chucaro: Wild Pony of the Pampa,Francis Kalnay,[Children's literature],==Reference,1
5879,The Caverns of Kalte,Joe Dever,"[Gamebook, Speculative fiction, Fantasy, Child...",==Receptio,1
5693,The Deathlord of Ixia,John Grant,"[Gamebook, Speculative fiction, Children's lit...",==Receptio,1
5972,The Eyes of Darkness,Dean Koontz,"[Speculative fiction, Horror, Fiction, Romance...",==Character,1
...,...,...,...,...,...
13201,Archform: Beauty,"L. E. Modesitt, Jr.",[Science Fiction],Archform: Beauty is set in 24th century Earth.,8
9689,"The Princess Diaries, Volume VII and 3/4: Vale...",Meg Cabot,[Young adult literature],Mia and Michael share Valentine's Day togethe...,9
12201,The Temple of the Ten,H. Bedford-Jones,[Fantasy],The novel adventures in the realms of Prester...,9
12856,The Sword of Aldones,Marion Zimmer Bradley,[Science Fiction],The novel concerns involved intrigue on the p...,9


In [19]:
df = df[df['len Summary'] >= 10].copy().reset_index(drop = True)
df.sort_values('len Summary')

Unnamed: 0,Title,Author,Genre,Summary,len Summary
11840,The Abyss of Wonders,Perley Poore Sheehan,[Science Fiction],The novel concerns a lost race in the Gobi De...,10
11810,Seeds of Life,Eric Temple Bell,[Science Fiction],The novel concerns the creation of a superman...,10
6395,Bullet Time,David A. McIntee,[Science Fiction],Sarah Jane Smith encounters the Seventh Docto...,10
10853,Stone Tables,Orson Scott Card,"[History, Fiction]",Stone Tables is a novelization of the life of...,10
12356,Yellow Fog,Les Daniels,"[Speculative fiction, Horror]",The novel concerns the vampire Don Sebastian ...,10
...,...,...,...,...,...
14161,March to the Stars,John Ringo,[Science Fiction],The story opens in the restored city of Voita...,6560
12448,Dawkins vs. Gould,Kim Sterelny,[],In the introductory chapter the author points...,7182
14619,Fire World,Chris D'Lacey,[Fantasy],It opens on the planet Co:pern:ica with Couns...,7958
518,"The History of Tom Jones, a Foundling",Henry Fielding,"[Fiction, Novel]",The novel's events occupy eighteen books. Squ...,9055


In [20]:
df.loc[12350, 'Summary']

' Upon reaching the City of Elua, Sidonie and Imriel find that there are many people awaiting them. Some, like the Yeshuites and the Tsingani, are there simply because Imriel was foster-son to Phèdre nó Delaunay and Joscelin Verreuil. Also there are small knots of people, each wearing black armbands around one of their arms, signifying death. They all hold out their fists, thumbs pointed downward. Imriel later learns that these people are the families of his mother, Melisande Shahrizai\'s, victims. In the City, Imriel parts ways with Phèdre and Joscelin, declining their offer to stay at the townhouse in favor of confronting Queen Ysandre. Upon reaching the palace, Imriel comes to find that the Queen has seemed to have cooled off since he had last seen her. His room are ready for his use and he takes the opportunity to have a much-needed bath. Hearing a commotion outside his door, he allows his cousin, Mavros Shahrizai, to be admitted to see him. Mavros comes in and begins berating Imri

## Obtención de los tópicos principales.

## Vectorizacion de textos

In [21]:
def preprocesar(texto):
    # Convertir a minusculas
    texto = (texto).lower()

    # Elimina stopwords
    stop = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
    texto = stop.sub('', texto)

    # Quitar puntuación y números
    texto = re.sub('[^ña-z]+', ' ', texto)

    # Lemmatizar y quedarnos con palabras que tengan más de tres carácteres
    st = PorterStemmer()
    texto = texto.split()
    texto = ' '.join([st.stem(i) for i in texto if len(i) > 2])

    return(texto)


In [22]:
df['Summary_pp'] = df['Summary'].apply(preprocesar)
df.head()

Unnamed: 0,Title,Author,Genre,Summary,len Summary,Summary_pp
0,Animal Farm,George Orwell,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca...",957,old major old boar manor farm call anim farm m...
1,A Clockwork Orange,Anthony Burgess,"[Science Fiction, Novella, Speculative fiction...","Alex, a teenager living in near-future Englan...",998,alex teenag live near futur england lead gang ...
2,The Plague,Albert Camus,"[Existentialism, Fiction, Absurdist fiction, N...",The text of The Plague is divided into five p...,1119,text plagu divid five part town oran thousand ...
3,An Enquiry Concerning Human Understanding,David Hume,[],The argument of the Enquiry proceeds by a ser...,2825,argument enquiri proce seri increment step sep...
4,A Fire Upon the Deep,Vernor Vinge,"[Hard science fiction, Science Fiction, Specul...",The novel posits that space around the Milky ...,722,novel posit space around milki way divid conce...


In [23]:
vectorizer = CountVectorizer(min_df = 10, max_df = 0.10, ngram_range = (1, 2))
BOW = vectorizer.fit_transform(df['Summary_pp'])
BOW.shape

(16496, 37122)

In [24]:
vocabulario = vectorizer.get_feature_names_out()
len(vocabulario)

37122

In [25]:
vocabulario[200:240]

array(['accident stumbl', 'acclaim', 'accommod', 'accompani',
       'accompani back', 'accompani father', 'accompani journey',
       'accompani two', 'accomplic', 'accomplish', 'accomplish goal',
       'accomplish mission', 'accomplish task', 'accord', 'accord author',
       'accord book', 'accord plan', 'accordingli', 'accost', 'account',
       'account event', 'account life', 'accumul', 'accur', 'accuraci',
       'accus', 'accus murder', 'accus steal', 'accus treason',
       'accustom', 'ace', 'ach', 'achiev', 'achiev goal',
       'achiev success', 'achil', 'acid', 'ackbar', 'acknowledg',
       'acolyt'], dtype=object)

## Entrenamiento del modelo

El número óptimo de topicos depende de las caracteristicas del texto a analizar (el largo de los textos, la cantidad de distintas ideas)

No obstante existen algunas metricas que ayudan a determinar k.

In [26]:
k = 10

In [27]:
lda_model = LatentDirichletAllocation(n_components = k, learning_method = 'online', random_state = 42, max_iter = 50)

In [28]:
%%time
lda_model.fit(BOW) # Entrena el modelo y obtienela matriz documento-topico

CPU times: user 9min 51s, sys: 10min, total: 19min 51s
Wall time: 3min 1s


### Guardamos el modelo con pickle

In [30]:
path = '/Users/danielml/Documents/Computational/Curso_Bourbaki/Semana_7/'
tuple_models = (lda_model, BOW, vectorizer)
pickle.dump(tuple_models, open (path + "tuple_model_books_k10.pkl", 'wb'))