# Projet Recommendations de livres

## Nettoyage du dataset

#### Imports nécessaires au nettoyage du dataset

In [1]:
import pandas as pd
import csv
import json

#### Création du jeu de données

In [2]:
data = []
with open('data.txt', 'r') as f:
    reader = csv.reader(f, dialect='excel-tab')
    for row in reader:
        data.append(row)

df = pd.DataFrame.from_records(data, columns=['book_id', 'freebase_id', 'book_title', 'author', 'publication_date', 'genre', 'summary'])
df

Unnamed: 0,book_id,freebase_id,book_title,author,publication_date,genre,summary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca..."
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan..."
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...
...,...,...,...,...,...,...,...
16554,36934824,/m/0m0p0hr,Under Wildwood,Colin Meloy,2012-09-25,,"Prue McKeel, having rescued her brother from ..."
16555,37054020,/m/04f1nbs,Transfer of Power,Vince Flynn,2000-06-01,"{""/m/01jfsb"": ""Thriller"", ""/m/02xlf"": ""Fiction""}",The reader first meets Rapp while he is doing...
16556,37122323,/m/0n5236t,Decoded,Jay-Z,2010-11-16,"{""/m/0xdf"": ""Autobiography""}",The book follows very rough chronological ord...
16557,37132319,/m/0n4bqb1,America Again: Re-becoming The Greatness We Ne...,Stephen Colbert,2012-10-02,,Colbert addresses topics including Wall Stree...


#### Suppression de la colonne freebase_id

In [3]:
df = df.drop(columns = ['freebase_id'])
df

Unnamed: 0,book_id,book_title,author,publication_date,genre,summary
0,620,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca..."
1,843,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan..."
2,986,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...
3,1756,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...
4,2080,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...
...,...,...,...,...,...,...
16554,36934824,Under Wildwood,Colin Meloy,2012-09-25,,"Prue McKeel, having rescued her brother from ..."
16555,37054020,Transfer of Power,Vince Flynn,2000-06-01,"{""/m/01jfsb"": ""Thriller"", ""/m/02xlf"": ""Fiction""}",The reader first meets Rapp while he is doing...
16556,37122323,Decoded,Jay-Z,2010-11-16,"{""/m/0xdf"": ""Autobiography""}",The book follows very rough chronological ord...
16557,37132319,America Again: Re-becoming The Greatness We Ne...,Stephen Colbert,2012-10-02,,Colbert addresses topics including Wall Stree...


#### Nettoyage de la colonne genre

In [4]:
def parse_genre_entry(genre_info):
    if genre_info == '':
        return []
    genre_dict = json.loads(genre_info)
    genres = list(genre_dict.values())
    return genres

df['genre'] = df['genre'].apply(parse_genre_entry)
df['genre']

0        [Roman à clef, Satire, Children's literature, ...
1        [Science Fiction, Novella, Speculative fiction...
2        [Existentialism, Fiction, Absurdist fiction, N...
3                                                       []
4        [Hard science fiction, Science Fiction, Specul...
                               ...                        
16554                                                   []
16555                                  [Thriller, Fiction]
16556                                      [Autobiography]
16557                                                   []
16558              [Epistolary novel, Speculative fiction]
Name: genre, Length: 16559, dtype: object

#### Affichage des genres uniques et du nombre total de genres

In [5]:
genres = []
df['genre'].apply(lambda x: [genres.append(i) for i in x if i not in genres])
(len(genres), genres)

(227,
 ['Roman à clef',
  'Satire',
  "Children's literature",
  'Speculative fiction',
  'Fiction',
  'Science Fiction',
  'Novella',
  'Utopian and dystopian fiction',
  'Existentialism',
  'Absurdist fiction',
  'Novel',
  'Hard science fiction',
  'Fantasy',
  'War novel',
  'Bildungsroman',
  'Religious text',
  'Picaresque novel',
  'Gothic fiction',
  'Horror',
  'Invasion literature',
  'Mystery',
  'Epistolary novel',
  'Parody',
  'Psychological novel',
  'Farce',
  'Philosophy',
  'Science',
  'Dystopia',
  'Detective fiction',
  'Suspense',
  'Historical fiction',
  'Adventure novel',
  'Humour',
  'Historical novel',
  'Sea story',
  'Cyberpunk',
  'Business',
  'Non-fiction',
  'Economics',
  'Anthropology',
  'Sociology',
  'Romance novel',
  'Poetry',
  'Chivalric romance',
  'High fantasy',
  'Time travel',
  'Scientific romance',
  'Crime Fiction',
  'Juvenile fantasy',
  'Religion',
  'Inspirational',
  'Short story',
  'Techno-thriller',
  'Thriller',
  'Young adult

## Préprocessing

#### Import des mots pour le préprocessing

In [6]:
import re
import nltk
from nltk import WordNetLemmatizer

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/antoinecastaing/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/antoinecastaing/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/antoinecastaing/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

#### Création d'une colonne pour le préprocessing

In [7]:
df['summary_processing'] = df['summary']
df

Unnamed: 0,book_id,book_title,author,publication_date,genre,summary,summary_processing
0,620,Animal Farm,George Orwell,1945-08-17,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca...","Old Major, the old boar on the Manor Farm, ca..."
1,843,A Clockwork Orange,Anthony Burgess,1962,"[Science Fiction, Novella, Speculative fiction...","Alex, a teenager living in near-future Englan...","Alex, a teenager living in near-future Englan..."
2,986,The Plague,Albert Camus,1947,"[Existentialism, Fiction, Absurdist fiction, N...",The text of The Plague is divided into five p...,The text of The Plague is divided into five p...
3,1756,An Enquiry Concerning Human Understanding,David Hume,,[],The argument of the Enquiry proceeds by a ser...,The argument of the Enquiry proceeds by a ser...
4,2080,A Fire Upon the Deep,Vernor Vinge,,"[Hard science fiction, Science Fiction, Specul...",The novel posits that space around the Milky ...,The novel posits that space around the Milky ...
...,...,...,...,...,...,...,...
16554,36934824,Under Wildwood,Colin Meloy,2012-09-25,[],"Prue McKeel, having rescued her brother from ...","Prue McKeel, having rescued her brother from ..."
16555,37054020,Transfer of Power,Vince Flynn,2000-06-01,"[Thriller, Fiction]",The reader first meets Rapp while he is doing...,The reader first meets Rapp while he is doing...
16556,37122323,Decoded,Jay-Z,2010-11-16,[Autobiography],The book follows very rough chronological ord...,The book follows very rough chronological ord...
16557,37132319,America Again: Re-becoming The Greatness We Ne...,Stephen Colbert,2012-10-02,[],Colbert addresses topics including Wall Stree...,Colbert addresses topics including Wall Stree...


#### On enlève les majuscules

In [8]:
df['summary_processing'] = df['summary_processing'].str.lower()
df

Unnamed: 0,book_id,book_title,author,publication_date,genre,summary,summary_processing
0,620,Animal Farm,George Orwell,1945-08-17,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca...","old major, the old boar on the manor farm, ca..."
1,843,A Clockwork Orange,Anthony Burgess,1962,"[Science Fiction, Novella, Speculative fiction...","Alex, a teenager living in near-future Englan...","alex, a teenager living in near-future englan..."
2,986,The Plague,Albert Camus,1947,"[Existentialism, Fiction, Absurdist fiction, N...",The text of The Plague is divided into five p...,the text of the plague is divided into five p...
3,1756,An Enquiry Concerning Human Understanding,David Hume,,[],The argument of the Enquiry proceeds by a ser...,the argument of the enquiry proceeds by a ser...
4,2080,A Fire Upon the Deep,Vernor Vinge,,"[Hard science fiction, Science Fiction, Specul...",The novel posits that space around the Milky ...,the novel posits that space around the milky ...
...,...,...,...,...,...,...,...
16554,36934824,Under Wildwood,Colin Meloy,2012-09-25,[],"Prue McKeel, having rescued her brother from ...","prue mckeel, having rescued her brother from ..."
16555,37054020,Transfer of Power,Vince Flynn,2000-06-01,"[Thriller, Fiction]",The reader first meets Rapp while he is doing...,the reader first meets rapp while he is doing...
16556,37122323,Decoded,Jay-Z,2010-11-16,[Autobiography],The book follows very rough chronological ord...,the book follows very rough chronological ord...
16557,37132319,America Again: Re-becoming The Greatness We Ne...,Stephen Colbert,2012-10-02,[],Colbert addresses topics including Wall Stree...,colbert addresses topics including wall stree...


#### Nettoyage des symboles

In [9]:
df['summary_processing'] = df['summary_processing'].apply(lambda x: re.sub(r'[^\w\s]', ' ', x))
df['summary_processing'] = df['summary_processing'].apply(lambda x: re.sub(' +', ' ', x).strip())
df

Unnamed: 0,book_id,book_title,author,publication_date,genre,summary,summary_processing
0,620,Animal Farm,George Orwell,1945-08-17,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca...",old major the old boar on the manor farm calls...
1,843,A Clockwork Orange,Anthony Burgess,1962,"[Science Fiction, Novella, Speculative fiction...","Alex, a teenager living in near-future Englan...",alex a teenager living in near future england ...
2,986,The Plague,Albert Camus,1947,"[Existentialism, Fiction, Absurdist fiction, N...",The text of The Plague is divided into five p...,the text of the plague is divided into five pa...
3,1756,An Enquiry Concerning Human Understanding,David Hume,,[],The argument of the Enquiry proceeds by a ser...,the argument of the enquiry proceeds by a seri...
4,2080,A Fire Upon the Deep,Vernor Vinge,,"[Hard science fiction, Science Fiction, Specul...",The novel posits that space around the Milky ...,the novel posits that space around the milky w...
...,...,...,...,...,...,...,...
16554,36934824,Under Wildwood,Colin Meloy,2012-09-25,[],"Prue McKeel, having rescued her brother from ...",prue mckeel having rescued her brother from th...
16555,37054020,Transfer of Power,Vince Flynn,2000-06-01,"[Thriller, Fiction]",The reader first meets Rapp while he is doing...,the reader first meets rapp while he is doing ...
16556,37122323,Decoded,Jay-Z,2010-11-16,[Autobiography],The book follows very rough chronological ord...,the book follows very rough chronological orde...
16557,37132319,America Again: Re-becoming The Greatness We Ne...,Stephen Colbert,2012-10-02,[],Colbert addresses topics including Wall Stree...,colbert addresses topics including wall street...


#### Tokenization

In [10]:
df['summary_processing'] = df['summary_processing'].apply(lambda x: nltk.word_tokenize(x))
df

Unnamed: 0,book_id,book_title,author,publication_date,genre,summary,summary_processing
0,620,Animal Farm,George Orwell,1945-08-17,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca...","[old, major, the, old, boar, on, the, manor, f..."
1,843,A Clockwork Orange,Anthony Burgess,1962,"[Science Fiction, Novella, Speculative fiction...","Alex, a teenager living in near-future Englan...","[alex, a, teenager, living, in, near, future, ..."
2,986,The Plague,Albert Camus,1947,"[Existentialism, Fiction, Absurdist fiction, N...",The text of The Plague is divided into five p...,"[the, text, of, the, plague, is, divided, into..."
3,1756,An Enquiry Concerning Human Understanding,David Hume,,[],The argument of the Enquiry proceeds by a ser...,"[the, argument, of, the, enquiry, proceeds, by..."
4,2080,A Fire Upon the Deep,Vernor Vinge,,"[Hard science fiction, Science Fiction, Specul...",The novel posits that space around the Milky ...,"[the, novel, posits, that, space, around, the,..."
...,...,...,...,...,...,...,...
16554,36934824,Under Wildwood,Colin Meloy,2012-09-25,[],"Prue McKeel, having rescued her brother from ...","[prue, mckeel, having, rescued, her, brother, ..."
16555,37054020,Transfer of Power,Vince Flynn,2000-06-01,"[Thriller, Fiction]",The reader first meets Rapp while he is doing...,"[the, reader, first, meets, rapp, while, he, i..."
16556,37122323,Decoded,Jay-Z,2010-11-16,[Autobiography],The book follows very rough chronological ord...,"[the, book, follows, very, rough, chronologica..."
16557,37132319,America Again: Re-becoming The Greatness We Ne...,Stephen Colbert,2012-10-02,[],Colbert addresses topics including Wall Stree...,"[colbert, addresses, topics, including, wall, ..."


#### Création d'une colonne pour la lemmatisation

In [11]:
lemmatizer = WordNetLemmatizer()

df['summary_lemmatized'] = df['summary_processing'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
df

Unnamed: 0,book_id,book_title,author,publication_date,genre,summary,summary_processing,summary_lemmatized
0,620,Animal Farm,George Orwell,1945-08-17,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca...","[old, major, the, old, boar, on, the, manor, f...","[old, major, the, old, boar, on, the, manor, f..."
1,843,A Clockwork Orange,Anthony Burgess,1962,"[Science Fiction, Novella, Speculative fiction...","Alex, a teenager living in near-future Englan...","[alex, a, teenager, living, in, near, future, ...","[alex, a, teenager, living, in, near, future, ..."
2,986,The Plague,Albert Camus,1947,"[Existentialism, Fiction, Absurdist fiction, N...",The text of The Plague is divided into five p...,"[the, text, of, the, plague, is, divided, into...","[the, text, of, the, plague, is, divided, into..."
3,1756,An Enquiry Concerning Human Understanding,David Hume,,[],The argument of the Enquiry proceeds by a ser...,"[the, argument, of, the, enquiry, proceeds, by...","[the, argument, of, the, enquiry, proceeds, by..."
4,2080,A Fire Upon the Deep,Vernor Vinge,,"[Hard science fiction, Science Fiction, Specul...",The novel posits that space around the Milky ...,"[the, novel, posits, that, space, around, the,...","[the, novel, posit, that, space, around, the, ..."
...,...,...,...,...,...,...,...,...
16554,36934824,Under Wildwood,Colin Meloy,2012-09-25,[],"Prue McKeel, having rescued her brother from ...","[prue, mckeel, having, rescued, her, brother, ...","[prue, mckeel, having, rescued, her, brother, ..."
16555,37054020,Transfer of Power,Vince Flynn,2000-06-01,"[Thriller, Fiction]",The reader first meets Rapp while he is doing...,"[the, reader, first, meets, rapp, while, he, i...","[the, reader, first, meet, rapp, while, he, is..."
16556,37122323,Decoded,Jay-Z,2010-11-16,[Autobiography],The book follows very rough chronological ord...,"[the, book, follows, very, rough, chronologica...","[the, book, follows, very, rough, chronologica..."
16557,37132319,America Again: Re-becoming The Greatness We Ne...,Stephen Colbert,2012-10-02,[],Colbert addresses topics including Wall Stree...,"[colbert, addresses, topics, including, wall, ...","[colbert, address, topic, including, wall, str..."


## Vectorisation avec tf-idf (Méthode choisie)

#### Imports nécessaires à la vectorisation avec tf-idf

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#### Vectorisation des documents avec tf-idf, et création d'une matrice où chaque vecteur est un document

In [13]:
df['summary_lemmatized'] = df['summary_lemmatized'].apply(lambda x: " ".join(x))


tfidf = TfidfVectorizer(
    min_df = 5,
    max_df = 0.95,
    stop_words = 'english'
)
documents_vectorized = tfidf.fit_transform(df['summary_lemmatized'])
df['summary_lemmatized'] = df['summary_lemmatized'].apply(lambda x: x.split())
documents_vectorized

<16559x27774 sparse matrix of type '<class 'numpy.float64'>'
	with 2265196 stored elements in Compressed Sparse Row format>

#### Transformation de la matrice en jeu de données

In [14]:
test = pd.DataFrame(documents_vectorized.toarray())
test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,27764,27765,27766,27767,27768,27769,27770,27771,27772,27773
0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.027791,0.0,0.0,0.0,0.0,0.0,0.028421,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16554,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16555,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16556,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16557,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Choix d'un sommaire et calcul des distances des autres documents

In [15]:
from scipy.spatial.distance import cdist

# Choix d'un sommaire
vector = test.loc()[0]

distances = pd.DataFrame(cdist([vector], test)[0])
distances

Unnamed: 0,0
0,0.000000
1,1.404041
2,1.400378
3,1.383029
4,1.393623
...,...
16554,1.412296
16555,1.409086
16556,1.406293
16557,1.410697


#### Liaison des distances à leurs documents respectifs

In [16]:
new_df = pd.concat([df, distances], axis = 1)
new_df

Unnamed: 0,book_id,book_title,author,publication_date,genre,summary,summary_processing,summary_lemmatized,0
0,620,Animal Farm,George Orwell,1945-08-17,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca...","[old, major, the, old, boar, on, the, manor, f...","[old, major, the, old, boar, on, the, manor, f...",0.000000
1,843,A Clockwork Orange,Anthony Burgess,1962,"[Science Fiction, Novella, Speculative fiction...","Alex, a teenager living in near-future Englan...","[alex, a, teenager, living, in, near, future, ...","[alex, a, teenager, living, in, near, future, ...",1.404041
2,986,The Plague,Albert Camus,1947,"[Existentialism, Fiction, Absurdist fiction, N...",The text of The Plague is divided into five p...,"[the, text, of, the, plague, is, divided, into...","[the, text, of, the, plague, is, divided, into...",1.400378
3,1756,An Enquiry Concerning Human Understanding,David Hume,,[],The argument of the Enquiry proceeds by a ser...,"[the, argument, of, the, enquiry, proceeds, by...","[the, argument, of, the, enquiry, proceeds, by...",1.383029
4,2080,A Fire Upon the Deep,Vernor Vinge,,"[Hard science fiction, Science Fiction, Specul...",The novel posits that space around the Milky ...,"[the, novel, posits, that, space, around, the,...","[the, novel, posit, that, space, around, the, ...",1.393623
...,...,...,...,...,...,...,...,...,...
16554,36934824,Under Wildwood,Colin Meloy,2012-09-25,[],"Prue McKeel, having rescued her brother from ...","[prue, mckeel, having, rescued, her, brother, ...","[prue, mckeel, having, rescued, her, brother, ...",1.412296
16555,37054020,Transfer of Power,Vince Flynn,2000-06-01,"[Thriller, Fiction]",The reader first meets Rapp while he is doing...,"[the, reader, first, meets, rapp, while, he, i...","[the, reader, first, meet, rapp, while, he, is...",1.409086
16556,37122323,Decoded,Jay-Z,2010-11-16,[Autobiography],The book follows very rough chronological ord...,"[the, book, follows, very, rough, chronologica...","[the, book, follows, very, rough, chronologica...",1.406293
16557,37132319,America Again: Re-becoming The Greatness We Ne...,Stephen Colbert,2012-10-02,[],Colbert addresses topics including Wall Stree...,"[colbert, addresses, topics, including, wall, ...","[colbert, address, topic, including, wall, str...",1.410697


#### Tri des documents les plus proches

In [17]:
new_df.sort_values(0)

Unnamed: 0,book_id,book_title,author,publication_date,genre,summary,summary_processing,summary_lemmatized,0
0,620,Animal Farm,George Orwell,1945-08-17,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca...","[old, major, the, old, boar, on, the, manor, f...","[old, major, the, old, boar, on, the, manor, f...",0.000000
9989,11541759,Snowball's Chance,John Reed,,[Parody],"The story begins with the death of Napoleon, ...","[the, story, begins, with, the, death, of, nap...","[the, story, begin, with, the, death, of, napo...",0.948869
16531,35311147,Guardians of Ga'Hoole Book 4: The Siege,Helen Dunmore,2004-05-01,"[Speculative fiction, Fantasy, Historical novel]",==Receptio,[receptio],[receptio],1.000000
5879,4817875,The Caverns of Kalte,Joe Dever,1984,"[Gamebook, Speculative fiction, Fantasy, Child...",==Receptio,[receptio],[receptio],1.000000
5693,4597024,The Deathlord of Ixia,John Grant,1992,"[Gamebook, Speculative fiction, Children's lit...",==Receptio,[receptio],[receptio],1.000000
...,...,...,...,...,...,...,...,...,...
12225,17419637,Dating Hamlet,,2002-11,"[History, Novel]",The novel is a retelling of Hamlet from Ophel...,"[the, novel, is, a, retelling, of, hamlet, fro...","[the, novel, is, a, retelling, of, hamlet, fro...",1.414214
2045,1078455,The Kennel Murder Case,S. S. Van Dine,,"[Mystery, Fiction, Suspense]",~Plot outline description,"[plot, outline, description]","[plot, outline, description]",1.414214
3257,2074207,The Jungle Pyramid,Franklin W. Dixon,1977,"[Mystery, Detective fiction]",The search for gold robbed from a mint takes ...,"[the, search, for, gold, robbed, from, a, mint...","[the, search, for, gold, robbed, from, a, mint...",1.414214
12007,16845793,Hello Sailor,Eric Idle,1975,[Satire],Hello Sailor is a satirical view of British p...,"[hello, sailor, is, a, satirical, view, of, br...","[hello, sailor, is, a, satirical, view, of, br...",1.414214


In [149]:
new_df.loc()[9989]['summary']

' The story begins with the death of Napoleon, the original antagonist of Animal Farm. The animals of the farm, fearing what will become of them, learn that Snowball is alive and well, and Snowball returns to the farm to encourage capitalism. A second windmill is soon built alongside the first, and the two are thenceforth known as the Twin Mills (allegorical of the Twin Towers of the World Trade Center), and the animals all learn to walk on their hind legs, something hitherto forbidden by Old Major shortly before the expulsion of Mr. Jones from the farm. Also, in place of the original Seven Commandments, Snowball adopts a single slogan for the animals to live by: All animals are born equal - what they become is their own affair. As time passes, the animals, under the leadership of Snowball, realise the properties of monetary gain, and begin to file lawsuits against neighbouring farms, allowing Animal Farm to gain land and wealth. The revitalised farm also attracts animals from elsewher

## Vectorisation avec Word2vec (Méthode possible)

#### Imports nécessaires à la vectorisation avec du word embedding

In [118]:
import numpy as np

#### Récupération d'une matrice word2vec

In [87]:
word2vec_matrix = pd.read_table("word2vec/glove.6B.300d.txt", sep=" ",index_col=0, header=None, quoting=csv.QUOTE_NONE)
word2vec_matrix

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,291,292,293,294,295,296,297,298,299,300
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
the,0.046560,0.213180,-0.007436,-0.458540,-0.035639,0.236430,-0.288360,0.215210,-0.134860,-1.641300,...,-0.013064,-0.296860,-0.079913,0.195000,0.031549,0.285060,-0.087461,0.009061,-0.209890,0.053913
",",-0.255390,-0.257230,0.131690,-0.042688,0.218170,-0.022702,-0.178540,0.107560,0.058936,-1.385400,...,0.075968,-0.014359,-0.073794,0.221760,0.146520,0.566860,0.053307,-0.232900,-0.122260,0.354990
.,-0.125590,0.013630,0.103060,-0.101230,0.098128,0.136270,-0.107210,0.236970,0.328700,-1.678500,...,0.060148,-0.156190,-0.119490,0.234450,0.081367,0.246180,-0.152420,-0.342240,-0.022394,0.136840
of,-0.076947,-0.021211,0.212710,-0.722320,-0.139880,-0.122340,-0.175210,0.121370,-0.070866,-1.572100,...,-0.366730,-0.386030,0.302900,0.015747,0.340360,0.478410,0.068617,0.183510,-0.291830,-0.046533
to,-0.257560,-0.057132,-0.671900,-0.380820,-0.364210,-0.082155,-0.010955,-0.082047,0.460560,-1.847700,...,-0.012806,-0.597070,0.317340,-0.252670,0.543840,0.063007,-0.049795,-0.160430,0.046744,-0.070621
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
chanty,0.392700,-0.022505,0.304580,0.187990,0.141180,0.724030,-0.257810,-0.137290,-0.016521,0.595960,...,-0.182950,0.406630,-0.343630,-0.270400,-0.593680,0.016447,0.140740,0.463940,-0.369570,-0.287180
kronik,0.136790,-0.139090,-0.360890,0.079864,0.321490,0.263870,-0.109900,0.044420,0.083869,0.791330,...,0.036419,-0.036845,-0.348150,0.064732,-0.000577,-0.133790,0.428960,-0.023320,0.410210,-0.393080
rolonda,0.075713,-0.040502,0.183450,0.512300,-0.228560,0.839110,0.178780,-0.713010,0.326900,0.695350,...,-0.388530,0.545850,-0.035050,-0.184360,-0.197000,-0.350030,0.160650,0.218380,0.309670,0.437610
zsombor,0.814510,-0.362210,0.311860,0.813810,0.188520,-0.313600,0.827840,0.296560,-0.085519,0.475970,...,0.130880,0.106120,-0.408110,0.313380,-0.430250,0.069798,-0.207690,0.075486,0.284080,-0.175590


#### Récupération de tous les mots du dataset GloVe

In [88]:
glove_words = {*word2vec_matrix.index.to_list()}
glove_words

{'t.stewart',
 'spivakov',
 'setpieces',
 'anogeissus',
 'c-46',
 'nausea',
 'myeongseong',
 'rakitic',
 'peine',
 'priit',
 'errin',
 'bojaxhiu',
 'enclosure',
 'moschata',
 '1,456',
 'chestnut',
 'forfar',
 'biedenkopf',
 'dehra',
 'pownal',
 'patapsco',
 'neptuno',
 'metković',
 'ochamchire',
 'fajer',
 'myself',
 'nykänen',
 'gansu',
 'malacologist',
 'gordion',
 'pinstriped',
 'zeidler',
 'tv-ma',
 'u.n.-based',
 'draper',
 'salernitana',
 'kratos',
 '42.95',
 'nabateans',
 'reesha',
 '126.05',
 'ssangyong',
 'swiftness',
 'trevelyan',
 'grita',
 'firemen',
 'delamar',
 '25.24',
 '45.16',
 'synaxis',
 'sagafjord',
 'sipepa',
 'zango',
 'gouri',
 'rubí',
 'aboo',
 'protégé',
 'tewari',
 'szczęsny',
 'whelan',
 'gerdts',
 'atim',
 'fandango',
 's400',
 'bivouacked',
 'ameican',
 '92.66',
 'ziff-davis',
 'yoginis',
 'speicher',
 '4,930',
 'dynastically',
 'ss3',
 'zl',
 'mapushi',
 'vlogger',
 'spyro',
 'gutbucket',
 'yōsuke',
 'confides',
 'nij',
 'injudicious',
 'non-electric',
 't

#### Récupération de tous les mots des sommaires

In [89]:
all_summary_words = []
df['summary_lemmatized'].apply(lambda x: [all_summary_words.append(i) for i in x if i not in all_summary_words])
all_summary_words

['old',
 'major',
 'the',
 'boar',
 'on',
 'manor',
 'farm',
 'call',
 'animal',
 'for',
 'a',
 'meeting',
 'where',
 'he',
 'compare',
 'human',
 'to',
 'parasite',
 'and',
 'teach',
 'revolutionary',
 'song',
 'beast',
 'of',
 'england',
 'when',
 'dy',
 'two',
 'young',
 'pig',
 'snowball',
 'napoleon',
 'assume',
 'command',
 'turn',
 'his',
 'dream',
 'into',
 'philosophy',
 'revolt',
 'drive',
 'drunken',
 'irresponsible',
 'mr',
 'jones',
 'from',
 'renaming',
 'it',
 'they',
 'adopt',
 'seven',
 'commandment',
 'ism',
 'most',
 'important',
 'which',
 'is',
 'all',
 'are',
 'equal',
 'attempt',
 'reading',
 'writing',
 'food',
 'plentiful',
 'run',
 'smoothly',
 'elevate',
 'themselves',
 'position',
 'leadership',
 'set',
 'aside',
 'special',
 'item',
 'ostensibly',
 'their',
 'personal',
 'health',
 'take',
 'pup',
 'dog',
 'train',
 'them',
 'privately',
 'struggle',
 'announces',
 'plan',
 'build',
 'windmill',
 'ha',
 'chase',
 'away',
 'declares',
 'himself',
 'leader',


#### Création d'une liste avec tous les mots des sommaires absents du dataset GloVe

In [90]:
summary_words_not_in_glove = (set(all_summary_words) ^ set(glove_words)) - set(glove_words)
summary_words_not_in_glove 

{'panzee',
 'gorseclaw',
 'worstead',
 'honestus',
 'tulippa',
 'tzeentch',
 'kopeikin',
 'remoralization',
 'griever',
 'melée',
 'terayama',
 'huntingseåssen',
 'dorrel',
 'mackecknie',
 'pavulean',
 'magenpies',
 'tummeler',
 'путь',
 'macquarts',
 'prahu',
 'staho',
 'macroscope',
 'meima',
 'repness',
 'kinsford',
 'logboats',
 'sealously',
 'tharkoonian',
 'phaders',
 'mechanicist',
 'soakes',
 'lowicker',
 'larrin',
 'alflolol',
 'palante',
 'coude',
 'concotion',
 'komitzät',
 'tellai',
 'malverton',
 'gryylth',
 'buntaro',
 'arantola',
 'ਆਰਟਸ',
 'hundrec',
 '25em',
 'adopta',
 'winthorp',
 'kerchpoff',
 'friendrichs',
 'seppings',
 'erij',
 'eilistraee',
 'baptizer',
 'rodlox',
 'crowpaw',
 '極東アジア調査會',
 'tethis',
 'beguildy',
 'arsindo',
 'ganymedeans',
 'cluracan',
 'ottaanthadi',
 'bayta',
 'raincloud',
 'asanova',
 'rhime',
 'giftlist',
 'dynas',
 'draag',
 'harpooneer',
 'grabr',
 'mislays',
 'bannus',
 'vacuousness',
 'garthlings',
 'rocangus',
 'mirri',
 'dragonsriders',

#### Créer une liste avec tous les mots des sommaires présent dans le dataset GloVe

In [91]:
glove_compatible_words = (set(all_summary_words) - summary_words_not_in_glove)
glove_compatible_words

{'macey',
 'mayard',
 'signed',
 'sapiens',
 'hogue',
 'quinn',
 'mille',
 'compromising',
 'oxcart',
 'exor',
 'svein',
 'nausea',
 'kokopelli',
 'sensitivity',
 'coolly',
 'gunilla',
 'errin',
 'oo',
 'nava',
 'trivium',
 'price',
 'mixer',
 'kwesi',
 'euthanizing',
 'enclosure',
 'physiotherapist',
 'touchdown',
 'chestnut',
 'lurks',
 'cyanide',
 'saldana',
 'mould',
 'fallbrook',
 'birte',
 'pownal',
 'epidemic',
 'oran',
 'doolittle',
 'adnan',
 'opined',
 'parcells',
 'höhe',
 '317',
 'strayhorn',
 'naive',
 'myself',
 'ideological',
 'mcnamara',
 'gansu',
 'mirage',
 'aggressed',
 'wolfhound',
 'cormeilles',
 'modest',
 'impending',
 'hoarse',
 'draper',
 'denied',
 'melt',
 'mashing',
 'kratos',
 'flagg',
 'overground',
 'danced',
 'biblical',
 'dasa',
 'crossbones',
 'agia',
 'kindred',
 'swiftness',
 'trevelyan',
 'palestinian',
 'rizla',
 'grita',
 'trever',
 'elke',
 'computed',
 'progressed',
 'jose',
 'axle',
 'objected',
 'fairest',
 'antinous',
 'ca',
 'candlewick',
 '

#### Création d'une fonction qui prend un sommaire et renvoie un vecteur

In [92]:
def summary_to_vector(summary):

    # On rend le sommaire compatible avec word2vec
    compatible_summary = [i for i in summary if i in glove_compatible_words]

    # On le met sous forme de matrice word2vec
    matrix = word2vec_matrix.loc[compatible_summary].values
    
    # On aggrège la matrice
    vector = np.mean(matrix, axis=0)
    
    return vector

#### Création d'un dataframe pour stocker les sommaires sous forme de vecteurs

In [93]:
vectorized_documents = pd.DataFrame()

#### Vectorisation des documents

In [94]:
vectorized_documents['Documents'] = df['summary_lemmatized'].apply(lambda x: summary_to_vector(x))
vectorized_documents

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = um.true_divide(
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = um.true_divide(
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = um.true_divide(


Unnamed: 0,Documents
0,"[-0.07298177581493165, 0.12149058247108308, -0..."
1,"[-0.07680598694767443, 0.12032673653197673, -0..."
2,"[-0.09822240497757848, 0.07893396940807176, -0..."
3,"[-0.12620177455459272, 0.12082359030849218, -0..."
4,"[-0.09442028913223141, 0.09537487761845731, -0..."
...,...
16554,"[-0.11869587844155845, 0.0594336512987013, -0...."
16555,"[-0.11666568169014085, 0.09926808591549297, -0..."
16556,"[-0.12534059150943397, 0.11652011540880504, -0..."
16557,"[0.04994615, 0.09150670000000002, -0.082625990..."


#### Mise sous forme de matrice des documents vectorisés

In [95]:
documents_matrix = np.stack(vectorized_documents['Documents'].tolist())
documents_matrix

array([[-0.07298178,  0.12149058, -0.08183253, ..., -0.13990112,
        -0.04957455,  0.0390767 ],
       [-0.07680599,  0.12032674, -0.01767911, ..., -0.07146781,
        -0.16399043,  0.06361667],
       [-0.0982224 ,  0.07893397, -0.02971348, ..., -0.12836276,
        -0.11224773,  0.0291524 ],
       ...,
       [-0.12534059,  0.11652012, -0.01453767, ..., -0.10155226,
        -0.17483535,  0.01556576],
       [ 0.04994615,  0.0915067 , -0.08262599, ..., -0.11599289,
        -0.23081545,  0.0766829 ],
       [-0.1485635 ,  0.07215925, -0.05787597, ..., -0.14550448,
        -0.11372692,  0.06725887]])

##  Utilisateur 

#### Création d'une fonction qui prend un texte en entrée et qui sort les livres les plus proches

In [96]:
from scipy.spatial.distance import cdist

entree = ' Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, \'Beasts of England\'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a philosophy. The animals revolt and drive the drunken and irresponsible Mr Jones from the farm, renaming it "Animal Farm". They adopt Seven Commandments of Animal-ism, the most important of which is, "All animals are equal". Snowball attempts to teach the animals reading and writing; food is plentiful, and the farm runs smoothly. The pigs elevate themselves to positions of leadership and set aside special food items, ostensibly for their personal health. Napoleon takes the pups from the farm dogs and trains them privately. Napoleon and Snowball struggle for leadership. When Snowball announces his plans to build a windmill, Napoleon has his dogs chase Snowball away and declares himself leader. Napoleon enacts changes to the governance structure of the farm, replacing meetings with a committee of pigs, who will run the farm. Using a young pig named Squealer as a "mouthpiece", Napoleon claims credit for the windmill idea. The animals work harder with the promise of easier lives with the windmill. After a violent storm, the animals find the windmill annihilated. Napoleon and Squealer convince the animals that Snowball destroyed it, although the scorn of the neighbouring farmers suggests that its walls were too thin. Once Snowball becomes a scapegoat, Napoleon begins purging the farm with his dogs, killing animals he accuses of consorting with his old rival. He and the pigs abuse their power, imposing more control while reserving privileges for themselves and rewriting history, villainising Snowball and glorifying Napoleon. Squealer justifies every statement Napoleon makes, even the pigs\' alteration of the Seven Commandments of Animalism to benefit themselves. \'Beasts of England\' is replaced by an anthem glorifying Napoleon, who appears to be adopting the lifestyle of a man. The animals remain convinced that they are better off than they were when under Mr Jones. Squealer abuses the animals\' poor memories and invents numbers to show their improvement. Mr Frederick, one of the neighbouring farmers, attacks the farm, using blasting powder to blow up the restored windmill. Though the animals win the battle, they do so at great cost, as many, including Boxer the workhorse, are wounded. Despite his injuries, Boxer continues working harder and harder, until he collapses while working on the windmill. Napoleon sends for a van to take Boxer to the veterinary surgeon\'s, explaining that better care can be given there. Benjamin, the cynical donkey, who "could read as well as any pig", notices that the van belongs to a knacker, and attempts to mount a rescue; but the animals\' attempts are futile. Squealer reports that the van was purchased by the hospital and the writing from the previous owner had not been repainted. He recounts a tale of Boxer\'s death in the hands of the best medical care. Years pass, and the pigs learn to walk upright, carry whips and wear clothes. The Seven Commandments are reduced to a single phrase: "All animals are equal, but some animals are more equal than others". Napoleon holds a dinner party for the pigs and the humans of the area, who congratulate Napoleon on having the hardest-working but least fed animals in the country. Napoleon announces an alliance with the humans, against the labouring classes of both "worlds". He abolishes practices and traditions related to the Revolution, and changes the name of the farm to "The Manor Farm". The animals, overhearing the conversation, notice that the faces of the pigs have begun changing. During a poker match, an argument breaks out between Napoleon and Mr Pilkington, and the animals realise that the faces of the pigs look like the faces of humans, and no one can tell the difference between them. The pigs Snowball, Napoleon, and Squealer adapt Old Major\'s ideas into an actual philosophy, which they formally name Animalism. Soon after, Napoleon and Squealer indulge in the vices of humans (drinking alcohol, sleeping in beds, trading). Squealer is employed to alter the Seven Commandments to account for this humanisation, an allusion to the Soviet government\'s revising of history in order to exercise control of the people\'s beliefs about themselves and their society. The original commandments are: # Whatever goes upon two legs is an enemy. # Whatever goes upon four legs, or has wings, is a friend. # No animal shall wear clothes. # No animal shall sleep in a bed. # No animal shall drink alcohol. # No animal shall kill any other animal. # All animals are equal. Later, Napoleon and his pigs secretly revise some commandments to clear them of accusations of law-breaking (such as "No animal shall drink alcohol" having "to excess" appended to it and "No animal shall sleep in a bed" with "with sheets" added to it). The changed commandments are as follows, with the changes bolded: * 4 No animal shall sleep in a bed with sheets. * 5 No animal shall drink alcohol to excess. * 6 No animal shall kill any other animal without cause. Eventually these are replaced with the maxims, "All animals are equal, but some animals are more equal than others", and "Four legs good, two legs better!" as the pigs become more human. This is an ironic twist to the original purpose of the Seven Commandments, which were supposed to keep order within Animal Farm by uniting the animals together against the humans, and prevent animals from following the humans\' evil habits. Through the revision of the commandments, Orwell demonstrates how simply political dogma can be turned into malleable propaganda.'

vector = summary_to_vector(entree)

distances = cdist([vector], documents_matrix)[0]
distances

array([4.09107612, 3.98300727, 4.11027837, ..., 3.98640225, 4.32334609,
       4.1059843 ])

In [97]:
distances_from_input_summary = pd.DataFrame()
distances_from_input_summary['distances'] = pd.Series(distances)
distances_from_input_summary

Unnamed: 0,distances
0,4.091076
1,3.983007
2,4.110278
3,4.056156
4,4.039546
...,...
16554,4.095399
16555,4.191006
16556,3.986402
16557,4.323346


In [98]:
new_df = pd.concat([df, distances_from_input_summary], axis = 1)
new_df

Unnamed: 0,book_id,book_title,author,publication_date,genre,summary,summary_processing,summary_lemmatized,distances
0,620,Animal Farm,George Orwell,1945-08-17,"[Roman à clef, Satire, Children's literature, ...","Old Major, the old boar on the Manor Farm, ca...","[old, major, the, old, boar, on, the, manor, f...","[old, major, the, old, boar, on, the, manor, f...",4.091076
1,843,A Clockwork Orange,Anthony Burgess,1962,"[Science Fiction, Novella, Speculative fiction...","Alex, a teenager living in near-future Englan...","[alex, a, teenager, living, in, near, future, ...","[alex, a, teenager, living, in, near, future, ...",3.983007
2,986,The Plague,Albert Camus,1947,"[Existentialism, Fiction, Absurdist fiction, N...",The text of The Plague is divided into five p...,"[the, text, of, the, plague, is, divided, into...","[the, text, of, the, plague, is, divided, into...",4.110278
3,1756,An Enquiry Concerning Human Understanding,David Hume,,[],The argument of the Enquiry proceeds by a ser...,"[the, argument, of, the, enquiry, proceeds, by...","[the, argument, of, the, enquiry, proceeds, by...",4.056156
4,2080,A Fire Upon the Deep,Vernor Vinge,,"[Hard science fiction, Science Fiction, Specul...",The novel posits that space around the Milky ...,"[the, novel, posits, that, space, around, the,...","[the, novel, posit, that, space, around, the, ...",4.039546
...,...,...,...,...,...,...,...,...,...
16554,36934824,Under Wildwood,Colin Meloy,2012-09-25,[],"Prue McKeel, having rescued her brother from ...","[prue, mckeel, having, rescued, her, brother, ...","[prue, mckeel, having, rescued, her, brother, ...",4.095399
16555,37054020,Transfer of Power,Vince Flynn,2000-06-01,"[Thriller, Fiction]",The reader first meets Rapp while he is doing...,"[the, reader, first, meets, rapp, while, he, i...","[the, reader, first, meet, rapp, while, he, is...",4.191006
16556,37122323,Decoded,Jay-Z,2010-11-16,[Autobiography],The book follows very rough chronological ord...,"[the, book, follows, very, rough, chronologica...","[the, book, follows, very, rough, chronologica...",3.986402
16557,37132319,America Again: Re-becoming The Greatness We Ne...,Stephen Colbert,2012-10-02,[],Colbert addresses topics including Wall Stree...,"[colbert, addresses, topics, including, wall, ...","[colbert, address, topic, including, wall, str...",4.323346


In [99]:
new_df.sort_values('distances')

Unnamed: 0,book_id,book_title,author,publication_date,genre,summary,summary_processing,summary_lemmatized,distances
10215,12076423,Arthur's Teacher Trouble,Marc Brown,1986,[Children's literature],"Arthur starts a new year with Mr. Ratburn, an...","[arthur, starts, a, new, year, with, mr, ratbu...","[arthur, start, a, new, year, with, mr, ratbur...",3.466410
6018,4932118,Chicka Chicka Boom Boom,John Archambault,1989,"[Picture book, Children's literature]",The lower-case letters climb up a coconut tre...,"[the, lower, case, letters, climb, up, a, coco...","[the, lower, case, letter, climb, up, a, cocon...",3.509439
14547,24629520,New York Dead,Stuart Woods,1991-10,[],New york spelled N-E-W Y-O-R-K. is a city in ...,"[new, york, spelled, n, e, w, y, o, r, k, is, ...","[new, york, spelled, n, e, w, y, o, r, k, is, ...",3.551551
3071,1908089,The Ring of Charon,Roger MacBride Allen,1990,"[Science Fiction, Speculative fiction, Fiction]",- ==Charonian Life Cycle==--&#62; it:L'anello ...,"[charonian, life, cycle, 62, it, l, anello, di...","[charonian, life, cycle, 62, it, l, anello, di...",3.558923
15539,28959603,The Cambridge Quintet,,1998,[Novel],The book describes a fictitious dinner party ...,"[the, book, describes, a, fictitious, dinner, ...","[the, book, describes, a, fictitious, dinner, ...",3.576953
...,...,...,...,...,...,...,...,...,...
5972,4908574,The Eyes of Darkness,Dean Koontz,1981,"[Speculative fiction, Horror, Fiction, Romance...",==Character,[character],[character],6.836701
9064,9388867,Cecily Parsley's Nursery Rhymes,Beatrix Potter,1922,[Children's literature],== Merchandise,[merchandise],[merchandise],7.268012
5693,4597024,The Deathlord of Ixia,John Grant,1992,"[Gamebook, Speculative fiction, Children's lit...",==Receptio,[receptio],[receptio],
5879,4817875,The Caverns of Kalte,Joe Dever,1984,"[Gamebook, Speculative fiction, Fantasy, Child...",==Receptio,[receptio],[receptio],


In [100]:
new_df.sort_values('distances')

Unnamed: 0,book_id,book_title,author,publication_date,genre,summary,summary_processing,summary_lemmatized,distances
10215,12076423,Arthur's Teacher Trouble,Marc Brown,1986,[Children's literature],"Arthur starts a new year with Mr. Ratburn, an...","[arthur, starts, a, new, year, with, mr, ratbu...","[arthur, start, a, new, year, with, mr, ratbur...",3.466410
6018,4932118,Chicka Chicka Boom Boom,John Archambault,1989,"[Picture book, Children's literature]",The lower-case letters climb up a coconut tre...,"[the, lower, case, letters, climb, up, a, coco...","[the, lower, case, letter, climb, up, a, cocon...",3.509439
14547,24629520,New York Dead,Stuart Woods,1991-10,[],New york spelled N-E-W Y-O-R-K. is a city in ...,"[new, york, spelled, n, e, w, y, o, r, k, is, ...","[new, york, spelled, n, e, w, y, o, r, k, is, ...",3.551551
3071,1908089,The Ring of Charon,Roger MacBride Allen,1990,"[Science Fiction, Speculative fiction, Fiction]",- ==Charonian Life Cycle==--&#62; it:L'anello ...,"[charonian, life, cycle, 62, it, l, anello, di...","[charonian, life, cycle, 62, it, l, anello, di...",3.558923
15539,28959603,The Cambridge Quintet,,1998,[Novel],The book describes a fictitious dinner party ...,"[the, book, describes, a, fictitious, dinner, ...","[the, book, describes, a, fictitious, dinner, ...",3.576953
...,...,...,...,...,...,...,...,...,...
5972,4908574,The Eyes of Darkness,Dean Koontz,1981,"[Speculative fiction, Horror, Fiction, Romance...",==Character,[character],[character],6.836701
9064,9388867,Cecily Parsley's Nursery Rhymes,Beatrix Potter,1922,[Children's literature],== Merchandise,[merchandise],[merchandise],7.268012
5693,4597024,The Deathlord of Ixia,John Grant,1992,"[Gamebook, Speculative fiction, Children's lit...",==Receptio,[receptio],[receptio],
5879,4817875,The Caverns of Kalte,Joe Dever,1984,"[Gamebook, Speculative fiction, Fantasy, Child...",==Receptio,[receptio],[receptio],


In [101]:
df.loc()[0]['summary']

' Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, \'Beasts of England\'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a philosophy. The animals revolt and drive the drunken and irresponsible Mr Jones from the farm, renaming it "Animal Farm". They adopt Seven Commandments of Animal-ism, the most important of which is, "All animals are equal". Snowball attempts to teach the animals reading and writing; food is plentiful, and the farm runs smoothly. The pigs elevate themselves to positions of leadership and set aside special food items, ostensibly for their personal health. Napoleon takes the pups from the farm dogs and trains them privately. Napoleon and Snowball struggle for leadership. When Snowball announces his plans to build a windmill, Napoleon has his dogs chase Snowball away and declares himself leader. N

In [104]:
new_df.loc()[10215]['summary']

' Arthur starts a new year with Mr. Ratburn, and is given heaps of homework because Mr. Ratburn is very strict, D. W is ecstatic because she has not started school yet, and she knows that next year, she won\'t get any homework because the kindergarten teacher is nice. The principal announces the annual September Spellathon, and not long after Mr. Ratburn announces a spelling test to determine which two students will represent his class at the spellathon. Everybody studies, and Arthur and Brain get all twenty words right, and enter into the spellathon. On the night of the spellathon, Arthur is very nervous. Brain is first, and spells \'fear\' "F-E-R-E", Prunella falls out not long after, spelling \'preparation\' "P-R-E-P-E-R-A-T-I-O-N". Arthur spells preparation correctly and wins the spellathon. At the end of the spellathon, Mr. Ratburn announces that he has loved teaching third grade, but that he is looking forward to a new challenge next year, teaching kindergarten. At this announcem

## Doc2Vec (Méthode possible)

In [12]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import nltk
from nltk.tokenize import word_tokenize

data = ["I love machine learning. Its awesome.",
        "I love coding in python",
        "I love building chatbots",
        "they chat amagingly well"]

data = df['summary']

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]

In [13]:
max_epochs = 100
vec_size = 20
alpha = 0.025
 
model = Doc2Vec(vector_size=vec_size,
                alpha=alpha,
                min_alpha=0.00025,
                min_count=1,
                dm =1)
  
model.build_vocab(tagged_data)
 
for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.epochs)
    # decrease the learning rate
    model.alpha -= 0.0002
    
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
iteration 19
iteration 20
iteration 21
iteration 22
iteration 23
iteration 24
iteration 25
iteration 26
iteration 27
iteration 28
iteration 29
iteration 30
iteration 31
iteration 32
iteration 33
iteration 34
iteration 35
iteration 36
iteration 37
iteration 38
iteration 39
iteration 40
iteration 41
iteration 42
iteration 43
iteration 44
iteration 45
iteration 46
iteration 47
iteration 48
iteration 49
iteration 50
iteration 51
iteration 52
iteration 53
iteration 54
iteration 55
iteration 56
iteration 57
iteration 58
iteration 59
iteration 60
iteration 61
iteration 62
iteration 63
iteration 64
iteration 65
iteration 66
iteration 67
iteration 68
iteration 69
iteration 70
iteration 71
iteration 72
iteration 73
iteration 74
iteration 75
iteration 76
iteration

In [14]:
test_data = word_tokenize("I love chatbots".lower())
liste = df['summary_processing'].apply(lambda x: model.infer_vector(x))
print("V1_infer", liste)

V1_infer 0        [-2.3444753, -1.4343172, 1.5873882, -0.7337385...
1        [-1.3725247, -1.102957, -0.01775584, -1.283962...
2        [-1.174576, 2.0684273, -1.9528142, 0.30446884,...
3        [-1.4223877, 1.0449867, -1.8482441, -1.8878657...
4        [1.7154088, 0.59428126, -0.45145193, -3.694948...
                               ...                        
16554    [-0.4012651, -0.2989922, -0.028916895, -0.8193...
16555    [-0.5523634, 0.75570744, -0.6764767, -0.411904...
16556    [-0.82417643, -0.27783677, -0.840221, -0.83324...
16557    [-0.20514935, -0.0345414, -0.16608354, -0.1309...
16558    [-0.27915102, 0.28536725, -0.97357947, -0.7447...
Name: summary_processing, Length: 16559, dtype: object


In [21]:
from scipy.spatial.distance import cdist
import numpy as np

entree = ' Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, \'Beasts of England\'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a philosophy. The animals revolt and drive the drunken and irresponsible Mr Jones from the farm, renaming it "Animal Farm". They adopt Seven Commandments of Animal-ism, the most important of which is, "All animals are equal". Snowball attempts to teach the animals reading and writing; food is plentiful, and the farm runs smoothly. The pigs elevate themselves to positions of leadership and set aside special food items, ostensibly for their personal health. Napoleon takes the pups from the farm dogs and trains them privately. Napoleon and Snowball struggle for leadership. When Snowball announces his plans to build a windmill, Napoleon has his dogs chase Snowball away and declares himself leader. Napoleon enacts changes to the governance structure of the farm, replacing meetings with a committee of pigs, who will run the farm. Using a young pig named Squealer as a "mouthpiece", Napoleon claims credit for the windmill idea. The animals work harder with the promise of easier lives with the windmill. After a violent storm, the animals find the windmill annihilated. Napoleon and Squealer convince the animals that Snowball destroyed it, although the scorn of the neighbouring farmers suggests that its walls were too thin. Once Snowball becomes a scapegoat, Napoleon begins purging the farm with his dogs, killing animals he accuses of consorting with his old rival. He and the pigs abuse their power, imposing more control while reserving privileges for themselves and rewriting history, villainising Snowball and glorifying Napoleon. Squealer justifies every statement Napoleon makes, even the pigs\' alteration of the Seven Commandments of Animalism to benefit themselves. \'Beasts of England\' is replaced by an anthem glorifying Napoleon, who appears to be adopting the lifestyle of a man. The animals remain convinced that they are better off than they were when under Mr Jones. Squealer abuses the animals\' poor memories and invents numbers to show their improvement. Mr Frederick, one of the neighbouring farmers, attacks the farm, using blasting powder to blow up the restored windmill. Though the animals win the battle, they do so at great cost, as many, including Boxer the workhorse, are wounded. Despite his injuries, Boxer continues working harder and harder, until he collapses while working on the windmill. Napoleon sends for a van to take Boxer to the veterinary surgeon\'s, explaining that better care can be given there. Benjamin, the cynical donkey, who "could read as well as any pig", notices that the van belongs to a knacker, and attempts to mount a rescue; but the animals\' attempts are futile. Squealer reports that the van was purchased by the hospital and the writing from the previous owner had not been repainted. He recounts a tale of Boxer\'s death in the hands of the best medical care. Years pass, and the pigs learn to walk upright, carry whips and wear clothes. The Seven Commandments are reduced to a single phrase: "All animals are equal, but some animals are more equal than others". Napoleon holds a dinner party for the pigs and the humans of the area, who congratulate Napoleon on having the hardest-working but least fed animals in the country. Napoleon announces an alliance with the humans, against the labouring classes of both "worlds". He abolishes practices and traditions related to the Revolution, and changes the name of the farm to "The Manor Farm". The animals, overhearing the conversation, notice that the faces of the pigs have begun changing. During a poker match, an argument breaks out between Napoleon and Mr Pilkington, and the animals realise that the faces of the pigs look like the faces of humans, and no one can tell the difference between them. The pigs Snowball, Napoleon, and Squealer adapt Old Major\'s ideas into an actual philosophy, which they formally name Animalism. Soon after, Napoleon and Squealer indulge in the vices of humans (drinking alcohol, sleeping in beds, trading). Squealer is employed to alter the Seven Commandments to account for this humanisation, an allusion to the Soviet government\'s revising of history in order to exercise control of the people\'s beliefs about themselves and their society. The original commandments are: # Whatever goes upon two legs is an enemy. # Whatever goes upon four legs, or has wings, is a friend. # No animal shall wear clothes. # No animal shall sleep in a bed. # No animal shall drink alcohol. # No animal shall kill any other animal. # All animals are equal. Later, Napoleon and his pigs secretly revise some commandments to clear them of accusations of law-breaking (such as "No animal shall drink alcohol" having "to excess" appended to it and "No animal shall sleep in a bed" with "with sheets" added to it). The changed commandments are as follows, with the changes bolded: * 4 No animal shall sleep in a bed with sheets. * 5 No animal shall drink alcohol to excess. * 6 No animal shall kill any other animal without cause. Eventually these are replaced with the maxims, "All animals are equal, but some animals are more equal than others", and "Four legs good, two legs better!" as the pigs become more human. This is an ironic twist to the original purpose of the Seven Commandments, which were supposed to keep order within Animal Farm by uniting the animals together against the humans, and prevent animals from following the humans\' evil habits. Through the revision of the commandments, Orwell demonstrates how simply political dogma can be turned into malleable propaganda.'

vector = model.infer_vector([entree])
liste.

#distances = cdist([vector], liste)[0]
#distances
#vector

(16559,)