# Data processing

#### Requirements

- [TextHero](https://github.com/jbesomi/texthero)
- [langdetect](https://pypi.org/project/langdetect/)
- pandas 
- numpy
- matplotlib
- re



In [1]:
import re
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

import texthero as hero
from texthero import stopwords
from texthero import preprocessing

from langdetect import detect

from ast import literal_eval

In [2]:
data = pd.read_csv("data/books.csv")
data.head(3)

Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,...,firstPublishDate,awards,numRatings,ratingsByStars,likedPercent,setting,coverImg,bbeScore,bbeVotes,price
0,2767052-the-hunger-games,The Hunger Games,The Hunger Games #1,Suzanne Collins,4.33,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,9780439023481,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...","['Katniss Everdeen', 'Peeta Mellark', 'Cato (H...",...,,['Locus Award Nominee for Best Young Adult Boo...,6376780,"['3444695', '1921313', '745221', '171994', '93...",96.0,"['District 12, Panem', 'Capitol, Panem', 'Pane...",https://i.gr-assets.com/images/S/compressed.ph...,2993816,30516,5.09
1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,Harry Potter #5,"J.K. Rowling, Mary GrandPré (Illustrator)",4.5,There is a door at the end of a silent corrido...,English,9780439358071,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...","['Sirius Black', 'Draco Malfoy', 'Ron Weasley'...",...,06/21/03,['Bram Stoker Award for Works for Young Reader...,2507623,"['1593642', '637516', '222366', '39573', '14526']",98.0,['Hogwarts School of Witchcraft and Wizardry (...,https://i.gr-assets.com/images/S/compressed.ph...,2632233,26923,7.38
2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,4.28,The unforgettable novel of a childhood in a sl...,English,9999999999999,"['Classics', 'Fiction', 'Historical Fiction', ...","['Scout Finch', 'Atticus Finch', 'Jem Finch', ...",...,07/11/60,"['Pulitzer Prize for Fiction (1961)', 'Audie A...",4501075,"['2363896', '1333153', '573280', '149952', '80...",95.0,"['Maycomb, Alabama (United States)']",https://i.gr-assets.com/images/S/compressed.ph...,2269402,23328,


In [3]:
data.shape

(52478, 25)

Initially, there are over 50'000 records.

## Sanitize book formats

In [4]:
counts = data['bookFormat'].nunique()
counts

136

In [5]:
print(data['bookFormat'].unique())

['Hardcover' 'Paperback' 'Mass Market Paperback' 'Kindle Edition'
 'Audiobook' 'ebook' nan 'Board book' 'Boxed Set' 'Leather Bound'
 'Capa dura' 'Trade Paperback' 'Box Set' 'Board Book' 'Nook'
 'Library Binding' 'Capa comum' 'Pasta blanda' 'Audio Cassette'
 'Unknown Binding' 'Audio CD' 'Slipcased Hardcover' 'Broschiert'
 'Paperback ' 'Brochura' 'MP3 CD' 'Audible Audio' 'hardcover' 'cloth'
 'Pasta dura' 'Paperback/Kindle' 'paper' 'Hard Cover' 'Perfect Paperback'
 'Poche' 'Comics' 'Hardcover Slipcased ' 'Unbound' 'Taschenbuch'
 'Paper back' 'Paperback, Kindle, Ebook, Audio' 'CD-ROM'
 'Paperback and Kindle' 'Hardcover im Schuber' 'paperback'
 'Graphic Novels' 'Broché' 'Science Fiction Book Club Omnibus' 'Newsprint'
 'Spiral-bound' 'Mass Market' 'Hardcover Boxed Set'
 'Mass Market Paperback ' 'Hardback' 'Audio' 'Novel' 'Gebundene Ausgabe'
 'softcover' 'گالینگور-وزیری' 'hardbound' 'Hard cover, Soft cover, e-book'
 'Kindle' 'Paperback/Ebook' 'Online Fiction' 'Interactive ebook'
 'Paperback m

In [6]:
# set all names to lower case
data['bookFormat'] = data['bookFormat'].str.lower()
data['bookFormat'] = data['bookFormat'].fillna('unknown binding')

cover_types = data[['bookFormat', 'title']]
cover_types.insert(1, 'count', 1)

n_uniques = cover_types[['bookFormat', 'count']].groupby('bookFormat').sum()
n_uniques.head()

Unnamed: 0_level_0,count
bookFormat,Unnamed: 1_level_1
album,1
audible audio,33
audio,16
audio book,1
audio cassette,36


We can see most of the book formats are either the same or are very marginal. Let's group together into four major groups and put all the other groups into an 'other' category.

In [7]:
n_uniques = n_uniques[n_uniques['count'] > 10] 
n_uniques

Unnamed: 0_level_0,count
bookFormat,Unnamed: 1_level_1
audible audio,33
audio,16
audio cassette,36
audio cd,146
audiobook,107
board book,37
comics,17
ebook,2547
hardcover,12278
kindle edition,5834


In [8]:
format_names = ['hardcover', 'paperback', 'audiobook', 'ebook']
all_cover_names = []
format_categories = [
    ['hardcover', 'slipcased hardcover', 'hardcover slipcased', 'hardcover im schuber', 'hardcover boxed set', 'hardcover, paper dust jacket', 'hardcover chapbook', 'tankobon hardcover'],
    ['mass market paperback', 'paperback', 'trade paperback', 'paperback/ebook', 'paperback, ebook', 'paperback/kindle'],
    ['audible audio', 'audio', 'audio cassette', 'audiobook', 'audio play', 'audio cd'],
    ['ebook', 'kindle edition', 'interactive ebook', 'softcover, free ebook', 'kindle_edition', 'pdf']
]

for i, cover_names in enumerate(format_categories):
    data[format_names[i]] = data.apply(lambda row: row['bookFormat'] in cover_names, axis=1)
    all_cover_names.extend(cover_names)
    
data['other'] = data.apply(lambda row: row['bookFormat'] not in all_cover_names, axis=1)

In [9]:
data[['bookFormat', 'hardcover', 'paperback', 'audiobook', 'ebook', 'other']].sample(10)

Unnamed: 0,bookFormat,hardcover,paperback,audiobook,ebook,other
1240,paperback,False,True,False,False,False
37921,paperback,False,True,False,False,False
16342,paperback,False,True,False,False,False
30316,paperback,False,True,False,False,False
1506,paperback,False,True,False,False,False
32040,kindle edition,False,False,False,True,False
43339,unknown binding,False,False,False,False,True
36714,kindle edition,False,False,False,True,False
51433,paperback,False,True,False,False,False
28926,hardcover,True,False,False,False,False


## Sanitize non-valid book descriptions

As an important part of the project relies on the similarities between the book's descriptions, we require them to be non-null. Let's filter out the NaN values.

In [10]:
data.shape

(52478, 30)

In [11]:
MIN_NB_CHAR = 40

data = data[data['description'].notna()]

# checks if the description contains at least one latin alphabetical character AND if the description is at least 40 characters long
data['is_descr_valid'] = data.apply(lambda row: len(row['description'])>MIN_NB_CHAR and bool(re.match('^(?=.*[a-zA-Z])', row['description'])), axis=1)

data = data[data['is_descr_valid']==True]
data = data.drop(['is_descr_valid'], axis = 1)

In [12]:
data.shape

(48940, 30)

This only removes about 6% of the dataset which is acceptable.

# Sanitize book language

In [13]:
data_not_en = data[data['language'] != 'English']
data_not_en.shape

(7521, 30)

Around 15% of the dataset does not have English langage set, this is non-negligeable.

In [14]:
data_not_en.sample(5)

Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,...,setting,coverImg,bbeScore,bbeVotes,price,hardcover,paperback,audiobook,ebook,other
21471,3726791-die-w-chter-trilogie,Die Wächter-Trilogie: Drei Romane in einem Band,Дозоры #1-3,Sergei Lukyanenko,4.18,"First three books in Lukyanenko's ""Watch"" seri...",German,9783453532861,"['Fantasy', 'Fiction', 'Horror', 'Urban Fantas...",[],...,[],https://i.gr-assets.com/images/S/compressed.ph...,99,1,5.48,False,True,False,False,False
42808,22807749-stealing-from-god,Stealing from God: Why Atheists Need God to Ma...,,Frank Turek,4.29,"If you think atheists have reason, evidence, a...",,9781612917016,"['Christianity', 'Nonfiction', 'Philosophy', '...",[],...,[],https://i.gr-assets.com/images/S/compressed.ph...,76,1,13.79,False,True,False,False,False
34975,17948902-de-geesten-van-het-dodenpunt,de geesten van het dodenpunt,,Eve Bunting,4.12,"After a car plunges over a dangerous cliff, ki...",Dutch,9789026977909,"['Young Adult', 'Ghosts', 'Teen', 'Paranormal'...",[],...,[],,90,1,,False,False,False,False,True
20739,42186115-sins-of-the-son,Sins of the Son (The Frank Lucianus Mafia #1),,Frank Lucianus (Goodreads Author),4.02,Alternate cover edition of ASIN B07HZ1ZL26An u...,,9999999999999,['Dark'],[],...,[],https://i.gr-assets.com/images/S/compressed.ph...,100,1,,False,False,False,True,False
18480,8156791-genesis,Genesis: Commentary,,Robert Alter,4.17,"This volume, a part of the Old Testament Libra...",,9780393316704,"['Religion', 'Theology', 'School', 'Classics',...",[],...,[],https://i.gr-assets.com/images/S/compressed.ph...,100,1,5.26,False,False,False,True,False


#### We can see a lot of books with NaN values still have their descriptions in english, let's quickly use Google's language-detection package on the descriptions to infer if the books are indeed in English

In [15]:
data['is_description_en'] = data.apply(lambda row: row['description']=='English' or detect(row['description'])=='en', axis=1)

In [16]:
# Let's drop the books whose summary has not been detected to be in English
data = data[data['is_description_en']==True]

In [17]:
data.shape

(44799, 31)

## Sanitize books without any genres 

Our main visualization will greatly depend on the genres of each displayed book. We will therefore require the valid books to have at least one genre. Furthermore, in order to not duplicate the books displays in our bubble graph, we will stick with the first main genre of each book to represent it.

In [18]:
data.shape

(44799, 31)

In [19]:
# Remove books with NaN values on 'genres'
data = data[data['genres'].notna()]

In [20]:
# Remove books with less than one genre
data['nb_genres'] = data.apply(lambda row: len(literal_eval(row['genres'])), axis=1)
data = data[data['nb_genres']>=1]
data = data.drop(['nb_genres'], axis = 1)

In [21]:
# Derive main genre of each book into separate column
data['first_genre'] = data.apply(lambda row: literal_eval(row['genres'])[0], axis=1)

In [22]:
data.shape

(41476, 32)

This procedure removed 3320 books (about 6% of the dataset), which is acceptable.

## Infer missing prices and number of pages

In the website, we filter the dataset based on some widgets levels. For this reason we want every book from the dataset to have a value for the concerned fields. Setting the NaN to the median has low impact and allows to keep most books potentially displayable if the user sets average values on the widgets.

#### a) Prices

In [23]:
# Replace a few corrupted datapoints by NaN values
data['price'] = data['price'].replace(['1.743.28'],np.NaN)
data['price'] = data['price'].replace(['1.307.46'],np.NaN)
data['price'] = data['price'].replace(['8.715.51'],np.NaN)
data['price'] = data['price'].replace(['1.775.18'],np.NaN)
data['price'] = data['price'].replace(['1.734.84'],np.NaN)

In [24]:
no_nan_price = data[data['price'].notna()]
no_nan_price['price'] = pd.to_numeric(no_nan_price['price'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [25]:
max_price = no_nan_price['price'].max()
max_price

898.64

In [26]:
min_price = no_nan_price['price'].min()
min_price

0.84

In [27]:
median_price = no_nan_price['price'].median()
median_price

5.09

In [28]:
data['price'] = data['price'].fillna(median_price)

#### b) Number of pages

In [29]:
# Replace a few corrupted datapoints
data['pages'] = data['pages'].replace(['1 page'],1)

In [30]:
with_nb_pages = data[data['pages'].notna()]
with_nb_pages['pages'] = pd.to_numeric(with_nb_pages['pages'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [31]:
max_price = with_nb_pages['pages'].max()
max_price

14777

In [32]:
min_price = with_nb_pages['pages'].min()
min_price

0

In [33]:
median_price = with_nb_pages['pages'].median()
median_price

312.0

In [34]:
data['pages'] = data['pages'].fillna(median_price)

## Export usefull pre-processed data

In [35]:
# Fields currently used on the website, can be adapted to future needs
usefull_fields = [
    'bookId',
    'title',
    'author',
    'first_genre',
    'coverImg',
    'rating',
    'numRatings',
    'price',
    'description',
    'pages',
    'publishDate',
    'publisher',
    'bookFormat',
    'hardcover',
    'paperback',
    'audiobook',
    'ebook',
    'other',
    'isbn'
]

export_data = data[usefull_fields]

In [36]:
export_data.head(5)

Unnamed: 0,bookId,title,author,first_genre,coverImg,rating,numRatings,price,description,pages,publishDate,publisher,bookFormat,hardcover,paperback,audiobook,ebook,other,isbn
0,2767052-the-hunger-games,The Hunger Games,Suzanne Collins,Young Adult,https://i.gr-assets.com/images/S/compressed.ph...,4.33,6376780,5.09,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,374,09/14/08,Scholastic Press,hardcover,True,False,False,False,False,9780439023481
1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,"J.K. Rowling, Mary GrandPré (Illustrator)",Fantasy,https://i.gr-assets.com/images/S/compressed.ph...,4.5,2507623,7.38,There is a door at the end of a silent corrido...,870,09/28/04,Scholastic Inc.,paperback,False,True,False,False,False,9780439358071
2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,Harper Lee,Classics,https://i.gr-assets.com/images/S/compressed.ph...,4.28,4501075,5.09,The unforgettable novel of a childhood in a sl...,324,05/23/06,Harper Perennial Modern Classics,paperback,False,True,False,False,False,9999999999999
3,1885.Pride_and_Prejudice,Pride and Prejudice,"Jane Austen, Anna Quindlen (Introduction)",Classics,https://i.gr-assets.com/images/S/compressed.ph...,4.26,2998241,5.09,Alternate cover edition of ISBN 9780679783268S...,279,10/10/00,Modern Library,paperback,False,True,False,False,False,9999999999999
4,41865.Twilight,Twilight,Stephenie Meyer,Young Adult,https://i.gr-assets.com/images/S/compressed.ph...,3.6,4964519,2.1,About three things I was absolutely positive.\...,501,09/06/06,"Little, Brown and Company",paperback,False,True,False,False,False,9780316015844


In [37]:
export_data.to_csv(path_or_buf='./data/processed_books.csv', index=False)
# export_data.to_pickle(path='./data/processed_books.pkl')

## Export Tinder books data

In [38]:
tinder_30book_ids = [
    '2767052-the-hunger-games',
    '2.Harry_Potter_and_the_Order_of_the_Phoenix',
    '2657.To_Kill_a_Mockingbird',
    '1885.Pride_and_Prejudice',
    '33.The_Lord_of_the_Rings',
    '370493.The_Giving_Tree',
    '968.The_Da_Vinci_Code',
    '24213.Alice_s_Adventures_in_Wonderland_Through_the_Looking_Glass',
    '24280.Les_Mis_rables',
    '18144590-the-alchemist',
    '7144.Crime_and_Punishment',
    '22628.The_Perks_of_Being_a_Wallflower',
    '375802.Ender_s_Game',
    '17245.Dracula',
    '13496.A_Game_of_Thrones',
    '1381.The_Odyssey',
    '4214.Life_of_Pi',
    '44767458-dune',
    '3590.The_Adventures_of_Sherlock_Holmes',
    '2429135.The_Girl_with_the_Dragon_Tattoo',
    '4934.The_Brothers_Karamazov',
    '99107.Winnie_the_Pooh',
    '49552.The_Stranger',
    '11588.The_Shining',
    '99561.Looking_for_Alaska',
    '1618.The_Curious_Incident_of_the_Dog_in_the_Night_Time',
    '119073.The_Name_of_the_Rose',
    '22034.The_Godfather',
    '830502.It',
    '1845.Into_the_Wild', 
]

tinder_books_df = data[data['bookId'].isin(tinder_30book_ids)]
tinder_books_df.shape

(30, 32)

In [39]:
tinder_books_df.head(30)

Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,...,bbeScore,bbeVotes,price,hardcover,paperback,audiobook,ebook,other,is_description_en,first_genre
0,2767052-the-hunger-games,The Hunger Games,The Hunger Games #1,Suzanne Collins,4.33,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,9780439023481,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...","['Katniss Everdeen', 'Peeta Mellark', 'Cato (H...",...,2993816,30516,5.09,True,False,False,False,False,True,Young Adult
1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,Harry Potter #5,"J.K. Rowling, Mary GrandPré (Illustrator)",4.5,There is a door at the end of a silent corrido...,English,9780439358071,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...","['Sirius Black', 'Draco Malfoy', 'Ron Weasley'...",...,2632233,26923,7.38,False,True,False,False,False,True,Fantasy
2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,4.28,The unforgettable novel of a childhood in a sl...,English,9999999999999,"['Classics', 'Fiction', 'Historical Fiction', ...","['Scout Finch', 'Atticus Finch', 'Jem Finch', ...",...,2269402,23328,5.09,False,True,False,False,False,True,Classics
3,1885.Pride_and_Prejudice,Pride and Prejudice,,"Jane Austen, Anna Quindlen (Introduction)",4.26,Alternate cover edition of ISBN 9780679783268S...,English,9999999999999,"['Classics', 'Fiction', 'Romance', 'Historical...","['Mr. Bennet', 'Mrs. Bennet', 'Jane Bennet', '...",...,1983116,20452,5.09,False,True,False,False,False,True,Classics
12,370493.The_Giving_Tree,The Giving Tree,,Shel Silverstein,4.37,"""Once there was a tree...and she loved a littl...",English,9780060256654,"['Childrens', 'Picture Books', 'Classics', 'Fi...",[],...,1021534,10594,4.87,True,False,False,False,False,True,Childrens
14,968.The_Da_Vinci_Code,The Da Vinci Code,Robert Langdon #2,Dan Brown (Goodreads Author),3.86,ISBN 9780307277671 moved to this edition.While...,English,9999999999999,"['Fiction', 'Mystery', 'Thriller', 'Suspense',...","['Sophie Neveu', 'Robert Langdon', 'Sir Leigh ...",...,876633,9231,5.09,False,True,False,False,False,True,Fiction
17,24213.Alice_s_Adventures_in_Wonderland_Through...,Alice's Adventures in Wonderland & Through the...,Alice's Adventures in Wonderland #1-2,"Lewis Carroll, John Tenniel (Illustrator), Mar...",4.06,"""I can't explain myself, I'm afraid, sir,"" sai...",English,9780451527745,"['Classics', 'Fantasy', 'Fiction', 'Childrens'...","['The Hatter (Lewis Carroll)', 'The Queen of H...",...,833791,8812,3.07,False,True,False,False,False,True,Classics
19,24280.Les_Mis_rables,Les Misérables,,"Victor Hugo, Lee Fahnestock (Translator), Norm...",4.18,Introducing one of the most famous characters ...,English,9999999999999,"['Classics', 'Fiction', 'Historical Fiction', ...","['Jean Valjean', 'Javert', 'Cosette', 'Fantine...",...,813088,8548,5.09,False,True,False,False,False,True,Classics
24,18144590-the-alchemist,The Alchemist,,"Paulo Coelho (Goodreads Author), Alan R. Clark...",3.88,Paulo Coelho's enchanting novel has inspired a...,English,9780062315007,"['Fiction', 'Classics', 'Fantasy', 'Philosophy...","['Santiago', 'Alchemist', 'Melchizedek']",...,765587,8008,13.22,False,True,False,False,False,True,Fiction
25,7144.Crime_and_Punishment,Crime and Punishment,,"Fyodor Dostoyevsky, David McDuff (Translator)",4.22,"Raskolnikov, a destitute and desperate former ...",English,9780143058144,"['Classics', 'Fiction', 'Russia', 'Literature'...","['Rodion Romanovich Raskolnikov', 'Porfiry Pet...",...,759066,7937,18.85,False,True,False,False,False,True,Classics


In [40]:
tinder_books_df[['bookId', 'coverImg']].to_csv(path_or_buf='./data/tinder_books.csv', index=False)

In [None]:
import re
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

import texthero as hero
from texthero import stopwords
from texthero import preprocessing

from langdetect import detect

from ast import literal_eval

In [2]:
data = pd.read_pickle('../processed_books.pkl')

In [3]:
data.head()

Unnamed: 0,bookId,title,author,first_genre,coverImg,rating,numRatings,price,description,pages,publishDate,publisher,bookFormat,hardcover,paperback,audiobook,ebook,other,isbn
0,2767052-the-hunger-games,The Hunger Games,Suzanne Collins,Young Adult,https://i.gr-assets.com/images/S/compressed.ph...,4.33,6376780,5.09,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,374,09/14/08,Scholastic Press,hardcover,True,False,False,False,False,9780439023481
1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,"J.K. Rowling, Mary GrandPré (Illustrator)",Fantasy,https://i.gr-assets.com/images/S/compressed.ph...,4.5,2507623,7.38,There is a door at the end of a silent corrido...,870,09/28/04,Scholastic Inc.,paperback,False,True,False,False,False,9780439358071
2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,Harper Lee,Classics,https://i.gr-assets.com/images/S/compressed.ph...,4.28,4501075,5.09,The unforgettable novel of a childhood in a sl...,324,05/23/06,Harper Perennial Modern Classics,paperback,False,True,False,False,False,9999999999999
3,1885.Pride_and_Prejudice,Pride and Prejudice,"Jane Austen, Anna Quindlen (Introduction)",Classics,https://i.gr-assets.com/images/S/compressed.ph...,4.26,2998241,5.09,Alternate cover edition of ISBN 9780679783268S...,279,10/10/00,Modern Library,paperback,False,True,False,False,False,9999999999999
4,41865.Twilight,Twilight,Stephenie Meyer,Young Adult,https://i.gr-assets.com/images/S/compressed.ph...,3.6,4964519,2.1,About three things I was absolutely positive.\...,501,09/06/06,"Little, Brown and Company",paperback,False,True,False,False,False,9780316015844


## Compute description similarities

In [41]:
help(hero)

Help on package texthero:

NAME
    texthero - Texthero: python toolkit for text preprocessing, representation and visualization.

PACKAGE CONTENTS
    nlp
    preprocessing
    representation
    stopwords
    visualization

DATA
    Callable = typing.Callable
    List = typing.List
    Optional = typing.Optional
    Set = typing.Set

FILE
    c:\users\guillaumep\anaconda3\lib\site-packages\texthero\__init__.py




In [13]:
# Perform standard cleaning operations on the books summary
custom_pipeline = [preprocessing.fillna,
                   preprocessing.lowercase,
                   preprocessing.remove_digits,
                   preprocessing.remove_punctuation,
                   preprocessing.remove_urls,
                   preprocessing.remove_html_tags,
                   preprocessing.remove_diacritics, # All special accents like Noël
                   preprocessing.remove_whitespace] # Remove any extra whitespace, newline, tabs and any form of space.
data['nlp_description'] = hero.clean(data['description'], custom_pipeline)

# Add a few custom stopwords to be removed
default_stopwords = stopwords.DEFAULT
specific_stopwords = ['ISBN', 'my', 'great', 'interesting']
custom_stopwords = default_stopwords.union(set(specific_stopwords))

data['nlp_description'] = hero.remove_stopwords(data.nlp_description, custom_stopwords)

data[['nlp_description', 'description']].head(10)

Unnamed: 0,nlp_description,description
0,winning means fame fortune losing means certai...,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...
1,door end silent corridor ' haunting harry pott...,There is a door at the end of a silent corrido...
2,unforgettable novel childhood sleepy southern ...,The unforgettable novel of a childhood in a sl...
3,alternate cover edition isbn 9780679783268sinc...,Alternate cover edition of ISBN 9780679783268S...
4,three things absolutely positive first edward ...,About three things I was absolutely positive.\...
