This notebook is a genera preprocessing for textual data that will be used for several NLP models.
The input for this data is the original data set and not the preprocessed one (no 0-ratings and only book with recommendations above 20) since it is not about the ratings but about the books content

In [34]:
import pandas as pd

#if you move this to another file make sure to adjust the path to the csv-file
df = pd.read_csv('../data/raw/Preprocessed_data.csv', index_col=0, encoding = 'ISO-8859-1')
df

Unnamed: 0,user_id,location,age,isbn,rating,book_title,book_author,year_of_publication,publisher,img_s,img_m,img_l,Summary,Language,Category,city,state,country
0,2,"stockton, california, usa",18.0000,0195153448,0,Classical Mythology,Mark P. O. Morford,2002.0,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,Provides an introduction to classical myths pl...,en,['Social Science'],stockton,california,usa
1,8,"timmins, ontario, canada",34.7439,0002005018,5,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],timmins,ontario,canada
2,11400,"ottawa, ontario, canada",49.0000,0002005018,0,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],ottawa,ontario,canada
3,11676,"n/a, n/a, n/a",34.7439,0002005018,8,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],,,
4,41385,"sudbury, ontario, canada",34.7439,0002005018,0,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,"In a small town in Canada, Clara Callan reluct...",en,['Actresses'],sudbury,ontario,canada
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1031170,278851,"dallas, texas, usa",33.0000,0743203763,0,As Hogan Said . . . : The 389 Best Things Anyo...,Randy Voorhees,2000.0,Simon & Schuster,http://images.amazon.com/images/P/0743203763.0...,http://images.amazon.com/images/P/0743203763.0...,http://images.amazon.com/images/P/0743203763.0...,Golf lovers will revel in this collection of t...,en,['Humor'],dallas,texas,usa
1031171,278851,"dallas, texas, usa",33.0000,0767907566,5,All Elevations Unknown: An Adventure in the He...,Sam Lightner,2001.0,Broadway Books,http://images.amazon.com/images/P/0767907566.0...,http://images.amazon.com/images/P/0767907566.0...,http://images.amazon.com/images/P/0767907566.0...,A daring twist on the travel-adventure genre t...,en,['Nature'],dallas,texas,usa
1031172,278851,"dallas, texas, usa",33.0000,0884159221,7,Why stop?: A guide to Texas historical roadsid...,Claude Dooley,1985.0,Lone Star Books,http://images.amazon.com/images/P/0884159221.0...,http://images.amazon.com/images/P/0884159221.0...,http://images.amazon.com/images/P/0884159221.0...,9,9,9,dallas,texas,usa
1031173,278851,"dallas, texas, usa",33.0000,0912333022,7,The Are You Being Served? Stories: 'Camping In...,Jeremy Lloyd,1997.0,Kqed Books,http://images.amazon.com/images/P/0912333022.0...,http://images.amazon.com/images/P/0912333022.0...,http://images.amazon.com/images/P/0912333022.0...,These hilarious stories by the creator of publ...,en,['Fiction'],dallas,texas,usa


In [None]:
from gensim.parsing.preprocessing import preprocess_documents

'remove_pollution' is a function that gets passed a list of substrings and a summary text. It replaces all occurences of each substring in the list by ''.
In the case of the summaries it is text that is polluted with characters that were not properly decoded. Here is an example

Also there are some books where the summary contains only the string '9', which indicates a default value for missing summaries. We drop those books.

In [3]:
def remove_pollution(text, pollution_list):
    for trash in pollution_list:
        text = text.replace(trash, ' ')
    return text

In [4]:
#remove all records with "9" in the summary
df = df[ (df['Summary']!= '9')]

#remove pollution from the summaries
pollution_list = ['\n', '&quot;','&#39;']
df['Summary'] = df['Summary'].map(lambda x: remove_pollution(x,pollution_list))

grouping by isbn to get each summary, book title, and isbn only once

In [5]:
books_with_summary = df.groupby('isbn').apply(lambda df: df.iloc[0]).loc[:,['isbn', 'book_title', 'Summary']]

In [20]:
print(f'Number of Rows: {books_with_summary.shape[0]}')
print(f'Number of unique ISBNs: {len(books_with_summary.isbn.unique())}')
print(f'Number of unique titles: {len(books_with_summary["book_title"].unique())}')
print(f'Number of unique summaries: {len(books_with_summary["Summary"].unique())}')

Number of Rows: 141967
Number of unique ISBNs: 141967
Number of unique titles: 131845
Number of unique summaries: 136910


# Preprocessing

In [21]:
from gensim.parsing.preprocessing import preprocess_documents, preprocess_string

In [32]:
books_with_summary.reset_index(drop=True)

Unnamed: 0,isbn,book_title,Summary
0,0000913154,The Way Things Work: An Illustrated Encycloped...,"Scientific principles, inventions, and chemica..."
1,0001055607,Cereus Blooms At Night,"When Mala, old and notoriously crazy, arrives ..."
2,0001061127,CHESS FOR YOUNG BEGINNERS,A step by step guide to playing chess
3,0001374362,When It's Time for Bed (Collins Baby & Toddler...,Shows baby and his animal friends preparing fo...
4,0001711253,The Big Honey Hunt,Father Bear takes Small Bear on a honey hunt. ...
...,...,...,...
141962,9992003766,Petersburg,Four people swept up in the turmoil of the Rus...
141963,9997410440,Nine Hours to Rama,Story in the form of a novel of the assassinat...
141964,9997488997,Floating Island,Tells the story of a family of dolls shipwreck...
141965,999750271X,The Star Gazer,A novel of the life of Galileo.


In [25]:
text_corpus = [text for text in books_with_summary.Summary]
text_corpus

['Scientific principles, inventions, and chemical, mechanical, and industrial processes are explained for the general reader with the help of drawings and diagrams',
 'When Mala, old and notoriously crazy, arrives at the Paradise Alms House, she is placed in the tender care of Tyler, a gay male nurse, and an extraordinary relationship begins to develop.',
 'A step by step guide to playing chess',
 'Shows baby and his animal friends preparing for bedtime. 1-2 yrs.',
 'Father Bear takes Small Bear on a honey hunt. After many problems, they go to their local store.',
 'P.J. Funnybunny did not like being a bunny.',
 'Spot is an animal who first changed his spots in PUT ME IN THE ZOO, and this new story finds him changing his shape as well. He becomes an elephant, a giraffe, a mouse, and then discovers it s best to be himself.',
 'The new road is to go right through the Callendar family s garden and February Callendar, while trying to change the Ministry s plans, discovers some very fishy t

In [26]:
text_corpus_preprocessed = preprocess_documents(text_corpus)

In [27]:
text_corpus_preprocessed

[['scientif',
  'principl',
  'invent',
  'chemic',
  'mechan',
  'industri',
  'process',
  'explain',
  'gener',
  'reader',
  'help',
  'draw',
  'diagram'],
 ['mala',
  'old',
  'notori',
  'crazi',
  'arriv',
  'paradis',
  'alm',
  'hous',
  'place',
  'tender',
  'care',
  'tyler',
  'gai',
  'male',
  'nurs',
  'extraordinari',
  'relationship',
  'begin',
  'develop'],
 ['step', 'step', 'guid', 'plai', 'chess'],
 ['show', 'babi', 'anim', 'friend', 'prepar', 'bedtim', 'yr'],
 ['father',
  'bear',
  'take',
  'small',
  'bear',
  'honei',
  'hunt',
  'problem',
  'local',
  'store'],
 ['funnybunni', 'like', 'bunni'],
 ['spot',
  'anim',
  'chang',
  'spot',
  'zoo',
  'new',
  'stori',
  'find',
  'chang',
  'shape',
  'eleph',
  'giraff',
  'mous',
  'discov',
  'best'],
 ['new',
  'road',
  'right',
  'callendar',
  'famili',
  'garden',
  'februari',
  'callendar',
  'try',
  'chang',
  'ministri',
  'plan',
  'discov',
  'fishi',
  'thing',
  'go'],
 ['near', 'end', 'long', 

In [23]:
books_with_summary['Summary']

isbn
0000913154    Scientific principles, inventions, and chemica...
0001055607    When Mala, old and notoriously crazy, arrives ...
0001061127                A step by step guide to playing chess
0001374362    Shows baby and his animal friends preparing fo...
0001711253    Father Bear takes Small Bear on a honey hunt. ...
                                    ...                        
9992003766    Four people swept up in the turmoil of the Rus...
9997410440    Story in the form of a novel of the assassinat...
9997488997    Tells the story of a family of dolls shipwreck...
999750271X                      A novel of the life of Galileo.
9999999999    This is the story of how slavery caused the Ci...
Name: Summary, Length: 141967, dtype: object