# CHILDREN'S BOOKS EDA AND DATA CLEANING

## FIRST STEPS: IMPORTING THE LIBRARIES AND THE DATA

In [1]:
import pandas as pd
import numpy as np
import json
pd.options.display.max_rows=350
pd.options.display.max_columns=40
pd.options.mode.chained_assignment = None 

In [2]:
filename = '../../datasets/goodreads_books_children.json'  #change your path here
data = pd.read_json(filename,lines=True)

In [3]:
books=data.copy()

# EDA AND DATA CLEANING

In [4]:
books.shape

(124082, 29)

In [5]:
books.head(3)

Unnamed: 0,isbn,text_reviews_count,series,country_code,language_code,popular_shelves,asin,is_ebook,average_rating,kindle_asin,similar_books,description,format,link,authors,publisher,num_pages,publication_day,isbn13,publication_month,edition_information,publication_year,url,image_url,book_id,ratings_count,work_id,title,title_without_series
0,1599150603,7,[],US,,"[{'count': '56', 'name': 'to-read'}, {'count':...",,False,4.13,B00DU10PUG,[],"Relates in vigorous prose the tale of Aeneas, ...",Paperback,https://www.goodreads.com/book/show/287141.The...,"[{'author_id': '3041852', 'role': ''}]",Yesterday's Classics,162,13,9781599150604,9,,2006,https://www.goodreads.com/book/show/287141.The...,https://s.gr-assets.com/assets/nophoto/book/11...,287141,46,278578,The Aeneid for Boys and Girls,The Aeneid for Boys and Girls
1,1934876569,6,[151854],US,,"[{'count': '515', 'name': 'to-read'}, {'count'...",,False,4.22,,"[948696, 439885, 274955, 12978730, 372986, 216...","To Kara's astonishment, she discovers that a p...",Paperback,https://www.goodreads.com/book/show/6066812-al...,"[{'author_id': '19158', 'role': ''}]",Seven Seas,216,3,9781934876565,3,,2009,https://www.goodreads.com/book/show/6066812-al...,https://images.gr-assets.com/books/1316637798m...,6066812,98,701117,All's Fairy in Love and War (Avalon: Web of Ma...,All's Fairy in Love and War (Avalon: Web of Ma...
2,590417010,193,[],US,eng,"[{'count': '450', 'name': 'to-read'}, {'count'...",,False,4.43,B017RORXNI,"[834493, 452189, 140185, 1897316, 2189812, 424...",In Newbery Medalist Cynthia Rylant's classic b...,Hardcover,https://www.goodreads.com/book/show/89378.Dog_...,"[{'author_id': '5411', 'role': ''}]",Blue Sky Press,40,1,9780590417013,9,,1995,https://www.goodreads.com/book/show/89378.Dog_...,https://images.gr-assets.com/books/1360057676m...,89378,1331,86259,Dog Heaven,Dog Heaven


In [6]:
books.columns

Index(['isbn', 'text_reviews_count', 'series', 'country_code', 'language_code',
       'popular_shelves', 'asin', 'is_ebook', 'average_rating', 'kindle_asin',
       'similar_books', 'description', 'format', 'link', 'authors',
       'publisher', 'num_pages', 'publication_day', 'isbn13',
       'publication_month', 'edition_information', 'publication_year', 'url',
       'image_url', 'book_id', 'ratings_count', 'work_id', 'title',
       'title_without_series'],
      dtype='object')

## Dropping some columns without meaninful information

In [7]:
books=books.drop(['series','country_code','popular_shelves', 'asin', 'similar_books','kindle_asin','publication_day', 'publication_month', 'edition_information','work_id','authors', 'image_url','link','url','title_without_series'], axis = 1)

In [8]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124082 entries, 0 to 124081
Data columns (total 14 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   isbn                124082 non-null  object 
 1   text_reviews_count  124082 non-null  int64  
 2   language_code       124082 non-null  object 
 3   is_ebook            124082 non-null  object 
 4   average_rating      124082 non-null  float64
 5   description         124082 non-null  object 
 6   format              124082 non-null  object 
 7   publisher           124082 non-null  object 
 8   num_pages           124082 non-null  object 
 9   isbn13              124082 non-null  object 
 10  publication_year    124082 non-null  object 
 11  book_id             124082 non-null  int64  
 12  ratings_count       124082 non-null  int64  
 13  title               124082 non-null  object 
dtypes: float64(1), int64(3), object(10)
memory usage: 13.3+ MB


At a first glance we have 4 numerical variables and 10 categorical variables, but it looks like num_pages, isbn, isbn13, and publication_year should be integers instead of objects. 
There are no null values in the dataset, so there has to be some noise or maybe missing values inside these columns,  which prevent them from being considered as numerical variables.

## Ckecking if there are duplicates in the dataframe

In [9]:
books.duplicated().sum() 

0

There are no duplicates in the dataframe.

## Getting the unique values in each column

In [10]:
def get_uniques(df,lim=20):
    for col in df.columns:
        array=list(df[col].values)
        uniques=set(array)
        if len(uniques) < lim:
            print(col, ":", uniques)
        else:
            print(col, ":", len(uniques), "unique values.")          

In [11]:
get_uniques(books)

isbn : 103884 unique values.
text_reviews_count : 1040 unique values.
language_code : 90 unique values.
is_ebook : {'true', 'false'}
average_rating : 299 unique values.
description : 94613 unique values.
format : 209 unique values.
publisher : 11651 unique values.
num_pages : 790 unique values.
isbn13 : 108594 unique values.
publication_year : 192 unique values.
book_id : 124082 unique values.
ratings_count : 4433 unique values.
title : 96354 unique values.


## Univariable analysis

### Numerical variables

In [12]:
books.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
text_reviews_count,124082.0,27.08621,266.547,0.0,2.0,5.0,15.0,49850.0
average_rating,124082.0,3.910883,0.3648551,0.0,3.71,3.94,4.14,5.0
book_id,124082.0,10579290.0,10179900.0,5.0,1414649.5,7068956.0,18165059.25,36469877.0
ratings_count,124082.0,522.8165,10838.69,0.0,10.0,30.0,96.0,1876252.0


#### Book ID

The column book_id is the only variable having 124082 unique values. Every value respond to one goodread book page.

In [13]:
books.book_id.nunique()

124082

In [14]:
books[books.book_id == 5]

Unnamed: 0,isbn,text_reviews_count,language_code,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title
53726,043965548X,28561,eng,False,4.53,Harry Potter's third year at Hogwarts is full ...,Mass Market Paperback,Scholastic Inc.,435,9780439655484,2004,5,1876252,Harry Potter and the Prisoner of Azkaban (Harr...


#### Text Reviews Count

The column text_reviews_count, tells us the number of reviews for each book_id of the dataset. The minimum is 0 and the maximum 49850. So we can see there are books much more popular than others. 

In [15]:
books.text_reviews_count.value_counts()

1        21107
2        15054
3        11395
4         8753
5         7015
         ...  
987          1
3290         1
1627         1
11870        1
7868         1
Name: text_reviews_count, Length: 1040, dtype: int64

In [16]:
books[books.text_reviews_count== 49850]

Unnamed: 0,isbn,text_reviews_count,language_code,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title
110278,385732554,49850,eng,False,4.12,Twelve-year-old Jonas lives in a seemingly ide...,Paperback,Ember,208,9780385732550,2006,3636,1311422,"The Giver (The Giver, #1)"


The book with the maximum number of text reviews is The Giver, this book have also 1311422 ratings count.

In [17]:
books[books.text_reviews_count== 0].title.count()

15

In [18]:
books[books.text_reviews_count== 0].title

10853                                    James and the Mini
22957                      Alice's Adventures in Wonderland
26069                                     Muumipeikko herää
31913                                           Hoàng Tử Bé
32681                                The Story of Christmas
44094                        Sagwa, the Chinese Siamese Cat
57177                          The Mystery of the Wrong Dog
67331                           Laika: The 1st Dog in Space
72909                                            Büyülü Kuş
95282     Lego Ninjago: Masters of Spinjitzu: Official G...
95815                      Alice's Adventures in Wonderland
99336                           My Name is Oscar the Grouch
100312                                       In My Backyard
107820    DORK Diaries 02. Nikkis (nicht ganz so) glamou...
114660                                     Figgs & Phantoms
Name: title, dtype: object

In [19]:
books[books.text_reviews_count== 0].ratings_count

10853      5
22957     56
26069      1
31913      8
32681     28
44094      7
57177     46
67331      0
72909      5
95282     39
95815     38
99336      3
100312     3
107820    51
114660     7
Name: ratings_count, dtype: int64

There are 15 books with 0 text reviews, few of them are in minority languages. These books don't have many ratings either.

#### Rating counts

ratings_count goes from 0.0 to 1876252 , but the most of the sample ranges between 10 and 96.

In [20]:
books[books.ratings_count== 1876252]

Unnamed: 0,isbn,text_reviews_count,language_code,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title
53726,043965548X,28561,eng,False,4.53,Harry Potter's third year at Hogwarts is full ...,Mass Market Paperback,Scholastic Inc.,435,9780439655484,2004,5,1876252,Harry Potter and the Prisoner of Azkaban (Harr...


In [21]:
books[books.ratings_count== 0].title.count()

316

There are 316 books without any rating count, let's drop these books.

In [22]:
books.shape

(124082, 14)

In [23]:
books=books[books.ratings_count!= 0]

In [24]:
books.shape

(123766, 14)

In [25]:
books.describe()

Unnamed: 0,text_reviews_count,average_rating,book_id,ratings_count
count,123766.0,123766.0,123766.0,123766.0
mean,27.152796,3.911681,10566360.0,524.1514
std,266.883826,0.360879,10176660.0,10852.49
min,0.0,1.0,5.0,1.0
25%,2.0,3.71,1413372.0,10.0
50%,5.0,3.94,7049316.0,31.0
75%,15.0,4.14,18154770.0,96.0
max,49850.0,5.0,36469880.0,1876252.0


#### Average rating

The average rating ranges from 1 to 5. Being the mean 3.91

In [26]:
books[books.average_rating == 1].title.count()

10

In [27]:
books[books.average_rating == 1].ratings_count

29219    1
30726    1
31565    1
32445    1
38341    1
40781    1
42742    3
56489    1
78572    1
96827    1
Name: ratings_count, dtype: int64

In [28]:
books[books.average_rating == 5].title.count()

388

### Categorical variables that should be numerical

#### ISBN

In [29]:
books.isbn.value_counts()

              20129
078577842X        1
0746073356        1
0060734663        1
1519549466        1
              ...  
0380977176        1
0061234753        1
1619630451        1
1627552987        1
2070612384        1
Name: isbn, Length: 103638, dtype: int64

There are 20129 missing values in the column isbn. The rest of the books have an unique isbn.

#### ISBN 13

In [30]:
books.isbn13.value_counts()

                 15436
9780439057493        1
9780763659837        1
9780803730960        1
9780670852529        1
                 ...  
9780152057770        1
9780747562832        1
9780679892281        1
9781402258893        1
9780066238685        1
Name: isbn13, Length: 108331, dtype: int64

There are 15436 missing values in the column isbn13. The rest of the books have an unique isbn.

An ISBN is an International Standard Book Number. ISBNs were 10 digits in length up to the end of December 2006, but since 1 January 2007 they now always consist of 13 digits. ISBNs are calculated using a specific mathematical formula and include a check digit to validate the number.
An ISBN is essentially a product identifier used by publishers, booksellers, libraries, internet retailers and other supply chain participants for ordering, listing, sales records and stock control purposes. The ISBN identifies the registrant as well as the specific title, edition and format.

ISBNs are assigned to text-based monographic publications (i.e. one-off publications rather than journals, newspapers, or other types of serials).Any book made publicly available, whether for sale or on a gratis basis, can be identified by ISBN.

Let's check which books have both (the isbn and the isbn13) missing values.

In [31]:
books[(books.isbn == '') & (books.isbn13== '')].shape

(12882, 14)

There are 12882 books without any kind of isbn. Let's drop them.

In [32]:
books=books[(books.isbn != '')|(books.isbn13!= '')]

In [33]:
books.shape

(110884, 14)

In [34]:
books.isbn.value_counts()

              7247
078577842X       1
0746073356       1
0060734663       1
1519549466       1
              ... 
0380977176       1
0061234753       1
1619630451       1
1627552987       1
2070612384       1
Name: isbn, Length: 103638, dtype: int64

In [35]:
books.isbn13.value_counts()

                 2554
9780439057493       1
9780763659837       1
9780803730960       1
9780670852529       1
                 ... 
9780152057770       1
9780747562832       1
9780679892281       1
9781402258893       1
9780066238685       1
Name: isbn13, Length: 108331, dtype: int64

#### Number of Pages

The num_pages variable should contain integers but it contains strings. Some of them could have sense if the format of the book is for example audio, but it's not the case of all of them.
There are many missing values in this column. Should we drop them or scrap them and add them to the column?
There are also books with more than 3000 words (I suppose these might be collections), and also books with just 1 or two pages.
Let's check them.

In [36]:
books.num_pages.value_counts()

32      23485
        21079
40       6042
24       4430
48       3474
        ...  
566         1
731         1
621         1
471         1
1293        1
Name: num_pages, Length: 757, dtype: int64

In [37]:
books.num_pages.nunique()

757

In [38]:
books[books.num_pages== ''].format.value_counts()

                                  11480
Hardcover                          4212
Paperback                          3047
Audio CD                            647
ebook                               362
Audio                               360
Unknown Binding                     223
Audiobook                           200
Board Book                          197
Board book                           66
Library Binding                      65
Audio Cassette                       36
Mass Market Paperback                34
Boxed Set                            21
School &amp; Library Binding         15
paperback                            12
Kindle Edition                       12
Audible Audio                         9
Gebundene Ausgabe                     7
Spiral-bound                          7
MP3 CD                                6
audio                                 5
Bath Book                             4
board book (lift the flap)            4
Turtleback                            3


In [39]:
books[books.num_pages== '1'].format.value_counts()

Audio CD          25
Hardcover         23
Audiobook         16
Paperback          8
Audio              3
Board book         3
                   1
DVD                1
Audio Cassette     1
Name: format, dtype: int64

#### Publication year

In [40]:
books.publication_year.value_counts()

         13373
2013      5855
2014      5755
2012      5466
2015      5315
2016      4978
2011      4808
2010      4426
2008      4101
2009      4050
2007      3933
2006      3629
2005      3390
2017      3193
2004      3187
2003      3023
2002      2701
2001      2495
2000      2462
1999      2215
1998      2061
1997      1889
1996      1758
1995      1520
1994      1428
1993      1357
1992      1229
1991      1048
1990       928
1989       916
1988       852
1987       728
1986       586
1985       540
1984       494
1983       415
1982       362
1980       328
1981       318
1978       301
1977       289
1979       275
1976       223
1973       213
1974       204
1975       204
1972       170
1971       165
1970       163
1969       132
1968       114
1967       107
1966        85
2018        82
1962        73
1963        70
1961        70
1965        68
1964        63
1960        57
1958        45
1950        42
1959        41
1957        37
1953        34
1956        30
1948      

There are 13373 missing values in publication year. 
The most of the books in which we know the publication year, range between 1960 and 2017.
There are weird values like with more than 4 digits and also less than 14 digits. 
We should drop the rows presenting missing values and weird values.

In [41]:
books.drop(books[books.publication_year == ''].index, inplace=True)

In [42]:
books.publication_year=pd.to_numeric(books.publication_year)

In [43]:
books.publication_year.describe().T

count    97511.000000
mean      2005.754756
std        250.288665
min          5.000000
25%       1999.000000
50%       2007.000000
75%       2013.000000
max      65535.000000
Name: publication_year, dtype: float64

In [44]:
books.drop(books[books.publication_year <= 1800].index, inplace = True)

In [45]:
books.drop(books[books.publication_year >= 2019].index, inplace = True)

In [46]:
books.publication_year.describe().T

count    97469.000000
mean      2004.407196
std         11.333694
min       1877.000000
25%       1999.000000
50%       2007.000000
75%       2013.000000
max       2018.000000
Name: publication_year, dtype: float64

In [47]:
books.shape

(97469, 14)

### Rest of categorical variables

#### Description

There are 7029 books without description. The description column is one of the most important for our model, so let's drop all rows in wich the description is missing.
There are also different books sharing the same description, but we will deal with that later.

In [48]:
books.description.value_counts()

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

In [49]:
books1=books[books.description != '']

In [50]:
books1.shape

(90440, 14)

So there are 90440 books including a description. 

In [51]:
books1.description.value_counts()

Source of legend and lyric, reference and conjecture, Alice's Adventures in Wonderland is for most children pure pleasure in prose. While adults try to decipher Lewis Carroll's putative use of complex mathematical codes in the text, or debate his alleged use of opium, young readers simply dive with Alice through the rabbit hole, pursuing "The dream-child moving through a land / Of wonders wild and new." There they encounter the White Rabbit, the Queen of Hearts, the Mock Turtle, and the Mad Hatter, among a multitude of other characters--extinct, fantastical, and commonplace creatures. Alice journeys through this Wonderland, trying to fathom the meaning of her strange experiences. But they turn out to be "curiouser and curiouser," seemingly without moral or sense.\nFor more than 130 years, children have reveled in the delightfully non-moralistic, non-educational virtues of this classic. In fact, at every turn, Alice's new companions scoff at her traditional education. The Mock Turtle, f

In [52]:
# But there are several books sharing the same description

In [53]:
books1[books1.description =='By falling down a rabbit hole and stepping through a mirror, Alice experiences unusual adventures with a variety of nonsensical characters.']

Unnamed: 0,isbn,text_reviews_count,language_code,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title
223,,4,eng,False,4.06,By falling down a rabbit hole and stepping thr...,Paperback,Sweetwater Press,272.0,9781784284268.0,2017,34457541,19,Alice's Adventures in Wonderland & Through the...
4018,0922984018,1,eng,False,4.06,By falling down a rabbit hole and stepping thr...,Hardcover,Wellington Pub,165.0,9780922984015.0,1991,5135797,3,Alice In Wonderland and Through the Looking-glass
4491,044860043,1,eng,False,4.06,By falling down a rabbit hole and stepping thr...,Hardcover,"Grosset & Dunlap, Publishers",307.0,,1980,15809676,20,Alice's Adventures in Wonderland & Through the...
6108,0451512790,11,eng,False,4.06,By falling down a rabbit hole and stepping thr...,Mass Market Paperback,Signet Classics,,9780451512796.0,1960,1503618,111,Alice's Adventures in Wonderland & Through the...
6316,8477024472,1,spa,False,4.06,By falling down a rabbit hole and stepping thr...,Paperback,Valdemar,,9788477024477.0,2006,12228349,5,Aventuras De Alicia En El País De Las Maravill...
8231,,2,spa,False,4.06,By falling down a rabbit hole and stepping thr...,Paperback,Ediciones Brontes,187.0,9788496975668.0,2010,18141618,20,Alicia en el País de las Maravillas & A través...
9324,0716631989,1,,False,4.06,By falling down a rabbit hole and stepping thr...,Unknown Binding,World Book,248.0,9780716631989.0,1988,4930110,8,Alice in Wonderland & Through the Looking Glass
19874,1847494072,9,eng,False,4.06,By falling down a rabbit hole and stepping thr...,Paperback,Alma Classics,384.0,9781847494078.0,2015,25231788,85,Alice's Adventures in Wonderland & Through the...
25165,1907360360,5,en-US,False,4.06,By falling down a rabbit hole and stepping thr...,Hardcover,Collector's Library,288.0,9781907360367.0,2011,13664833,67,Alice's Adventures in Wonderland and Through t...
30876,8889145439,3,ita,False,4.06,By falling down a rabbit hole and stepping thr...,Hardcover,Gruppo Editoriale L'Espresso,320.0,9788889145432.0,2004,15735794,21,Le avventure di Alice nel paese delle meravigl...


#### Language_code

There are 90 unique values in the language code variable, but for the majority of books, this value is a white space. There are also 2 books with -- as language category, and one book with pt-Br instead of por (for portuguese). English have also different codes, like en-Us, en-GB, eng-Ca or eng. 
In this case, we are just interested in books written or translated to english language. So we will check carefully the books descriptions using the langdetect function, and we will delete the books in which the language code is different than english or if the descriptions are written in another different language.

In [54]:
books1.language_code.value_counts()

         54429
eng      20893
en-US     4728
en-GB     1284
spa       1023
ind        833
fin        817
ita        699
ger        669
nl         634
por        567
fre        494
swe        443
per        254
bul        237
tur        212
gre        208
cze        194
dan        191
ara        167
nor        162
rum        149
pol        122
rus        112
est         90
lav         83
vie         79
slo         65
scr         55
hun         54
ukr         51
ben         48
en-CA       44
nob         43
lit         40
mul         25
kat         24
srp         23
fil         22
tha         17
zho         17
jpn         16
cat         14
afr         12
msa         10
pes         10
hye          6
slv          6
tgl          5
kor          5
isl          5
lat          5
mal          4
sin          4
nno          3
nav          2
--           2
bos          2
gle          2
heb          2
glg          2
sco          2
egy          1
smn          1
mlt          1
eus          1
hin       

In [55]:
books1[books1.language_code == '--']

Unnamed: 0,isbn,text_reviews_count,language_code,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title
108770,3785544359.0,2,--,False,4.15,"As a child, there are constantly people trying...",Hardcover,Loewe,32,9783785544358,2002,7940220,6,Mein Körper gehört mir! Schutz vor Missbrauch ...
122588,,1,--,False,4.28,"Translated to Spanglish by Ilan Stavans, this ...",Paperback,Edition Tintenfass,96,9783946190431,2016,34369354,1,El Little Principe


There is one book written in spanglish and a book written in german. So we will delete these two rows.

In [56]:
books1.drop(books1[books1.language_code == '--'].index, inplace=True)

In [57]:
#Let's replace the english variants for just eng.
books1.language_code.replace({"": "unknown", 'en-US':'eng', 'en-GB':'eng', 'en-CA':'eng'}, inplace=True)

In [58]:
books1.language_code.unique()

array(['unknown', 'eng', 'fin', 'scr', 'fil', 'ger', 'tur', 'ara', 'nl',
       'gre', 'zho', 'est', 'spa', 'fre', 'dan', 'cze', 'per', 'ind',
       'rum', 'ita', 'por', 'ben', 'swe', 'nor', 'vie', 'hun', 'rus',
       'lit', 'bul', 'mul', 'slo', 'pol', 'ukr', 'nob', 'pes', 'lav',
       'cat', 'smn', 'mal', 'kat', 'srp', 'jpn', 'slv', 'oci', 'pt-BR',
       'kor', 'afr', 'tha', 'msa', 'lat', 'tgl', 'sin', 'mon', 'mlt',
       'gle', 'egy', 'glg', 'hye', 'isl', 'sqi', 'dum', 'yid', 'guj',
       'sco', 'fao', 'bos', 'heb', 'eus', 'hmn', 'nav', 'mkd', 'nld',
       'hin', 'kaz', 'krl', 'roh', 'nno', 'nub'], dtype=object)

In [59]:
#Let's check some language_codes to see the titles and descriptions.
books1[books1.language_code == 'scr'].sample(10)

Unnamed: 0,isbn,text_reviews_count,language_code,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title
60471,9532130195,1,scr,False,4.28,"Le Petit Princ (Mali princ, 1943. godine), dje...",Hardcover,Marjan knjiga,104,,2000,12396588,36,Mali princ
111365,9532290451,3,scr,False,3.05,Kraj u kojemu smo rodeni i u kojemu smo odrast...,Hardcover,ABC,125,,2005,9841888,64,Voda i druge pripovijetke
18375,,1,scr,False,3.53,"Lili obozava carobirati, sve od trenutka kad j...",Hardcover,Mozaik knjiga,132,9789531407267.0,2010,31868708,1,"Čarobnica Lili leti na Mjesec (Čarobnica Lili,..."
102765,953612498X,3,scr,False,4.09,Nenadani susret sestara blizanki u ljetovalist...,Hardcover,"Znanje, Zagreb",114,9789536124985.0,1995,12475550,65,Blizanke
10010,9533240482,6,scr,False,3.88,Roman Sofijin svijet nastao je iz Gaarderove z...,Mass Market Paperback,"Znanje d.o.o., Zagreb",512,9789533240480.0,2010,10652799,106,Sofijin svijet
25,,1,scr,False,3.92,"Nakon Labirinta kostiju u kojem je sve pocelo,...",Hardcover,Algoritam,208,9789533164069.0,2011,25208148,7,"Kao u grobu (39 tragova, #4)"
50722,9531967075,4,scr,False,3.8,Vladimir Nazor nas u BIJELOM JELENU uvodi u ba...,Paperback,Mozaik knjiga,84,,2000,10756509,333,Bijeli jelen
2360,9536989077,8,scr,False,4.28,Prica o WAITAPU zapravo je poetska bajka o pot...,Hardcover,Neretva d.o.o.,162,,1993,8423233,131,Waitapu
111204,9531969744,8,scr,False,3.63,"Novo izdanje pripovjedaka Dinka Simunovica, sa...",Hardcover,Mozaik knjiga d.o.o.,197,,2002,8981458,770,Duga
48240,,5,scr,False,4.24,Kad bismo pokusali odrediti zanr romana Demon ...,Hardcover,Algoritam,220,9789533168241.0,2015,25665932,26,Demon školske knjižnice


In [60]:
#And the books with unknown language code. We can see many of them are written in english.
books1[books1.language_code == 'unknown'].sample(30)

Unnamed: 0,isbn,text_reviews_count,language_code,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title
38980,0192727745,2,unknown,False,4.25,I fjarde vaningen i ett hyreshus nagonstans i ...,Paperback,"Oxford University Press, USA",183.0,9780192727749,2009,6338700,53,Karlson Flies Again. Astrid Lindgren
109005,0310727308,8,unknown,False,4.06,A Lesson in Being Patient\nThe Pod Squad needs...,Paperback,Zonderkidz,32.0,9780310727309,2011,11262145,25,Bob and Larry in the Case of the Missing Patience
121682,0395687004,7,unknown,False,3.69,Rabbit leaves her warm dark burrow and discove...,Hardcover,Clarion Books,32.0,9780395687000,1995,4052958,35,Rabbit's Good News
1074,0061853089,2,unknown,False,4.27,"Braaaaains \nZack Clarke, his best pal, Rice, ...",Paperback,HarperCollins,204.0,9780061853081,2011,10905281,25,Undead Ahead (The Zombie Chasers #2)
108257,0544022637,12,unknown,False,3.9,"In Chinese, peng youmeans friend. But in any l...",Paperback,HMH Books for Young Readers,160.0,9780544022638,2013,15814549,65,The Year of the Book
55161,0064442950,4,unknown,False,4.12,Rain-day bluesJohnny Lion has everything he ne...,Paperback,HarperCollins,64.0,9780064442954,2000,1402954,31,Johnny Lion's Rubber Boots
102062,0552547123,21,unknown,False,3.81,"Fortunately for Jacqueline Wilson fans, her ne...",Paperback,Corgi Childrens,288.0,9780552547123,2004,1527197,240,Lola Rose
104102,0060239824,19,unknown,False,3.67,Thirty-eight original limericks about all mann...,Hardcover,HarperCollins Publishers,48.0,9780060239824,1983,311794,64,The Book of Pigericks: Pig Limericks
86881,0761320776,3,unknown,False,3.06,A young girl camps out in her bedroom and is j...,Paperback,Millbrook Press,28.0,9780761320777,1999,2973608,16,My Camp-Out
89865,079452852X,2,unknown,False,3.29,Babies will love sharing this book with you. T...,Board Book,Usborne Books,10.0,9780794528522,2010,11480634,7,Baby's Very First Touchy-Feely Christmas Book


Before detecting the language of the descriptions, we have to lowercase them. We will create a new column with all the descriptions transformed to lowercase.

In [61]:
def preprocess_df(df):
    
    def process_string(x):
                 
        x = x.lower()
        return x
     
    df['descriptiondetect'] = df['description'].apply(process_string)
    
    return df

In [62]:
%%time
books2=preprocess_df(books1)

Wall time: 136 ms


In [63]:
books2.descriptiondetect

0         relates in vigorous prose the tale of aeneas, ...
1         to kara's astonishment, she discovers that a p...
2         in newbery medalist cynthia rylant's classic b...
4         what do you do?\na hen lays eggs...\na cow giv...
5         ben draws a train that takes him to all sorts ...
                                ...                        
124075    holy unanticipated occurrences! a cynic meets ...
124076    rhyming text and illustrations of comical cats...
124079    "a perfect reminder to always be on the lookou...
124080    one of the most popular series ever published ...
124081    gathers poems by william blake, emily bronte, ...
Name: descriptiondetect, Length: 90438, dtype: object

In [64]:
books2.columns

Index(['isbn', 'text_reviews_count', 'language_code', 'is_ebook',
       'average_rating', 'description', 'format', 'publisher', 'num_pages',
       'isbn13', 'publication_year', 'book_id', 'ratings_count', 'title',
       'descriptiondetect'],
      dtype='object')

And now we will use the detect_langs function to check all descriptions, creating a new column to keep all the results.

''' Important!!!
It takes a long time to run the cell below, so I have kept it in markdown format. 
The dataset with the new column has been saved as a csv file, we have just to import it to keep working on the project
instead of run it every time we restart the kernel'''

%%time
from langdetect import detect_langs

languages = []

for row in range(len(books2)):
    try:
        languages.append(detect_langs(books2.iloc[row,14]))
    
    except:
        languages.append('?')

languages = [str(lang).split(':')[0][1:] for lang in languages]

books2['detect']=languages
print(books2.head())

books2[books2.language_code=='unknown'].detect.value_counts()

As the language detection takes a long time to run, let's save the dataset with the results in a new csv and let's import it again so we don't have to run it every time we restart the kernel.

##### Importing the clean csv file

In [65]:
books = pd.read_csv('booksdescriptiondetect.csv')

In [66]:
books.head(2)

Unnamed: 0,isbn,text_reviews_count,language_code,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title,descriptiondetect,detect
0,1599150603,7,unknown,False,4.13,"Relates in vigorous prose the tale of Aeneas, ...",Paperback,Yesterday's Classics,162.0,9781599150604,2006,287141,46,The Aeneid for Boys and Girls,"relates in vigorous prose the tale of aeneas, ...",en
1,1934876569,6,unknown,False,4.22,"To Kara's astonishment, she discovers that a p...",Paperback,Seven Seas,216.0,9781934876565,2009,6066812,98,All's Fairy in Love and War (Avalon: Web of Ma...,"to kara's astonishment, she discovers that a p...",en


In [67]:
books.shape

(90438, 16)

In [68]:
books.columns

Index(['isbn', 'text_reviews_count', 'language_code', 'is_ebook',
       'average_rating', 'description', 'format', 'publisher', 'num_pages',
       'isbn13', 'publication_year', 'book_id', 'ratings_count', 'title',
       'descriptiondetect', 'detect'],
      dtype='object')

In [69]:
books_otherlang=books[(books.language_code != 'unknown') & (books.language_code != 'eng') ]

In [70]:
books_otherlang[books_otherlang.detect=='es']  
#checking if the detection looks correct in different language codes.It looks correct, so let's drop the rows matching those conditions.

Unnamed: 0,isbn,text_reviews_count,language_code,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title,descriptiondetect,detect
165,8416690073,6,spa,False,4.05,Suzy Swanson esta segura de que conoce el verd...,Paperback,Maeva,320.0,9788416690077,2016,32792284,17,Lo que sucedió con la medusa,suzy swanson esta segura de que conoce el verd...,es
217,8714193523,7,dan,False,3.88,Luego del exito mundial del libro que relata l...,Paperback,Host & Son,542.0,9788714193522,1995,8285950,267,Sofies verden,luego del exito mundial del libro que relata l...,es
283,8498676185,1,spa,False,4.27,"""A las mellizas les parecio que el tren tardab...",Paperback,Molino,,9788498676181,2009,11196995,29,Santa Clara: Todos los cursos (St. Clare's #1-6),"""a las mellizas les parecio que el tren tardab...",es
608,844066785X,2,spa,False,3.51,Zakie Beauchamp quiere convertirse en un famos...,Paperback,Ediciones B,140.0,9788440667854,1999,17610045,9,"El monstruo baboso (Pesadillas, #53)",zakie beauchamp quiere convertirse en un famos...,es
639,,1,spa,False,4.08,La ciudad en verano es el campo de juego perfe...,Hardcover,Barbara Fiore,48.0,9788415208464,2014,22066825,17,Las reglas del verano,la ciudad en verano es el campo de juego perfe...,es
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89933,8422610086,6,spa,False,4.29,en una ciudad -cualquier ciudad- aparece un di...,Hardcover,circulo de lectores,232.0,,1988,2799598,29,Momo,en una ciudad -cualquier ciudad- aparece un di...,es
89997,842644556X,4,spa,False,3.89,Enviamos un dibijo de Quint a cuarenta y seis ...,Hardcover,Lumen,120.0,9788426445568,1998,993956,14,El Libro De Los Libros. Historias sobre imágenes.,enviamos un dibijo de quint a cuarenta y seis ...,es
90212,,1,spa,False,3.88,Con el unico objeto de divertir a una nina de ...,Paperback,ERA,52.0,9786074554564,2010,30347070,3,Alicia para niños,con el unico objeto de divertir a una nina de ...,es
90373,8494411039,13,spa,False,3.88,"Buenas noches, Charlie,\ndulces suenos, Jack.\...",Hardcover,Oceano - Gran Travesia,403.0,9788494411038,2015,26239341,26,¡Pesadillas! (¡Pesadillas! #1),"buenas noches, charlie,\ndulces suenos, jack.\...",es


In [71]:
books.drop(books[(books.language_code != 'unknown') & (books.language_code != 'eng') ].index, inplace = True)

In [72]:
books.language_code.unique()

array(['unknown', 'eng'], dtype=object)

In [73]:
books.shape

(81378, 16)

In [74]:
books_eng=books[books.language_code=='eng']

In [75]:
books_eng.detect.value_counts()

en    26771
es       32
fr       20
de       16
id       14
nl       13
hr       12
it        9
pt        8
no        7
af        6
tl        6
sv        5
cy        5
ca        5
pl        4
da        3
sl        3
sk        2
fi        2
lt        2
sq        1
cs        1
et        1
tr        1
Name: detect, dtype: int64

We can see that there are books with eng as language code where the descriptions are detected as other different languages. 
Let's check them

In [76]:
#books_eng[books_eng.detect=='es']
#books_eng[books_eng.detect=='fr']

In [77]:
#Let's keep just the books with english language_code where the description language detection was english.
books.drop(books[(books.language_code == 'eng') & (books.detect != 'en') ].index, inplace = True)

In [78]:
books[books.language_code=='eng'].detect.value_counts()

en    26771
Name: detect, dtype: int64

In [79]:
books[books.language_code=='unknown'].detect.value_counts()

en    53889
es      174
fr       90
id       42
cy       28
de       27
sv       22
nl       21
fi       20
it       18
pt       16
da       13
af        9
hr        9
sl        7
no        7
pl        5
ro        5
tr        5
vi        4
sw        4
tl        3
sk        2
hu        2
ca        2
cs        1
sq        1
so        1
lt        1
Name: detect, dtype: int64

In [80]:
books_unk=books[books.language_code=='unknown']

In [81]:
#books_unk[books_unk.detect=='es']
books_unk[books_unk.detect=='en']
#The detection language looks fine inside the unknown language codes as well in most of the cases.

Unnamed: 0,isbn,text_reviews_count,language_code,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title,descriptiondetect,detect
0,1599150603,7,unknown,False,4.13,"Relates in vigorous prose the tale of Aeneas, ...",Paperback,Yesterday's Classics,162.0,9781599150604,2006,287141,46,The Aeneid for Boys and Girls,"relates in vigorous prose the tale of aeneas, ...",en
1,1934876569,6,unknown,False,4.22,"To Kara's astonishment, she discovers that a p...",Paperback,Seven Seas,216.0,9781934876565,2009,6066812,98,All's Fairy in Love and War (Avalon: Web of Ma...,"to kara's astonishment, she discovers that a p...",en
3,1416904999,4,unknown,False,3.57,WHAT DO YOU DO?\nA hen lays eggs...\nA cow giv...,Board Book,Little Simon,24.0,9781416904991,2005,1698376,23,What Do You Do?,what do you do?\na hen lays eggs...\na cow giv...,en
4,0531301060,3,unknown,False,3.68,Ben draws a train that takes him to all sorts ...,Hardcover,Orchard Books (NY),,9780531301067,1999,2592648,21,It's Funny Where Ben's Train Takes Him,ben draws a train that takes him to all sorts ...,en
5,0884482987,17,unknown,False,3.89,When Amadi disobeys his mother and runs off to...,Hardcover,Tilbury House Publishers,32.0,9780884482987,2008,3631900,44,Amadi's Snowman: A Story of Reading,when amadi disobeys his mother and runs off to...,en
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90432,055256916X,16,unknown,False,4.11,"When they first arrived, they came quietly and...",Paperback,Corgi Childrens,336.0,9780552569163,2015,23346344,58,Boy In The Tower,"when they first arrived, they came quietly and...",en
90434,1575054035,36,unknown,False,4.05,Rhyming text and illustrations of comical cats...,Hardcover,Carolrhoda Books,,9781575054032,2000,823094,240,"To Root, to Toot, to Parachute",rhyming text and illustrations of comical cats...,en
90435,0061960314,13,unknown,False,4.29,"""A perfect reminder to always be on the lookou...",Hardcover,HarperCollins,40.0,9780061960314,2010,7925060,40,Instructions,"""a perfect reminder to always be on the lookou...",en
90436,0689852959,1,unknown,False,4.36,One of the most popular series ever published ...,Paperback,Aladdin,176.0,9780689852954,2002,331839,18,Jacqueline Kennedy Onassis: Friend of the Arts,one of the most popular series ever published ...,en


In [82]:
books.shape

(81200, 16)

In [83]:
#Let's keep just the unknown books where the description language detection was english.
books.drop(books[(books.language_code == 'unknown') & (books.detect != 'en') ].index, inplace = True)

In [84]:
books.shape

(80660, 16)

In [85]:
books[books.language_code=='unknown'].detect.value_counts()

en    53889
Name: detect, dtype: int64

In [86]:
books.language_code.isnull().sum()

0

In [87]:
books.language_code.value_counts()

unknown    53889
eng        26771
Name: language_code, dtype: int64

In [88]:
books.detect.value_counts()

en    80660
Name: detect, dtype: int64

In [89]:
books.columns

Index(['isbn', 'text_reviews_count', 'language_code', 'is_ebook',
       'average_rating', 'description', 'format', 'publisher', 'num_pages',
       'isbn13', 'publication_year', 'book_id', 'ratings_count', 'title',
       'descriptiondetect', 'detect'],
      dtype='object')

In [90]:
#Now that all books have english as language_code, let's drop the columns language_code and detect.
books=books.drop(['language_code', 'detect'],axis=1)

In [91]:
books.shape

(80660, 14)

##### Dealing with missing values

After importing the books.csv, the whitespaces have been replaced by NaN values. So we will use the function isnull to check the rest of the columns.

In [92]:
books.isnull().sum()

isbn                  1889
text_reviews_count       0
is_ebook                 0
average_rating           0
description              0
format                1356
publisher              924
num_pages             7737
isbn13                 469
publication_year         0
book_id                  0
ratings_count            0
title                    1
descriptiondetect        0
dtype: int64

#### Titles

In [93]:
books.title.isnull().sum()

1

In [94]:
# Let's check the book with the missing title
books[books.title.isna()]

Unnamed: 0,isbn,text_reviews_count,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title,descriptiondetect
34303,440428475,3,False,3.63,Ben has always been content to be brilliant at...,Paperback,Yearling,160.0,9780440428473,1985,2433394,8,,ben has always been content to be brilliant at...


https://www.goodreads.com/book/show/2433394

In [95]:
#Let's add manually the title
books.title= books.title.fillna('(George)')

In [96]:
books.title.isnull().sum()

0

In [97]:
books.title.value_counts()

The Secret Garden                                         99
Peter Pan                                                 88
The Wind in the Willows                                   82
Alice's Adventures in Wonderland                          74
Anne of Green Gables                                      72
                                                          ..
Charlie Chick Wants to Play                                1
She Persisted: 13 American Women Who Changed the World     1
Big Bad Detective Agency - Library Edition                 1
Nobody Has Time for Me                                     1
The Adventures of Isaiah James: Beach Boy                  1
Name: title, Length: 64916, dtype: int64

Let's try to detect if there are titles not written in English but having a description in English.

In [98]:
def preprocess_df(df):
    
    def process_string(x):
                 
        x = x.lower()
        return x
     
    df['titledetect'] = df['title'].apply(process_string)
    
    return df

In [99]:
books=preprocess_df(books)

In [100]:
books.titledetect

0                            the aeneid for boys and girls
1        all's fairy in love and war (avalon: web of ma...
2                                               dog heaven
3                                          what do you do?
4                   it's funny where ben's train takes him
                               ...                        
90433        flora and ulysses: the illuminated adventures
90434                       to root, to toot, to parachute
90435                                         instructions
90436       jacqueline kennedy onassis: friend of the arts
90437             the children's classic poetry collection
Name: titledetect, Length: 80660, dtype: object

In [101]:
books.columns

Index(['isbn', 'text_reviews_count', 'is_ebook', 'average_rating',
       'description', 'format', 'publisher', 'num_pages', 'isbn13',
       'publication_year', 'book_id', 'ratings_count', 'title',
       'descriptiondetect', 'titledetect'],
      dtype='object')

Again, don't run the cell below at it takes a long time, the dataset with the results of 
applying langdetect function are saved in the csv file 'bookstitledetect.csv'

In [102]:
books.columns

Index(['isbn', 'text_reviews_count', 'is_ebook', 'average_rating',
       'description', 'format', 'publisher', 'num_pages', 'isbn13',
       'publication_year', 'book_id', 'ratings_count', 'title',
       'descriptiondetect', 'titledetect'],
      dtype='object')

####  Importing the clean csv file with the detected language of the titles

In [103]:
books = pd.read_csv('bookstitledetect.csv')

In [104]:
books.detect_tit.value_counts()

en    59770
af     1994
cy     1645
it     1490
da     1430
no     1382
fr     1370
tl     1193
nl     1072
so      930
ca      927
ro      729
id      698
es      688
et      659
sv      591
fi      588
pl      485
pt      373
sk      349
de      319
tr      291
lt      291
hr      253
cs      232
sl      215
sq      212
sw      181
hu      122
lv      106
vi       55
fa        6
ar        5
ja        2
bg        1
ko        1
Name: detect_tit, dtype: int64

In [105]:
books[books.detect_tit=='ko']  #try also with bg, ja,ar, fa all are titles in different languages than english.

Unnamed: 0,isbn,text_reviews_count,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title,descriptiondetect,titledetect,detect_tit
51712,8983782706,2,False,3.88,THIS JUST IN! Amy and Dan Cahill were spotted ...,Paperback,Seoul Gyoyuk,264.0,9788983782700,2010,8907049,25,모차르트의 악보 (39 클루스. 2),this just in! amy and dan cahill were spotted ...,모차르트의 악보 (39 클루스. 2),ko


In [106]:
books.detect_tit.unique()

array(['en', 'da', 'pt', 'af', 'id', 'es', 'sw', 'pl', 'fi', 'sv', 'nl',
       'et', 'fr', 'so', 'tl', 'sk', 'vi', 'it', 'ro', 'ca', 'hr', 'lt',
       'tr', 'cy', 'no', 'de', 'cs', 'sq', 'hu', 'lv', 'sl', 'fa', 'ar',
       nan, 'ja', 'ko', 'bg'], dtype=object)

In [107]:
books.detect_tit.replace({'ko':'del', 'ja':'del', 'ar': 'del', 'fa': 'del','bg':'del'},inplace=True)

In [108]:
books.detect_tit.unique()

array(['en', 'da', 'pt', 'af', 'id', 'es', 'sw', 'pl', 'fi', 'sv', 'nl',
       'et', 'fr', 'so', 'tl', 'sk', 'vi', 'it', 'ro', 'ca', 'hr', 'lt',
       'tr', 'cy', 'no', 'de', 'cs', 'sq', 'hu', 'lv', 'sl', 'del', nan],
      dtype=object)

In [109]:
books=books[books.detect_tit != 'del']

In [110]:
books[books.detect_tit=='da'] #Here the most of the titles are in english even if they are detected as different language.

Unnamed: 0,isbn,text_reviews_count,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title,descriptiondetect,titledetect,detect_tit
2,0590417010,193,False,4.43,In Newbery Medalist Cynthia Rylant's classic b...,Hardcover,Blue Sky Press,40.0,9780590417013,1995,89378,1331,Dog Heaven,in newbery medalist cynthia rylant's classic b...,dog heaven,da
29,0544873912,1,False,3.22,CHIRP! CHIRP! CHIRP! It's springtime on Bradfo...,Hardcover,HMH Books for Young Readers,48.0,9780544873919,2017,30971686,9,Bradford Street Buddies: Springtime Blossoms,chirp! chirp! chirp! it's springtime on bradfo...,bradford street buddies: springtime blossoms,da
117,0763624764,31,False,3.66,Fetching illustrations from an exciting new ta...,Hardcover,Candlewick,40.0,9780763624767,2004,683151,194,Dog Blue,fetching illustrations from an exciting new ta...,dog blue,da
241,1590789962,35,False,3.11,"Goose has an important message for Bear, and h...",Hardcover,Boyds Mills Press,32.0,9781590789964,2013,17469953,142,Fox Forgets,"goose has an important message for bear, and h...",fox forgets,da
344,155469891X,9,False,3.68,"Hanna is fed up with her best friend, Lizzy, w...",Hardcover,Orca Book Publishers,32.0,9781554698912,2014,18853127,35,Best Friend Trouble,"hanna is fed up with her best friend, lizzy, w...",best friend trouble,da
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80342,0006481108,10,False,3.61,Star on the rise: a pulse-racing new novel fro...,Paperback,HarperCollins Canada,230.0,9780006481102,1998,2487930,100,Stranded,star on the rise: a pulse-racing new novel fro...,stranded,da
80377,1550028286,3,False,3.93,Short-listed for the Sheila A. Egoff Award for...,Paperback,Dundurn,152.0,9781550028287,2008,6853284,11,Finders Keepers,short-listed for the sheila a. egoff award for...,finders keepers,da
80490,039925028X,41,False,3.67,Caldecott Honor winner Rachel Isadora gives re...,Hardcover,G.P. Putnam's Sons Books for Young Readers,32.0,9780399250286,2009,6400465,151,Hansel and Gretel,caldecott honor winner rachel isadora gives re...,hansel and gretel,da
80521,0606142282,19,False,3.86,A young pig uses her ability to read to outwit...,Hardcover,Turtleback Books,,9780606142281,1998,3325443,74,Hog-Eye,a young pig uses her ability to read to outwit...,hog-eye,da


In [111]:
books.detect_tit.isnull().sum()

5

In [112]:
books[books.detect_tit.isna()]

Unnamed: 0,isbn,text_reviews_count,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title,descriptiondetect,titledetect,detect_tit
24276,0740755803,4,False,4.19,Every masterful image by famed photographer An...,Hardcover,Andrews McMeel Publishing,24.0,9780740755804,2005,699395,17,123,every masterful image by famed photographer an...,123,
33468,0753467720,12,False,3.62,The numbers 1 to 20 have never been so creativ...,Hardcover,Kingfisher,48.0,9780753467725,2012,13167187,37,1 2 3,the numbers 1 to 20 have never been so creativ...,1 2 3,
44626,0140371915,1,False,3.82,"Zapped into the 21st century by ""The Book"", th...",Paperback,Puffin,80.0,9780140371918,1997,3301515,23,2095,"zapped into the 21st century by ""the book"", th...",2095,
51561,0746040997,5,False,3.67,-- This delightful series of board books has b...,Board Book,Usborne Books,12.0,9780746040997,2000,2621001,9,1 2 3,-- this delightful series of board books has b...,1 2 3,
54679,068930529X,1,False,4.25,All the animals have advice for mouse on what ...,Hardcover,Atheneum Books,24.0,9780689305290,1976,3889642,4,22 23,all the animals have advice for mouse on what ...,22 23,


In [113]:
books=books.drop(['detect_tit'], axis=1)

In [114]:
books.columns

Index(['isbn', 'text_reviews_count', 'is_ebook', 'average_rating',
       'description', 'format', 'publisher', 'num_pages', 'isbn13',
       'publication_year', 'book_id', 'ratings_count', 'title',
       'descriptiondetect', 'titledetect'],
      dtype='object')

#### is_ebook

In [115]:
books.is_ebook.value_counts()

False    78707
True      1938
Name: is_ebook, dtype: int64

#### Format

In [116]:
books.format.isnull().sum()

1349

In [117]:
books.format= books.format.fillna('not defined')

In [118]:
books.format.value_counts()

Hardcover                         40749
Paperback                         29510
Board Book                         3074
ebook                              1875
not defined                        1349
Audio CD                           1105
Unknown Binding                     548
Library Binding                     422
Audio                               422
Mass Market Paperback               388
Board book                          337
Audiobook                           256
Novelty Book                        178
Audio Cassette                       79
Kindle Edition                       54
Spiral-bound                         52
Leather Bound                        39
paperback                            19
Audible Audio                        14
MP3 CD                               12
School &amp; Library Binding          9
Big Book                              9
Bath Book                             8
hardback                              8
Turtleback                            7


#### Publisher

In [119]:
books.publisher.isnull().sum()

924

In [120]:
books.publisher= books.publisher.fillna('not defined')

In [121]:
books.publisher.value_counts()

HarperCollins                           3134
HMH Books for Young Readers             2494
Random House Books for Young Readers    1798
Scholastic                              1689
Candlewick Press                        1285
                                        ... 
Ignatius Ho                                1
Raintree Publishers                        1
New Chapter Press                          1
Lothrop Lee & Shepard                      1
Random House Value Publications,U.S.       1
Name: publisher, Length: 6212, dtype: int64

## EDA AND DATA CLEANING PART 2

### Dealing with missing values after first round of data cleaning

In [122]:
books.shape

(80645, 15)

In [123]:
books.isnull().sum()

isbn                  1885
text_reviews_count       0
is_ebook                 0
average_rating           0
description              0
format                   0
publisher                0
num_pages             7735
isbn13                 461
publication_year         0
book_id                  0
ratings_count            0
title                    0
descriptiondetect        0
titledetect              0
dtype: int64

#### ISBN and ISBN 13

In this case, we are going to replace the isbn and isbn13 rows presenting null values, for the value 'not defined' in both of the cases.ISBN and ISBN 13 are useful codes to identify the books, so we will keep them in case they are useful later.

In [124]:
books.isbn= books.isbn.fillna('not defined')

In [125]:
books.isbn13=books.isbn13.fillna('not defined')

#### Number of pages

In [126]:
books.num_pages.isnull().sum()

7735

In [127]:
books.num_pages=pd.to_numeric(books.num_pages)

In [128]:
books.num_pages.describe().T

count    72910.000000
mean        91.226978
std        105.056666
min          0.000000
25%         32.000000
50%         40.000000
75%        128.000000
max       3816.000000
Name: num_pages, dtype: float64

In [129]:
books[books.num_pages== 0]

Unnamed: 0,isbn,text_reviews_count,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title,descriptiondetect,titledetect
86,0739356275,7,False,3.99,Peter Hatcher's summer is not looking good.\nF...,Audio CD,Listening Library (Audio),0.0,9780739356272,2007,428336,38,Fudge-A-Mania,peter hatcher's summer is not looking good.\nf...,fudge-a-mania
547,1400109124,2,False,3.81,Young Cedric Errol lives in poverty in New Yor...,Audio CD,Tantor Media,0.0,9781400109128,2008,5844470,3,"Little Lord Fauntleroy, with eBook",young cedric errol lives in poverty in new yor...,"little lord fauntleroy, with ebook"
586,1400090938,4,False,3.86,Who is Ida B. Applewood? She is a fourth grade...,Audio Cassette,Listening Library (Audio),0.0,9781400090938,2004,988939,10,"Ida B: ...and Her Plans to Maximize Fun, Avoid...",who is ida b. applewood? she is a fourth grade...,"ida b: ...and her plans to maximize fun, avoid..."
1270,0613240820,2,False,3.86,When a man brings a pregnant border collie to ...,not defined,not defined,0.0,9780613240826,2000,3775172,8,Abandoned,when a man brings a pregnant border collie to ...,abandoned
1372,0739360469,1,False,4.19,"One night, Mercy hears a noise. An unlikely th...",Audio,Listening Library (Audio),0.0,9780739360460,2007,13134116,1,Mercy Watson Fights Crime,"one night, mercy hears a noise. an unlikely th...",mercy watson fights crime
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79201,0307282481,19,False,4.16,Mrs. Piggle-Wiggle lives in an upside-down hou...,Audio CD,Listening Library (Audio),0.0,9780307282484,2005,1290115,38,Mrs. Piggle-Wiggle,mrs. piggle-wiggle lives in an upside-down hou...,mrs. piggle-wiggle
79520,1400138914,1,False,4.05,"Aesop, an ancient Greek poet who was sold into...",Audio CD + ebook,Tantor Media,0.0,9781400138913,2008,8873903,1,Aesop's Fables,"aesop, an ancient greek poet who was sold into...",aesop's fables
79776,1600246753,25,False,3.96,The classic Newbery Honor book that inspired t...,Audio CD,"Little, Brown Young Readers",0.0,9781600246753,2009,6080943,66,Mr. Popper's Penguins,the classic newbery honor book that inspired t...,mr. popper's penguins
80185,1427217254,2,False,4.18,"Brave Ireneis Irene Bobbin, the dressmaker's d...",Paperback,Macmillan Young Listeners,0.0,9781427217257,2011,12079010,4,Brave Irene,"brave ireneis irene bobbin, the dressmaker's d...",brave irene


As number of pages is not a really important variable and there are many errors inside it, in order to keep those rows, which have really meaningful information, we will fill the null values with 0.

In [130]:
books.num_pages= books.num_pages.fillna(0)

In [131]:
books.isnull().sum()

isbn                  0
text_reviews_count    0
is_ebook              0
average_rating        0
description           0
format                0
publisher             0
num_pages             0
isbn13                0
publication_year      0
book_id               0
ratings_count         0
title                 0
descriptiondetect     0
titledetect           0
dtype: int64

In [132]:
books.shape

(80645, 15)

### DEALING WITH REPEATED TITLES AND DESCRIPTIONS

In [133]:
# applying the function we created at the beginning of the notebook
get_uniques(books)

isbn : 78761 unique values.
text_reviews_count : 980 unique values.
is_ebook : {False, True}
average_rating : 286 unique values.
description : 73913 unique values.
format : 99 unique values.
publisher : 6212 unique values.
num_pages : 643 unique values.
isbn13 : 80185 unique values.
publication_year : 115 unique values.
book_id : 80645 unique values.
ratings_count : 3920 unique values.
title : 64901 unique values.
descriptiondetect : 73894 unique values.
titledetect : 64123 unique values.


We can notice that the titledetect and descriptiondetect have less unique values than the title and the description. 
So we are going to use those columns in the datacleaning, because they are already in lowercases.

In [134]:
books.titledetect.nunique()

64123

In [135]:
books.descriptiondetect.nunique()

73894

In [136]:
books.columns

Index(['isbn', 'text_reviews_count', 'is_ebook', 'average_rating',
       'description', 'format', 'publisher', 'num_pages', 'isbn13',
       'publication_year', 'book_id', 'ratings_count', 'title',
       'descriptiondetect', 'titledetect'],
      dtype='object')

In [137]:
books.descriptiondetect.value_counts()

boyds mills press publishes a wide range of high-quality fiction and nonfiction picture books, chapter books, novels, and nonfiction                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

In [138]:
books[books.descriptiondetect.map(books.descriptiondetect.value_counts())== 32]

Unnamed: 0,isbn,text_reviews_count,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title,descriptiondetect,titledetect
2537,0261660578,3,False,3.98,Follow the yellow brick road!\nDorothy thinks ...,not defined,Diamond Books,159.0,9780261660571,1993,7569458,14,The Wizard of Oz,follow the yellow brick road!\ndorothy thinks ...,the wizard of oz
3183,1843546590,1,False,3.98,Follow the yellow brick road!\nDorothy thinks ...,Hardcover,Atlantic,304.0,9781843546597,2008,7739742,7,The Wizard of Oz. L. Frank Baum,follow the yellow brick road!\ndorothy thinks ...,the wizard of oz. l. frank baum
3184,1905716524,7,False,3.98,Follow the yellow brick road!\nDorothy thinks ...,Hardcover,Collector's Library,184.0,9781905716524,2010,7739741,60,The Wizard of Oz,follow the yellow brick road!\ndorothy thinks ...,the wizard of oz
5849,0907785751,2,False,3.98,Follow the yellow brick road!\nDorothy thinks ...,not defined,Robert Frederick,0.0,9780907785750,2001,2889190,27,The Wizard of Oz,follow the yellow brick road!\ndorothy thinks ...,the wizard of oz
8906,not defined,3,False,3.98,Follow the yellow brick road!\nDorothy thinks ...,Paperback,Dalmatian Press,180.0,9781453064603,2013,18631990,19,The Wonderful of World Oz,follow the yellow brick road!\ndorothy thinks ...,the wonderful of world oz
10679,1843913909,13,False,3.98,Follow the yellow brick road!\nDorothy thinks ...,Paperback,Hesperus Press,144.0,9781843913900,2013,17729145,92,"The Wonderful Wizard of Oz (Oz, #1)",follow the yellow brick road!\ndorothy thinks ...,"the wonderful wizard of oz (oz, #1)"
14382,0395618045,4,False,3.98,Follow the yellow brick road!\nDorothy thinks ...,Paperback,not defined,0.0,9780395618042,1992,2603731,13,The Wonderful Wizard of Oz,follow the yellow brick road!\ndorothy thinks ...,the wonderful wizard of oz
22104,0141305460,5,False,3.98,Follow the yellow brick road!\nDorothy thinks ...,Paperback,Puffin,189.0,9780141305462,1999,3309639,45,The Wizard of Oz,follow the yellow brick road!\ndorothy thinks ...,the wizard of oz
22340,0517500868,5,False,3.98,Follow the yellow brick road!\nDorothy thinks ...,Hardcover,Clarkson Potter Publishers,384.0,9780517500866,1973,719973,13,The Annotated Wizard of Oz: The Wonderful Wiza...,follow the yellow brick road!\ndorothy thinks ...,the annotated wizard of oz: the wonderful wiza...
22573,1435139739,44,False,3.98,Follow the yellow brick road!\nDorothy thinks ...,Hardcover,Barnes & Noble,224.0,9781435139732,2012,15945612,256,The Wonderful Wizard of Oz,follow the yellow brick road!\ndorothy thinks ...,the wonderful wizard of oz


In [139]:
books[(books.descriptiondetect.map(books.descriptiondetect.value_counts()) >= 2) & (books.titledetect.map(books.titledetect.value_counts())>= 2)].sort_values('titledetect', ascending=False)

Unnamed: 0,isbn,text_reviews_count,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title,descriptiondetect,titledetect
65347,0802727913,2,False,3.35,When night has come and the moon shines bright...,Hardcover,Walker Childrens,32.0,9780802727916,2012,13167169,7,"Zoom, Rocket, Zoom!",when night has come and the moon shines bright...,"zoom, rocket, zoom!"
64625,0802727905,38,False,3.35,When night has come and the moon shines bright...,Hardcover,Walker Childrens,32.0,9780802727909,2012,13167167,185,"Zoom, Rocket, Zoom!",when night has come and the moon shines bright...,"zoom, rocket, zoom!"
13998,1442412720,50,False,4.11,A Simon & Schuster eBook,Hardcover,Beach Lane Books,40.0,9781442412729,2010,7866583,223,ZooBorns!: Zoo Babies from Around the World,a simon & schuster ebook,zooborns!: zoo babies from around the world
10575,074596270X,2,False,3.47,A strikingly illustrated story with a heartwar...,Paperback,Lion Hudson,32.0,9780745962702,2013,13162256,8,Zoo Girl,a strikingly illustrated story with a heartwar...,zoo girl
14757,0745963234,20,False,3.47,A strikingly illustrated story with a heartwar...,Hardcover,Lion Hudson,32.0,9780745963235,2012,13198862,120,Zoo Girl,a strikingly illustrated story with a heartwar...,zoo girl
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72619,1442499281,5,False,4.06,A Simon & Schuster eBook,Board Book,Little Simon,36.0,9781442499287,2014,18668444,31,1-2-3 Peas,a simon & schuster ebook,1-2-3 peas
3002,0399230130,99,False,3.90,"Joyously colored animals, riding on a train to...",Board book,Philomel Books,20.0,9780399230134,1996,532249,1203,"1, 2, 3 to the Zoo","joyously colored animals, riding on a train to...","1, 2, 3 to the zoo"
12177,039961172X,14,False,3.90,"Joyously colored animals, riding on a train to...",Hardcover,Philomel Books,34.0,9780399611728,1982,3004699,66,"1, 2, 3 to the Zoo","joyously colored animals, riding on a train to...","1, 2, 3 to the zoo"
71531,0547531176,3,True,4.36,"Once upon a time, children imagined St. Nichol...",ebook,Houghton Mifflin,0.0,9780547531175,2005,8846499,15,'Twas the Night Before Christmas,"once upon a time, children imagined st. nichol...",'twas the night before christmas


In [140]:
books1=books.sort_values('text_reviews_count', ascending= False)

In [141]:
books1

Unnamed: 0,isbn,text_reviews_count,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title,descriptiondetect,titledetect
71712,0385732554,49850,False,4.12,Twelve-year-old Jonas lives in a seemingly ide...,Paperback,Ember,208.0,9780385732550,2006,3636,1311422,"The Giver (The Giver, #1)",twelve-year-old jonas lives in a seemingly ide...,"the giver (the giver, #1)"
20152,0375869026,31536,False,4.43,I won't describe what I look like. Whatever yo...,Hardcover,Knopf,316.0,9780375869020,2012,11387515,255461,Wonder (Wonder #1),i won't describe what i look like. whatever yo...,wonder (wonder #1)
34990,043965548X,28561,False,4.53,Harry Potter's third year at Hogwarts is full ...,Mass Market Paperback,Scholastic Inc.,435.0,9780439655484,2004,5,1876252,Harry Potter and the Prisoner of Azkaban (Harr...,harry potter's third year at hogwarts is full ...,harry potter and the prisoner of azkaban (harr...
13148,0156012197,16639,False,4.28,"Moral allegory and spiritual autobiography, Th...",Paperback,"Harcourt, Inc.",93.0,9780156012195,2000,157993,763309,The Little Prince,"moral allegory and spiritual autobiography, th...",the little prince
70661,0439244196,14851,False,3.93,Stanley tries to dig up the truth in this inve...,Paperback,Scholastic,233.0,9780439244190,2000,38709,766680,"Holes (Holes, #1)",stanley tries to dig up the truth in this inve...,"holes (holes, #1)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28737,0613504968,0,False,3.98,To explain the markings on their faces and tai...,Hardcover,Turtleback Books,40.0,9780613504966,2001,9277714,7,"Sagwa, the Chinese Siamese Cat",to explain the markings on their faces and tai...,"sagwa, the chinese siamese cat"
37196,1556614071,0,False,3.43,An exciting series of chapter books for reader...,Paperback,Bethany House Publishers,59.0,9781556614071,1994,341579,46,The Mystery of the Wrong Dog,an exciting series of chapter books for reader...,the mystery of the wrong dog
62302,3150091608,0,False,4.00,"Source of legend and lyric, reference and conj...",Paperback,Reclam Philipp Jun.,148.0,9783150091609,1984,853469,38,Alice's Adventures in Wonderland,"source of legend and lyric, reference and conj...",alice's adventures in wonderland
21328,0824918452,0,False,3.77,A classic bestseller presented in a new size w...,Board Book,WorthyKids/ideals,22.0,9780824918453,2010,9791536,28,The Story of Christmas,a classic bestseller presented in a new size w...,the story of christmas


In [142]:
books1.shape

(80645, 15)

In [143]:
books2=books1.drop_duplicates(subset=['descriptiondetect','titledetect'], keep= 'first')

In [144]:
books2.shape

(76213, 15)

In [145]:
books2[(books2.titledetect.map(books2.titledetect.value_counts()) >= 2)].sort_values('titledetect', ascending=False)

Unnamed: 0,isbn,text_reviews_count,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title,descriptiondetect,titledetect
2736,055315608X,1,False,3.68,A painfully shy young boy befriends a homeless...,Paperback,Skylark,160.0,9780553156089,1984,652287,8,Zucchini,a painfully shy young boy befriends a homeless...,zucchini
66272,0440414024,7,False,3.68,Zucchini knows there's more to life than his c...,Paperback,Yearling,160.0,9780440414025,1984,2865870,59,Zucchini,zucchini knows there's more to life than his c...,zucchini
32192,1101002697,1,True,4.05,Are You a Believer in Fanciful Things? In Pira...,ebook,Razorbill,0.0,9781101002698,2008,8910936,2,Zorgamazoo,are you a believer in fanciful things? in pira...,zorgamazoo
29781,1595142959,43,False,4.05,Are You a Believer in Fanciful Things? In Pira...,Paperback,Razorbill,288.0,9781595142955,2010,7911670,110,Zorgamazoo,are you a believer in fanciful things? in pira...,zorgamazoo
19641,037586847X,18,False,3.75,Winter weather is keeping children from visiti...,Hardcover,Knopf Books for Young Readers,40.0,9780375868474,2011,11331530,73,ZooZical,winter weather is keeping children from visiti...,zoozical
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67217,0688156347,8,False,3.81,"Here are\nLittle Guy,\nLittle Pumpkin,\nand\nL...",Board Book,Greenwillow Books,0.0,9780688156343,1997,1296580,97,"""More More More,"" Said the Baby","here are\nlittle guy,\nlittle pumpkin,\nand\nl...","""more more more,"" said the baby"
22419,0688091741,5,False,3.81,"Here are Little Guy, Little Pumpkin, and Littl...",Hardcover,Greenwillow Books,32.0,9780688091743,1990,3875332,11,"""More More More,"" Said the Baby","here are little guy, little pumpkin, and littl...","""more more more,"" said the baby"
15002,0688091733,7,False,3.81,"Here are Little Guy, Little Pumpkin,and Little...",Hardcover,Greenwillow Books,40.0,9780688091736,1990,2599431,60,"""More More More,"" Said the Baby","here are little guy, little pumpkin,and little...","""more more more,"" said the baby"
77729,0763614521,103,False,4.11,The Barnes & Noble Review\nA little girl's dre...,Hardcover,Candlewick,32.0,9780763614522,2001,1068302,411,"""Let's Get a Pup!"" Said Kate",the barnes & noble review\na little girl's dre...,"""let's get a pup!"" said kate"


In [146]:
books3=books2.drop_duplicates(subset=['titledetect'], keep= 'first')

In [147]:
books3.shape

(64123, 15)

In [148]:
books3.titledetect.value_counts()

dinosaurs: a visual encyclopedia                             1
stella batts needs a new name (stella batts, #1)             1
something beginning with blue. nick sharratt, sally symes    1
brigid's cloak: an ancient irish story                       1
kindergarten cat                                             1
                                                            ..
princess grace and the little lost kitten                    1
frazzle                                                      1
the blizzard on blue mountain (cabin creek mysteries #5)     1
rhymes & reasons                                             1
there was a tree                                             1
Name: titledetect, Length: 64123, dtype: int64

In [149]:
books3[books3.descriptiondetect.map(books3.descriptiondetect.value_counts())>= 2].sort_values('descriptiondetect', ascending=False)

Unnamed: 0,isbn,text_reviews_count,is_ebook,average_rating,description,format,publisher,num_pages,isbn13,publication_year,book_id,ratings_count,title,descriptiondetect,titledetect
68837,0152063897,123,False,4.21,"Zooom! Wooeeee . . . ! ""Make way!""The big city...",Hardcover,HMH Books for Young Readers,40.0,9780152063894,2009,6399395,2022,Little Blue Truck Leads the Way,"zooom! wooeeee . . . ! ""make way!""the big city...",little blue truck leads the way
52228,0547850603,1,False,4.21,"Zooom! Wooeeee . . . ! ""Make way!""The big city...",Big Book,HMH Books for Young Readers,40.0,9780547722146,2012,13356695,1,Little Blue Truck Leads the Way big book,"zooom! wooeeee . . . ! ""make way!""the big city...",little blue truck leads the way big book
10216,1848778570,1,False,3.60,Zookeeper Mr. Peek causes pandemonium again!\n...,Paperback,Templar Books,40.0,9781848778573,2013,17594073,6,Pandamonium at Peek Zoo. Kevin Waldron,zookeeper mr. peek causes pandemonium again!\n...,pandamonium at peek zoo. kevin waldron
18469,0763666580,22,False,3.60,Zookeeper Mr. Peek causes pandemonium again!\n...,Hardcover,Templar,40.0,9780763666583,2014,18209409,75,Panda-monium at Peek Zoo,zookeeper mr. peek causes pandemonium again!\n...,panda-monium at peek zoo
30817,0694016608,68,False,3.43,Your world. My world. I can swing right over t...,Paperback,HarperCollins,32.0,9780694016600,2004,237337,554,My World: A Companion to Goodnight Moon,your world. my world. i can swing right over t...,my world: a companion to goodnight moon
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25840,9584504452,1,False,3.84,"""Beware a mermaid's wrath!""\nThe mermaid Soop ...",Hardcover,not defined,0.0,9789584504456,2007,2983697,13,El Refugio De Las Hadas Y La Busqueda De La Va...,"""beware a mermaid's wrath!""\nthe mermaid soop ...",el refugio de las hadas y la busqueda de la va...
19156,0749732105,115,False,4.32,"""A land at the top of a tree!"" said Connie. ""I...",Paperback,Egmont,185.0,9780749732103,1997,473623,11899,The Folk of the Faraway Tree (The Faraway Tree...,"""a land at the top of a tree!"" said connie. ""i...",the folk of the faraway tree (the faraway tree...
73860,1405230576,15,False,4.32,"""A land at the top of a tree!"" said Connie. ""I...",Paperback,Egmont,192.0,9781405230575,2007,319977,171,The Folk of the Faraway Tree,"""a land at the top of a tree!"" said connie. ""i...",the folk of the faraway tree
73351,0340042443,1,False,3.80,"""A car is stolen-- and Peter and Janet witness...",Paperback,Knight Books,128.0,9780340042441,1970,2657284,10,"Good Work, Secret Seven","""a car is stolen-- and peter and janet witness...","good work, secret seven"


In [150]:
books4=books3.drop_duplicates(subset=['descriptiondetect'], keep= 'first') 
#in the end we decide to drop duplicates descriptions as well

In [151]:
books4.shape

(62395, 15)

In [152]:
books4.columns

Index(['isbn', 'text_reviews_count', 'is_ebook', 'average_rating',
       'description', 'format', 'publisher', 'num_pages', 'isbn13',
       'publication_year', 'book_id', 'ratings_count', 'title',
       'descriptiondetect', 'titledetect'],
      dtype='object')

In [153]:
file_name='descriptionsdfclean.csv'
books4.to_csv(file_name, index=False)