You have to work on the files:
*  [Books](https://github.com/gdv/foundationsCS/raw/master/progetti/2021/Books.csv.gz)
*  [Book ratings](https://github.com/gdv/foundationsCS/raw/master/progetti/2021/Book-Ratings.csv.gz)
*  [Users](https://github.com/gdv/foundationsCS/raw/master/progetti/2021/Users.csv.gz)
*  [Goodbooks books](https://github.com/gdv/foundationsCS/raw/master/progetti/2021/goodbooks.csv.gz)
*  [Goodbooks ratings](https://github.com/gdv/foundationsCS/raw/master/progetti/2021/goodbooks-ratings.csv.gz)

## Notes

1.    It is mandatory to use GitHub for developing the project.
1.    The project must be a jupyter notebook.
1.    There is no restriction on the libraries that can be used, nor on the Python version.
1.    To read those files, you need to use the `encoding = 'latin-1'` option.
1.    All questions on the project **must** be asked in a public channel on [Zulip](https://focs.zulipchat.com), otherwise no  answer will be given.

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

## Importazione librerie e datasets

Per svolgere il progetto utilizziamo le librerie `Pandas` e `Numpy`

In [2]:
import pandas as pd
import numpy as np

Per importare i datafarme utilizziamo la funzione `read_csv` di pandas specificando tra i parametri:
 - la tipologia di compressione del file (`gzip`)
 - l'encoding (`latin-1`)
 - il separatore (`,` o `;`)
 - dove necessario l'escape character (`\\`)
 - la tipologia degli attributi

La tipologia degli attributi la specifichiamo per alcuni dataset cosí da velocizzare il caricamento e risolvere errate interpretazioni.

In [3]:
users = pd.read_csv('https://github.com/gdv/foundationsCS/raw/master/progetti/2021/Users.csv.gz', compression = 'gzip', escapechar = "\\", encoding = 'latin-1', sep = ';', dtype = {'Age': 'Int64'})
users.shape

(278858, 3)

In [4]:
books = pd.read_csv('https://github.com/gdv/foundationsCS/raw/master/progetti/2021/Books.csv.gz', compression = 'gzip', escapechar = "\\", encoding = 'latin-1', sep = ';')
books.shape

(271359, 8)

In [5]:
goodbooks = pd.read_csv('https://github.com/gdv/foundationsCS/raw/master/progetti/2021/goodbooks.csv.gz', compression = 'gzip', encoding = 'latin-1', sep = ',', dtype = {'isbn13': 'object', 'original_publication_year': 'int64',})
goodbooks.shape

(99, 23)

In [6]:
book_ratings = pd.read_csv('https://github.com/gdv/foundationsCS/raw/master/progetti/2021/Book-Ratings.csv.gz', compression = 'gzip', escapechar = "\\", encoding = 'latin-1', sep = ';')
book_ratings.shape

(1149780, 3)

In [7]:
goodbooks_ratings = pd.read_csv('https://github.com/gdv/foundationsCS/raw/master/progetti/2021/goodbooks-ratings.csv.gz', compression = 'gzip', encoding = 'latin-1', sep = ',')
goodbooks_ratings.shape

(99, 3)

## Verifiche datasets

### Controllo tipologia di attributi dei dataset

Verifichiamo che le tipologie degli attributi siano quelle attese andando eventualmente a correggere le anomalie in fase di importazione.

In [8]:
users.dtypes

User-ID      int64
Location    object
Age          Int64
dtype: object

In [9]:
books.dtypes

ISBN                   object
Book-Title             object
Book-Author            object
Year-Of-Publication     int64
Publisher              object
Image-URL-S            object
Image-URL-M            object
Image-URL-L            object
dtype: object

In [10]:
goodbooks.dtypes

book_id                        int64
goodreads_book_id              int64
best_book_id                   int64
work_id                        int64
books_count                    int64
isbn                          object
isbn13                        object
authors                       object
original_publication_year      int64
original_title                object
title                         object
language_code                 object
average_rating               float64
ratings_count                  int64
work_ratings_count             int64
work_text_reviews_count        int64
ratings_1                      int64
ratings_2                      int64
ratings_3                      int64
ratings_4                      int64
ratings_5                      int64
image_url                     object
small_image_url               object
dtype: object

In [11]:
book_ratings.dtypes

User-ID         int64
ISBN           object
Book-Rating     int64
dtype: object

In [12]:
goodbooks_ratings.dtypes

user_id    int64
book_id    int64
rating     int64
dtype: object

### Verifico valori anomali

Dato che é richiesto diverse volte di fare merging rispetto agli ISBN decidiamo di verificare la presenza di caratteri speciali in questo attributo, cioé cerchiamo se sono presenti caratteri diversi da lettere e numeri [a-z,A-Z,0-9].
Lo valutiamo in tutti i dataset che contengono l'attributo ISBN e per farlo utilizziamo la funzione `contains`.

In [13]:
book_ratings[book_ratings['ISBN'].str.contains("\W+", regex = True)]

Unnamed: 0,User-ID,ISBN,Book-Rating
535,276929,2.02.032126.2,0
536,276929,2.264.03602.8,0
8918,278491,01420.01740,10
9745,183,100940/86,9
9746,183,10622/86,0
...,...,...,...
1145175,275414,"""8888809228""",5
1146052,275891,384220/2/52,0
1146054,275891,400/33/72,0
1147650,276009,01400.77022,0


In [14]:
books[books['ISBN'].str.contains("\W+", regex = True)]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
111808,0486404242\t,War in Kind: And Other Poems (Dover Thrift Edi...,Stephen Crane,1998,Dover Publications,http://images.amazon.com/images/P/0486404242.0...,http://images.amazon.com/images/P/0486404242.0...,http://images.amazon.com/images/P/0486404242.0...
171206,3518365479<90,"Suhrkamp TaschenbÃ?ÃÂ¼cher, Nr.47, Frost",Thomas Bernhard,1972,Suhrkamp,http://images.amazon.com/images/P/3518365479.0...,http://images.amazon.com/images/P/3518365479.0...,http://images.amazon.com/images/P/3518365479.0...
251423,3442248027 3,Diamond Age. Die Grenzwelt.,Neal Stephenson,2000,Goldmann,http://images.amazon.com/images/P/3442248027.0...,http://images.amazon.com/images/P/3442248027.0...,http://images.amazon.com/images/P/3442248027.0...
251648,0385722206 0,Balzac and the Little Chinese Seamstress : A N...,DAI SIJIE,2002,Anchor,http://images.amazon.com/images/P/0385722206.0...,http://images.amazon.com/images/P/0385722206.0...,http://images.amazon.com/images/P/0385722206.0...


In [15]:
goodbooks[goodbooks['isbn'].str.contains("\W+", regex = True)]

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url


Notiamo che in `book_ratings` e in `book` sono presenti molti valori anomali ma decidiamo di non trattarli poiché richiederebbero delle tecniche piú approfondite.

## 1. Normalize the location field of *Users* dataset, splitting into city, region, country.

Dividiamo la stringa `Location` nelle tre colonne in corrisponzenza delle virgole con la funzione `split` di pandas. Per tentare di risolvere i problemi legati alla presenza di piú di due virgole usiamo la variante `rsplit` che da prioritá al campo `Country`, mentre per quelle stringhe con meno di due virgole specifichiamo `expand = True` per evitare errori.

In [16]:
users[['City', 'Region', 'Country']] = users['Location'].str.rsplit(',', 2, expand=True)
users.head(10)

Unnamed: 0,User-ID,Location,Age,City,Region,Country
0,1,"nyc, new york, usa",,nyc,new york,usa
1,2,"stockton, california, usa",18.0,stockton,california,usa
2,3,"moscow, yukon territory, russia",,moscow,yukon territory,russia
3,4,"porto, v.n.gaia, portugal",17.0,porto,v.n.gaia,portugal
4,5,"farnborough, hants, united kingdom",,farnborough,hants,united kingdom
5,6,"santa monica, california, usa",61.0,santa monica,california,usa
6,7,"washington, dc, usa",,washington,dc,usa
7,8,"timmins, ontario, canada",,timmins,ontario,canada
8,9,"germantown, tennessee, usa",,germantown,tennessee,usa
9,10,"albacete, wisconsin, spain",26.0,albacete,wisconsin,spain


Facendo un raggruppamento rispetto a `Country` ci si rende conto della presenza di valori anomali in alcuni campi e inoltre si notano alcuni problemi con l'encoding, la risoluzione di questi peró esula dalla richiesta principale.

In [17]:
users.groupby('Country').count().head(10)

Unnamed: 0_level_0,User-ID,Location,Age,City,Region
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,4588,4588,2000,4588,4588
"""n/a""",2,2,1,2,2
&#20013;&#22269;,1,1,1,1,1
&#32654;&#22269;,1,1,1,1,1
*,1,1,0,1,1
-,1,1,0,1,1
-------,1,1,0,1,1
.,1,1,0,1,1
01776,1,1,0,1,1
02458,1,1,0,1,1


## 2. For each book in the *Books* dataset, compute its average rating.

Per prima cosa facciamo una left join delle tabelle `books` e `book_ratings` con la funzione `merge` di pandas. In secondo luogo raggruppiamo rispetto al campo `ISBN` facendo la media del campo `Book-Rating`.

In [18]:
books_avg = pd.merge(books, book_ratings, on = 'ISBN', how = 'left').groupby(['Book-Title', 'ISBN'], as_index=False).agg(Book_Average_Rating = ('Book-Rating', 'mean'))
books_avg.head(10)

Unnamed: 0,Book-Title,ISBN,Book_Average_Rating
0,A Light in the Storm: The Civil War Diary of ...,0590567330,2.25
1,Always Have Popsicles,0964147726,0.0
2,Apple Magic (The Collector's series),0942320093,0.0
3,"Ask Lily (Young Women of Faith: Lily Series, ...",0310232546,8.0
4,Beyond IBM: Leadership Marketing and Finance ...,0962295701,0.0
5,Clifford Visita El Hospital (Clifford El Gran...,0439188970,0.0
6,Dark Justice,0399151788,10.0
7,Deceived,0786000015,0.0
8,Earth Prayers From around the World: 365 Pray...,006250746X,5.0
9,Final Fantasy Anthology: Official Strategy Gu...,1566869250,5.0


## 3. For each book in the *GoodBooks* dataset, compute its average rating.

Per risolvere questo quesito si esegue una left join tra goodbooks e goodbooks_ratings rispetto all'attributo `book_id`. Questo nuovo dataframe lo si raggruppa rispetto a `isbn` e `original_title` calcolando il valore medio dell'attributo `rating`.

In [19]:
pd.merge(goodbooks, goodbooks_ratings, on = 'book_id', how = 'left').groupby(['isbn', 'original_title'], as_index = False).agg(average_rating = ('rating', 'mean')).head()

Unnamed: 0,isbn,original_title,average_rating
0,014038572X,The Outsiders,
1,014241493X,Paper Towns,
2,030734813X,Jurassic Park,4.0
3,031606792X,Breaking Dawn,
4,043965548X,Harry Potter and the Prisoner of Azkaban,5.0


É possibile risolvere questo quesito anche sfruttando la colonna `average_rating` giá presente nel dataset `goodbooks`. Utilizzeremo i valori ottenuti da questa soluzione per i punti successivi.

In [20]:
goodbooks[['isbn', 'original_title', 'average_rating']].head()

Unnamed: 0,isbn,original_title,average_rating
0,439023483,The Hunger Games,4.34
1,439554934,Harry Potter and the Philosopher's Stone,4.44
2,316015849,Twilight,3.57
3,61120081,To Kill a Mockingbird,4.25
4,743273567,The Great Gatsby,3.89


## 4. Merge together all rows sharing the same book title, author and publisher. We will call the resulting datset `merged books`. The books that have not been merged together will not appear in `merged books`.

In un primo momento creiamo il dataset `merging_books`, che sfrutteró anche nel punto successivo, con presenti tutte le righe che si ripetono almeno una volta rispetto ai campi `Book-Title`, `Book-Author` e `Publisher`. Per farlo utilizziamo la funzione `duplicated` che restituisce un array di booleani specificando `keep = False` per tenere tutti i duplicati. 

Successivamente creiamo il dataset `merged_books` raggruppando il dataset precedente rispetto a `Book-Title`, `Book-Author` e `Publisher` e otteniamo la colonna `Count`.

In [35]:
merging_books = books[books.duplicated(subset = ['Book-Title', 'Book-Author', 'Publisher'], keep = False)]  #keep = False per poter avere il conteggio
merged_books = merging_books.groupby(['Book-Title', 'Book-Author', 'Publisher'], as_index=False).agg(Count=('ISBN', 'count'))
merged_books.head(10)

Unnamed: 0,Book-Title,Book-Author,Publisher,Count
0,!%@ (A Nutshell handbook),Donnalyn Frey,O'Reilly,2
1,'A Hell of a Place to Lose a Cow': An American...,Tim Brookes,National Geographic,2
2,"10,000 dreams interpreted: A dictionary of dreams",Gustavus Hindman Miller,Barnes &amp; Nobles Books,2
3,101 Famous Poems,Roy J. Cook,McGraw-Hill/Contemporary Books,3
4,15 Houseplants Even You Can't Kill,Joe Elder,Berkley Pub Group,2
5,158 POUND MARRIAGE,John Irving,Pocket,2
6,1700: Scenes from London Life,Maureen Waller,Four Walls Eight Windows,2
7,1921 : The Great Novel of the Irish Civil War ...,Morgan Llywelyn,Forge Books,2
8,2001. Odyssee im Weltraum.,Arthur C. Clarke,Heyne,2
9,2010: Odyssey Two,Arthur C. Clarke,Del Rey Books,2


## 5. For each book in `merged books` compute its average rating.

Calcolo l'Average_Rating per ogni edizione dei libri in book_ratings e poi faccio un merge con il dataset creato al punto precedente.

In [34]:
temp = book_ratings.groupby('ISBN').agg(Average_Rating = ('Book-Rating', 'mean')) #media rispetto a ISBN
merging_book_ratings = pd.merge(merging_books, temp, on = 'ISBN', how = 'left')
merging_book_ratings.groupby(['Book-Title', 'Book-Author', 'Publisher'], as_index=False)[['Average_Rating']].mean().head(10) #media tra le edizioni

Unnamed: 0,Book-Title,Book-Author,Publisher,Average_Rating
0,!%@ (A Nutshell handbook),Donnalyn Frey,O'Reilly,3.0
1,'A Hell of a Place to Lose a Cow': An American...,Tim Brookes,National Geographic,1.7
2,"10,000 dreams interpreted: A dictionary of dreams",Gustavus Hindman Miller,Barnes &amp; Nobles Books,6.958333
3,101 Famous Poems,Roy J. Cook,McGraw-Hill/Contemporary Books,3.111111
4,15 Houseplants Even You Can't Kill,Joe Elder,Berkley Pub Group,0.0
5,158 POUND MARRIAGE,John Irving,Pocket,2.416667
6,1700: Scenes from London Life,Maureen Waller,Four Walls Eight Windows,6.25
7,1921 : The Great Novel of the Irish Civil War ...,Morgan Llywelyn,Forge Books,1.0
8,2001. Odyssee im Weltraum.,Arthur C. Clarke,Heyne,4.5
9,2010: Odyssey Two,Arthur C. Clarke,Del Rey Books,1.360759


## 6. For each book in `merged books` compute the minimum and maximum of the average ratings over all corresponding books in the `books` dataset.



In [23]:
merging_book_ratings.groupby(['Book-Title', 'Book-Author', 'Publisher'], as_index=False).agg({'Average_Rating' : ['min', 'max']}).head(10)

Unnamed: 0_level_0,Book-Title,Book-Author,Publisher,Average_Rating,Average_Rating
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,min,max
0,!%@ (A Nutshell handbook),Donnalyn Frey,O'Reilly,0.0,6.0
1,'A Hell of a Place to Lose a Cow': An American...,Tim Brookes,National Geographic,0.0,3.4
2,"10,000 dreams interpreted: A dictionary of dreams",Gustavus Hindman Miller,Barnes &amp; Nobles Books,6.666667,7.25
3,101 Famous Poems,Roy J. Cook,McGraw-Hill/Contemporary Books,0.0,5.0
4,15 Houseplants Even You Can't Kill,Joe Elder,Berkley Pub Group,0.0,0.0
5,158 POUND MARRIAGE,John Irving,Pocket,1.333333,3.5
6,1700: Scenes from London Life,Maureen Waller,Four Walls Eight Windows,4.5,8.0
7,1921 : The Great Novel of the Irish Civil War ...,Morgan Llywelyn,Forge Books,0.0,2.0
8,2001. Odyssee im Weltraum.,Arthur C. Clarke,Heyne,0.0,9.0
9,2010: Odyssey Two,Arthur C. Clarke,Del Rey Books,0.0,2.721519


Sfruttando il dataset del punto precedente facciamo un groupby e calcoliamo max e min del valore medio con la funzione agg

## 7. For each book in `goodbooks`, compute the list of its authors. Assuming that the number of reviews with a text (column `work_text_reviews_count`) is split equally among all authors, find for each authors the total number of reviews with a text. We will call this quantity the *shared number of reviews with a text*.

Creiamo una nuova colonna chiamata authors_splitted, ottenuta con la funzione split, che contiene gli autori come lista per ogni libro.
Aggiungiamo poi la colonna shared_number_of_reviews_with_a_text calcolata dividendo work_text_reviews_count per in numero di autori di ogni libro (lunghezza della lista splitted_authors).

Costruiamo poi un nuovo dataset con una riga per ogni coppia libro-autore della lista splitted_authors. Infine facciamo un group by e una sum rispetto ad ogni attore per ottenere il shared_number_of_reviews_with_a_text di ogni scrittore.

In [24]:
goodbooks['authors_splitted'] = goodbooks['authors'].str.split(',')
goodbooks['shared_number_of_reviews_with_a_text'] = round(goodbooks['work_text_reviews_count'] / goodbooks['authors_splitted'].apply(len), 2)

goodbooks_exploded = goodbooks.explode('authors_splitted')
goodbooks_exploded['authors_splitted'] = goodbooks_exploded['authors_splitted'].str.strip() #rimozione spazi negli autori
goodbooks_exploded.groupby(['authors_splitted'], as_index=False)['shared_number_of_reviews_with_a_text'].sum().head(10)

Unnamed: 0,authors_splitted,shared_number_of_reviews_with_a_text
0,Alan R. Clarke,27890.5
1,Aldous Huxley,20095.0
2,Alice Sebold,36642.0
3,Anne Frank,6941.67
4,Antoine de Saint-ExupÃ©ry,6134.25
5,Arthur Golden,25605.0
6,Audrey Niffenegger,43382.0
7,B.M. Mooyaart-Doubleday,6941.67
8,Bernard Knox,1620.2
9,Bram Stoker,5754.33


## 8. For each year of publication, determine the author that has the largest value of the shared number of reviews with a text.

Da spiegare bene..

Infine creo un altro dataset temporaneo chiamato authors_highest_reviews cosí da poter usare set_index per avere una vista piu compatta per gli anni con autori a pari merito.

In [25]:
temp = goodbooks_exploded.groupby(['original_publication_year', 'authors_splitted'], as_index=False).sum()
idx = temp.groupby(['original_publication_year'])['shared_number_of_reviews_with_a_text'].transform(max) == temp['shared_number_of_reviews_with_a_text']

authors_highest_reviews = temp[idx] [['original_publication_year', 'shared_number_of_reviews_with_a_text', 'authors_splitted']]
authors_highest_reviews.set_index(['original_publication_year', 'shared_number_of_reviews_with_a_text', 'authors_splitted'])

original_publication_year,shared_number_of_reviews_with_a_text,authors_splitted
-720,1620.2,Bernard Knox
-720,1620.2,E.V. Rieu
-720,1620.2,FrÃ©dÃ©ric Mugler
-720,1620.2,Homer
-720,1620.2,Robert Fagles
...,...,...
2009,88538.0,Suzanne Collins
2010,96274.0,Suzanne Collins
2011,103489.0,E.L. James
2012,140739.0,John Green


## 9. Assuming that there are no errors in the ISBN fields, find the books in both datasets, and compute the difference of average rating according to the ratings and the goodratings datasets

Scalo i voti del dataset Books perché in scala 10

In [26]:
print('book_ratings:   ', book_ratings['Book-Rating'].max(), book_ratings['Book-Rating'].min(), book_ratings['Book-Rating'].count())
print('goodbooks_rating:', goodbooks_ratings['rating'].max(), goodbooks_ratings['rating'].min(), goodbooks_ratings['rating'].count())

book_ratings:    10 0 1149780
goodbooks_rating: 5 2 99


In [27]:
books_goodbooks = pd.merge(books_avg, goodbooks, left_on = 'ISBN', right_on = 'isbn')

books_goodbooks['Book_Average_Rating'] = books_goodbooks['Book_Average_Rating'] / 2
books_goodbooks['Difference'] = books_goodbooks['Book_Average_Rating'] - books_goodbooks['average_rating']

books_goodbooks.rename(columns={'average_rating': 'Goodbook_Average_Rating'}, inplace=True)
books_goodbooks[['ISBN', 'Book-Title', 'Book_Average_Rating', 'Goodbook_Average_Rating', 'Difference']]

Unnamed: 0,ISBN,Book-Title,Book_Average_Rating,Goodbook_Average_Rating,Difference
0,014028009X,Bridget Jones's Diary,1.875926,3.75,-1.874074
1,043965548X,Harry Potter and the Prisoner of Azkaban (Harr...,1.766667,4.53,-2.763333
2,1400032717,The Curious Incident of the Dog in the Night-T...,2.406593,3.85,-1.443407
3,1594480001,The Kite Runner,1.2,4.26,-3.06
4,014038572X,The Outsiders (Now in Speak!),2.230337,4.06,-1.829663


## 10. Split the users dataset according to the age. One dataset contains the users with unknown age, one with age 0-14, one with age 15-24, one with age 25-34, and so on.

Iniziamo valutando come sono distribuite le eta degli utenti del dataset `users` facendo un raggruppamento rispetto ad intervalli di etá di 10 anni. Per farlo utilizziamo la funzione `groupby` rispetto al campo `Age` opportunamente diviso con la funzione `cut` in intervalli.
Gli intervalli dei bin sono stati ottenuti con la funzione `arange` di numpy che restituisce un array di valori equidistanti.

In [28]:
interval = np.arange(14, 180, 10)
users['Age_bin'] = pd.cut(users['Age'], np.append(0, interval))

distrib_eta = users.groupby('Age_bin').count() [['User-ID']]
distrib_eta.rename(columns={'User-ID':'Count'}, inplace=True)
distrib_eta

Unnamed: 0_level_0,Count
Age_bin,Unnamed: 1_level_1
"(0, 14]",3897
"(14, 24]",40001
"(24, 34]",50767
"(34, 44]",32690
"(44, 54]",23152
"(54, 64]",12493
"(64, 74]",3596
"(74, 84]",615
"(84, 94]",83
"(94, 104]",278


Dato che per le etá avanzate il trend é decrescente fino a all'intervallo `(84, 94]` decidiamo di prendere questo come ultimo intervallo affidabile. Creiamo quindi un dizionario di dataframes per ogni fascia di etá da 0 a 94 anni, per tutti gli utenti che presentano un'etá maggiore di 94 anni creiamo un dataframe chiamato `altre_eta` 

In [29]:
interval = np.arange(14, 95, 10)
users['Age_bin'] = pd.cut(users['Age'], np.append(0, interval)).astype(str)
users['Age_bin'].replace('nan', 'altre_eta', inplace = True)
eta = users.groupby('Age_bin')

users_dict = {}

for bin_eta in eta.groups.keys():
    users_dict[bin_eta] = eta.get_group(bin_eta)

users_dict['(14, 24]'].set_index('User-ID')

Unnamed: 0_level_0,Location,Age,City,Region,Country,Age_bin
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,"stockton, california, usa",18,stockton,california,usa,"(14, 24]"
4,"porto, v.n.gaia, portugal",17,porto,v.n.gaia,portugal,"(14, 24]"
20,"langhorne, pennsylvania, usa",19,langhorne,pennsylvania,usa,"(14, 24]"
24,"cologne, nrw, germany",19,cologne,nrw,germany,"(14, 24]"
28,"freiburg, baden-wuerttemberg, germany",24,freiburg,baden-wuerttemberg,germany,"(14, 24]"
...,...,...,...,...,...,...
278835,"karachi, sindh, pakistan",18,karachi,sindh,pakistan,"(14, 24]"
278838,"massillon, ohio, usa",15,massillon,ohio,usa,"(14, 24]"
278846,"toronto, ontario, canada",23,toronto,ontario,canada,"(14, 24]"
278849,"georgetown, ontario, canada",23,georgetown,ontario,canada,"(14, 24]"


Verifichiamo che corrisponda il numero di utenti nel dataframe `users` con il numero di utenti in `users_dict`

In [30]:
print(pd.concat(users_dict).shape)
print(users.shape)

(278858, 7)
(278858, 7)


## 11. Find the books that appear only in the goodbooks datasets.

In [31]:
goodbooks[~goodbooks['isbn'].isin(books_goodbooks['ISBN'])].head(10) [['isbn', 'original_title']]

Unnamed: 0,isbn,original_title
0,439023483,The Hunger Games
1,439554934,Harry Potter and the Philosopher's Stone
2,316015849,Twilight
3,61120081,To Kill a Mockingbird
4,743273567,The Great Gatsby
5,525478817,The Fault in Our Stars
6,618260307,The Hobbit or There and Back Again
7,316769177,The Catcher in the Rye
8,1416524797,Angels & Demons
9,679783261,Pride and Prejudice


## 12. Assuming that each pair (author, title) identifies a book, for each book find the number of times it appears in the books dataset. Which books appear the most times?

In [32]:
books_grouped = books.groupby(['Book-Title', 'Book-Author'], as_index=False).agg(Count = ('ISBN', 'count'))
books_grouped[books_grouped['Count'] == books_grouped['Count'].max()]

Unnamed: 0,Book-Title,Book-Author,Count
114024,Little Women,Louisa May Alcott,21


## 13. Find the author with the highest average rating according to the goodbooks datasets.

In [33]:
temp = goodbooks_exploded.groupby('authors_splitted', as_index=False)['average_rating'].mean()
temp[temp['average_rating'] == temp['average_rating'].max()]

Unnamed: 0,authors_splitted,average_rating
91,Rufus Beck,4.53
