# Python notebook for the dataset collection and pre-processing from multiple sources

----

## Initial idea

Many media sites have a social network of people and their corresponding transactions in the form of books, music, etc.
In order to predict the "purchase" an account will make it is useful to model their history using sequential modelling.

However, many models so far have discarded the interactions and dependencies in the social network structure. Using coupled HMMs we can model the model the dependencies in the hidden states between different users.

## Step 1. Dataset gathering

### a. Source 1 : Goodreads

In [1]:
from goodreads import client
gc = client.GoodreadsClient('SlgLqMjiphasqekm1RDPw', 'yMERrlbPngTd7B6BXNluEBjUCkZ3o7fNeB8omF0cMx8')

### Testing the API

In [10]:
# Accessing books on the website
book = gc.book(1)
print(dir(book))
print(book.title)
print(book.authors)
print(book.average_rating)

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_book_dict', '_client', 'authors', 'average_rating', 'description', 'edition_information', 'format', 'gid', 'image_url', 'is_ebook', 'isbn', 'isbn13', 'language_code', 'link', 'num_pages', 'popular_shelves', 'publication_date', 'publisher', 'rating_dist', 'ratings_count', 'reviews_widget', 'series_works', 'similar_books', 'small_image_url', 'text_reviews_count', 'title', 'work']
Harry Potter and the Half-Blood Prince (Harry Potter, #6)
[J.K. Rowling, Mary GrandPr√©]
4.55


In [11]:
print(book.popular_shelves)

[to-read, fantasy, favorites, currently-reading, young-adult, fiction, harry-potter, books-i-own, owned, ya, series, favourites, magic, childrens, owned-books, re-read, adventure, children, j-k-rowling, children-s, childhood, sci-fi-fantasy, all-time-favorites, audiobook, my-books, default, classics, audiobooks, reread, 5-stars, middle-grade, i-own, children-s-books, favorite-books, novels, favorite, kids, fantasy-sci-fi, my-library, ya-fantasy, paranormal, read-more-than-once, teen, english, urban-fantasy, books, british, witches, jk-rowling, audio, re-reads, library, read-in-2016, mystery, ya-fiction, read-in-2017, my-favorites, supernatural, own-it, novel, harry-potter-series, childrens-books, faves, young-adult-fiction, 2005, scifi-fantasy, kindle, favorite-series, made-me-cry, wizards, bookshelf, read-in-2018, my-bookshelf, juvenile, youth, all-time-favourites, read-in-2015, romance, favourite, to-buy, rereads, 5-star, childhood-favorites, ebook, shelfari-favorites, read-in-englis

In [12]:
print(book.rating_dist)

5:1325556|4:499853|3:145551|2:22837|1:8089|total:2001886


In [3]:
# Accessing users on the website
user = gc.user(5)
print(user.user_name)
print(dir(user))
print(user.list_groups())

elizabeth
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_client', '_user_dict', 'gid', 'image_url', 'link', 'list_groups', 'name', 'owned_books', 'reviews', 'shelves', 'small_image_url', 'user_name']
[Quotable Quotes, Southern California Events, Goodreads Author Outreach Project, Goodreads Feedback, Journalists Top Reads, Goodreads Librarians Group, Elizabeth's Child Development Reading Group, Books Of Your Life with Elizabeth, Classical music lovers, What's the Name of That Book???, Great African Reads, Stanford Book Club, Santa Monica Book Club, Silverlake Classics, 2008 Political Reading Checklist, Icelandophiles, Founders Bookclub: Elizabeth's Bookclub, Books I Loathed, Pride & Prejudice 2005 is a disgrace t

In [4]:
#print(user.owned_books())

In [5]:
# Get valid user from the API
user_id = 1
user_name = 'Start'
valid_user_names = []

while user_id < 50:
    
    try:
        user_name = gc.user(user_id)
    except:
        user_name = 'Invalid'
        
    valid_user_names.append(user_name)
    print('Added user with ID %d : %s' %(user_id, user_name))
    user_id += 1

Added user with ID 1 : otis
Added user with ID 2 : 2
Added user with ID 3 : sickpea
Added user with ID 4 : 4
Added user with ID 5 : elizabeth
Added user with ID 6 : 6
Added user with ID 7 : sundeep
Added user with ID 8 : evanpon
Added user with ID 9 : 9
Added user with ID 10 : 10
Added user with ID 11 : Invalid
Added user with ID 12 : 12
Added user with ID 13 : 13
Added user with ID 14 : grebmorb
Added user with ID 15 : 15
Added user with ID 16 : keaka
Added user with ID 17 : Invalid
Added user with ID 18 : 18
Added user with ID 19 : benh
Added user with ID 20 : 20
Added user with ID 21 : zack415
Added user with ID 22 : 22
Added user with ID 23 : 23
Added user with ID 24 : 24
Added user with ID 25 : 25
Added user with ID 26 : douglasl9
Added user with ID 27 : 27
Added user with ID 28 : 28
Added user with ID 29 : 29
Added user with ID 30 : 30
Added user with ID 31 : 31
Added user with ID 32 : 32
Added user with ID 33 : 33
Added user with ID 34 : gregveen
Added user with ID 35 : msiliski

## Achieved

1. We can access a given user with a numerical ID : 1
2. We can list the groups a user belongs to using list_groups()
3. We can iterate over multiple users at once and add the ids to a list
4. Ignore the user with invalid username and can access the groups of the rest

## Need to try to get the books the user has read or owned

In [8]:
user = gc.user(1)
print(user.owned_books())

AttributeError: 'GoodreadsClient' object has no attribute 'session'

In [9]:
print(dir(gc))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'auth_user', 'authenticate', 'author', 'base_url', 'book', 'book_review_stats', 'client_key', 'client_secret', 'find_author', 'find_groups', 'group', 'list_comments', 'list_events', 'owned_book', 'query_dict', 'recent_reviews', 'request', 'request_oauth', 'review', 'search_books', 'user']
