# Assignment

In this assignment we want to get comfortable with loading and manipulating data in Python. While future assignments will focus more using structured data which we can load into a `DataFrame` using `pandas`, this assignment is focused on semi-structured data and how we can "flatten" it and then load it into other formats. The objective is to see how data flows in Python from one object to another and what advantages and disadvantages each offers.

Let's read the `books.json' data set and display the first item in it.

In [None]:
import json
with open('../../data/books.json', encoding = 'utf-8') as f:
    books_dict = json.load(f)

from pprint import pprint
pprint(books_dict[0]) # print information for the first book

1. Write a program that goes through the entire data and extracts the following information:  <span style="color:red" float:right>[4 point]</span>

  - title of the book
  - name of the first author
  - name of the second author (if book has more than one author)
  - number of authors
  - ISBN
  - if the word "data" is in the book's description
  - the number of words in the book's description
  - the year the book was published

  Of course because JSON data doesn't necessarily enforce any sort of schema, we can't be sure that the information we are trying to extract exists for every book. For example, if the book only has one author, then there is no second author. So use `try` and `except` as you loop through every book and skip to the next item every time some information is missing.

  Store the extracted data in a list named `rows` whose elements are tuples, one tuple per book. For example, the first element of `rows` stores the tuple for the first book and should look like this:

        ('Unlocking Android', 'W. Frank Ableson', 'Charlie Collins', 3, '1933988673', True, 252, 2009)

In [2]:
## I functionalized my approach to better drill down for potential (and actual) errors

from datetime import datetime

def _description_extract(long_desc:str):
    data_in_desc = True if 'data' in long_desc else False
    len_of_desc = len(long_desc.split())
    return data_in_desc, len_of_desc

def _year_pub_extract(pub_dict:dict):
    try:
        date_pub = datetime.strptime(pub_dict['$date'],"%Y-%m-%dT00:00:00.000%z")
        year_pub = date_pub.year
    except KeyError:
        year_pub = None
    return year_pub

def _author_extractions(auth_list):
    if len(auth_list) > 1:
        auth01 = auth_list[0]
        auth02 = auth_list[1]
        auth_c = len(auth_list)
    elif len(auth_list) == 1:
        auth01 = auth_list[0]
        auth02 = None
        auth_c = len(auth_list)
    else:
        auth01, auth02, auth_c = None, None, None
    return auth01, auth02, auth_c

def extract_details(books_dict):
    rows = list()
    for book in books_dict:
        # re-init book details
        details = dict(title=None, auth01=None, auth02=None, auth_count=None, isbn=None, data_exists=None, word_count=None, year_pub=None)
        
        # generally guaranteed
        details['title'] = book['title']
        
        # isbn
        try:
            details['isbn'] = book['isbn']
        except KeyError:
            details['isbn'] = None
        
        # date
        try:
            details['year_pub'] = _year_pub_extract(book['publishedDate'])
        except KeyError:
            details['year_pub'] = None
        
        # description
        try:
            details['data_exists'], details['word_count'] = _description_extract(book['longDescription'])
        except KeyError:
            details['data_exists'] = False
            details['word_count'] = 0
        
        # authors
        try:
            details['auth01'], details['auth02'], details['auth_count'] = _author_extractions(book['authors'])
        except KeyError:
            details['auth01'], details['auth02'], details['auth_count'] = None
        
        rows.append(tuple([v for k,v in details.items()]))

    return rows

In [None]:
rows = extract_details(books_dict)
print(f"Extracted {len(rows)} from JSON data\n---")
for row in rows[:5]:
    print(row)

2. Save the content of `rows` in a SQL-like table using `sqlite3`, and choose the appropriate column types. <span style="color:red" float:right>[2 point]</span> 

  As your column names use the following:

  - `title`
  - `author_1`
  - `author_2`
  - `num_authors`
  - `isbn`
  - `has_data`
  - `desc_len`
  - `year_published`

In [4]:
import sqlite3

db = sqlite3.connect('book_extractions.db')

sql = r"""
CREATE TABLE IF NOT EXISTS book_details (`title` , `author_1` , `author_2` , `num_authors` INT, `isbn` , `has_data` INT, `desc_len` INT, `year_published` INT)
"""

crs = db.execute(sql)
crs.executemany("INSERT INTO book_details VALUES (?, ?, ?, ?, ?, ?, ?, ?)", rows)
db.commit()


3. Write a SQL query against the table to show all books that (1) contain the word "data" and (2) have more than 3 authors. Store the result of the query in an object called `books_table`, then close the connection. <span style="color:red" float:right>[2 point]</span>

In [None]:
sql = r"""SELECT * FROM book_details WHERE `has_data` = True AND `num_authors` > 3"""
crs.execute(sql)
books_table = crs.fetchall()
db.close()

print(f"Have returned {len(books_table)} from SQLite DB\n---")
for row in books_table[:5]:
    print(row)

SQL tables are not the only way, and definitely not the most straightforward way to store and manipulate data in Python. A format that's more popular with data scientist is to use the `pandas` library to create a `DataFrame`. This library has a lot of functionality that makes it easy to run the common tasks data scientists do with data.

4. Read the data from the above query into a `DataFrame` and call it `books_df`. HINT: Use `pd.DataFrame` and specify meaningful column names to use for the columns. <span style="color:red" float:right>[1 point]</span>

In [6]:
import pandas as pd

In [None]:
books_df = pd.DataFrame(books_table,columns=['title','author_01','author_02','number of authors','isbn','has data in description', 'description word count','year published'])
books_df

5. Display the first few columns of a `DataFrame` by calling its `head` method. <span style="color:red" float:right>[1 point]</span>

In [None]:
books_df.head()

Remember how earlier we said that a `DataFrame` is built on top of `numpy` arrays? Another way of saying it is that a `DataFrame` is an **abstraction** on top of `numpy` arrays: i.e. a `DataFrame` is a more **high-level** object than a `numpy` array. 

6. Call the `values` attribute of your `DataFrame` to convert it into a numpy array and display the first 3 elements of the array. <span style="color:red" float:right>[1 point]</span>

In [None]:
print(f"The `values` object of a `DataFrame` is already a : {type(books_df.values)}\n---\nAnd the first 3 elements are:\n{books_df.values[:3]}")

Now you can judge which object is more "user-friendly". That's one of the things that abstractions allow us to do: build more user-friendly (abstract) objects from less user-friendly (but more fundamental) objects.

# End of assignment