# The nature of data



## Review of Python / Jupyter basics

Before we begin our Python tutorial, here's one important note about code cells in Jupyter notebooks: If you have a code cell and only one result generates output, we don't need to use a `print` function.

In [1]:
a = 1
a # this line is the reason the cell prints 1 in the output

1

But if we have more than one line that should generate output, we need to use `print` otherwise only the **last** line will produce output.

In [2]:
a, b = 1, 2
a # will not produce output
b # will produce output

2

In [3]:
a, b = 1, 2
print(a) # will produce output
print(b) # will produce output

1
2


If you want to produce output for something that is running inside of a loop, you also need to use `print`.

In [4]:
for i in range(5):
    print(i)

0
1
2
3
4


## Review of basic data types in Python

Let's begin by learning about the basic building blocks of Python data structures: **integers, floats, booleans, and strings**.

In [5]:
30 + 10

40

In [6]:
type(40)

int

In [7]:
4.5

4.5

In [8]:
type(40.0)

float

In [9]:
int(40.0)

40

In [10]:
float('nan')

nan

In [11]:
float('inf')

inf

In [12]:
1 == 0

False

In [13]:
1 is 1

  1 is 1


True

In [14]:
1 in [1, 0]

True

In [15]:
not(1 == 0)

True

In [16]:
type(1 == 0)

bool

In [17]:
1.0 + False

1.0

In [18]:
1 + 1.0

2.0

In [19]:
3 - True # this is the kind of thing that you shouldn't do, even if you can

2



## Start Here

### Exercise (1 minute)

Verify that `float('inf')` indeed behaves like infinity. Mathematically, what should happen if you multiply infinty by 2? What about dividing infinity by itself?

In [20]:
float('inf') * 2

inf

### End of exercise

## Python has dynamic typing

What sets dynamic programming languages like Python apart from **strongly-typed** programming languages like Java is that we can change the type of a variable at execution time. In strongly-typed programming languages we define the type ahead of time and enforce it at runtime. In Python we can do things like this:

In [21]:
a = 3 # a is an integer
print("a is", a, "and is of type", type(a))
a = a/3
print("a is", a, "and is of type", type(a))

a is 3 and is of type <class 'int'>
a is 1.0 and is of type <class 'float'>


## Python conditional statements

We can sometimes write conditional statements in a concise way. But be careful doing this, because as popular as **one-liners** are in Python, it's important to write code that is also easy to understand, either to you six months from now, or to someone else reading your code. Here's an example:

In [22]:
statement = (2 > 4) # See what happens if you flip ">"

print(statement * 5 + (1 - statement)*3) # hard to understand

if statement == True: # easier to understand but many lines
    print(5)
else:
    print(3)

if statement: # turns out we don't need the `== True` part, it's implied
    print(5)
else:
    print(3)

if not statement: # turns out we don't need the `== True` part, it's implied
    print(3)
else:
    print(5)

print(5 if statement else 3) # both concise and easy to understand, but a little risky

3
3
3
3
3


Of course this is both a convenience and an inconvenience. It is a convenience because it means that we can write very short code to do what we need to do. In data science, where we want to quickly analyze our data, create summaries and visualizations, train and test models, etc. this convenience really pays off. However, the inconvenience lies in that we are more susceptible to have bugs in our program that we do not catch until later. So we need to be careful and develop best practices when we code in Python. 

## Re-running code in Jupyter cells

In [23]:
s = "abc"

This problem can occurr inside a Jupyter notebook where we are not always careful and might execute a cell twice by accident, such as in the cell below. To avoid this kind of problem, we should combine the above and below cells into one.

In [24]:
s = s + "def"
s

'abcdef'

## Operator overloading

In a way we can say that dynamic languages **try to be smart when we are being lazy** by guessing what we mean when we type some code. Of course there is a limit to that and sometimes we need to be a little more explicit, such as here when we need to coerce a string into an integer or vice versa for the operation to make sense.

In [25]:
int('1') + 2

3

In [26]:
str(1) + '2'

'12'

## Python's Built-in Data Structures

From these basic building blocks, we create the next set of fundamental objects: **lists, tuples, sets and dictionaries**.

In [27]:
fruits = ["apple", "pear", "pitaya", "zapote"] # lists
print(fruits[0]) # first element
print(fruits[-1]) # last element
print(fruits[1:3]) # 2nd and 3rd element

apple
zapote
['pear', 'pitaya']


Lists are **mutable** objects, meaning that we can change its elements, add new ones, or drop elements from it.

In [28]:
fruits[-1] = "nispero" # lists are mutable
print(fruits)

['apple', 'pear', 'pitaya', 'nispero']


In [29]:
fruits = fruits + ["guanabana", "papaya"] # adding something to a list is easy
print(fruits)

['apple', 'pear', 'pitaya', 'nispero', 'guanabana', 'papaya']


In [30]:
fruits = fruits + ["papaya"]
print(fruits)

['apple', 'pear', 'pitaya', 'nispero', 'guanabana', 'papaya', 'papaya']


In [31]:
del fruits[:2] # dropping something from a list is easy
print(fruits)

['pitaya', 'nispero', 'guanabana', 'papaya', 'papaya']


We can have lists of booleans, strings, or even lists of mixed data types. Basically, lists are very **flexible** objects, and we use them when we need flexibility. Just know that there's no such thing as a free lunch: when you have more flexibility, you usually have less efficiency.

Notice what happens when we convert a string into a `list`:

In [32]:
list("abc")

['a', 'b', 'c']

We can join things back using `join`:

In [33]:
"-".join(list("abc")) # we can use join to join a list of strings

'a-b-c'

With a list of booleans, we can use `any` and `all` to check conditions. We later encounter similar versions of these functions that work on a `DataFrame`. They have the same name but slightly different functionality.

In [34]:
any([True, False, False])

True

In [35]:
all([True, False, False])

False

Finally, lists can be **nested** too, meaning we can have lists of lists.

In [36]:
nested_list = [[-1, 1, 9], [-2, 3, 4]]
print(nested_list)

[[-1, 1, 9], [-2, 3, 4]]


In [37]:
nested_list[1][0]

-2

There is a really neat way to create lists with Python using what's called **list comprehension**. It's really just a shortcut for creating a list very quickly and in a way that makes the code look easy to follow.

In [38]:
# Without List comprehension:
some_list = [] # initialize empty list
for i in range(10):
    if i % 3 == 1: # if i divided by 3 leaves a remainder of 1
        some_list += [i] # add i to the list, same as some_list = some_list + [i]

print(some_list)

[1, 4, 7]


The above snippet is valid, but there's a much easier way of doing it using a list comprehension:

In [39]:
# With List comprehension:
print([i for i in range(10) if i % 3 == 1])

[1, 4, 7]


Tuples look somewhat similar to lists, but they are much more rigid. Tuples are **immutable** meaning that once they're created they can't be changed (unless you recreate them).

In [40]:
tup = tuple([2, 4, 12, "bla"])
print(tup)
tup = (2, 4, 6, "blabla")
print(tup)

(2, 4, 12, 'bla')
(2, 4, 6, 'blabla')


### Exercise (1 minute)

Try changing the tuple above by changing one of its elements and see what happens. What happens when trying to add tuples to each other? Compare that with lists.

In [41]:
# Add code here

### End of exercise

Tuples are handy when you know what information goes in each place and don't need the flexibility to change it later.  Often library functions return tuples and therefore we need to know what they are and how they behave.  For, instance, the objects inside a tuple are mutable.

In [42]:
my_list = ['a', 'b']
my_tuple = (my_list, 3)
print('my_tuple is:', my_tuple)
my_list.append('X')
print('my_tuple changed to:', my_tuple)

my_tuple is: (['a', 'b'], 3)
my_tuple changed to: (['a', 'b', 'X'], 3)


We can also have **sets** in Python, which are similar to mathematical sets: elements in a set are unique and the order doesn't matter. Sets have certain **methods** that sound familiar if you remember sets in math, such as `difference`.

In [43]:
my_set = set([2, 2, "hello", "hello", 5])
print(my_set)
my_set = {2, 2, "hello", "hello", 5}
print(my_set)

{2, 5, 'hello'}
{2, 5, 'hello'}


In [44]:
my_set.difference(set([2, 4, "hello"]))

{5}

If you're not familiar with the term **method** it refers to the fact that once you create an object, there are certain functions that you can call that are relevant to the object. For example, a set has a method called `difference` so that you can take the difference of one set from another set. To call the method, you type the name of the object itself, followed by a period, and the name of the method. Tab completion can help you find out what all methods an object might have. Similarly, on more sophisticated editors like Visual Studio Code, we can use intellisense to quickly view the methods and attributes of an object and even pull up their definitions. Finally, we can also call the `dir` function in Python for this. In the below example, notice how we drop methods that start with a double underscore (called "dunder" by Pythonistas) as these are the internal methods and we should avoid using them. A deeper discussion of this is out of the scope of this lesson.

In [45]:
[s for s in dir(my_set) if s[:2] != '__']

['add',
 'clear',
 'copy',
 'difference',
 'difference_update',
 'discard',
 'intersection',
 'intersection_update',
 'isdisjoint',
 'issubset',
 'issuperset',
 'pop',
 'remove',
 'symmetric_difference',
 'symmetric_difference_update',
 'union',
 'update']

### Exercise (2 minutes)

Explore the methods and attributes associated with a list and a tuple. Just create a list, then type its name, followed by dot, and press `TAB` to let auto-complete show you the methods. Then try the same thing for a tuple. Can you write a code to see which methods  or attributes (if any) the two objects have in common?

In [47]:
# Add code here
[o for o in dir(my_list) if o[:2] != '__']

['append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

### End of exercise

A Python dictionary is similar to a list, but instead of elements being ordered by their position in the list, the elements are **values** that are paired with **keys**, which makes dictionaries a **key-value pair**.

In [48]:
my_dict = {"a": 12, "b": "hello", "c": [4, 9], "d": {"one": 1, "two": 2}}
print(my_dict)

{'a': 12, 'b': 'hello', 'c': [4, 9], 'd': {'one': 1, 'two': 2}}


In [49]:
print('keys:', my_dict.keys())

keys: dict_keys(['a', 'b', 'c', 'd'])


In [50]:
print('values:', my_dict.values())
print("value for 'c':", my_dict['c'])

values: dict_values([12, 'hello', [4, 9], {'one': 1, 'two': 2}])
value for 'c': [4, 9]


In [51]:
del my_dict['a']
print(my_dict)

{'b': 'hello', 'c': [4, 9], 'd': {'one': 1, 'two': 2}}


In [52]:
# Here is a dictionary that contains an integer and an anonymous function
MyDict = {'a': 3, 'b': lambda x : x + 2}
# We can use the anonymous function on the value of 'a':
MyDict['b'](MyDict['a'])

5

There is one more data type we briefly cover here, although it is not a **native** data type (a.k.a **built-in**), in the sense that we need to load a library for it: the `numpy` library. The `numpy` library is a library used to do linear algebra calculations. The most basic `numpy` object is an `array`.

In [53]:
import numpy as np
arr = np.array([2, 4, 1, 9])
print(arr)

[2 4 1 9]


In [54]:
type(arr)

numpy.ndarray

In [55]:
arr.mean()

np.float64(4.0)

In [56]:
np.mean(arr)

np.float64(4.0)

The above array looks like a Python list, but a `numpy` array and a list are very different. Let's look at some example to see what makes them different.

### Exercise (2 minutes)

In the following cell, we create a Python list and a `numpy` array. Then we perform the same operation on the list and on the array, but get very different results. Explain what happens in each case.

In [57]:
lst = [2, 4, 9, 7, 1, 9] # this is a list
arr = np.array(lst) # this is an array

print(lst + [1])
print(arr + [1])

[2, 4, 9, 7, 1, 9, 1]
[ 3  5 10  8  2 10]


Return to the above code and change one of the elements of the list from an integer to a string, then report what happens.

### End of exercise

As we saw in the last exercise, `numpy` arrays cannot contain elements of different types.  The type of components in an array is the component type.  Each array may have only one component type.  We say that the values within an array are `uniform` in type.  This makes `numpy` arrays ideal for storing columns in tables, because in a table, different columns can be of different types, but for a given column all elements (rows) are expected to be of the same type.

The `numpy` library is meant to make it easy to work with arrays. We saw for example how we can add 1 to every element of the array **without having to write a loop or list comprehension**. Let's look at another example: say we want to filter an array to only keep elements greater than 5. Compare the list implementation and the array implementation:

In [58]:
lst = [2, 4, 9, 7, 1, 9] # this is a list

[i for i in lst if i > 5] # filtering a list

[9, 7, 9]

In [59]:
arr = np.array(lst) # this is an array

arr[arr > 5] # filtering an array

array([9, 7, 9])

There's a lot more we could be talking about, but this is a data science class, not a class on Python programming. We **strongly** encourage you to take an introductory Python class and get very comfortable with the constructs we covered here. 

## Other useful functions

For the remainder of the notebook, we cover a few more things that have relevance to data science and then try to bring these examples home by looking at a practical example.

We often have to print some summaries about data. The `print` function along with the `format` method can be very helpful.

In [60]:
dec, bignum, pct = 3.1415, 4.234**5, 0.89 # you can assign multiple variable at once with this shortcut
print("The three numbers are {}, {}, and {}.".format(dec, bignum, pct))
print("The decimal rounded to 2 digits is {:0.3}".format(dec))
print("The bignum with thousand separators and 2 decimals is {:,.2f}".format(bignum))
print("The percentage number is {:2.0f}%".format(pct * 100))

The three numbers are 3.1415, 1360.6745706140914, and 0.89.
The decimal rounded to 2 digits is 3.14
The bignum with thousand separators and 2 decimals is 1,360.67
The percentage number is 89%


Sometimes we want to run some code, and in case it fails, run some other code. This can help us **gracefully** (that's a technical term) handle errors in our code. To do that, we use the `try` and `except` function.

In [61]:
try:
    new_var += 1 # new_var doesn't exist
    print("Found and incremented new_var. The new value is {}.".format(new_var))
except:
    new_var = 45
    print("Initialized new_var to 45")

Initialized new_var to 45


Try running the above cell multiple times to see what happens. The `try` function can be very helpful when we're traversing data that is not well structured and we expect to run into errors but don't want the errors to stop us mid-stream.

Writing functions in Python is relatively easy:

In [62]:
def my_function(n, m = 1): # name and arguments for the function
    return(n - m)

print(my_function(3))
print(my_function(n = 3))

2
2


In [63]:
print(my_function(7, 4)) # you can match arguments by position, here order matters
print(my_function(m = 4, n = 7)) # if you match arguments by name, order doesn't matter
# You might want to try switching the order of the arguments

3
3


### Exercise (3 minutes)

In the list below, one of the elements is a string by accident. Write a program that multiplies each element of the list by itself. Use `try` and `except` to leave the element as-is when the element is not a number.

There are different ways to solve this. Here we propose to write a function and then use list comprehension to apply it to each element of the list

In [64]:
my_list = [2, 4, 8, "3", 5]

def convert(n):
    ## write code here
    return n

[convert(i) for i in my_list]

[2, 4, 8, '3', 5]

### End of exercise

One final note about functions. The above function was created in the current Python session, so we can call it in the same session. But what about calling a function from one of the libraries we load. For example, let's say we want to use the `mean` function in the `numpy` library. We already saw that we first have to load `numpy` and then call `numpy.mean` or `np.mean` if we alias numpy with `np`.

In [65]:
import numpy as np
np.mean([4, 8, 3])

np.float64(5.0)

However, if we don't like to preface the function name with the library name, we can load the function like this:

In [66]:
from numpy import mean
mean([4, 8, 3])

np.float64(5.0)

However, we should be careful about doing this because we might overwrite an existing function with the same name. This topic deals with what we call "scope" in programming. When Python is looking for some variable, it first looks for it in the local scope, then the enclosing, then the global scope and finally the built-in scope (leading to the LEGB acronym). Scoping is out of the scope of this notebook (pun intended), but suffice it to say that in general we should alias functions to avoid name conflicts.

In [67]:
from numpy import mean as average
average([4, 8, 3])

np.float64(5.0)

So this is it with our short Python tutorial, but before we finish, let's give you some motivation for everything we learned.

At this point you might be wondering, am I here to learn data science or am I here to learn solve little programming challenges. In other words, how relevant is all of this to doing data science? The answer is very, very, very relevant, because knowing the basics well can help you write clear and concise code to manipulate data or train models. We'll look at some examples here, but the truth is it takes time and practice to come to this realization.

## Reading semi-structured data

Let's show an example of how the things we learned can be applied to a data science situation. We will go and read in some data from a JSON file, which is an example of what we call **semi-structured data**. 

Before reading the data, go and open it in an editor (the file name is `books.json`). A JSON file is not a Python object, but does it look **similar to** any of the Python objects we've encountered so far?

Let's now go and read the file into Python. After reading it, we will print the first element of it, to see what kind of data is there. When you try to print objects that can have nested information, it's helpful to pretty-print it, using the `pprint` library so the information is more presentable.

In [69]:
import json
with open('../../data/books.json', encoding = 'utf-8') as f:
    books_dict = json.load(f)

from pprint import pprint
pprint(books_dict[0]) # print information for the first book

{'_id': 1,
 'authors': ['W. Frank Ableson', 'Charlie Collins', 'Robi Sen'],
 'categories': ['Open Source', 'Mobile'],
 'isbn': '1933988673',
 'longDescription': 'Android is an open source mobile phone platform based on '
                    'the Linux operating system and developed by the Open '
                    'Handset Alliance, a consortium of over 30 hardware, '
                    'software and telecom companies that focus on open '
                    'standards for mobile devices. Led by search giant, '
                    'Google, Android is designed to deliver a better and more '
                    'open and cost effective mobile experience.    Unlocking '
                    "Android: A Developer's Guide provides concise, hands-on "
                    'instruction for the Android operating system and '
                    'development tools. This book teaches important '
                    'architectural concepts in a straightforward writing style '
                    

We are now going to extract particular pieces of information from the first book: the title, author, category, and ISBN of the book. However, we run into a problem: books can have multiple authors and multiple categories, and for reasons that will become clear soon we don't want to allow that. Instead we'll do this:

- When there are multiple authors, we will replicate the information once for each author.
- When there are multiple categories, we will just take the first one and ignore the rest.

In [70]:
elem = books_dict[0] # pull out the first element
num_authors = len(elem['authors'])
print("The first book has {} authors.".format(num_authors))

The first book has 3 authors.


Also for reasons that will become clear soon, we create a **tuple** for storing the information, so here because we have three authors, we create three tuples called `row_1`, `row_2` and `row_3`.

In [71]:
row_1 = (elem['title'], 
         elem['authors'][0], # first author
         elem['categories'][0], # first category
         elem['isbn'])

row_2 = (elem['title'], 
         elem['authors'][1], # second author
         elem['categories'][0], # first category
         elem['isbn'])

row_3 = (elem['title'], 
         elem['authors'][2], # third author
         elem['categories'][0], # first category
         elem['isbn'])

In [72]:
print(row_1)
print(row_2)
print(row_3)

('Unlocking Android', 'W. Frank Ableson', 'Open Source', '1933988673')
('Unlocking Android', 'Charlie Collins', 'Open Source', '1933988673')
('Unlocking Android', 'Robi Sen', 'Open Source', '1933988673')


Now it's time to take the content from above and place it in a tabular data format. We will use the `sqlite3` library, which gives us access to a light-weight SQL database in Python. Note that we are only doing this to illustrate how data can flow from one format into another. Unless we insist on using SQL databases we learn later that there are other options. 

We don't actually need to have SQLite installed on our machine. Instead we use `sqlite.connect(':memory:')` to connect to a "database" in the memory and pretend it's a physical database somewhere.

In [73]:
import sqlite3

connection = sqlite3.connect(':memory:') 
cursor = connection.cursor()

cursor.execute('''CREATE TABLE books_long
             (title text, author text, categroy text, isbn text)''')

rows = [row_1, row_2, row_3]
cursor.executemany('INSERT INTO books_long VALUES (?,?,?,?)', rows)

connection.commit() # save the changes

Notice how in the above snippet, we first create a SQL table with column names and column types matching what we extracted from the JSON file. We then grouped `row_1`, `row_2` and `row_3` into a list and then inserted them into a SQL table we created, by using `INSERT INTO`.

### Exercise (3 minutes)

Based on what we learned about lists and tuples, see if you can answer the following questions:

1. What is the type of `row_1`? Provide a justification for this chioce.
1. What is the type of `rows`? Provide a justification for this chioce.

### End of exercise

How do we check that it all worked? We can simply run a `SELECT *` on the data, and use `fetchall()` to grab it from the database and bring it back into Python.

In [74]:
books_table = cursor.execute("SELECT * FROM books_long").fetchall()
books_table

[('Unlocking Android', 'W. Frank Ableson', 'Open Source', '1933988673'),
 ('Unlocking Android', 'Charlie Collins', 'Open Source', '1933988673'),
 ('Unlocking Android', 'Robi Sen', 'Open Source', '1933988673')]

The above is not really a table.  It is a list of tuples.  A real table has row indices and meaningful column headers.  A real table is diplayed in a nice grid.  In the assignment you will create a real table.  

So let's summarize what we accomplished:

1. We read data from a JSON file into a Python dictionary.
1. We extracted some of the data out of the Python dictionary and placed it into a list of tuples.
1. We took the list of tuples and dumped its content into a SQL table.
1. We read the content back from the SQL table and into Python as a list of tuples.

As data flows from one format to another, it's important to think about the right object for representing the data. The right choice depends on many factors, such as

- The size of the data and other efficiency factors
- Whether the data is flat or hierarchical
- Whether the data is structured, unstructured, or semi-structured
- Our preference for how to "query" the data

Working with data, especially at scale, is such an important topic in data science that a relative new role, called **data engineer**, was created to deal with this. Among other things, data engineers work on creating efficient data pipelines to reduce redundancy, and letting the right tools do the job.

# Assignment

In this assignment we want to get comfortable with loading and manipulating data in Python. While future assignments will focus more using structured data which we can load into a `DataFrame` using `pandas`, this assignment is focused on semi-structured data and how we can "flatten" it and then load it into other formats. The objective is to see how data flows in Python from one object to another and what advantages and disadvantages each offers.

Let's read the `books.json` data set and display the first item in it.

In [76]:
import json
with open('../../data/books.json', encoding = 'utf-8') as f:
    books_dict = json.load(f)

from pprint import pprint
pprint(books_dict[0]) # print information for the first book

{'_id': 1,
 'authors': ['W. Frank Ableson', 'Charlie Collins', 'Robi Sen'],
 'categories': ['Open Source', 'Mobile'],
 'isbn': '1933988673',
 'longDescription': 'Android is an open source mobile phone platform based on '
                    'the Linux operating system and developed by the Open '
                    'Handset Alliance, a consortium of over 30 hardware, '
                    'software and telecom companies that focus on open '
                    'standards for mobile devices. Led by search giant, '
                    'Google, Android is designed to deliver a better and more '
                    'open and cost effective mobile experience.    Unlocking '
                    "Android: A Developer's Guide provides concise, hands-on "
                    'instruction for the Android operating system and '
                    'development tools. This book teaches important '
                    'architectural concepts in a straightforward writing style '
                    

1. Write a program that goes through the entire data and extracts the following information:  <span style="color:red" float:right>[4 point]</span>

  - title of the book
  - name of the first author
  - name of the second author (if book has more than one author)
  - number of authors
  - ISBN
  - if the word "data" is in the book's description
  - the number of words in the book's description
  - the year the book was published

  Of course because JSON data doesn't necessarily enforce any sort of schema, we can't be sure that the information we are trying to extract exists for every book. For example, if the book only has one author, then there is no second author. So use `try` and `except` as you loop through every book and skip to the next item every time some information is missing.

  Store the extracted data in a list named `rows` whose elements are tuples, one tuple per book. For example, the first element of `rows` stores the tuple for the first book and should look like this:

        ('Unlocking Android', 'W. Frank Ableson', 'Charlie Collins', 3, '1933988673', True, 252, 2009)

2. Save the content of `rows` in a SQL-like table using `sqlite3`, and choose the appropriate column types. <span style="color:red" float:right>[2 point]</span> 

  As your column names use the following:

  - `title`
  - `author_1`
  - `author_2`
  - `num_authors`
  - `isbn`
  - `has_data`
  - `desc_len`
  - `year_published`

3. Write a SQL query against the table to show all books that (1) contain the word "data" and (2) have more than 3 authors. Store the result of the query in an object called `books_table`, then close the connection. <span style="color:red" float:right>[2 point]</span>

SQL tables are not the only way, and definitely not the most straightforward way to store and manipulate data in Python. A format that's more popular with data scientist is to use the `pandas` library to create a `DataFrame`. This library has a lot of functionality that makes it easy to run the common tasks data scientists do with data.

4. Read the data from the above query into a `DataFrame` and call it `books_df`. HINT: Use `pd.DataFrame` and specify meaningful column names to use for the columns. <span style="color:red" float:right>[1 point]</span>

In [None]:
import pandas as pd

5. Display the first few columns of a `DataFrame` by calling its `head` method. <span style="color:red" float:right>[1 point]</span>

Remember how earlier we said that a `DataFrame` is built on top of `numpy` arrays? Another way of saying it is that a `DataFrame` is an **abstraction** on top of `numpy` arrays: i.e. a `DataFrame` is a more **high-level** object than a `numpy` array. 

6. Call the `values` attribute of your `DataFrame` to convert it into a numpy array and display the first 3 elements of the array. <span style="color:red" float:right>[1 point]</span>

Now you can judge which object is more "user-friendly". That's one of the things that abstractions allow us to do: build more user-friendly (abstract) objects from less user-friendly (but more fundamental) objects.

# End of assignment