# Chapter 23. Databases and SQL

In [289]:
from __future__ import division
import math, random, re
from collections import defaultdict

The data that you will be working with will often live in [databases](https://en.wikipedia.org/wiki/Database), systems designed for efficiently storing and querying data.  
The bulk of these are [relational databases](https://en.wikipedia.org/wiki/Relational_database) such as [Oracle](https://en.wikipedia.org/wiki/Oracle_Database), [MySQL](https://en.wikipedia.org/wiki/MySQL), and [SQL Server](https://en.wikipedia.org/wiki/Microsoft_SQL_Server), which store data in [tables](https://en.wikipedia.org/wiki/Table_%28database%29) and are typically queried using [Structured Query Language (SQL)](https://en.wikipedia.org/wiki/SQL), a [declarative language](https://en.wikipedia.org/wiki/SQL) for manipulating data.  
SQL is an essential part of a data scientist's toolkit.  
In this chapter, we'll create NotQuiteABase, a Python implementation of something that's not quite a database.  
We'll also cover the basics of SQL while demonstrating how those principles work in NotQuiteABase.  
Hopefully, solving problems in NotQuiteABase will give you a good sense of how you might solve the same problems using SQL.

## CREATE TABLE and INSERT

A relational database is a collection of tables and the relationships among those tables.  
A table is simply a collection of rows, not unlike the matrices we've been working with.  
However, a table also has a fixed [database schema](https://en.wikipedia.org/wiki/Database_schema) consisting of column names and column types.  
For example, imagine a `users` data set containing (for each user) her `user_id, name,` and `num_friends`:

In SQL, we might create this table with:

Notice that we specified that the `user_id` and `num_friends` must be integers (and that `user_id` can't be NULL, which indicates a missing value and is sort of like Python's `None`) and that the name should be a string of length 200 or less.  
NotQuiteABase won't take types into account, but we'll behave as if it did.  
Also SQL doesn't usually care about case (you don't have to capitalize SELECT or GROUP BY) or indentation, so the style you use here will probably be different than styles you encounter elsewhere.

You can insert the rows with INSERT statements:

Notice also that SQL statements need to end with semicolons, and that SQL requires single quotes for its strings.  
In NotQuiteABase, you'll create a `Table` simply by specifying the names of its columns.  
To insert a row, you'll use the table's `insert()` method, which takes a `list` of row values that need to be in the same order as the table's column names.  
Behind the scenes, we'll store each row as a `dict` from column names to values.  
A real database would never use such a space-wasting representation, but doing so will make NotQuiteABase much easier to work with:

In [290]:
class Table:
    
    def __init__(self, columns):
        self.columns = columns
        self.rows = []
        
    def __repr__(self):
        """ pretty representation of the table: first columns then rows """
        return str(self.columns) + "\n" + "\n".join(map(str, self.rows))
    
    def __getitem__(self, i):
        """ return row for specified user: users[i] """
        return self.rows[i]
    
    def insert(self, row_values):
        if len(row_values) != len(self.columns):
            raise TypeError("Wrong Number of Elements")
        row_dict = dict(zip(self.columns, row_values))
        self.rows.append(row_dict)
        
    def update(self, updates, predicate):
        for row in self.rows:
            if predicate(row):
                for column, new_value in updates.iteritems():
                    row[column] = new_value

For example, we could set up:

In [291]:
    users = Table(["user_id", "name", "num_friends"])
    users.insert([0, "Hero", 0])
    users.insert([1, "Dunn", 2])
    users.insert([2, "Sue", 3])
    users.insert([3, "Chi", 3])
    users.insert([4, "Thor", 3])
    users.insert([5, "Clive", 2])
    users.insert([6, "Hicks", 3])
    users.insert([7, "Devin", 2])
    users.insert([8, "Kate", 2])
    users.insert([9, "Klein", 3])
    users.insert([10, "Jen", 1])

In [292]:
print users

['user_id', 'name', 'num_friends']
{'user_id': 0, 'name': 'Hero', 'num_friends': 0}
{'user_id': 1, 'name': 'Dunn', 'num_friends': 2}
{'user_id': 2, 'name': 'Sue', 'num_friends': 3}
{'user_id': 3, 'name': 'Chi', 'num_friends': 3}
{'user_id': 4, 'name': 'Thor', 'num_friends': 3}
{'user_id': 5, 'name': 'Clive', 'num_friends': 2}
{'user_id': 6, 'name': 'Hicks', 'num_friends': 3}
{'user_id': 7, 'name': 'Devin', 'num_friends': 2}
{'user_id': 8, 'name': 'Kate', 'num_friends': 2}
{'user_id': 9, 'name': 'Klein', 'num_friends': 3}
{'user_id': 10, 'name': 'Jen', 'num_friends': 1}


## UPDATE

Sometimes you need to update the data that's already in the database.  
For instance, if Dunn acquires another friend, you might need to do this:

The key features are:
- What table to update 
- Which rows to update
- Which fields to update
- What their new values should be  

We'll add a similar `update` method to NotQuiteABase.  
Its first argument will be a `dict` whose keys are the columns to update and whose values are the new values for those fields.  
Its second argument is a [predicate](https://en.wikipedia.org/wiki/Predicate_%28mathematical_logic%29) that returns `True` for rows that should be updated, `False` otherwise:

In [293]:
def update(self, updates, predicate):
    for row in self.rows:
        if predicate(row):
            for column, new_value in updates.iteritems():
                row[column] = new_value

Now when Dunn makes a new friend and we want to update his information, we can do this:

In [294]:
# set num_friends = 3 in rows where user_id == 1
users.update({'num_friends' : 3 }, lambda row: row['user_id'] == 1)
users[1]

{'name': 'Dunn', 'num_friends': 3, 'user_id': 1}

## DELETE