# Building a Database for Crime Reports

In this project, we will build a database for storing data related with crimes that occurred in Boston. This dataset is available in the file boston.csv.

We will be creating a database named crimes_db with a table boston_crimes using the csv file. We will also be creating readonly and readwrite groups as well as a schema named crimes. Users will be added to each of these groups as well.

## Creating the Crime Database

In [2]:
import psycopg2

# For creating the crime_db database
conn = psycopg2.connect(dbname='dq',user='dq')
conn.autocommit = True # Required to create database
cur = conn.cursor()
cur.execute('CREATE DATABASE crime_db;')
conn.close()

# For creating the crimes schema
conn = psycopg2.connect(dbname="crime_db", user="dq")
conn.autocommit = True
cur = conn.cursor()
cur.execute("CREATE SCHEMA crimes;")

## Obtaining the Column Names and Sample



In [3]:
import csv
with open('boston.csv') as file:
    reader = csv.reader(file)
    col_headers = next(reader)
    first_row = next(reader)

## Creating an Auxiliary Function

Before we create a table for storing the crime data, we need to identify the proper datatypes for the columns. To help us with that, we'll create a function — get_col_value_set() — that given the name of a CSV file and a column index (starting a 0) computes a Python set with all distinct values contained in that column.

In [4]:
def get_col_set(csv_file, col_index):
    values = set()
    with open(csv_file, 'r') as f:
        next(f)
        reader = csv.reader(f)
        for row in reader:
            values.add(row[col_index])
    return values

for i in range(len(col_headers)):
    values = get_col_set('boston.csv', i)
    print(col_headers[i], len(values), sep='\t')

incident_number	298329
offense_code	219
description	239
date	1177
day_of_the_week	7
lat	18177
long	18177


## Finding the Maximum Length

We'll look at the longest word in any column to see the appropriate length for that field. In particular, we will find the maximum length of the description column.

In [7]:
print(col_headers)

['incident_number', 'offense_code', 'description', 'date', 'day_of_the_week', 'lat', 'long']


In [6]:
descriptions = get_col_set("boston.csv", 2) # description is at index number 2

max_len = 0
for description in descriptions:
    max_len = max(max_len, len(description))
    
print(max_len)

58


## Creating the Table

We will be creating a table for storing the Boston crime data in this next part.

In [8]:
print(col_headers)
print(first_row)

['incident_number', 'offense_code', 'description', 'date', 'day_of_the_week', 'lat', 'long']
['1', '619', 'LARCENY ALL OTHERS', '2018-09-02', 'Sunday', '42.35779134', '-71.13937053']


In [10]:
# Will enum the weekday
cur.execute(
"""
    CREATE TYPE weekday AS ENUM ('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday');
""")

# create the table
cur.execute("""
    CREATE TABLE crimes.boston_crimes (
        incident_number INTEGER PRIMARY KEY,
        offense_code INTEGER,
        description VARCHAR(100),
        date DATE,
        day_of_the_week weekday,
        lat decimal,
        long decimal
    );
""")

DuplicateObject: type "weekday" already exists


Incident number makes best sense as primary key. The others come from the first_row.

## Loading the Data

We will use the copy_expert to insert in the data.

In [11]:
with open("boston.csv") as f:
    cur.copy_expert("COPY crimes.boston_crimes FROM STDIN WITH CSV HEADER;", f)

## Revoking Public Priviliges

Now, let's handle permissions.

In [12]:
cur.execute("REVOKE ALL ON SCHEMA public FROM public;")
cur.execute("REVOKE ALL ON DATABASE crime_db FROM public;")

## Creating User Groups

Now, we will create the readonly and readwrite groups. We will assign the permissions appropriately so that users can user the appropriate commands.

In [13]:
# For readonly group
cur.execute("CREATE GROUP readonly NOLOGIN;")
cur.execute("GRANT CONNECT ON DATABASE crime_db TO readonly;")
cur.execute("GRANT USAGE ON SCHEMA crimes TO readonly;")
cur.execute("GRANT SELECT ON ALL TABLES IN SCHEMA crimes TO readonly;")

# For readwrite group
cur.execute("CREATE GROUP readwrite NOLOGIN;")
cur.execute("GRANT CONNECT ON DATABASE crime_db TO readwrite;")
cur.execute("GRANT USAGE ON SCHEMA crimes TO readwrite;")
cur.execute("GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA crimes TO readwrite;")

## Creating Users

We will create a test user in each group below.

In [14]:
cur.execute("CREATE USER data_analyst WITH PASSWORD 'secret1';")
cur.execute("GRANT readonly TO data_analyst;")

cur.execute("CREATE USER data_scientist WITH PASSWORD 'secret2';")
cur.execute("GRANT readwrite TO data_scientist;")

## Testing

We will check to see that everything is now configured as expected.

In [15]:
conn.close() # Close old connection

# Reconnecting
conn = psycopg2.connect(dbname="crime_db", user="dq")
cur = conn.cursor()

# Checking the users and groups
cur.execute(
"""
    SELECT rolname, rolsuper, rolcreaterole, rolcreatedb, rolcanlogin FROM pg_roles
    WHERE rolname IN ('readonly', 'readwrite', 'data_analyst', 'data_scientist');
""")

for user in cur:
    print(user)

# Checking priviliges
cur.execute(
"""
    SELECT grantee, privilege_type
    FROM information_schema.table_privileges
    WHERE grantee IN ('readonly', 'readwrite');
""")

for user in cur:
    print(user)

conn.close()

('readonly', False, False, False, False)
('readwrite', False, False, False, False)
('data_analyst', False, False, False, True)
('data_scientist', False, False, False, True)
('readonly', 'SELECT')
('readwrite', 'INSERT')
('readwrite', 'SELECT')
('readwrite', 'UPDATE')
('readwrite', 'DELETE')
