# Building A Database For Crime Reports

In this notebook, we will use Postgres to build a database for storing data about crimes that occurred in Boston. The dataset is available in the file boston.csv.

We will create a database named crimes_db, with a schema named crimes, and a table named boston_crimes with data from the boston.csv file. We will also create readonly and readwrite groups with appropriate priviledges, as well as example users for each of these groups.

## Creating the Crime Database

We will start by creating the database for storing our crime data, as well as the schema.

In [4]:
import psycopg2
import csv

In [1]:
# crime_db does not exist yet, so we connect to postgres db first and create the crime_db from there.
conn = psycopg2.connect(dbname="postgres", user="postgres")
conn.autocommit = True
cursor = conn.cursor()
cursor.execute("CREATE DATABASE crime_db;")
conn.close()

In [3]:
# Connect to crime_db and create schema named crimes.
conn = psycopg2.connect(dbname="crime_db", user="postgres")
cursor = conn.cursor()
cursor.execute("CREATE SCHEMA crimes;")

## Obtaining the Column Names and Sample

Before we load data into our table, let's first understand the crime dataset so we can choose the right datatypes to use in our table.

In [9]:
with open("boston.csv", "r") as file:
    row_list = list(csv.reader(file))
    
col_headers = row_list[0]
first_row = row_list[1]

print("Column Headers: ", col_headers)
print("First Data Row: ", first_row)

Column Headers:  ['incident_number', 'offense_code', 'description', 'date', 'day_of_the_week', 'lat', 'long']
First Data Row:  ['1', '619', 'LARCENY ALL OTHERS', '2018-09-02', 'Sunday', '42.35779134', '-71.13937053']


## Creating an Auxiliary Function

Let's write a function -- get_col_set() -- that will help us identify proper datatypes for the columns. This function will return a set of all distinct values in a column of a CSV file. This will allow us to see if any columns can be of enumerated type and will allow us to easily calculate the maximum length of data in column so we can appropriately set VARCHAR.

We will start off by using this function to find the number of different values in each column of the boston.csv file.

In [18]:
def get_col_set(csv_filename, col_index):
    unique_values = set()
    
    with open(csv_filename, 'r') as file:
        file_list = list(csv.reader(file))
        for item in file_list[1:]:
            unique_values.add(item[col_index])
    
    return unique_values

In [17]:
num_distinct_values = []
for index in range(0, len(col_headers)):
    num_distinct_values.append(len(get_col_set("boston.csv", index)))
    
print(num_distinct_values)

[298329, 219, 239, 1177, 7, 18177, 18177]
