Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
1 contributor

Users who have contributed to this file

254 lines (192 sloc) 9.99 KB

Homework 3: Joins and Query Optimization

This homework is divided into two parts.

Part 1 is due: Wednesday, 10/16/2019, 11:59 PM.

Part 2 is due: Wednesday, 10/23/2019, 11:59 PM.

Part 2 does not require Part 1 to be completed first, but you will need to make sure any code you write for Part 1 does not throw an exception before starting Part 2. See the Grading section at the bottom of this document for notes on how your score will be computed.

You will not be able to use slip minutes for the Part 1 deadline. The standard late penalty (33%/day, counted in days not minutes) will apply after the Part 1 deadline. Slip minutes may be used for the Part 2 deadline, and the late penalty for the two parts are independent.

For example, if you submit Part 1 at 5:30 AM two days after it is due, and Part 2 at 6:00 PM the day after it is due, you will recieve:

  • 66% penalty on your Part 1 submission
  • No penalty on your Part 2 submission
  • A total of 1080 slip minutes consumed

Overview

In this assignment, you will implement some join algorithms and a limited version of the Selinger optimizer.

Prerequisites

You should watch the Iterators and Joins lectures before working on Part 1 of this homework.

You should watch both the Query Optimization I: Plan Space and Query Optimization II: Cost and Search lectures before working on Part 2 of this homework.

Getting Started

The test cases for Part 1 are located in src/test/java/edu/berkeley/cs186/database/query (TestJoinOperator and TestSortOperator).

The test cases for Part 2 are located in src/test/java/edu/berkeley/cs186/database/table/stats (TestHistogram) and src/test/java/edu/berkeley/cs186/database/query (TestBasicQuery, TestOptimization2, TestOptimizationJoins, TestSingleAccess, and TestSingleAccessJoins).

To build and test your code in the container, run the following:

mvn clean test -D HW=3

There should be 21 failures, 14 errors, and 39 tests run.

To run just Part 1 tests, run:

mvn clean test -D HW=3Part1

There should be 12 failures, 0 errors, and 12 tests run.

To run just Part 2 tests, run:

mvn clean test -D HW=3Part2

There should be 9 failures, 14 errors, and 27 tests run.

Part 0: Understand the Skeleton Code

common

The common/iterator directory contains an interface called a BacktrackingIterator. Iterators that implement this will be able to mark a point during iteration, and reset back to that mark. For example, here we have a backtracking iterator that just returns 1, 2, and 3, but can backtrack:

BackTrackingIterator<Integer> iter = new BackTrackingIteratorImplementation();
iter.next(); //returns 1
iter.next(); //returns 2
iter.markPrev(); // marks the previously returned value, 2
iter.next(); //returns 3
iter.hasNext(); //returns false
iter.reset(); // reset to the marked value - next() will return the marked value (2)
iter.hasNext(); // returns true
iter.next(); //returns 2
iter.markNext(); // mark the value to be returned next, 3
iter.next(); // returns 3
iter.hasNext(); // returns false
iter.reset(); // reset to the marked value - next() will return the marked value (3)
iter.hasNext(); // returns true
iter.next(); // returns 3

ArrayBacktrackingIterator implements this interface. It takes in an array and returns a backtracking iterator over the values in that array.

query

The query directory contains what are called query operators. A single query to the database may be expressed as a composition of these operators. The sequential scan and index scan operators fetch data from a single table, while the remaining operators take as input the output of one or two operators, transform or combine the input (for example: projecting away columns, or joining records from two input operators together), and return a collection of records.

This is the volcano model, where the operators are layered atop one another, and each operator requests tuples from the input operator(s) as it needs to generate its next output tuple. Note that each operator only fetches tuples from its input operator(s) as needed, rather than all at once!

image of volcano model

A query plan is a composition of query operators, and it describes how a query is executed. Recall that SQL is a declarative language - the user does not specify how a query is run, and only what the query should return. Therefore, there are often many possible query plans for a given query.

The various *Operator classes are the query operators. All operators extend the QueryOperator class. JoinOperator in particular is the base class of all the join operators, which various join implementations extend. You should take a look at the methods that are implemented for you in JoinOperator: it provides methods you may need to deal with tables and the current transaction. You should not be dealing directly with Table objects nor TransactionContext objects while implementing join algorithms in Part 1 (aside from passing them into methods that require them).

The QueryPlan class represents a query. Users of the database create queries using the public methods (such as join(), select(), etc.) and then call execute to generate a query plan for the query and get back an iterator over the resulting data set (which is not fully materialized: the iterator generates each tuple as requested). The current implementation of execute simply calls executeNaive, which joins tables in the order given; your task in Part 2 will be to generate better query plans.

Interface for querying

You should read through the Database.java section of the main README and browse through examples in src/test/java/edu/berkeley/cs186/database/TestDatabase.java to familiarize yourself with how queries are written in our database.

After execute() has been called on a QueryPlan object, you can print the final query plan:

Iterator<Record> result = query.execute();
QueryOperator finalOperator = query.getFinalOperator();
System.out.println(finalOperator.toString());
type: BNLJ
leftColumn: S.sid
rightColumn: E.sid
    (left)
    type: WHERE
    column: E.cid
    predicate: EQUALS
    value: CS 186
        type: SEQSCAN
        table: E

    (right)
    type: SEQSCAN
    table: S

Part 1: Join Algorithms

Part 1 README

Part 2: Cost Estimation and Query Optimization

Part 2 README

Submitting the Assignment

See the main readme for submission instructions. The homework numbers for this homework are hw3_part1 and hw3_part2.

You may not modify the signature of any methods or classes that we provide to youi, but you're free to add helper methods.

You should make sure that all code you modify belongs to files with HW3 todo comments in them (e.g. don't add helper methods to DataBox). A full list of files that you may modify follows:

  • query/BNLJOperator.java
  • query/SortOperator.java
  • query/SortMergeOperator.java
  • table/stats/Histogram.java (Part 2 only)
  • query/QueryPlan.java (Part 2 only)

Make sure that your code does not use any static (non-final) variables - this may cause odd behavior when running with maven vs. in your IDE (tests run through the IDE often run with a new instance of Java for each test, so the static variables get reset, but multiple tests per Java instance may be run when using maven, where static variables do not get reset).

Testing

We strongly encourage testing your code yourself. The given tests for this project are not comprehensive tests: it is possible to write incorrect code that passes them all (but not get full score).

Things that you might consider testing for include: anything that we specify in the comments or in this document that a method should do that you don't see a test already testing for, and any edge cases that you can think of. Think of what valid inputs might break your code and cause it not to perform as intended, and add a test to make sure things are working.

To help you get started, here is one case that is not in the given tests (and will be included in the hidden tests): joining an empty table with another table should result in an iterator that returns no records (hasNext() should return false immediately).

To add a unit test, open up the appropriate test file and simply add a new method to the file with a @Test annotation, for example:

@Test
public void testEmptyBNLJ() {
    // your test code here
}

Many test classes have some setup code done for you already: take a look at other tests in the file for an idea of how to write the test code.

Grading

  • 20% of your overall score will come from your submission for Part 1. We will only be running released Part 1 tests (database.query.*) on your Part 1 submission.
  • The rest of your score will come from your submission for Part 2 (testing both Part 1 and Part 2 code).
  • 60% of your overall score will be made up of the tests released in this homework (the tests that we provided in database.query.* and database.table.stats.*).
  • 40% of your overall score will be made up of hidden, unreleased tests that we will run on your submission after the deadline.
  • Part 1 is worth approximately 33.33% of your HW3 grade and Part 2 is worth approximately 66.67% of your HW3 grade.

For example, if:

  • your part 1 submission passes 44% of the part 1 public tests, and
  • your part 2 submission passes
    • 79% of the part 1 public tests,
    • 65% of the part 1 hidden tests,
    • 55% of the part 2 public tests, and
    • 43% of the part 2 hidden tests,

then,

  • your score for part 1 public tests (20% of grade) would be 0.44/1.00,
  • your score for part 1 hidden tests (13.33% of grade) would be 0.65/1.00,
  • your score for part 2 public tests (40% of grade) would be 0.55/1.00,
  • your score for part 2 hideen tests (26.67% of grade) would be 0.43/1.00,

and your overall score would be: (0.20 * 0.44 + 0.1333 * 0.65) + (0.4 * 0.55 + 0.2667 * 0.43) or about 51%.

You can’t perform that action at this time.