# SQL Assignment

SQL is the language of relational databases. It's important to know how to query and manipulate structured data so that when you end up building your own database, you have an idea of how to use it. In this project, we'll practice writing SQL queries for the iMDb database.

This notebook will include the prompts for you to write the SQL queries to get the desired data. At the end, there is a section for you to see if the queries make sense or not.

## Getting started

Confirm that you can see 'SQL_Assignment.ipynb' and 'database.db' in the Jupyter Lab sidebar. If so, then you're ready to go.

## **SQL Queries**

This assignment consists of a set of prompts for which you should write an SQL query to get the appropriate results from "database.db". There are three parts of the assignment. The query you have to write will increase in complexity in each section. 

We have written a helper function for you that conveniently returns results as a [pandas dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). Make sure that you run the following cell before trying to run any subsequent cells, as otherwise the environment will not register this helper function:

In [None]:
import sqlite3
import pandas as pd
def query_sql(query):
    if query.isspace():
        return "An empty query won't find anything!"
    r = pd.read_sql(query, sqlite3.connect('database.db'))
    return r

You can pass an SQL query to this helper function as a string, and it will return a dataframe. For example, the following query returns the names of each of the tables in the database:

In [None]:
pd.set_option('display.max_colwidth', None)
query_sql("SELECT sql FROM sqlite_master WHERE type='table';")

There are 5 relevant tables in this assignment, which has table and column names as follow:
- `names` : a table containing director's name and their details
    - `nconst` : a text representing the unique id of each director's name
    - `primary_name` : a text referring to the primary name of the director
    - `birth_year` : an integer representing the year of birth of the director
    - `death_year` : an integer representing the year of death of the director
- `ratings` : a table representing the ratings of the movies
    - `tconst` : a text representing the unique id of each movie's name
    - `avg_rating` : a number representing the average ratings of the movie
    - `num_votes` : an integer representing the number of votes placed on the specific movie
- `crews` : a table showing which movie is directed by which director
    - `id` : The auto-increasing id for each movie-director pairing 
    - `tconst` :  a text representing the unique id of each movie's name (foreign key from tconst in titles)
    - `director` : a text representing the unique id of each director's name (foreign key from nconst)
- `titles` : a table displaying the name of the movie
    - `tconst` : a text representing the unique id of each movie's name
    - `type` : a type of media (all are "movie" in this case)
    - `primary_title` : a text representing the titles of the movie 
    - `original_title` : a text representing the "original" titles of the movie (in case of foreign movie)
    - `start_year` : an integer showing the year the movie starts in the theatre
    - `runtime_minutes` : an integer showing the duration of movie show
- `title_genres` : a table showing the genre of the movie
    - `tconst` : a text representing the unique id of each movie's name
    - `genre` : a text representing the genre of the movie

Here is another query that retrieves all of the rows in the `titles` table:

In [None]:
query_sql("SELECT * from titles")

You're now ready to start the assigment! 

## **Part 1:**

1. Write a SQL query to find `primary_title` of top 10 movies which have received an `avg_rating` of more than or equal to 8.5, or have been voted by more than or equal to 1000000 people. Your results should be sorted by the average rating in descending order.

In [None]:
## Answer for Part 1 Question 1
p1q1 = """
"""
# For testing
query_sql(p1q1)

2. Write a SQL query that gets the director’s primary_name for the movie 'Poeta'.

In [None]:
## Answer for Part 1 Question 2
p1q2 = """

"""

# For testing
query_sql(p1q2)

3. Write a SQL query that returns the primary_name of directors and count of titles directed by each director. Results should be ordered by the count of titles in descending order

In [None]:
## Answer for Part 1 Question 3
p1q3 = """

"""

# For testing
query_sql(p1q3)

4. Write a SQL query that returns the different genres of all the movies and average runtime of all genres.

In [None]:
## Answer for Part 1 Question 4
p1q4 = """

"""

# For testing
query_sql(p1q4)

5. Write a SQL query that finds primary_name of directors who have directed movies that received an average rating less than or equal to 5.

In [None]:
## Answer for Part 1 Question 5
p1q5 = """

"""

# For testing
query_sql(p1q5)

## **Part 2:**

1. Write a SQL query that returns primary_title and start_year of movies whose titles begin with 'The' and were relased in a leap year.

In [None]:
## Answer for Part 2 Question 1
p2q1 = """

"""

# For testing
query_sql(p2q1)

2. Write a SQL query that returns the primary_title and start_year of movies that were released 19 years after the birth year of Walt Disney. (Hint: the CAST function may come in handy)

In [None]:
## Answer for Part 2 Question 2
p2q2 = """

"""

# For testing
query_sql(p2q2)

3. Write a SQL query that finds the primary_title, start_year and runtime_minutes of all shows whose runtime_minutes have exceeded the average runtime minutes of movies released in the same year. Results should be ordered by start year ascending and runtime minutes descending.

In [None]:
## Answer for Part 2 Question 3
p2q3 = """

"""
# For testing
query_sql(p2q3)

## **Part 3:**

1. Write a SQL query that finds primary_title, avg_rating and reviews of 20 movies. The reviews depend on average rating. A movie with rating less than or equal to 3 should be reviewed as ‘poor’. A movie with rating greater than 3 and less than or equal to 6 should be reviewed as ‘okay’ and a movie with rating greater than 6 should be reviewed as ‘good’. The result should be sorted by the title (descending order) (Hint: The CASE function may come in handy)

In [None]:
## Answer for Part 3 Question 1
p3q1 = """

"""

# For testing
query_sql(p3q1)

2. Write a SQL query that returns the decades and percentage of titles which were released in the corresponding decade.

In [None]:
## Answer for Part 3 Question 2
p3q2 = """

"""

# For testing
query_sql(p3q2)

3. Write a SQL query that returns the name of director and a boolean value if the director has directed more than or equal to 5 movies. The boolean value will be 1 if true else 0.

In [None]:
## Answer for Part 3 Question 3
p3q3 = """

"""

# For testing
query_sql(p3q3)

## **Submission**

To submit this assignment, run the cell below. It will create a zip file called `submit-me.zip` in this assignment's root directory. This zip contains all the queries you created in this notebook.

Then, submit this `submit-me.zip` file to Gradescope under "Assignment 2: SQL" (using the "Upload" option, not submitting through GitHub).

In [None]:
import os, shutil
result = [p1q1, p1q2, p1q3, p1q4, p1q5, p2q1, p2q2, p2q3, p3q1, p3q2, p3q3]
filenames = ["1_1", "1_2", "1_3", "1_4", "1_5", "2_1", "2_2", "2_3", "3_1", "3_2", "3_3"]
working_dir = "temp"
zip_path = "../submit-me.zip"
if os.path.exists(zip_path):
    os.remove(zip_path)
os.makedirs(working_dir, exist_ok=True)

for i in range(len(result)):
    with open(working_dir + os.sep + filenames[i] + '.sql', 'w') as file:
        file.write(result[i])

!zip -j ../submit-me.zip {working_dir}/*
shutil.rmtree(working_dir)
        

## **END OF THE ASSIGNMENT**

If you pass everything, Congrats! You've learned how to write SQL queries from the basic to complex structure which would be essential for working with databases!!