In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib notebook

# Introduction to Data Storage

##### Version 0.1

***

By AA Miller (Northwestern/CIERA)  
15 July 2022

[Session 15](https://github.com/LSSTC-DSFP/LSSTC-DSFP-Sessions/tree/main/Sessions/Session15) is primarily concerned with handling our data with efficiency.

Ideally, for any and every task we want to devire solutions that operate *faster*. 

This can be accomplished many different ways:

$~~~~~~$build algorithms that execute faster

$~~~~~~$spread calculations over many different computers simultaneously

$~~~~~~$find a compact storage solution for the data so it can be accessed more quickly

In our introduction to this session we will start with data storage, and discuss fast algorithms as a challenge problem. 

Random notes - Scorcese is in the data, Ace Ventura is not, Godfather is not, 

## Problem 1) IMDb Data

Throughout the session we will use information from the [Internet Movie Database (IMDb)](https://www.imdb.com/) to illustrate various principles regarding data storage.

For this notebook, we will use a [google sheets](https://docs.google.com/spreadsheets/d/1B-C7uJFrVNGpAXsGE6_xymfFVSKhwnIsI_RewkkmGa0/edit?usp=sharing) spreadsheet to examine this data (later in the week we will explore the same data stored in a database). 

A quick note on the provenance of this data. The files we have used to populate this data set are from [this website](https://relational.fit.cvut.cz/dataset/IMDb) and it may not be a list of every single movie on IMDb.

**Problem 1a**

Open the google sheet.$^\dagger$ What information is stored in the `movies` sheet of this file? 

How many movies are listed? 

$^\dagger$Note – the link points to a "view" of the data, you may find it useful to copy the file so that you have write access. 

*Write your answer here*

The `movies` sheet contains 4 columns: id, name, year, and rank. `id` appears to be a running index counting the movies, `name` is the title of the movie (and the file appears to be organized alphabetically by name), `year` is the release date for the film, and `rank` is the user score (10 = really good) for the film on IMDb. 

There are 388269 movies in the file. This can (with an investment of time) be determined by scrolling all the way to the end of the file and seeing the top index. We can be more efficient and use the built-in count function to determine the number of movies `=COUNTA(B2:B10000000)`.

**Problem 1b**

What information is stored in the `directors` sheet? 

How many directors are there? 

*Write your answer here*

The `directors` sheet includes 3 columns: id, First Name, Last Name. `id` is a running index, and the name columns are self-explanatory. 

There are 86880 directors total in this sheet. 

**Problem 1c**

What information is stored in the `movies_directors` sheet?

How many rows are there? Does this make sense? Why?

*Write your answer here*

The `movies_directors` sheet includes 2 columns: `movie_id` and `director_id`. `movie_id` corresponds to the `id` column in the `movies` sheet, while `director_id` correspondss to the `id` in the directors sheet.

There are 371180 total rows in this sheet. This answer does not really make sense – we know there are 388269 movies in total, so I would expect just as many entries here. Perhaps there are a bunch of movies without directors? 

**Problem 1d**

Confirm your column identifications in **1c** by finding your favorite movie and making sure `movies_directors` correctly matches it with the proper director.

*Write your answer here*

My favorite movie is "Wayne's World" (at least it's top 5), and that has movie id = 360290. The director was Penelope Sphereis, who has director id = 75368. In the `movies_directors` sheet, this `movie_id` and `director_id` match up confirming my answer from **1c**. 

**Problem 1e**

What information is stored in the `movies_genres` sheet?

How many rows are there? Does this make sense? Why?

*Write your answer here*

The `movies_genres` sheet includes 2 columns: `movie_id` and `genre`. `movie_id` corresponds to the `id` column in the `movies` sheet, while `genre` is one of several potential genres for the movie.

There are 395119 total rows in this sheet. A quick glance at the sheet reveals two things: (i) there are several movies that do not have an identified genre (e.g., movie_id = 3), and (ii) there are movies that have multiple genres (e.g., movie_id = 8). 

It is hard to know how many rows to expect given these two facts, but given that many films fall into multiple genres it makes sense this sheet has more rows than there are total movies.

## Problem 2) Connections