# Jupyter notebook

## What is Jupyter notebook?

* Interactive app, opens in browser
* Code, output, and text together
* Supports different programming languages

<div class="alert alert-block alert-info">
    Try it out: Open this notebook on your own device!
</div>

## Overview of a notebook

* Edit mode: edit cells of the notebook
* Command mode: use keyboard shortcuts (useful example: h = help)
* Use esc to exit edit mode into command mode
* Execute a cell by pressing "Run" button (top menu bar) or hitting Ctrl-Enter

This is a markdown cell. You can switch from a markdown to a code cell using the drop-down in the menu bar.

The next cell below is a code cell. You can set variables, define functions, and more. Any output is printed directly below the cell when it is executed.

In [4]:
x = 1
y = x + 1
print(y)

2


Variables defined in already-executed cells are available in other cells!

In [5]:
print(x)

1


<div class="alert alert-block alert-info">
Try it out: create a python notebook, execute some code, and describe it in a markdown block.
</div>

For more info about jupyter notebook, [read the docs](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html)

# NumPy
The NumPy package provides the "ndarray" object along with many efficient functions for working with arrays. The NumPy array is used to contain data of uniform type with an arbitrary number of dimensions. NumPy then provides basic mathematical and array methods to lay down the foundation for the entire SciPy ecosystem. The following import statement is the generally accepted convention for NumPy.

In [26]:
import numpy as np

## Generating evenly spaced data

In this workshop, our main use of numpy will be using the `arange` function. This allows us to generate evenly spaced numerical data within an interval.

In [29]:
np.arange(0,1,.1)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

Note that `1` is not included in my array! If I want it, I have to increment my range by one unit:

In [31]:
np.arange(0, 1.1, .1)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

Numpy also has vectorized versions of common mathematical functions, for example, `sin` and `cos`. If you ever get stuck with a python function, the first thing to try is the `help()` function:

In [33]:
help(np.sin)

Help on ufunc object:

sin = class ufunc(builtins.object)
 |  Functions that operate element by element on whole arrays.
 |  
 |  To see the documentation for a specific ufunc, use `info`.  For
 |  example, ``np.info(np.sin)``.  Because ufuncs are written in C
 |  (for speed) and linked into Python with NumPy's ufunc facility,
 |  Python's help() function finds this page whenever help() is called
 |  on a ufunc.
 |  
 |  A detailed explanation of ufuncs can be found in the docs for :ref:`ufuncs`.
 |  
 |  Calling ufuncs:
 |  
 |  op(*x[, out], where=True, **kwargs)
 |  Apply `op` to the arguments `*x` elementwise, broadcasting the arguments.
 |  
 |  The broadcasting rules are:
 |  
 |  * Dimensions of length 1 may be prepended to either array.
 |  * Arrays may be repeated along dimensions of length 1.
 |  
 |  Parameters
 |  ----------
 |  *x : array_like
 |      Input arrays.
 |  out : ndarray, None, or tuple of ndarray and None, optional
 |      Alternate array object(s) in which to

## Bonus: Array Creation
There are several ways to make NumPy arrays. An array has three particular attributes that can be queried: shape, size and the number of dimensions.

In [None]:
a = np.array([1, 2, 3])
print(a.shape)
print(a.size)
print(a.ndim)

In [None]:
x = np.arange(100)
print(x.shape)
print(x.size)
print(x.ndim)

In [None]:
y = np.random.rand(5, 80)
print(y.shape)
print(y.size)
print(y.ndim)

## Bonus: Array Manipulation
How to change the shape of an array without a copy!

In [None]:
x.shape = (20, 5)
print(x)

NumPy can even automatically figure out the size of at most one dimension for you.

In [None]:
y.shape = (4, 20, -1)
print(y.shape)

## Bonus: Array Indexing

In [None]:
# Scalar Indexing
print(x[2])

In [None]:
# Slicing
print(x[2:5])

In [None]:
# Advanced slicing
print("First 5 rows\n", x[:5])
print("Row 18 to the end\n", x[18:])
print("Last 5 rows\n", x[-5:])
print("Reverse the rows\n", x[::-1])

In [None]:
# Boolean Indexing
print(x[(x % 2) == 0])

In [None]:
# Fancy Indexing -- Note the use of a list, not tuple!
print(x[[1, 3, 8, 9, 2]])

## Bonus: Broadcasting
Broadcasting is a very useful feature of NumPy that will let arrays with differing shapes still be used together. In most cases, broadcasting is faster, and it is more memory efficient than the equivalent full array operation.

In [None]:
print("Shape of X:", x.shape)
print("Shape of Y:", y.shape)

Now, here are three identical assignments. The first one takes full advantage of broadcasting by allowing NumPy to automatically add a new dimension to the *left*. The second explicitly adds that dimension with the special NumPy alias "np.newaxis". These first two creates a singleton dimension without any new arrays being created. That singleton dimension is then implicitly tiled, much like the third example to match with the RHS of the addition operator. However, unlike the third example, the broadcasting merely re-uses the existing data in memory.

In [None]:
a = x + y
print(a.shape)
b = x[np.newaxis, :, :] + y
print(b.shape)
c = np.tile(x, (4, 1, 1)) + y
print(c.shape)
print("Are a and b identical?", np.all(a == b))
print("Are a and c identical?", np.all(a == c))

Another example of broadcasting two 1-D arrays to make a 2-D array.

In [None]:
x = np.arange(-5, 5, 0.1)
y = np.arange(-8, 8, 0.25)
print(x.shape, y.shape)
z = x[np.newaxis, :] * y[:, np.newaxis]
print(z.shape)

In [None]:
# More concisely
y, x = np.ogrid[-8:8:0.25, -5:5:0.1]
print(x.shape, y.shape)
z = x * y
print(z.shape)

# Pandas

Pandas is a popular tool for data science. Unlike numpy arrays, which must have all the same data type (usually numerical), pandas has a data structure called a datframe which may have columns with various data types. For example, we may have a dataframe about books, with "Author" and "Title" columns containing strings, "Copyright date" column containing a datetime, and "Number of pages" column containing an integer.

In [7]:
import pandas as pd

Pandas excels when reading csv files -- including ones you find online.

In [8]:
df = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-22/horror_movies.csv")

Print out the first few rows of your dataframe:

In [9]:
df.head()

Unnamed: 0,title,genres,release_date,release_country,movie_rating,review_rating,movie_run_time,plot,cast,language,filming_locations,budget
0,Gut (2012),Drama| Horror| Thriller,26-Oct-12,USA,,3.9,91 min,"Directed by Elias. With Jason Vail, Nicholas W...",Jason Vail|Nicholas Wilder|Sarah Schoofs|Kirst...,English,"New York, USA",
1,The Haunting of Mia Moss (2017),Horror,13-Jan-17,USA,,,,"Directed by Jake Zelch. With Nicola Fiore, Bri...",Nicola Fiore|Brinke Stevens|Curtis Carnahan|Ja...,English,,"$30,000"
2,Sleepwalking (2017),Horror,21-Oct-17,Canada,,,,"Directed by David Briggs. With Alysia Topol, A...",Alysia Topol|Anthony Makela|Kelsi Ashley|Patri...,English,"Sudbury, Ontario, Canada",
3,Treasure Chest of Horrors II (2013),Comedy| Horror| Thriller,23-Apr-13,USA,NOT RATED,3.7,82 min,"Directed by M. Kelley, Shawn C. Phillips, Alex...",Veronica Ricci|Nicholas Adam Clark|James Culle...,English,"Baltimore, Maryland, USA",
4,Infidus (2015),Crime| Drama| Horror,10-Apr-15,USA,,5.8,80 min,"Directed by Giulio De Santi. With Bonini Mino,...",Bonini Mino|Massimo Caratelli|Maurizio Zaffino...,Italian,,


## Investigating your dataframe

Get column names:

In [10]:
df.columns

Index(['title', 'genres', 'release_date', 'release_country', 'movie_rating',
       'review_rating', 'movie_run_time', 'plot', 'cast', 'language',
       'filming_locations', 'budget'],
      dtype='object')

Basic description of numerical columns:

In [11]:
df.describe()

Unnamed: 0,review_rating
count,3076.0
mean,5.077016
std,1.474272
min,1.0
25%,4.0
50%,5.0
75%,6.1
max,9.8


What are the datatypes of the other columns?

In [12]:
df.dtypes

title                 object
genres                object
release_date          object
release_country       object
movie_rating          object
review_rating        float64
movie_run_time        object
plot                  object
cast                  object
language              object
filming_locations     object
budget                object
dtype: object

Hmm... why would budget be an object instead of a number? Let's' take a look at the first 10 unique values in this column:

In [16]:
df['budget'].unique()[:10]

array([nan, '$30,000', '$3,400,000', '$7,000', 'INR\xa06,000,000',
       '$150,000', '£100,000', '$1,023', '$3,000,000', '£25,000'],
      dtype=object)

So it looks like budget is an object (in this case a string) because it is formatted like a currency.

## Filtering based on column values

Filtering based on the value of columns is straightforward. Put your filter inside of square brackets: `df[<filter>]`. This will return a filtered dataframe. You can save it as a new dataframe if you'd like.

In [19]:
df_filtered = df[df['release_country'] == 'Canada']
df_filtered.head()

Unnamed: 0,title,genres,release_date,release_country,movie_rating,review_rating,movie_run_time,plot,cast,language,filming_locations,budget
2,Sleepwalking (2017),Horror,21-Oct-17,Canada,,,,"Directed by David Briggs. With Alysia Topol, A...",Alysia Topol|Anthony Makela|Kelsi Ashley|Patri...,English,"Sudbury, Ontario, Canada",
178,Secret Santa (2015),Horror,28-Nov-15,Canada,,6.1,,Directed by Mike McMurran. With Annette Woznia...,Annette Wozniak|Geoff Almond|Keegan Chambers|B...,English,"Cambridge, Ontario, Canada","CAD 6,000"
234,There Are Monsters (2013),Horror,13-Sep-13,Canada,,5.1,90 min,"Directed by Jay Dahl. With Matthew Amyotte, Ja...",Matthew Amyotte|Jason Daley|Michael Ray Fox|Gu...,English,,
272,The Door (2014),Horror,14-Oct-14,Canada,,4.5,,Directed by Patrick McBrearty. With Alys Crock...,Alys Crocker|Sam Kantor|Matt O'Connor|Winny Cl...,English,,
280,Black Forest (2015),Horror,31-Jan-15,Canada,,4.4,,Directed by David Briggs. With Marie-Josee Dio...,Marie-Josee Dionne|France Huot|Jayson Stewart|...,English,"Sudbury, Ontario, Canada","CAD 200,000"


We can also create new columns in our dataframe. Accessing or creating columns has a similar syntax as python dictionaries. Use the `map` method to run an operation on all values in that column.

In [24]:
df['from_canada'] = df['release_country'].map(lambda x: x == 'Canada')
df.columns

Index(['title', 'genres', 'release_date', 'release_country', 'movie_rating',
       'review_rating', 'movie_run_time', 'plot', 'cast', 'language',
       'filming_locations', 'budget', 'from_canada'],
      dtype='object')