In [1]:
import numpy as np
from datascience import *

In [2]:
# the table we'll be using throughout these notes
imdb = Table.read_table("data8assets/materials/fa16/lab/lab03/imdb.csv")
imdb

Votes,Rating,Title,Year,Decade
88355,8.4,M (1931),1931,1930
132823,8.3,Singin' in the Rain (1952),1952,1950
74178,8.3,All About Eve (1950),1950,1950
635139,8.6,Léon (1994),1994,1990
145514,8.2,The Elephant Man (1980),1980,1980
425461,8.3,Full Metal Jacket (1987),1987,1980
441174,8.1,Gone Girl (2014),2014,2010
850601,8.3,Batman Begins (2005),2005,2000
37664,8.2,Judgment at Nuremberg (1961),1961,1960
46987,8.0,Relatos salvajes (2014),2014,2010


# Using functions
Everything that I'm going to say now applies to every function you might use, whether it's np.arange(), Table.join(), max(), make_array, whatever. They're all the same!

At its most basic level, a function **takes in** arguments, does some stuff to it, then **returns** some value. Now, there are some functions that just do operations and change the argument without returning anything, but for this class we're just going to be working with functions that return some value. Take note of data types whenever you have to deal with functions.

** _Every function call_ ** will look something like this: `function_name(argument1, argument2, ...)` depending on the number of arguments you need to pass in.

In [25]:
make_array(1, 2, 3, 4, 5)               # Takes in integers and returns an array of integers.
make_array('a', 'b', 'c', 'd')          # Takes in strings and returns an array of strings.

int('234')                              # Takes in a string and returns an int (type conversion)

sum(make_array(1, 3, 4, 5, 6, 3))       # Takes in an array and returns a single int

type('10000')                           # Takes in anything and returns what data type it is
                                        # (This one's useful for debugging!)

str

There are also functions that fit a slightly different pattern, but really are in the same format. When we're using numpy (np) or table methods, we need to specify that we're using that **package** or **module**, which is basically just a group of related methods and objects that we can use. The ones we use in this class often are `numpy (np)` and `Table`, although you have also seen the `Math` module. Here's what they look like:

In [26]:
imdb.column('Votes')                    # Takes in a column name (string) and returns an array of integers.
imdb.where('Rating', are.above(8.5))    # Takes in a column name (string), are.above statement, and returns
                                        #     a table.

np.arange(0, 10, 2)                     # Takes in start, stop, and step values and returns an array of integers.

array([0, 2, 4, 6, 8])

In [27]:
# Some clarification:
imdb.num_rows   # number of rows/cols is not a function, it's a characteristic of the table, so no parens needed

250

## Stringing functions together

**Obnoxious Question:** I want the average year of release (as a string) of movies where the rating is above 8.2 and the number of votes is less than 100,000. And I want you to write it all on one line. (Obviously this isn't a realistic demand on a homework/lab/project, but let's just do it for funsies)

Let's break it up into steps that make sense in order:
* start with table imdb
* filter movies where votes is below 100,000
* filter movies where rating is above 8.2
* get the years (probably as an array) of those movies after filtering
* take the average year
* get the years of those movies
* end with some string

In [28]:
# 1. start with the table
imdb

# 2. get the movies that have fewer than 100,000 votes
imdb.where('Votes', are.below(100000))       # Note this returns a table. That's why we can do this next step:

# 3. from the above, also get where the rating is above 8.2
imdb.where('Votes', are.below(100000)).where('Rating', are.above(8.2))

# 4. get the years from the table
imdb.where('Votes', are.below(100000)).where('Rating', are.above(8.2)).column('Year')

# 5. take the average of those years
np.average(imdb.where('Votes', are.below(100000)).where('Rating', are.above(8.2)).column('Year'))

# 6. turn all of that into a string
str(np.average(imdb.where('Votes', are.below(100000)).where('Rating', are.above(8.2)).column('Year')))

'1949.46153846'

**Obnoxious Question 2:** I want the length of the movie title with the highest rating that was released in a year before 1980 (inclusive), and that has more than 90,000 votes. (You should get 20. Also, don't worry about stripping the (1950) or whatever year off the title)

_Tip:_ Check after every step that the output is what you're expecting/makes sense (e.g. that you don't get an empty table if you use imdb.where()). It doesn't hurt to run cells over and over again!

In [None]:
# Your answer here

In [29]:
# Bonus: what's the significance of this line of code?

imdb.where('Year', are.above_or_equal_to(1990)).sort('Title').where('Votes', are.below(80000)).num_rows

8

# Writing functions

### Function structure
Before we talk about writing functions, we'll go over what they're made of. Functions are all structured pretty much the same:
* a `def` statement, where we define the name of the function and arguments that we pass in
* a body, where we define what calculations we do with the argument(s)
* a `return` statement, where we return some value that we've calculated

Take a look at this simple function that returns the first element of an array that we pass in.

In [6]:
def first_element(an_array):             # def, our function name, (arguments we need), colon
    return_element = an_array.item(0)    # body
    return return_element                # return

test_array = make_array('wow', 'cool', 'kljslkjd')    # the array we're going to try our function on
first_element(test_array)

'wow'

Looks like it worked! Now let's write some functions on our own.

A question asks you to write a function. Here's some tips on how to approach it:

1. What do I need to pass in, and what do I need to return? (Are they ints, arrays, strings, etc?)
2. Try writing your function for one specific case and then making it more general.

### Question: Write a function that returns the maximum of a column when we pass in a column name.

Answering the above questions:
1. We're going to take in a column name (as a string) and return probably an integer, based on the above table.
2. Let's try writing the function for a specific case! Below, I've written a way to get the maximum number of votes out of the votes column:

In [7]:
votes = imdb.column('Votes')        # Get all of the votes as an array
np.max(votes)                       # use the np.max() function to find the maximum number in the array

1498733

There are a lot of other ways to do this! The other way we could do it is by sorting imdb by the votes columns, `descending = True`, and then taking the first item of the votes column, etc.

But let's continue. Here's a skeleton of the function that we want to write:

In [8]:
def find_max_of_column(column):
    ### body
    return maximum

Super basic--all I gave you there is the name of the function, what we're passing in, and that we're returning a maximum. 

Now, to write the function, we can literally just copy/paste the specific code we've already written for '`Votes'`, and replace each instance of where we mention votes to something more generic. In this case, since we're passing in a `column_name` we can replace `'Votes'` with `column_name`.

In [9]:
def find_max_of_column(column_name):
    column = imdb.column(column_name)
    maximum = np.max(column)
    return maximum

Easy-peasy. Compare this function to what we wrote to find the max of the `'Votes'` column, and note that it's almost exactly the same code! This process has always worked really well for me whenever I need to write functions.

**As practice, try writing a function that, given some decade, returns a table of all of the movies released in that decade.** (e.g. if I give the function the decade 1950, it should return a table of all of the movies that were released in the 1950s)

In [None]:
# Your answer here