# Pandas assignment

The problems in this notebook are adapted from Brandon Rhodes's Pycon `pandas` tutorial.

The first few problems are identical.

## Loading the data

This notebook loads two datasets `titles` and ` cast`.  Both are are loaded frim the course website and contain extracts of the IMDB movie data. The second is a fragment of a much larger DataFrame.  

Everything
you need to complete this assignment is included in the **fragment loaded in this section**
in the cells below. 

For those who want to have the entire IMDB-derived dataset, here are some pointers.

Loading the entire IMDB data set used for these tutorial exercises is best done by 
visiting [Brandon Rhodes' github repo](https://github.com/brandon-rhodes/pycon-pandas-tutorial)
and following the instructions.  You can either download a big zip file or, if you have `git` (the main
github API function) installed on your machine, do a `git clone` command. 
In addition to that, 
you will need to download 4 compressed files by ftp that Rhodes
provides links to.  You can then run code
from Rhodes' cloned repo to create the non-truncated version CSV files used in these exercises.
If you do that, you may use the complete dataset to complete these exercises.  Some but not
all of your answers will be dufferent.

The statistics
in the Part B answers will not be correct until you load the complete dataset.  But
you can complete the assignment on the fragment, since I will be evaluating your code,
not the accuracy of your statistics.

In [7]:
# The Python modules you need for thisn assignment.
import pandas as pd
import os.path
import urllib.request 
import os.path
from matplotlib import pyplot as plt

url_dir = 'https://gawron.sdsu.edu/python_for_ss/course_core/data'

Loading and applying the style sheets in the next two cells customizes the styles of your notebook output,
in particular for how pandas `DataFrame`s are printed.

This is optional,  but it's interesting if you know anything about css files.

In [8]:
target_url1 = os.path.join(url_dir,'style-notebook.css')
target_url2 = os.path.join(url_dir,'style-table.css')

with urllib.request.urlopen(target_url1) as fh1:
    css1 = fh1.read().decode('utf8')
with urllib.request.urlopen(target_url2) as fh2:
    css2 = fh2.read().decode('utf8')
css = css1 + css2

In [9]:
from IPython.core.display import HTML
#css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))

The next cell loads the `titles` DataFrame, the first of two used in this assignment.  There are only
two columns, `'title'`  and `'year'`.

In [10]:
titles = pd.read_csv(os.path.join(url_dir,'titles.csv'))
titles.head()

Unnamed: 0,title,year
0,The Patriarchs,2009
1,Angels in the Attic,1998
2,The Rapture,1991
3,Star na si Van Damme Stallone,2016
4,Sweet Talk,2004


This is a simple `DataFrame` with two columns, containing the title and year of a film.

If a film is remade and given the same title, the title shows up twice:

In [10]:
titles[titles['title']  == 'Around the World in 80 Days']

Unnamed: 0,title,year
91875,Around the World in 80 Days,2004
121800,Around the World in 80 Days,1956


The next cell loads the `cast` DataFrame, the second of two used in this assignment.  This is
large and will take a while.

In [11]:
#cast = pd.read_csv('data/cast.csv')
cast = pd.read_csv(os.path.join(url_dir,'truncated_cast.csv'),index_col=0)
cast.head()

Unnamed: 0,title,year,name,type,character,n
0,In the Land of the Head Hunters,1914,Paddy 'Malid,actor,Kenada,5.0
1,The Colour of Darkness,2016,Ketan Daraji -Gohel,actor,Chhagan,
2,London Betty,2009,Isaiah Entsua -Mensah,actor,Camera Man,
3,Candelabra,2014,Groovin .,actor,Lt. Dick Sims,5.0
4,Bad Ideas,2012,Hamid .,actor,The Diner,


Please have a look at the columns of the `cast` `DataFrame`and make sure you understand what information it contains. A row uniquely identifies a role in a film; it contains the film title, the year the file was made, the name of the actor or actress playing the role, and the name of the character they played.  The number in the `'n'` column represents the importance of the part, with the lead role receiving a 1 and all less important roles receiving higher numbers. 

Sometimes extras are included; with extras included, a big Hollywood production can easily  cast over 1000 actors and will therefore populate over 1000 rows of `cast`.  The character name for the extras will be "Extra".

In [18]:
len(cast[(cast['title']  == 'Around the World in 80 Days')&(cast['year']  == 1956)])

1299

In [5]:
import numpy as np
np.NaN  == 4

False

In [23]:
cast[(cast['title']  == 'Around the World in 80 Days')&(cast['year']  == 1956)].iloc[:5]

Unnamed: 0,title,year,name,type,character,n
784,Around the World in 80 Days,1956,Ronald Adam,actor,Club Steward,47.0
7147,Around the World in 80 Days,1956,Ray Arnett,actor,Extra,
25665,Around the World in 80 Days,1956,Charles Boyer,actor,Monsieur Gasse - Thomas Cook Paris Clerk,5.0
40928,Around the World in 80 Days,1956,Suey Chan,actor,Extra,
44164,Around the World in 80 Days,1956,Martin Cichy,actor,Bartender,


As shown above, there are a number of null or `Nan` values in the `'n'` column.  There are also some very
high numbers in the `'n'` column, but not very many of them.

In [205]:
len(cast[cast['n']>500])

232

A single actor may play more than one role in a film, and both those roles can be the starring role.

In [206]:
cast[(cast['title']=='Around the World in 80 Days')&(cast['year']==2004)&(cast['n']==1)]

Unnamed: 0,title,year,name,type,character,n
351948,Around the World in 80 Days,2004,Jackie Chan,actor,Passepartout,1.0
351949,Around the World in 80 Days,2004,Jackie Chan,actor,Lau Xing,1.0


## Part A Questions involving selecting and sorting subsets of the rows in the Dataset

### How many movies are listed in the titles dataframe?

As an example, we have put the answer in the next cell.  Be sure to execute the cell containing your solution, so that the answer to the question is displayed in the output as it is in this example.

For each question, first give some thought to which of the two DataFrames loaded in the section entitled "Loading the Data" is best suited to provide an answer.

In [24]:
### What are the earliest two films listed in the titles dataframe?

#Hint:  This will require using `.sort_values(...)`.

### How many movies have the title "Hamlet"?

Hint: one approach is to use a Boolean mask

### How many movies are titled "North by Northwest"?

### When was the first movie titled "Hamlet" made?

### List all of the "Treasure Island" movies from earliest to most recent.

### How many movies were made in the year 1950?

### How many movies were made in the year 1960?

### How many movies were made from 1950 through 1959?

### In what years has a movie titled "Batman" been released?

### How many roles were there in the movie "Inception"?

### But how many roles in the movie "Inception" did receive an "n" value?

The idea is that some of the rows have `NaN` in the `'n'` column.  These are not
values.  To eliminate such rows use the `.notnull()` method.

### Display the cast of "North by Northwest" in their correct "n"-value order, ignoring roles that did not earn a numeric "n" value.

### Display the entire cast, in "n"-order, of the 1972 film "Sleuth".

### Now display the entire cast, in "n"-order, of the 2007 version of "Sleuth".

### How many roles were credited in the silent 1921 version of Hamlet?

### How many roles were credited in Branagh’s 1996 Hamlet?

### How many "Hamlet" roles have been listed in all film credits through history?

### How many people have played "James Bond" as a leading role?

Yes, I'm thinking of the Ian Fleming character that has given rise to a whole film franchise,
but I'm going to allow for a little noise.  It
turns out that characters named "James Bond" have come up many times in film history.  Adding
the qualification "as a leading role" helps with that.
This will still leave in a couple of non-Ian Fleming Bonds, but don't worry.
If you just answer the question literally, your
answer should be very close to a list of the actors who've played the
Ian Fleming character.  By the way,  David Niven will be missing from
that list, because the name of the character
listed in the credits of his 1967 parody is not "James Bond" but "Sir James Bond".
It's okay to leave out David Niven.  It was a parody, not a real Bond film.

Note: Relative to this DB (which is dated), the historically correct answer to the question how many actors played the Ian Fleming character is 6. The literal answer to the question in this DB is 
greater than 6 because of a few superfluous Bonds, but if your answer is a lot more than 6
(greater than 10), then there's an issue with your code.  There is a technical challenge in this question;
it is tricky to get exactly the right list of actors.

### How many people have played a role called "The Dude"?

### How many people have played a role called "The Stranger"?

### How many roles has Sidney Poitier played throughout his career?

### How many roles has Judi Dench played?

### List the supporting roles (having n=2) played by Cary Grant in the 1940s, in order by year.

### List the leading roles that Cary Grant played in the 1940s in order by year.

### How many roles were available for actors in the 1920s?

### How many roles were available for actresses in the 1920s?

## Part B:  Questions needing value_counts, pivot_tables, or cross tabulation

### Of the films made before 1939, which had the largest cast?

After you've made a DataFrame containing the set of rows in `cast` that you're interested in,
you need to do a computation that counts cast members by film.  The question to think about
is what you want to count.  You **don't** want to do this:

```
pd.crosstab(cast_pre_1939['title'],cast_pre_1939['name'])
```

since after a **very** long computation, this will end up telling you
the number of characters each actor/actress played in each film
(which is usually 1).

So think about the structure of the data.  This is actually pretty simple.

Once you get a reasonable looking answer, you may want to do the same computation
fro the entire `cast` DataFrame, to find out what film had the largest cast all time.

In [30]:
cast_pre_1939 = cast[cast['year'] < 1939]
#pd.pivot
# Group the players into teams, take the mean age and salary for each time, 
# make a datafra,
#pt = pd.pivot_table(nba_df,index='Team', values=["Salary","Age"],aggfunc='mean')
pt0 = pd.pivot_table(cast,index=['title','year'],aggfunc='count')

In [31]:
pt0['character'].loc['Around the World in 80 Days']

year
1956    1299
2004      65
Name: character, dtype: int64

In [34]:
pt0['character'].loc[('Around the World in 80 Days',1956)]

1299

In [28]:
print(len(cast[cast['title']=='Around the World in 80 Days']))
print(len(cast[(cast['title']=='Around the World in 80 Days')& (cast['year']==1956)]))
print(len(cast[(cast['title']=='Around the World in 80 Days')& (cast['year']==2004)]))

1364
1299
65


### Of the films made in either 1939 or 1966, what films had a cast size of 90 or more but fewer than 100?

This is a follow-up to the previous question which takes you a little deeper.

After you've made a DataFrame containing the set of rows in `cast` that you're interested in,
you again need to do a computation that's counts cast members on films.   As noted
above there is a simple computation  that seems to give you the cast sizes of all films.

But to answer this question correctly,  you will need to
come up with a slightly more complicated answer. The issue to
think about is remakes.  In fact in 1966 there were some remakes of
great 1939 films.  How should that affect your answer to this
problem?


### How many films made before 1939 have a cast of size 1.

### How many movies have had remakes?

Film buffs will know this is much harder than it might seem.  To make this
doable, let's look only at remakes that have the same title, and let's assume (falsely)
that when two movies have the same title, the later one is a remake of the earlier one.

### Plot the number of male and female roles year by year (up through 2017)

So you want years on the x-axis and two lines, one
tracking the number of male roles and another tracking the number of female roles.

Note this plot will not be realistic because of the way our data has been sampled,
so I'll just be evaluating the code, not the accuracy.

### Plot the percentages of all roles that are female year by year for the century from 1917 through 2017

You are continuing your study of the imbalance of male and female roles.  Years on the x-axis,
one line tracking the percentage of female roles.

Comment: This is  a very interesting plot, which is begging for a story to explain it.
On the truncated sample, the plot has some serious flaws, but the general pattern you
see is correct.