##  A Quick Review from W1

Before pulling any data, we've gotta import all the packages we need

In [81]:
import pandas as pd
import numpy as np

pd.set_option("display.max_rows", 6)

Now we can read in the data from a link OR from a file in the same directory

In [None]:
movies = pd.read_csv('https://raw.githubusercontent.com/dt3zjy/node/master/week-2/workshop/imdb.csv')

Take a quick look at the data

In [83]:
movies.head(3)

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
0,Avatar,7.9,PG-13,237000000.0,760505847.0,178.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Joel David Moore,Wes Studi,James Cameron,English,USA,1.78,2009.0
1,Pirates of the Caribbean: At World's End,7.1,PG-13,300000000.0,309404152.0,169.0,Action|Adventure|Fantasy,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski,English,USA,2.35,2007.0
2,Spectre,6.8,PG-13,245000000.0,200074175.0,148.0,Action|Adventure|Thriller,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes,English,UK,2.35,2015.0


How many rows are there? Columns?

In [84]:
movies.shape

(5043, 15)

There's a lot of columns, could we get a full list? 

In [85]:
movies.columns.values.tolist()

['movie_title',
 'imdb_score',
 'content_rating',
 'budget',
 'gross',
 'duration',
 'genres',
 'actor_1_name',
 'actor_2_name',
 'actor_3_name',
 'director_name',
 'language',
 'country',
 'aspect_ratio',
 'title_year']

# Pandas Foundations

## Dataframe who?

Think of dataframes as little "spreadsheets" that hold our data.
<br>Each row represents an observation, and each column represents some feature about that data.

<br> With little bits of code, we can modify what that spreadsheet looks like.
<br> For example, our raw data `movies` can be transformed to show some metric about some feature for some subset of the data

In [86]:
# For example:
movies[movies.country.isin(['USA','France','UK'])].groupby('country')[['gross', 'budget']].agg(['median','max'])

Unnamed: 0_level_0,gross,gross,budget,budget
Unnamed: 0_level_1,median,max,median,max
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
France,4291965.0,145000989.0,19715000.0,390000000.0
UK,13401683.0,362645141.0,15000000.0,250000000.0
USA,32178777.0,760505847.0,20000000.0,300000000.0


## Series vs. Dataframes 

We've been referring to *2 dimensional* tables using the `pd.DataFrame` object
<br>An individual column (or row), is *1 dimensional*. We call this a `pd.Series` object

In [87]:
type(movies)

pandas.core.frame.DataFrame

We can access a series (always 1-dimensional) using either: `df.column` OR `df['column']`

In [88]:
print(type(movies.imdb_score))
print(movies.imdb_score)

<class 'pandas.core.series.Series'>
0       7.9
1       7.1
2       6.8
       ... 
5040    6.3
5041    6.3
5042    6.6
Name: imdb_score, Length: 5043, dtype: float64


In [89]:
print(type(movies['imdb_score']))
print(movies['imdb_score'])

<class 'pandas.core.series.Series'>
0       7.9
1       7.1
2       6.8
       ... 
5040    6.3
5041    6.3
5042    6.6
Name: imdb_score, Length: 5043, dtype: float64


To select a **row** as a series, we can use `df.iloc[]` and a specific row number (its index)

In [90]:
print(type(movies.iloc[3]))
movies.iloc[3]

<class 'pandas.core.series.Series'>


movie_title       The Dark Knight Rises 
imdb_score                           8.5
content_rating                     PG-13
                           ...          
country                              USA
aspect_ratio                        2.35
title_year                          2012
Name: 3, Length: 15, dtype: object

## Subsetting Columns

Subsetting and filtering is one of the most important, yet confusing topics when getting started.
<br>As we go along, feel free to run the code part by part to see what's going on in each step. `type()` is also a great tool here

Lets take a look at a smaller set of columns, say the movie title, actors & director name

In [91]:
movies[['movie_title','actor_1_name','actor_2_name','actor_3_name','director_name']] # Note TWO square brackets

Unnamed: 0,movie_title,actor_1_name,actor_2_name,actor_3_name,director_name
0,Avatar,CCH Pounder,Joel David Moore,Wes Studi,James Cameron
1,Pirates of the Caribbean: At World's End,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski
2,Spectre,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes
...,...,...,...,...,...
5040,A Plague So Pleasant,Eva Boehnke,Maxwell Moody,David Chandler,Benjamin Roberds
5041,Shanghai Calling,Alan Ruck,Daniel Henney,Eliza Coupe,Daniel Hsia
5042,My Date with Drew,John August,Brian Herzlinger,Jon Gunn,Jon Gunn


Side Note: Why did we use two square brackets? What we're doing is passing a `list` to the subset function.
<br> Pandas `DataFrame` objects know that whenever we place `[]` after it, we're looking to do some sort of filtering operation

In [92]:
actor_cols = ['actor_1_name','actor_2_name','actor_3_name']
movies[actor_cols]

Unnamed: 0,actor_1_name,actor_2_name,actor_3_name
0,CCH Pounder,Joel David Moore,Wes Studi
1,Johnny Depp,Orlando Bloom,Jack Davenport
2,Christoph Waltz,Rory Kinnear,Stephanie Sigman
...,...,...,...
5040,Eva Boehnke,Maxwell Moody,David Chandler
5041,Alan Ruck,Daniel Henney,Eliza Coupe
5042,John August,Brian Herzlinger,Jon Gunn


Another quick point of confusion: Check out the difference between `df['column']` vs `df[['column']]` ( try it out below)

In [93]:
movies['content_rating']

0       PG-13
1       PG-13
2       PG-13
        ...  
5040      NaN
5041    PG-13
5042       PG
Name: content_rating, Length: 5043, dtype: object

In [94]:
movies[['content_rating']]

Unnamed: 0,content_rating
0,PG-13
1,PG-13
2,PG-13
...,...
5040,
5041,PG-13
5042,PG


The former creates a *series*, since the input is just one *string*. The latter creates a *dataframe*, since the input is a *list* of columns.

## Filtering Rows With Conditions

Another common task is to filter rows based upon some criteria we have. We could:
1. Compare floats
2. Match strings
3. Check against multiple elements

In [95]:
movies[movies.imdb_score > 8]

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
3,The Dark Knight Rises,8.5,PG-13,250000000.0,448130642.0,164.0,Action|Thriller,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan,English,USA,2.35,2012.0
17,The Avengers,8.1,PG-13,220000000.0,623279547.0,173.0,Action|Adventure|Sci-Fi,Chris Hemsworth,Robert Downey Jr.,Scarlett Johansson,Joss Whedon,English,USA,1.85,2012.0
27,Captain America: Civil War,8.2,PG-13,250000000.0,407197282.0,147.0,Action|Adventure|Sci-Fi,Robert Downey Jr.,Scarlett Johansson,Chris Evans,Anthony Russo,English,USA,2.35,2016.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4945,The Brain That Sings,8.2,,125000.0,,62.0,Documentary|Family,,,,Amal Al-Agroobi,Arabic,United Arab Emirates,,2013.0
4972,"Peace, Propaganda & the Promised Land",8.3,,70000.0,,80.0,Documentary,Noam Chomsky,Seth Ackerman,Arik Ascherman,Sut Jhally,English,USA,,2004.0
5001,The Last Waltz,8.2,PG,,321952.0,117.0,Documentary|Music,Ringo Starr,Levon Helm,Bob Dylan,Martin Scorsese,English,USA,1.85,1978.0


In [96]:
movies[movies.content_rating == 'G']

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
35,Monsters University,7.3,G,200000000.0,268488329.0,104.0,Adventure|Animation|Comedy|Family|Fantasy,Steve Buscemi,Tyler Labine,Sean Hayes,Dan Scanlon,English,USA,1.85,2013.0
41,Cars 2,6.3,G,200000000.0,191450875.0,106.0,Adventure|Animation|Comedy|Family|Sport,Joe Mantegna,Thomas Kretschmann,Eddie Izzard,John Lasseter,English,USA,2.35,2011.0
43,Toy Story 3,8.3,G,200000000.0,414984497.0,103.0,Adventure|Animation|Comedy|Family|Fantasy,Tom Hanks,John Ratzenberger,Don Rickles,Lee Unkrich,English,USA,1.85,2010.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4725,Benji,6.1,G,500000.0,39552600.0,86.0,Adventure|Family|Romance,Frances Bavier,Peter Breck,Edgar Buchanan,Joe Camp,English,USA,1.85,1974.0
4787,Rise of the Entrepreneur: The Search for a Bet...,8.2,G,450000.0,,52.0,Documentary,Bob Proctor,Jack Canfield,Eric Worre,Joe Kenemore,English,USA,,2014.0
4874,Sunday School Musical,2.5,G,,,93.0,Drama|Musical,Dustin Fitzsimons,Mark Hengst,Debra Lynn Hull,Rachel Goldenberg,English,USA,1.85,2008.0


What if we wanted to find movies that were either `G`, `PG`, or `PG-13`? 
We'd have to type out something pretty annoying like: 
```python
movies[(movies.content_rating == 'G') | (movies.content_rating == 'PG') | (movies.content_rating == 'PG-13')]
```

<br>Instead, we'll use the `.isin()` operator to match against a group. Remember, `.isin()` accepts a `list` only

In [97]:
movies[movies.content_rating.isin(['G','PG', 'PG-13'])]

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
0,Avatar,7.9,PG-13,237000000.0,760505847.0,178.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Joel David Moore,Wes Studi,James Cameron,English,USA,1.78,2009.0
1,Pirates of the Caribbean: At World's End,7.1,PG-13,300000000.0,309404152.0,169.0,Action|Adventure|Fantasy,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski,English,USA,2.35,2007.0
2,Spectre,6.8,PG-13,245000000.0,200074175.0,148.0,Action|Adventure|Thriller,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes,English,UK,2.35,2015.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5036,The Mongol King,7.8,PG-13,3250.0,,84.0,Crime|Drama,Richard Jewell,John Considine,Sara Stepnicka,Anthony Vallone,English,USA,,2005.0
5041,Shanghai Calling,6.3,PG-13,,10443.0,100.0,Comedy|Drama|Romance,Alan Ruck,Daniel Henney,Eliza Coupe,Daniel Hsia,English,USA,2.35,2012.0
5042,My Date with Drew,6.6,PG,1100.0,85222.0,90.0,Documentary,John August,Brian Herzlinger,Jon Gunn,Jon Gunn,English,USA,1.85,2004.0


### Masking

What's going on under the hood? When we say something like this,
```python 
movies.content_rating == 'G'
```
We're actually asking Pandas to check that **for each** row in the series, if the given statement is True or False.


In [98]:
movies.content_rating == 'G'

0       False
1       False
2       False
        ...  
5040    False
5041    False
5042    False
Name: content_rating, Length: 5043, dtype: bool

In [99]:
movies[  movies.content_rating == 'G'  ]

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
35,Monsters University,7.3,G,200000000.0,268488329.0,104.0,Adventure|Animation|Comedy|Family|Fantasy,Steve Buscemi,Tyler Labine,Sean Hayes,Dan Scanlon,English,USA,1.85,2013.0
41,Cars 2,6.3,G,200000000.0,191450875.0,106.0,Adventure|Animation|Comedy|Family|Sport,Joe Mantegna,Thomas Kretschmann,Eddie Izzard,John Lasseter,English,USA,2.35,2011.0
43,Toy Story 3,8.3,G,200000000.0,414984497.0,103.0,Adventure|Animation|Comedy|Family|Fantasy,Tom Hanks,John Ratzenberger,Don Rickles,Lee Unkrich,English,USA,1.85,2010.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4725,Benji,6.1,G,500000.0,39552600.0,86.0,Adventure|Family|Romance,Frances Bavier,Peter Breck,Edgar Buchanan,Joe Camp,English,USA,1.85,1974.0
4787,Rise of the Entrepreneur: The Search for a Bet...,8.2,G,450000.0,,52.0,Documentary,Bob Proctor,Jack Canfield,Eric Worre,Joe Kenemore,English,USA,,2014.0
4874,Sunday School Musical,2.5,G,,,93.0,Drama|Musical,Dustin Fitzsimons,Mark Hengst,Debra Lynn Hull,Rachel Goldenberg,English,USA,1.85,2008.0


## Basic functions

There's a couple super helpful functions that help us work with our data.

For sorting, we can use `.sort_values(by= )`, and pass in either a single string, or multiple strings in a list.
<br>To flip the default order, change the `ascending= ` parameter to `False`

In [100]:
movies.sort_values(by=['imdb_score','gross'], ascending=False).head()

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
2765,Towering Inferno,9.5,,,,65.0,Comedy,Martin Short,Andrea Martin,Joe Flaherty,John Blanchard,English,Canada,1.33,
1937,The Shawshank Redemption,9.3,R,25000000.0,28341469.0,142.0,Crime|Drama,Morgan Freeman,Jeffrey DeMunn,Bob Gunton,Frank Darabont,English,USA,1.85,1994.0
3466,The Godfather,9.2,R,6000000.0,134821952.0,175.0,Crime|Drama,Al Pacino,Marlon Brando,Robert Duvall,Francis Ford Coppola,English,USA,1.85,1972.0
2824,Dekalog,9.1,TV-MA,,447093.0,55.0,Drama,Krystyna Janda,Olaf Lubaszenko,Olgierd Lukaszewicz,,Polish,Poland,1.33,
3207,Dekalog,9.1,TV-MA,,447093.0,55.0,Drama,Krystyna Janda,Olaf Lubaszenko,Olgierd Lukaszewicz,,Polish,Poland,1.33,


We could also use the `.value_counts()` function to could how many observations there are for each category.
<br>If we wanted relative frequencies (i.e. proportions) instead of absolute counts, we could change the `normalize=` parameter to `True`

In [101]:
movies.country.value_counts(normalize=True).head()

USA        0.755657
UK         0.088924
France     0.030568
Canada     0.025010
Germany    0.019254
Name: country, dtype: float64

Finally, we can take basic summary measures, like the `.mean()` or `.sum()` of a column.

In [102]:
movies.budget.mean()

39752620.436387606

## Chaining

Chaining helps make our code more concise and readable.

For example, if we wanted to combine some subsets with functions:

In [103]:
df1 = movies[movies.imdb_score > 8.5]
df2 = df1[df1.content_rating == 'PG']
df3 = df2.sort_values(by='duration',ascending=False)
df3.head()

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
2051,Star Wars: Episode V - The Empire Strikes Back,8.8,PG,18000000.0,290158751.0,127.0,Action|Adventure|Fantasy|Sci-Fi,Harrison Ford,Kenny Baker,Anthony Daniels,Irvin Kershner,English,USA,2.35,1980.0
2373,Spirited Away,8.6,PG,19000000.0,10049886.0,125.0,Adventure|Animation|Family|Fantasy,Bunta Sugawara,Ryûnosuke Kamiki,Miyu Irino,Hayao Miyazaki,Japanese,Japan,1.85,2001.0
3024,Star Wars: Episode IV - A New Hope,8.7,PG,11000000.0,460935665.0,125.0,Action|Adventure|Fantasy|Sci-Fi,Harrison Ford,Peter Cushing,Kenny Baker,George Lucas,English,USA,2.35,1977.0
4049,It's a Wonderful Life,8.6,PG,3180000.0,,118.0,Drama|Family|Fantasy|Romance,Donna Reed,Lionel Barrymore,Thomas Mitchell,Frank Capra,English,USA,1.37,1946.0
4526,Casablanca,8.6,PG,950000.0,,82.0,Drama|Romance|War,Humphrey Bogart,Claude Rains,Conrad Veidt,Michael Curtiz,English,USA,1.37,1942.0


We could instead write the above as:

In [104]:
movies[movies.imdb_score > 8.5].loc[movies.content_rating == 'PG'].sort_values(by='duration',ascending=False).head()

Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
2051,Star Wars: Episode V - The Empire Strikes Back,8.8,PG,18000000.0,290158751.0,127.0,Action|Adventure|Fantasy|Sci-Fi,Harrison Ford,Kenny Baker,Anthony Daniels,Irvin Kershner,English,USA,2.35,1980.0
2373,Spirited Away,8.6,PG,19000000.0,10049886.0,125.0,Adventure|Animation|Family|Fantasy,Bunta Sugawara,Ryûnosuke Kamiki,Miyu Irino,Hayao Miyazaki,Japanese,Japan,1.85,2001.0
3024,Star Wars: Episode IV - A New Hope,8.7,PG,11000000.0,460935665.0,125.0,Action|Adventure|Fantasy|Sci-Fi,Harrison Ford,Peter Cushing,Kenny Baker,George Lucas,English,USA,2.35,1977.0
4049,It's a Wonderful Life,8.6,PG,3180000.0,,118.0,Drama|Family|Fantasy|Romance,Donna Reed,Lionel Barrymore,Thomas Mitchell,Frank Capra,English,USA,1.37,1946.0
4526,Casablanca,8.6,PG,950000.0,,82.0,Drama|Romance|War,Humphrey Bogart,Claude Rains,Conrad Veidt,Michael Curtiz,English,USA,1.37,1942.0


This works because each part of our code `returns` a dataframe, so we can keep tagging along functions instead of saving each step into a temporary variable.

**Try it out:** In one line, see if you can find the value counts of `content_rating` for movies with a gross revenue (`gross`) over `200000000` ($ 200 million)

In [105]:
movies[movies.gross > 200000000].content_rating.value_counts()

PG-13    97
PG       49
R        11
G        10
Name: content_rating, dtype: int64

# Practice w/ UFOs

Data sampled from the National UFO Reporting Center (NUFORC)
<br>With your breakout groups, open up `ufo.csv` and answer the following questions:

1. Among the West Coast states (California, Oregon, and Washington), how long (on average) did the fireballs encounters last?
2. Which state saw the most encounters that lasted between 5 minutes to 1 hour?
3. There was one particularly interesting encounter on `2/11/2004 00:00` in West Palm Beach, Florida. What happened?

<br>Hint: Break down each question into parts, and chain them back together. There's no particular 'right' way
<br>To refer to the `shape` column, use `ufo['shape']` instead of `ufo.shape`, since the latter is a reserved attribute

In [106]:
ufo = pd.read_csv('https://raw.githubusercontent.com/dt3zjy/node/master/week-2/workshop/ufo.csv')
ufo.head()

Unnamed: 0,datetime,city,state,country,shape_type,duration_sec,duration_hrs,comments,latitude,longitude
0,1/25/2014 22:00,tewksbury,ma,us,light,4.0,00:04,Green and red falling light over walgreens res...,42.610556,-71.234722
1,8/20/2004 22:27,lake in the hills,il,us,triangle,180.0,2-3 minutes,On August 20&#442004 at exactly 10:25 to 10:27...,42.181667,-88.330278
2,1/23/2009 19:00,slidell,la,us,sphere,7200.0,2 hours,Bright red&#44 green &#44 white and blue spher...,30.275,-89.781111
3,12/15/1994 21:00,alliance,oh,us,formation,1800.0,30 min +,Brightly Colored Orbs Over Portage County,40.915278,-81.106111
4,7/31/2011 21:00,milford,ct,us,oval,15.0,10-15 seconds,Bright orange&#44 silent orb&#44 moving steadi...,41.222222,-73.056944


In [107]:
# 1
ufo.loc[ufo.state.isin(['ca','or','wa'])].loc[ufo.shape_type=='fireball'].duration_sec.mean()

238.88888888888889

In [108]:
# 2
ufo.loc[(ufo.duration_sec>=5*60) & (ufo.duration_sec<=60*60)].state.value_counts().head(1)

ca    392
Name: state, dtype: int64

In [109]:
# 3
ufo.loc[ufo.datetime == '2/11/2004 00:00'].loc[ufo.city=='west palm beach'].comments.values[0]

'BLINDING LIGHT LIFTED MY DOG AND TOOK OFF INTO SPACE'

# Groupby Objects

From before, we used `value_counts()` to get the number of movies per each content rating?

In [110]:
movies.content_rating.value_counts()

R        2118
PG-13    1461
PG        701
         ... 
M           5
TV-Y7       1
TV-Y        1
Name: content_rating, Length: 18, dtype: int64

This is a good summary statistic to examine the distribution of our dataset. But...

<br>What if we want to know how movie performance differs by rating?
<br>We need to apply some function (i.e. take the mean of revenue) **per each** content rating

In [111]:
movies_byRating = movies.groupby('content_rating')
type(movies_byRating)

pandas.core.groupby.groupby.DataFrameGroupBy

This creates a special **GroupBy object**. 

<br>For now, let's think of it like a *collection* of dataframes, seperated by each unique value from content rating (One group for `R`, `PG-13`, etc).
<br>We can't easily render what the entire GroupBy object looks like, but we can pull out a particular group

In [112]:
movies_byRating

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x118839898>

In [113]:
print(type(movies_byRating.get_group('PG')))
movies_byRating.get_group('PG').head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,movie_title,imdb_score,content_rating,budget,gross,duration,genres,actor_1_name,actor_2_name,actor_3_name,director_name,language,country,aspect_ratio,title_year
7,Tangled,7.8,PG,260000000.0,200807262.0,100.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,Brad Garrett,Donna Murphy,M.C. Gainey,Nathan Greno,English,USA,1.85,2010.0
9,Harry Potter and the Half-Blood Prince,7.5,PG,250000000.0,301956980.0,153.0,Adventure|Family|Fantasy|Mystery,Alan Rickman,Daniel Radcliffe,Rupert Grint,David Yates,English,UK,2.35,2009.0
16,The Chronicles of Narnia: Prince Caspian,6.6,PG,225000000.0,141614023.0,150.0,Action|Adventure|Family|Fantasy,Peter Dinklage,Pierfrancesco Favino,Damián Alcázar,Andrew Adamson,English,USA,2.35,2008.0
33,Alice in Wonderland,6.5,PG,200000000.0,334185206.0,108.0,Adventure|Family|Fantasy,Johnny Depp,Alan Rickman,Anne Hathaway,Tim Burton,English,USA,1.85,2010.0
38,Oz the Great and Powerful,6.4,PG,215000000.0,234903076.0,130.0,Adventure|Family|Fantasy,Tim Holmes,Mila Kunis,James Franco,Sam Raimi,English,USA,2.35,2013.0


When we apply **aggregation** functions to a `GroupBy` object, we get back averages for each column in the dataframe, **broken down** by content rating

In [114]:
movies_byRating.agg('mean')

Unnamed: 0_level_0,imdb_score,budget,gross,duration,aspect_ratio,title_year
content_rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Approved,7.325455,4.142475e+06,4.814586e+07,116.218182,1.735818,1954.781818
G,6.529464,4.499913e+07,8.245516e+07,98.973214,1.962523,1995.892857
GP,6.916667,5.550000e+06,4.380000e+07,110.833333,2.024000,1970.666667
...,...,...,...,...,...,...
TV-Y7,7.200000,,,30.000000,4.000000,
Unrated,6.920968,5.092971e+06,4.302599e+06,102.790323,2.110370,1990.354839
X,6.500000,3.338462e+06,1.865881e+07,92.615385,1.928462,1981.307692


If we were just to apply `.mean()` to the entire dataframe, we'd only get back one row with summaries for the entire dataset

In [115]:
pd.DataFrame(movies.mean(), columns=['Total']).T

Unnamed: 0,imdb_score,budget,gross,duration,aspect_ratio,title_year
Total,6.442138,39752620.0,48468410.0,107.201074,2.220403,2002.470517


We've got other ways to aggregate the data too.

Here, we're showing the mean, max, and min values of `imdb_score` by passing in multiple strings in a list to `.agg()`

In [116]:
movies_byRating['imdb_score'].agg(['mean', 'max', 'min', 'count'])

Unnamed: 0_level_0,mean,max,min,count
content_rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Approved,7.325455,8.9,4.1,55
G,6.529464,8.6,1.6,112
GP,6.916667,8.0,6.2,6
...,...,...,...,...
TV-Y7,7.200000,7.2,7.2,1
Unrated,6.920968,8.7,3.2,62
X,6.500000,7.9,4.0,13


The results of these groupby operations are all dataframes, check it out with the `type( )` operator
This means we can start chaining together dataframe functions, for example `sort_values()` 

**Try it out:** Break down median revenues by country, and sort them highest to lowest

In [117]:
# 1
movies.groupby(by='country').agg('median').sort_values(by='gross', ascending=False).head()

Unnamed: 0_level_0,imdb_score,budget,gross,duration,aspect_ratio,title_year
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Taiwan,7.15,15000000.0,64340682.0,112.5,1.86,2007.5
Peru,5.4,45000000.0,57362581.0,110.0,1.85,1994.0
South Africa,6.25,17500000.0,45089048.0,100.5,1.85,2011.5
USA,6.5,20000000.0,32178777.0,103.0,1.85,2005.0
New Zealand,7.3,25000000.0,30465398.0,108.0,2.35,2005.0


### Multilevel GroupBy

We can also group on multiple columns to get *all unique combinations* of those columns. 

<br> For example, we can see if the relationship between `content_rating` and `imdb_rating` differs across countries by using both as keys

In [119]:
movies_byCountryRating = movies.groupby(['country', 'content_rating'])

In [120]:
movies_byCountryRating['imdb_score'].agg('mean').sort_values(ascending=False).head()

country     content_rating
Poland      TV-MA             9.1
Italy       Approved          8.9
Kyrgyzstan  PG-13             8.7
Italy       TV-MA             8.7
            PG-13             8.6
Name: imdb_score, dtype: float64

# Summary

That's it for now!

<br>Today you learned how to:
- **Import** a dataset as a pandas object
- Check out quick features, like `.head()`, `.shape`, and `.value_counts()`
- The distinction between `pd.Series` and `pd.DataFrame` objects
- **Filter** rows based on some condition
- **Subset** columns to those we want

We also used special `GroupBy` objects to get specific drill-down insights by:
<br>(1) first breaking out, or **grouping** a dataset based on some category, then
<br>(2) **aggregating** information from each observation in that category

<br> In practice, if we wanted to get a mean score, broken down by every value in a given column, we would do:
<br>`df.groupby(by='group column').agg('mean').score_column`