##  A Quick Review from W1

Before pulling any data, we've gotta import all the packages we need

In [None]:
import pandas as pd
import numpy as np

pd.set_option("display.max_rows", 6)

Now we can read in the data from a link OR from a file in the same directory

URL: `https://raw.githubusercontent.com/dt3zjy/node/master/week-2/workshop/imdb.csv`

Take a quick look at the data

How many rows are there? Columns?

There's a lot of columns, could we get a full list? 

# Pandas Foundations

## Dataframe who?

Think of dataframes as little "spreadsheets" that hold our data.
<br>Each row represents an observation, and each column represents some feature about that data.

<br> With little bits of code, we can modify what that spreadsheet looks like.
<br> For example, our raw data `movies` can be transformed to show some metric about some feature for some subset of the data

In [None]:
# For example:
movies[movies.country.isin(['USA','France','UK'])].groupby('country')[['gross', 'budget']].agg(['median','max'])

## Series vs. Dataframes 

We've been referring to *2 dimensional* tables using the `pd.DataFrame` object
<br>An individual column (or row), is *1 dimensional*. We call this a `pd.Series` object

We can access a series (always 1-dimensional) using either: `df.column` OR `df['column']`

To select a **row** as a series, we can use `df.iloc[]` and a specific row number (its index)

## Subsetting Columns

Subsetting and filtering is one of the most important, yet confusing topics when getting started.
<br>As we go along, feel free to run the code part by part to see what's going on in each step. `type()` is also a great tool here

Lets take a look at a smaller set of columns, say the movie title, actors & director name

In [None]:
movies[['movie_title','actor_1_name','actor_2_name','actor_3_name','director_name']] # Note TWO square brackets

Side Note: Why did we use two square brackets? What we're doing is passing a `list` to the subset function.
<br> Pandas `DataFrame` objects know that whenever we place `[]` after it, we're looking to do some sort of filtering operation

In [None]:
actor_cols = ['actor_1_name','actor_2_name','actor_3_name']
movies[actor_cols]

Another quick point of confusion: Check out the difference between `df['column']` vs `df[['column']]` ( try it out below)

In [None]:
movies

The former creates a *series*, since the input is just one *string*. The latter creates a *dataframe*, since the input is a *list* of columns.

## Filtering Rows With Conditions

Another common task is to filter rows based upon some criteria we have. We could:
1. Compare floats
2. Match strings
3. Check against multiple elements

What if we wanted to find movies that were either `G`, `PG`, or `PG-13`? 
We'd have to type out something pretty annoying like: 
```python
movies[(movies.content_rating == 'G') | (movies.content_rating == 'PG') | (movies.content_rating == 'PG-13')]
```

<br>Instead, we'll use the `.isin()` operator to match against a group. Remember, `.isin()` accepts a `list` only

### Masking

What's going on under the hood? When we say something like this,
```python 
movies.content_rating == 'G'
```
We're actually asking Pandas to check that **for each** row in the series, if the given statement is True or False.


## Basic functions

There's a couple super helpful functions that help us work with our data.

For sorting, we can use `.sort_values(by= )`, and pass in either a single string, or multiple strings in a list.
<br>To flip the default order, change the `ascending= ` parameter to `False`

We could also use the `.value_counts()` function to could how many observations there are for each category.
<br>If we wanted relative frequencies (i.e. proportions) instead of absolute counts, we could change the `normalize=` parameter to `True`

Finally, we can take basic summary measures, like the `.mean()` or `.sum()` of a column.

## Chaining

Chaining helps make our code more concise and readable.

For example, if we wanted to combine some subsets with functions:

In [None]:
df1 = movies[movies.imdb_score > 8.5]
df2 = df1[df1.content_rating == 'PG']
df3 = df2.sort_values(by='duration',ascending=False)
df3.head()

We could instead write the above as:

This works because each part of our code `returns` a dataframe, so we can keep tagging along functions instead of saving each step into a temporary variable.

**Try it out:** In one line, see if you can find the value counts of `content_rating` for movies with a gross revenue (`gross`) over `200000000` ($ 200 million)

# Practice w/ UFOs

Data sampled from the National UFO Reporting Center (NUFORC)
<br>With your breakout groups, open up `ufo.csv` and answer the following questions:

1. Among the West Coast states (California, Oregon, and Washington), how long (on average) did the fireballs encounters last?
2. Which state saw the most encounters that lasted between 5 minutes to 1 hour?
3. There was one particularly interesting encounter on `2/11/2004 00:00` in West Palm Beach, Florida. What happened?

<br>Hint: Break down each question into parts, and chain them back together. There's no particular 'right' way
<br>To refer to the `shape` column, use `ufo['shape']` instead of `ufo.shape`, since the latter is a reserved attribute

<br> URL = `https://raw.githubusercontent.com/dt3zjy/node/master/week-2/workshop/ufo.csv`

In [None]:
# Read data

In [None]:
# 1

In [None]:
# 2

In [None]:
# 3

# Groupby Objects

From before, we used `value_counts()` to get the number of movies per each content rating

This is a good summary statistic to examine the distribution of our dataset. But...

<br>What if we want to know how movie performance differs by rating?
<br>We need to apply some function (i.e. take the mean of revenue) **per each** content rating

This creates a special **GroupBy object**. 

<br>For now, let's think of it like a *collection* of dataframes, seperated by each unique value from content rating (One group for `R`, `PG-13`, etc).
<br>We can't easily render what the entire GroupBy object looks like, but we can pull out a particular group

In [None]:
print(type(movies_byRating.get_group('PG')))
movies_byRating.get_group('PG').head()

When we apply **aggregation** functions to a `GroupBy` object, we get back averages for each column in the dataframe, **broken down** by content rating

If we were just to apply `.mean()` to the entire dataframe, we'd only get back one row with summaries for the entire dataset

In [None]:
pd.DataFrame(movies.mean(), columns=['Total']).T

We've got other ways to aggregate the data too.

Here, we're showing the mean, max, and min values of `imdb_score` by passing in multiple strings in a list to `.agg()`

The results of these groupby operations are all dataframes, check it out with the `type( )` operator
This means we can start chaining together dataframe functions, for example `sort_values()` 

**Try it out:** Break down median revenues by country, then sort them highest to lowest

In [None]:
# 1

### Multilevel GroupBy

We can also group on multiple columns to get *all unique combinations* of those columns. 

<br> For example, we can see if the relationship between `content_rating` and `imdb_rating` differs across countries by using both as keys

# Summary

That's it for now!

<br>Today you learned how to:
- **Import** a dataset as a pandas object
- Check out quick features, like `.head()`, `.shape`, and `.value_counts()`
- The distinction between `pd.Series` and `pd.DataFrame` objects
- **Filter** rows based on some condition
- **Subset** columns to those we want

We also used special `GroupBy` objects to get specific drill-down insights by:
<br>(1) first breaking out, or **grouping** a dataset based on some category, then
<br>(2) **aggregating** information from each observation in that category

<br> In practice, if we wanted to get a mean score, broken down by every value in a given column, we would do:
<br>`df.groupby(by='group column').agg('mean').score_column`