# Analyze data

In this notebook, we'll explore how to use the [_pandas_](https://pandas.pydata.org/) data analysis library to perform some basic analysis on a CSV file assembled from a bunch of inconsistently formatted spreadsheets with data from the FAA on [laser pointer incidents](https://www.faa.gov/about/initiatives/lasers/laws) from 2010 to mid-2022.

(If you're interested in how those files were downloaded and cleaned, you can check out the [`Download data.ipynb`](Download%20data.ipynb) and [`Combine and clean spreadsheets.ipynb`](Combine%20and%20clean%20spreadsheets.ipynb) notebooks, though they're a bit more complicated and we probably won't have time today to get into them today.)

Also: Please feel free to keep these these two cheat sheets open in your browser: a thing with [basic Python syntax](../cheat-sheets/0%20-%20Python%20101.ipynb) and a [pandas reference](../cheat-sheets/1%20-%20Pandas%20cheat%20sheet.ipynb) that includes some fun bonus material!

Here's a quick outline of today:

- [Importing pandas](#Importing-pandas)
- [Loading and inspecting data](#Loading-and-inspecting-data)
- [Checking out individual columns](#Checking-out-individual-columns)
- [Sorting](#Sorting)
- [Filtering](#Filtering)
- [Method chaining](#Method-chaining)
- [Grouping](#Grouping)
- [Exporting](#Exporting)
- [Let's practice](#Let's-practice)
- [Bored? Here's some extra credit](#Bored?-Here's-some-extra-credit)

Note: We probably won't have time to get into joining data sets, which is another super handy thing you can do in pandas, but there's an overview [in the "Joining data" section of our pandas cheat sheet notebook](../cheat-sheets/1%20-%20Pandas%20cheat%20sheet.ipynb#Joining-data).

### Importing pandas

Before you can use the functionality of the _pandas_ library, which we installed prior to the class, you have to import it into your script. The convention for doing so is to use an alias to import pandas _as_ `pd` -- just saves you a few keystrokes every time you need to use one of its tools.

In [1]:
import pandas as pd

### Loading and inspecting data

Before you can use pandas to analyze your data, you first need to load the data into a _data frame_, which is a sort of virtual spreadsheet with columns and rows.

You can load many different types of data files into a data frame, including CSVs and other delimited text files, Excel files, JSON and more. ([Here's a quick reference notebook](../cheat-sheets/Importing%20data%20into%20pandas.ipynb) demonstrating how to import some common data files, including data pulled directly from the Internet.)

For today, we'll focus on importing the laser pointer data using a pandas method called [`read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). There are a ton of options you can pass in when you read in the data file, but at minimum, you need to tell the method _where_ the file lives, which means you need to supply the path to the data file as a Python string (some text enclosed in single or double quotes).

The file is called `faa-laser-incidents.csv`, and it is located in the same directory as this notebook file, so we don't need to specify a longer path.

Additionally -- something I found out once I started to explore the data -- we need to tell pandas to import empty values as an empty string rather than a specific kind of null value. (The details aren't important here, but this will save us from having to convert between data types later -- my Google was "[pandas empty strings instead of nan import](https://www.google.com/search?client=firefox-b-1-d&q=pandas+empty+strings+instead+of+nan+import)" because although I have done this operation dozens of times, I literally never remember this, which led me to [our good friend StackOverflow](https://stackoverflow.com/a/43832955) and _boomtown_ we're in business.)

As we import the data, we'll also _assign_ the results of the loading operation to a new variable called `df`, short for data frame -- easy to type, plus you'll see this convention a lot when Googling around for help. You can see more information on variable assignment in the Python syntax notebook.

In [62]:
# this is a comment that will be ignored -- they're here for you, not for python

# you can ignore the warning on import or specify the low-memory keyword argument, as suggested -- dealer's choice
df = pd.read_csv(
    'faa-laser-incidents.csv',
    keep_default_na=False
)

  df = pd.read_csv(


Nothing happened, which usually is good news! (Python will always yell at you if you make mistakes -- learning how to debug errors is a major part of the process of learning to code.)

Next, we'll want to take a look at the imported data using a couple of handy tools in the _pandas_ toolbox.

Pretty much the first thing I do after importing is use the [`.head()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html) method to check out the first five rows of data. 

In [63]:
df.head()

Unnamed: 0,Incident Date,Incident Time,Flight ID,Aircraft,Altitude,Airport,Laser Color,Injury,City,State,datetime_utc,year,injury_clean,state_clean,colors_clean
0,2010-01-01,102.0,AIR1,HELO,2000,LEX,green,NO,Lexington,KENTUCKY,2010-01-01 01:02:00,2010,NO,KENTUCKY,green
1,2010-01-01,403.0,ASA513,B737,10000,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 04:03:00,2010,NO,CALIFORNIA,green
2,2010-01-01,246.0,EGF3002,E135,7000,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 02:46:00,2010,NO,CALIFORNIA,green
3,2010-01-01,157.0,EGF3086,E135,7000,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 01:57:00,2010,NO,CALIFORNIA,green
4,2010-01-01,258.0,JBU300,A320,12500,SLI,green,NO,Los Alamitos,CALIFORNIA,2010-01-01 02:58:00,2010,NO,CALIFORNIA,green


If you'd like to see more than five rows, just pass a number into the `head()` function:

In [64]:
df.head(10)

Unnamed: 0,Incident Date,Incident Time,Flight ID,Aircraft,Altitude,Airport,Laser Color,Injury,City,State,datetime_utc,year,injury_clean,state_clean,colors_clean
0,2010-01-01,102.0,AIR1,HELO,2000,LEX,green,NO,Lexington,KENTUCKY,2010-01-01 01:02:00,2010,NO,KENTUCKY,green
1,2010-01-01,403.0,ASA513,B737,10000,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 04:03:00,2010,NO,CALIFORNIA,green
2,2010-01-01,246.0,EGF3002,E135,7000,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 02:46:00,2010,NO,CALIFORNIA,green
3,2010-01-01,157.0,EGF3086,E135,7000,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 01:57:00,2010,NO,CALIFORNIA,green
4,2010-01-01,258.0,JBU300,A320,12500,SLI,green,NO,Los Alamitos,CALIFORNIA,2010-01-01 02:58:00,2010,NO,CALIFORNIA,green
5,2010-01-01,255.0,JBU303,A320,2000,LGB,green,NO,Los Angeles,CALIFORNIA,2010-01-01 02:55:00,2010,NO,CALIFORNIA,green
6,2010-01-01,857.0,PD2,E120,700,FAT,green,NO,Fresno,CALIFORNIA,2010-01-01 08:57:00,2010,NO,CALIFORNIA,green
7,2010-01-01,44.0,SKW6083,CRJ2,6800,LNK,green,NO,Lincoln,NEBRASKA,2010-01-01 00:44:00,2010,NO,NEBRASKA,green
8,2010-01-01,155.0,SKW6318,E120,1000,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 01:55:00,2010,NO,CALIFORNIA,green
9,2010-01-01,730.0,SKW6433,CRJ2,1000,FAT,green,NO,Fresno,CALIFORNIA,2010-01-01 07:30:00,2010,NO,CALIFORNIA,green


Other ways to check out the dataframe

- [`.tail()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html) will get you the _last_ 5 rows of data
- [`.info()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html) will let us know if any columns have null values in them, and what kind of data type pandas has inferred for them ("object" being a kind of catchall) -- this is helpful information when you're figuring out what kinds of questions you want to ask the data
- [`.sample(5)`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html) will give you a random sample of the data (pass in the number of rows you'd like to see; the default is 1)
- [`.describe()`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.describe.html) will compute summary stats for every _numeric_ column
- [`.columns`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html) will list the column names
- [`.dtypes`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html) will list the data types of each column
- [`.shape`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.shape.html) will give you `(number of rows, number of columns)`

In [65]:
df.tail()

Unnamed: 0,Incident Date,Incident Time,Flight ID,Aircraft,Altitude,Airport,Laser Color,Injury,City,State,datetime_utc,year,injury_clean,state_clean,colors_clean
71204,2022-05-31,416,ABX152,B763,14500,ZID,green,NO,Indianapolis,INDIANA,2022-05-31 04:16:00,2022,NO,INDIANA,green
71205,2022-05-31,416,DAL1140,B737,32000,ZME,green,NO,Memphis,TENNESSEE,2022-05-31 04:16:00,2022,NO,TENNESSEE,green
71206,2022-05-31,522,SWA1781,B738,35000,ZID,green,NO,Indianapolis,INDIANA,2022-05-31 05:22:00,2022,NO,INDIANA,green
71207,2022-05-31,530,N3055B,PA28,3500,JAX,green,NO,Jacksonville,FLORIDA,2022-05-31 05:30:00,2022,NO,FLORIDA,green
71208,2022-05-31,853,PHI53,AAS50,7000,ABQ,blue,NO,Albuquerque,NEW MEXICO,2022-05-31 08:53:00,2022,NO,NEW MEXICO,blue


In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71209 entries, 0 to 71208
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Incident Date  71209 non-null  object
 1   Incident Time  71209 non-null  object
 2   Flight ID      71209 non-null  object
 3   Aircraft       71209 non-null  object
 4   Altitude       71209 non-null  object
 5   Airport        71209 non-null  object
 6   Laser Color    71209 non-null  object
 7   Injury         71209 non-null  object
 8   City           71209 non-null  object
 9   State          71209 non-null  object
 10  datetime_utc   71209 non-null  object
 11  year           71209 non-null  int64 
 12  injury_clean   71209 non-null  object
 13  state_clean    71209 non-null  object
 14  colors_clean   71209 non-null  object
dtypes: int64(1), object(14)
memory usage: 8.1+ MB


Ope! The `Altitude` column is an object, but we want it to be a number. (Later, we might want to figure out what was the average altitude was, or something.) There are a couple of ways to solve this -- here's one: Use the top-level [`to_numeric()`](https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html) method to coerce the values to a numeric type.

Please note two things here:
- With variable assignment, you're typically assigning the results of an operation or statement on the right-hand side of an equals sign to a variable name on the left. Here, We're assigning the results of the operation on the right-hand side -- the conversion of the values in the `Altitude` column to numbers -- back to the `Altitude` column
- What happens when the parser tries to convert a value to a number that can't be converted to a number -- like the string `UNKN` in this data? The default behavior is to raise an error, but then we're stuck, so instead let's coerce those unparseable values to nulls and later exclude them if needed. To do this, you can specify keyword argument `errors='coerce'` (see [the method documentation](https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html) for other options.)

In [67]:
df['Altitude'] = pd.to_numeric(df['Altitude'], errors='coerce')
print(df.dtypes)

Incident Date     object
Incident Time     object
Flight ID         object
Aircraft          object
Altitude         float64
Airport           object
Laser Color       object
Injury            object
City              object
State             object
datetime_utc      object
year               int64
injury_clean      object
state_clean       object
colors_clean      object
dtype: object


In [68]:
df.sample(10)

Unnamed: 0,Incident Date,Incident Time,Flight ID,Aircraft,Altitude,Airport,Laser Color,Injury,City,State,datetime_utc,year,injury_clean,state_clean,colors_clean
2621,2010-12-10,503.0,DAL175,B752,1000.0,MCO,green,NO,Orlando,FLORIDA,2010-12-10 05:03:00,2010,NO,FLORIDA,green
37226,2017-10-11,330.0,DAL1897,MD90,1500.0,CMH,green,NO,Columbus,OHIO,2017-10-11 03:30:00,2017,NO,OHIO,green
66645,2021-11-30,255.0,ASH6374,E175,17000.0,ZHU,green,NO,Houston,TEXAS,2021-11-30 02:55:00,2021,NO,TEXAS,green
36165,2017-08-18,445.0,CF23,HELO,4500.0,ELP,green,NO,El Paso,TEXAS,2017-08-18 04:45:00,2017,NO,TEXAS,green
26480,2016-03-07,839.0,N136JP,LJ35,2000.0,FLL,green,NO,Fort Lauderdale,FLORIDA,2016-03-07 08:39:00,2016,NO,FLORIDA,green
13260,2013-11-08,227.0,JENA208,C182,3500.0,PDX,blue,NO,Portland,OREGON,2013-11-08 02:27:00,2013,NO,OREGON,blue
54621,2020-08-24,345.0,SWA454,B738,4000.0,PHL,green,NO,Philadelphia,PENNSYLVANIA,2020-08-24 03:45:00,2020,NO,PENNSYLVANIA,green
56987,2020-11-28,118.0,N242ML,C525,26000.0,COU,blue,NO,Columbia,MISSOURI,2020-11-28 01:18:00,2020,NO,MISSOURI,blue
36514,2017-09-08,445.0,ATN3753,B763,1800.0,,green,NO,Chicago-Rockford,ILLINOIS,2017-09-08 04:45:00,2017,NO,ILLINOIS,green
16437,2014-09-24,38.0,SWA644,B737,4500.0,HLG,green,NO,Wheeling,WEST VIRGINIA,2014-09-24 00:38:00,2014,NO,WEST VIRGINIA,green


In [69]:
df.describe()

Unnamed: 0,Altitude,year
count,69759.0,71209.0
mean,7112.128,2016.827746
std,8943.957,3.400831
min,0.0,2010.0
25%,2500.0,2015.0
50%,5000.0,2017.0
75%,9000.0,2020.0
max,1350020.0,2022.0


In [70]:
df.columns

Index(['Incident Date', 'Incident Time', 'Flight ID', 'Aircraft', 'Altitude',
       'Airport', 'Laser Color', 'Injury', 'City', 'State', 'datetime_utc',
       'year', 'injury_clean', 'state_clean', 'colors_clean'],
      dtype='object')

In [71]:
df.dtypes

Incident Date     object
Incident Time     object
Flight ID         object
Aircraft          object
Altitude         float64
Airport           object
Laser Color       object
Injury            object
City              object
State             object
datetime_utc      object
year               int64
injury_clean      object
state_clean       object
colors_clean      object
dtype: object

In [72]:
df.shape

(71209, 15)

In [73]:
# bonus! you can also use the base python function len() to
# check the length of a data frame -- in other words, a records count
# https://realpython.com/len-python-function/
len(df)

71209

At this step, I'm still just kind of poking at the data and looking for outliers, dirty/missing data, just trying to get my arms around what I'm looking at.

This is usually where I start inspecting and running integrity checks on data in individual columns ...

### Checking out individual columns

Just as you might interrogate your data in another programs (Excel, SQL, whatever), you also likely will want to do some integrity checks on any column that's interesting to you: checking for outliers, examining the range of values to see if they pass the common-sense test, etc.

In pandas, you can select one (or more) columns to examine individually using either "dot notation" or "bracket notation. These are equivalent operations:
- `df.state_clean`
- `df['state_clean']`

One note: If your column name has spaces [or other characters that you couldn't use for a variable name](https://realpython.com/python-variables/#variable-names), you _must_ use bracket notation.

In [74]:
df.state_clean

0          KENTUCKY
1        CALIFORNIA
2        CALIFORNIA
3        CALIFORNIA
4        CALIFORNIA
            ...    
71204       INDIANA
71205     TENNESSEE
71206       INDIANA
71207       FLORIDA
71208    NEW MEXICO
Name: state_clean, Length: 71209, dtype: object

In [75]:
df['state_clean']

0          KENTUCKY
1        CALIFORNIA
2        CALIFORNIA
3        CALIFORNIA
4        CALIFORNIA
            ...    
71204       INDIANA
71205     TENNESSEE
71206       INDIANA
71207       FLORIDA
71208    NEW MEXICO
Name: state_clean, Length: 71209, dtype: object

In [76]:
df[['state_clean', 'injury_clean']]

Unnamed: 0,state_clean,injury_clean
0,KENTUCKY,NO
1,CALIFORNIA,NO
2,CALIFORNIA,NO
3,CALIFORNIA,NO
4,CALIFORNIA,NO
...,...,...
71204,INDIANA,NO
71205,TENNESSEE,NO
71206,INDIANA,NO
71207,FLORIDA,NO


**✍️ Try it yourself**

In the cells below, try selecting specific columns or groups of columns.

We'll cover sorting here in a bit, but there a couple of handy pandas operations you can use to check out the data in a particular `Series`:
- `.max()` will give you the largest/latest/alphabetically latest value in each column
- `.min()` will give you the smallest/soonest/alphabetically earliest value in each column
- `.mean()` will give the arithmetic mean (average) of a numeric column
- `.median()` will give the median of a numeric column
- `.sum()` will give the sum of a numeric column

In [81]:
df.state_clean.min()

''

In [82]:
df.state_clean.max()

'WYOMING'

In [83]:
df.year.min()

2010

In [84]:
df.year.max()

2022

In [86]:
df.Altitude.mean()

7112.12752476383

In [87]:
df.Altitude.median()

5000.0

In [88]:
df.Altitude.max()

1350020.0

In [89]:
df.Altitude.min()

0.0

#### Value counts FTW

A common integrity check is to get a frequency count of values in a particular column, kind of like doing a pivot table with counts in a spreadsheet program. Often, the question you're trying to answer is: What's the most common value?

This is a common enough operation that pandas has a method for that -- [`value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) -- that you can apply to a column.

In [90]:
df.state_clean.value_counts()

CALIFORNIA             13375
TEXAS                   7215
FLORIDA                 5269
ARIZONA                 3464
ILLINOIS                2155
                       ...  
MICRONESIA                 1
U.S. VIRGIN ISLANDS        1
MEXICO                     1
MIAMI                      1
NORTH HAMPSHIRE            1
Name: state_clean, Length: 63, dtype: int64

Another way to answer the question "What values are in this column?" is to use the [`unique()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html) method.

In [91]:
df.State.unique()

array(['KENTUCKY', 'CALIFORNIA', 'NEBRASKA', 'LOUISIANA', 'ARIZONA',
       'PENNSYLVANIA', 'FLORIDA', 'ILLINOIS', 'TEXAS', 'NEW YORK',
       'PUERTO RICO', 'NORTH CAROLINA', 'HAWAII', 'TENNESSEE', 'COLORADO',
       'SOUTH CAROLINA', 'ALABAMA', 'NEVADA', 'ARKANSAS', 'WASHINGTON',
       'MISSOURI', 'UTAH', 'NEW JERSEY', 'KANSAS', 'OREGON',
       'MASSACHUSETTS', 'GUAM', 'MARYLAND', 'CONNECTICUT', 'IDAHO',
       'GEORGIA', 'DISTRICT OF COLUMBIA', 'RHODE ISLAND', 'WEST VIRGINIA',
       'OHIO', 'MINNESOTA', 'VERMONT', 'MAINE', 'INDIANA', 'NEW MEXICO',
       'MICHIGAN', 'IOWA', 'WISCONSIN', 'OKLAHOMA', 'VIRGINIA', 'CHICAGO',
       'DELAWARE', 'SOUTH DAKOTA', 'WYOMING', 'MISSISSIPPI', 'ALASKA',
       'NEW HAMPSHIRE', 'MONTANA', 'NORTH DAKOTA', 'UNKN', 'CALIFIORNIA',
       '', 'ST CROIX', 'VIRGIN ISLANDS', 'NORTHERN MARIANA IS',
       'PITTSBURGH', 'NORTHERN MARIANA ISLAND', 'MICRONESIA', 'OHO',
       'NORTHERN MARIANAS IS', 'U.S. VIRGIN ISLANDS', 'DC',
       'MASSACHUSSETS', 'D.

A more friendly view is to look at a sorted version of this, and you can use a base Python function called `sorted` to accomplish this:

In [92]:
sorted(df.State.unique())

['',
 'ALABAMA',
 'ALASKA',
 'ARIZONA',
 'ARKANSAS',
 'CALIFIORNIA',
 'CALIFORNIA',
 'CHICAGO',
 'COLORADO',
 'CONNECTICUT',
 'D.C.',
 'DC',
 'DELAWARE',
 'DISTRICT OF COLUMBIA',
 'FLORIDA',
 'GEORGIA',
 'GUAM',
 'HAWAII',
 'IDAHO',
 'ILLINOIS',
 'INDIANA',
 'IOWA',
 'KANSAS',
 'KENTUCKY',
 'LOUISIANA',
 'MAINE',
 'MARIANA ISLANDS',
 'MARINA ISLANDS',
 'MARYLAND',
 'MASSACHUSETTS',
 'MASSACHUSSETS',
 'MEXICO',
 'MIAMI',
 'MICHIGAN',
 'MICRONESIA',
 'MINNESOTA',
 'MISSISSIPPI',
 'MISSOURI',
 'MONTANA',
 'NEBRASKA',
 'NEVADA',
 'NEW HAMPSHIRE',
 'NEW JERSEY',
 'NEW MEXICO',
 'NEW YORK',
 'NORTH CAROLINA',
 'NORTH DAKOTA',
 'NORTH HAMPSHIRE',
 'NORTHERN MARIANA IS',
 'NORTHERN MARIANA ISLAND',
 'NORTHERN MARIANAS IS',
 'NORTHERN MARINA ISLANDS',
 'OHIO',
 'OHO',
 'OKLAHOMA',
 'OREGON',
 'PENNSYLVANIA',
 'PITTSBURGH',
 'PUERTO RICO',
 'RHODE ISLAND',
 'SOUTH CAROLINA',
 'SOUTH DAKOTA',
 'ST CROIX',
 'TEAS',
 'TENNESSEE',
 'TEXAS',
 'U.S. VIRGIN ISLANDS',
 'UNKN',
 'UTAH',
 'VA',
 'VERMONT'

This is super useful when you're filtering your data to focus on specific rows, so you need to know how your target value is represented in the data, or if you're trying to suss out how dirty your data is to clean it.

You can see this approach in action (with a little sauce on top) in the [`Combine and clean spreadsheets.ipynb`](Combine%20and%20clean%20spreadsheets.ipynb) notebook.

### Sorting

To sort the rows in a DataFrame, use the [`sort_values()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) method. At a minimum, you need to tell it which column to sort on.

In [93]:
df.sort_values('datetime_utc')

Unnamed: 0,Incident Date,Incident Time,Flight ID,Aircraft,Altitude,Airport,Laser Color,Injury,City,State,datetime_utc,year,injury_clean,state_clean,colors_clean
7,2010-01-01,44.0,SKW6083,CRJ2,6800.0,LNK,green,NO,Lincoln,NEBRASKA,2010-01-01 00:44:00,2010,NO,NEBRASKA,green
0,2010-01-01,102.0,AIR1,HELO,2000.0,LEX,green,NO,Lexington,KENTUCKY,2010-01-01 01:02:00,2010,NO,KENTUCKY,green
8,2010-01-01,155.0,SKW6318,E120,1000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 01:55:00,2010,NO,CALIFORNIA,green
3,2010-01-01,157.0,EGF3086,E135,7000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 01:57:00,2010,NO,CALIFORNIA,green
2,2010-01-01,246.0,EGF3002,E135,7000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 02:46:00,2010,NO,CALIFORNIA,green
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71204,2022-05-31,416,ABX152,B763,14500.0,ZID,green,NO,Indianapolis,INDIANA,2022-05-31 04:16:00,2022,NO,INDIANA,green
71205,2022-05-31,416,DAL1140,B737,32000.0,ZME,green,NO,Memphis,TENNESSEE,2022-05-31 04:16:00,2022,NO,TENNESSEE,green
71206,2022-05-31,522,SWA1781,B738,35000.0,ZID,green,NO,Indianapolis,INDIANA,2022-05-31 05:22:00,2022,NO,INDIANA,green
71207,2022-05-31,530,N3055B,PA28,3500.0,JAX,green,NO,Jacksonville,FLORIDA,2022-05-31 05:30:00,2022,NO,FLORIDA,green


The default sort is ascending. To sort descending, you need to pass in another argument to the `sort_values()` method: `ascending=False`.

Note that the boolean value is _not_ a string, so it's not contained in quotes, and only the initial letter is capitalized.

For any function or method, multiple arguments are separated by a comma.

In [94]:
df.sort_values('datetime_utc', ascending=False)

Unnamed: 0,Incident Date,Incident Time,Flight ID,Aircraft,Altitude,Airport,Laser Color,Injury,City,State,datetime_utc,year,injury_clean,state_clean,colors_clean
71208,2022-05-31,853,PHI53,AAS50,7000.0,ABQ,blue,NO,Albuquerque,NEW MEXICO,2022-05-31 08:53:00,2022,NO,NEW MEXICO,blue
71207,2022-05-31,530,N3055B,PA28,3500.0,JAX,green,NO,Jacksonville,FLORIDA,2022-05-31 05:30:00,2022,NO,FLORIDA,green
71206,2022-05-31,522,SWA1781,B738,35000.0,ZID,green,NO,Indianapolis,INDIANA,2022-05-31 05:22:00,2022,NO,INDIANA,green
71205,2022-05-31,416,DAL1140,B737,32000.0,ZME,green,NO,Memphis,TENNESSEE,2022-05-31 04:16:00,2022,NO,TENNESSEE,green
71204,2022-05-31,416,ABX152,B763,14500.0,ZID,green,NO,Indianapolis,INDIANA,2022-05-31 04:16:00,2022,NO,INDIANA,green
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2,2010-01-01,246.0,EGF3002,E135,7000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 02:46:00,2010,NO,CALIFORNIA,green
3,2010-01-01,157.0,EGF3086,E135,7000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 01:57:00,2010,NO,CALIFORNIA,green
8,2010-01-01,155.0,SKW6318,E120,1000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 01:55:00,2010,NO,CALIFORNIA,green
0,2010-01-01,102.0,AIR1,HELO,2000.0,LEX,green,NO,Lexington,KENTUCKY,2010-01-01 01:02:00,2010,NO,KENTUCKY,green


You can sort by multiple columns by passing in a _list_ of column names rather than the name of a single column. A list is a collection of items enclosed within square brackets `[]`.

In [95]:
df.sort_values(['state_clean', 'datetime_utc'])

Unnamed: 0,Incident Date,Incident Time,Flight ID,Aircraft,Altitude,Airport,Laser Color,Injury,City,State,datetime_utc,year,injury_clean,state_clean,colors_clean
2480,2010-11-19,145.0,AAL1709,B752,,,green,NO,,,2010-11-19 01:45:00,2010,NO,,green
2538,2010-12-27,317.0,AAL850,B737,2500.0,DCA,green,,,,2010-12-27 03:17:00,2010,,,green
12113,2013-07-31,123.0,N72664,C172,4500.0,HDI,green,NO,,,2013-07-31 01:23:00,2013,NO,,green
12328,2013-08-16,150.0,ASQ4344,E145,6000.0,IAH,green,NO,Houston,,2013-08-16 01:50:00,2013,NO,,green
12844,2013-10-02,236.0,WiLLY01,C172,5500.0,PMD,green,NO,Palmdale,,2013-10-02 02:36:00,2013,NO,,green
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57225,2020-12-07,358.0,SKW4672,E75L,12000.0,CPR,green,NO,Casper,WYOMING,2020-12-07 03:58:00,2020,NO,WYOMING,green
57823,2020-12-31,138.0,SKW4269,CRJ2,7000.0,CPR,green,NO,Casper,WYOMING,2020-12-31 01:38:00,2020,NO,WYOMING,green
57973,2021-01-07,300,SKW4672,E75L,14500.0,CPR,green,NO,Casper,WYOMING,2021-01-07 03:00:00,2021,NO,WYOMING,green
69033,2022-02-28,146,SKW4269,CRJ2,7000.0,CPR,blue,NO,Casper,WYOMING,2022-02-28 01:46:00,2022,NO,WYOMING,blue


You can specify the sort order (descending vs. ascending) for each sort column by passing another list to the `ascending` keyword with `True` and `False` items corresponding to the position of the columns in the first list. 

For example, to sort by `state_clean` descending, then by `datetime_utc` ascending:

In [96]:
df.sort_values(['state_clean', 'datetime_utc'], ascending=[False, True])

Unnamed: 0,Incident Date,Incident Time,Flight ID,Aircraft,Altitude,Airport,Laser Color,Injury,City,State,datetime_utc,year,injury_clean,state_clean,colors_clean
816,2010-05-28,327.0,SKW45B,CRJ7,,JAC,green,NO,Jackson,WYOMING,2010-05-28 03:27:00,2010,NO,WYOMING,green
983,2010-06-25,305.0,PETON35,C130,6400.0,CYS,green,NO,Cheyenne,WYOMING,2010-06-25 03:05:00,2010,NO,WYOMING,green
1908,2010-09-20,304.0,GRIT59,UH60,6200.0,CYS,red,NO,Cheyenne,WYOMING,2010-09-20 03:04:00,2010,NO,WYOMING,red
2017,2010-10-01,256.0,EGF2996,E135,10000.0,CYS,green,NO,Cheyenne,WYOMING,2010-10-01 02:56:00,2010,NO,WYOMING,green
2078,2010-10-08,335.0,GLA715,B190,7700.0,CYS,green,NO,Cheyenne,WYOMING,2010-10-08 03:35:00,2010,NO,WYOMING,green
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36242,2017-08-22,420,N3475E,C172 (CESSNA - 172),5500.0,,green,,,,2017-08-22 04:20:00,2017,,,green
67361,2021-12-23,422,N462JB,C182,1900.0,N/NC,green,NO,,,2021-12-23 04:22:00,2021,NO,,green
69527,2022-03-18,1111,RAMBO13,UH1,100.0,GUM,green,NO,,,2022-03-18 11:11:00,2022,NO,,green
69972,2022-04-07,701,SWA193,B738,2000.0,SOSC,green,NO,,,2022-04-07 07:01:00,2022,NO,,green


The `False` goes with `state_clean` and the `True` goes with `datetime_utc` because they're in the same position in their respective lists.

One other note: Despite all of this sorting we've been doing, the original `df` data frame is unchanged:

In [97]:
df.head()

Unnamed: 0,Incident Date,Incident Time,Flight ID,Aircraft,Altitude,Airport,Laser Color,Injury,City,State,datetime_utc,year,injury_clean,state_clean,colors_clean
0,2010-01-01,102.0,AIR1,HELO,2000.0,LEX,green,NO,Lexington,KENTUCKY,2010-01-01 01:02:00,2010,NO,KENTUCKY,green
1,2010-01-01,403.0,ASA513,B737,10000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 04:03:00,2010,NO,CALIFORNIA,green
2,2010-01-01,246.0,EGF3002,E135,7000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 02:46:00,2010,NO,CALIFORNIA,green
3,2010-01-01,157.0,EGF3086,E135,7000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 01:57:00,2010,NO,CALIFORNIA,green
4,2010-01-01,258.0,JBU300,A320,12500.0,SLI,green,NO,Los Alamitos,CALIFORNIA,2010-01-01 02:58:00,2010,NO,CALIFORNIA,green


That's because we haven't "saved" the results of those sorts by assigning them to a new variable. Typically, if you want to preserve a sort (or any other kind of manipulation), you'd would assign the results to a new variable:

In [98]:
sorted_by_datetime = df.sort_values('datetime_utc')

Equivalently, you could pass in the `inplace=True` argument to the `sort_values()` method call to sort the data in place, without assigning it to a new variable:

In [99]:
df.sort_values('datetime_utc', inplace=True)

In [100]:
df.head()

Unnamed: 0,Incident Date,Incident Time,Flight ID,Aircraft,Altitude,Airport,Laser Color,Injury,City,State,datetime_utc,year,injury_clean,state_clean,colors_clean
7,2010-01-01,44.0,SKW6083,CRJ2,6800.0,LNK,green,NO,Lincoln,NEBRASKA,2010-01-01 00:44:00,2010,NO,NEBRASKA,green
0,2010-01-01,102.0,AIR1,HELO,2000.0,LEX,green,NO,Lexington,KENTUCKY,2010-01-01 01:02:00,2010,NO,KENTUCKY,green
8,2010-01-01,155.0,SKW6318,E120,1000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 01:55:00,2010,NO,CALIFORNIA,green
3,2010-01-01,157.0,EGF3086,E135,7000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 01:57:00,2010,NO,CALIFORNIA,green
2,2010-01-01,246.0,EGF3002,E135,7000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 02:46:00,2010,NO,CALIFORNIA,green


Which is better: Mutating your data in place or saving the results to a new variable? Follow your heart, but I generally like to assign to a new variable for a couple reasons:
- I like to think about the series of transformations I apply to the data as the process of creating specific views or slices of the data that might be useful to refer to later, so it makes sense to "save" them to a variable
- It makes my intentions explicit to anyone reading the code over my shoulder, including my future self 

I still use `inplace` pretty frequently for quick-and-dirty scripts to accomplish simple tasks. Anything that's complicated or needs to be run more than once, though, and I'll assign to a new variable.

**✍️ Try it yourself**

In the cells below, practice sorting by single and multiple columns.

### Filtering

You can think of the column selection we were doing earlier as filtering your data table horizontally to grab certain columns (like a `SELECT` statement in SQL, if that's your thing).

Row filtering involves looking at a subset of your data that matches some criteria, like the criteria following a `WHERE` statement in SQL. (For instance, "Show me all records in my data frame where the value in the `state_clean` column is "COLORADO".)

To make things maximally confusing, pandas _also_ uses bracket notation for row-wise filtering. (I know!) Except in this case, instead of dropping the name of a column (or a list of column names) into the brackets, you hand it a _condition_ of some sort -- something that would resolve to the boolean of `True` or `False`.

Let's see about those Colorado laser pointer incidents:

In [101]:
df[df.state_clean == 'COLORADO']

Unnamed: 0,Incident Date,Incident Time,Flight ID,Aircraft,Altitude,Airport,Laser Color,Injury,City,State,datetime_utc,year,injury_clean,state_clean,colors_clean
54,2010-01-13,254.0,SKW6699,CRJ2,13000.0,DEN,green,NO,Denver,COLORADO,2010-01-13 02:54:00,2010,NO,COLORADO,green
90,2010-01-21,515.0,UPS805,A306,,DEN,green,NO,Denver,COLORADO,2010-01-21 05:15:00,2010,NO,COLORADO,green
142,2010-01-31,530.0,LN1,AS50,6200.0,BJC,green,NO,Denver,COLORADO,2010-01-31 05:30:00,2010,NO,COLORADO,green
284,2010-03-05,256.0,SKW5917,CRJ2,9000.0,DEN,green,NO,Denver,COLORADO,2010-03-05 02:56:00,2010,NO,COLORADO,green
311,2010-03-12,344.0,GLA121,E120,10000.0,DEN,green,NO,Denver,COLORADO,2010-03-12 03:44:00,2010,NO,COLORADO,green
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71116,2022-05-28,313,N1887S,TB20,10500.0,D01,green,NO,Denver,COLORADO,2022-05-28 03:13:00,2022,NO,COLORADO,green
71129,2022-05-28,735,LN854AL,PC12,500.0,ZDV,blue,NO,Longmont,COLORADO,2022-05-28 07:35:00,2022,NO,COLORADO,blue
71193,2022-05-31,254,SKW532E,CRJ2,8000.0,COS,green,NO,Colorado Springs,COLORADO,2022-05-31 02:54:00,2022,NO,COLORADO,green
71198,2022-05-31,337,UAL1114,,14000.0,D01,green,NO,Denver,COLORADO,2022-05-31 03:37:00,2022,NO,COLORADO,green


Again, this is just showing us a view of the data -- it's not changing the underlying data frame. To "save" the results, assign the results of this operation to a new variable:

In [102]:
colorado_incidents = df[df.state_clean == 'COLORADO']

In [103]:
colorado_incidents.head()

Unnamed: 0,Incident Date,Incident Time,Flight ID,Aircraft,Altitude,Airport,Laser Color,Injury,City,State,datetime_utc,year,injury_clean,state_clean,colors_clean
54,2010-01-13,254.0,SKW6699,CRJ2,13000.0,DEN,green,NO,Denver,COLORADO,2010-01-13 02:54:00,2010,NO,COLORADO,green
90,2010-01-21,515.0,UPS805,A306,,DEN,green,NO,Denver,COLORADO,2010-01-21 05:15:00,2010,NO,COLORADO,green
142,2010-01-31,530.0,LN1,AS50,6200.0,BJC,green,NO,Denver,COLORADO,2010-01-31 05:30:00,2010,NO,COLORADO,green
284,2010-03-05,256.0,SKW5917,CRJ2,9000.0,DEN,green,NO,Denver,COLORADO,2010-03-05 02:56:00,2010,NO,COLORADO,green
311,2010-03-12,344.0,GLA121,E120,10000.0,DEN,green,NO,Denver,COLORADO,2010-03-12 03:44:00,2010,NO,COLORADO,green


Python's main comparison operators:
- `>` greater than
- `>=` greater than or equal to
- `<` less than
- `<=` less than or equal to
- `==` equal to
- `!=` not equal to

You can also use logical operators like `in` or `not in` to test membership in a list, search for strings inside other strings and so on.

In [104]:
df[df.Altitude >= 1000]

Unnamed: 0,Incident Date,Incident Time,Flight ID,Aircraft,Altitude,Airport,Laser Color,Injury,City,State,datetime_utc,year,injury_clean,state_clean,colors_clean
7,2010-01-01,44.0,SKW6083,CRJ2,6800.0,LNK,green,NO,Lincoln,NEBRASKA,2010-01-01 00:44:00,2010,NO,NEBRASKA,green
0,2010-01-01,102.0,AIR1,HELO,2000.0,LEX,green,NO,Lexington,KENTUCKY,2010-01-01 01:02:00,2010,NO,KENTUCKY,green
8,2010-01-01,155.0,SKW6318,E120,1000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 01:55:00,2010,NO,CALIFORNIA,green
3,2010-01-01,157.0,EGF3086,E135,7000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 01:57:00,2010,NO,CALIFORNIA,green
2,2010-01-01,246.0,EGF3002,E135,7000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 02:46:00,2010,NO,CALIFORNIA,green
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71204,2022-05-31,416,ABX152,B763,14500.0,ZID,green,NO,Indianapolis,INDIANA,2022-05-31 04:16:00,2022,NO,INDIANA,green
71205,2022-05-31,416,DAL1140,B737,32000.0,ZME,green,NO,Memphis,TENNESSEE,2022-05-31 04:16:00,2022,NO,TENNESSEE,green
71206,2022-05-31,522,SWA1781,B738,35000.0,ZID,green,NO,Indianapolis,INDIANA,2022-05-31 05:22:00,2022,NO,INDIANA,green
71207,2022-05-31,530,N3055B,PA28,3500.0,JAX,green,NO,Jacksonville,FLORIDA,2022-05-31 05:30:00,2022,NO,FLORIDA,green


#### Multiple filtering conditions?!

What if you have multiple filtering conditions? There is a way to do it all in one shot but I don't recommend it. Often it makes more sense to _save_ the results of each filtering operation by assigning the results to a new variable, then filter _that_ data frame again instead of the original data frame. This also makes it much easier for your colleagues and your future self to think about and debug your code.

For example, if you wanted to look at incidents in Pennsylvania that happened below 1,000 feet, you could do this:

In [105]:
penn_incidents = df[df.state_clean == 'PENNSYLVANIA']
penn_under1k = penn_incidents[penn_incidents.Altitude < 1000]

In [106]:
penn_under1k.head()

Unnamed: 0,Incident Date,Incident Time,Flight ID,Aircraft,Altitude,Airport,Laser Color,Injury,City,State,datetime_utc,year,injury_clean,state_clean,colors_clean
952,2010-06-20,230.0,EMD9,UH1,500.0,74PN,green,NO,Altoona,PENNSYLVANIA,2010-06-20 02:30:00,2010,NO,PENNSYLVANIA,green
1072,2010-07-07,105.0,EJA819,C560,500.0,PNE,green,NO,Philadelphia,PENNSYLVANIA,2010-07-07 01:05:00,2010,NO,PENNSYLVANIA,green
1080,2010-07-08,105.0,EXJ819,C560,500.0,PHL,green,NO,Philadelphia,PENNSYLVANIA,2010-07-08 01:05:00,2010,NO,PENNSYLVANIA,green
1351,2010-08-04,209.0,N1NJ,S76,800.0,PNE,green,NO,Philadelphia,PENNSYLVANIA,2010-08-04 02:09:00,2010,NO,PENNSYLVANIA,green
1555,2010-08-22,124.0,AWE1840,B737,200.0,PIT,green,NO,Pittsburgh,PENNSYLVANIA,2010-08-22 01:24:00,2010,NO,PENNSYLVANIA,green


This could be a journalism question! "How many laser pointer incidents in Pennsylvania were reported when the plane was under 1,000 feet?" Answer:

In [107]:
len(penn_under1k)

87

🤯 BEAST MODE: Let's compute a percentage and use a Python formatting thing called "f-strings" to write a sentence in our story.

Journalism question: What percentage of laser pointer incidents involved an injury?

Let's take it one step at a time:
- Filter the data frame to get just the rows with an injury
- Count the number of rows in your filtered data frame
- Count the number of rows in your main data frame
- Divide the count of the filtered data frame into the count of the main data frame

What were the values in the `injury_clean` column, again?

In [108]:
df.injury_clean.unique()

array(['NO', '', 'YES', 'UNKNOWN'], dtype=object)

In [109]:
# filter for the 'YES' values
injuries = df[df.injury_clean == 'YES']

In [110]:
# grab the length of our new data frame and save to a variable
injury_count = len(injuries)
print(injury_count)

254


In [111]:
# get the total number
total_count = len(df)
print(total_count)

71209


In [112]:
pct_injuries = (injury_count / total_count) * 100
print(pct_injuries)

0.3566964849948743


In [113]:
# gross let's round that
pct_injuries_rounded = round(pct_injuries, 2)
print(pct_injuries_rounded)

0.36


In [114]:
# honestly tho looking at this number, i wouldn't be mad about
# pct_injuries_rounded = 'less than 1 percent'

In [115]:
# oh and what was the first year of this data
first_year = df.year.min()

In [117]:
# use an f-string to write a data sentence
# https://realpython.com/python-f-strings/
print(f'Since {first_year}, {injury_count} laser pointer incidents resulted in an injury to someone, about {pct_injuries_rounded} percent of all incidents, according to the FAA.')

Since 2010, 254 laser pointer incidents resulted in an injury to someone, about 0.36 percent of all incidents, according to the FAA.


**✍️ Try it yourself**

In the cells below, try filtering by single and multiple criteria.

Here are a few questions to get you started:
- How many laser pointer incidents in California involved an injury?
- What was the earliest laser pointer incident recorded in Maine?

### Method chaining

You can use a process called "method chaining" to perform multiple operations in one line. If, for instance, we wanted to sort our data frame by date ascending and then inspect the first 5 records returned, you could write:

In [118]:
df.sort_values('datetime_utc').head()

Unnamed: 0,Incident Date,Incident Time,Flight ID,Aircraft,Altitude,Airport,Laser Color,Injury,City,State,datetime_utc,year,injury_clean,state_clean,colors_clean
7,2010-01-01,44.0,SKW6083,CRJ2,6800.0,LNK,green,NO,Lincoln,NEBRASKA,2010-01-01 00:44:00,2010,NO,NEBRASKA,green
0,2010-01-01,102.0,AIR1,HELO,2000.0,LEX,green,NO,Lexington,KENTUCKY,2010-01-01 01:02:00,2010,NO,KENTUCKY,green
8,2010-01-01,155.0,SKW6318,E120,1000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 01:55:00,2010,NO,CALIFORNIA,green
3,2010-01-01,157.0,EGF3086,E135,7000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 01:57:00,2010,NO,CALIFORNIA,green
2,2010-01-01,246.0,EGF3002,E135,7000.0,LAX,green,NO,Los Angeles,CALIFORNIA,2010-01-01 02:46:00,2010,NO,CALIFORNIA,green


"Show me the earliest 10 laser pointer incidents in Florida, sorted alphabetically by airport ascending":

In [120]:
df[df.state_clean == 'FLORIDA'].sort_values('datetime_utc').head(5)

Unnamed: 0,Incident Date,Incident Time,Flight ID,Aircraft,Altitude,Airport,Laser Color,Injury,City,State,datetime_utc,year,injury_clean,state_clean,colors_clean
15,2010-01-03,0.0,EGF2771,E135,,VPS,green,NO,Eglin AFB,FLORIDA,2010-01-03 00:00:00,2010,NO,FLORIDA,green
33,2010-01-07,257.0,AAL443,B738,800.0,MIA,green,NO,Miami,FLORIDA,2010-01-07 02:57:00,2010,NO,FLORIDA,green
37,2010-01-08,323.0,NWA2405,A320,3200.0,PBI,green,NO,West Palm Beach,FLORIDA,2010-01-08 03:23:00,2010,NO,FLORIDA,green
58,2010-01-14,2333.0,JENA231,UNKN,2500.0,TMB,green,NO,Tamiami,FLORIDA,2010-01-14 23:33:00,2010,NO,FLORIDA,green
68,2010-01-16,100.0,JBU425,A306,2000.0,PBI,green,NO,West Palm Beach,FLORIDA,2010-01-16 01:00:00,2010,NO,FLORIDA,green


**✍️ Try it yourself**

In the cells below, try chaining some methods together.

Here are a few questions to get you started:
- Show me all incidents that happened before noon, sorted by airport (hint: the `Incident Time` column is a number from 0 to 2400 representing military time, so noon would be `1200`
- Show me the _last_ five laser pointer incidents in 2014

### Grouping

Data frames have a `groupby` method for grouping and aggregating data, similar to what you might do in a pivot table or a `GROUP BY` statement in SQL. (They also have a [`pivot_table` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html) that's pretty boss.)

Let's say we're working on a really weirdly specific story and it is _vital_ that we find the states where the laser pointer incidents occured at the highest altitudes.

To see the data broken out by state and average altitude of laser pointer incidents:
- Group the data by the `state_clean` column: `groupby()`
- Get the mean value of the `Altitude` value for each state group: `mean()`
- Sort the results by `datetime_utc` descending: `sort_values()`
- Take only the top 10 results: `head(10)`

Calling the `groupby()` method without telling it what to do with the grouped records isn't super helpful:

In [121]:
df.groupby('state_clean')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x120e38d60>

It is _indignant_ that you put your data into groups but then just left them sitting there. Let's see what happens when we tack on a `mean()` method call:

In [122]:
df.groupby('state_clean').mean()

Unnamed: 0_level_0,Altitude,year
state_clean,Unnamed: 1_level_1,Unnamed: 2_level_1
,4084.615385,2015.928571
ALABAMA,7900.789826,2017.077428
ALASKA,5484.710744,2017.888889
ARIZONA,6379.427690,2016.404157
ARKANSAS,9847.146727,2017.784922
...,...,...
VIRGINIA,7621.416583,2017.251451
WASHINGTON,7030.004929,2018.010091
WEST VIRGINIA,9639.400000,2016.678571
WISCONSIN,6213.679012,2016.169492


It's taking the `mean()` of the numeric columns -- `Altitude` and `year` -- but we only care about altitude, so let's amend our code to first select just the two columns of interest:

In [123]:
df[['state_clean', 'Altitude']].groupby('state_clean').mean()

Unnamed: 0_level_0,Altitude
state_clean,Unnamed: 1_level_1
,4084.615385
ALABAMA,7900.789826
ALASKA,5484.710744
ARIZONA,6379.427690
ARKANSAS,9847.146727
...,...
VIRGINIA,7621.416583
WASHINGTON,7030.004929
WEST VIRGINIA,9639.400000
WISCONSIN,6213.679012


OK you're going to have to trust me on this one, but to close the circle on this one, we're also going to want to use a method called [`reset_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) to turn this back into a DataFrame that behaves as you'd expect. (tl;dr the way it is now, the `state_clean` column _is_ the index for the table -- usually it's consecutive numbers.)

In [124]:
df[['state_clean', 'Altitude']].groupby('state_clean').mean().reset_index()

Unnamed: 0,state_clean,Altitude
0,,4084.615385
1,ALABAMA,7900.789826
2,ALASKA,5484.710744
3,ARIZONA,6379.427690
4,ARKANSAS,9847.146727
...,...,...
58,VIRGINIA,7621.416583
59,WASHINGTON,7030.004929
60,WEST VIRGINIA,9639.400000
61,WISCONSIN,6213.679012


OK now to answer the question posed earlier: On average, which states had the highest laser pointer incidents?

In [125]:
# Science!
df[['state_clean', 'Altitude']].groupby('state_clean').mean().reset_index().sort_values('Altitude', ascending=False).head()

Unnamed: 0,state_clean,Altitude
40,NORTH HAMPSHIRE,15000.0
62,WYOMING,14697.435897
6,COLORADO,12385.912685
36,NEW MEXICO,11376.053086
11,GEORGIA,11183.330077


**✍️ Try it yourself**

In the cells below, try grouping your data. A few suggested questions to get you started:
- Show me a breakdown of the mean altitude of incidents broken down by year
- Show me a breakdown of the median altitude of incidents broken down by airport
- What else?

### Exporting

You can dump your files to CSV, Excel and other formats -- I usually default to CSV files, and the `to_csv()` file does exactly what you'd think it does. At minimum, you need to hand this method the file name of the output CSV (or its full file path, if you don't want it to land in the current directory).

And unless you want a column of row numbers -- the index attached to the DataFrame -- you'll also want to specify `index=False`.

Remember earlier, when we filtered to get the incidents in Pennsylvania and saved it to a variable called `penn_incidents`? Let's dump that dataframe to a file called `penn-laser-incidents.csv`:

In [59]:
penn_incidents.to_csv('penn-laser-incidents.csv', index=False)

### Let's practice
Pick a dataset to explore -- there are a couple in the `data` folder, or you can read a CSV directly from a URL on the Internet.

Options in the `/data` folder:
- `pep.csv`: FOIA'd data on Florida's "python elimination program" that's a couple years old
- `mlb.csv`: Some Major League Baseball player salary data from a few years ago
- `house-gift-travel.csv`: Congressional junkets!

Or maybe you'd rather connect to a CSV on [Colorado's open data portal](https://data.colorado.gov/browse), such as [this list of early childcare providers in the state](https://data.colorado.gov/Early-childhood/Colorado-Licensed-Child-Care-Facilities-Report/a9rr-k8mu). To get the link, click "Export" and copy the link to the CSV export. Here's how that would work for the childcare providers data set:

In [60]:
df_earlycc = pd.read_csv('https://data.colorado.gov/api/views/a9rr-k8mu/rows.csv?accessType=DOWNLOAD')

In [61]:
df_earlycc.head()

Unnamed: 0,PROVIDER ID,PROVIDER NAME,PROVIDER SERVICE TYPE,STREET ADDRESS,CITY,STATE,ZIP,COUNTY,COMMUNITY,ECC,...,LICENSE FEE DISCOUNT,LONG-LAT,OPERATING STATUS (Self-Report),OPERATING STATUS REPORT DATE,Licensed Home Capacity,Licensed Infant Capacity,Licensed Toddler Capacity,Licensed Preschool Capacity,Licensed School Age Capacity,Licensed Preschool and School Age Capacity
0,1701308,BIRD CONSERVANCY OF THE ROCKIES,Resident Camp,1306 BUSINESS HIGHWAY 7,ALLENSPARK,CO,80510,Boulder,,Early Childhood Council of Boulder County,...,,,Open,2022-02-08,,,,,,
1,1684119,Knights of Heroes Camp,Resident Camp,5987 Gold Camp RD,Victor,CO,80813,Teller,,Teller/Park Early Childhood Council,...,,,Open,2020-06-26,,,,,,
2,1623752,Glacier View Ranch,Resident Camp,8748 Overland RD,Ward,CO,80481,Boulder,,Early Childhood Council of Boulder County,...,,,Closed,2020-05-06,,,,,,
3,1742705,Tseganesh Tesega,Family Child Care Home,2632 S Halifax Ct,Aurora,CO,80013,Arapahoe,,Arapahoe County Early Childhood Council,...,,,Open,2021-01-15,6.0,,,,,
4,1503461,GLACIER PEAK BASE PROGRAM,School-Age Child Care Center,12060 Jasmine ST,Brighton,CO,80602,Adams,,Early Childhood Partnership of Adams County,...,,,Open,2022-03-31,,,,,100.0,


Your assignment:
- Load data into a data frame and inspect it. Fix any problems you know how to fix after this shaky introduction to pandas syntax
- Come up with some questions to ask and figure out the operation(s) you need to make it happen
- Write the code to make it happen!

### Bored? Here's some extra credit

What's going on with the laser color(s) reported? What's the most common color sighted? The colors in the `colors_clean` column are lowercased, comma-separated lists of the colors mentioned in the report. Solving this problem will likely involve splitting that into a Python list and counting them. I would start by Googling around the phrases "pandas AND split values on delimiter AND count" or somesuch.