<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Grouping Data with Pandas - Lab

---

You are going to investigate UFO sightings around the US.  This lab will give you practice performing `groupby` operations to split data along multiple dimensions and investigate patterns between subsets of the data using basic aggregation.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(font_scale=1.5)

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

#### 1. Load and print the header for the UFO data.

In [2]:
ufo_csv = '../../../../resource-datasets/ufo_sightings/ufo.csv'

In [3]:
# A:
ufo = pd.read_csv(ufo_csv)
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


#### 2. How many null values exist per column?

In [4]:
# A:
ufo.isnull().sum()

City                  47
Colors Reported    63509
Shape Reported      8402
State                  0
Time                   0
dtype: int64

#### 3. Which city has the most observations?

In [5]:
# A:
grouped_by_city = ufo.groupby('City', as_index=False)
grouped_by_city.count().sort_values('Time',ascending=False).head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
10778,Seattle,147,521,646,646
8277,New York City,77,567,612,612
9330,Phoenix,106,473,533,533
6412,Las Vegas,60,387,442,442
9647,Portland,94,379,438,438


In [40]:
ufo.groupby('City').size().sort_values(ascending=False).index[0]

'Seattle'

#### 4. How many observations were there by shape?

In [7]:
# A:
ufo.groupby('Shape Reported').size().sort_values(ascending=False)

Shape Reported
LIGHT        16332
TRIANGLE      7816
CIRCLE        7725
FIREBALL      6249
OTHER         5506
SPHERE        5231
DISK          5226
OVAL          3721
FORMATION     2405
CIGAR         1983
VARIOUS       1957
FLASH         1329
RECTANGLE     1295
CYLINDER      1252
DIAMOND       1152
CHEVRON        940
EGG            733
TEARDROP       723
CONE           310
CROSS          241
DELTA            7
CRESCENT         2
ROUND            2
DOME             1
HEXAGON          1
PYRAMID          1
FLARE            1
dtype: int64

#### 5. Create a subset of the data that is the top 5 cities and the top 5 shapes.

In [22]:
# A:
top5cities = ufo.groupby('City', as_index=False).count().sort_values('Time',ascending=False)[0:5]
top5cities

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
10778,Seattle,147,521,646,646
8277,New York City,77,567,612,612
9330,Phoenix,106,473,533,533
6412,Las Vegas,60,387,442,442
9647,Portland,94,379,438,438


In [29]:
ufo_top5cities_subset = ufo[ufo['City'].isin(top5cities['City'])]
ufo_top5cities_subset.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
41,New York City,,DISK,NY,1/8/1946 2:00
68,Seattle,,OVAL,WA,7/4/1947 17:30
76,Las Vegas,,DISK,NV,7/15/1947 10:00
96,New York City,,CIRCLE,NY,8/1/1948 2:00
99,Seattle,,DISK,WA,4/10/1949 15:00


In [13]:
top5shapes = ufo.groupby('Shape Reported', as_index=False).count().sort_values('Time',ascending=False)[0:5]
top5shapes

Unnamed: 0,Shape Reported,City,Colors Reported,State,Time
17,LIGHT,16325,3727,16332,16332
25,TRIANGLE,7809,1360,7816,7816
2,CIRCLE,7720,2110,7725,7725
12,FIREBALL,6246,2067,6249,6249
18,OTHER,5505,869,5506,5506


In [43]:
ufo_top5_subset = ufo[ufo['Shape Reported'].isin(top5shapes['Shape Reported'])
                          & ufo['City'].isin(top5cities['City'])]
ufo_top5_subset.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
96,New York City,,CIRCLE,NY,8/1/1948 2:00
208,New York City,,OTHER,NY,7/15/1952 19:00
313,New York City,,LIGHT,NY,2/15/1955 19:00
367,New York City,,LIGHT,NY,6/15/1956 21:00
568,New York City,,OTHER,NY,10/1/1959 16:00


#### 6. What is the number of observations per city?

In [32]:
# A:
top5cities['Time']

10778    646
8277     612
9330     533
6412     442
9647     438
Name: Time, dtype: int64

#### 7. What is the number of observations of each shape per city?

In [45]:
# A:
df_b = ufo_top5_subset.groupby(['City','Shape Reported']).size().reset_index()
df_b

Unnamed: 0,City,Shape Reported,0
0,Las Vegas,CIRCLE,42
1,Las Vegas,FIREBALL,22
2,Las Vegas,LIGHT,85
3,Las Vegas,OTHER,36
4,Las Vegas,TRIANGLE,31
5,New York City,CIRCLE,56
6,New York City,FIREBALL,35
7,New York City,LIGHT,105
8,New York City,OTHER,42
9,New York City,TRIANGLE,40


#### 8. With the subset, find the fraction of each shape seen by city.

Hint: You could merge the last two dataframes on city.

Plot the fractions per city and shape as a bar chart.

In [39]:
df_merged = pd.merge()

940

In [37]:
# A: 
ufo.groupby(['City','Shape Reported']).size().apply(fraction)

def fraction(x):
    fraction = x / ufo.groupby('Shape Reported').size()[x.City]

Shape Reported
CHEVRON        940
CIGAR         1983
CIRCLE        7725
CONE           310
CRESCENT         2
CROSS          241
CYLINDER      1252
DELTA            7
DIAMOND       1152
DISK          5226
DOME             1
EGG            733
FIREBALL      6249
FLARE            1
FLASH         1329
FORMATION     2405
HEXAGON          1
LIGHT        16332
OTHER         5506
OVAL          3721
PYRAMID          1
RECTANGLE     1295
ROUND            2
SPHERE        5231
TEARDROP       723
TRIANGLE      7816
VARIOUS       1957
dtype: int64

#### Challenge: Obtain the fractions of shapes per city using groupby.

If you would like to do this with groupby, you need to do consecutive groupbys. Once you have grouped by city, you can aggregate grouping by shape and each subgroup size by the sum of the observations across all shape groups.

In [None]:
# A: