# Today: EDA GROUPS!

Choose a team, and then spend some time looking at data.  We want you to explore the data using the techniques we learned this far including:

- Grouping / subsetting / segmentation
- Summary statistics
    - Histograms
    - Plotting
- Slicing
- Cleaning data
    - assessing proper types
    - expected values
    - object converstion
   

At the end of our exploratory analysis, each group will be giving a 10 minute presentation on their findings to the rest of class.


In [2]:
import pandas as pd, numpy as np, seaborn as sns

%matplotlib inline

## Team Alpha Drone

Since the API from `api.dronestre.am` provides data on drone strikes in near real time, this **might** be useful to hold President Obama accountable to his promise of reducing drone strikes.  Your mission, is to explore drone strike data, doing any accomanying research with your analysis, and report back any good summary statistics.

Also, we would like to know:
 - Is this a good source of data?
     - Why / why not?
     
*Politics aside -- let's keep it to what is measurable in our dataset.  This isn't meant to prove or disprove anything.  It's a **fun** dataset to look at moreso than a motivator of political discourse.*

In [15]:
# First we need to fetch some data using Python requests from API
# Read more about Python requests:
# http://docs.python-requests.org/en/master/user/quickstart/

import requests

response = requests.get("http://api.dronestre.am/data")
json_data = response.json()
drone_df = pd.DataFrame(json_data['strike'])

In [20]:
drone_df.head(1)

Unnamed: 0,_id,articles,bij_link,bij_summary_short,bureau_id,children,civilians,country,date,deaths,...,injuries,lat,location,lon,names,narrative,number,target,town,tweet_id
0,55c79e711cbee48856a30886,[],http://www.thebureauinvestigates.com/2012/03/2...,In the first known US targeted assassination u...,YEM001,,0,Yemen,2002-11-03T00:00:00.000Z,6,...,,15.47467,Marib Province,45.322755,"[Qa'id Salim Sinan al-Harithi, Abu Ahmad al-Hi...",In the first known US targeted assassination u...,1,,,278544689483890688


## Team Popcorn

You're a force to be reckoned with when you `read_csv` into your `movie_df` dataframe.  You are team "Popcorn".  It would be nice to know:

 - Which movies remained in the top 10 the longest
 - Which movies were good investments?
 
 Bonus:
 - Do any holidays impact sales performance or position?


_[There's a data dictionary available!](http://www.amstat.org/publications/jse/v17n1/datasets.mclaren.html)_

In [43]:
movie_df.head(5)

# movie_df['WEEKEND_DATE'] = pd.to_datetime(movie_df['WEEKEND_DATE'])

Unnamed: 0,NUMBER,MOVIE,WEEK_NUM,WEEKEND_PER_THEATER,WEEKEND_DATE
0,1.0,A Beautiful Mind,1.0,701.0,2001-12-21
1,1.0,A Beautiful Mind,2.0,14820.0,2001-12-28
2,1.0,A Beautiful Mind,3.0,8940.0,2002-01-04
3,1.0,A Beautiful Mind,4.0,6850.0,2002-01-11
4,1.0,A Beautiful Mind,5.0,5280.0,2002-01-18


In [112]:
x = movie_df.groupby('MOVIE')['WEEKEND_DATE'].count()

x.sort(ascending = False)
x.head()

  app.launch_new_instance()


MOVIE
ET                         52
Raiders of the Lost Ark    43
Return of the Jedi         42
Forrest Gump               42
Titanic                    41
Name: WEEKEND_DATE, dtype: int64

In [145]:
movie_df = pd.read_csv("../assets/data/movie_weekend.csv")

movie_df.head()

top_ten = movie_df.groupby('MOVIE')['WEEK_NUM'].max()

# top_ten.sort(ascending = False)
top_ten = pd.DataFrame(top_ten)
top_ten.reset_index().sort('WEEK_NUM', ascending = False)
# sns.set(style = 'whitegrid', rc={'figure.figsize':(12, 6)})
# sns.barplot(x = 'MOVIE', y = 'WEEK_NUM', data = top_ten.head(10))



Unnamed: 0,MOVIE,WEEK_NUM
7,ET,52.0
31,Raiders of the Lost Ark,43.0
32,Return of the Jedi,42.0
9,Forrest Gump,42.0
50,Titanic,41.0
1,American Beauty,38.0
4,Chicago,36.0
35,Shakespeare in Love,33.0
3,Beverly Hills Cop,33.0
11,Gladiator,33.0


In [110]:
top_invest = movie_df.groupby('MOVIE')['WEEKEND_PER_THEATER'].sum()

top_invest.sort(ascending = False)

top_invest.head(10)

  app.launch_new_instance()


MOVIE
Star Wars                   228181.0
ET                          201257.0
Empire Strikes Back, The    178013.0
American Beauty             165891.0
Titanic                     165701.0
Return of the Jedi          163572.0
Million Dollar Baby         154115.0
Chicago                     146062.0
Raiders of the Lost Ark     144778.0
Forrest Gump                128534.0
Name: WEEKEND_PER_THEATER, dtype: float64

## Team Titanic

Known for it's honesty, the Titanic dataset is a very common dataset for doing classification prediction of fatalities.  For our challenge, why don't we try to focus on the latent characteristics. 

For the record, this is how much know:

![](http://www.glencoe.com/sec/math/studytools/books/0-07-829631-5/images/IQ02-003W-8228662.gif)

Certainly there is a better story to tell.

**Bonus**
 - Can you pull out titles (ie: Mr., Miss, Mrs) from the feature "Name" and assign it to a new variable? We think there could be something interesting to look at in aggregate based on titles!

In [61]:
titanic_df = pd.read_csv("../assets/data/titanic.csv")
titanic_df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
