<a href="https://colab.research.google.com/github/tjHendrixx/DS-Unit-1-Sprint-2-Data-Wrangling-and-Storytelling/blob/master/DS_Unit_1_Sprint_Challenge_2_Data_Wrangling_and_Storytelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling and Storytelling

Taming data from its raw form into informative insights and stories.

## Data Wrangling

In this Sprint Challenge you will first "wrangle" some data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the first part of this sprint challenge.
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

### Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

### Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [0]:
df = pd.merge(cell_phones, population, how='inner', on=['geo', 'time'])

Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [0]:
df[['geo', 'country']] = geo_country_codes[['geo', 'country']]

In [4]:
df.head()


Unnamed: 0,geo,time,cell_phones_total,population_total,country
0,abkh,1960,0.0,8996351,Abkhazia
1,abw,1965,0.0,9938414,Aruba
2,afg,1970,0.0,11126123,Afghanistan
3,ago,1975,0.0,12590286,Angola
4,aia,1976,0.0,12840299,Anguilla


In [5]:
df.shape

(8590, 5)

***Optional bonus for Part 1: Take initiative to join more data.***

### Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [0]:
df['cellphones per person'] = df['cell_phones_total'] / df['population_total']

In [7]:
df.shape

(8590, 6)

In [0]:
df = df.dropna(axis=0, how='any')

In [9]:
df.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cellphones per person
0,abkh,1960,0.0,8996351,Abkhazia,0.0
1,abw,1965,0.0,9938414,Aruba,0.0
2,afg,1970,0.0,11126123,Afghanistan,0.0
3,ago,1975,0.0,12590286,Angola,0.0
4,aia,1976,0.0,12840299,Anguilla,0.0


Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [0]:
df['geo'] = df['geo'].str.upper()

In [11]:
df.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cellphones per person
0,ABKH,1960,0.0,8996351,Abkhazia,0.0
1,ABW,1965,0.0,9938414,Aruba,0.0
2,AFG,1970,0.0,11126123,Afghanistan,0.0
3,AGO,1975,0.0,12590286,Angola,0.0
4,AIA,1976,0.0,12840299,Anguilla,0.0


***Optional bonus for Part 2: Take initiative to make more features.***

### Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [0]:
import numpy as np 

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 273 entries, 0 to 272
Data columns (total 6 columns):
geo                      273 non-null object
time                     273 non-null int64
cell_phones_total        273 non-null float64
population_total         273 non-null int64
country                  273 non-null object
cellphones per person    273 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 14.9+ KB


In [14]:
df.drop_duplicates()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cellphones per person
0,ABKH,1960,0.0,8996351,Abkhazia,0.000000
1,ABW,1965,0.0,9938414,Aruba,0.000000
2,AFG,1970,0.0,11126123,Afghanistan,0.000000
3,AGO,1975,0.0,12590286,Angola,0.000000
4,AIA,1976,0.0,12840299,Anguilla,0.000000
5,AKR_A_DHE,1977,0.0,13067538,Akrotiri and Dhekelia,0.000000
6,ALA,1978,0.0,13237734,Åland,0.000000
7,ALB,1979,0.0,13306695,Albania,0.000000
8,AND,1980,0.0,13248370,Andorra,0.000000
9,ANT,1981,0.0,13053954,Netherlands Antilles,0.000000


In [15]:
df.describe()

Unnamed: 0,time,cell_phones_total,population_total,cellphones per person
count,273.0,273.0,273.0,273.0
mean,1994.131868,4610215.0,12748690.0,0.295601
std,14.298197,12205580.0,12968470.0,0.4808
min,1960.0,0.0,13411.0,0.0
25%,1983.0,0.0,2525065.0,0.0
50%,1995.0,7924.0,8672475.0,0.002844
75%,2006.0,2428071.0,22283390.0,0.496965
max,2017.0,67361520.0,44271040.0,2.147349


In [16]:
df.describe(include='all')

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cellphones per person
count,273,273.0,273.0,273.0,273,273.0
unique,273,,,,273,
top,NIC,,,,Western Sahara,
freq,1,,,,1,
mean,,1994.131868,4610215.0,12748690.0,,0.295601
std,,14.298197,12205580.0,12968470.0,,0.4808
min,,1960.0,0.0,13411.0,,0.0
25%,,1983.0,0.0,2525065.0,,0.0
50%,,1995.0,7924.0,8672475.0,,0.002844
75%,,2006.0,2428071.0,22283390.0,,0.496965


In [0]:
df = df.sort_values(by='cell_phones_total', ascending=0)

In [18]:
df[]

SyntaxError: ignored

In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [0]:
df[['country', 'cell_phones_total']].head(10)

In [0]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format

In [0]:
df[df['geo'].str.match('USA')]

In [0]:
us_cell.head()

2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [0]:
2007

***Optional bonus for Part 3: Take initiative to do more exploratory data analysis.***

### (OPTIONAL) Part 4. Reshape data

*This part is not needed to pass the sprint challenge, only to get a 3! Only work on this after completing the other sections.*

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

If you have the time and curiosity, what other questions can you ask and answer with this data?

## Data Storytelling

In this part of the sprint challenge you'll work with a dataset from **FiveThirtyEight's article, [Every Guest Jon Stewart Ever Had On ‘The Daily Show’](https://fivethirtyeight.com/features/every-guest-jon-stewart-ever-had-on-the-daily-show/)**!

### Part 0 — Run this starter code

You don't need to add or change anything here. Just run this cell and it loads the data for you, into a dataframe named `df`.

(You can explore the data if you want, but it's not required to pass the Sprint Challenge.)

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/daily-show-guests/daily_show_guests.csv'
df = pd.read_csv(url).rename(columns={'YEAR': 'Year', 'Raw_Guest_List': 'Guest'})

def get_occupation(group):
    if group in ['Acting', 'Comedy', 'Musician']:
        return 'Acting, Comedy & Music'
    elif group in ['Media', 'media']:
        return 'Media'
    elif group in ['Government', 'Politician', 'Political Aide']:
        return 'Government and Politics'
    else:
        return 'Other'
      
df['Occupation'] = df['Group'].apply(get_occupation)

### Part 1 — What's the breakdown of guests’ occupations per year?

For example, in 1999, what percentage of guests were actors, comedians, or musicians? What percentage were in the media? What percentage were in politics? What percentage were from another occupation?

Then, what about in 2000? In 2001? And so on, up through 2015.

So, **for each year of _The Daily Show_, calculate the percentage of guests from each occupation:**
- Acting, Comedy & Music
- Government and Politics
- Media
- Other

#### Hints:
You can make a crosstab. (See pandas documentation for examples, explanation, and parameters.)

You'll know you've calculated correctly when the percentage of "Acting, Comedy & Music" guests is 90.36% in 1999, and 45% in 2015.

**Optional Bonus Challenge:** Do additional insightful data exploration.

In [25]:
df.head()

Unnamed: 0,Year,GoogleKnowlege_Occupation,Show,Group,Guest,Occupation
0,1999,actor,1/11/99,Acting,Michael J. Fox,"Acting, Comedy & Music"
1,1999,Comedian,1/12/99,Comedy,Sandra Bernhard,"Acting, Comedy & Music"
2,1999,television actress,1/13/99,Acting,Tracey Ullman,"Acting, Comedy & Music"
3,1999,film actress,1/14/99,Acting,Gillian Anderson,"Acting, Comedy & Music"
4,1999,actor,1/18/99,Acting,David Alan Grier,"Acting, Comedy & Music"


In [31]:
pd.crosstab(df['Guest'], df['Year'], margins=True).head()

Year,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,All
Guest,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
(None),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1
(no guest),0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,2
Aaron Brown,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
Aaron Eckhart,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,3
Aaron Sorkin,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1


In [19]:
df_cross.head()

NameError: ignored

### Part 2 — Recreate this explanatory visualization:

In [0]:
from IPython.display import display, Image
png = 'https://fivethirtyeight.com/wp-content/uploads/2015/08/hickey-datalab-dailyshow.png'
example = Image(png, width=500)
display(example)

**Hints:**
- You can choose any Python visualization library you want. I've verified the plot can be reproduced with matplotlib, pandas plot, or seaborn. I assume other libraries like altair or plotly would work too.
- If you choose to use seaborn, you may want to upgrade the version to 0.9.0.

**Expectations:** Your plot should include:
- 3 lines visualizing "occupation of guests, by year." The shapes of the lines should look roughly identical to 538's example. Each line should be a different color. (But you don't need to use the _same_ colors as 538.)
- Legend or labels for the lines. (But you don't need each label positioned next to its line or colored like 538.)
- Title in the upper left: _"Who Got To Be On 'The Daily Show'?"_ with more visual emphasis than the subtitle. (Bolder and/or larger font.)
- Subtitle underneath the title: _"Occupation of guests, by year"_

**Optional Bonus Challenge:**
- Give your plot polished aesthetics, with improved resemblance to the 538 example.
- Any visual element not specifically mentioned in the expectations is an optional bonus.

### (OPTIONAL) Part 3 — Who were the top 10 guests on _The Daily Show_?

*This part is not needed to pass the sprint challenge, only to get a 3! Only work on this after completing the other sections.*

**Make a plot** that shows their names and number of appearances.

**Add a title** of your choice.

**Expectations:** It's ok to make a simple, quick plot: exploratory, instead of explanatory. 

**Optional Bonus Challenge:** You can change aesthetics and add more annotation. For example, in a relevant location, could you add the text "19" to show that Fareed Zakaria appeared 19 times on _The Daily Show_? (And so on, for each of the top 10 guests.)