<a href="https://colab.research.google.com/github/HadenMoore/DS-Unit-1-Sprint-2-Data-Wrangling-and-Storytelling/blob/master/DS_Unit_1_Sprint_Challenge_2_Data_Wrangling_and_Storytelling_(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling and Storytelling

Taming data from its raw form into informative insights and stories.

## Data Wrangling

In this Sprint Challenge you will first "wrangle" some data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the first part of this sprint challenge.
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

### Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

### Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [87]:
# Looking at its head 
cell_phones.head()

Unnamed: 0,geo,time,cell_phones_total
0,abw,1960,0.0
1,abw,1965,0.0
2,abw,1970,0.0
3,abw,1975,0.0
4,abw,1976,0.0


In [88]:
# Looking at the shape of the dataframe 
cell_phones.shape

(9215, 3)

In [89]:
#Checking for NaNs
cell_phones.isnull().sum()

geo                  0
time                 0
cell_phones_total    0
dtype: int64

In [90]:
# Looking at its head 
population.head()

Unnamed: 0,geo,time,population_total
0,afg,1800,3280000
1,afg,1801,3280000
2,afg,1802,3280000
3,afg,1803,3280000
4,afg,1804,3280000


In [91]:
# Looking at the shape of the dataframe 
population.shape

(59297, 3)

In [92]:
#Checking for NaNs
population.isnull().sum()

geo                 0
time                0
population_total    0
dtype: int64

In [93]:
#inner joining on Geo and Time 
merge1 = pd.merge(cell_phones, population, how='inner', on=['geo','time'])
print(merge1)

      geo  time  cell_phones_total  population_total
0     afg  1960                0.0           8996351
1     afg  1965                0.0           9938414
2     afg  1970                0.0          11126123
3     afg  1975                0.0          12590286
4     afg  1976                0.0          12840299
5     afg  1977                0.0          13067538
6     afg  1978                0.0          13237734
7     afg  1979                0.0          13306695
8     afg  1980                0.0          13248370
9     afg  1981                0.0          13053954
10    afg  1982                0.0          12749645
11    afg  1983                0.0          12389269
12    afg  1984                0.0          12047115
13    afg  1985                0.0          11783050
14    afg  1986                0.0          11601041
15    afg  1987                0.0          11502761
16    afg  1988                0.0          11540888
17    afg  1989                0.0          11

In [94]:
# I'm just saying, I freaking did this. I'm so proud of myself
# This is to check the final shape after the inner merge
merge1.shape

(8590, 4)

Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [95]:
#Looking at columns 
geo_country_codes.head()

Unnamed: 0,geo,alt_5,alternative_1,alternative_2,alternative_3,alternative_4_cdiac,arb1,arb2,arb3,arb4,arb5,arb6,g77_and_oecd_countries,gapminder_list,god_id,gwid,income_groups,is--country,iso3166_1_alpha2,iso3166_1_alpha3,iso3166_1_numeric,iso3166_2,landlocked,latitude,longitude,main_religion_2008,country,pandg,un_state,unicode_region_subtag,upper_case_name,world_4region,world_6region
0,abkh,,,,,,,,,,,,others,Abkhazia,GE-AB,i0,,True,,,,,,,,,Abkhazia,,False,,,europe,europe_central_asia
1,abw,,,,,Aruba,,,,,,,others,Aruba,AW,i12,high_income,True,AW,ABW,533.0,,coastline,12.5,-69.96667,christian,Aruba,,False,AW,ARUBA,americas,america
2,afg,,Islamic Republic of Afghanistan,,,Afghanistan,,,,,,,g77,Afghanistan,AF,i1,low_income,True,AF,AFG,4.0,,landlocked,33.0,66.0,muslim,Afghanistan,AFGHANISTAN,True,AF,AFGHANISTAN,asia,south_asia
3,ago,,,,,Angola,,,,,,,g77,Angola,AO,i7,upper_middle_income,True,AO,AGO,24.0,,coastline,-12.5,18.5,christian,Angola,ANGOLA,True,AO,ANGOLA,africa,sub_saharan_africa
4,aia,,,,,,,,,,,,others,Anguilla,AI,i8,,True,AI,AIA,660.0,,coastline,18.21667,-63.05,christian,Anguilla,,False,AI,ANGUILLA,americas,america


In [96]:
# Checking Shape of DataFrame 
geo_country_codes.shape

(273, 33)

In [97]:
#Checking Value Counts of Country Column
geo_country_codes['country'].value_counts()

Netherlands                       1
Guernsey                          1
Chad                              1
Heard and McDonald Islands        1
Morocco                           1
St. Martin                        1
Wallis et Futuna                  1
Niue                              1
Italy                             1
Somalia                           1
Aruba                             1
Brazil                            1
Ukraine                           1
Bahamas                           1
Holy See                          1
Papua New Guinea                  1
St.-Pierre-et-Miquelon            1
Switzerland                       1
Central African Republic          1
Slovak Republic                   1
Qatar                             1
Argentina                         1
Bolivia                           1
India                             1
South Ossetia                     1
El Salvador                       1
Norway                            1
Egypt                       

In [0]:
#Creating variable to hold country and geo columns
column = ['geo', 'country']

In [99]:
final = pd.merge(merge1, geo_country_codes[column], how='inner', on='geo')
print(final)

      geo  time  cell_phones_total  population_total      country
0     afg  1960                0.0           8996351  Afghanistan
1     afg  1965                0.0           9938414  Afghanistan
2     afg  1970                0.0          11126123  Afghanistan
3     afg  1975                0.0          12590286  Afghanistan
4     afg  1976                0.0          12840299  Afghanistan
5     afg  1977                0.0          13067538  Afghanistan
6     afg  1978                0.0          13237734  Afghanistan
7     afg  1979                0.0          13306695  Afghanistan
8     afg  1980                0.0          13248370  Afghanistan
9     afg  1981                0.0          13053954  Afghanistan
10    afg  1982                0.0          12749645  Afghanistan
11    afg  1983                0.0          12389269  Afghanistan
12    afg  1984                0.0          12047115  Afghanistan
13    afg  1985                0.0          11783050  Afghanistan
14    afg 

In [100]:
#Answer 
final.shape

(8590, 5)

### Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [101]:
# creating function to calculate phones per person 
phones_per_person =final['cell_phones_total'] / final['population_total']
print(phones_per_person)

0                        0.0
1                        0.0
2                        0.0
3                        0.0
4                        0.0
5                        0.0
6                        0.0
7                        0.0
8                        0.0
9                        0.0
10                       0.0
11                       0.0
12                       0.0
13                       0.0
14                       0.0
15                       0.0
16                       0.0
17                       0.0
18                       0.0
19                       0.0
20                       0.0
21                       0.0
22                       0.0
23                       0.0
24                       0.0
25                       0.0
26                       0.0
27                       0.0
28                       0.0
29                       0.0
                ...         
8560                     0.0
8561                     0.0
8562                     0.0
8563          

In [102]:
final['phones_per_person']= phones_per_person
print(final)

      geo  time  ...      country     phones_per_person
0     afg  1960  ...  Afghanistan                   0.0
1     afg  1965  ...  Afghanistan                   0.0
2     afg  1970  ...  Afghanistan                   0.0
3     afg  1975  ...  Afghanistan                   0.0
4     afg  1976  ...  Afghanistan                   0.0
5     afg  1977  ...  Afghanistan                   0.0
6     afg  1978  ...  Afghanistan                   0.0
7     afg  1979  ...  Afghanistan                   0.0
8     afg  1980  ...  Afghanistan                   0.0
9     afg  1981  ...  Afghanistan                   0.0
10    afg  1982  ...  Afghanistan                   0.0
11    afg  1983  ...  Afghanistan                   0.0
12    afg  1984  ...  Afghanistan                   0.0
13    afg  1985  ...  Afghanistan                   0.0
14    afg  1986  ...  Afghanistan                   0.0
15    afg  1987  ...  Afghanistan                   0.0
16    afg  1988  ...  Afghanistan               

In [103]:
usa = final[final.country=='United States']
usa.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,phones_per_person
8092,usa,1960,0.0,186808228,United States,0.0
8093,usa,1965,0.0,199815540,United States,0.0
8094,usa,1970,0.0,209588150,United States,0.0
8095,usa,1975,0.0,219205296,United States,0.0
8096,usa,1976,0.0,221239215,United States,0.0


In [104]:
answer = usa[usa.time.isin([2017])]
print(answer)

      geo  time  ...        country  phones_per_person
8134  usa  2017  ...  United States 1.2201246847283354

[1 rows x 6 columns]


Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [105]:
final['geo'] = final['geo'].str.upper()
final.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,phones_per_person
0,AFG,1960,0.0,8996351,Afghanistan,0.0
1,AFG,1965,0.0,9938414,Afghanistan,0.0
2,AFG,1970,0.0,11126123,Afghanistan,0.0
3,AFG,1975,0.0,12590286,Afghanistan,0.0
4,AFG,1976,0.0,12840299,Afghanistan,0.0


### Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [106]:
import numpy as np 
final.describe(include='all')

Unnamed: 0,geo,time,cell_phones_total,population_total,country,phones_per_person
count,8590,8590.0,8590.0,8590.0,8590,8590.0
unique,195,,,,195,
top,BEL,,,,Bulgaria,
freq,46,,,,46,
mean,,1994.1934807916184,9004949.642905472,29838230.581722934,,0.2796385558059151
std,,14.257974607310302,55734084.87217964,116128377.474773,,0.454246656214052
min,,1960.0,0.0,4433.0,,0.0
25%,,1983.0,0.0,1456148.0,,0.0
50%,,1995.0,6200.0,5725062.5,,0.0015636266438163
75%,,2006.0,1697652.0,18105812.0,,0.4611491855201403


In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [0]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format

In [108]:
final[final.time ==2017].sort_values('cell_phones_total', ascending=False).head(5)

Unnamed: 0,geo,time,cell_phones_total,population_total,country,phones_per_person
1496,CHN,2017,1474097000.0,1409517397,China,1.0458168186766978
3595,IND,2017,1168902277.0,1339180127,India,0.8728491809526382
3549,IDN,2017,458923202.0,263991379,Indonesia,1.738402230172827
8134,USA,2017,395881000.0,324459463,United States,1.2201246847283354
1084,BRA,2017,236488548.0,209288278,Brazil,1.1299655683535224


2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [109]:
usa = final[final.country=='United States']
usa[['time','cell_phones_total','population_total']]

Unnamed: 0,time,cell_phones_total,population_total
8092,1960,0.0,186808228
8093,1965,0.0,199815540
8094,1970,0.0,209588150
8095,1975,0.0,219205296
8096,1976,0.0,221239215
8097,1977,0.0,223324042
8098,1978,0.0,225449657
8099,1979,0.0,227599878
8100,1980,0.0,229763052
8101,1984,91600.0,238573861


In [0]:
#Answer is 2014

### Part 4. Reshape data

*This part is not needed to pass the sprint challenge, only to get a 3! Only work on this after completing the other sections.*

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [0]:
import seaborn as sns
import matplotlib.pyplot as plt

In [0]:
countries = ('China','India','United States','Indonesia','Brazil')
years = ['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015','2016','2017']

In [113]:
final.pivot_table(columns='years', index=countries, values='cell_phones_total').plot

KeyError: ignored

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

If you have the time and curiosity, what other questions can you ask and answer with this data?

## Data Storytelling

In this part of the sprint challenge you'll work with a dataset from **FiveThirtyEight's article, [Every Guest Jon Stewart Ever Had On ‘The Daily Show’](https://fivethirtyeight.com/features/every-guest-jon-stewart-ever-had-on-the-daily-show/)**!

### Part 0 — Run this starter code

You don't need to add or change anything here. Just run this cell and it loads the data for you, into a dataframe named `df`.

(You can explore the data if you want, but it's not required to pass the Sprint Challenge.)

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/daily-show-guests/daily_show_guests.csv'
df = pd.read_csv(url).rename(columns={'YEAR': 'Year', 'Raw_Guest_List': 'Guest'})

def get_occupation(group):
    if group in ['Acting', 'Comedy', 'Musician']:
        return 'Acting, Comedy & Music'
    elif group in ['Media', 'media']:
        return 'Media'
    elif group in ['Government', 'Politician', 'Political Aide']:
        return 'Government and Politics'
    else:
        return 'Other'
      
df['Occupation'] = df['Group'].apply(get_occupation)

### Part 1 — What's the breakdown of guests’ occupations per year?

For example, in 1999, what percentage of guests were actors, comedians, or musicians? What percentage were in the media? What percentage were in politics? What percentage were from another occupation?

Then, what about in 2000? In 2001? And so on, up through 2015.

So, **for each year of _The Daily Show_, calculate the percentage of guests from each occupation:**
- Acting, Comedy & Music
- Government and Politics
- Media
- Other

#### Hints:
You can make a crosstab. (See pandas documentation for examples, explanation, and parameters.)

You'll know you've calculated correctly when the percentage of "Acting, Comedy & Music" guests is 90.36% in 1999, and 45% in 2015.

In [0]:
df.head()

In [0]:
# Creating a crosstab to view the data better
pd.crosstab(df['Occupation'], df['Year'])

In [0]:
pd.crosstab(df['Year'], df['Occupation'], normalize='index')

### Part 2 — Recreate this explanatory visualization:

In [0]:
from IPython.display import display, Image
png = 'https://fivethirtyeight.com/wp-content/uploads/2015/08/hickey-datalab-dailyshow.png'
example = Image(png, width=500)
display(example)

**Hints:**
- You can choose any Python visualization library you want. I've verified the plot can be reproduced with matplotlib, pandas plot, or seaborn. I assume other libraries like altair or plotly would work too.
- If you choose to use seaborn, you may want to upgrade the version to 0.9.0.

**Expectations:** Your plot should include:
- 3 lines visualizing "occupation of guests, by year." The shapes of the lines should look roughly identical to 538's example. Each line should be a different color. (But you don't need to use the _same_ colors as 538.)
- Legend or labels for the lines. (But you don't need each label positioned next to its line or colored like 538.)
- Title in the upper left: _"Who Got To Be On 'The Daily Show'?"_ with more visual emphasis than the subtitle. (Bolder and/or larger font.)
- Subtitle underneath the title: _"Occupation of guests, by year"_

**Optional Bonus Challenge:**
- Give your plot polished aesthetics, with improved resemblance to the 538 example.
- Any visual element not specifically mentioned in the expectations is an optional bonus.

In [0]:
ct = pd.crosstab(df['Year'], df['Occupation'], normalize='index')
ct

In [0]:
ct.plot(kind='bar', legend=True);