<a href="https://colab.research.google.com/github/macscheffer/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/Mac_Scheffer_DS_Sprint_Challenge_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [0]:

cell_pop = pd.merge(cell_phones,population, how='inner', on = ['geo', 'time'])

In [5]:
cell_pop.shape

(8590, 4)

In [0]:
cell_phones.head()

In [0]:
cell_pop.head()

Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [8]:
cols = ['geo', 'country']
final = pd.merge(cell_pop, geo_country_codes[cols])
final.shape

(8590, 5)

## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [9]:
final.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country
0,afg,1960,0.0,8996351,Afghanistan
1,afg,1965,0.0,9938414,Afghanistan
2,afg,1970,0.0,11126123,Afghanistan
3,afg,1975,0.0,12590286,Afghanistan
4,afg,1976,0.0,12840299,Afghanistan


In [19]:
final['cell_phones_per_person'] = final.cell_phones_total / final.population_total
final[final.country == 'United States'].tail(1)

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
8134,usa,2017,395881000.0,324459463,United States,1.220125


Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [22]:
final.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
0,afg,1960,0.0,8996351,Afghanistan,0.0
1,afg,1965,0.0,9938414,Afghanistan,0.0
2,afg,1970,0.0,11126123,Afghanistan,0.0
3,afg,1975,0.0,12590286,Afghanistan,0.0
4,afg,1976,0.0,12840299,Afghanistan,0.0


In [24]:
final['geo'] = final.geo.str.upper()
final.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
0,AFG,1960,0.0,8996351,Afghanistan,0.0
1,AFG,1965,0.0,9938414,Afghanistan,0.0
2,AFG,1970,0.0,11126123,Afghanistan,0.0
3,AFG,1975,0.0,12590286,Afghanistan,0.0
4,AFG,1976,0.0,12840299,Afghanistan,0.0


## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [25]:
final.describe()

Unnamed: 0,time,cell_phones_total,population_total,cell_phones_per_person
count,8590.0,8590.0,8590.0,8590.0
mean,1994.193481,9004950.0,29838230.0,0.279639
std,14.257975,55734080.0,116128400.0,0.454247
min,1960.0,0.0,4433.0,0.0
25%,1983.0,0.0,1456148.0,0.0
50%,1995.0,6200.0,5725062.0,0.001564
75%,2006.0,1697652.0,18105810.0,0.461149
max,2017.0,1474097000.0,1409517000.0,2.490243


In [26]:
final.describe(include='all')

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
count,8590,8590.0,8590.0,8590.0,8590,8590.0
unique,195,,,,195,
top,UGA,,,,Romania,
freq,46,,,,46,
mean,,1994.193481,9004950.0,29838230.0,,0.279639
std,,14.257975,55734080.0,116128400.0,,0.454247
min,,1960.0,0.0,4433.0,,0.0
25%,,1983.0,0.0,1456148.0,,0.0
50%,,1995.0,6200.0,5725062.0,,0.001564
75%,,2006.0,1697652.0,18105810.0,,0.461149


In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [0]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format

In [46]:
final[final.time == 2017].pivot_table(index='country', values='cell_phones_total').sort_values(by='cell_phones_total',ascending=False).head()

Unnamed: 0_level_0,cell_phones_total
country,Unnamed: 1_level_1
China,1474097000.0
India,1168902277.0
Indonesia,458923202.0
United States,395881000.0
Brazil,236488548.0


2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [40]:
final[(final.cell_phones_per_person > 1) & (final.geo == 'USA')]

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
8131,USA,2014,355500000.0,317718779,United States,1.118914031833164
8132,USA,2015,382307000.0,319929162,United States,1.1949739048796058
8133,USA,2016,395881000.0,322179605,United States,1.228758722948959
8134,USA,2017,395881000.0,324459463,United States,1.2201246847283354


In [42]:
# looks like it was 2014 but lets double check. 

final[(final.cell_phones_total > final.population_total) & (final.geo == 'USA')]

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
8131,USA,2014,355500000.0,317718779,United States,1.118914031833164
8132,USA,2015,382307000.0,319929162,United States,1.1949739048796058
8133,USA,2016,395881000.0,322179605,United States,1.228758722948959
8134,USA,2017,395881000.0,324459463,United States,1.2201246847283354


In [45]:
# one more time

final[(final.geo == 'USA')].sort_values(by='time', ascending=False).head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
8134,USA,2017,395881000.0,324459463,United States,1.2201246847283354
8133,USA,2016,395881000.0,322179605,United States,1.228758722948959
8132,USA,2015,382307000.0,319929162,United States,1.1949739048796058
8131,USA,2014,355500000.0,317718779,United States,1.118914031833164
8130,USA,2013,310698000.0,315536676,United States,0.9846652501340288


## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [0]:
countries = ['China', 'India', 'United States', 'Indonesia', 'Brazil']

pivot = final[(final.time >= 2007) & (final.country.isin(countries))].pivot_table(index='country', values='cell_phones_total', columns='time')

In [52]:
pivot

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Brazil,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,236488548.0
China,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984200.0,1364934000.0,1474097000.0
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,1168902277.0
Indonesia,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963665.0,313226914.0,325582819.0,338948340.0,385573398.0,458923202.0
United States,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,395881000.0


In [53]:
pivot.shape

(5, 11)

#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

In [0]:
pivot_ts = pivot.T
pivot_ts

In [68]:
# india added the most cell phones from 2007-2017, 935,282,277
pivot['phones_added_from_2007-2017'] = pivot[2017] - pivot[2007]
pivot.sort_values(by = 'phones_added_from_2007-2017', ascending=False)

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,phones added from 2007-2017,phones_added_from_2007-2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,1168902277.0,935282277.0,935282277.0
China,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984200.0,1364934000.0,1474097000.0,926791000.0,926791000.0
Indonesia,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963665.0,313226914.0,325582819.0,338948340.0,385573398.0,458923202.0,365536321.0,365536321.0
United States,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,395881000.0,146581000.0,146581000.0
Brazil,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,236488548.0,115508445.0,115508445.0


If you have the time and curiosity, what other questions can you ask and answer with this data?

In [0]:
pivot['2008_growth_rate_in_phones'] = round((pivot[2008] / pivot[2007]) - 1, 3)
pivot

In [72]:
for i in range(10):
  year = 2007 + i + 1
  pivot[str(year) + '_growth_rate_in_phones'] = round((pivot[year] / pivot[year-1]) - 1, 3)
pivot

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,...,2008_growth_rate_in_phones,2009_growth_rate_in_phones,2010_growth_rate_in_phones,2011_growth_rate_in_phones,2012_growth_rate_in_phones,2013_growth_rate_in_phones,2014_growth_rate_in_phones,2015_growth_rate_in_phones,2016_growth_rate_in_phones,2017_growth_rate_in_phones
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Brazil,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,...,0.245,0.124,0.163,0.19,0.06,0.092,0.036,-0.082,-0.053,-0.031
China,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984200.0,1364934000.0,...,0.172,0.165,0.15,0.148,0.128,0.105,0.046,0.005,0.056,0.08
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,...,0.485,0.514,0.432,0.188,-0.033,0.025,0.065,0.06,0.127,0.036
Indonesia,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963665.0,313226914.0,325582819.0,338948340.0,385573398.0,...,0.505,0.164,0.291,0.182,0.129,0.111,0.039,0.041,0.138,0.19
United States,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,...,0.048,0.05,0.04,0.043,0.025,0.019,0.144,0.075,0.036,0.0


In [0]:
pivot = pivot.drop(labels='phones added from 2007-2017', axis='columns')
pivot.columns

In [76]:
pivot2 = final[(final.time >= 2007) & (final.country.isin(countries))].pivot_table(index='country', values='cell_phones_per_person', columns='time')
pivot2

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Brazil,0.6333153580042348,0.7806102237150339,0.869107562373934,1.000679428531239,1.1795330092774006,1.2381456217733038,1.3393687626918995,1.3746853195773652,1.2517558521007175,1.1753623336716303,1.1299655683535224
China,0.409414865975522,0.4769694564014336,0.5526452439590929,0.6317336105130495,0.7212191838989495,0.8087231797896388,0.8888624772913624,0.925173289187736,0.9248087286588194,0.9725213003418064,1.0458168186766978
India,0.198036547735587,0.2897639364571018,0.4324326080022529,0.6110493897259677,0.7166746768185286,0.6846206123225949,0.6932038505029893,0.7296069065451255,0.7647171280133154,0.8517092569576913,0.8728491809526382
Indonesia,0.4008207446886977,0.5952687752989215,0.6838666086394296,0.8712132730812926,1.0166788063715315,1.1329154749967243,1.2428048309037325,1.2761392028716716,1.3129282839422685,1.4766395061654258,1.738402230172827
United States,0.8293546295279024,0.8613129084629373,0.8961260458264333,0.9237840688710478,0.9561250192584748,0.9728807457559626,0.9846652501340288,1.118914031833164,1.1949739048796058,1.228758722948959,1.2201246847283354


In [77]:
for i in range(10):
  year = 2007 + i + 1
  pivot2[str(year) + '_growth_rate_in_phones_per_person'] = round((pivot2[year] / pivot2[year-1]) - 1, 3)
pivot2

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,...,2008_growth_rate_in_phones_per_person,2009_growth_rate_in_phones_per_person,2010_growth_rate_in_phones_per_person,2011_growth_rate_in_phones_per_person,2012_growth_rate_in_phones_per_person,2013_growth_rate_in_phones_per_person,2014_growth_rate_in_phones_per_person,2015_growth_rate_in_phones_per_person,2016_growth_rate_in_phones_per_person,2017_growth_rate_in_phones_per_person
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Brazil,0.6333153580042348,0.7806102237150339,0.869107562373934,1.000679428531239,1.1795330092774006,1.2381456217733038,1.3393687626918995,1.3746853195773652,1.2517558521007175,1.1753623336716303,...,0.233,0.113,0.151,0.179,0.05,0.082,0.026,-0.089,-0.061,-0.039
China,0.409414865975522,0.4769694564014336,0.5526452439590929,0.6317336105130495,0.7212191838989495,0.8087231797896388,0.8888624772913624,0.925173289187736,0.9248087286588194,0.9725213003418064,...,0.165,0.159,0.143,0.142,0.121,0.099,0.041,-0.0,0.052,0.075
India,0.198036547735587,0.2897639364571018,0.4324326080022529,0.6110493897259677,0.7166746768185286,0.6846206123225949,0.6932038505029893,0.7296069065451255,0.7647171280133154,0.8517092569576913,...,0.463,0.492,0.413,0.173,-0.045,0.013,0.053,0.048,0.114,0.025
Indonesia,0.4008207446886977,0.5952687752989215,0.6838666086394296,0.8712132730812926,1.0166788063715315,1.1329154749967243,1.2428048309037325,1.2761392028716716,1.3129282839422685,1.4766395061654258,...,0.485,0.149,0.274,0.167,0.114,0.097,0.027,0.029,0.125,0.177
United States,0.8293546295279024,0.8613129084629373,0.8961260458264333,0.9237840688710478,0.9561250192584748,0.9728807457559626,0.9846652501340288,1.118914031833164,1.1949739048796058,1.228758722948959,...,0.039,0.04,0.031,0.035,0.018,0.012,0.136,0.068,0.028,-0.007
