<a href="https://colab.research.google.com/github/DylanGraves/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/DS_Unit_1_Sprint_Challenge_2_Data_Wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [2]:
cell_phones.shape

(9215, 3)

In [3]:
population.shape

(59297, 3)

In [4]:
cell_phones.head()

Unnamed: 0,geo,time,cell_phones_total
0,abw,1960,0.0
1,abw,1965,0.0
2,abw,1970,0.0
3,abw,1975,0.0
4,abw,1976,0.0


In [5]:
population.head()

Unnamed: 0,geo,time,population_total
0,afg,1800,3280000
1,afg,1801,3280000
2,afg,1802,3280000
3,afg,1803,3280000
4,afg,1804,3280000


In [6]:
new_data = pd.merge(cell_phones, population, how='inner', on=['geo', 'time'])

new_data.shape

# Looks like it worked since I got the right answer

(8590, 4)

In [7]:
new_data.head()

Unnamed: 0,geo,time,cell_phones_total,population_total
0,afg,1960,0.0,8996351
1,afg,1965,0.0,9938414
2,afg,1970,0.0,11126123
3,afg,1975,0.0,12590286
4,afg,1976,0.0,12840299


In [9]:
new_data.sort_values(by=['time'])

Unnamed: 0,geo,time,cell_phones_total,population_total
0,afg,1960,0.0,8996351
4083,kgz,1960,0.0,2170093
4129,khm,1960,0.0,5722370
535,bel,1960,0.0,9167365
8407,yem,1960,0.0,5172135
7165,svk,1960,0.0,4140129
4175,kir,1960,0.0,41233
4221,kna,1960,0.0,51195
4037,ken,1960,0.0,8105440
4262,kor,1960,0.0,25340918


Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [10]:
geo_country_codes.head()

Unnamed: 0,geo,alt_5,alternative_1,alternative_2,alternative_3,alternative_4_cdiac,arb1,arb2,arb3,arb4,...,latitude,longitude,main_religion_2008,country,pandg,un_state,unicode_region_subtag,upper_case_name,world_4region,world_6region
0,abkh,,,,,,,,,,...,,,,Abkhazia,,False,,,europe,europe_central_asia
1,abw,,,,,Aruba,,,,,...,12.5,-69.96667,christian,Aruba,,False,AW,ARUBA,americas,america
2,afg,,Islamic Republic of Afghanistan,,,Afghanistan,,,,,...,33.0,66.0,muslim,Afghanistan,AFGHANISTAN,True,AF,AFGHANISTAN,asia,south_asia
3,ago,,,,,Angola,,,,,...,-12.5,18.5,christian,Angola,ANGOLA,True,AO,ANGOLA,africa,sub_saharan_africa
4,aia,,,,,,,,,,...,18.21667,-63.05,christian,Anguilla,,False,AI,ANGUILLA,americas,america


In [14]:
subset = geo_country_codes[['geo', 'country']]

final = pd.merge(new_data, subset)

final.shape

(8590, 5)

In [15]:
final.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country
0,afg,1960,0.0,8996351,Afghanistan
1,afg,1965,0.0,9938414,Afghanistan
2,afg,1970,0.0,11126123,Afghanistan
3,afg,1975,0.0,12590286,Afghanistan
4,afg,1976,0.0,12840299,Afghanistan


In [17]:
final.sort_values(by=['cell_phones_total'], ascending=False)

# All the data I have seen so far has had cell_phones_total as 0, so I wanted 
# to make sure that I did not mess up the data by deleting it all or something,
# so I did a quick sort to make sure there were actually non-zero values.

Unnamed: 0,geo,time,cell_phones_total,population_total,country
1496,chn,2017,1.474097e+09,1409517397,China
1495,chn,2016,1.364934e+09,1403500365,China
1494,chn,2015,1.291984e+09,1397028553,China
1493,chn,2014,1.286093e+09,1390110388,China
1492,chn,2013,1.229113e+09,1382793212,China
3595,ind,2017,1.168902e+09,1339180127,India
3594,ind,2016,1.127809e+09,1324171354,India
1491,chn,2012,1.112155e+09,1375198619,China
3593,ind,2015,1.001056e+09,1309053980,India
1490,chn,2011,9.862530e+08,1367480264,China


## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [19]:
final['cell_phones_per_person'] = (final['cell_phones_total'] / final['population_total'])

final.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
0,afg,1960,0.0,8996351,Afghanistan,0.0
1,afg,1965,0.0,9938414,Afghanistan,0.0
2,afg,1970,0.0,11126123,Afghanistan,0.0
3,afg,1975,0.0,12590286,Afghanistan,0.0
4,afg,1976,0.0,12840299,Afghanistan,0.0


In [22]:
final.tail()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
8585,zwe,2013,13633167.0,15054506,Zimbabwe,0.905587
8586,zwe,2014,11798652.0,15411675,Zimbabwe,0.765566
8587,zwe,2015,12757410.0,15777451,Zimbabwe,0.808585
8588,zwe,2016,12878926.0,16150362,Zimbabwe,0.797439
8589,zwe,2017,14092104.0,16529904,Zimbabwe,0.852522


In [23]:
final.sort_values(by=['cell_phones_per_person'], ascending=False)

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
3319,hkg,2017,18340347.0,7364883,"Hong Kong, China",2.490243
3318,hkg,2016,17584969.0,7302843,"Hong Kong, China",2.407962
3315,hkg,2013,16973133.0,7148571,"Hong Kong, China",2.374339
3316,hkg,2014,16959455.0,7194563,"Hong Kong, China",2.357260
3317,hkg,2015,16724440.0,7245701,"Hong Kong, China",2.308188
3314,hkg,2012,16387536.0,7106399,"Hong Kong, China",2.306025
3313,hkg,2011,15292924.0,7065815,"Hong Kong, China",2.164354
218,are,2016,19905093.0,9269612,United Arab Emirates,2.147349
219,are,2017,19826224.0,9400145,United Arab Emirates,2.109140
809,bhr,2016,2994865.0,1425171,Bahrain,2.101407


In [24]:
final.loc[final['country'] == 'United States']

# Down there at the bottom of this list, it looks like I got the right answer so
# it seems like I succeeded.

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
8092,usa,1960,0.0,186808228,United States,0.0
8093,usa,1965,0.0,199815540,United States,0.0
8094,usa,1970,0.0,209588150,United States,0.0
8095,usa,1975,0.0,219205296,United States,0.0
8096,usa,1976,0.0,221239215,United States,0.0
8097,usa,1977,0.0,223324042,United States,0.0
8098,usa,1978,0.0,225449657,United States,0.0
8099,usa,1979,0.0,227599878,United States,0.0
8100,usa,1980,0.0,229763052,United States,0.0
8101,usa,1984,91600.0,238573861,United States,0.000384


Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [25]:
final['geo'].isna().sum()

# No null values, so that's nice.

0

In [0]:
def capitalize(string):
  return string.title()

In [0]:
final['geo'] = final['geo'].apply(capitalize)

In [28]:
final.head()

# Worked : )

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
0,Afg,1960,0.0,8996351,Afghanistan,0.0
1,Afg,1965,0.0,9938414,Afghanistan,0.0
2,Afg,1970,0.0,11126123,Afghanistan,0.0
3,Afg,1975,0.0,12590286,Afghanistan,0.0
4,Afg,1976,0.0,12840299,Afghanistan,0.0


## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [30]:
import numpy as np

final.describe(include = [np.number])

# All numeric columns.

Unnamed: 0,time,cell_phones_total,population_total,cell_phones_per_person
count,8590.0,8590.0,8590.0,8590.0
mean,1994.193481,9004950.0,29838230.0,0.279639
std,14.257975,55734080.0,116128400.0,0.454247
min,1960.0,0.0,4433.0,0.0
25%,1983.0,0.0,1456148.0,0.0
50%,1995.0,6200.0,5725062.0,0.001564
75%,2006.0,1697652.0,18105810.0,0.461149
max,2017.0,1474097000.0,1409517000.0,2.490243


In [31]:
final.describe()

# Although the default provides the same thing.

Unnamed: 0,time,cell_phones_total,population_total,cell_phones_per_person
count,8590.0,8590.0,8590.0,8590.0
mean,1994.193481,9004950.0,29838230.0,0.279639
std,14.257975,55734080.0,116128400.0,0.454247
min,1960.0,0.0,4433.0,0.0
25%,1983.0,0.0,1456148.0,0.0
50%,1995.0,6200.0,5725062.0,0.001564
75%,2006.0,1697652.0,18105810.0,0.461149
max,2017.0,1474097000.0,1409517000.0,2.490243


In [32]:
final.describe(include = [np.object])

# All non-numeric columns.

Unnamed: 0,geo,country
count,8590,8590
unique,195,195
top,Nzl,United Kingdom
freq,46,46


In [33]:
final.describe(include='all')

# Making sure I didn't miss any columns.

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
count,8590,8590.0,8590.0,8590.0,8590,8590.0
unique,195,,,,195,
top,Nzl,,,,United Kingdom,
freq,46,,,,46,
mean,,1994.193481,9004950.0,29838230.0,,0.279639
std,,14.257975,55734080.0,116128400.0,,0.454247
min,,1960.0,0.0,4433.0,,0.0
25%,,1983.0,0.0,1456148.0,,0.0
50%,,1995.0,6200.0,5725062.0,,0.001564
75%,,2006.0,1697652.0,18105810.0,,0.461149


In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [0]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format

In [35]:
final.sort_values(by=['cell_phones_total'], ascending=False)

# Same code that I did before.

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
1496,Chn,2017,1474097000.0,1409517397,China,1.0458168186766978
1495,Chn,2016,1364934000.0,1403500365,China,0.9725213003418065
1494,Chn,2015,1291984200.0,1397028553,China,0.9248087286588194
1493,Chn,2014,1286093000.0,1390110388,China,0.9251732891877361
1492,Chn,2013,1229113000.0,1382793212,China,0.8888624772913624
3595,Ind,2017,1168902277.0,1339180127,India,0.8728491809526382
3594,Ind,2016,1127809000.0,1324171354,India,0.8517092569576913
1491,Chn,2012,1112155000.0,1375198619,China,0.8087231797896388
3593,Ind,2015,1001056000.0,1309053980,India,0.7647171280133154
1490,Chn,2011,986253000.0,1367480264,China,0.7212191838989495


In [40]:
country_phone_subset = final[['country', 'cell_phones_total', 'time']]

country_phone_subset.sort_values(by=['cell_phones_total'], ascending=False)

# Same data, but to clean up the data I made a subset with only country, total number of cell phones, and the year.

Unnamed: 0,country,cell_phones_total,time
1496,China,1474097000.0,2017
1495,China,1364934000.0,2016
1494,China,1291984200.0,2015
1493,China,1286093000.0,2014
1492,China,1229113000.0,2013
3595,India,1168902277.0,2017
3594,India,1127809000.0,2016
1491,China,1112155000.0,2012
3593,India,1001056000.0,2015
1490,China,986253000.0,2011


2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [42]:
final.loc[final['country'] == 'United States'].sort_values(by=['cell_phones_total'], ascending=False)

# Doing a quick sort you can see that 2014 was the first year that the USA had 
# more cell phones than people, but I think I can do better.

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
8134,Usa,2017,395881000.0,324459463,United States,1.2201246847283354
8133,Usa,2016,395881000.0,322179605,United States,1.228758722948959
8132,Usa,2015,382307000.0,319929162,United States,1.1949739048796058
8131,Usa,2014,355500000.0,317718779,United States,1.118914031833164
8130,Usa,2013,310698000.0,315536676,United States,0.9846652501340288
8129,Usa,2012,304838000.0,313335423,United States,0.9728807457559626
8128,Usa,2011,297404000.0,311051373,United States,0.9561250192584748
8127,Usa,2010,285118000.0,308641391,United States,0.9237840688710478
8126,Usa,2009,274283000.0,306076362,United States,0.8961260458264333
8125,Usa,2008,261300000.0,303374067,United States,0.8613129084629373


In [47]:
condition = ((final.country=='United States') & (final.cell_phones_total > final.population_total))

columns = ['country',
          'cell_phones_total',
          'population_total',
          'time']

subset2 = final.loc[condition, columns]

subset2.head()

# A subset of all the years in which there were more cell phones than people in the United States.

Unnamed: 0,country,cell_phones_total,population_total,time
8131,United States,355500000.0,317718779,2014
8132,United States,382307000.0,319929162,2015
8133,United States,395881000.0,322179605,2016
8134,United States,395881000.0,324459463,2017


## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [54]:
# pivot_subset = final[final.country != ['United States', 'Brazil', 'Indonesia', 'China', 'India']]

# pivot_subset = final.ix[~(final['country'] == ['United States', 'Brazil', 'Indonesia', 'China', 'India'])]

# pivot_subset = ~final.country.isin(['United States', 'Brazil', 'Indonesia', 'China', 'India'])

# pivot_subset.head()

0    True
1    True
2    True
3    True
4    True
Name: country, dtype: bool

In [0]:
# print(final[pivot_subset])

In [0]:
# rows = final['country'] != ['United States', 'Brazil', 'Indonesia', 'China', 'India']

# final.set_index('country').drop(rows).reset_index()

# final.head()

In [0]:
# df = df[df.line_race != 0]

final_pivot = final.pivot(index='country', columns='time', values='cell_phones_total')

In [51]:
final_pivot.shape

# The pivot table is made, but it is not the right shape.

(195, 46)

In [57]:
final_pivot.head()

time,1960,1965,1970,1975,1976,1977,1978,1979,1980,1981,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7898909.0,10500000.0,10215840.0,13797879.0,15340115.0,16807156.0,18407168.0,19709038.0,21602982.0,23929713.0
Albania,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1859632.0,2463741.0,2692372.0,3100000.0,3500000.0,3685983.0,3359654.0,3400955.0,3369756.0,3497950.0
Algeria,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,27031472.0,32729824.0,32780165.0,35615926.0,37527703.0,39517045.0,43298174.0,43227643.0,47041321.0,49873389.0
Andorra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,64202.0,64549.0,65495.0,65044.0,63865.0,63931.0,66241.0,71336.0,76132.0,80337.0
Angola,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6773356.0,8109421.0,9403365.0,12073218.0,12785109.0,13285198.0,14052558.0,13884532.0,13001124.0,13323952.0


In [97]:
# final_pivot.drop('1960')

final_pivot.head()

time,1960,1965,1970,1975,1976,1977,1978,1979,1980,1981,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7898909.0,10500000.0,10215840.0,13797879.0,15340115.0,16807156.0,18407168.0,19709038.0,21602982.0,23929713.0
Albania,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1859632.0,2463741.0,2692372.0,3100000.0,3500000.0,3685983.0,3359654.0,3400955.0,3369756.0,3497950.0
Algeria,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,27031472.0,32729824.0,32780165.0,35615926.0,37527703.0,39517045.0,43298174.0,43227643.0,47041321.0,49873389.0
Andorra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,64202.0,64549.0,65495.0,65044.0,63865.0,63931.0,66241.0,71336.0,76132.0,80337.0
Angola,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6773356.0,8109421.0,9403365.0,12073218.0,12785109.0,13285198.0,14052558.0,13884532.0,13001124.0,13323952.0


In [93]:
# final['country'] = final.country['United States', 'Brazil', 'Indonesia', 'China', 'India']

# final.head()

ValueError: ignored

In [94]:
condition = ((final.country=='United States') or 
             (final.country=='Brazil') or 
             (final.country=='India') or
             (final.country=='China') or
             (final.country=='Indonesia') & 
             (final.time > 2006))

# columns = ['country',
#           'cell_phones_total',
#           'population_total',
#           'time']

# subset3 = final.loc[condition, columns]

# subset3.head()

ValueError: ignored

In [0]:
# ['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017']

In [0]:
condition = [final.country == ('United States', 'Brazil', 'Indonesia', 'China', 'India')]
             

In [0]:
condition2 = [(final.time > 2006)]

In [82]:
# pivoted = pd.merge(condition, condition2, how='inner')

# pivoted.head()


# columns = ['user_id', 
#            'order_id', 
#            'order_number', 
#            'order_dow', 
#            'order_hour_of_day']

subset3 = final.loc[condition, condition2]

ValueError: ignored

In [66]:
# placeholder = (final['time'] != ['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017'])

# df = final.drop(final[final.time < 2007].index)

# df.head(15)

# final_pivot.drop(columns=[placeholder])

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phones_per_person
35,Afg,2007,4668096.0,26616792,Afghanistan,0.1753816162368477
36,Afg,2008,7898909.0,27294031,Afghanistan,0.2894006019118246
37,Afg,2009,10500000.0,28004331,Afghanistan,0.3749420045063744
38,Afg,2010,10215840.0,28803167,Afghanistan,0.3546776644387751
39,Afg,2011,13797879.0,29708599,Afghanistan,0.464440581664588
40,Afg,2012,15340115.0,30696958,Afghanistan,0.4997275300047646
41,Afg,2013,16807156.0,31731688,Afghanistan,0.5296647313562393
42,Afg,2014,18407168.0,32758020,Afghanistan,0.5619133268738464
43,Afg,2015,19709038.0,33736494,Afghanistan,0.5842052822679203
44,Afg,2016,21602982.0,34656032,Afghanistan,0.6233541681863636


In [69]:
# df = df.drop(df[df.country] != ['United States', 'Brazil', 'Indonesia', 'China', 'India'], axis=1)

KeyError: ignored

#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

If you have the time and curiosity, what other questions can you ask and answer with this data?