<a href="https://colab.research.google.com/github/coding-ss/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/DS_Unit_1_Sprint_Challenge_2_Data_Wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [2]:
cell_phones.head()

Unnamed: 0,geo,time,cell_phones_total
0,abw,1960,0.0
1,abw,1965,0.0
2,abw,1970,0.0
3,abw,1975,0.0
4,abw,1976,0.0


In [3]:
population.head()

Unnamed: 0,geo,time,population_total
0,afg,1800,3280000
1,afg,1801,3280000
2,afg,1802,3280000
3,afg,1803,3280000
4,afg,1804,3280000


In [0]:
merged = pd.merge(population,cell_phones,how='inner', on=('geo','time'))

In [7]:
merged.shape

(8590, 4)

Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [8]:
geo_country_codes.head()

Unnamed: 0,geo,alt_5,alternative_1,alternative_2,alternative_3,alternative_4_cdiac,arb1,arb2,arb3,arb4,...,latitude,longitude,main_religion_2008,country,pandg,un_state,unicode_region_subtag,upper_case_name,world_4region,world_6region
0,abkh,,,,,,,,,,...,,,,Abkhazia,,False,,,europe,europe_central_asia
1,abw,,,,,Aruba,,,,,...,12.5,-69.96667,christian,Aruba,,False,AW,ARUBA,americas,america
2,afg,,Islamic Republic of Afghanistan,,,Afghanistan,,,,,...,33.0,66.0,muslim,Afghanistan,AFGHANISTAN,True,AF,AFGHANISTAN,asia,south_asia
3,ago,,,,,Angola,,,,,...,-12.5,18.5,christian,Angola,ANGOLA,True,AO,ANGOLA,africa,sub_saharan_africa
4,aia,,,,,,,,,,...,18.21667,-63.05,christian,Anguilla,,False,AI,ANGUILLA,americas,america


In [0]:
final = pd.merge(merged,geo_country_codes[['geo','country']])

In [13]:
final.shape

(8590, 5)

In [18]:
final

Unnamed: 0,geo,time,population_total,cell_phones_total,country
0,afg,1960,8996351,0.0,Afghanistan
1,afg,1965,9938414,0.0,Afghanistan
2,afg,1970,11126123,0.0,Afghanistan
3,afg,1975,12590286,0.0,Afghanistan
4,afg,1976,12840299,0.0,Afghanistan
5,afg,1977,13067538,0.0,Afghanistan
6,afg,1978,13237734,0.0,Afghanistan
7,afg,1979,13306695,0.0,Afghanistan
8,afg,1980,13248370,0.0,Afghanistan
9,afg,1981,13053954,0.0,Afghanistan


In [31]:
final['country'].value_counts().head(500)

Togo                    46
Panama                  46
Rwanda                  46
Hong Kong, China        46
Sri Lanka               46
Malawi                  46
Azerbaijan              46
Uganda                  46
Georgia                 46
Philippines             46
Finland                 46
Fiji                    46
Bulgaria                46
Slovak Republic         46
Niger                   46
Trinidad and Tobago     46
Germany                 46
Honduras                46
Cyprus                  46
Cape Verde              46
Zimbabwe                46
Hungary                 46
Switzerland             46
Seychelles              46
Guyana                  46
Syria                   46
China                   46
Angola                  46
France                  46
Mozambique              46
                        ..
North Korea             45
Dominica                45
South Korea             44
Tuvalu                  44
Guinea                  44
Tunisia                 44
S

## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [17]:
cell_phones_per_person = len('cell_phones_total')/len('population_total')
cell_phones_per_person

1.0625

In [46]:
final.loc[(final['country'] == 'United States') & (final['time'] == 2017)]


Unnamed: 0,geo,time,population_total,cell_phones_total,country
8134,USA,2017,324459463,395881000.0,United States


In [47]:
395881000.0/324459463

1.2201246847283354

Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [26]:
final['geo'] = final['geo'].str.upper()
final['geo'].head()

0    AFG
1    AFG
2    AFG
3    AFG
4    AFG
Name: geo, dtype: object

## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [32]:
final.describe()

Unnamed: 0,time,population_total,cell_phones_total
count,8590.0,8590.0,8590.0
mean,1994.193481,29838230.0,9004950.0
std,14.257975,116128400.0,55734080.0
min,1960.0,4433.0,0.0
25%,1983.0,1456148.0,0.0
50%,1995.0,5725062.0,6200.0
75%,2006.0,18105810.0,1697652.0
max,2017.0,1409517000.0,1474097000.0


In [36]:
# To show all numeric columns

import numpy as np 

final.describe(include=[np.number])

Unnamed: 0,time,population_total,cell_phones_total
count,8590.0,8590.0,8590.0
mean,1994.193481,29838230.0,9004950.0
std,14.257975,116128400.0,55734080.0
min,1960.0,4433.0,0.0
25%,1983.0,1456148.0,0.0
50%,1995.0,5725062.0,6200.0
75%,2006.0,18105810.0,1697652.0
max,2017.0,1409517000.0,1474097000.0


In [37]:
# To show all non-numeric columns

final.describe(include=[np.object])

Unnamed: 0,geo,country
count,8590,8590
unique,195,195
top,ZMB,Togo
freq,46,46


In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [152]:
f2 = final.sort_values(by='time').tail(168)

Unnamed: 0,geo,time,population_total,cell_phones_total,country
8543,ZMB,2017,17094130,13438539.0,Zambia
311,ARM,2017,2930450,3488524.0,Armenia
8497,ZAF,2017,56717156,91878275.0,South Africa
219,ARE,2017,9400145,19826224.0,United Arab Emirates
183,AND,2017,76965,80337.0,Andorra
137,ALB,2017,2930187,3497950.0,Albania
91,AGO,2017,29784193,13323952.0,Angola
396,AUS,2017,24450561,27553000.0,Australia
45,AFG,2017,35530081,23929713.0,Afghanistan
265,ARG,2017,44271041,61897379.0,Argentina


In [0]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format

2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [0]:

f = final.query('cell_phones_total > population_total')

In [141]:
# In 2014, US had more cell phones than people

f.tail(50)

Unnamed: 0,geo,time,population_total,cell_phones_total,country
8039,UKR,2011,45576307,55576481.0,Ukraine
8040,UKR,2012,45349333,59343693.0,Ukraine
8041,UKR,2013,45115785,62458800.0,Ukraine
8042,UKR,2014,44883426,61170229.0,Ukraine
8043,UKR,2015,44657704,60720073.0,Ukraine
8044,UKR,2016,44438625,56717856.0,Ukraine
8045,UKR,2017,44222947,55714733.0,Ukraine
8082,URY,2008,3350824,3507816.0,Uruguay
8083,URY,2009,3362755,4111560.0,Uruguay
8084,URY,2010,3374415,4437158.0,Uruguay


## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [0]:

time1= final.loc[(final['time'] >= 2007) & (final['time'] <= 2017)]
time1

In [81]:

final['country'].value_counts().head(100).index

Index(['Togo', 'Panama', 'Rwanda', 'Hong Kong, China', 'Sri Lanka', 'Malawi',
       'Azerbaijan', 'Uganda', 'Georgia', 'Philippines', 'Finland', 'Fiji',
       'Bulgaria', 'Slovak Republic', 'Niger', 'Trinidad and Tobago',
       'Germany', 'Honduras', 'Cyprus', 'Cape Verde', 'Zimbabwe', 'Hungary',
       'Switzerland', 'Seychelles', 'Guyana', 'Syria', 'China', 'Angola',
       'France', 'Mozambique', 'Suriname', 'Moldova', 'Mexico', 'Iceland',
       'Nepal', 'Mauritius', 'Argentina', 'Poland', 'Ireland', 'Denmark',
       'Albania', 'Australia', 'Bosnia and Herzegovina', 'Guinea-Bissau',
       'Ecuador', 'Kuwait', 'Vietnam', 'Portugal', 'Pakistan', 'Malta',
       'Bolivia', 'Chile', 'Sao Tome and Principe', 'Paraguay', 'Peru',
       'Andorra', 'Lao', 'Burundi', 'Tanzania', 'Sweden', 'Indonesia',
       'Guatemala', 'Austria', 'Thailand', 'Solomon Islands', 'Iraq',
       'Costa Rica', 'Norway', 'Canada', 'Vanuatu', 'Comoros', 'Gabon',
       'Bangladesh', 'India', 'Mauritania', '

In [0]:

final['country']=final['country'].str.strip()

In [0]:
country1 = final.groupby(['country']).groups.keys() 

In [113]:
                   
table = final.pivot_table(values = 'cell_phones_total',
                  index=['country'],
                  columns=['time'])

table.head()

time,1960,1965,1970,1975,1976,1977,1978,1979,1980,1981,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7898909.0,10500000.0,10215840.0,13797879.0,15340115.0,16807156.0,18407168.0,19709038.0,21602982.0,23929713.0
Albania,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1859632.0,2463741.0,2692372.0,3100000.0,3500000.0,3685983.0,3359654.0,3400955.0,3369756.0,3497950.0
Algeria,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,27031472.0,32729824.0,32780165.0,35615926.0,37527703.0,39517045.0,43298174.0,43227643.0,47041321.0,49873389.0
Andorra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,64202.0,64549.0,65495.0,65044.0,63865.0,63931.0,66241.0,71336.0,76132.0,80337.0
Angola,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6773356.0,8109421.0,9403365.0,12073218.0,12785109.0,13285198.0,14052558.0,13884532.0,13001124.0,13323952.0


In [118]:
table1 = table.groupby(['country'])
table1.head()

time,1960,1965,1970,1975,1976,1977,1978,1979,1980,1981,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7898909.0,10500000.0,10215840.0,13797879.0,15340115.0,16807156.0,18407168.0,19709038.0,21602982.0,23929713.0
Albania,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1859632.0,2463741.0,2692372.0,3100000.0,3500000.0,3685983.0,3359654.0,3400955.0,3369756.0,3497950.0
Algeria,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,27031472.0,32729824.0,32780165.0,35615926.0,37527703.0,39517045.0,43298174.0,43227643.0,47041321.0,49873389.0
Andorra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,64202.0,64549.0,65495.0,65044.0,63865.0,63931.0,66241.0,71336.0,76132.0,80337.0
Angola,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6773356.0,8109421.0,9403365.0,12073218.0,12785109.0,13285198.0,14052558.0,13884532.0,13001124.0,13323952.0
Antigua and Barbuda,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,136592.0,134925.0,167970.0,176008.0,127381.0,114358.0,120041.0,176000.0,180000.0,
Argentina,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,46508774.0,52482780.0,57082298.0,60722729.0,64327647.0,67361515.0,61234216.0,61842011.0,63723692.0,61897379.0
Armenia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1442000.0,2191500.0,3865354.0,3211215.0,3322837.0,3346275.0,3459137.0,3464490.0,3434567.0,3488524.0
Australia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,22120000.0,22200000.0,22500000.0,23789000.0,24338000.0,24940000.0,25060000.0,25770000.0,26551000.0,27553000.0
Austria,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10816000.0,11434000.0,12241000.0,13022578.0,13588000.0,13272000.0,12952605.0,13470623.0,14270000.0,14924340.0


In [119]:
table1.loc[(table1['country'] == 'China') & (df['country'] == 'Brazil')]

AttributeError: ignored

In [120]:
table_df = pd.DataFrame(table)
table_df

time,1960,1965,1970,1975,1976,1977,1978,1979,1980,1981,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7898909.0,10500000.0,10215840.0,13797879.0,15340115.0,16807156.0,18407168.0,19709038.0,21602982.0,23929713.0
Albania,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1859632.0,2463741.0,2692372.0,3100000.0,3500000.0,3685983.0,3359654.0,3400955.0,3369756.0,3497950.0
Algeria,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,27031472.0,32729824.0,32780165.0,35615926.0,37527703.0,39517045.0,43298174.0,43227643.0,47041321.0,49873389.0
Andorra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,64202.0,64549.0,65495.0,65044.0,63865.0,63931.0,66241.0,71336.0,76132.0,80337.0
Angola,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6773356.0,8109421.0,9403365.0,12073218.0,12785109.0,13285198.0,14052558.0,13884532.0,13001124.0,13323952.0
Antigua and Barbuda,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,136592.0,134925.0,167970.0,176008.0,127381.0,114358.0,120041.0,176000.0,180000.0,
Argentina,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,46508774.0,52482780.0,57082298.0,60722729.0,64327647.0,67361515.0,61234216.0,61842011.0,63723692.0,61897379.0
Armenia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1442000.0,2191500.0,3865354.0,3211215.0,3322837.0,3346275.0,3459137.0,3464490.0,3434567.0,3488524.0
Australia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,22120000.0,22200000.0,22500000.0,23789000.0,24338000.0,24940000.0,25060000.0,25770000.0,26551000.0,27553000.0
Austria,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10816000.0,11434000.0,12241000.0,13022578.0,13588000.0,13272000.0,12952605.0,13470623.0,14270000.0,14924340.0


#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

If you have the time and curiosity, what other questions can you ask and answer with this data?