<a href="https://colab.research.google.com/github/wel51x/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/DS_Unit_1_Sprint_Challenge_2_Data_Wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [111]:
print(cell_phones.shape)
cell_phones.sample(8)

(9215, 3)


Unnamed: 0,geo,time,cell_phones_total
6397,nru,1985,0.0
1828,col,2009,42159613.0
4394,ken,2008,16303573.0
2924,fsm,2017,23114.0
5610,mlt,1980,0.0
6124,ner,1995,0.0
2431,ecu,1977,0.0
6255,nld,1989,56000.0


In [112]:
print(population.shape)
population.sample(8)

(59297, 3)


Unnamed: 0,geo,time,population_total
25216,isl,2033,371060
8820,can,1891,4957140
8940,can,2011,34538622
40153,nru,1920,2618
55408,ukr,1824,11757787
28173,kir,1980,59339
53991,tur,1912,15022873
27117,ken,1827,2574000


In [113]:
print(geo_country_codes.shape)
geo_country_codes.sample(8)

(273, 33)


Unnamed: 0,geo,alt_5,alternative_1,alternative_2,alternative_3,alternative_4_cdiac,arb1,arb2,arb3,arb4,...,latitude,longitude,main_religion_2008,country,pandg,un_state,unicode_region_subtag,upper_case_name,world_4region,world_6region
214,slb,,Solomon Isl.,,,Solomon Islands,,,,,...,-8.0,159.0,christian,Solomon Islands,,True,SB,SOLOMON ISLANDS,asia,east_asia_pacific
94,gmb,,"Gambia, The",The Gambia,Gambia The,Gambia,,,,,...,13.5,-15.5,muslim,Gambia,GAMBIA,True,GM,GAMBIA,africa,sub_saharan_africa
218,som,,,,,Somalia,,,,,...,6.0,48.0,muslim,Somalia,SOMALIA,True,SO,SOMALIA,africa,sub_saharan_africa
257,uzb,,,,,Uzbekistan,,,,,...,41.66667,63.83333,muslim,Uzbekistan,UZBEKISTAN,True,UZ,UZBEKISTAN,asia,europe_central_asia
159,mng,,,,,Mongoli,,,,,...,46.0,105.0,,Mongolia,MONGOLIA,True,MN,MONGOLIA,asia,east_asia_pacific
137,lby,,Libyan Arab Jamahiriya,,,Libyan Arab Jamahiriyah,,,,,...,28.0,17.0,muslim,Libya,LIBYAN ARAB JAMAHIRIYA,True,LY,LIBYAN ARAB JAMAHIRIYA,africa,middle_east_north_africa
181,nru,,,,,Nauru,,,,,...,-0.517,166.933,christian,Nauru,,True,NR,NAURU,asia,east_asia_pacific
149,mda,"Moldova, Republic of","Moldova, Rep. of",Republic of Moldova,Moldova (Republic of),Republic Of Moldov,,,,,...,47.25,28.58333,christian,Moldova,MOLDOVA,True,MD,"MOLDOVA, REPUBLIC OF",europe,europe_central_asia


Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [114]:
cell_pop_df = pd.merge(cell_phones, population,  how='inner', left_on=['geo','time'], right_on = ['geo','time'])

cell_pop_df.shape

(8590, 4)

In [115]:
cell_pop_df.sample(8)

Unnamed: 0,geo,time,cell_phones_total,population_total
8219,vct,2011,131809.0,109341
6060,pak,1965,0.0,50845221
1662,cog,2003,330000.0,3502519
3851,ita,2015,87691238.0,59504212
1,afg,1965,0.0,9938414
6330,pol,2000,6747000.0,38550495
8174,uzb,2011,25441789.0,29068224
5569,mys,1975,0.0,12162369


## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [116]:
cell_pop_df['cell_phones_per_person'] = cell_pop_df.cell_phones_total / cell_pop_df.population_total

print(cell_pop_df.dtypes)

cell_pop_df.sample(8)

geo                        object
time                        int64
cell_phones_total         float64
population_total            int64
cell_phones_per_person    float64
dtype: object


Unnamed: 0,geo,time,cell_phones_total,population_total,cell_phones_per_person
4178,kir,1975,0.0,55169,0.0
1435,chl,2002,6244310.0,15623635,0.3996707552371775
7634,tkm,1984,0.0,3166221,0.0
5275,mlt,2017,560010.0,430835,1.2998247588984182
434,aut,2009,11434000.0,8370038,1.3660630931424684
6022,omn,1981,0.0,1220587,0.0
3046,gnq,1965,0.0,277396,0.0
2429,esp,1996,2997645.0,40009324,0.0749236602947852


In [117]:
cell_pop_df.loc[(cell_pop_df['geo'] == 'usa') & (cell_pop_df['time'] == 2017)]

Unnamed: 0,geo,time,cell_phones_total,population_total,cell_phones_per_person
8134,usa,2017,395881000.0,324459463,1.2201246847283354


Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [118]:
cell_pop_df['geo'] = cell_pop_df['geo'].str.upper()

cell_pop_df.sample(8)

Unnamed: 0,geo,time,cell_phones_total,population_total,cell_phones_per_person
8149,UZB,1986,0.0,18565477,0.0
4651,LSO,1980,0.0,1310118,0.0
1576,CMR,2005,2252508.0,17420795,0.129299954450988
436,AUT,2011,13022578.0,8459864,1.539336566167021
4012,KAZ,1993,0.0,16370000,0.0
4321,KWT,1987,14300.0,1942810,0.0073604727173527
7464,TCD,1996,0.0,7241134,0.0
52,AGO,1978,0.0,8376147,0.0


## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [119]:
cell_pop_df.describe()

Unnamed: 0,time,cell_phones_total,population_total,cell_phones_per_person
count,8590.0,8590.0,8590.0,8590.0
mean,1994.1934807916184,9004949.642905472,29838230.581722934,0.2796385558059153
std,14.257974607310278,55734084.872176506,116128377.47477296,0.4542466562140471
min,1960.0,0.0,4433.0,0.0
25%,1983.0,0.0,1456148.0,0.0
50%,1995.0,6200.0,5725062.5,0.0015636266438163
75%,2006.0,1697652.0,18105812.0,0.4611491855201403
max,2017.0,1474097000.0,1409517397.0,2.490242818521353


In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [120]:
# This optional code formats float numbers with comma separators
pd.set_option('precision',0)
pd.options.display.float_format = '{:,}'.format

cell_pop_df.loc[(cell_pop_df['time'] == 2017)]\
           .nlargest(5, 'cell_phones_total')\
           .drop(['time', 'population_total', 'cell_phones_per_person'], axis=1)

Unnamed: 0,geo,cell_phones_total
1496,CHN,1474097000.0
3595,IND,1168902277.0
3549,IDN,458923202.0
8134,USA,395881000.0
1084,BRA,236488548.0


2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [121]:
cell_pop_df_usa = cell_pop_df[cell_pop_df['geo'] == 'USA']

cell_pop_df_usa = cell_pop_df_usa[cell_pop_df['cell_phones_total'] > cell_pop_df['population_total']]

cell_pop_df_usa = cell_pop_df_usa.nsmallest(1, 'time')

print("first time number of cell phones exceeded population in the US was:", list(cell_pop_df_usa['time'].reset_index(drop=True)))

first time number of cell phones exceeded population in the US was: [2014]


  This is separate from the ipykernel package so we can avoid doing imports until


## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [122]:
# Create my_geo_country_codes df of geo & country
country_list = ['China', 'India', 'United States', 'Indonesia', 'Brazil']

my_geo_country_codes = geo_country_codes[geo_country_codes['country'].isin(country_list)]
my_geo_country_codes = my_geo_country_codes[['geo', 'country']]

my_geo_country_codes

Unnamed: 0,geo,country
33,bra,Brazil
45,chn,China
111,idn,Indonesia
112,ind,India
254,usa,United States


In [123]:
# Create cell_phone subset for my_geo_country_codes only
my_cell_phones = cell_phones.loc[(cell_phones['time'] >= 2007) & (cell_phones['time'] <= 2017)]

my_cell_phone_table = pd.merge(my_cell_phones[['time', 'cell_phones_total', 'geo']], 
                               my_geo_country_codes[['geo', 'country']])

my_cell_phone_table = my_cell_phone_table.drop(['geo'], axis = 1)

my_cell_phone_table.sample(8)

Unnamed: 0,time,cell_phones_total,country
35,2009,525090000.0,India
8,2015,257814274.0,Brazil
41,2015,1001056000.0,India
16,2012,1112155000.0,China
28,2013,313226914.0,Indonesia
40,2014,944008677.0,India
51,2014,355500000.0,United States
27,2012,281963665.0,Indonesia


In [124]:
# Create pivot table
cell_phone_pivot_table = pd.pivot_table(my_cell_phone_table, index=['country'], columns=['time'])

print(cell_phone_pivot_table.shape)

cell_phone_pivot_table

(5, 11)


Unnamed: 0_level_0,cell_phones_total,cell_phones_total,cell_phones_total,cell_phones_total,cell_phones_total,cell_phones_total,cell_phones_total,cell_phones_total,cell_phones_total,cell_phones_total,cell_phones_total
time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
Brazil,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,236488548.0
China,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984200.0,1364934000.0,1474097000.0
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,1168902277.0
Indonesia,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963665.0,313226914.0,325582819.0,338948340.0,385573398.0,458923202.0
United States,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,395881000.0


#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

If you have the time and curiosity, what other questions can you ask and answer with this data?