# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [3]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

In [4]:
cell_phones.head( )

Unnamed: 0,geo,time,cell_phones_total
0,abw,1960,0.0
1,abw,1965,0.0
2,abw,1970,0.0
3,abw,1975,0.0
4,abw,1976,0.0


In [5]:
population.head()

Unnamed: 0,geo,time,population_total
0,afg,1800,3280000
1,afg,1801,3280000
2,afg,1802,3280000
3,afg,1803,3280000
4,afg,1804,3280000


In [6]:
geo_country_codes.head()

Unnamed: 0,geo,alt_5,alternative_1,alternative_2,alternative_3,alternative_4_cdiac,arb1,arb2,arb3,arb4,...,latitude,longitude,main_religion_2008,country,pandg,un_state,unicode_region_subtag,upper_case_name,world_4region,world_6region
0,abkh,,,,,,,,,,...,,,,Abkhazia,,False,,,europe,europe_central_asia
1,abw,,,,,Aruba,,,,,...,12.5,-69.96667,christian,Aruba,,False,AW,ARUBA,americas,america
2,afg,,Islamic Republic of Afghanistan,,,Afghanistan,,,,,...,33.0,66.0,muslim,Afghanistan,AFGHANISTAN,True,AF,AFGHANISTAN,asia,south_asia
3,ago,,,,,Angola,,,,,...,-12.5,18.5,christian,Angola,ANGOLA,True,AO,ANGOLA,africa,sub_saharan_africa
4,aia,,,,,,,,,,...,18.21667,-63.05,christian,Anguilla,,False,AI,ANGUILLA,americas,america


In [7]:
####okay

## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [9]:
cell_pop_join = pd.merge(cell_phones, population,
                                  on=['geo','time'],
                                  how='inner')

cell_pop_join.head()

Unnamed: 0,geo,time,cell_phones_total,population_total
0,afg,1960,0.0,8996351
1,afg,1965,0.0,9938414
2,afg,1970,0.0,11126123
3,afg,1975,0.0,12590286
4,afg,1976,0.0,12840299


In [10]:
cell_pop_join.shape

(8590, 4)

Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [11]:
geo_country = ['geo', 'country']

geo_cell_pop = pd.merge(cell_pop_join, geo_country_codes[geo_country],
                        how='inner', on='geo')

geo_cell_pop.shape

(8590, 5)

## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [12]:
geo_cell_pop.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country
0,afg,1960,0.0,8996351,Afghanistan
1,afg,1965,0.0,9938414,Afghanistan
2,afg,1970,0.0,11126123,Afghanistan
3,afg,1975,0.0,12590286,Afghanistan
4,afg,1976,0.0,12840299,Afghanistan


In [14]:
phone_sum = geo_cell_pop['cell_phones_total'].sum()
phone_sum

77352517432.558

In [15]:
# pop_total = geo_cell_pop['population_total']

In [17]:
geo_cell_pop['country'].value_counts().head(20)

Sudan               46
Djibouti            46
Kuwait              46
Burundi             46
Niger               46
Monaco              46
China               46
Estonia             46
El Salvador         46
Hong Kong, China    46
Burkina Faso        46
Switzerland         46
Ecuador             46
Kazakhstan          46
Guyana              46
Cuba                46
Cameroon            46
Germany             46
Mauritius           46
Honduras            46
Name: country, dtype: int64

In [19]:
geo_cell_pop.describe()

Unnamed: 0,time,cell_phones_total,population_total
count,8590.0,8590.0,8590.0
mean,1994.193481,9004950.0,29838230.0
std,14.257975,55734080.0,116128400.0
min,1960.0,0.0,4433.0
25%,1983.0,0.0,1456148.0
50%,1995.0,6200.0,5725062.0
75%,2006.0,1697652.0,18105810.0
max,2017.0,1474097000.0,1409517000.0


In [21]:
import numpy as np

geo_cell_pop['cell_phone_average'] = geo_cell_pop['cell_phones_total'].div(geo_cell_pop['population_total'])


In [26]:
geo_cell_pop['cell_phone_average'].describe()

count    8590.000000
mean        0.279639
std         0.454247
min         0.000000
25%         0.000000
50%         0.001564
75%         0.461149
max         2.490243
Name: cell_phone_average, dtype: float64

Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [29]:
def uppercase(x):
    return x.upper()

### upper

In [30]:
geo_cell_pop['geo'] = geo_cell_pop['geo'].apply(uppercase)

In [31]:
geo_cell_pop['geo'].head()

0    AFG
1    AFG
2    AFG
3    AFG
4    AFG
Name: geo, dtype: object

## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [33]:
geo_cell_pop.describe(include='all')

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phone_average
count,8590,8590.0,8590.0,8590.0,8590,8590.0
unique,195,,,,195,
top,PAN,,,,Sudan,
freq,46,,,,46,
mean,,1994.193481,9004950.0,29838230.0,,0.279639
std,,14.257975,55734080.0,116128400.0,,0.454247
min,,1960.0,0.0,4433.0,,0.0
25%,,1983.0,0.0,1456148.0,,0.0
50%,,1995.0,6200.0,5725062.0,,0.001564
75%,,2006.0,1697652.0,18105810.0,,0.461149


In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [34]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format

In [37]:
geo_cell_pop.describe(include='all')

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phone_average
count,8590,8590.0,8590.0,8590.0,8590,8590.0
unique,195,,,,195,
top,PAN,,,,Sudan,
freq,46,,,,46,
mean,,1994.1934807916184,9004949.642905472,29838230.581722934,,0.2796385558059151
std,,14.257974607310302,55734084.87217964,116128377.474773,,0.454246656214052
min,,1960.0,0.0,4433.0,,0.0
25%,,1983.0,0.0,1456148.0,,0.0
50%,,1995.0,6200.0,5725062.5,,0.0015636266438163
75%,,2006.0,1697652.0,18105812.0,,0.4611491855201403


In [45]:
grouped = geo_cell_pop.groupby(['country', 'time'])['cell_phones_total'].max()
grouped

country      time
Afghanistan  1960            0.0
             1965            0.0
             1970            0.0
             1975            0.0
             1976            0.0
             1977            0.0
             1978            0.0
             1979            0.0
             1980            0.0
             1981            0.0
             1982            0.0
             1983            0.0
             1984            0.0
             1985            0.0
             1986            0.0
             1987            0.0
             1988            0.0
             1989            0.0
             1990            0.0
             1991            0.0
             1992            0.0
             1993            0.0
             1994            0.0
             1995            0.0
             1996            0.0
             1997            0.0
             1998            0.0
             1999            0.0
             2000            0.0
             2001        

In [106]:
locate = geo_cell_pop.loc[geo_cell_pop['cell_phones_total'].idxmax()]
locate

geo                                 CHN
time                               2017
cell_phones_total       1,474,097,000.0
population_total             1409517397
country                           China
cell_phone_average   1.0458168186766978
Name: 1496, dtype: object

2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [111]:
# geo_cell_pop.groupby('time').apply(lambda geo_cell_pop:geo_cell_pop.iloc(geo_cell_pop['cell_phones_total'].idxmax()))
geo_cell_pop = geo_cell_pop.sort_values(by='cell_phones_total', ascending=False)

geo_cell_pop.loc[geo_cell_pop['time']==2017].head()


Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell_phone_average
1496,CHN,2017,1474097000.0,1409517397,China,1.0458168186766978
3595,IND,2017,1168902277.0,1339180127,India,0.8728491809526382
3549,IDN,2017,458923202.0,263991379,Indonesia,1.738402230172827
8134,USA,2017,395881000.0,324459463,United States,1.2201246847283354
1084,BRA,2017,236488548.0,209288278,Brazil,1.1299655683535224


In [159]:
locating = geo_cell_pop.loc[['China', 'India', 'United States', 'Indonesia', 'Brazil']]
### this does not make sense... it' sliterally right there ^^^^
## OH WAIT WE RESET CAN RESET THE INDEX BY COUNTRY..
# locating.groupby(by='country');


In [164]:
geo_cell_pop = geo_cell_pop.reset_index()

In [166]:
geo_cell_pop.head()

Unnamed: 0,country,geo,time,cell_phones_total,population_total,cell_phone_average
0,China,CHN,2017,1474097000.0,1409517397,1.0458168186766978
1,China,CHN,2016,1364934000.0,1403500365,0.9725213003418064
2,China,CHN,2015,1291984200.0,1397028553,0.9248087286588194
3,China,CHN,2014,1286093000.0,1390110388,0.925173289187736
4,China,CHN,2013,1229113000.0,1382793212,0.8888624772913624


In [167]:
countries = ['China', 'India', 'United States', 'Indonesia', 'Brazil']

In [168]:
cols = ['country', 'cell_phones_total']
rows = geo_cell_pop.country.groupbycountries

ValueError: Arrays were different lengths: 8590 vs 5

## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [None]:
geo_cell_pop.set_index('country', inplace=True)

geo_cell_pop.head()

In [128]:
mask = (geo_cell_pop['time'] > 2006) & (geo_cell_pop['time'] <= 2017)
pref_years = geo_cell_pop['time'].loc[mask]

In [None]:
##having a lot of trouble with this one.. please help :) thanks

In [142]:
country_piv = geo_cell_pop.pivot_table(index=locating, columns=pref_years, 
                                       values='cell_phones_total'
                                       )

country_piv.shape

ValueError: cannot reindex from a duplicate axis

In [110]:
# for country in geo_cell_pop.groupby(['country']):
#     geo_cell_pop.loc['China', 'India', 'United States', 'Indonesia', 'Brazil']

#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

If you have the time and curiosity, what other questions can you ask and answer with this data?