<a href="https://colab.research.google.com/github/Rice-from-data/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/DS_Unit_1_Sprint_Challenge_2_Data_Wrangling_Ned_Horsey_3_29_19.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [0]:
df = pd.merge(cell_phones, population, how='inner', on=['geo', 'time'])

Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [142]:
geo_country_codes.head()

Unnamed: 0,geo,alt_5,alternative_1,alternative_2,alternative_3,alternative_4_cdiac,arb1,arb2,arb3,arb4,...,latitude,longitude,main_religion_2008,country,pandg,un_state,unicode_region_subtag,upper_case_name,world_4region,world_6region
0,abkh,,,,,,,,,,...,,,,Abkhazia,,False,,,europe,europe_central_asia
1,abw,,,,,Aruba,,,,,...,12.5,-69.96667,christian,Aruba,,False,AW,ARUBA,americas,america
2,afg,,Islamic Republic of Afghanistan,,,Afghanistan,,,,,...,33.0,66.0,muslim,Afghanistan,AFGHANISTAN,True,AF,AFGHANISTAN,asia,south_asia
3,ago,,,,,Angola,,,,,...,-12.5,18.5,christian,Angola,ANGOLA,True,AO,ANGOLA,africa,sub_saharan_africa
4,aia,,,,,,,,,,...,18.21667,-63.05,christian,Anguilla,,False,AI,ANGUILLA,americas,america


In [143]:
df.head()

Unnamed: 0,geo,time,cell_phones_total,population_total
0,afg,1960,0.0,8996351
1,afg,1965,0.0,9938414
2,afg,1970,0.0,11126123
3,afg,1975,0.0,12590286
4,afg,1976,0.0,12840299


In [0]:
df = pd.merge(df, geo_country_codes[['geo', 'country']], how='inner', on='geo')

In [145]:
df.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country
0,afg,1960,0.0,8996351,Afghanistan
1,afg,1965,0.0,9938414,Afghanistan
2,afg,1970,0.0,11126123,Afghanistan
3,afg,1975,0.0,12590286,Afghanistan
4,afg,1976,0.0,12840299,Afghanistan


Stretch Goal Below:

In [0]:
# it might be useful to know more geographic data about these countries so I will add the world_4region and world_6region columns
df = pd.merge(df, geo_country_codes[['geo','world_4region', 'world_6region']], how='inner', on='geo')

In [147]:
df.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,world_4region,world_6region
0,afg,1960,0.0,8996351,Afghanistan,asia,south_asia
1,afg,1965,0.0,9938414,Afghanistan,asia,south_asia
2,afg,1970,0.0,11126123,Afghanistan,asia,south_asia
3,afg,1975,0.0,12590286,Afghanistan,asia,south_asia
4,afg,1976,0.0,12840299,Afghanistan,asia,south_asia


In [207]:
# this gives us the opportunity to organize our data by region like this:

df.groupby('world_4region')['cell_phones_total'].mean()

world_4region
africa     3,366,468.9443069305
americas    8,397,025.482018927
asia       16,334,830.667424548
europe      7,295,256.256683206
Name: cell_phones_total, dtype: float64

## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [69]:
df.tail()

Unnamed: 0,geo,time,cell_phones_total,population_total,country
8585,zwe,2013,13633167.0,15054506,Zimbabwe
8586,zwe,2014,11798652.0,15411675,Zimbabwe
8587,zwe,2015,12757410.0,15777451,Zimbabwe
8588,zwe,2016,12878926.0,16150362,Zimbabwe
8589,zwe,2017,14092104.0,16529904,Zimbabwe


In [0]:
df['cell phones per person'] = (df['cell_phones_total']) / (df['population_total'])

In [72]:
df.tail()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell phones per person
8585,zwe,2013,13633167.0,15054506,Zimbabwe,0.905587
8586,zwe,2014,11798652.0,15411675,Zimbabwe,0.765566
8587,zwe,2015,12757410.0,15777451,Zimbabwe,0.808585
8588,zwe,2016,12878926.0,16150362,Zimbabwe,0.797439
8589,zwe,2017,14092104.0,16529904,Zimbabwe,0.852522


In [0]:
# df.country.unique()

In [79]:
df[df['country'] == 'United States'][df['time'] == 2017]

  """Entry point for launching an IPython kernel.


Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell phones per person
8134,USA,2017,395881000.0,324459463,United States,1.220125


Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [0]:
df['geo'] = df['geo'].str.upper()

In [210]:
df.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,world_4region,world_6region
0,AFG,1960,0.0,8996351,Afghanistan,asia,south_asia
1,AFG,1965,0.0,9938414,Afghanistan,asia,south_asia
2,AFG,1970,0.0,11126123,Afghanistan,asia,south_asia
3,AFG,1975,0.0,12590286,Afghanistan,asia,south_asia
4,AFG,1976,0.0,12840299,Afghanistan,asia,south_asia


In [217]:
# an interesting feature to add would be cell phones per region

df.groupby(['world_4region', 'world_6region', 'time'])['cell_phones_total'].sum()

# not sure how to put this in however

world_4region  world_6region             time
africa         middle_east_north_africa  1960                 0.0
                                         1965                 0.0
                                         1970                 0.0
                                         1975                 0.0
                                         1976                 0.0
                                         1977                 0.0
                                         1978                 0.0
                                         1979                 0.0
                                         1980                 0.0
                                         1981                 0.0
                                         1982                 0.0
                                         1983                 0.0
                                         1984                 0.0
                                         1985                 0.0
                              

In [213]:
df.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,world_4region,world_6region,regional_cell_phone_total
0,AFG,1960,0.0,8996351,Afghanistan,asia,south_asia,
1,AFG,1965,0.0,9938414,Afghanistan,asia,south_asia,
2,AFG,1970,0.0,11126123,Afghanistan,asia,south_asia,
3,AFG,1975,0.0,12590286,Afghanistan,asia,south_asia,
4,AFG,1976,0.0,12840299,Afghanistan,asia,south_asia,


## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [83]:
df.describe(include='number')

Unnamed: 0,time,cell_phones_total,population_total,cell phones per person
count,8590.0,8590.0,8590.0,8590.0
mean,1994.193481,9004950.0,29838230.0,0.279639
std,14.257975,55734080.0,116128400.0,0.454247
min,1960.0,0.0,4433.0,0.0
25%,1983.0,0.0,1456148.0,0.0
50%,1995.0,6200.0,5725062.0,0.001564
75%,2006.0,1697652.0,18105810.0,0.461149
max,2017.0,1474097000.0,1409517000.0,2.490243


In [84]:
df.describe(exclude='number')

Unnamed: 0,geo,country
count,8590,8590
unique,195,195
top,BGR,Croatia
freq,46,46


In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [0]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format

In [110]:
df[df['time']==2017].set_index('country')['cell_phones_total'].sort_values(ascending=False).head()
# Below are the 5 countries with the most total cell phones in 2017 

country
China           1,474,097,000.0
India           1,168,902,277.0
Indonesia         458,923,202.0
United States     395,881,000.0
Brazil            236,488,548.0
Name: cell_phones_total, dtype: float64

2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [111]:
df.head()

Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell phones per person
0,AFG,1960,0.0,8996351,Afghanistan,0.0
1,AFG,1965,0.0,9938414,Afghanistan,0.0
2,AFG,1970,0.0,11126123,Afghanistan,0.0
3,AFG,1975,0.0,12590286,Afghanistan,0.0
4,AFG,1976,0.0,12840299,Afghanistan,0.0


In [112]:
condition = df['cell_phones_total'] > df['population_total']

df[df['country']=='United States'][condition]
# looks like 2014 was the first year, but let's check

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell phones per person
8131,USA,2014,355500000.0,317718779,United States,1.118914031833164
8132,USA,2015,382307000.0,319929162,United States,1.1949739048796058
8133,USA,2016,395881000.0,322179605,United States,1.228758722948959
8134,USA,2017,395881000.0,324459463,United States,1.2201246847283354


In [114]:
time_frame = df['time'] > 2010
df[df['country']=='United States'][time_frame]

  


Unnamed: 0,geo,time,cell_phones_total,population_total,country,cell phones per person
8128,USA,2011,297404000.0,311051373,United States,0.9561250192584748
8129,USA,2012,304838000.0,313335423,United States,0.9728807457559626
8130,USA,2013,310698000.0,315536676,United States,0.9846652501340288
8131,USA,2014,355500000.0,317718779,United States,1.118914031833164
8132,USA,2015,382307000.0,319929162,United States,1.1949739048796058
8133,USA,2016,395881000.0,322179605,United States,1.228758722948959
8134,USA,2017,395881000.0,324459463,United States,1.2201246847283354


In [0]:
# yes, looks like 2014 was the first year that cell phones outnumbered people in the usa

## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [0]:
tidy.set_index('name').pivot(columns = 'trt', values='result').rename_axis(None).rename(columns = {'a': 'treatmenta', 'b': 'treatmentb'})

In [133]:
time_cond = (df['time'] > 2006)
countries = (df['country'].isin(['China', 'India', 'United States', 'Indonesia', 'Brazil']))

df[time_cond][countries].pivot(index = 'country', columns = 'time', values = 'cell_phones_total').shape
# ok got it!

  after removing the cwd from sys.path.


(5, 11)

#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

In [161]:
# I'm making a subset dataframe that contains the years and countries we're interested in
subset = df[time_cond][countries]
subset.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,geo,time,cell_phones_total,population_total,country,world_4region,world_6region
1074,bra,2007,120980103.0,191026637,Brazil,americas,america
1075,bra,2008,150641403.0,192979029,Brazil,americas,america
1076,bra,2009,169385584.0,194895996,Brazil,americas,america
1077,bra,2010,196929978.0,196796269,Brazil,americas,america
1078,bra,2011,234357507.0,198686688,Brazil,americas,america


In [0]:
# I'm adding a column that should have the difference in cell phones from the previous decade
subset['10 year growth'] = subset['cell_phones_total'].diff(periods=10)

In [199]:
subset['10 year growth'].describe()

count                  45.0
mean           24,197,366.6
std     705,939,746.4912602
min        -1,088,523,602.0
25%          -633,310,677.0
50%           146,581,000.0
75%           611,898,813.0
max         1,128,445,452.0
Name: 10 year growth, dtype: float64

In [200]:
subset.pivot(index = 'country', columns = 'time', values = '10 year growth')

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Brazil,,,,,,,,,,,115508445.0
China,396664597.0,471859416.0,550284022.0,624645493.0,737929297.0,841055201.0,948384204.0,1028278726.0,1047916844.0,1128445452.0,926791000.0
India,93041757.0,183213039.0,313799765.0,502384381.0,611898813.0,551494003.0,560721426.0,605060337.0,615482602.0,668885798.0,935282277.0
Indonesia,-547858119.0,-606635757.0,-695326039.0,-774962765.0,-862349381.0,-947149335.0,-972866086.0,-966401381.0,-1025985660.0,-1088523602.0,365536321.0
United States,-97590000.0,-263790000.0,-477907000.0,-608744478.0,-567316917.0,-581466245.0,-633310677.0,-645556000.0,-745502000.0,-773021277.0,146581000.0


In [0]:
# Ok cool! So India had a 10 year increase in cellphones of 935,282,277.0 !

# Whoo hoo!

Below here I tried some different techniques that didn't work out

In [194]:
# actually, that isn't necessary to answer the question, all I need is the difference between 2007 and 2017

for x in (subset['cell_phones_total'][subset['time']==2017]):
  for y in (subset['cell_phones_total'][subset['time']==2007]):
    print(x-y)

# ok so this gives me sort of the right output, but how to append it to the dataframe?

def year_growth(target):
  for y in (subset['cell_phones_total'][subset['time']==2007]):
    return float(target-y)

#   maybe this function will be useful, I can see some of the data I want, but not sure how to put it into the dataframe at the right index point

115508445.0
-310817452.0
143101667.0
2868548.0
-12811452.0
1353116897.0
926791000.0
1380710119.0
1240477000.0
1224797000.0
337943099.0
-88382798.0
365536321.0
225303202.0
209623202.0
1047922174.0
621596277.0
1075515396.0
935282277.0
919602277.0
274900897.0
-151425000.0
302494119.0
162261000.0
146581000.0


In [195]:

# df.int_rate = df['int_rate'].apply(remove_percent)

# subset['10 year phone growth'] = 
subset[subset['time']==2017]['cell_phones_total'].apply(year_growth)

1084     115,508,445.0
1496   1,353,116,897.0
3549     337,943,099.0
3595   1,047,922,174.0
8134     274,900,897.0
Name: cell_phones_total, dtype: float64

In [160]:
# now I'll try to sort by these columns
subset.groupby('yearly phone growth').country.agg('count')

yearly phone growth
-1,380,710,119.0    1
-919,602,277.0      1
-225,303,202.0      1
-29,141,561.0       1
-22,914,522.0       1
-13,746,918.0       1
-7,578,808.0        1
0.0                 1
5,860,000.0         1
5,891,200.0         1
7,434,000.0         1
9,628,997.0         1
10,835,000.0        1
12,000,000.0        1
12,286,000.0        1
12,355,905.0        1
12,983,000.0        1
13,365,521.0        1
13,574,000.0        1
13,966,196.0        1
18,744,181.0        1
21,583,328.0        1
22,776,096.0        1
23,098,718.0        1
26,807,000.0        1
27,544,394.0        1
29,661,300.0        1
31,263,249.0        1
32,158,046.0        1
37,427,529.0        1
38,515,384.0        1
41,093,277.0        1
44,802,000.0        1
46,625,058.0        1
47,191,362.0        1
47,613,274.0        1
56,980,000.0        1
57,047,323.0        1
57,704,432.0        1
72,949,800.0        1
73,349,804.0        1
93,939,000.0        1
105,969,000.0       1
109,163,000.0       1
111,789,000.

If you have the time and curiosity, what other questions can you ask and answer with this data?