<a href="https://colab.research.google.com/github/LilySu/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/Lily_Su_Sprint2_DS_Unit_1_Sprint_Challenge_2_Data_Wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [2]:
cell_phones.shape, population.shape, geo_country_codes.shape

((9215, 3), (59297, 3), (273, 33))

In [3]:
cell_phones.head(3)

Unnamed: 0,geo,time,cell_phones_total
0,abw,1960,0.0
1,abw,1965,0.0
2,abw,1970,0.0


In [4]:
population.head(3)

Unnamed: 0,geo,time,population_total
0,afg,1800,3280000
1,afg,1801,3280000
2,afg,1802,3280000


In [5]:
cell_phones_by_pop = pd.merge(cell_phones, population, how='outer', on=['geo', 'time'])
cell_phones_by_pop.head(3)

Unnamed: 0,geo,time,cell_phones_total,population_total
0,abw,1960,0.0,
1,abw,1965,0.0,
2,abw,1970,0.0,


In [6]:
cell_phones_by_pop.shape

(59922, 4)

In [7]:
cell_phones_by_pop_nona = cell_phones_by_pop.dropna()
cell_phones_by_pop_nona.shape

(8590, 4)

Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

In [8]:
geo_country_codes.head(3)

Unnamed: 0,geo,alt_5,alternative_1,alternative_2,alternative_3,alternative_4_cdiac,arb1,arb2,arb3,arb4,...,latitude,longitude,main_religion_2008,country,pandg,un_state,unicode_region_subtag,upper_case_name,world_4region,world_6region
0,abkh,,,,,,,,,,...,,,,Abkhazia,,False,,,europe,europe_central_asia
1,abw,,,,,Aruba,,,,,...,12.5,-69.96667,christian,Aruba,,False,AW,ARUBA,americas,america
2,afg,,Islamic Republic of Afghanistan,,,Afghanistan,,,,,...,33.0,66.0,muslim,Afghanistan,AFGHANISTAN,True,AF,AFGHANISTAN,asia,south_asia


In [9]:
geo_country_codes.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,263,264,265,266,267,268,269,270,271,272
geo,abkh,abw,afg,ago,aia,akr_a_dhe,ala,alb,and,ant,...,vut,wlf,wsm,yem,yem_north,yem_south,yug,zaf,zmb,zwe
alt_5,,,,,,,,,,,...,,,,,,,,,,
alternative_1,,,Islamic Republic of Afghanistan,,,,√Öland,,,Neth. Antilles,...,,Wallis and Futuna,Samoa (Western),Yemen Republic,,,Yugoslav SFR,South Africa Republic,,
alternative_2,,,,,,,,,,,...,,Wallis and Futuna Islands,,,,,,,,
alternative_3,,,,,,,,,,,...,,,,,,,,,,
alternative_4_cdiac,,Aruba,Afghanistan,Angola,,,,Albania,,Netherland Antilles,...,Vanuatu,,Samoa,Yemen,,,Yugoslavia,South Africa,Zambia,Zimbabwe
arb1,,,,,,,,,,,...,,,,,,,,South_Africa,,
arb2,,,,,,,,,,,...,,,,,,,,,,
arb3,,,,,,,,,,,...,,,,,,,,,,
arb4,,,,,,,,,,,...,,,,,,,,,,


In [10]:
geo_country_codes_just_geo_country = geo_country_codes.iloc[:,[0, 26]]
geo_country_codes_just_geo_country.head(3)


Unnamed: 0,geo,country
0,abkh,Abkhazia
1,abw,Aruba
2,afg,Afghanistan


In [11]:
geo_country_codes_just_geo_country.shape, cell_phones_by_pop_nona.shape

((273, 2), (8590, 4))

In [12]:
cell_phones_by_pop_nona.head(3)

Unnamed: 0,geo,time,cell_phones_total,population_total
41,afg,1960,0.0,8996351.0
42,afg,1965,0.0,9938414.0
43,afg,1970,0.0,11126123.0


In [61]:
cell_phones_by_pop_updated = pd.merge(cell_phones_by_pop_nona, geo_country_codes_just_geo_country,how='left', on='geo')
cell_phones_by_pop_updated.shape

(8590, 5)

## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [14]:
cell_phones_by_pop_updated['cell_phones_total'].nunique()

4491

In [15]:
(cell_phones_by_pop_updated.groupby('cell_phones_total')
              .apply(lambda x: x.groupby('geo').first().sum())
              .reset_index(name='unique_users'))

TypeError: ignored

In [0]:
# cell_phones_by_pop_updated.drop_duplicates(['geo']).groupby('country').agg({'cell_phones_total':'sum'})

In [0]:
# aggregation_functions = {'cell_phones_total': 'sum'}
# cell_phones_by_pop_updated_new = cell_phones_by_pop_updated.groupby(cell_phones_by_pop_updated['country']).aggregate(aggregation_functions)
# cell_phones_by_pop_updated_new.head(3)

In [16]:
# cell_phones_by_pop_updated_new.dtypes

NameError: ignored

In [17]:
cell_phones_by_pop_updated.head(3)

Unnamed: 0,geo,time,cell_phones_total,population_total,country
0,afg,1960,0.0,8996351.0,Afghanistan
1,afg,1965,0.0,9938414.0,Afghanistan
2,afg,1970,0.0,11126123.0,Afghanistan


In [18]:
cell_phones_by_pop_updated.dtypes

geo                   object
time                   int64
cell_phones_total    float64
population_total     float64
country               object
dtype: object

In [0]:
cell_phones_by_pop_updated['population_total']=cell_phones_by_pop_updated['population_total'].astype(float)

In [20]:
cell_phones_by_pop_updated.dtypes

geo                   object
time                   int64
cell_phones_total    float64
population_total     float64
country               object
dtype: object

In [0]:
# cell_phones_by_pop_updated_condensed = cell_phones_by_pop_updated.groupby(['country']).mean()

In [22]:
cell_phones_by_pop_updated.head(3)

Unnamed: 0,geo,time,cell_phones_total,population_total,country
0,afg,1960,0.0,8996351.0,Afghanistan
1,afg,1965,0.0,9938414.0,Afghanistan
2,afg,1970,0.0,11126123.0,Afghanistan


In [0]:
pd.set_option('display.max_rows', 1000)

In [67]:
# cell_phones_by_pop_updated['mean'] = cell_phones_by_pop_updated['population_total']/cell_phones_by_pop_updated['cell_phones_total']
pd.set_option('display.max_rows', 1000)
cell_phones_by_pop_updated['mean'] = cell_phones_by_pop_updated['cell_phones_total']/cell_phones_by_pop_updated['population_total']
cell_phones_by_pop_updated

Unnamed: 0,geo,time,cell_phones_total,population_total,country,mean
0,afg,1960,0.0,8996351.0,Afghanistan,0.0
1,afg,1965,0.0,9938414.0,Afghanistan,0.0
2,afg,1970,0.0,11126123.0,Afghanistan,0.0
3,afg,1975,0.0,12590286.0,Afghanistan,0.0
4,afg,1976,0.0,12840299.0,Afghanistan,0.0
5,afg,1977,0.0,13067538.0,Afghanistan,0.0
6,afg,1978,0.0,13237734.0,Afghanistan,0.0
7,afg,1979,0.0,13306695.0,Afghanistan,0.0
8,afg,1980,0.0,13248370.0,Afghanistan,0.0
9,afg,1981,0.0,13053954.0,Afghanistan,0.0


In [68]:
cell_phones_by_pop_updated.head(3)

Unnamed: 0,geo,time,cell_phones_total,population_total,country,mean
0,afg,1960,0.0,8996351.0,Afghanistan,0.0
1,afg,1965,0.0,9938414.0,Afghanistan,0.0
2,afg,1970,0.0,11126123.0,Afghanistan,0.0


# (You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [69]:
see_mean_value = cell_phones_by_pop_updated.loc[cell_phones_by_pop_updated['country'] == 'United States'].sort_values(by='time', ascending=False)
see_mean_value.head(5)

Unnamed: 0,geo,time,cell_phones_total,population_total,country,mean
8134,usa,2017,395881000.0,324459463.0,United States,1.2201246847283354
8133,usa,2016,395881000.0,322179605.0,United States,1.228758722948959
8132,usa,2015,382307000.0,319929162.0,United States,1.1949739048796058
8131,usa,2014,355500000.0,317718779.0,United States,1.118914031833164
8130,usa,2013,310698000.0,315536676.0,United States,0.9846652501340288


Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [0]:
# cell_phones_by_pop_nona.dtypes

In [0]:
# cell_phones_by_pop_nona['geo'] = cell_phones_by_pop_nona.loc[:0].astype('|S')

# Modify the geo column to make the geo codes uppercase instead of lowercase.

In [27]:
cell_phones_by_pop_updated['geo'] = cell_phones_by_pop_updated['geo'].str.upper()
# cell_phones_by_pop_nona['geo'] = cell_phones_by_pop_nona['geo'].str.upper()
# cell_phones_by_pop_nona.head(3)
cell_phones_by_pop_updated.head(3)

Unnamed: 0,geo,time,cell_phones_total,population_total,country,mean
0,AFG,1960,0.0,8996351.0,Afghanistan,inf
1,AFG,1965,0.0,9938414.0,Afghanistan,inf
2,AFG,1970,0.0,11126123.0,Afghanistan,inf


In [0]:
# cell_phones_by_pop_nona.dtypes

## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [29]:
cell_phones_by_pop_updated.describe()

Unnamed: 0,time,cell_phones_total,population_total,mean
count,8590.0,8590.0,8590.0,8590.0
mean,1994.193481,9004950.0,29838230.0,inf
std,14.257975,55734080.0,116128400.0,
min,1960.0,0.0,4433.0,0.401567
25%,1983.0,0.0,1456148.0,2.168496
50%,1995.0,6200.0,5725062.0,639.538864
75%,2006.0,1697652.0,18105810.0,inf
max,2017.0,1474097000.0,1409517000.0,inf


In [36]:
cell_phones_by_pop_updated.nunique()

geo                   195
time                   46
cell_phones_total    4491
population_total     8589
country               195
mean                 4836
dtype: int64

In [30]:
cell_phones_by_pop_updated.head(3)

Unnamed: 0,geo,time,cell_phones_total,population_total,country,mean
0,AFG,1960,0.0,8996351.0,Afghanistan,inf
1,AFG,1965,0.0,9938414.0,Afghanistan,inf
2,AFG,1970,0.0,11126123.0,Afghanistan,inf


In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [49]:
just_2017 = cell_phones_by_pop_updated.loc[cell_phones_by_pop_updated['time'] == 2017].sort_values(by='cell_phones_total', ascending=False)
just_2017.head(5)

Unnamed: 0,geo,time,cell_phones_total,population_total,country,mean
1496,CHN,2017,1474097000.0,1409517397.0,China,0.9561903979181832
3595,IND,2017,1168902277.0,1339180127.0,India,1.1456732982307296
3549,IDN,2017,458923202.0,263991379.0,Indonesia,0.5752408635029091
8134,USA,2017,395881000.0,324459463.0,United States,0.8195883687269659
1084,BRA,2017,236488548.0,209288278.0,Brazil,0.8849827180637939


2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [0]:
first_year_USA_got_ahead = cell_phones_by_pop_updated.loc[(cell_phones_by_pop_updated['cell_phones_total']) >= (cell_phones_by_pop_updated['population_total'])]
first_year_USA_got_ahead

In [53]:
first_year_USA_got_ahead_find_year = first_year_USA_got_ahead.loc[first_year_USA_got_ahead['country'] == 'United States'].sort_values(by='time', ascending=True)
first_year_USA_got_ahead_find_year.head(5)

Unnamed: 0,geo,time,cell_phones_total,population_total,country,mean
8131,USA,2014,355500000.0,317718779.0,United States,0.8937237102672293
8132,USA,2015,382307000.0,319929162.0,United States,0.8368383576549736
8133,USA,2016,395881000.0,322179605.0,United States,0.8138294209623599
8134,USA,2017,395881000.0,324459463.0,United States,0.8195883687269659


# 2014 was the first year that the USA had more cell phones than people

## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [54]:
cell_phones_by_pop_updated.head(3)

Unnamed: 0,geo,time,cell_phones_total,population_total,country,mean
0,AFG,1960,0.0,8996351.0,Afghanistan,inf
1,AFG,1965,0.0,9938414.0,Afghanistan,inf
2,AFG,1970,0.0,11126123.0,Afghanistan,inf


In [70]:
cell_phones_by_pop_updated_pivot = cell_phones_by_pop_updated.pivot_table(index='country',columns='time',values='cell_phones_total')
cell_phones_by_pop_updated_pivot.head(5)

time,1960,1965,1970,1975,1976,1977,1978,1979,1980,1981,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7898909.0,10500000.0,10215840.0,13797879.0,15340115.0,16807156.0,18407168.0,19709038.0,21602982.0,23929713.0
Albania,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1859632.0,2463741.0,2692372.0,3100000.0,3500000.0,3685983.0,3359654.0,3400955.0,3369756.0,3497950.0
Algeria,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,27031472.0,32729824.0,32780165.0,35615926.0,37527703.0,39517045.0,43298174.0,43227643.0,47041321.0,49873389.0
Andorra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,64202.0,64549.0,65495.0,65044.0,63865.0,63931.0,66241.0,71336.0,76132.0,80337.0
Angola,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6773356.0,8109421.0,9403365.0,12073218.0,12785109.0,13285198.0,14052558.0,13884532.0,13001124.0,13323952.0


In [87]:

cell_phones_by_pop_updated_pivot_selection = cell_phones_by_pop_updated_pivot.loc[:,2007:2017]
cell_phones_by_pop_updated_pivot_selection.shape

(195, 11)

In [0]:
cell_phones_by_pop_updated_pivot.columns.tolist()

In [89]:
only_countries_to_see = ['China', 'India', 'United States', 'Indonesia', 'Brazil']
cell_phones_by_pop_updated_pivot_segmented = cell_phones_by_pop_updated_pivot_selection.loc[only_countries_to_see]
cell_phones_by_pop_updated_pivot_segmented

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
China,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984200.0,1364934000.0,1474097000.0
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,1168902277.0
United States,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,395881000.0
Indonesia,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963665.0,313226914.0,325582819.0,338948340.0,385573398.0,458923202.0
Brazil,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,236488548.0


In [90]:
cell_phones_by_pop_updated_pivot_segmented.shape

(5, 11)

#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

If you have the time and curiosity, what other questions can you ask and answer with this data?