<a href="https://colab.research.google.com/github/connorpheraty/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/Connor_Heraty_DS_Unit_1_Sprint_Challenge_2_Data_Wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Data Wrangling

In this Sprint Challenge you will use data from [Gapminder](https://www.gapminder.org/about-gapminder/), a Swedish non-profit co-founded by Hans Rosling. "Gapminder produces free teaching resources making the world understandable based on reliable statistics."
- [Cell phones (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv)
- [Population (total), by country and year](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)
- [Geo country codes](https://github.com/open-numbers/ddf--gapminder--systema_globalis/blob/master/ddf--entities--geo--country.csv)

These two links have everything you need to successfully complete the Sprint Challenge!
- [Pandas documentation: Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html]) (one question)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) (everything else)

## Part 0. Load data

You don't need to add or change anything here. Just run this cell and it loads the data for you, into three dataframes.

In [0]:
import pandas as pd

cell_phones = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--cell_phones_total--by--geo--time.csv')

population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')

geo_country_codes = (pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')
                       .rename(columns={'country': 'geo', 'name': 'country'}))

## Part 1. Join data

First, join the `cell_phones` and `population` dataframes (with an inner join on `geo` and `time`).

The resulting dataframe's shape should be: (8590, 4)

In [345]:
cell_phones.head(1)

Unnamed: 0,geo,time,cell_phones_total
0,abw,1960,0.0


In [0]:
cell_phones['cell_phones_total'] = cell_phones['cell_phones_total']

In [347]:
cell_phones.head(1)

Unnamed: 0,geo,time,cell_phones_total
0,abw,1960,0.0


In [348]:
population.head(1)

Unnamed: 0,geo,time,population_total
0,afg,1800,3280000


In [349]:
geo_country_codes.head(1)

Unnamed: 0,geo,alt_5,alternative_1,alternative_2,alternative_3,alternative_4_cdiac,arb1,arb2,arb3,arb4,...,latitude,longitude,main_religion_2008,country,pandg,un_state,unicode_region_subtag,upper_case_name,world_4region,world_6region
0,abkh,,,,,,,,,,...,,,,Abkhazia,,False,,,europe,europe_central_asia


In [0]:
cell_df = pd.merge(cell_phones, population, on=['geo','time'])

In [351]:
cell_df.head()

Unnamed: 0,geo,time,cell_phones_total,population_total
0,afg,1960,0.0,8996351
1,afg,1965,0.0,9938414
2,afg,1970,0.0,11126123
3,afg,1975,0.0,12590286
4,afg,1976,0.0,12840299


In [352]:
cell_df.shape

(8590, 4)

Then, select the `geo` and `country` columns from the `geo_country_codes` dataframe, and join with your population and cell phone data.

The resulting dataframe's shape should be: (8590, 5)

I included the **`world_4region`** and **`world_6region`** to see the relationship between continent/subcontinent and cell phone adoption.

In [0]:
cell_df = cell_df.merge(geo_country_codes[['geo','country','world_4region','world_6region']])

In [0]:
cell_df = cell_df.rename({'world_4region':'region','world_6region':'subregion'},axis=1)

In [355]:
# Shape is expected with reion and subregion features added
cell_df.shape

(8590, 7)

## Part 2. Make features

Calculate the number of cell phones per person, and add this column onto your dataframe.

(You've calculated correctly if you get 1.220 cell phones per person in the United States in 2017.)

In [0]:
# Create cell_phones_per_person column using column division
cell_df['cell_phones_per_person'] = cell_df['cell_phones_total'] / cell_df['population_total']

In [0]:
# Round the calculation to 4 decimal places
cell_df['cell_phones_per_person'] = cell_df['cell_phones_per_person'].round(decimals=4)

In [358]:
# Locate row containing 2017 United States data
cell_df.loc[cell_df['country'] == 'United States'][::-1].head(1)

Unnamed: 0,geo,time,cell_phones_total,population_total,country,region,subregion,cell_phones_per_person
8134,usa,2017,395881000.0,324459463,United States,americas,america,1.2201


Modify the `geo` column to make the geo codes uppercase instead of lowercase.

In [0]:
# Use .upper() method to capitalioze `geo` column
cell_df['geo'] = cell_df['geo'].str.upper()

In [360]:
cell_df['geo'].head()

0    AFG
1    AFG
2    AFG
3    AFG
4    AFG
Name: geo, dtype: object

## Part 3. Process data

Use the describe function, to describe your dataframe's numeric columns, and then its non-numeric columns.

(You'll see the time period ranges from 1960 to 2017, and there are 195 unique countries represented.)

In [361]:
# Excluding non-numeric columns
cell_df.describe(exclude='object')

Unnamed: 0,time,cell_phones_total,population_total,cell_phones_per_person
count,8590.0,8590.0,8590.0,8590.0
mean,1994.1934807916184,9004949.642905472,29838230.581722934,0.2796383119906869
std,14.257974607310302,55734084.87217964,116128377.474773,0.4542467913882122
min,1960.0,0.0,4433.0,0.0
25%,1983.0,0.0,1456148.0,0.0
50%,1995.0,6200.0,5725062.5,0.0016
75%,2006.0,1697652.0,18105812.0,0.461125
max,2017.0,1474097000.0,1409517397.0,2.4902


In [362]:
# Excluding numeric columns
cell_df.describe(exclude='number')

Unnamed: 0,geo,country,region,subregion
count,8590,8590,8590,8590
unique,195,195,4,6
top,NOR,Uzbekistan,asia,europe_central_asia
freq,46,46,2485,2324


In 2017, what were the top 5 countries with the most cell phones total?

Your list of countries should have these totals:

| country | cell phones total |
|:-------:|:-----------------:|
|    ?    |     1,474,097,000 |
|    ?    |     1,168,902,277 |
|    ?    |       458,923,202 |
|    ?    |       395,881,000 |
|    ?    |       236,488,548 |



In [0]:
# This optional code formats float numbers with comma separators
pd.options.display.float_format = '{:,}'.format

In [0]:
# Create sub dataframe that includes only the year 2017
cell_df_2017 = cell_df.loc[cell_df['time'] == 2017]

In [0]:
# Use the groupby operation that sums the cell phone yearly totals for each country
country_grouped = (cell_df_2017
            .groupby(['country'])
            .cell_phones_total.agg(['sum'])
            .rename(columns={'sum':'cell phones total'})
            .reset_index())

In [366]:
# Use the sort values function to sort by cell phones total 
country_grouped.sort_values(by=['cell phones total'], ascending=False).head(5)

Unnamed: 0,country,cell phones total
31,China,1474097000.0
67,India,1168902277.0
68,Indonesia,458923202.0
160,United States,395881000.0
21,Brazil,236488548.0


2017 was the first year that China had more cell phones than people.

What was the first year that the USA had more cell phones than people?

In [0]:
# Create sub dataframe that includes only rows where the country equals the United States
us_cell_df = cell_df.loc[cell_df['country'] == 'United States']

2014 was the first year that the USA had more cell phones than people!

In [368]:
# Use the .loc function to specify my output to only include rows where the cell phones total is less than the population total
us_cell_df.loc[cell_df['cell_phones_total'] > cell_df['population_total']]

Unnamed: 0,geo,time,cell_phones_total,population_total,country,region,subregion,cell_phones_per_person
8131,USA,2014,355500000.0,317718779,United States,americas,america,1.1189
8132,USA,2015,382307000.0,319929162,United States,americas,america,1.195
8133,USA,2016,395881000.0,322179605,United States,americas,america,1.2288
8134,USA,2017,395881000.0,324459463,United States,americas,america,1.2201


## Part 4. Reshape data

Create a pivot table:
- Columns: Years 2007—2017
- Rows: China, India, United States, Indonesia, Brazil (order doesn't matter)
- Values: Cell Phones Total

The table's shape should be: (5, 11)

In [0]:
# Create country names list and use as a condition to create a subset dataframe that only includes rows with the relevant countries

country_names = ['China','India','United States','Indonesia','Brazil']
condition = cell_df['country'].isin(country_names)
subset_test = cell_df[condition]

In [0]:
# Use .loc function to only include years 2007-2017
subset_final = subset.loc[cell_df['time'] > 2006]

In [371]:
# Create pivot table with required elements
country_2007_2017_pt = subset_final.pivot_table(index='country', columns='time', values='cell_phones_total')
country_2007_2017_pt

time,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Brazil,120980103.0,150641403.0,169385584.0,196929978.0,234357507.0,248323703.0,271099799.0,280728796.0,257814274.0,244067356.0,236488548.0
China,547306000.0,641245000.0,747214000.0,859003000.0,986253000.0,1112155000.0,1229113000.0,1286093000.0,1291984200.0,1364934000.0,1474097000.0
India,233620000.0,346890000.0,525090000.0,752190000.0,893862478.0,864720917.0,886304245.0,944008677.0,1001056000.0,1127809000.0,1168902277.0
Indonesia,93386881.0,140578243.0,163676961.0,211290235.0,249805619.0,281963665.0,313226914.0,325582819.0,338948340.0,385573398.0,458923202.0
United States,249300000.0,261300000.0,274283000.0,285118000.0,297404000.0,304838000.0,310698000.0,355500000.0,382307000.0,395881000.0,395881000.0


In [372]:
# Check to ensure proper shape
country_2007_2017_pt.shape

(5, 11)

In [373]:
# Line graph plotting the relationship between time and total cell phones in each country
import altair as alt

alt.Chart(subset_final,height=400,width=700).mark_line().encode(
    x='time',
    y='cell_phones_total',
    color='country'
)

#### OPTIONAL BONUS QUESTION!

Sort these 5 countries, by biggest increase in cell phones from 2007 to 2017.

Which country had 935,282,277 more cell phones in 2017 versus 2007?

In [0]:
# Create sub-dataframes including only the years 2007 and 2017
_2007_test = subset_test.loc[cell_df['time'] == 2007]
_2017_test = subset_test.loc[cell_df['time'] == 2017]

In [0]:
# Rename columns to avoid confusion
_2007_test =_2007_test.rename({'cell_phones_total':'2007 Cell Phone Total'},axis=1)
_2017_test =_2017_test.rename({'cell_phones_total':'2017 Cell Phone Total'},axis=1)

In [376]:
_2007_test

Unnamed: 0,geo,time,2007 Cell Phone Total,population_total,country,region,subregion,cell_phones_per_person
1074,BRA,2007,120980103.0,191026637,Brazil,americas,america,0.6333
1486,CHN,2007,547306000.0,1336800506,China,asia,east_asia_pacific,0.4094
3539,IDN,2007,93386881.0,232989141,Indonesia,asia,east_asia_pacific,0.4008
3585,IND,2007,233620000.0,1179681239,India,asia,south_asia,0.198
8124,USA,2007,249300000.0,300595175,United States,americas,america,0.8294


In [0]:
# Reduce sub-dataframes to only include relevant information
_2007_test = _2007_test[['country', '2007 Cell Phone Total']]
_2017_test = _2017_test[['country', '2017 Cell Phone Total']]

In [0]:
# Merge sub-dataframes together
bonus_merge = _2007_test.merge(_2017_test)

In [379]:
# Create Cell Phone Total Increase column to see which country had the largest nominal increase in cell phones
bonus_merge['Cell Phone Total Increase'] = bonus_merge['2017 Cell Phone Total'] - bonus_merge['2007 Cell Phone Total']
bonus_merge

Unnamed: 0,country,2007 Cell Phone Total,2017 Cell Phone Total,Cell Phone Total Increase
0,Brazil,120980103.0,236488548.0,115508445.0
1,China,547306000.0,1474097000.0,926791000.0
2,Indonesia,93386881.0,458923202.0,365536321.0
3,India,233620000.0,1168902277.0,935282277.0
4,United States,249300000.0,395881000.0,146581000.0


**India** had the largest Nominal increase in cell phones from 2007 to 2017 with 935,282,277 answering the second question!

In [380]:
# Create Cell Phone Percentage Increase column to see which country had the largest nominal increase in cell phones
bonus_merge['Cell Phone Percentage Increase'] = ((bonus_merge['2017 Cell Phone Total'] - bonus_merge['2007 Cell Phone Total']) / bonus_merge['2007 Cell Phone Total'])
bonus_merge['Cell Phone Percentage Increase'] = bonus_merge['Cell Phone Percentage Increase'].round(decimals=4)
bonus_merge

Unnamed: 0,country,2007 Cell Phone Total,2017 Cell Phone Total,Cell Phone Total Increase,Cell Phone Percentage Increase
0,Brazil,120980103.0,236488548.0,115508445.0,0.9548
1,China,547306000.0,1474097000.0,926791000.0,1.6934
2,Indonesia,93386881.0,458923202.0,365536321.0,3.9142
3,India,233620000.0,1168902277.0,935282277.0,4.0034
4,United States,249300000.0,395881000.0,146581000.0,0.588


**India** also had the largest percentage increase in cell phones from 2007 to 2017 with an increase of 400%!

If you have the time and curiosity, what other questions can you ask and answer with this data?

Using the features I added earlier, I am going to plot total cell phones over time for each subregion in our dataset.
- We can infer that around 2005-2010 the cell phone market for the developed market had begun to mature.
- The growth for the east pacific region exploded around the mid-2000's and is continuing to see high growth year over year.
- Around 2015 we start to see the total cell phones deecreasing in all subregions that are not in east or south asia.

In [0]:
sub_region_df = (cell_df
            .groupby(['subregion','time'])
            .cell_phones_total.agg(['sum'])
            .rename(columns={'sum':'Total Cell Phones'})
            .reset_index())

In [382]:
alt.Chart(sub_region_df,height=400,width=700).mark_line().encode(
    x='time',
    y='Total Cell Phones',
    color='subregion'
)