# Grouping your data


In [1]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

import matplotlib
matplotlib.rcParams['axes.grid'] = True # show gridlines by default
%matplotlib inline

import pandas as pd
import pandas_datareader as pdr

In last week modules, you saw how to merge two datasets containing a common column to create a
single, combined dataset. Combining datasets allows us to make comparisons across
datasets, as you discovered when looking for correlations between GDP and life
expectancy.
In this week modules, you’ll learn how to go the other way, separating out distinct ‘subsets’ or groups
of data, before summarising them individually.
As well as splitting out different groups of data, row and column values can be rearranged
to reshape a dataset and allow the creation of a wide range of pivot table style reports
from a single data table.

In this week’s tasks, you’ll learn how a single line of code can be used to generate a
wide variety of pivot table style reports of your own.

One of the ways you are shown for loading World Bank data into the notebook in last Week,
was to use the **download ()** function.
One way to find out for yourself what sorts of argument a function expects is to ask it.
Running a code cell containing a question mark (?) followed by a function name should
pop up a help area in the bottom of the notebook window. (Close it using the x in the top
right hand corner of the panel.)

In [2]:
if pd.__version__.startswith('0.23'):
    # this solves an incompatibility between pandas 0.23 and datareader 0.6
    # taken from https://stackoverflow.com/questions/50394873/
    core.common.is_list_like = api.types.is_list_like

from pandas_datareader.wb import download

In [3]:
?download

The function documentation tells you that you can enter a list of one or more country
names using standard country codes as well as a date range. You can also calculate a
date range from a single date to show the **N** years of data leading up to a particular year.

In [4]:
YEAR = 2013
GDP_INDICATOR = 'NY.GDP.MKTP.CD'
gdp = download(indicator=GDP_INDICATOR, country=['GB','CN'],
start=YEAR-5, end=YEAR)
gdp = gdp.reset_index()
gdp

Unnamed: 0,country,year,NY.GDP.MKTP.CD
0,China,2013,9570406000000.0
1,China,2012,8532230000000.0
2,China,2011,7551500000000.0
3,China,2010,6087164000000.0
4,China,2009,5101703000000.0
5,China,2008,4594307000000.0
6,United Kingdom,2013,2783251000000.0
7,United Kingdom,2012,2704017000000.0
8,United Kingdom,2011,2659882000000.0
9,United Kingdom,2010,2481580000000.0


Although many datasets that you are likely to work with are published in the form of a
single data table, such as a single CSV file or spreadsheet worksheet, it is often possible
to regard the dataset as being made up from several distinct subsets of data.
In the above example, you will probably notice that each country name appears in several
rows, as does each year. This suggests that we can make different sorts of comparisons
between different groupings of data using just this dataset. For example, compare the
total GDP of each country calculated over the six years 2008 to 2013 using just a single
line of code:

In [5]:
gdp.groupby('country')['NY.GDP.MKTP.CD'].aggregate(sum)

country
China             4.143731e+13
United Kingdom    1.596255e+13
Name: NY.GDP.MKTP.CD, dtype: float64

Essentially what this does is to say ‘for each country, find the total GDP’.
The total combined GDP for those two countries in each year could be found by making
just one slight tweak to our code (can you see below where I made the change?):

In [6]:
gdp.groupby('year')['NY.GDP.MKTP.CD'].aggregate(sum)

year
2008    7.515739e+12
2009    7.514093e+12
2010    8.568743e+12
2011    1.021138e+13
2012    1.123625e+13
2013    1.235366e+13
Name: NY.GDP.MKTP.CD, dtype: float64

That second calculation probably doesn’t make much sense in this particular case, but
what if there was another column saying which region of the world each country was in?
Then, by taking the data for all the countries in the world, the total GDP could be found for
each region by grouping on both the year and the region.
Next, you will consider ways of grouping data.

## Ways of grouping data

Think back to the weather dataset you used in earlier week , how might you group that data
into several distinct groups? What sorts of comparisons could you make by grouping just
the elements of that dataset? Or how might you group and compare the GDP data?

One thing the newspapers love to report are weather ‘records’, such as the ‘hottest June
ever’ or the wettest location in a particular year as measured by total annual rainfall, or
highest average monthly rainfall. How easy is it to find that information out from the data?
Or with the GDP data, if countries were assigned to economic groupings such as the
European Union, or regional groupings such as Africa, or South America, how would you
generate information such as lowest GDP in the EU or highest GDP in South America?

You will learn how to split data into groups based on particular features of the
data, and then generate information about each separate group, across all of the groups,
at the same time.

**Activity: Grouping data**
    
Based on the data you have seen so far, or some other datasets you may be aware of,
what other ways of grouping data can you think of, and why might grouping data that
way be useful?

## Data that describes the world of trade

Lets look at what sorts of thing different
countries actually export to the UK.
For example, it might surprise you that India was the world’s largest exporter by value of
unset diamonds in 2014 (24 billion US dollars worth), or that Germany was the biggest
importer of chocolate (over $2.5 billion worth) in that same year.
National governments all tend to publish their own trade figures, but the UN also collect
data from across the world. In particular, the UN’s global trade database, Comtrade,
contains data about import and export trade flows between countries for a wide range of
goods and services.

So if you’ve ever wondered where your country imports most of its T-shirts from, or
exports most of its municipal waste to, **Comtrade** is likely to have the data.
In the next section, you will find out about the Comtrade data.

## Getting Comtrade data into your notebook

In this exercise, you will practice loading data from Comtrade into a pandas dataframe and getting it into a form where you can start to work with it. 

The following steps and code are an example. Your task for this exercise is stated at the end, after the example.

The data is obtained from the [United Nations Comtrade](http://comtrade.un.org/data/) website, by selecting the following configuration:

- Type of Product: goods
- Frequency: monthly 
- Periods: all of 2014
- Reporter: United Kingdom
- Partners: all
- Flows: imports and exports
- HS (as reported) commodity codes: 0401 (Milk and cream, neither concentrated nor sweetened) and 0402 (Milk and cream, concentrated or sweetened)

Clicking on 'Preview' results in a message that the data exceeds 500 rows. Data was downloaded using the *Download CSV* button and the download file renamed appropriately.

In [7]:
LOCATION='comtrade_milk_uk_monthly_14.csv'

A URL for downloading all the data as a CSV file can also be obtained via "View API Link".
It must be modified so that it returns up to 5000 records (set `max=5000`) in the CSV format (`&fmt=csv`).

In [8]:
# LOCATION = 'http://comtrade.un.org/api/get?max=5000&type=C&freq=M&px=HS&ps=2014&r=826&p=all&rg=1%2C2&cc=0401%2C0402&fmt=csv'

Load the data in from the specified location, ensuring that the various codes are read as strings. Preview the first few rows of the dataset.

In [9]:
milk = pd.read_csv(LOCATION, dtype={'Commodity Code':str, 'Reporter Code':str})
milk.head(3)

Unnamed: 0,Classification,Year,Period,Period Desc.,Aggregate Level,Is Leaf Code,Trade Flow Code,Trade Flow,Reporter Code,Reporter,...,Qty,Alt Qty Unit Code,Alt Qty Unit,Alt Qty,Netweight (kg),Gross weight (kg),Trade Value (US$),CIF Trade Value (US$),FOB Trade Value (US$),Flag
0,HS,2014,201401,January 2014,4,0,1,Imports,826,United Kingdom,...,,,,,22404316,,21950747,,,0
1,HS,2014,201401,January 2014,4,0,2,Exports,826,United Kingdom,...,,,,,60497363,,46923551,,,0
2,HS,2014,201401,January 2014,4,0,2,Exports,826,United Kingdom,...,,,,,2520,,3410,,,0


Limit the columns to make the dataframe easier to work with by selecting just a subset of them.

In [10]:
COLUMNS = ['Year', 'Period','Trade Flow','Reporter', 'Partner', 'Commodity','Commodity Code','Trade Value (US$)']
milk = milk[COLUMNS]

Derive two new dataframes that separate out the 'World' partner data and the data for individual partner countries.

In [11]:
milk_world = milk[milk['Partner'] == 'World']
milk_countries = milk[milk['Partner'] != 'World']

You may wish to store a local copy as a CSV file, for example:

In [12]:
milk_countries.to_csv('countrymilk.csv', index=False)

To load the data back in:

In [13]:
load_test = pd.read_csv('countrymilk.csv', dtype={'Commodity Code':str, 'Reporter Code':str})
load_test.head(2)

Unnamed: 0,Year,Period,Trade Flow,Reporter,Partner,Commodity,Commodity Code,Trade Value (US$)
0,2014,201401,Exports,United Kingdom,Afghanistan,"Milk and cream, neither concentrated nor sweet...",401,3410
1,2014,201401,Exports,United Kingdom,Austria,"Milk and cream, neither concentrated nor sweet...",401,316


In [14]:
load_test=pd.read_csv('countrymilk.csv', dtype={'Commodity Code':str}, encoding = "ISO-8859-1")
load_test.head()

Unnamed: 0,Year,Period,Trade Flow,Reporter,Partner,Commodity,Commodity Code,Trade Value (US$)
0,2014,201401,Exports,United Kingdom,Afghanistan,"Milk and cream, neither concentrated nor sweet...",401,3410
1,2014,201401,Exports,United Kingdom,Austria,"Milk and cream, neither concentrated nor sweet...",401,316
2,2014,201401,Imports,United Kingdom,Belgium,"Milk and cream, neither concentrated nor sweet...",401,4472349
3,2014,201401,Exports,United Kingdom,Belgium,"Milk and cream, neither concentrated nor sweet...",401,5663128
4,2014,201401,Exports,United Kingdom,Br. Virgin Isds,"Milk and cream, neither concentrated nor sweet...",401,34566


If you are on a Windows computer, data files may sometimes be saved using a file encoding (*Latin-1*). Pandas may not recognise this by default, in which case you will see a `UnicodeDecodeError`.

In such cases, opening files in `read_excel()` or `read_csv()` using the parameter  `encoding="ISO-8859-1"` or  `encoding = "Latin-1"` should fix the problem. For example, edit the previous command to read:

`load_test=read_csv('countrymilk.csv', dtype={'Commodity Code':str}, encoding = "ISO-8859-1")`

### Subsetting Your Data
For large or heterogenous datasets, it is often convenient to create subsets of the data. To further separate out the imports:


In [15]:
milk_imports = milk[milk['Trade Flow'] == 'Imports']
milk_countries_imports = milk_countries[milk_countries['Trade Flow'] == 'Imports']
milk_world_imports=milk_world[milk_world['Trade Flow'] == 'Imports']

### Sorting the data

Having loaded in the data, find the most valuable partners in terms of import trade flow during a particular month by sorting the data by *decreasing* trade value and then selecting the top few rows.

In [16]:
milkImportsInJanuary2014 = milk_countries_imports[milk_countries_imports['Period'] == 201401]
milkImportsInJanuary2014.sort_values('Trade Value (US$)',ascending=False).head(10)

Unnamed: 0,Year,Period,Trade Flow,Reporter,Partner,Commodity,Commodity Code,Trade Value (US$)
23,2014,201401,Imports,United Kingdom,Ireland,"Milk and cream, neither concentrated nor sweet...",401,10676138
626,2014,201401,Imports,United Kingdom,France,"Milk and cream, concentrated or sweetened",402,8020014
637,2014,201401,Imports,United Kingdom,Ireland,"Milk and cream, concentrated or sweetened",402,5966962
650,2014,201401,Imports,United Kingdom,Netherlands,"Milk and cream, concentrated or sweetened",402,4650774
629,2014,201401,Imports,United Kingdom,Germany,"Milk and cream, concentrated or sweetened",402,4545873
4,2014,201401,Imports,United Kingdom,Belgium,"Milk and cream, neither concentrated nor sweet...",401,4472349
612,2014,201401,Imports,United Kingdom,Belgium,"Milk and cream, concentrated or sweetened",402,3584038
10,2014,201401,Imports,United Kingdom,Denmark,"Milk and cream, neither concentrated nor sweet...",401,2233438
667,2014,201401,Imports,United Kingdom,Spain,"Milk and cream, concentrated or sweetened",402,1850097
15,2014,201401,Imports,United Kingdom,France,"Milk and cream, neither concentrated nor sweet...",401,1522872


### Task

To complete these tasks you could copy this notebook and amend the code or create a new notebook to do the analysis for your chosen data.

Using the [Comtrade Data website](http://comtrade.un.org/data/), identify a dataset that describes the import and export trade flows for a particular service or form of goods between your country (as reporter) and all ('All') the other countries in the world. Get the monthly data for all months in 2014.

Download the data as a CSV file and add the file to the same folder as the one containing this notebook. Load the data in from the file into a pandas dataframe. Create an easier to work with dataframe that excludes data associated with the 'World' partner. Sort this data to see which countries are the biggest partners in terms of import and export trade flow.

Task Completed in NoteBook 25 using Data for all months in 2020 as that for 2014 is currently unavailable