# Data Manipulation & Analysis With Pandas

## Introduction

**pandas** is a Python library that provides high-performance, easy-to-use data structures and data analysis tools. In many ways pandas provides a Python equivilent of the data analysis and manipulation tools avialble in the R programming language. Full details of pandas are available at [pandas.pydata.org](http://pandas.pydata.org/)

pandas fits very nicely as part of the iPython Notebook environment along with librarires such as numpy, scikit learn, and matplotlib.

The pandas website describes the highlights of the pandas library as:

* A fast and efficient DataFrame object for data manipulation with integrated indexing;
* Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
* Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
* Flexible reshaping and pivoting of data sets;
* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
* Columns can be inserted and deleted from data structures for size mutability;
* Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
* High performance merging and joining of data sets;
* Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
* Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
* Highly optimized for performance, with critical code paths written in Cython or C.
* Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.




## Pandas Data Structures

To get started we import the **pandas** library (and the **numpy** library as much of pandas is based on this).

In [None]:
from IPython.display import display, HTML
import pandas as pd
import numpy as np

Pandas offers two key data structures that are optimised for data analysis and manipulation: **Series** and **DataFrame**. the key distinction of these data structures over basic Python data structures is that they make it easy to associate labels with data - for example row and column names. 

### Series

In pandas a Series is a one-dimensional array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).  the difference between this and a basic Python list or tuple is that the elements in the array can have a custom label, or **index** - that is they are not limited to a simple nueric index like the basic Python data structures.

#### Creation

To create a pandas Series we use the **Series** method. The simplest way to create a pandas Series is to pass it a list of values and use a simple numeric index (a not very exictiing way to use a pandas Series!).

In [None]:
my_series = pd.Series([45, 232, 45, 67, 1, 88, 99, 65])
print(my_series)

We can also explictily pass a list of index values to the Series method so as to use a more intersting index. For example:

In [None]:
populations = pd.Series([1357000000, 1252000000, 321068000, 249900000, 200400000, 191854000], 
                        ["China", "India", "United States", "Indonesia", "Brazil", "Pakistan"])
print(populations)

This is very similar to a Python dictionary. In fact we can create a pandas series directly from a Python dictionary:

In [None]:
populations = pd.Series({"China":1357000000, "India":1252000000, "United States":321068000, "Indonesia":249900000, "Brazil":200400000, "Pakistan":191854000})
print(populations)

#### Operations

The pandas Series object, however, supports interesting data analysis functionality. For example, we can perform a range of simple analysis tasks on this Series object:

In [None]:
populations.min()

In [None]:
populations.max()

In [None]:
populations.median()

In [None]:
populations.mean()

In [None]:
populations.std()

The **describe** method generates a nice summary set of descriptive statistics:

In [None]:
populations.describe()

#### Access & Manipulation

A pandas Series offers a number of choices for accessing elements. We can use simple index numbers:

In [None]:
populations[0]

In [None]:
populations[5]

We can use slicing through the **:** operator (remenber that **start:end** includes the elements from index start up to but not including index end): 

In [None]:
populations[2:5]

In [None]:
populations[:5]

In [None]:
populations[2:]

We can also provide Boolean expressions for conditional data acess. For example, to generate a list of countries with populations greater than 1 billion we use:

In [None]:
populations[populations > 1000000000]

We can also access elements using the index defined at creation (like we would with a dictionary):

In [None]:
populations

In [None]:
populations["Brazil"]

Using the index is also the easiest way to change elements in a pandas Series:

In [None]:
populations["China"] = 1374730000
print(populations)

Although changes can always also be made using numeric indices:

In [None]:
populations[2] = 1236344631
print(populations)

### DataFrame

A pandas **DataFrame** is a 2-dimensional labeled data structure with columns of data that can be of different types. The DataFrame is essentially equivilent to a spreadsheet, and SQL table, and R data frame, or a SAS dataset. The DataFrame is the most commonly used pandas object and usually why we use pandas. Rows and columns in a DataFrame can be labelled, which allows for easy data access.


In [None]:
from IPython.display import display, HTML
import pandas as pd
import numpy as np

#### Creation

The easiest way to create a pandas DataFrame is to pass the **DataFrame** method a dictionary of lists (each list will be a **column** in the DataFrame):

In [None]:
countries = pd.DataFrame({"Country":["China", "India", "United States", "Indonesia", "Brazil", "Pakistan"],
                          "Population":[1357000000, 1252000000, 321068000, 249900000, 200400000, 191854000],
                          "GDP":[11384760, 2182580, 17968200, 888648, 1799610, 246849],
                          "Life Expectancy":[75.41, 68.13, 79.68, 72.45, 73.53, 67.39]})
display(countries)

Data frames can also be created easily using a list of dictionary objects each of which defines data using the same data format:

In [None]:
countries = pd.DataFrame([{"Country":"China", "Population":1357000000, "GDP":11384760, "Life Expectancy":75.41},
                         {"Country":"India", "Population":1252000000, "GDP":2182580, "Life Expectancy":68.13},
                         {"Country":"United States", "Population":321068000, "GDP":17968200, "Life Expectancy":79.68},
                         {"Country":"Indonesia", "Population":249900000, "GDP":888648, "Life Expectancy":72.45},
                         {"Country":"Brazil", "Population":200400000, "Life Expectancy":73.53},
                         {"Country":"Pakistan", "Population":191854000, "GDP":246849, "Life Expectancy":67.39}])
display(countries)

Note that the DataFrame is able to gracefully handle the missing GDP value for Brazil. This is one of the advantages of a pandas DataFrame.

Another nice thing about pandas is that we can load a DataFrame directly from a .csv file using the **read_csv** method. In this example we read data about a longer list of countries:

In [None]:
extended_countries = pd.read_csv('FMLPDA_Table_5_ex_3.csv')
display(extended_countries)

The **head** and **tail** methods can be used to show the first or last few lines of a DataFrame:

In [None]:
extended_countries.head()

In [None]:
extended_countries.tail(8)

#### Operations

pandas offers a range of easy to use analysis operations for DataFrames. For example the basic statistics operators return a value for each column in the DataFrame:

In [None]:
extended_countries.mean()

In [None]:
extended_countries.median()

In [None]:
extended_countries.max()

In [None]:
extended_countries.describe()

#### Filtering Columns

Accessing *columns* in a DataFrame is simply a matter of using the name of the column (similar to dictionary selection) to give a single column Series:

In [None]:
extended_countries["School Years"]

In [None]:
extended_countries.columns

We can easily select multiple columns by passing a list of column names:

In [None]:
school_details \
= extended_countries[["Country ID","School Years"]]
display(school_details)

In [None]:
display(extended_countries)

We can access rows either using row labels or row indices using the **loc** or **iloc** methods which both return a series (although in our example these will give the same result):

In [None]:
extended_countries.loc[1]

In [None]:
extended_countries.iloc[1]

Columns in a DataFrame are easily removed using the **del** operator:

In [None]:
display(extended_countries.head())
del extended_countries["CPI"]
display(extended_countries.head())

#### Filtering Rows

We can also easily slice by rows to get an extract from a DataFrame:

In [None]:
extended_countries[4:9]

One very useful way to slice a DataFrame is using a condition. We can pass a list of Bollean values to a DataFrame indicating which rows should be retained (True) and which should be filtered (False). A suitable list is easily generated using a simple Boolean expression on a column from the DataFrame:

In [None]:
extended_countries[extended_countries["CPI"] > 6]

In [None]:
mean_CPI = extended_countries["Life Exp."].mean()
extended_countries[extended_countries["Life Exp."] < mean_CPI]

This list can be passed directly to the DataFrame to perform a filtering:

In [None]:
military_countries = extended_countries[extended_countries["Mil. Spend"] > 2]
display(military_countries)

In [None]:
#condition = \
# (extended_countries["School Years"] > 10) & \
#                  (extended_countries["Mil. Spend"] > 2)

s_and_s = extended_countries[(extended_countries["School Years"] < 10) | \
                  (extended_countries["Mil. Spend"] > 2)]
display(s_and_s)
s_and_s.describe()

We can also delete rows using the **drop** function.

In [None]:
display(extended_countries)

In [None]:
extended_countries = extended_countries.drop(extended_countries.index[[6, 12]])
display(extended_countries)

#### Filtering Columns & Rows

We can combine row selection and column selection using the **loc** method. We pass it the row slice first, followed by a list of column headings. For example:

In [None]:
extended_countries.loc[(extended_countries["School Years"] > 10), ["Country ID", "Top-10 Income"]]

We can do the same thing using numeric indices for columns and rows rather than labels using the **iloc** method:

In [None]:
extended_countries.iloc[1:5, 2:4]

#### Deriving New Fields

This simple data frame contains people's age, height and weight

In [None]:
num_people = 100
weights = np.random.normal(75, 10, num_people)
heights = np.random.normal(170, 10, num_people)
ages =  np.random.gamma(5.5, 10, num_people)

people = pd.DataFrame({'weight':weights,'age':ages, 'height':heights})
display(people)

In [None]:
age_thresh = 65
people['old'] = people['age'] > age_thresh
display(people)

In [None]:
people['BMI'] \
= people['weight']/((people['height']/100)*(people['height']/100))
display(people)

In [None]:
old_people = people[people['old'] == True]

young_people = people[people['old'] == False]

display(old_people.head())
display(young_people.head())


In [None]:
old_people.mean()

In [None]:
young_people.mean()

In [None]:
%matplotlib inline
young_people['BMI'].hist()

In [None]:
old_people['BMI'].hist()

In [None]:
print("Mean old BMI: ", old_people['BMI'].mean())
print("Mean young BMI: ", young_people['BMI'].mean())

From this we can easily calculate BMI

We can easily add new columns to a DataFrame by simply referring to the new column name in an expression. For example:

In [None]:
display(extended_countries)

In [None]:
extended_countries["High Education"] = True
extended_countries.head()

Or more interestingly using other columns in the DataFrame:

In [None]:
extended_countries["High Education"] \
    = extended_countries["School Years"] > 10
extended_countries.head()

In [None]:
extended_countries["Mil School Ratio"] = \
    extended_countries["Mil. Spend"] / extended_countries["School Years"]
extended_countries.head()

Adding new rows to a DataFrame is easy using the **append** method. For example to append the first 5 rows of the DataFrame again at the end:

In [None]:
extra_rows = extended_countries[0:6]
extended_countries.append(extra_rows, ignore_index=True) # ignore_index tells pandas not to repat row indices

The **describe** method gives a nice set of summary descriptive statistics for each column:

In [None]:
extended_countries.describe()

#### Merging DataFrames

We can also *merge* together DataFrames in SQL style join operations using the **merge** method. For example

In [90]:
country_populations = pd.read_csv('FMLPDA_Table_5_ex_3a.csv')

In [91]:
display(country_populations)

Unnamed: 0,Country ID,Population
0,Afghanistan,27101365
1,Nigeria,186988000
2,Nicaragua,6198154
3,Egypt,90369500
4,Argentina,43590400
5,China,1357000000
6,Brazil,200400000
7,U.S.A,321068000
8,Ireland,4635400
9,U.K.,65097000


In [92]:
display(extended_countries)

Unnamed: 0,Country ID,Life Exp.,Top-10 Income,Infant Mort.,Mil. Spend,School Years,High Education,Mil School Ratio
0,Afghanistan,59.61,23.21,74.3,4.44,0.4,False,11.1
1,Haiti,45.0,47.67,73.1,0.09,3.4,False,0.026471
2,Nigeria,51.3,38.23,82.6,1.07,4.1,False,0.260976
3,Egypt,70.48,26.58,19.6,1.86,5.3,False,0.350943
4,Argentina,75.77,32.3,13.3,0.76,10.1,True,0.075248
5,China,74.87,29.98,13.7,1.95,6.4,False,0.304688
7,Israel,81.3,28.8,3.6,6.77,12.5,True,0.5416
8,U.S.A,78.51,29.85,6.3,4.72,13.7,True,0.344526
9,Ireland,80.15,27.23,3.5,0.6,11.5,True,0.052174
10,U.K.,80.09,28.49,4.4,2.59,13.0,True,0.199231


In [93]:
extended_countries_with_pop = \
    pd.merge(extended_countries, \
             country_populations, 
             on="Country ID")
display(extended_countries_with_pop)

Unnamed: 0,Country ID,Life Exp.,Top-10 Income,Infant Mort.,Mil. Spend,School Years,High Education,Mil School Ratio,Population
0,Afghanistan,59.61,23.21,74.3,4.44,0.4,False,11.1,27101365
1,Nigeria,51.3,38.23,82.6,1.07,4.1,False,0.260976,186988000
2,Egypt,70.48,26.58,19.6,1.86,5.3,False,0.350943,90369500
3,Argentina,75.77,32.3,13.3,0.76,10.1,True,0.075248,43590400
4,China,74.87,29.98,13.7,1.95,6.4,False,0.304688,1357000000
5,U.S.A,78.51,29.85,6.3,4.72,13.7,True,0.344526,321068000
6,Ireland,80.15,27.23,3.5,0.6,11.5,True,0.052174,4635400
7,U.K.,80.09,28.49,4.4,2.59,13.0,True,0.199231,65097000
8,Germany,80.24,22.07,3.5,1.31,12.0,True,0.109167,81459000
9,Australia,82.09,25.4,4.2,1.86,11.5,True,0.161739,23992700


In [94]:
extended_countries.shape

(14, 8)

In [95]:
country_populations.shape

(16, 2)

In [96]:
extended_countries_with_pop.shape

(12, 9)

The **how** parameter to the merge method determines the type of join that is performed - the options are 'left', 'right', 'outer', 'inner' (the default is 'inner')

In [97]:
extended_countries_with_pop = \
pd.merge(extended_countries, \
         country_populations, 
         on="Country ID", \
         how = 'left')
display(extended_countries_with_pop)

Unnamed: 0,Country ID,Life Exp.,Top-10 Income,Infant Mort.,Mil. Spend,School Years,High Education,Mil School Ratio,Population
0,Afghanistan,59.61,23.21,74.3,4.44,0.4,False,11.1,27101360.0
1,Haiti,45.0,47.67,73.1,0.09,3.4,False,0.026471,
2,Nigeria,51.3,38.23,82.6,1.07,4.1,False,0.260976,186988000.0
3,Egypt,70.48,26.58,19.6,1.86,5.3,False,0.350943,90369500.0
4,Argentina,75.77,32.3,13.3,0.76,10.1,True,0.075248,43590400.0
5,China,74.87,29.98,13.7,1.95,6.4,False,0.304688,1357000000.0
6,Israel,81.3,28.8,3.6,6.77,12.5,True,0.5416,
7,U.S.A,78.51,29.85,6.3,4.72,13.7,True,0.344526,321068000.0
8,Ireland,80.15,27.23,3.5,0.6,11.5,True,0.052174,4635400.0
9,U.K.,80.09,28.49,4.4,2.59,13.0,True,0.199231,65097000.0


In [98]:
print(extended_countries.shape)
print(country_populations.shape)
print(extended_countries_with_pop.shape)


(14, 8)
(16, 2)
(14, 9)


In [None]:
extended_countries_with_pop = pd.merge(extended_countries, country_populations, on="Country ID", how = 'left')
display(extended_countries_with_pop)

In [99]:
extended_countries_with_pop = pd.merge(extended_countries, country_populations, on="Country ID", how = 'right')
display(extended_countries_with_pop)

Unnamed: 0,Country ID,Life Exp.,Top-10 Income,Infant Mort.,Mil. Spend,School Years,High Education,Mil School Ratio,Population
0,Afghanistan,59.61,23.21,74.3,4.44,0.4,False,11.1,27101365
1,Nigeria,51.3,38.23,82.6,1.07,4.1,False,0.260976,186988000
2,Egypt,70.48,26.58,19.6,1.86,5.3,False,0.350943,90369500
3,Argentina,75.77,32.3,13.3,0.76,10.1,True,0.075248,43590400
4,China,74.87,29.98,13.7,1.95,6.4,False,0.304688,1357000000
5,U.S.A,78.51,29.85,6.3,4.72,13.7,True,0.344526,321068000
6,Ireland,80.15,27.23,3.5,0.6,11.5,True,0.052174,4635400
7,U.K.,80.09,28.49,4.4,2.59,13.0,True,0.199231,65097000
8,Germany,80.24,22.07,3.5,1.31,12.0,True,0.109167,81459000
9,Australia,82.09,25.4,4.2,1.86,11.5,True,0.161739,23992700


In [100]:
extended_countries_with_pop.shape

(16, 9)

In [101]:
extended_countries_with_pop = pd.merge(extended_countries, country_populations, on="Country ID", how = 'outer')
display(extended_countries_with_pop)

Unnamed: 0,Country ID,Life Exp.,Top-10 Income,Infant Mort.,Mil. Spend,School Years,High Education,Mil School Ratio,Population
0,Afghanistan,59.61,23.21,74.3,4.44,0.4,False,11.1,27101360.0
1,Haiti,45.0,47.67,73.1,0.09,3.4,False,0.026471,
2,Nigeria,51.3,38.23,82.6,1.07,4.1,False,0.260976,186988000.0
3,Egypt,70.48,26.58,19.6,1.86,5.3,False,0.350943,90369500.0
4,Argentina,75.77,32.3,13.3,0.76,10.1,True,0.075248,43590400.0
5,China,74.87,29.98,13.7,1.95,6.4,False,0.304688,1357000000.0
6,Israel,81.3,28.8,3.6,6.77,12.5,True,0.5416,
7,U.S.A,78.51,29.85,6.3,4.72,13.7,True,0.344526,321068000.0
8,Ireland,80.15,27.23,3.5,0.6,11.5,True,0.052174,4635400.0
9,U.K.,80.09,28.49,4.4,2.59,13.0,True,0.199231,65097000.0


In [102]:
extended_countries_with_pop.shape

(18, 9)

#### Aggregating DataFrames

If there are a categoricial variables in a dataset we can use them to define groups. Once groups are defined it is possible to perform analysis based on these groups.

Read in a dataset that contains the continet to which each country belongs and add that to the country details dataset.

In [103]:
country_continents = pd.read_csv('FMLPDA_Table_5_ex_3b.csv')
display(country_continents.head())

Unnamed: 0,Country ID,Continent
0,Afghanistan,Asia
1,Haiti,North America
2,Nigeria,Africa
3,Egypt,Africa
4,Argentina,South America


In [104]:
display(extended_countries.head())

Unnamed: 0,Country ID,Life Exp.,Top-10 Income,Infant Mort.,Mil. Spend,School Years,High Education,Mil School Ratio
0,Afghanistan,59.61,23.21,74.3,4.44,0.4,False,11.1
1,Haiti,45.0,47.67,73.1,0.09,3.4,False,0.026471
2,Nigeria,51.3,38.23,82.6,1.07,4.1,False,0.260976
3,Egypt,70.48,26.58,19.6,1.86,5.3,False,0.350943
4,Argentina,75.77,32.3,13.3,0.76,10.1,True,0.075248


In [105]:
extended_countries_with_contnts \
= pd.merge(extended_countries, \
           country_continents, \
           on="Country ID", how = 'inner')
display(extended_countries_with_contnts)

Unnamed: 0,Country ID,Life Exp.,Top-10 Income,Infant Mort.,Mil. Spend,School Years,High Education,Mil School Ratio,Continent
0,Afghanistan,59.61,23.21,74.3,4.44,0.4,False,11.1,Asia
1,Haiti,45.0,47.67,73.1,0.09,3.4,False,0.026471,North America
2,Nigeria,51.3,38.23,82.6,1.07,4.1,False,0.260976,Africa
3,Egypt,70.48,26.58,19.6,1.86,5.3,False,0.350943,Africa
4,Argentina,75.77,32.3,13.3,0.76,10.1,True,0.075248,South America
5,China,74.87,29.98,13.7,1.95,6.4,False,0.304688,Asia
6,Israel,81.3,28.8,3.6,6.77,12.5,True,0.5416,Asia
7,U.S.A,78.51,29.85,6.3,4.72,13.7,True,0.344526,North America
8,Ireland,80.15,27.23,3.5,0.6,11.5,True,0.052174,Europe
9,U.K.,80.09,28.49,4.4,2.59,13.0,True,0.199231,Europe


To define groups within a dataframe we use the **groupby** function, passing it the name of the column we would like to group by. Using the grouped data then we can then perform grouped analysis.

In [107]:
grouped_data = \
extended_countries_with_contnts.groupby('Continent')

We can save the grouping object if we want to perform multiple analyses.

In [109]:
extended_countries_with_contnts.mean()

Life Exp.           72.965000
Top-10 Income       29.271429
Infant Mort.        22.100000
Mil. Spend           2.172857
School Years         9.214286
High Education       0.642857
Mil School Ratio     0.979846
dtype: float64

In [108]:
grouped_data.mean()

Unnamed: 0_level_0,Life Exp.,Top-10 Income,Infant Mort.,Mil. Spend,School Years,High Education,Mil School Ratio
Continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Africa,60.89,32.405,51.1,1.465,4.7,0.0,0.30596
Asia,71.926667,27.33,30.533333,4.386667,6.433333,0.333333,3.982096
Europe,80.4775,24.9925,3.45,1.4425,12.325,1.0,0.114948
North America,61.755,38.76,39.7,2.405,8.55,0.5,0.185498
Oceania,81.38,26.605,4.55,1.495,11.9,1.0,0.126805
South America,75.77,32.3,13.3,0.76,10.1,1.0,0.075248


In [None]:
extended_countries_with_contnts.mean()

In [110]:
display(grouped_data.max())
display(grouped_data.min())

Unnamed: 0_level_0,Country ID,Life Exp.,Top-10 Income,Infant Mort.,Mil. Spend,School Years,High Education,Mil School Ratio
Continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Africa,Nigeria,70.48,38.23,82.6,1.86,5.3,False,0.350943
Asia,Israel,81.3,29.98,74.3,6.77,12.5,True,11.1
Europe,U.K.,81.43,28.49,4.4,2.59,13.0,True,0.199231
North America,U.S.A,78.51,47.67,73.1,4.72,13.7,True,0.344526
Oceania,New Zealand,82.09,27.81,4.9,1.86,12.3,True,0.161739
South America,Argentina,75.77,32.3,13.3,0.76,10.1,True,0.075248


Unnamed: 0_level_0,Country ID,Life Exp.,Top-10 Income,Infant Mort.,Mil. Spend,School Years,High Education,Mil School Ratio
Continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Africa,Egypt,51.3,26.58,19.6,1.07,4.1,False,0.260976
Asia,Afghanistan,59.61,23.21,3.6,1.95,0.4,False,0.304688
Europe,Germany,80.09,22.07,2.4,0.6,11.5,True,0.052174
North America,Haiti,45.0,29.85,6.3,0.09,3.4,False,0.026471
Oceania,Australia,80.67,25.4,4.2,1.13,11.5,True,0.09187
South America,Argentina,75.77,32.3,13.3,0.76,10.1,True,0.075248


We can use column selection on the grouped data object to only see details of certain columns. 

In [112]:
display(grouped_data['Mil. Spend'].describe())

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Africa,2.0,1.465,0.558614,1.07,1.2675,1.465,1.6625,1.86
Asia,3.0,4.386667,2.410443,1.95,3.195,4.44,5.605,6.77
Europe,4.0,1.4425,0.83144,0.6,1.1025,1.29,1.63,2.59
North America,2.0,2.405,3.273904,0.09,1.2475,2.405,3.5625,4.72
Oceania,2.0,1.495,0.516188,1.13,1.3125,1.495,1.6775,1.86
South America,1.0,0.76,,0.76,0.76,0.76,0.76,0.76


Or to look at multiple columns:

In [113]:
display(grouped_data[['Life Exp.', 'Infant Mort.']].describe())

Unnamed: 0_level_0,Life Exp.,Life Exp.,Life Exp.,Life Exp.,Life Exp.,Life Exp.,Life Exp.,Life Exp.,Infant Mort.,Infant Mort.,Infant Mort.,Infant Mort.,Infant Mort.,Infant Mort.,Infant Mort.,Infant Mort.
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
Continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Africa,2.0,60.89,13.562308,51.3,56.095,60.89,65.685,70.48,2.0,51.1,44.547727,19.6,35.35,51.1,66.85,82.6
Asia,3.0,71.926667,11.140531,59.61,67.24,74.87,78.085,81.3,3.0,30.533333,38.237983,3.6,8.65,13.7,44.0,74.3
Europe,4.0,80.4775,0.637985,80.09,80.135,80.195,80.5375,81.43,4.0,3.45,0.818535,2.4,3.225,3.5,3.725,4.4
North America,2.0,61.755,23.695148,45.0,53.3775,61.755,70.1325,78.51,2.0,39.7,47.234733,6.3,23.0,39.7,56.4,73.1
Oceania,2.0,81.38,1.004092,80.67,81.025,81.38,81.735,82.09,2.0,4.55,0.494975,4.2,4.375,4.55,4.725,4.9
South America,1.0,75.77,,75.77,75.77,75.77,75.77,75.77,1.0,13.3,,13.3,13.3,13.3,13.3,13.3


Using groups we can also perform **data aggregation** jobs - rolling up muptiple rows of data into a single row that aggregates them. To do this we use the **agg** function in conection with grouped data. For example to create a dataset containing the mean life expectancy of each continent we could use:

In [None]:
grouped_data['Life Exp.'].agg([np.mean])

We can add multiple measurs to this aggregation - for example including max and min as well as mean:

In [None]:
grouped_data['Life Exp.'].agg([np.mean, np.min, np.max])

We can do this for multiple columns from the original dataset to be even more expressive.

In [None]:
grouped_data[['Life Exp.', 'Infant Mort.']].agg([np.mean, np.min, np.max])

In [None]:
grouped_data.agg([np.mean])

### Simple Analysis

Remember we can use simple analysis functions to start analysing data. The most interesting are **describe**, and **value_counts**.

In [None]:
extended_countries_with_pop["School Years"].describe()

In [None]:
extended_countries_with_contnts["Continent"].describe()

In [None]:
extended_countries_with_contnts["Continent"].value_counts()