# Project 1: Deaths by tuberculosis

by Michel Wermelinger, 14 July 2015, with minor edits on 5 April 2016

additional edits by George Kaimakis, 10 May 2017

This is the project notebook for Week 1 of The Open University's [_Learn to code for Data Analysis_](http://futurelearn.com/courses/learn-to-code) course.

In 2000, the United Nations set eight Millenium Development Goals (MDGs) to reduce poverty and diseases, improve gender equality and environmental sustainability, etc. Each goal is quantified and time-bound, to be achieved by the end of 2015. Goal 6 is to have halted and started reversing the spread of HIV, malaria and tuberculosis (TB).
TB doesn't make headlines like Ebola, SARS (severe acute respiratory syndrome) and other epidemics, but is far deadlier. For more information, see the World Health Organisation (WHO) page <http://www.who.int/gho/tb/en/>.

Given the population and number of deaths due to TB in some countries during one year, the following questions will be answered: 

- What is the total, maximum, minimum and average number of deaths in that year?
- Which countries have the most and the least deaths?
- What is the death rate (deaths per 100,000 inhabitants) for each country?
- Which countries have the lowest and highest death rate?

The death rate allows for a better comparison of countries with widely different population sizes.

## The data

The data consists of total population and total number of deaths due to TB (excluding HIV) in 2013 in each of the BRICS (Brazil, Russia, India, China, South Africa) and Portuguese-speaking countries. 

The data was taken in July 2015 from <http://apps.who.int/gho/data/node.main.POP107?lang=en> (population) and <http://apps.who.int/gho/data/node.main.593?lang=en> (deaths). The uncertainty bounds of the number of deaths were ignored.

The data was collected into an Excel file. This data file must be in a location path accessable from this notebook's file location (see below).

In [1]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

from pandas import *
pandas.set_option('display.max_rows', 30)   # alter this value (30) in order to display more or less of the table
data = read_excel('../Datasets/Datasets_From_Course/WHO POP TB all.xls') # relative path to the datafile location
data

Unnamed: 0,Country,Population (1000s),TB deaths
0,Afghanistan,30552,13000.00
1,Albania,3173,20.00
2,Algeria,39208,5100.00
3,Andorra,79,0.26
4,Angola,21472,6900.00
5,Antigua and Barbuda,90,1.20
6,Argentina,41446,570.00
7,Armenia,2977,170.00
8,Australia,23343,45.00
9,Austria,8495,29.00


## The range of the problem

The column of interest is the last one.

In [2]:
tbColumn = data['TB deaths']

The total number of deaths in 2013 is:

In [3]:
tbColumn.sum()  # the 'tbColumn' variable is evaluated with several 'methods' - sum(), max(), etc

1072677.97

The largest and smallest number of deaths in a single country are:

In [4]:
tbColumn.max()

240000.0

In [5]:
tbColumn.min()

0.0

From less than 20 to almost a quarter of a million deaths is a huge range. The average number of deaths, over all countries in the data, can give a better idea of the seriousness of the problem in each country.
The average can be computed as the mean or the median. Given the wide range of deaths, the median is probably a more sensible average measure.

In [6]:
tbColumn.mean()

5529.267886597938

In [7]:
tbColumn.median()

315.0

The median is far lower than the mean. This indicates that some of the countries had a very high number of TB deaths in 2013, pushing the value of the mean up.

## Total deaths per country due to TB

To see the most affected countries, the table is sorted in decending order by the last column, which puts those countries in the first rows.

In [8]:
data.sort('TB deaths', ascending=False)    # the 'keyword' property results in a descending sort - highest at the top

Unnamed: 0,Country,Population (1000s),TB deaths
77,India,1252140,240000.00
124,Nigeria,173615,160000.00
13,Bangladesh,156595,80000.00
78,Indonesia,249866,64000.00
128,Pakistan,182143,49000.00
47,Democratic Republic of the Congo,67514,46000.00
36,China,1393337,41000.00
58,Ethiopia,94101,30000.00
134,Philippines,98394,27000.00
116,Myanmar,53259,26000.00


The table raises the possibility that a large number of deaths may be partly due to a large population. To compare the countries on an equal footing, the death rate per 100,000 inhabitants is computed.

The death rate per 100,000 inhabitants is show in decending order in the last column, which puts the countries with the highest death rates in the first rows.

In [9]:
populationColumn = data['Population (1000s)']
data['TB deaths (per 100,000)'] = tbColumn * 100 / populationColumn # a new column of data is created from the evaluation
data.sort('TB deaths (per 100,000)', ascending=False)   # the table is sorted in descending order base on the new column

Unnamed: 0,Country,Population (1000s),TB deaths,"TB deaths (per 100,000)"
49,Djibouti,873,870.00,99.656357
124,Nigeria,173615,160000.00,92.157936
165,Swaziland,1250,1100.00,88.000000
172,Timor-Leste,1133,990.00,87.378641
158,Somalia,10496,7700.00,73.361280
71,Guinea-Bissau,1704,1200.00,70.422535
115,Mozambique,25834,18000.00,69.675621
47,Democratic Republic of the Congo,67514,46000.00,68.134017
30,Cambodia,15135,10000.00,66.072019
117,Namibia,2303,1300.00,56.448111


## Conclusions

The entire world population had a total of about 1.07 million deaths due to TB in 2013. The median shows that half of these coutries had less than 315 deaths. The much higher mean (over 5,500) indicates that some countries had a very high number. The least affected were San Marino, Niuo and Monaco, with less than 0.1 deaths, and the most affected were India and Nigeria with 240 thousand and 160 thousand deaths in a single year respectively. However, taking the population size into account, the least affected were San Marino, Monaco and Norway with less than 0.1 deaths per 100 thousand inhabitants, and the most affected were Djibouti and Nigeria with over 92 deaths per 100,000 inhabitants.

One should not forget that most values are estimates. Nevertheless, they convey the message that TB is still a major cause of fatalities, and that there is a huge disparity between countries, with several countries being highly affected.