## Malaria Dataset: Cases and Deaths from Different Countries(2000-2017)

Malaria is a life-threatening disease caused by the Plasmodioum falciparum parasite. It can be transmitted through the bites of infected female Anopheles mosquitoes.

Worldwide, malaria causes over 400,000 deaths annually. The good news is that it is preventable and curable.

The WHO African Region carries a disproportionately high share of the global malaria burden and records more deaths than other regions.

In this notebook, I will analyse the WHO Malaria Report for the years 2000-2017. I will also answer insightful questions on geographical prevalence and progression of these events.

In [1]:
# importing libraries

import pandas as pd                  # for data manipulation

import numpy as np                   # for statistical analysis

import matplotlib.pyplot as plt      # for plotting graphs
 
%matplotlib inline                   
# "%matplotlib inline" ensures commands in cells below the cell that outputs a plot does not affect the plot

In [2]:
#loading the dataset

reported = pd.read_csv('reported_numbers.csv')


# previewing the data

reported.head()                     

Unnamed: 0,Country,Year,No. of cases,No. of deaths,WHO Region
0,Afghanistan,2017,161778.0,10.0,Eastern Mediterranean
1,Algeria,2017,0.0,0.0,Africa
2,Angola,2017,3874892.0,13967.0,Africa
3,Argentina,2017,0.0,1.0,Americas
4,Armenia,2017,0.0,,Europe


In [3]:
reported.shape                         # returns the no. of rows and columns

(1944, 5)

##### There are 5 columns and 1944 rows

In [4]:
reported.describe(include = 'all')     # This is used to view basic statistical details like percentile, mean, std etc.

Unnamed: 0,Country,Year,No. of cases,No. of deaths,WHO Region
count,1944,1944.0,1710.0,1675.0,1944
unique,108,,,,6
top,Bolivia (Plurinational State of),,,,Africa
freq,18,,,,792
mean,,2008.5,389730.3,1289.413731,
std,,5.189462,1270270.0,4290.739997,
min,,2000.0,0.0,0.0,
25%,,2004.0,593.75,1.0,
50%,,2008.5,14792.0,30.0,
75%,,2013.0,117097.8,669.5,


##### From the output of the .describe() query, we see that the data consists of 108 countries(unique) and 18 years report(freq).

In [7]:
# Checking to see if any feature has empty/missing values

reported.isnull().sum()

Country            0
Year               0
No. of cases     234
No. of deaths    269
WHO Region         0
dtype: int64

##### 'No. of cases' & 'No. of deaths' columns both have null values. since the values for each countries differ greatly, it'll be impractical to use the mode/mean imputation technique.
##### In that case, the rows with missing values will be dropped, since we don't have sufficient information on that row.

In [8]:
# dropping rows with null values

reported.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True)

In [9]:
# Let's confirm that there's no more null values

reported.isnull().sum()

Country          0
Year             0
No. of cases     0
No. of deaths    0
WHO Region       0
dtype: int64

##### Neat! No more null values, the data is clean. We can now start analysis.

In [10]:
# Top 10 countries with the highest no of cases over the years?

danger_zones = reported[['Country', 'No. of cases', 'No. of deaths']].groupby('Country').sum().sort_values(by = 'No. of cases', ascending = False)
danger_zones.head(10)

Unnamed: 0_level_0,No. of cases,No. of deaths
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Democratic Republic of the Congo,74842893.0,328552.0
Uganda,41993628.0,70941.0
Burkina Faso,41655606.0,89211.0
Burundi,41264306.0,33484.0
Mozambique,40725992.0,34697.0
United Republic of Tanzania,33559165.0,183120.0
Ghana,28405332.0,37305.0
India,27013448.0,15218.0
Angola,26006152.0,125364.0
Malawi,19445640.0,31815.0


##### Congo recorded the highest no. of cases and deaths in the world. That's serious.

In [11]:
# Top 10 countries with the lowest no of cases over the years?

safe_countries = reported[['Country', 'No. of cases', 'No. of deaths']].groupby('Country').sum().sort_values(by = 'No. of cases', ascending = True)
safe_countries.head(10)

Unnamed: 0_level_0,No. of cases,No. of deaths
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Egypt,0.0,10.0
Syrian Arab Republic,0.0,13.0
Morocco,3.0,9.0
Oman,19.0,4.0
Turkmenistan,62.0,0.0
Iraq,76.0,0.0
Armenia,355.0,0.0
Uzbekistan,632.0,1.0
Algeria,1044.0,4.0
Cabo Verde,1143.0,32.0


##### We see that about 8 countries have not recorded deaths due to malaria in 18 years! Impressive!

##### Let's see the stats for my country Nigeria

In [12]:
nigeria = reported[reported.Country == 'Nigeria']
nigeria

Unnamed: 0,Country,Year,No. of cases,No. of deaths,WHO Region
394,Nigeria,2014,7826954.0,6082.0,Africa
826,Nigeria,2010,551187.0,4238.0,Africa
934,Nigeria,2009,479845.0,7522.0,Africa
1042,Nigeria,2008,143079.0,8677.0,Africa


##### Sighs! We still have a long way to go as a nation. Of the 18 years, Nigeria has just 4 years of complete data, too bad. 
##### The no of cases has  greatly increased, although the death rate has been reducing. There isn't enough data to make conclusions.

##### Let's see how malaria events has progressed over the years.

In [14]:
# Progression of cases and deaths over the years

annual = reported.groupby('Year').sum()
annual

Unnamed: 0_level_0,No. of cases,No. of deaths
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,5279182.0,21419.0
2001,5534764.0,26162.0
2002,5335247.0,70683.0
2003,8243454.0,91247.0
2004,9389638.0,87926.0
2005,11170319.0,76842.0
2006,11898896.0,78995.0
2007,13365529.0,76904.0
2008,13395349.0,87024.0
2009,17454477.0,115694.0


In [17]:
# What years were the most deaths reported?

tough_years = reported.groupby('Year').sum().sort_values(by='No. of deaths', ascending=False)
tough_years.head()

Unnamed: 0_level_0,No. of cases,No. of deaths
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2010,25484864.0,135936.0
2015,85379317.0,120335.0
2009,17454477.0,115694.0
2013,49175016.0,107591.0
2016,108212723.0,105929.0


##### 2010 has the highest number of deaths.
##### There seems to be some improvement though; the no. of cases reported in 2016 is the highest so far, yet, less people died from the disease than did in 2010.This may be due to technological advances that introduced the use of rapid diagnostic test kits. This makes testing easier and produces faster results.
##### This improvement can also be attributed to the generous funding provided to the research department. The mode of action of these parasites have been extensively studied and more antibiotics have been produced.

In [20]:
# Are some region doing better than the others?

region = reported.groupby('WHO Region').sum().sort_values(by='No. of deaths', ascending=False)
region

Unnamed: 0_level_0,Year,No. of cases,No. of deaths
WHO Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Africa,1129576,545111852.0,1480850.0
South-East Asia,303338,38305249.0,49802.0
Eastern Mediterranean,305474,15841260.0,26764.0
Western Pacific,383662,6709491.0,18330.0
Americas,733140,13433321.0,11039.0
Europe,267052,112675.0,25.0


##### Africa remains the region with the highest number of cases and deaths. 
##### Europe has the least number of cases and deaths, some european countries have also been able to eradicate the disease.

#####  Interactive data visualizations were also carried out using Power BI

### Conclusion
##### Africa region needs all the help they can get, in form of funding, research support and materials.
##### Awareness should also be created on the effect of antibiotics abuse since this has resulted in resistant to once-effective antimalarial drugs.


In [None]:
reported.to_csv('cleaned_reported.csv', index =False)