In [1]:
# You can either directly import the entire module. 
# You can also import the module and rename the imported module by giving an alias. 

import sys 
import numpy as np
import pandas as pd
import statsmodels as sm
import sklearn
import scipy as sp
%matplotlib inline 
# This is to enable plotting in Jupyter directly.
import matplotlib.pyplot as plt

In [2]:
# Read the data as a Pandas data frame.

df_who = pd.read_csv("WHO.csv")

# Take a look at the first few observations of the data set.
df_who.head()

Unnamed: 0,Country,Region,Population,Under15,Over60,FertilityRate,LifeExpectancy,ChildMortality,CellularSubscribers,LiteracyRate,GNI,PrimarySchoolEnrollmentMale,PrimarySchoolEnrollmentFemale
0,Afghanistan,Eastern Mediterranean,29825,47.42,3.82,5.4,60,98.5,54.26,,1140.0,,
1,Albania,Europe,3162,21.33,14.93,1.75,74,16.7,96.39,,8820.0,,
2,Algeria,Africa,38482,27.42,7.17,2.83,73,20.0,98.99,,8310.0,98.2,96.4
3,Andorra,Europe,78,15.2,22.86,,82,3.2,75.49,,,78.4,79.4
4,Angola,Africa,20821,47.58,3.84,6.1,51,163.5,48.38,70.1,5230.0,93.1,78.2


In [4]:
# Output the shape of the data frame.

df_who.shape

(194, 13)

### Remark:

You may wonder whether we should remove all the missing (``NA``) values using the command ``df_who.dropna()`` first or not. For this problem set, I **DO NOT** suggest this approach. This is because directly removing all the observations with missing values of **any** variable may over-remove those who have relevant information for the focal varibles of interest. The questions of this problem set are mainly concerning the local information of some variables. Later in this course, we will discuss more on how to pre-process data and conduct feature engineering before building prediction models.

### (A) Which variables have at least THREE missing (i.e., NA) value?

In [5]:
df_who.isna().sum()

Country                           0
Region                            0
Population                        0
Under15                           0
Over60                            0
FertilityRate                    11
LifeExpectancy                    0
ChildMortality                    0
CellularSubscribers              10
LiteracyRate                     91
GNI                              32
PrimarySchoolEnrollmentMale      93
PrimarySchoolEnrollmentFemale    93
dtype: int64

Therefore, the variables with at least 3 missing values are:

* ``FertilityRate``

* ``CellularSubscribers``

* ``LiteracyRate``

* ``GNI``

* ``PrimarySchoolEnrollmentMale``

* ``PrimarySchoolEnrollmentFemale``

### (B) Which country has the highest and lowest fertility rate?

In [6]:
df_who[df_who['FertilityRate'] == max(df_who['FertilityRate'])]

# In Python, the logic condition of "equal to" is represented as "==".

Unnamed: 0,Country,Region,Population,Under15,Over60,FertilityRate,LifeExpectancy,ChildMortality,CellularSubscribers,LiteracyRate,GNI,PrimarySchoolEnrollmentMale,PrimarySchoolEnrollmentFemale
123,Niger,Africa,17157,49.99,4.26,7.58,56,113.5,29.52,,720.0,64.2,52.0


In [7]:
df_who[df_who['FertilityRate'] == min(df_who['FertilityRate'])]

Unnamed: 0,Country,Region,Population,Under15,Over60,FertilityRate,LifeExpectancy,ChildMortality,CellularSubscribers,LiteracyRate,GNI,PrimarySchoolEnrollmentMale,PrimarySchoolEnrollmentFemale
21,Bosnia and Herzegovina,Europe,3834,16.35,20.52,1.26,76,6.7,84.52,97.9,9190.0,86.5,88.4


Therefore, the country with the largest fertility rate is **Niger** and the country with the smallest fertility rate is **Bosnia and Herzegovina**.

### (C) Which region has the minimum variation (measured by standard deviation) in Gross National Income (GNI)? What is the standard deviation of GNI in this region?

In [8]:
# Select the two variables of interest.

df_who[['Region','GNI']]

Unnamed: 0,Region,GNI
0,Eastern Mediterranean,1140.0
1,Europe,8820.0
2,Africa,8310.0
3,Europe,
4,Africa,5230.0
...,...,...
189,Americas,12430.0
190,Western Pacific,3250.0
191,Eastern Mediterranean,2170.0
192,Africa,1490.0


In [10]:
# Use groupby and the aggregation function std() to evaluate the variability.

df_who[['Region','GNI']].groupby(['Region']).std()

Unnamed: 0_level_0,GNI
Region,Unnamed: 1_level_1
Africa,5933.619545
Americas,10062.508234
Eastern Mediterranean,24755.985472
Europe,15389.025378
South-East Asia,2477.339803
Western Pacific,15839.972247


**South-East Asia** has the lowest variation in GNI, the standard deviation is **2477.34**.


### (D) We define a country to be a rich country if its GNI exceeds 20,000. What is the mean child mortality of the rich countries?

In [11]:
#The function numpy.average() computes the average value of an array.

np.average(df_who[df_who['GNI'] >= 20000]['ChildMortality'])

7.448648648648649

The average child mortality of the rich countries is **7.45**.

### (E) Demonstrate the relationship between income level vs. life expectancy through calculating their correlations.

In [12]:
# Select the two variables GNI and LifeExpectancy and compute their correlations using the function corr().

df_who[['GNI','LifeExpectancy']].corr()

Unnamed: 0,GNI,LifeExpectancy
GNI,1.0,0.665786
LifeExpectancy,0.665786,1.0


The correlation between GNI and life expectancy is 0.6658, so they have **moderately strong positive relationship** between each other.