# Exercise 4: Hypertension patients

We want to investigate hypertension patients based on the NCD download data. 

Questions we want to investigate: 

* What is the gender breakdown of Hypertension patients
* Is there a gender difference in the number of controlled hypertension patients
* Want to look at the geographic distribution of the fraction of controlled hypertension patients
* (Bonus) Build a linear regression model to investigate which factors impact BP in hypertension patients

This exercise is using Pandas to read in data downloaded from the website. 

In [3]:
import pandas as pd

We import the data from an excel file

In [4]:
data = pd.read_excel("../../../non_communicable_diseases.xlsx")

We get all the hypertension patients as follows: 

In [5]:
hypertension_data = data[data["Common Name"] == "Hypertension"]

In [6]:
# We calculate controlled patients as patients with systolic BP less than 150
hypertension_data["controlled"] = (hypertension_data["Systolic BP"] < 150).apply(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Now you can try to investigate the gender distribution of controlled and not controlled patients

In [4]:
hypertension_data["Gender"].value_counts()

female    47098
male      30326
Name: Gender, dtype: int64

One useful function is to create a pivot table as in the example below:

In [6]:
hypertension_data.pivot_table(index=["Gender","controlled"], aggfunc="count")["Clinic"]

Gender  controlled
female  0             42243
        1              4855
male    0             27091
        1              3235
Name: Clinic, dtype: int64

If we want to calculate the fraction of controlled patients for each gender the easiest thing is to iterate over each group as below. 

In [7]:
for name, group in hypertension_data.groupby("Gender"):
    # Calculate and pring the fraction of controlled patients for each Gender

female 0.529661556754
male 0.532381454857


You can now try the same with using the Region variable instead of Gender. To see the geographic distribution

In [15]:
# Create a pivot table for Region and Controlled

In [16]:
# Using groupby calculate the fraction of controlled patients in each Region

As a bonus we can try to create a linear regression model to estimate a patients BP from other factors. To this we use the statsmodel packages as below

In [12]:
import statsmodels.formula.api as smf

In [10]:
# To use this package properly we need to make sure all the column names are a bit easier to deal with 
from slugify import slugify
smf_data =hypertension_data.rename(columns=lambda x: slugify(x).replace("-", "_"))
smf_data.columns

Index(['visit_type', 'visit_date', 'epi_week', 'vist_day', 'vist_month',
       'visit_year', 'region', 'district', 'clinic', 'status',
       ...
       'dose_of_medicine_3_prescribed',
       'availability_of_medicine_3_prescribed',
       'name_of_medicine_4_prescribed', 'dose_of_medicine_4_prescribed',
       'availability_of_medicine_4_prescribed',
       'name_of_medicine_5_prescribed', 'dose_of_medicine_5_prescribed',
       'availability_of_medicine_5_prescribed', 'uuid', 'controlled'],
      dtype='object', length=129)

We can now create our first regression model of modelling the age and gender effect on BP. We specify our model using a formula language `systolic_bp ~ age_years + C(gender)`. The C(gender), means that gender is treated as a categorical factor. We can see the result of this regression analysis below

In [13]:
smf.ols("systolic_bp ~ age_years + C(gender)", data=smf_data).fit().summary()

0,1,2,3
Dep. Variable:,systolic_bp,R-squared:,0.002
Model:,OLS,Adj. R-squared:,0.002
Method:,Least Squares,F-statistic:,51.82
Date:,"Fri, 22 Sep 2017",Prob (F-statistic):,3.2999999999999996e-23
Time:,10:49:25,Log-Likelihood:,-238270.0
No. Observations:,55148,AIC:,476500.0
Df Residuals:,55145,BIC:,476600.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,132.5812,0.391,338.927,0.000,131.815,133.348
C(gender)[T.male],0.2940,0.159,1.852,0.064,-0.017,0.605
age_years,0.0637,0.006,9.944,0.000,0.051,0.076

0,1,2,3
Omnibus:,4387.054,Durbin-Watson:,1.997
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7684.979
Skew:,0.584,Prob(JB):,0.0
Kurtosis:,4.408,Cond. No.,306.0


You can now extedn this model to try to understand what the impact of for example Region, BMI or visit type has on the BP of the patients

In [17]:
# Create and fit a larger linear model as above