<a href="https://colab.research.google.com/github/araldi/HS21---Big-Data-Analysis-in-Biomedical-Research-376-1723-00L-/blob/main/Week2_homework_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 2 homework: Numpy and Pandas

In this homework, you will analyze an actual dataset that contains info on ~500.000 individuals. You will calculate the mean and standard deviation of the BMI and waist-to-hip ratio of the males and females in that population.

### Download the following files:

**Data**

https://github.com/araldi/HS21---Big-Data-Analysis-in-Biomedical-Research-376-1723-00L-/blob/main/numpy/Week2_homework_data.csv?raw=true

(It describes some characteristics of ~ 500.000 individuals)

**Dictionary**

https://raw.githubusercontent.com/araldi/HS21---Big-Data-Analysis-in-Biomedical-Research-376-1723-00L-/main/numpy/Week2_homework_dictionary.csv

(it describes the column name of the dataframe)





### Analyze the data:
* *Create a new column with the BMI of the individuals* : BMI = weight [kg] / ( height [m] * height [m]) --> careful, the height for BMI has to be in meters and not centimeters

* *Create a new column with the waist-to-hip ratio of the individuals* : waist-to-hip = waist / hip

* *Find the mean and standard deviation of the BMI of females and males* [Females = 0, Males = 1]

* *Find the mean and standard deviation of the waist-to-hip ratio of females and males* [Females = 0, Males = 1]

## Solution

### Import the data

In [None]:
# always initialize your environment!!
# you will need Pandas to deal with the DataFrame and NumPy to do math operations on the columns
import pandas as pd
import numpy as np

In [None]:
data = pd.read_csv('https://github.com/araldi/HS21---Big-Data-Analysis-in-Biomedical-Research-376-1723-00L-/blob/main/numpy/Week2_homework_data.csv?raw=true')

In [None]:
dictionary = pd.read_csv('https://raw.githubusercontent.com/araldi/HS21---Big-Data-Analysis-in-Biomedical-Research-376-1723-00L-/main/numpy/Week2_homework_dictionary.csv')

### Have a look at your data

Always a good idea to look at the files we have before proceeding

In [None]:
data.head(5)

Unnamed: 0.1,Unnamed: 0,31-0.0,48-0.0,49-0.0,50-0.0,21002-0.0
0,0,0.0,80.0,103.0,169.0,68.6
1,1,0.0,80.0,96.0,185.0,70.2
2,2,1.0,89.0,97.0,164.0,71.5
3,3,0.0,101.0,108.0,159.0,82.9
4,4,1.0,97.0,107.0,186.0,94.0


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 499214 entries, 0 to 499213
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Unnamed: 0  499214 non-null  int64  
 1   31-0.0      499214 non-null  float64
 2   48-0.0      499214 non-null  float64
 3   49-0.0      499214 non-null  float64
 4   50-0.0      499214 non-null  float64
 5   21002-0.0   499214 non-null  float64
dtypes: float64(5), int64(1)
memory usage: 22.9 MB


In [None]:
dictionary.head()

Unnamed: 0.1,Unnamed: 0,Description,Code
0,0,Gender,31-0.0
1,1,Height [cm],50-0.0
2,2,Weight [Kg],21002-0.0
3,3,Waist circumference [cm],48-0.0
4,4,Hip circumference [cm],49-0.0


In [None]:
dictionary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   5 non-null      int64 
 1   Description  5 non-null      object
 2   Code         5 non-null      object
dtypes: int64(1), object(2)
memory usage: 248.0+ bytes


### Prepare the dataset for analysis

We will change the labels of the columns to make reading the file a bit easier. This step is not necessary (and you have not seen it before), but it makes the analysis a bit more user friendly.




In [None]:
# Create a dictionary of the codes and descriptions
dictionary_columns = {}
for index, value in enumerate(dictionary['Code']):
  dictionary_columns[value] = dictionary.loc[index, 'Description']

In [None]:
dictionary_columns

{'21002-0.0': 'Weight [Kg]',
 '31-0.0': 'Gender',
 '48-0.0': 'Waist circumference [cm]',
 '49-0.0': 'Hip circumference [cm]',
 '50-0.0': 'Height [cm]'}

In [None]:
# re-label the columns of the dataframe
data = data.rename(columns = dictionary_columns)
data

Unnamed: 0.1,Unnamed: 0,Gender,Waist circumference [cm],Hip circumference [cm],Height [cm],Weight [Kg]
0,0,0.0,80.0,103.0,169.0,68.6
1,1,0.0,80.0,96.0,185.0,70.2
2,2,1.0,89.0,97.0,164.0,71.5
3,3,0.0,101.0,108.0,159.0,82.9
4,4,1.0,97.0,107.0,186.0,94.0
...,...,...,...,...,...,...
499209,502456,0.0,103.0,118.0,168.0,88.6
499210,502457,1.0,85.0,104.0,177.0,80.5
499211,502458,0.0,72.0,93.0,158.0,59.3
499212,502459,0.0,83.0,107.0,167.0,77.2


In [None]:
# alternatively, you can use variables that refer to strings corresponding to the codes (or the codes directly, although it is less readable)
waist = '48-0.0'
data[waist]

0          80.0
1          80.0
2          89.0
3         101.0
4          97.0
          ...  
499209    103.0
499210     85.0
499211     72.0
499212     83.0
499213     86.0
Name: 48-0.0, Length: 499214, dtype: float64

### Create a new column for the BMI

BMI formula: weight [kg] / ( height [m] * height [m]) 

In [None]:
# transform the height from cm to m
height_m = np.divide(data['Height [cm]'], 100)

# # equivalent to
# height_m =  data['Height [cm]'] /100

# square the height
height_squared = np.power(height_m, 2)

# height_squared = height_m * height_m

# calculate the BMI
data['BMI'] = np.divide(data['Weight [Kg]'], height_squared)

# # equivalent to
# data['BMI'] = data['Weight [Kg]'] / height_squared

In [None]:
data['BMI'].describe()

count    499214.000000
mean         27.431592
std           4.801223
min          12.121212
25%          24.138910
50%          26.743199
75%          29.907407
max          74.683737
Name: BMI, dtype: float64

In [None]:
data.head()

Unnamed: 0.1,Unnamed: 0,Gender,Waist circumference [cm],Hip circumference [cm],Height [cm],Weight [Kg],BMI
0,0,0.0,80.0,103.0,169.0,68.6,24.018767
1,1,0.0,80.0,96.0,185.0,70.2,20.511322
2,2,1.0,89.0,97.0,164.0,71.5,26.583879
3,3,0.0,101.0,108.0,159.0,82.9,32.791424
4,4,1.0,97.0,107.0,186.0,94.0,27.170771


### Create a new column for waist-to-hip ratio

In [None]:
data['waist-to-hip ratio'] = np.divide(data['Waist circumference [cm]'],  data['Hip circumference [cm]'])

# # equivalent to 
# data['waist-to-hip ratio'] = data['Waist circumference [cm]']/ data['Hip circumference [cm]']


In [None]:
data['waist-to-hip ratio'].describe()

count    499214.000000
mean          0.871649
std           0.089779
min           0.200000
25%           0.803571
50%           0.872727
75%           0.936364
max           2.128205
Name: waist-to-hip ratio, dtype: float64

In [None]:
data.head()

Unnamed: 0.1,Unnamed: 0,Gender,Waist circumference [cm],Hip circumference [cm],Height [cm],Weight [Kg],BMI,waist-to-hip ratio
0,0,0.0,80.0,103.0,169.0,68.6,24.018767,0.776699
1,1,0.0,80.0,96.0,185.0,70.2,20.511322,0.833333
2,2,1.0,89.0,97.0,164.0,71.5,26.583879,0.917526
3,3,0.0,101.0,108.0,159.0,82.9,32.791424,0.935185
4,4,1.0,97.0,107.0,186.0,94.0,27.170771,0.906542


### Find the mean and standard deviation of the BMI of females and males

Females: Gender == 0

Males: Gender == 1

In [None]:
gender_dic = {0 : 'Females', 1:'Males'}

for gender in gender_dic:

  mask = data['Gender'] == gender
  mean_BMI = np.mean(data[mask]['BMI'])
  std_BMI = np.std(data[mask]['BMI'])

  print('The mean BMI in {} is '.format(gender_dic[gender]), mean_BMI)
  print('The standard deviation of BMI in {} is '.format(gender_dic[gender]), std_BMI)

The mean BMI in Females is  27.0914964589137
The standard deviation of BMI in Females is  5.1954646653226755
The mean BMI in Males is  27.83811796618345
The standard deviation of BMI in Males is  4.2470209608327565


### Find the mean and standard deviation of the waist-to-hip ratio of females and males


In [None]:
gender_dic = {0 : 'Females', 1:'Males'}
for gender in gender_dic:

  mask = data['Gender'] == gender
  mean_BMI = np.mean(data[mask]['waist-to-hip ratio'])
  std_BMI = np.std(data[mask]['waist-to-hip ratio'])

  print('The mean waist-to-hip ratio in {} is '.format(gender_dic[gender]), mean_BMI)
  print('The standard deviation of waist-to-hip ratio in {} is '.format(gender_dic[gender]), std_BMI)

The mean waist-to-hip ratio in Female is  0.8179575339770495
The standard deviation of waist-to-hip ratio in Female is  0.0701035927293748
The mean waist-to-hip ratio in Males is  0.9358285534665348
The standard deviation of waist-to-hip ratio in Males is  0.06523575861845037


### Same as above, but using a function

In [None]:
def print_mean_and_std(parameter):
  gender_dic = {0 : 'Females', 1:'Males'}
  for gender in gender_dic:

    mask = data['Gender'] == gender
    mean_BMI = np.mean(data[mask][parameter])
    std_BMI = np.std(data[mask][parameter])

    print('The mean {} in {} is '.format(parameter, gender_dic[gender]), mean_BMI)
    print('The standard deviation of {} in {} is '.format(parameter, gender_dic[gender]), std_BMI, '\n')

In [None]:
print_mean_and_std('BMI')

The mean BMI in Females is  27.0914964589137
The standard deviation of BMI in Females is  5.1954646653226755 

The mean BMI in Males is  27.83811796618345
The standard deviation of BMI in Males is  4.2470209608327565 



In [None]:
data.columns

Index(['Unnamed: 0', 'Gender', 'Waist circumference [cm]',
       'Hip circumference [cm]', 'Height [cm]', 'Weight [Kg]', 'BMI',
       'waist-to-hip ratio'],
      dtype='object')

In [None]:
data.columns[2:]

Index(['Waist circumference [cm]', 'Hip circumference [cm]', 'Height [cm]',
       'Weight [Kg]', 'BMI', 'waist-to-hip ratio'],
      dtype='object')

In [None]:
for column in data.columns[2:]:
  print_mean_and_std(column)

The mean Waist circumference [cm] in Females is  84.72162179562625
The standard deviation of Waist circumference [cm] in Females is  12.547417381504973 

The mean Waist circumference [cm] in Males is  96.95652692635815
The standard deviation of Waist circumference [cm] in Males is  11.34510964036734 

The mean Hip circumference [cm] in Females is  103.36148063395774
The standard deviation of Hip circumference [cm] in Females is  10.392256219236845 

The mean Hip circumference [cm] in Males is  103.43745481490562
The standard deviation of Hip circumference [cm] in Males is  7.622580398623613 

The mean Height [cm] in Females is  162.44412028725202
The standard deviation of Height [cm] in Females is  6.309008447064729 

The mean Height [cm] in Males is  175.6196158277547
The standard deviation of Height [cm] in Males is  6.842008449989608 

The mean Weight [Kg] in Females is  71.45942622950827
The standard deviation of Weight [Kg] in Females is  14.093511511401559 

The mean Weight [Kg] 