We have 3-diet comparison dataset, which gives information about results of people after losing weights on 3 diets. In order to get know what diet is more effective (or it all have the same effect) I will use ANOVA.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import plotly as py
import plotly.graph_objects as go
from plotly.offline import iplot
import plotly.express as px
import cufflinks
cufflinks.go_offline()
# Устанавливаем глобальную тему 
cufflinks.set_config_file(world_readable=True, theme='pearl', offline=True)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/3-diet-comparison/Diet_R.csv


Importing data

There is 7 columns in dataset:
* Person - a code of the person;
* gender - what gender the person is;
* Age - how old the person is;
* Height - human height in centimeters;
* pre.weight - a person's weight before starting a diet;
* Diet - a code of diet;
* weight6weeks - a person's weight after finishing a diet.

In [2]:
data = pd.read_csv("../input/3-diet-comparison/Diet_R.csv")
data.head(5)

Unnamed: 0,Person,gender,Age,Height,pre.weight,Diet,weight6weeks
0,1,0,22,159,58,1,54.2
1,2,0,46,192,60,1,54.0
2,3,0,55,170,64,1,63.3
3,4,0,33,171,64,1,61.1
4,5,0,50,170,65,1,62.2


### EDA

Getting info about the dataset. There is 90 non-null rows with integers and floats as datatypes. 

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Person        90 non-null     int64  
 1   gender        90 non-null     int64  
 2   Age           90 non-null     int64  
 3   Height        90 non-null     int64  
 4   pre.weight    90 non-null     int64  
 5   Diet          90 non-null     int64  
 6   weight6weeks  90 non-null     float64
dtypes: float64(1), int64(6)
memory usage: 5.0 KB


There is equal number of people in groups for each diet

In [4]:
data.groupby("Diet").Person.count()

Diet
1    30
2    30
3    30
Name: Person, dtype: int64

Checking outliers with boxplots

In [5]:
fig = go.Figure()

fig.add_trace(go.Box(y=data.Age.values, name = "Age"))
fig.add_trace(go.Box(y=data.Height.values, name = "Height"))
fig.add_trace(go.Box(y=data["pre.weight"].values, name = "Start weight"))
fig.add_trace(go.Box(y=data["weight6weeks"].values, name = "Finish weight"))

fig.show()

There is no outliers in Age, pre.weight and weigt6weeks columns, all variables are normal. 
But there are outliers in Height column which means that we have people in the sample with a height of more than 190 cm.

In [6]:
data[data.Height > 185]

Unnamed: 0,Person,gender,Age,Height,pre.weight,Diet,weight6weeks
1,2,0,46,192,60,1,54.0
5,6,0,50,201,66,1,64.0
23,24,1,40,190,88,1,84.5
46,47,1,51,191,71,2,66.8
47,48,1,38,199,75,2,72.6
48,49,1,54,196,75,2,69.2
49,50,1,33,190,76,2,72.5
51,52,1,37,194,78,2,76.3
54,55,1,37,198,79,2,71.1
57,58,0,55,191,71,2,68.1


Majority of tallest people tested second diet. I will not delete these rows, because the analysis of the results of the second diet depends on them and because these values may be real.

There is 0 and 1 as gender types, but we don't know what code is for female or male. Maybe we can make a guess about it after watching height to weight ratio graph.

In [7]:
print(data.gender.unique())

[0 1]


In [8]:
fig = px.scatter(data, x = "Height", y = "pre.weight", color = 'gender', title = "Height to weight ratio by gender")
fig.show()

Yellow scatters are higher than blue and since men usually weigh more than women, it can be concluded that 1 is for male and 0 is for female.

It is inappropriate to compare diets in terms of the number of pounds lost, since this number depends on the initial weight (usually the more weight, the more kilograms go away), therefore, the values of body mass index will be compared, which takes into account a person's height (body mass index before starting a diet , after the end, its change)

BMI = weight (kilogramms) / height^2 (metres)

In [9]:
data ['BMI_start'] = data ["pre.weight"] / (data ["Height"] / 100)**2
data ["BMI_end"] = data ["weight6weeks"] / (data ["Height"] / 100)**2
data ["BMI_change"] = data ['BMI_start'] - data ["BMI_end"]
data

Unnamed: 0,Person,gender,Age,Height,pre.weight,Diet,weight6weeks,BMI_start,BMI_end,BMI_change
0,1,0,22,159,58,1,54.2,22.942130,21.439025,1.503105
1,2,0,46,192,60,1,54.0,16.276042,14.648438,1.627604
2,3,0,55,170,64,1,63.3,22.145329,21.903114,0.242215
3,4,0,33,171,64,1,61.1,21.887076,20.895318,0.991758
4,5,0,50,170,65,1,62.2,22.491349,21.522491,0.968858
...,...,...,...,...,...,...,...,...,...,...
85,86,1,40,167,87,3,77.8,31.195095,27.896303,3.298792
86,87,1,51,175,88,3,81.9,28.734694,26.742857,1.991837
87,88,1,25,155,74,3,68.5,30.801249,28.511967,2.289282
88,89,1,36,168,81,3,76.6,28.698980,27.140023,1.558957


There is approximately the same range of people with different parameters for diets

In [10]:
fig = px.scatter(data, x = "Height", y = "pre.weight", color = 'Diet', title = "Range of people with different parameters for diets")
fig.show()

Checking BMI change results on different diets. It seems like tthe third diet is more effective (box is higher than 1 and 2). I will check it with ANOVA analysis.
So, H0: Diet 1, Diet 2 and Diet 3 affect weight loss equally.
    H1: Diet 1, Diet 2 and Diet 3 

In [11]:
fig = go.Figure()

fig.add_trace(go.Box(y=data[data.Diet == 1].BMI_change.values, name = "Diet 1"))
fig.add_trace(go.Box(y=data[data.Diet == 2].BMI_change.values, name = "Diet 2"))
fig.add_trace(go.Box(y=data[data.Diet == 3].BMI_change.values, name = "Diet 3"))

fig.update_layout(title='BMI change results by diets')

fig.show()

Creating arrays with BMI changes in different variables for every diet

In [12]:
diet_1 = data [data["Diet"] == 1]["BMI_change"].to_list()
diet_2 = data [data["Diet"] == 2]["BMI_change"].to_list()
diet_3 = data [data["Diet"] == 3]["BMI_change"].to_list()

Finding a mean result for all diets

In [13]:
mean = data.BMI_change.mean()
mean

1.3413400888102005

Finding total sum of squares

In [14]:
diets = [diet_1, diet_2, diet_3]
sst = 0
for diet in range(len(diets)):
    for change in range(len(diets[diet])):
        sst+=((diets[diet][change] - mean)**2)
        
print(sst)

70.29681852815399


Finding degrees of freedom number 1

In [15]:
df_1 = 3 - 1

Calculating variations within groups (for it I write a function)

In [16]:
def ssw(data):
    meann = sum(data) / len(data)
    ssw = 0
    for i in range(len(data)):
        ssw += (data[i]-meann)**2
    return meann, ssw

In [17]:
ssw_1 = ssw(diet_1)
ssw_2 = ssw(diet_2)
ssw_3 = ssw(diet_3)

Calculating a sum of variations within groups

In [18]:
ssw_full = ssw_1[1] + ssw_2[1] +ssw_3[1]
ssw_full

55.60724318088374

Degrees of freedom number 2

In [19]:
df_2 = 90 - 3

Calculating variation between groups

In [20]:
ssb = 30*((ssw_1[0]-mean)**2+(ssw_2[0]-mean)**2+(ssw_3[0]-mean)**2)
ssb

14.689575347270234

Calculating F-score

In [21]:
ssb/df_1/(ssw_full/df_2)

11.491246302710522

F-critical = 3.10129576 ([F-critical calculator](http://danielsoper.com/statcalc/calculator.aspx?id=4))
F fact > F critical -> there is important difference between diets. We can say that third diet is more effective 

I can find F-score with f_oneway from scipy

In [22]:
from scipy.stats import f_oneway

f_oneway(diet_1, diet_2, diet_3)

F_onewayResult(statistic=11.491246302710522, pvalue=3.728315951810503e-05)

p-value, i.e. the probability that under a true null hypothesis the F-statistic is not less than 11.49 is 0.0000372 or 0,0037%. Since this value does not exceed the significance level α = 5%, the null hypothesis is rejected.