<a href="https://colab.research.google.com/github/callmesukhi/BFVM19PROG1/blob/main/assessments/Assignment_week_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Assignment week 05: Sleeping habits

Welcome to **week five** of this course programming 1. You will learn about analysing data with pandas and numpy and you will learn to visualize with bokeh. Concretely, you will preprocess the Sleep Study data in an appropiate format in order to conduct statistical and visual analysis. Learning outcomes:


## About the data

The data is collected from a survey-based study of the sleeping habits of individuals within the US. 

Below is a description of each of the variables contained within the dataset.

- Enough = Do you think that you get enough sleep?
- Hours = On average, how many hours of sleep do you get on a weeknight?
- PhoneReach = Do you sleep with your phone within arms reach?
- PhoneTime = Do you use your phone within 30 minutes of falling asleep?
- Tired = On a scale from 1 to 5, how tired are you throughout the day? (1 being not tired, 5 being very tired)
- Breakfast = Do you typically eat breakfast?

The two research questions you should answer in this assignment are:
1. Is there a differences in Hours sleep caused by having breakfast (yes, no)?
2. Is there a differences in Hours sleep caused by having breakfast and the tireness (score)


The assignment consists of 6 parts:

- [part 1: load the data](#0)
- [part 2: data inspection](#1)
- [part 3: check assumptions](#2)
   - [check normality 3.1](#ex-31)
   - [check equal variance 3.2](#ex-32)
- [part 4: prepare the data](#3)
- [part 5: answer the research question](#4)
- [part 6: enhanced plotting](#5)

Part 1 till 5 are mandatory, part 6 is optional (bonus)
To pass the assingnment you need to a score of 60%. 


**NOTE If your project data is suitable you can use that data instead of the given data**

## ANOVA

Analysis of variance (ANOVA) compares the variances between groups versus within groups. It basically determines whether the differences between groups is larger than the differences within a group (the noise). 
A graph picturing this is as follow: https://link.springer.com/article/10.1007/s00424-019-02300-4/figures/2


In ANOVA, the dependent variable must be a continuous (interval or ratio) level of measurement. For instance Glucose level. The independent variables in ANOVA must be categorical (nominal or ordinal) variables. For instance trial category, time of day (AM versus PM) or time of trial (different categories). Like the t-test, ANOVA is also a parametric test and has some assumptions. ANOVA assumes that the data is normally distributed.  The ANOVA also assumes homogeneity of variance, which means that the variance among the groups should be approximately equal. ANOVA also assumes that the observations are independent of each other. 

A one-way ANOVA has just one independent variable. A two-way ANOVA (are also called factorial ANOVA) refers to an ANOVA using two independent variables. For research question 1 we can use the one-way ANOVA, for research question two we can use two-way ANOVA. But first we need to check the assumptions. 


---

<a name='0'></a>
## Part 1: Load the data (10 pt)

load the `sleep.csv` data. Get yourself familiar with the data. Answer the following questions.

1. What is the percentage missing data?
2. Considering the research question, what is the dependent variable and what are the indepent variables? Are they of the correct datatype? 

In [1]:
# importing useful libraries 
import pandas as pd
import numpy as np
import yaml

In [2]:
# creating a color class for printing things in different colors 
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

In [3]:
# reading the config file
def get_config():
    """
    this functions reads the yaml file and returns it as a dict object. 
    :param: none
    :return: dict, returns a dict object
    """
    with open(r"config.yaml", 'r') as file:
        file = yaml.safe_load(file)
    return file

config = get_config()

In [4]:
# reading the file into pandas DataFrame
def read_data(file_path):
    """
    this functions reads the data into a Pandas DataFrame, 
    :param file_path: str, a string of file path
    :return: DataFrame, returns a pandas DataFrame
    """
    df = pd.read_csv(file_path)
    return df 

df = read_data(config['datadir'] + 'sleep.csv')

In [7]:
#code printing percentage missing data
def calculate_missing_data(df):
    """
    functions counts & prints the percentage of missing values in each column and entire DataFrame. 
    :param df: DataFrame, holds a pandas DataFrame
    "return: none
    """
    missing_data_column_wise = []
    for col in df.columns:
        missing_data_col = df[col].isnull().mean()
        missing_data_col = round((missing_data_col * 100), 2)
        missing_data_column_wise.append(missing_data_col)
    missing_total_data = sum(missing_data_column_wise)
    print(color.BOLD + color.UNDERLINE + "Percentage of missing values in entire DataDrame is : " +color.END + color.DARKCYAN +  f"{missing_total_data}%"+ color.END)
    print('\n')
    missing_data_each_column = df.isnull().mean()
    print(color.BOLD + color.UNDERLINE + "Percentage of missing values in each column is:\n" +color.END + color.DARKCYAN +  f"{round((missing_data_each_column * 100), 2)}"+ color.END)
    print('\n')
    print(color.BOLD + color.UNDERLINE + color.PURPLE + "Actual Data:" + color.END)
    return df

calculate_missing_data(df)

[1m[4mPercentage of missing values in entire DataDrame is : [0m[36m1.92%[0m


[1m[4mPercentage of missing values in each column is:
[0m[36mEnough        0.00
Hours         1.92
PhoneReach    0.00
PhoneTime     0.00
Tired         0.00
Breakfast     0.00
dtype: float64[0m


[1m[4m[95mActual Data:[0m


Unnamed: 0,Enough,Hours,PhoneReach,PhoneTime,Tired,Breakfast
0,Yes,8.0,Yes,Yes,3,Yes
1,No,6.0,Yes,Yes,3,No
2,Yes,6.0,Yes,Yes,2,Yes
3,No,7.0,Yes,Yes,4,No
4,No,7.0,Yes,Yes,2,Yes
...,...,...,...,...,...,...
99,No,7.0,Yes,Yes,2,Yes
100,No,7.0,No,Yes,3,Yes
101,Yes,8.0,Yes,Yes,3,Yes
102,Yes,7.0,Yes,Yes,2,Yes


In [None]:
#code printing answer dependent and independent variables
def dependent_independent_var(df):
    """
    functions counts & prints the dependent and independent variables of the DataFrame. 
    :param df: DataFrame, holds a pandas DataFrame
    "return: none
    """
    variables = df.columns.tolist()
    print(color.BOLD + color.UNDERLINE + 'Dependent variables are: '+ color.END + color.PURPLE + f'{variables[0], variables[1], variables[4]}' + color.END)
    print('\n')
    print(color.BOLD + color.UNDERLINE + 'Independent variables are: ' + color.END + color.RED + f'{variables[2], variables[3], variables[5]}' + color.END)

dependent_independent_var(df)

[1m[4mDependent variables are: [0m[95m('Enough', 'Hours', 'Tired')[0m


[1m[4mIndependent variables are: [0m[91m('PhoneReach', 'PhoneTime', 'Breakfast')[0m


In [None]:
#code printing answer about datatypes
def find_convert_datatypes(df):
  """
    functions counts & prints the percentage of missing values in each column and entire DataFrame. 
    :param df: DataFrame, holds a pandas DataFrame
    "return: none
    """
  print(color.BOLD + color.UNDERLINE + f'Current DataTypes are:\n' + color.END + f'{df.dtypes}')

  columns = ['Enough', 'PhoneReach', 'PhoneTime', 'Breakfast']

  for col in columns:
    df[col] = df[col].astype('category')

  print('\n')
  print(color.BOLD + color.UNDERLINE + f'Changed DataTypes:\n' + color.END + f'{df.dtypes}')

[1m[4mCurrent DataTypes are:
[0mEnough         object
Hours         float64
PhoneReach     object
PhoneTime      object
Tired           int64
Breakfast      object
dtype: object


[1m[4mChanged DataTypes:
[0mEnough        category
Hours          float64
PhoneReach    category
PhoneTime     category
Tired            int64
Breakfast     category
dtype: object


---

<a name='1'></a>
## Part 2: Inspect the data (30 pt)

Inspect the data practically. Get an idea about how well the variable categories are ballanced. Are the values of a variable equally divided? What is the mean value of the dependent variable? Are there correlations amongs the variables?


<ul>
<li>Create some meaninful overviews such as variable value counts</li>
<li>Create a scatter plot ploting the relation between being tired and hours of sleep with different colors for Breakfast</li>
    <li>Print some basic statistics about the target (mean, standard deviation)</li>
    <li>Create a heatmap to check for correlations among variables. </li>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
    <ul><li>the gitbook has a bokeh heatmap example</li></ul>
</details>
</ul>

In [None]:
#code your answer to the value counts and distribution plots here
def value_count(df):
    print(color.BOLD + color.UNDERLINE + "Number of values in each column is as follow:\n" + color.END + color.PURPLE + f"{df.count(0)}"+color.END)

value_count(df)

[1m[4mNumber of values in each column is as follow:
[0m[95mEnough        104
Hours         102
PhoneReach    104
PhoneTime     104
Tired         104
Breakfast     104
dtype: int64[0m


In [None]:
#code for the scatter plot here

In [None]:
#code your answer to the target statistics here

In [None]:
#code your answer for the heatmap here and briefly state your finding

---

<a name='2'></a>
## Part 3: Check Assumptions

Before we answer the research question with ANOVA we need to check the following assumptions:

1. ANOVA assumes that the dependent variable is normaly distributed
2. ANOVA also assumes homogeneity of variance
3. ANOVA also assumes that the observations are independent of each other. Most of the time we need domain knowledge and experiment setup descriptions to estimate this assumption

We are going to do this graphically and statistically. 

<a name='ex-31'></a>
### Check normality (10 pt)

<ul><li>
Plot the distribution of the dependent variable. Add a vertical line at the position of the average. Add a vertical line for the robuust estimation. Add the normal distribution line to the plot. Comment on the normallity of the data. Do you want the full points? Plot with bokeh!</li>

<li>Use a Shapiro-Wilk Test or an Anderson-Darling test to check statistically</li></ul>


<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
    <ul><li>check the code of lesson 1 DS1 bayesian statistics</li>
        <li>heart_failure case of gitbook uses bokeh histograms</li>
</ul>
</details>

In [None]:
# your code to plot here

In [None]:
# briefly summarize your findings

<a name='ex-32'></a>
### Check homogeneity of variance (20 pt)

<ul><li>
Use boxplots for the check of homoegeneity of variance. Do you want the full points? Plot with bokeh!</li>

<li>Use a Levene’s & Bartlett’s Test of Equality (Homogeneity) of Variance to test equal variance statistically</li><ul>

In [None]:
# your code to plot here

In [None]:
# your code for the statistical test here

In [None]:
# briefly summarize your findings

---

<a name='3'></a>
## Part 4: Prepare your data (10 pt)

Create a dataframe with equal samplesize. Make three categories for tireness 1-2 = no, 3 = maybe, 4-5 = yes

In [None]:
#your solution here

---

<a name='4'></a>
## Part 5: Answer the research questions (20 pt)

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
    <ul><li>use one-way ANOVA for research question 1</li>
    <li>Use two-way ANOVA for research question 2</li>
    <li>https://reneshbedre.github.io/blog/anova.html</li>
</ul>
</details>

In [None]:
#Your solution here

---

<a name='5'></a>
## Part 6: Enhanced plotting (20 pt)

Create a panel with 1) your dataframe with equal samplesize 2) a picture of a sleeping beauty, 3) the scatter plot of tired / hours of sleep with different colors for Breakfast from part 2 4) the boxplots given the p-value for the anova outcome in the title

In [None]:
#your solution here