## Final Assignment


Before working on this assignment please read the instructions fully. Use blackboard to submit a link to your repository. Upload a rendered document (html/pdf) as well as the original code. Please familiarize yourself with the criteria before beginning the assignment.

You should define a research question yourself based on at least two data sources that can be merged into a tidy dataset. The research question should be life science related. The research question should be a question with a causual nature. For instance questions like: How do independent variables X influence the dependent variable of Y? The research question should be answered with an interactive visual, and if possible tested for significance.
If you use code snippets from others you should refer to the original author, otherwise you will be accused of plagiarism. Please be prepared to explain your code in a verbal exam. 



Assessment criteria

Conditional
- No data and or api-key information is stored in the repository. 
- No hard datapaths are used, datapaths are provided in a configfile.
- At least two data sets are merged into one tidy dataframe.

Graded
- (5 pt) The research question is stated. 
- (5 pt) Links to sources are provided and a small description about the data
- (20 pt) Data qualitity and data quantity are inspected and reported. Appropiate transformations are applied.
- (20 pt) Assumptions and presuppositions are made explicit (chosen data storage method, chosen analysis method, chosen design). An argumentative approach is used explaining steps, taken into account data quality and quantity. Explanation is provided either with comments in the code or in a seperate document.
- (10 pt) Interactive visualization is extracted from correct analysis of (incomplete) data
- (10 pt) The design supports the research question. The data is informative in relation to the topic. Visualization is functional and attractive Figures contain X and Y labels, title and captions. (10)
- (20 pt) Code is efficient coded, according to coding style without code smells and easy to read. Code is demonstrated robust and flexible 
- (10 pt) All the code is stored in repository with Readme including most relevant information to implement the code. used software is suitably licensed and documented


### About the data

You can either choose 
- a dataset combination provided on blackboard
- two datasets on the web from two different sources which can be used to answer a research question
- the data from your project

You are welcome to choose datasets at your discretion, but keep in mind they will be shared with others, so choose appropriate datasets. You are welcome to use datasets of your own as well, but minimual two datasets should be coming from the web and or API's. 

Also, you are welcome to preserve data in its original language, but for the purposes of grading you should provide english translations in your visualization. 

### Instructions:

Define a research question, select data and code your data acquisition, data processing, data analysis and visualization. Use a repository with a commit strategy and write a readme file. Make sure that you document your choices. 

# The relationship between Psycho-social workload and smoking

## Research question

Are there any differences in the effects of psycho-social workload on smokers and non-smokers?
Smoking has been linked to an increased risk of developing various physical and mental health conditions. But what about the effects of psycho-social workload on smokers and non-smokers? This research will explore the differences between the two groups in terms of their response to workload, and how this could affect their overall wellbeing.There are two main data sets were taken (1)lifestyle of the Dutch population in private households mainly Smoking rate per year and age group. (2)The psychosocial workload (PSA) of Dutch employees aged 15 to 75, by age and gender.


### Loading needed libraries

In [None]:
import pandas as pd
import yaml
import numpy as np
import matplotlib.pyplot as plt
from bokeh.plotting import figure, show
import bokeh_catplot
import bokeh.io
bokeh.io.output_notebook()
import panel as pn
import seaborn as sns
from panel.interact import interact
from bokeh.models import FactorRange, ColumnDataSource
pn.extension()
from bokeh.transform import dodge


### 1. Data loading
To run this analysis download the following files:[workload.csv](https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS&tableId=83049NED&_theme=175) and [Smoking.csv](https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS&tableId=85457NED&_theme=162)

In [None]:
with open("config.yaml","r") as stream:
    config = yaml.safe_load(stream)

file_smoking_file = config['smokingData']
file_workload = config['workload']
raw_df_workload = pd.read_csv(file_workload,sep=';')
raw_df_smoking = pd.read_csv(file_smoking_file,sep=';')

### 2. Data inspection
The data is inspected for structure and format. Column and row number,datatypes and missing data for each dataframe.

#### workload dataframe

In [None]:
raw_df_workload.head(10)

#### Smoking dataframe

In [None]:
raw_df_smoking.head(10)

In [None]:
raw_df_smoking.info()
raw_df_workload.info()

##### observations:

1. most of the values are given in percentage which is converted from the scores of Survey question.
2. Several columns values need to be converted from special charater to readable values.
3. several columns of smoking dataset needed to convert from object to float.

### 3. Data Wrangling
#### 3a Data cleaning
In this step missing data is removed and datatypes are converted to the correct types. special charater values also converted to readable values

###### workload dataset

In [None]:
#displaying columns names
raw_df_workload.columns

In [None]:
#changing the column names to appropriate names
raw_df_workload.rename(columns={'Working Very Fast_1':'Fast Working','Doing A Lot Of Work_2':'Do lot of work','Periods':'Year',
                                'Emotionally DifficultWork Situations_10':'Emotionally Difficult Work Situation','OfManagersOr Colleagues_14':'Victimization at work',
                                'ExhaustedDue To Work_26':'ExhaustedByWork'},inplace = True)

In [None]:
#drop id column
df_workload = raw_df_workload.drop('ID', axis=1)
#dataframe consist of special keywords for each value that should be converted to specific values
df_workload['Gender']=np.where(df_workload['Gender'] == 3000,'F','M')
df_workload['Year'] = df_workload['Year'].str.replace('JJ00', '')
df_workload['Age'] = df_workload['Age'].replace([53050, 53500,53700,53800,53900,53925],
                                                ['15-25', '25-35', '35-45', '45-55', 
                                                 '55-65', '65-75'])
#converting the datatype to appropriate dtype
df_workload['Year']=df_workload['Year'].astype('int64')
df_workload['Fast Working']=pd.to_numeric(df_workload['Fast Working'],errors='coerce')
df_workload['Do lot of work']=pd.to_numeric(df_workload['Do lot of work'],errors='coerce')
df_workload['Emotionally Difficult Work Situation']=pd.to_numeric(df_workload['Emotionally Difficult Work Situation'],errors='coerce')
df_workload['Victimization at work']=pd.to_numeric(df_workload['Victimization at work'],errors='coerce')
df_workload['ExhaustedByWork']=pd.to_numeric(df_workload['ExhaustedByWork'],errors='coerce')
#removing null values
df_workload.dropna(axis=0,inplace=True)

In [None]:
df_workload

In [None]:
df_workload.dtypes

#### Smoking dataset

In [None]:
#displying column names
raw_df_smoking.columns

In [None]:
#changing the column names to appropriate names
raw_df_smoking.rename(columns={ 'Personal Attributes':'Age','Periods':'Year', 'Smokers_1':'Smoker',
                                'ExSmokers_2':'Exsmoker', 'NeverSmokers_3':'Neversmoker'},inplace = True)
#dataframe consist of special keywords for each value that should be converted to specific values
raw_df_smoking['Year'] = raw_df_smoking['Year'].str.replace('JJ00', '')
raw_df_smoking['Age'] = raw_df_smoking['Age'].replace([53080, 60300,60400,60500,71100,53900,53925],
                                                        ['15-25', '25-35', '35-45', '45-55', '55-65',
                                                         '65-75','75-85'])
#drop unwanted columns
smoking_df= raw_df_smoking.drop(['ID','Margins','DailySmokersAmongPopulation_4',
                                 'DailySmokersAmongSmokers_5','SmokingYearsCurrentCigaretteSmokers_10',
                                 'CigarettesPerDayPerSmoker_9'], axis=1)
#converting datatype
smoking_df['Year']=smoking_df['Year'].astype('int64')

In [None]:
smoking_df

In [None]:
smoking_df.dtypes

In [None]:
#beforing merging the two dataframe smoking dataset need to be groupby age and year
#inorder to avoid data duplication while merging 
df_smoking = smoking_df.groupby(['Age','Year'])[['Smoker','Exsmoker','Neversmoker']].mean().reset_index()
df_smoking.head()

#### 3b Data Combining 
 Here smoking and workload dataframes are merged on Age and Year columns

In [None]:
df_final = pd.merge(df_workload,df_smoking, on =["Age",'Year'],how ='inner')
df_final.describe()

### 4. Data Exploration

To determine further about the dataframe some of the informations have been displayed graphically

In [None]:
df_final.head(20).style.background_gradient(cmap = "flare")

### Pie chart 

In [None]:
def pie_chart():
    Gender1 = (df_final['Gender']=='M').sum()
    Gender0 = (df_final['Gender']=='F').sum()
    print(f'There are total of {Gender1} Males and {Gender0} Females')
    print(f"There are total of {len(df_final.groupby('Age'))} Age Categories{df_final['Age'].unique()}")
    colors_age_group = ['red','orange','yellow','green','blue','indigo']
    colors_gender = ['fuchsia','royalblue']
    plt.figure(figsize=(13,13))
    plt.subplot(121)
    df_final['Age'].value_counts().plot.pie(explode=[0.1, 0.0, 0.1, 0.0, 0.1, 0.1],
            autopct='%1.2f%%',fontsize=12,colors=colors_age_group,shadow = True, startangle = 70,
            wedgeprops= {"edgecolor":"black",
                         'linewidth': 1,
                         'antialiased': True})
    plt.subplot(122)
    df_final['Gender'].value_counts().plot.pie(explode=[0,0.05],autopct='%1.f%%',
                                               fontsize=15,colors=colors_gender,shadow = True, startangle = 90,
            wedgeprops= {"edgecolor":"black",
                         'linewidth': 1,
                         'antialiased': True})

piechart = pn.interact(pie_chart)
pn.extension()
pn.Row(piechart)

#### obsevation:
1. There are total of six age group and 65-75 is the only smallest group compare with others.
2. Almost equal number of Men and women in the Dataset. so there will be no significant difference in gender values

### 1.Bar graph

In [None]:
def barplot(Age,Year):
    df = df_final

    
    df= df[df['Age']== Age]
    df = df[df['Year']== Year]
    workload_mean=[]
    workload = list(df_final.columns[3:8])
    
    for x in workload:
        workload_mean.append(df[x].mean())
    color_code = ['red','orange','violet','green','blue']  
    p = figure(title='workload per Age Group and Year',  
    x_range=FactorRange(factors=workload), width= 500, height= 500)
    p.xaxis.axis_label='Workload'
    p.yaxis.axis_label='percentage(%)'
    p.xaxis.major_label_orientation = 1
    p.vbar(x=workload,  top=workload_mean, width=0.5, color= color_code, fill_alpha=0.7,line_color='white')
    return(p)

In [None]:
Ages= list(df_final.Age.unique())
value = pn.widgets.IntSlider(name ='Year',start = 2014,end = 2021,step=1)
barplot = pn.interact(barplot,Age = Ages,Year=value)
pn.Row(barplot)

#### observations:
1. Age group 25-35 and 35-45 having the highest workrate over the years. This means when people become older their working rate decreasing.
2. About the work pressure age group 25-35 have affected more. People who are experienced in the job did not face any difficulty.
3. Overall the percentage of bar is increasing and decreasing so we cannot see a specific trend.

### 2.Bar graph

In [None]:
def barplot_smoking(year):
    
    df = df_final[df_final['Year'] == year]
    
    source = ColumnDataSource(df)

    p = figure(x_range=df_final['Age'].unique(), y_range=(0, 100), title="Smoking habit by Age group and Year",
           width= 500, height= 500, toolbar_location=None, tools="")

    p.vbar(x=dodge('Age', -0.27, range=p.x_range), top='Smoker', source=source,
       width=0.2, color="#340701", legend_label="Smoker")

    p.vbar(x=dodge('Age',  0.0,  range=p.x_range), top='Exsmoker', source=source,
       width=0.2, color="#9c1a04", legend_label="Exsmoker")

    p.vbar(x=dodge('Age',  0.27, range=p.x_range), top='Neversmoker', source=source,
       width=0.2, color="#db8758", legend_label="Neversmoker")

    p.x_range.range_padding = 0.1
    p.xgrid.grid_line_color = None
    p.legend.location = "top_right"
    p.xaxis.axis_label="Age Group"
    p.yaxis.axis_label ="Percentage (%)"
    p.legend.orientation = "horizontal"
    p.legend.click_policy = "hide"
    return(p)

In [None]:
list_of_year =  list(df_final['Year'].unique()) 
years = pn.widgets.RadioButtonGroup(
        name='Year',
        options=list_of_year,
        button_type='success')

barplot_smoking = pn.interact(barplot_smoking, year = years)
pn.extension()
pn.Row(barplot_smoking)

#### observations:
1. From the bar graph it's very clear that 25-35 age group has the highest smoking rate over the year 2014-2021.
2. Percentage of ex-smokers is increasing by age group
3. Also the percentage of neversmokers is decreasing by age group

### scatter plot

In [None]:
def scatter_plot(workload,smoking):
    
    df = df_final

    

    source = ColumnDataSource(df)
    
    # Setting plot parameters
    p = figure(width=800, height=600, title="Scatterplot for Working condition and smoking status")
    p.xaxis.axis_label = "Smoking status"
    p.yaxis.axis_label = "Working status"
    
    # Scatter plot
    p.circle(smoking,workload, 
                 size=9, color="red", alpha=0.8, 
                 legend_label = 'workload v/s smoking', source=source)
    
    # Customizing legend
    p.legend.location = "top_left"
    p.legend.orientation = "horizontal"
    p.legend.background_fill_color = "grey"
    p.legend.background_fill_alpha = 0.15
    p.legend.label_text_font_size = "10.5pt"
    p.legend.label_text_font_style = "italic"
    
    return p


In [None]:
work_list =  list(df_final.columns[3:8]) 
smoke_list =  list(df_final.columns[8:11]) 

workloads = pn.widgets.RadioButtonGroup(name='workload',options=work_list,button_type='success')
smokers = pn.widgets.RadioButtonGroup(name='Smoker',options=smoke_list,button_type='success')

In [None]:
scatter_plot = pn.interact(scatter_plot, workload = workloads, smoking=smokers)
pn.Row(scatter_plot)

#### observations:
In general, it seems to be a postive correlation between workload and smoking rate. As the workload increases,smoking rate also do so. This relationship seems to be higher in young age group over the years. There is a possibility to compare all the smoking status of the person inorder to check they also have any relation between work pressure.

### correlation Matrix heatmap

In [None]:
def heat_map():
    fig, ax = plt.subplots()
    plt.title('Correlationship')
    fig.set_size_inches((10,10))
    sns.heatmap(df_final.corr(),square = True, annot = True)
heatmap = heat_map()

#### observations:
1. What is meant by positive correlation?
A positive correlation is a relationship between two variables that move in tandem—that is, in the same direction.
2. What is meant by negative correlation? 
Negative correlation is a relationship between two variables in which one variable increases as the other decreases, and vice versa.
3. Here smoking and workload shows positive correlation from the heatmap whereas Year and smoking shows negative correlation


In [None]:
import panel as pn
import numpy as np
import holoviews as hv

pn.extension()

template = pn.template.FastListTemplate(
    theme = 'default',
    accent_base_color="#5F4B8BFF",
    header_background="#97BC62FF",
    title = 'The relationship between Psycho-social workload and smoking',
    sidebar =[pn.pane.Markdown("## Are there any differences in the effects of psycho-social workload on smokers and non-smokers?"),
              pn.pane.Markdown("Smoking is a major public health concern that has been linked to an increased risk of various diseases, including cancer and cardiovascular diseases. But what are the underlying factors that make people more likely to smoke? Recent studies suggest that psycho-social workload can be one of the key contributors to smoking among people. "),
              pn.pane.Markdown("Data sources: [Lifestyle & personal characteristics](https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS&tableId=85457NED&_theme=162) [psychosocial workload (PSA)](https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS&tableId=85457NED&_theme=162)")],
    main = [pn.Column(pn.Row(barplot_smoking)),
            pn.Column(pn.Row(barplot)),
            pn.Column(pn.Row(scatter_plot)),
           ])
template.show()

### References
#### Data sources:
1. The [Lifestyle and personal characteristics](https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS&tableId=85457NED&_theme=162)was retrieved from CBS open data statline(2014-2021).
2. The [psychosocial workload (PSA)](https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS&tableId=83049NED&_theme=175) of Dutch employees was retrieved from CBS open data statline(2014-2021).

[CBS open data statline](https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS)