# Programming for Data Analysis Project

***

Project Brief:

*Create a data-set create a data set by simulating a real-world phenomenon of your choosing*

 - *Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.*
 - *Investigate the types of variables involved, their likely distributions, and their relationships with each other.*
 - *Synthesise/simulate a data set as closely matching their properties as possible.*
 - *Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.*
 
 ***

### Introduction

<br>

Within this jupyter notebook, I will simulate a dataset which represents the distribution of salaries across a population. 

Some key variables which influence the salary of an individual are age, gender, education, industry, and continuity of service.

To begin, I will research these variables and their trends within the workforce, as well as how they may relate to one another. I will then investigate how these variables impact the annual salary of an individual.

Following this investigative work, I will conclude by using these results to simulate a data set which mimics the real-world analysis.

<br>

In [2]:
# for numerical arrays and distributions
import numpy as np
# for plotting
import matplotlib.pyplot as plt
# creating dataframes
import pandas as pd
# another format of plotting
import seaborn as sns
#nice plot layout in notebook
%matplotlib inline
# defining the seed so my results don't vary on each execution
rng = np.random.default_rng(seed=0)

### Variable 1: Gender

For the purpose of this investigation, I will consider male and female as the 2 possible genders within the workforce, as the proportion of non-binary individuals included within widely avaialable statistical data is negligible. As such, gender in the workforce is a binomial distribution.

According to 2019 data from the Central Statistics Office (CSO), women account for 45.9% of the workforce in ireland, and men make up the remaining 54.1% [1].

For this analysis, a "success" will correlate to Male, so I will set p=0.541.

In [4]:
# one trial for each execution
n=1
# probability of success
p=0.541
# repeat 1000 times
x=1000
gender = rng.binomial(n,p,x)

### Variable 2: Age

The 2019 data from the CSO also provides a breakdown of the male and female workforce distribution across age bands [1]. As seen in table 5.4 and figure 5.5, the distribution across age for both genders peaks in the 35-44 age group, and loosely follows a normal distribution. 

Taking the median value of 40 as the mean, the remaining parameter to be considered is the standard deviation. Taking the empirical rule into consideration [2], and that the significant age range falls within +/- 25 of the mean, a standard deviation of 8 years best describes the spread of the data.



In [8]:
# setting mean, standard deviation, and size of output
loc = 40
scale = 8
size = 1000
# creating the distribution
age = rng.normal(loc,scale,size)
# converting the age values to integers
age = age.astype(int)

##### References:

[1] Women and Men in Ireland 2019; CSO; https://www.cso.ie/en/releasesandpublications/ep/p-wamii/womenandmeninireland2019/work/

[2] Empirical Rule; Adam Hayes; https://www.investopedia.com/terms/e/empirical-rule.asp

https://www.rte.ie/news/business/2020/1109/1177032-gender-pay-gap/

https://www.payscale.com/research/IE/Industry

https://www.cso.ie/en/releasesandpublications/ep/p-hes/hes2015/aiw/

https://www.cso.ie/en/statistics/earnings/earningsandlabourcosts/

https://www.google.com/search?rlz=1C1GCEU_enIE821IE821&sxsrf=ALeKk01Bv_bNXtiywTLDmdBgpTRtykAeGA%3A1607443657998&ei=yaTPX5yyPKXsxgPds66gBg&q=how+does+salary+change+based+on+length+of+service&oq=how+does+salary+change+based+on+length+of+ser&gs_lcp=CgZwc3ktYWIQAxgAMgUIIRCgATIFCCEQoAE6BAgjECc6DgguEMcBEK8BEMkDEJECOgUIABCRAjoICAAQsQMQgwE6CwguELEDEMcBEKMCOg4ILhCxAxCDARDHARCjAjoICAAQyQMQkQI6BAgAEEM6BwgAEBQQhwI6AggAOgUILhCxAzoFCAAQsQM6AgguOgUIABDJAzoJCAAQyQMQFhAeOgYIABAWEB46CAghEBYQHRAeOgQIIRAVOgcIIRAKEKABUMfaAViAnwJgz6oCaABwAXgAgAGfAYgBmSWSAQUxNy4yOJgBAKABAaoBB2d3cy13aXrAAQE&sclient=psy-ab

https://www.statista.com/statistics/377005/employment-by-economic-sector-in-ireland/#:~:text=The%20statistic%20shows%20the%20distribution,percent%20in%20the%20service%20sector.

https://www.cso.ie/en/statistics/labourmarket/



***

# End

***