# Programming for Data Analysis Project

***

Project Brief:

*Create a data-set create a data set by simulating a real-world phenomenon of your choosing*

 - *Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.*
 - *Investigate the types of variables involved, their likely distributions, and their relationships with each other.*
 - *Synthesise/simulate a data set as closely matching their properties as possible.*
 - *Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.*
 
 ***

### Introduction

<br>

Within this jupyter notebook, I will simulate a dataset which represents the distribution of salaries across a population. 

Some key variables which influence the salary of an individual are age, gender, education, and industry. Of these variables, industry, education and gender are all categorical variables which will follow binomial or multinomial distributions. As age is continuous numerical data, further investigation will be needed to identify the distribution.

To begin, I will research these variables and their trends within the workforce, as well as how they may relate to one another. I will then investigate how these variables impact the annual salary of an individual.

Following this investigative work, I will conclude by using these results to simulate a data set which mimics the real-world analysis.

<br>

In [1]:
# for numerical arrays and distributions
import numpy as np
# for plotting
import matplotlib.pyplot as plt
# creating dataframes
import pandas as pd
# another format of plotting
import seaborn as sns
#nice plot layout in notebook
%matplotlib inline
# defining the seed so my results don't vary on each execution
rng = np.random.default_rng(seed=0)

### Variable 1: Gender

For the purpose of this investigation, I will consider male and female as the 2 possible genders within the workforce, as the proportion of non-binary individuals included within widely avaialable statistical data is negligible. As such, gender in the workforce is a binomial distribution.

According to 2019 data from the Central Statistics Office (CSO), women account for 45.9% of the workforce in ireland, and men make up the remaining 54.1% [1].

For this analysis, a "success" will correlate to Male, so I will set p=0.541.

In [2]:
# one trial for each execution
n=1
# probability of success
p=0.541
# repeat 1000 times
x=1000
gender = rng.binomial(n,p,x)

In [3]:
# calculate a count value to identify the male/female ratio in the data
# will be useful for other variables which depend on gender
count=0
for i in gender:
    if i==1:
        count += 1
# print result to screen for reference
print(count)

508


### Variable 2: Age

The 2019 data from the CSO also provides a breakdown of the male and female workforce distribution across age bands [1]. As seen in table 5.4 and figure 5.5, the distribution across age for both genders peaks in the 35-44 age group, and loosely follows a normal distribution. As such, I believe the distribution of age within the workforce is largely independent of a person's gender, and will be treated as such in this simulation.

Taking the median value of 40 as the mean, the remaining parameter to be considered is the standard deviation. Taking the empirical rule into consideration [2], and that the significant age range falls within +/- 25 of the mean, a standard deviation of 8 years best describes the spread of the data.



In [4]:
# setting mean, standard deviation, and size of output
loc = 40
scale = 8
size = 1000
# creating the distribution
age = rng.normal(loc,scale,size)
# converting the age values to integers
age = age.astype(int)

In [5]:
# initiating the dataframe 
df = pd.DataFrame({'gender':gender, 'age':age})

### Variable 3: Industry

Another variable which is likely to have an impact on an individuals salary is the industry in which they work. For the purpose of this similuation, I have selected 13 categories from the 2016 census data [3] which provides the number of individuals employed in each of the categories, broken down by gender.

The categories in scope are as follows:

 - Wholesale and retail trade;
 
 - Human health and social work;
 
 - Manufacturing;
 
 - Education;
 
 - Accommodation and food service;
 
 - Professional, scientific and technical;
 
 - Construction;
 
 - Financial and insurance;
 
 - Information and communication;
 
 - Agriculture, forestry and fishing;
 
 - Transportation and storage;
 
 - Administrative and support service;
 
 - Arts, entertainment and recreation.
 
The distribution of the workforce across these categories varies greatly between males and females, thus a different distribution will be needed for each gender. In both cases, a multinomial distribution is most appropriate to simulate this data, however the probabilities applied to each category will vary. Given the data used as my source represents the entire population, I will use this to identify the appropriate probabilities to assign to each category [3].

In [6]:
# adding categories to an array
categories = ['Wholesale and retail trade','Human health and social work','Manufacturing','Education',
              'Accommodation and food service','Professional, scientific and technical','Construction','Financial and insurance',
              'Information and communication','Agriculture, forestry and fishing','Transportation and storage',
              'Administrative and support service','Arts, entertainment and recreation']
# probabilities to be used for males
p_men = [0.1580,0.0527,0.1586,0.0490,0.0623,0.0684,0.1068,0.0486,0.0687,0.0888,0.0722,0.0447,0.0211]
# probabilities to be used for females
p_women = [0.1642,0.2295,0.0789,0.1730,0.0801,0.0686,0.0093,0.0620,0.0379,0.0136,0.0222,0.0408,0.0201]

In [7]:
# the count variable is the count of males in the first array created
# output will be count number of arrays with 13 elements each
# each array will have 12 zeros and 1 one, the index of the one will correspond to the industry in the categories array
industry_men = rng.multinomial(1,p_men,size=count)
# empty list which will be used to hold the categroy generate
intermediate_list = []
# loop through each event in the simulation
for i in industry_men:
    # look at each value in the output array for the specific event
    for j in range(len(i)):
        # if loop will run when the category selected has been identfied
        if i[j] == 1:
            # add the relevant category to the list
            intermediate_list.append(categories[j])

# convert the list to a dataframe, so it can be added to the overall dataframe
intermediate_dataframe = pd.DataFrame(intermediate_list,columns=['industry'])
# identifies the indices in the overall dataframe which corrspond to men
men_index = np.array(df.loc[df.gender==1].index)
# adds the column of these indices to the results for men
intermediate_dataframe['men_index'] = men_index
# sets this column as the index for this array, so it will match up to the correct values in the overall dataframe
intermediate_dataframe.set_index('men_index',inplace=True, drop=True)
# add column of results to overall dataframe
df.loc[df.gender==1,'industry'] = intermediate_dataframe

In [8]:
# here the size value is set to the total amount minus the values already set for men
# gender value also set to 0 instead of 1
# all other comments from the previous cell also apply here
industry_women = rng.multinomial(1,p_women,size=1000-count)
intermediate_list1 = []
for i in industry_women:
    for j in range(len(i)):
        if i[j] == 1:
            intermediate_list1.append(categories[j])

intermediate_dataframe1 = pd.DataFrame(intermediate_list1,columns=['industry'])
women_index = np.array(df.loc[df.gender==0].index)
intermediate_dataframe1['women_index'] = women_index
intermediate_dataframe1.set_index('women_index',inplace=True, drop=True)
df.loc[df.gender==0,'industry'] = intermediate_dataframe1

### Variable 4: Education

The education level of an individual is another variable which I believe will have an impact on their salary, with those who reach a higher level of qualification likely to earn a larger salary. Through inspection of the 2016 census data [4], it is clear from firgure 2.3 that the distribution of education levels has a significant level of variation between industries.

With regards to gender, when the following 4 grouping of education level are considered:

 - Primary;
 
 - Secondary;
  
 - Certificate/Apprenticeship;
 
 - Third Level +.
 
The distribution also varies between the two genders [5]. However, the variation which occurs here is likely an echo of the variation due to industry and how frequently industries occur in each gender. For example, health care and social work is the most popular industry for females, which has a very high proportion of third level educated individuals. In contrast, farming and agriculture have a much higher proportion of males than females, and this category has a much higher incidence of primary and secondary education. 

As such, I feel that to consider both gender and industry as significant variables in the distribution of education, when these to variables are in themselves interdependent, would excessively skew the proportions in the distribution. I will use the population data from table EA007 from the 2016 census to calculate the correct probabilities for each of the 4 education levels listed above for both males and females.

##### References:

[1] Women and Men in Ireland 2019; CSO; https://www.cso.ie/en/releasesandpublications/ep/p-wamii/womenandmeninireland2019/work/

[2] Empirical Rule; Adam Hayes; https://www.investopedia.com/terms/e/empirical-rule.asp

[3] Census 2016 Table EB030; CSO; https://data.cso.ie/table/EB030

[4] Census 2016 Education and Economic Status; CSO; https://www.cso.ie/en/releasesandpublications/ep/p-cp10esil/p10esil/ees/

[5] Census 2016 Table EA007; CSO; https://data.cso.ie/table/EA007

https://www.rte.ie/news/business/2020/1109/1177032-gender-pay-gap/

https://www.payscale.com/research/IE/Industry

https://www.cso.ie/en/releasesandpublications/ep/p-hes/hes2015/aiw/

https://www.cso.ie/en/statistics/earnings/earningsandlabourcosts/

https://www.statista.com/statistics/377005/employment-by-economic-sector-in-ireland/#:~:text=The%20statistic%20shows%20the%20distribution,percent%20in%20the%20service%20sector.

https://www.cso.ie/en/statistics/labourmarket/



***

# End

***