## Problem statement
For this project you must create a data set by simulating a real-world phenomenon of your choosing. You may pick any phenomenon you wish – you might pick one that is of interest to you in your personal or professional life. Then, rather than collect data related to the phenomenon, you should model and synthesise such data using Python. We suggest you use the numpy.random package for this purpose. Specifically, in this project you should:
* Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.
* Investigate the types of variables involved, their likely distributions, and their relationships with each other.
* Synthesise/simulate a data set as closely matching their properties as possible.
* Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.

Note that this project is about simulation – you must synthesise a data set. Some students may already have some real-world data sets in their own files. It is okay to base your synthesised data set on these should you wish (please reference it if you do), but the main task in this project is to create a synthesised data set.


Having spent some time trying to think of a phenomenon that could be convincingly modelled by using one's only own judgement, and having considered, for example, creating a dataset similar to Verizon's data breach investigation report datasets [1] (https://enterprise.verizon.com/resources/reports/dbir/), I concluded that it would be best to lessen as much as possible the severity of the suspension of disbelief required to take the dataset seriously. In this vein, I decided to model a very uncomplicated phenomenon, namely, student satisfaction for a module in such a course as GMIT's Higher Diploma in Data Analytics (at least I think that's what it's still called - it might have been changed to 'Higher Diploma in Data Fabrication' considering the nature of this assessment).

To create the dataset, I imagined that a thousand students had completed a survey with seven questions:

1. What is your age?

2. How satisfied were you with your lecturer's engagement with you and the course?

3. How satisfied were you with the structure and pace of the module?

4. Did you find the module-content interesting?

5. How would you rate the difficulty of the module?

6. Were the assessments of the module appropriate to the module content?

7. How satisfied were you with the module overall?

In the case of the first question, the options available to the respondent were:
* (20-25) 
* (25-30) 
* (30-35) 
* (35-40) 
* (40-50) 
* (50-65). 

For the other six questions, the respondent was given five possible answers to choose from, namely: 
* (very unsatisfied/easy/unappropriate) 
* (unsatisfied/easy/unappropriate) 
* (neutral) 
* (satisfied/difficult/appropriate) 
* (very satisfied/difficult/appropriate).


Because we are in a sense 'reverse engineering' a datasest, the easiest way to create the dataset would likely be to first determine the distribution for the variable that could be said to be the target variable, i.e. a variable that is more determined than determining in relation to the other variables in the dataset. In this case, that target variable would of course be the answer to the seventh question, 'how satisfied were you with the module overall'. I assumed a distribution approximate to the normal distribution for this variable. I say 'approximate' here because of course all the variables in this dataset are discrete, and the actual normal distribution is continuous. Thus, to create the datapoints were the 'over all satisfaction' variable, I used numpy.random.normal() with a mean of three and standard deviation of 1, rounding each result to the nearest integer, and replacing any values less than zero to zero, and any greater than five to five. The fact that we are removing the tails of the distribution of course takes away from the 'normal-ness' of the distribution, but that is acceptable for our purposes, as this is a discrete variable in any case. 

In [191]:
import numpy as np
import pandas as pd
from collections import Counter

ageOptions = [(20, 25), (25, 30), (30, 35), (35, 40), (40, 50), (50, 65)]
ages = [(option[0] + option[1]) / 2 for option in ageOptions]
ageProbs = [0.4,0.3,0.15,0.1, 0.03, 0.02]
age = np.random.choice(ages, 1000, ageProbs)
# good overview of list comprehensions here: https://appdividend.com/2020/05/13/python-list-replace-replace-string-integer-in-list/
satisfaction = [1 if x < 1 else x for x in [5 if x > 5 else x for x in [int(x.round()) for x in np.random.normal(3, 1, 1000)]]]
#removeOverFives = [5 if x > 5 else x for x in overallSatisfaction]
#removeUnderZeros = [1 if x < 1 else x for x in overallSatisfaction]

difficulty = []
agesDiffMeans = {ages[0]:3, ages[1]:3, ages[2]:3, ages[3]:3.5, ages[4]:3.75, ages[5]:4}
for y in age:
    difficulty.append([1 if x < 1 else x for x in [5 if x > 5 else x for x in [int(x.round()) for x in np.random.normal(agesDiffMeans[y], 1, 1)]]][0])
    # create a dictionary to store the question names (other than age, and overallSatisfaction), students answers, and the expected standard deviations from the overall satisfaction value for each question
data = {}
data.update({"engagement":{"data":np.array([]), "std": 0.8}})
data.update({"structure":{"data":np.array([]), "std": 1}})
data.update({"content":{"data":np.array([]), "std": 0.3}})
data.update({"assessment":{"data":np.array([]), "std": 0.6}})


for key, value in data.items():
    value['data'] = np.array([1 if x < 1 else x for x in [5 if x > 5 else x for x in [int(x.round()) for x in [x + np.random.normal(0, value['std'], 1) for x in satisfaction]]]])
    data[key] = [x for x in value['data']]

# we don't want the difficulty rating to be related to the others
data.update({"difficulty":difficulty})
data.update({"age":age})
data.update({"satisfaction":satisfaction})

for key, value in data.items():
    print(f"{key} counts are: {Counter(data[key])}")

df = pd.DataFrame.from_dict(data)
# display at least 50 rows
pd.set_option('display.min_rows', 50)
df


engagement counts are: Counter({3: 309, 4: 240, 2: 206, 1: 133, 5: 112})
structure counts are: Counter({3: 271, 2: 228, 4: 198, 5: 158, 1: 145})
content counts are: Counter({3: 364, 2: 246, 4: 230, 1: 85, 5: 75})
assessment counts are: Counter({3: 345, 2: 226, 4: 220, 1: 106, 5: 103})
difficulty counts are: Counter({3: 331, 4: 301, 2: 187, 5: 145, 1: 36})
age counts are: Counter({22.5: 187, 45.0: 178, 32.5: 163, 57.5: 162, 37.5: 158, 27.5: 152})
satisfaction counts are: Counter({3: 390, 2: 243, 4: 224, 1: 74, 5: 69})


Unnamed: 0,engagement,structure,content,assessment,difficulty,age,satisfaction
0,1,3,1,1,2,22.5,1
1,1,3,2,2,4,27.5,2
2,2,4,3,3,4,45.0,3
3,2,1,1,2,4,57.5,1
4,4,3,4,5,3,57.5,4
5,5,5,5,5,5,32.5,5
6,4,4,3,3,4,57.5,3
7,2,3,2,3,4,22.5,2
8,5,5,5,5,4,45.0,5
9,4,4,4,4,5,57.5,4
