# Problem statement

For this project you must create a data set by simulating a real-world phenomenon of your choosing. You may pick any phenomenon you wish – you might pick one that is of interest to you in your personal or professional life. Then, rather than collect data
related to the phenomenon, you should model and synthesise such data using Python. We suggest you use the numpy.random package for this purpose. Specifically, in this project you should:

* Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.
* Investigate the types of variables involved, their likely distributions, and their relationships with each other.
* Synthesise/simulate a data set as closely matching their properties as possible.
* Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.

Note that this project is about simulation – you must synthesise a data set. Some students may already have some real-world data sets in their own files. It is okay to base your synthesised data set on these should you wish (please reference it if you do), but the main task in this project is to create a synthesised data set.

# My simulation of the real-world phenomenon

My project is inspired by The Height and Shoe Size dataset published online (McLaren 2012). In this project, I will model and synthesise the dataset concerning the height and shoe size of the students attending this course instead of collecting them by asking each student for their input. In the original dataset, there is only three variables- shoe size, height and sex, I decided to add another two variables- age and ID number. 

The variables involved:


* **Shoe_size:** Size of the shoe, the sizes are between 35-45.

* **Height:** Height of the student in centimetres.

* **Sex:** the student is Male or Female.

* **Age:** the age of the student, the age is between 18-49.

* **ID:** identification number of the student, the values are between 0000-9999.


The variables shoe_size, height, age and ID are numerical data, the variable sex is categorical data. The dataset will consist of 150 observations.

In [1]:
# Show plots in notebook
%matplotlib inline

# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Shoe_size

I will generate random values for the variable shoe_size. The values should be integers between 36 and 50 inclusive.

In [2]:
# Generates random numbers with seed 5, so we can reproduce the results.
np.random.seed(5)
# Returns random 150 integers between 36(inclusive) and 51(exclusive).
shoe_size = np.random.randint(36, 51, 150)
shoe_size

array([39, 50, 49, 42, 42, 36, 45, 44, 40, 43, 50, 47, 36, 50, 36, 43, 48,
       37, 41, 43, 36, 47, 48, 49, 47, 37, 50, 40, 42, 49, 38, 50, 45, 45,
       46, 45, 45, 50, 37, 38, 43, 36, 50, 41, 46, 36, 36, 40, 50, 40, 45,
       39, 47, 38, 40, 42, 45, 39, 39, 38, 37, 41, 43, 40, 48, 49, 50, 50,
       47, 47, 39, 48, 37, 43, 39, 47, 50, 47, 49, 37, 45, 41, 43, 36, 50,
       50, 46, 45, 42, 36, 41, 38, 49, 47, 44, 49, 42, 47, 44, 36, 48, 48,
       41, 50, 50, 38, 36, 43, 43, 42, 47, 36, 36, 44, 46, 48, 50, 41, 50,
       47, 41, 45, 42, 40, 41, 38, 48, 47, 49, 44, 44, 37, 48, 42, 39, 40,
       37, 44, 36, 38, 49, 48, 47, 38, 40, 37, 42, 46, 39, 40])

# References

1. MCLAREN, C.H., 2012. 'Using the Height and Shoe Size Data to Introduce Correlation and Regression'. *Journal of Statistics Education Volume 20, Number 3 (2012)* [Online].  Available from: http://jse.amstat.org/v20n3/mclaren.pdf [viewed 13
December 2019]. 