# David Markham
# Creating a Synthetic Data-Set
# Project for Programming Data Analysis 2019

## Project Guidelines.

- For this project we will be creating a data set, by simulating a real-world phenomenon of your choosing.

- You might pick one that is of interest to you in your personal or professional life.

- Then, rather than collect data related to the phenomenon, you should model and synthesise such data using Python.

- We will be using the numpy.random package, a sub package of Python, for this purpose.

### In this project you should:

- Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.

- Investigate the types of variables involved, their likely distributions, and their relationships with each other.

- Synthesise/simulate a data set as closely matching their properties as possible.

- Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.

### Main task in this project is to create a synthesized data-set. 

![Title](Images/SyntheticDataImage.png) Source(arXiv Vanity, 2018)

<a href="https://searchcio.techtarget.com/definition/synthetic-data">Synthetic</a> data is information that's artificially manufactured rather than generated by real-world events. Synthetic data is created algorithmically, and it is used as a stand-in for test datasets of production or operational data, to validate mathematical models and, increasingly, to train machine learning models.(SEARCH CIO, 2019)

# Phenonenom 
## Investigating the negative factors which effect a group of runner's.

For this project, the real-world phenomenom I will be using is running, and all the negative factors which effect runner's. There are a huge amount of factors that can be negative when a person exercises regularly, running in this case. Depends on the lifestyle, body weight, smoker or non-smoker etc. The variables I will be using for this synthetic data-set will be, stretching, training, smoker or non, cramps, dehydration, environment (flat/hills). It is very important for a runner to understand all these factors in ordr to be able to perform at the best of their ability. If you undertand the problems which effects runners, you will be able to prevent injuries from occuring. 

### The Variables

- Stretching
- Hydration
- Smoker/Non Smoker
- Training
- Environment
- Cramps

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [40]:
df = pd.DataFrame(np.random.randn(20, 4), columns=list('ABCD'))

In [41]:
df

Unnamed: 0,A,B,C,D
0,-0.536891,-1.492703,-1.14438,1.747015
1,0.145056,-0.530127,0.228514,1.299708
2,-0.051874,0.487297,-0.540172,-0.349517
3,1.176473,-0.555405,0.09736,-0.920533
4,0.722094,2.182179,0.172099,-0.058076
5,-0.030017,0.497603,-1.599405,2.355742
6,-1.703544,0.958724,-1.410313,-0.512252
7,0.43486,0.467804,0.668528,1.449095
8,-1.109617,-1.239573,-0.731155,0.185951
9,-0.292987,1.704239,0.821365,0.423547


### Pydbgen

**Pydbgen** is a lightweight, pure-python library to generate random useful entries (e.g. name, address, credit card number, date, time, company name, job title, license plate number, etc.) and save them in either Pandas dataframe object, or as a SQLite table in a database file, or in a MS Excel file. https://pydbgen.readthedocs.io/en/latest/

You will need to install it first on your terminal using the following command: **pip install pydbgen**

In [20]:
import pydbgen
from pydbgen import pydbgen # # imports the package.
myDB=pydbgen.pydb() 

In [4]:
generator = pydbgen.pydb()

In [21]:
generator.license_plate()

'ESQ-845'

Generates a **random names** list for our 15 runner's.

In [23]:
generator.gen_data_series(num=15,data_type='name')

0            Charles Mays
1     Christopher Allison
2             Eric Chavez
3              Austin Cox
4           Monica Deleon
5          Laura Thompson
6      Christopher Taylor
7        Heather Jennings
8          Savannah Smith
9              Eric Perry
10           Lance Garner
11        Roberto Carroll
12         Jennifer Cowan
13            Kelly Price
14           Rebecca Beck
dtype: object

Generates random dates.

In [24]:
se=myDB.gen_data_series(data_type='date')
print(se)

0    1998-01-27
1    1990-11-13
2    1990-07-08
3    1975-07-07
4    2016-07-17
5    1983-05-21
6    1996-05-18
7    2010-04-29
8    1981-03-29
9    1972-02-27
dtype: object


In [30]:
myDB.city_real()

'West Bay Shore'

In [37]:
testdf=myDB.gen_dataframe(5,['name','city','phone','date'])
testdf

Unnamed: 0,name,city,phone-number,date
0,Dorothy Contreras,Arista,461-194-3860,1998-12-25
1,Mary Ray,Williston,701-618-2413,1985-05-21
2,Kelly Scott,Lajitas,676-595-5208,1991-04-10
3,Shirley Ford,Lime City,487-163-8859,1983-11-27
4,Joshua Ortega,Benwood,615-382-8591,2005-10-19


# Bibliography 

Search CIO, 2019, Synthetic Data, viewed on 2019/11/22, available online at: https://searchcio.techtarget.com/definition/synthetic-data 

ArXiv Vanity, 2018, Learning to Generate Synthetic Datasets, viewed on 2019/11/22, available online at: https://www.arxiv-vanity.com/papers/1904.11621/

Github, 2018, Tirthajyoti/Machine-Learning-with-Python, viewed on 2019/11/02, available online at: https://github.com/tirthajyoti/Machine-Learning-with-Python/blob/master/Synthetic_data_generation/Synthetic-Data-Generation.ipynb

