# Programming for Data Analysis Project
## Eimear Butler, November 2018, Semester 2

### Problem statement

For this project you must create a data set by simulating a real-world phenomenon of your choosing. 

You may pick any phenomenon you wish. Then, rather than collect data related to the phenomenon, you should model and synthesise such data using Python.

We suggest you use the numpy. random package for this purpose.

Specifically, in this project you should:

• Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.

• Investigate the types of variables involved, their likely distributions, and their relationships with each other.

• Synthesise/simulate a data set as closely matching their properties as possible.

• Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.

## Section 1

**Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.**

For this project, I have chosen to simulate data regarding the sale of property in Dublin broadly based on asking prices currently advertised on Daft.ie. The information from Daft.ie will be used as a guideline only and all data points will be simulated according to the relationships between the variables. 

**Variables Investigate the types of variables involved, their likely distributions, and their relationships with each other.**

**Synthesise/simulate a data set as closely matching their properties as possible.**

The variables I am choosing to use are as follows: 

**Distance to City**

The rendom numbers generated for this variable will need to be a float and I am determining the range will be between 1 and 10km. 

From my previous assessment in this module<sup>1</sup>, the most suitable numpy random number generator function would appear to be: 

`(b - a) * np.random.random_sample((y, x)) + a`

Let's test it out...

In [449]:
import numpy as np #import numpy functionality

a = 0.2 #Lowest value
b = 10 #Highest value 
x = 100 #Array Size x axis 
# y axis is not needed as we only want a 1 dimentional Array to feed into panda dataframe

distance = np.around((b - a) * np.random.random_sample((x)) + a, decimals = 2) 
#Requests an array of random numbers from 0.2 (but not including 0.2 itself) to 10 and round to two decimal places

pd.Series(distance) #show the output as a panda dataframe

0     6.36
1     0.82
2     8.48
3     0.46
4     8.04
5     4.15
6     4.33
7     0.99
8     3.68
9     5.33
10    9.33
11    3.56
12    9.51
13    1.37
14    4.38
15    7.55
16    4.83
17    6.89
18    1.01
19    9.36
20    3.90
21    4.65
22    4.77
23    6.82
24    1.03
25    8.43
26    5.87
27    9.43
28    0.44
29    4.60
      ... 
70    4.53
71    6.35
72    5.76
73    8.56
74    8.02
75    4.49
76    1.24
77    7.30
78    6.35
79    1.27
80    7.65
81    3.41
82    5.42
83    5.16
84    1.96
85    5.42
86    5.25
87    7.32
88    5.33
89    6.03
90    2.49
91    8.01
92    7.83
93    9.39
94    6.26
95    2.19
96    9.77
97    4.80
98    6.44
99    2.65
Length: 100, dtype: float64

We now have a list of distances we can use. 

**Property Type**

Next, I want to assign each of the 100 data points one of the following property types:

- Detached
- Semi-detached
- Terrace
- Apartment

I can see from looking at Daft.ie, in general there are more apartments and terrace houses for sale in the centre of the city than in the suburbs so I intent to have a higher probability for the assignment of those to distances within 2.5 km of the city centre. 

Below I manipulate `np.random.choice(a, b, p=[])` and an if/elif statement to generate the next array of data.

In [450]:
# First I establish the property types I want to sort the data by
property_type = ['Detached', 'Semi-detached', 'Terrace', 'Apartment']

#At this stage, I do have the option to just generate equally random property types and associate them with a distance in a panda's dataframe using the formula below...
random_property_type = np.random.choice(property_type, 100, p=[0.25, 0.25, 0.25, 0.25])
f = pd.DataFrame({'Distance': distance, 'Property Type': random_property_type})

#I can also print out the number of times each property type is used and can see that they are fairly evenly distributed between the 4 options
len1 = len(f[f['Property Type'].str.contains('Detached')])
len2 = len(f[f['Property Type'].str.contains('Semi-detached')])
len3 = len(f[f['Property Type'].str.contains('Terrace')])
len4 = len(f[f['Property Type'].str.contains('Apartment')])

print(len1, len2, len3, len4)

25 21 28 26


In [451]:
#However, as I said above, we know there are likely to be more apartments/terrace houses within 2.5km of the city centre and less outside of that so I want to reflect that in my data
# I instead split the distance figures into those within 2.5km (True) and outside of 2.5km (False)
w = distance <= 2.5

#taking my list, I now instruct numpy to not just generate a random property type for each of the 10 data points but to create preferences based on the probability weighting (p=[])
z = [] #I create an empyt set to populate with the output from the below function

for i in w:              #for all the w list which now consists of just True/False statements
    if i == True:        #where i is True i.e. the property is closer to the city..... 
        dist_weighted = np.random.choice(property_type, p=[0.05, 0.1, 0.25, 0.6]) #create a random number order of list where probabilty is increases for Apartments [original list: 'Detached', 'Semi-detached', 'Terrace', 'Apartment']
        z.append(dist_weighted) #add the outputed random number to the list 'z'
    elif i == False:     #where i is False i.e. the property is further from the city..... 
        dist_weighted = np.random.choice(property_type, p=[0.25, 0.55, 0.1, 0.1]) #here, semi detached houses will be more popular (approx. 55%).
        z.append(dist_weighted)   #add the outputed random number to the list 'z'
        
#we now have a list of property types called "dist_weighted" that is still random but will reflect a more "real life" data set due to my 
#instruction to numpy to alter the probability of one result above another based on the information in the first column (i.e. the distance)

print(z)  
len(z)

['Detached', 'Semi-detached', 'Detached', 'Semi-detached', 'Semi-detached', 'Detached', 'Detached', 'Apartment', 'Semi-detached', 'Semi-detached', 'Semi-detached', 'Semi-detached', 'Detached', 'Semi-detached', 'Semi-detached', 'Semi-detached', 'Apartment', 'Semi-detached', 'Detached', 'Semi-detached', 'Apartment', 'Semi-detached', 'Semi-detached', 'Semi-detached', 'Terrace', 'Detached', 'Semi-detached', 'Semi-detached', 'Apartment', 'Detached', 'Apartment', 'Detached', 'Detached', 'Terrace', 'Detached', 'Apartment', 'Apartment', 'Terrace', 'Apartment', 'Semi-detached', 'Apartment', 'Semi-detached', 'Semi-detached', 'Apartment', 'Detached', 'Semi-detached', 'Detached', 'Semi-detached', 'Semi-detached', 'Detached', 'Semi-detached', 'Terrace', 'Semi-detached', 'Semi-detached', 'Terrace', 'Apartment', 'Apartment', 'Terrace', 'Semi-detached', 'Detached', 'Semi-detached', 'Terrace', 'Terrace', 'Detached', 'Semi-detached', 'Terrace', 'Detached', 'Apartment', 'Apartment', 'Apartment', 'Detache

100

In [461]:
#let's add the property types into the data frame so I can call up the rows that are within 2.5k of the city to see if our weighted probability has worked

df = pd.DataFrame({'Distance': distance, 'Property Type': z, 'Within 2.5km of Centre': w})

#I can isolate the true rows to see if there is a preference for apartments/terrace houses
df_true = df.loc[df['Within 2.5km of Centre'] == True]
df_true

Unnamed: 0,Distance,Property Type,Within 2.5km of Centre
1,0.82,Semi-detached,True
3,0.46,Semi-detached,True
7,0.99,Apartment,True
13,1.37,Semi-detached,True
18,1.01,Detached,True
24,1.03,Terrace,True
28,0.44,Apartment,True
35,2.48,Apartment,True
38,1.75,Apartment,True
40,1.43,Apartment,True


In [456]:
#I can also print out the number of times each property type is used to see if there is a bias for apartments/terrace houses in the city centre
len_df_true = len(df_true)
len_df_false = (len(df)-len_df_true)
len5 = len(df_true[df_true['Property Type'].str.contains('Detached')])
len6 = len(df_true[df_true['Property Type'].str.contains('Semi-detached')])
len7 = len(df_true[df_true['Property Type'].str.contains('Terrace')])
len8 = len(df_true[df_true['Property Type'].str.contains('Apartment')])

print("So %d of the 100 data points are within 2.5km of the city centre. Total number of Apartments are %d and Terrace Houses are %d overall for properties within 2.5km showing a clear preference for them within this criteria." % (len_df_true, len8, len7),)

So 22 of the 100 data points are within 2.5km of the city centre. Total number of Apartments are 11 and Terrace Houses are 7 overall for properties within 2.5km showing a clear preference for them within this criteria.


**Condition**

Next we want to assign each of the 100 data points a ratiing from 1 to 10 of how good condition the property is in. In theory any of the poperties could be in very good or very bad condition and so we will use the formula `np.dom.randint()` to produce random integers.

In [459]:
condition = np.random.randint(1, 11, size = 100) #generates 100 integers between 1 and 10

#let's take a look at the overall dataframe before moving on

df = pd.DataFrame({'Distance': distance,'Property Type': z, 'Condition': condition})
df

Unnamed: 0,Distance,Property Type,Condition
0,6.36,Detached,8
1,0.82,Semi-detached,9
2,8.48,Detached,3
3,0.46,Semi-detached,10
4,8.04,Semi-detached,3
5,4.15,Detached,3
6,4.33,Detached,7
7,0.99,Apartment,3
8,3.68,Semi-detached,7
9,5.33,Semi-detached,5


**Number of Bedrooms**

Next we want to determine how many bedrooms each property has. Again, in theory an apartment could have 4 bedrooms and a house 1 but most properties will have either 2 or 3 bedrooms. `np.random.choice()` is therefore useful again to help us create weighted results.  

In [462]:
# First I establish the numner of bedrooms we want to include
bed = range(1,5)

#Now I can np.random.choice to generate random bedroom numbers with a preference for 2 and 3 bedrooms again due to common sense
bedrooms = np.random.choice(bed, 100, p=[0.1, 0.4, 0.4, 0.1])
#bedrooms    #remove hashtag to view full list 

In [463]:
#Again let's add them into the data frame

df = pd.DataFrame({'Distance': distance,'Property Type': z, 'Number of Bedrooms': bedrooms, 'Condition': condition})
df    

Unnamed: 0,Distance,Property Type,Number of Bedrooms,Condition
0,6.36,Detached,4,8
1,0.82,Semi-detached,1,9
2,8.48,Detached,3,3
3,0.46,Semi-detached,3,10
4,8.04,Semi-detached,3,3
5,4.15,Detached,3,3
6,4.33,Detached,3,7
7,0.99,Apartment,2,3
8,3.68,Semi-detached,3,7
9,5.33,Semi-detached,2,5


**Square Metre Price**

Lastly we want to see if we can generate a random price per square meter for each property based on the attributes we have already established. Again common sense and a quick look at Daft.ie will tell you that the following will have an effect on the overall square meter price: 
- being closer to the city = higher price
- more detatched the property = higher price
- the more bedrooms = higher price
- the better condition the property is in = higher price

So let's generate random numbers that account for these atributes. 

In [None]:
#First I will establish a range of prices that Numpy can use as upper and lower limits to assign a price to each atribute
#The intention is to take the average of each assigned atribute price to get an overall random generated "Square Meter Price" for each of the 100 properties 

range_1a, range_1b = 5500, 6500 #create 4 ranges of prices that will be assigned based on atributes int he next cells 
range_2a, range_2b = 4000, 5500 #prices decrease through out the range
range_3a, range_3b = 3000, 4000
range_4a, range_4b = 2500, 3000

In [443]:
d = [] #create an empty list for the distance pricing
    
for i in distance:     #using a python if statements, determines which distance values should be within which range. 
    if i < 2:          #cut off points are 2, 5, 7 and 10
        price_range = np.random.randint(range_1a, range_1b) #the numbers generated are still random yet based within the appropriate range
        d.append(price_range) #add the resulting random number to the d list until the whole distance column has been assigned a number
    elif i < 5:        #repeat
        price_range = np.random.randint(range_2a, range_2b)
        d.append(price_range)
    elif i < 7:
        price_range = np.random.randint(range_3a, range_3b)
        d.append(price_range)
    else:
        price_range = np.random.randint(range_4a, range_4b)
        d.append(price_range)
        
print(len(d)) #confirm we have generated 100 new random numbers
#print(d)  #remove hashtag here to show the newly generated list. 

100


In [442]:
p = [] #create an empty list for the property type pricing

for i in z:                    #using a python if statements, determines which property type should be within which range. 
    if i == 'Detached':        #each of the 4 types are assigned a range
        price_range = np.random.randint(range_1a, range_1b) #the numbers generated are still random yet based within the appropriate range
        p.append(price_range) #add the resulting random number to the p list until the whole property column has been assigned a number
    if i == 'Semi-detached':         #repeat
        price_range = np.random.randint(range_2a, range_2b) 
        p.append(price_range)
    if i == 'Terrace':
        price_range = np.random.randint(range_3a, range_3b) 
        p.append(price_range)
    if i == 'Apartment':
        price_range = np.random.randint(range_4a, range_4b) 
        p.append(price_range)

print(len(p))  #confirm we have generated 100 new random numbers
#print(p)  #remove hashtag here to show the newly generated list. 

100


In [445]:
b = [] #create an empty list for the bedroom pricing

for i in bedrooms:            #using a python if statements, determines which bedroom quantity should be within which range.
    if i == 4:                #each of the 4 types are assigned a range
        price_range = np.random.randint(range_1a, range_1b) #the numbers generated are still random yet based within the appropriate range
        b.append(price_range) #add the resulting random number to the b list until the whole bedroom column has been assigned a number
    if i == 3:                #repeat
        price_range = np.random.randint(range_2a, range_2b) 
        b.append(price_range)
    if i == 2:
        price_range = np.random.randint(range_3a, range_3b) 
        b.append(price_range)
    if i == 1:
        price_range = np.random.randint(range_4a, range_4b) 
        b.append(price_range)

print(len(b)) #confirm we have generated 100 new random numbers
#print(b)  #remove hashtag here to show the newly generated list. 

100


In [446]:
c = [] #create an empty list for the condition pricing

for i in condition:               #using a python if statements, determines which condition value should be within which range.
    if i in range (7, 11):        #cut off points are 7, 5, 3 and 0
        price_range = np.random.randint(range_1a, range_1b) #the numbers generated are still random yet based within the appropriate range
        c.append(price_range)     #add the resulting random number to the c list until the whole condition column has been assigned a number
    if i in range (5, 7):         #repeat
        price_range = np.random.randint(range_2a, range_2b) 
        c.append(price_range)
    if i in range (3, 5):
        price_range = np.random.randint(range_3a, range_3b) 
        c.append(price_range)
    if i in range (0, 3):
        price_range = np.random.randint(range_4a, range_4b) 
        c.append(price_range)

print(len(c)) #confirm we have generated 100 new random numbers
#print(c)  #remove hashtag here to show the newly generated list. 

100


In [448]:
#Reviewing all the pricing is easiest to do in a pandas dataframe  
df_pricing = pd.DataFrame({'Distance Pricing': d,'Property Type Pricing': p, 'Bedroom Pricing': b, 'Condition Pricing': c})

#here we can also get the mean of each row of prices to generate an overall average price which we will also use as the Square Meter Price
square_m = df_pricing.mean(axis=1)
df_pricing = pd.DataFrame({'Distance Pricing': d,'Property Type Pricing': p, 'Bedroom Pricing': b, 'Condition Pricing': c, 'Mean Pricing': square_m})
df_pricing

Unnamed: 0,Distance Pricing,Property Type Pricing,Bedroom Pricing,Condition Pricing,Mean Pricing
0,2573,2526,2835,6196,3532.50
1,5829,4409,4513,6189,5235.00
2,4618,3123,4653,5487,4470.25
3,3575,5021,3498,3328,3855.50
4,3186,2797,4804,2982,3442.25
5,4407,2675,6363,4816,4565.25
6,4051,5244,4234,2913,4110.50
7,5402,5123,3626,3057,4302.00
8,2757,6303,4197,5719,4744.00
9,4917,6124,2758,5534,4833.25


In [426]:
#adding the Square Meter Price to the overall generated dataframe, results in the following: 
df = pd.DataFrame({'Distance': distance,'Property Type': z, 'Number of Bedrooms': bedrooms, 'Condition': condition, 'Square Metre Price': square_m})
df

Unnamed: 0,Distance,Property Type,Number of Bedrooms,Condition,Square Metre Price
0,7.51,Apartment,1,4,2860.75
1,0.48,Semi-detached,3,6,4770.50
2,3.27,Terrace,3,5,4155.00
3,5.06,Semi-detached,2,9,4352.50
4,6.05,Apartment,3,1,3199.75
5,3.01,Apartment,4,5,4322.75
6,3.51,Semi-detached,3,4,4584.25
7,4.80,Semi-detached,2,1,4071.25
8,9.45,Detached,3,1,4160.00
9,4.48,Detached,1,9,4885.00


## Plotting Graphs - Visual Analysis 

In [None]:
import seaborn as sns


## Machine Learning

## Conclusion

In conclusion, using a combination of the `numpy.random` function and if/elif statements, I was able to generate data that is reasonably reflective of realy life and based on the general assumptions anyone can make when reviewing askign prices on Daft.ie.  

## References

1. https://github.com/eimearbutler7/Programming4DA/blob/master/P4DA_Assignment.ipynb