# Instructions

1. This assignment is worth 5% of the final grade.
2. Insert cells (code or markdown, as appropriate) below each question and fill in your answers there.
2. You are required to work on this individually. Any form of plagiarism will result in 0.
3. Please submit your notebook file (name it ``IND5003_A1_<Your_Name>.ipynb``) through Canvas before **17th Sep 2023 23:59hrs**.

# Question 1

The file `resale_flat_prices.csv` contains resale flat prices in Singapore from January 2017 onwards. The file `town_type_region.xlsx` contains the classification of each town as *mature* or *non-mature* and each town's geographical region (*north*, *north-east*, *east*, *west*, or *central*). Use these two files to answer the following questions:

1. In the period January 2022 to June 2022, was the mean resale price **per square metre per year of remaining lease** of 4 ROOM flats the same for all geographical regions?
2. In the period July 2022 to December 2022, was there any difference in the distribution of geographical regions for each resale flat type? (Please **omit** 1 ROOM and MULTI-GENERATION flats from this analysis.)

It is up to you to choose the appropriate hypothesis test, and to perform the five steps for each question.

### 1.
#### Step 1


In [4]:
import pandas as pd
import numpy as np
from scipy import stats

# read files
flat_price = pd.read_csv('resale_flat_prices.csv')
town_type = pd.read_excel('town_type_region.xlsx')

#### Step 2: Null and Alternative Hypotheses
##### Null Hypothesis: mean resale price per square metre per year of remaining lease of 4 ROOM flats is the same for all geographical regions
##### Alternative Hypotheses: mean resale price per square metre per year of remaining lease of 4 ROOM flats is not the same for all geographical regions


#### Steps 3 & 4: Compute test statistic and $p$-value

In [5]:
# get January 2022 to June 2022, 4 ROOM flats
jan2June4Room = flat_price[(flat_price['month'] >= '2022-01') & (flat_price['month'] <= '2022-06') & (flat_price['flat_type'] == '4 ROOM')]

# merge two lists
resale_price = jan2June4Room.merge(town_type, how = 'left', left_on = 'town', right_on = 'town')

# remaining_lease year to int
resale_price['remaining_lease_years'] = resale_price['remaining_lease'].str.extract('(\d+)').astype(int)

# calculate resale price per square metre per year of remaining lease
resale_price['resale_price_per_sq_per_y'] = (resale_price['resale_price'] / resale_price['floor_area_sqm']) / resale_price['remaining_lease_years']

# calculate mean groupby each region
average_prices = resale_price.groupby('region')['resale_price_per_sq_per_y'].mean()

In [6]:
# ANOVA test
f_statistic, f_p_value = stats.f_oneway(*[group['resale_price_per_sq_per_y'] for name, group in resale_price.groupby('region')])

print("F-statistic:", f_statistic)
print("F-P-value:", f_p_value)

F-statistic: 1743.465746542492
F-P-value: 0.0


In [7]:
# kruskal test
h_statistic, h_p_value = stats.kruskal(*[group['resale_price_per_sq_per_y'] for name, group in resale_price.groupby('region')])
print("H-statistic:", h_statistic)
print("H-P-value:", f_p_value)


H-statistic: 2577.6400758012273
H-P-value: 0.0


#### Step 5: Conclusion 
Draw conclution from two tests

In [8]:
alpha = 0.05
if (f_p_value < alpha) and (h_p_value < alpha):
    print('Reject null hypothesis')
    print('Mean resale price is not the same for all regions')
else:
    print('No sufficient evident to reject null hypothesis')
    print('Mean resale price is the same for all regions')

Reject null hypothesis
Mean resale price is not the same for all regions


### 2.
#### Step 1


#### Step 2: Null and Alternative Hypotheses
##### Null Hypothesis: flat type and region are indepentent
##### Alternative Hypotheses: flat type and region are not indepentent

#### Steps 3 & 4: Compute test statistic and $p$-value

In [9]:
# July 2022 to December 2022, without 1 ROOM and MULTI-GENERATION flats
jul2Dec = flat_price[(flat_price['month'] >= '2022-07') & (flat_price['month'] <= '2022-12') & (flat_price['flat_type'] != '1 ROOM') & (flat_price['flat_type'] != 'MULTI-GENERATION')]
# merge two lists
jul2Dec_resale_price = jul2Dec.merge(town_type, how = 'left', left_on = 'town', right_on = 'town')

# Contingency Tables
contingency_table = pd.crosstab(index=jul2Dec_resale_price['region'], columns=jul2Dec_resale_price['flat_type'])
chi2_statistic, p_value, dof, expected = stats.chi2_contingency(contingency_table)

In [10]:
# print result
print('chi2_statistic:', chi2_statistic)
print('p_value:', p_value)
print('degree of freedom', dof)
print('expected frequencies', expected)

chi2_statistic: 734.1655738392807
p_value: 6.874279549042278e-146
degree of freedom 16
expected frequencies [[  48.25486726  618.26548673 1078.57168142  631.08318584  179.82477876]
 [  36.90855457  472.89085546  824.96386431  482.69469027  137.5420354 ]
 [  48.51917404  621.6519174  1084.47935103  634.53982301  180.80973451]
 [  66.54867257  852.65486726 1487.46681416  870.33185841  247.99778761]
 [  55.76873156  714.53687316 1246.51828909  729.35044248  207.82566372]]


#### Step 5: Conclusion 
Draw conclution from the test

In [136]:
alpha = 0.05
if p_value < alpha:
    print('Reject null hypothesis')
    print('flat type and region are not indepentent')
else:
    print('No sufficient evident to reject null hypothesis')
    print( 'flat type and region are indepentent')

Reject null hypothesis
flat type and region are not indepentent


# Question 2

The secretary problem *in its simplest form* has the following features.

1. There is one secretarial position available.
2. The number $n$ of applicants is known.
3. The applicants are interviewed sequentially in random order, each order being equally likely.
4. It is assumed that you can rank all the applicants from best to worst without ties. The decision to accept or reject an applicant must be based only on the relative ranks of those applicants interviewed so far.
5. An applicant once rejected cannot later be recalled.
6. You are very particular and will be satisfied with nothing but the very best.

This basic problem has a remarkably simple solution. First, one shows that attention can be restricted to the class of rules that for some integer $r \ge 1$ rejects the first $r - 1$ applicants, and then chooses the next applicant who is best in the relative ranking of the observed applicants. For such a rule, the probability, $p_n(r)$, of selecting the best applicant is $1/n$ for $r = 1$, and for $r > 1$,
\begin{equation}
  p_n(r) = \frac{r - 1}{n} \sum_{i=r}^n \frac{1}{i - 1}.
\end{equation}
The optimal $r$ is the one that maximises this probability. For small values of $n$, the optimal $r$ can easily be computed. For example, when $n = 11$, the function $p_n(r)$ takes on its maximum value when $r = 5$.

When $n = 11$ and using the optimal solution outlined above, use **simulation** to answer the following questions:

1. What is the probability that you **could not find an acceptable applicant?**
2. How many applicants do you expect to interview **by the time you accept an applicant?**

In [72]:
import random

In [117]:
def probability_of_selecting_best(n, r):
    if r == 1:
        return 0
    else:
        sum = 0
        for i in range(r, n + 1):
            sum += 1 / (i - 1)
        return sum * (r - 1) / n

def find_best_r(n):
    best_r = 1
    max_probability = 0
    for r in range(1, n + 1):
        probability = probability_of_selecting_best(n, r)
        if probability > max_probability:
            max_probability = probability
            best_r = r
    return best_r

# simulate interveiws
def secretary_simulation(n, iterations):
    
    best_r = find_best_r(n)
    interviews_unitl_find = 0
    no_acceptable_applicant_count = 0
    
    for _ in range(iterations):

        applicants = list(range(1, n + 1))
        random.shuffle(applicants)

    #     find highest benchmark score
        highest_benchmark_score = 0;
        for i in range(0, best_r - 1):
            highest_benchmark_score = max(highest_benchmark_score, applicants[i])
            interviews_unitl_find += 1
            

        
        isFind = False
        best_applicant = 0;
        for i in range(best_r - 1, n):
            interviews_unitl_find += 1

            if highest_benchmark_score < applicants[i]:
                isFind = True
                best_applicant = applicants[i]
                break
                
        if (not isFind) or (best_applicant != 11):
            no_acceptable_applicant_count += 1

    return no_acceptable_applicant_count / iterations, interviews_unitl_find / iterations

n = 11  # number of applicants
iterations = 100000
no_acceptable_prob, expected_interviews = secretary_simulation(n, iterations)
print(f"1. Probability of not finding an acceptable applicant: {no_acceptable_prob*100:.4f}%")
print(f"2. Expected number of interviews before accepting an applicant: {expected_interviews:.2f}")

1. Probability of not finding an acceptable applicant: 60.0090%
2. Expected number of interviews before accepting an applicant: 8.38
