# Confidence Interval

### Before we go into Confidence Interval, first let us recall the Central Limit Theorem
- We learned that if we collect **means** from different samples over and over again, the **means** will finally make a **normal distribution**. 
- This allows us to use the **Confidence Interval** as it will build on the assumption from the Central Limit Theorem.

### Standard Error
The term **standard error** is used to refer to the standard deviation of various sample statistics, such as the mean or median. For example, the "standard error of the mean" refers to the standard deviation of the distribution of sample means taken from a population. **The smaller the standard error, the more representative the sample will be of the overall population.**

$$ SE_{}= \frac{s_{}}{\sqrt{n}} $$

Short video explaining Standard Error
- https://www.investopedia.com/terms/s/standard-error.asp

### The Confidence Interval !!!
- The Confidence interval refers to the probability that a population parameter will fall between a set of values for a certain proportion of times.
- In our case, using the confidence interval, we can determine from the samples, the range of values of the means where we are confident the mean of the population is. 


**Confidence Interval =  Estimate ± z-score x Standard Error**

For our case

**Confidence Interval =  Sample Mean ± z-score x Standard Error**

$$\text{CI} = \bar{x} \pm z_{} \cdot SE_{}$$


**** we do not have to calculate the z-score, we can just use a function from scipy.stats

In [1]:
import pandas as pd 
import numpy as np
import random
import scipy.stats as st

#### The data used in this class is obtained from Kaggle- [Property Listing KL](https://www.kaggle.com/dragonduck/property-listings-in-kuala-lumpur)


In [2]:
kl = pd.read_csv("./data/property_kl.csv")

In [3]:
len(kl)

53883

In [4]:
kl.head()

Unnamed: 0,Location,Price,Rooms,Bathrooms,Car Parks,Property Type,Size,Furnishing
0,"KLCC, Kuala Lumpur","RM 1,250,000",2+1,3.0,2.0,Serviced Residence,"Built-up : 1,335 sq. ft.",Fully Furnished
1,"Damansara Heights, Kuala Lumpur","RM 6,800,000",6,7.0,,Bungalow,Land area : 6900 sq. ft.,Partly Furnished
2,"Dutamas, Kuala Lumpur","RM 1,030,000",3,4.0,2.0,Condominium (Corner),"Built-up : 1,875 sq. ft.",Partly Furnished
3,"Cheras, Kuala Lumpur",,,,,,,
4,"Bukit Jalil, Kuala Lumpur","RM 900,000",4+1,3.0,2.0,Condominium (Corner),"Built-up : 1,513 sq. ft.",Partly Furnished


In [5]:
kl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53883 entries, 0 to 53882
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Location       53883 non-null  object 
 1   Price          53635 non-null  object 
 2   Rooms          52177 non-null  object 
 3   Bathrooms      51870 non-null  float64
 4   Car Parks      36316 non-null  float64
 5   Property Type  53858 non-null  object 
 6   Size           52820 non-null  object 
 7   Furnishing     46953 non-null  object 
dtypes: float64(2), object(6)
memory usage: 3.3+ MB


### For our purpose today, we are only going to look at the column `Price`

#### Some Cleaning

In [7]:
price_clean = kl['Price'].dropna()

In [8]:
price_clean

0        RM 1,250,000
1        RM 6,800,000
2        RM 1,030,000
4          RM 900,000
5        RM 5,350,000
             ...     
53878    RM 5,100,000
53879    RM 5,000,000
53880    RM 5,500,000
53881      RM 480,000
53882      RM 540,000
Name: Price, Length: 53635, dtype: object

In [9]:
# checking data type
type(price_clean[0])

str

In [10]:
#Removing ',' and RM 
price_clean = price_clean.str.replace('[^A-Za-z0-9]+', '')
price_clean = price_clean.str.replace('RM', '')

In [11]:
price_clean = price_clean.apply(int)

In [12]:
price_clean

0        1250000
1        6800000
2        1030000
4         900000
5        5350000
          ...   
53878    5100000
53879    5000000
53880    5500000
53881     480000
53882     540000
Name: Price, Length: 53635, dtype: int64

In [13]:
# making the price data in a list
price_list = list(price_clean.values)

#### Done Cleaning

### Now let us take samples of our data 

In [14]:
# Here we are choosing 100 data randomly from the price_clean dataset

sample = random.sample(price_list, 100)


In [15]:
# now we will take the mean of the sample
np.mean(sample)

2484319.77

### Lets obtain the standard error for our data
$$ SE_{}= \frac{s_{}}{\sqrt{n}} $$

In [16]:
# standard deviation / square root of sample size

In [17]:
# standard deviation (s)
std = np.std(sample)

In [20]:
# number of samples (n)
n = len(sample)
n

100

In [21]:
# standard error (SE)
se = std/(np.sqrt(n))
se

236399.63499045424

**Confidence Interval =  Sample Mean ± z-score x Standard Error**

$$\text{CI} = \bar{x} \pm z_{} \cdot SE_{}$$

In [22]:
# For this, we actually need the z-score, but we can just use the function from the scipy.stats library.

In [23]:
# In the function below we are placing 0.95 to represent 95%. The interval that we obtain represent that we are 95% confident
st.norm.interval(alpha=0.95, loc=np.mean(sample), scale=se)

(2020984.9994602948, 2947654.540539705)

### Interpreting the results... 
- These 2 values obtained are the **upper** and **lower** limits of our **confidence interval**. 
- For example:
    - Sample mean - RM 1,750,785.60
    - Confidence interval (95%) - RM 1,796,140.57, RM 2,178,206.15
    - What the confidence interval is saying is that:
        - From the sample that we have taken, we are 95% confident that the **mean of the population** (average price of the houses in KL) is between 
        - **RM 1,796,140.57** and  **RM 2,178,206.15**

In [24]:
### If we check the actual mean value of the data, we get...
np.mean(price_list)

2091946.8565116061

The mean price of houses from our dataset is **RM 2,091,946.85**. This is within our confidence interval. Thus, proving the theory.

### Exercise 1. 
- Find the confidence interval for a sample size (n) - 50000. What is the difference you see?
- Find the confidence interval for a sample size (n) - 50. What is the difference you see?
- Change the alpha value to 99%, what difference do you see?

In [38]:
# sample size 50000
sample_5k = random.sample(price_list, 50000)
std_5k = np.std(sample_5k)
n_5k = len(sample_5k)


In [39]:
se_5k = std_5k/(np.sqrt(n_5k))
st.norm.interval(alpha=0.95, loc=np.mean(sample_5k), scale=se_5k)

(1970549.7770693835, 2205321.1821706165)

the lower and upper limit are too big.

In [40]:
# sample size 50
sample_50 = random.sample(price_list, 50)
std_50 = np.std(sample_50)
n_50 = len(sample_50)
se_50 = std_50/(np.sqrt(n_50))
st.norm.interval(alpha=0.95, loc=np.mean(sample_50), scale=se_50)

(1144862.2330500414, 1872100.3669499587)

The lower and upper limit are small. Miss the actual mean value.

Aplha value 99%

In [41]:
# sample size 50000
st.norm.interval(alpha=0.99, loc=np.mean(sample_5k), scale=se_5k)

(1933664.5160550706, 2242206.44318493)

In [42]:
# sample size 50
st.norm.interval(alpha=0.99, loc=np.mean(sample_50), scale=se_50)

(1030604.8413677205, 1986357.7586322795)

### Aplication in Stocks (Maybe)
- [Difference between confidence level and confidence interval value risk var](https://www.investopedia.com/ask/answers/041615/whats-difference-between-confidence-level-and-confidence-interval-value-risk-var.asp)

### Videos Explaining Confidence Intervals
- https://www.youtube.com/watch?v=tFWsuO9f74o
- https://www.youtube.com/watch?v=s4SRdaTycaw

### Refence
- [Investopedia Confidence Interval](https://www.investopedia.com/terms/c/confidenceinterval.asp)