In [26]:
import pandas as pd
import numpy as np
from scipy.stats import t, norm


**Los_Angeles_Long_Beach_Anaheim_2 Dataset**

we chose our dataset Los_Angeles_Long_Beach_Anaheim_2

The data is taken form this [link](https://docs.google.com/spreadsheets/d/1Bk7CHkG_vT-cJH57r24nCqoIqxhJ8TmRwYEO-JWh6W8/edit?usp=sharing) 

Before the take we will looked up the dataset and modify the data type of the column

In [27]:
df= pd.read_csv("Los_Angeles_Long_Beach_Anaheim_2.csv")

In [28]:
# print df
def print_dataframe(df):
    pd.set_option('display.max_columns', None)
    pd.set_option('display.expand_frame_repr', False)
    pd.set_option('display.max_rows', None)
    pd.set_option('display.width', None)
    pd.set_option('display.max_colwidth', None)
    print(df)


In [29]:
# covert to datatype to DateTime 

df['DATE'] = pd.to_datetime(df['DATE'])
print_dataframe (df)

          DATE  CPI_all_items  Rent_of_Primary_residence  Monthly_Housing_Cost  CPI_Energy  US_Dollar_Purchasing_power
0   1997-01-01        159.400                    158.400               155.400     113.800                        62.8
1   1997-02-01        159.700                    158.700               155.600     114.700                        62.6
2   1997-03-01        159.800                    158.500               155.500     117.400                        62.5
3   1997-04-01        159.900                    158.800               155.200     120.200                        62.4
4   1997-05-01        159.900                    159.100               156.100     121.200                        62.5
5   1997-06-01        160.200                    159.300               156.400     120.600                        62.4
6   1997-07-01        160.400                    159.500               156.900     119.700                        62.3
7   1997-08-01        160.800                   

**1. Outlier Detection**

By using the Tukey’s rule ,we detect the outlier and delete it.  

Tukey’s rule ([Reference](https://www.linkedin.com/pulse/outlier-treatment-tukeys-method-r-swanand-marathe/)
) :

Divide the data set into Quartiles:

* Q1 (the 1st quartile): 25% of the data are less than or equal to this value

* Q3 (the 3rd quartile): 25% of the data are greater than or equal to this value

* IQR (the interquartile range): the distance between Q3 – Q1, it contains the middle 50% of the data

Lower Range = Q1 – (Alpha * IQR)

Upper Range = Q3 + (Alpha * IQR)


**Note**

1. Since there were zero outliers, we reduced the value of alpha from 1.5 to 1.0.(As instruted on [Piazza](https://piazza.com/class/lcuo750nulb47q/post/141))
2.I thought deleting rows with outliers would cause a lot of data loss, so I deleted only the elements with outliers.







In [30]:
# Task 3.1 
# reduce aplha 1.5 => 1

def find_outliers(df, column_name, alpha=1):

     column_df = df[column_name]
     # caculate IQR and range by using Tukey’s rule
     Q1 = column_df.quantile(0.25)
     Q3 = column_df.quantile(0.75)
     IQR  = Q3 -Q1 
    
     low_range = Q1 - IQR * alpha
     upper_range = Q3 +  IQR * alpha

     # get a outliers by using the range
     outliers = pd.DataFrame(columns=df.columns)
     print ()
     for _, r in df.iterrows():
        if r[column_name] < low_range or r[column_name] > upper_range:
            outliers = pd.concat([outliers, r.to_frame().T])

    # delete outliers 
     temp_df = df[[column_name]].copy()
     temp_df.drop(outliers.index, inplace=True)
     df[column_name] = temp_df[column_name]
     select_df = outliers[['DATE', column_name]]

   # print result
     print(f"Column_name:{column_name}")
     print(f"Q1:{Q1},Q3:{Q3},IQR:{IQR}")
     print(f"Low_range:{low_range}, Upper_range:{upper_range}")
     print(f"The number of outliers which is deleted :  {len(outliers)}")
     print(f"The outliers which is deleted:")
     print()
     if len(outliers) != 0:
         print(select_df)
     print()

     return  outliers


column_list= ['DATE','CPI_all_items', 'Rent_of_Primary_residence',
       'Monthly_Housing_Cost', 'CPI_Energy', 'US_Dollar_Purchasing_power']

for column in column_list:
    outliers = find_outliers(df, column)                           



Column_name:DATE
Q1:2003-06-23 12:00:00,Q3:2016-06-08 12:00:00,IQR:4734 days 00:00:00
Low_range:1990-07-07 12:00:00, Upper_range:2029-05-25 12:00:00
The number of outliers which is deleted :  0
The outliers which is deleted:



Column_name:CPI_all_items
Q1:183.85,Q3:240.13125,IQR:56.28125
Low_range:127.56875, Upper_range:296.4125
The number of outliers which is deleted :  4
The outliers which is deleted:

          DATE CPI_all_items
308 2022-09-01       296.539
309 2022-10-01       297.987
310 2022-11-01       298.598
311 2022-12-01        298.99


Column_name:Rent_of_Primary_residence
Q1:206.0,Q3:327.213,IQR:121.21300000000002
Low_range:84.78699999999998, Upper_range:448.42600000000004
The number of outliers which is deleted :  1
The outliers which is deleted:

          DATE Rent_of_Primary_residence
286 2020-11-01                      40.0


Column_name:Monthly_Housing_Cost
Q1:193.025,Q3:275.6565,IQR:82.63149999999999
Low_range:110.39350000000002, Upper_range:358.288
The number of

**2. Perform linear interpolation of missing data-points** 

By using linear interpolation ,we replace the  missing data-points.  

Linear interpolation:

1. slope = (y2 − y1)/(x2 − x1) 
2. missing value = y1 + slope ∗ (missing value x − x1) 
3. then replace the missing value 




**Note**

1. In linear interpolation, if the first or last row of a column is missing, there can be issues in calculating the interpolation. Therefore, if a missing value is found in the first or last row of the DataFrame, the default values for the first and last row of the column given in the project guidelines are updated.
2.During the linear interpolation process, if there are consecutive missing indices, the index is incremented until there are no missing values, then x2 is obtained, and the slope is calculated. After that, the missing values are updated using the formula from step 2 and the obtained values.






In [31]:
print("print Missing  Dataset")
print_dataframe(df)

print Missing  Dataset
          DATE  CPI_all_items  Rent_of_Primary_residence  Monthly_Housing_Cost  CPI_Energy  US_Dollar_Purchasing_power
0   1997-01-01        159.400                    158.400               155.400     113.800                        62.8
1   1997-02-01        159.700                    158.700               155.600     114.700                        62.6
2   1997-03-01        159.800                    158.500               155.500     117.400                        62.5
3   1997-04-01        159.900                    158.800               155.200     120.200                        62.4
4   1997-05-01        159.900                    159.100               156.100     121.200                        62.5
5   1997-06-01        160.200                    159.300               156.400     120.600                        62.4
6   1997-07-01        160.400                    159.500               156.900     119.700                        62.3
7   1997-08-01        160

This code is used to update the first or last row of a specific column with the default value if it is missing, as mentioned above 


In [32]:

def update_row(df, first_values, last_values):
    # get last idx 
    last_idx = len(df) - 1
    for i, column in enumerate(df.columns):
        # Check type of data column 
        if np.issubdtype(df[column].dtype, np.number):
            # check is the first row of the column  is NaN
            if np.isnan(df.loc[0, column]):
                # Update the Nan value as the default first value
                df.loc[0, column] = first_values[i]
                print(f"We find that the first row for column '{column}' has missing value and updated the  with value {first_values[i]}")
            # check is the last row of the column  is NaN 
            if np.isnan(df.loc[last_idx, column]):
                # Update the Nan value as the default last value
                df.loc[last_idx, column] = last_values[i]
                print(f"We find that the last row for column '{column}' has missing value and updated the  with value {last_values[i]}")
    

first_values= ['1997-01-01', 159.400 , 164.400, 155.100 , 115.200 , 62.8]
last_values = ['2022-12-01', 298.900, 385.649, 310.725, 287.176, 33.7]

update_row(df, first_values, last_values)

We find that the last row for column 'CPI_all_items' has missing value and updated the  with value 298.9


This is coder for the linear interpolation

In [33]:
# Task 3.2 

def linear_interpolation(df):
    missing_values = df.isna()
    for column in missing_values.columns:
        print(f"Column '{column}':")

        # find the missing values
        missing_rows = missing_values[column][missing_values[column]]
        for index in missing_rows.index:

            # If the index is alread updated skip the the column
            if not np.isnan(df.loc[index, column]):
                continue

             # find x1 get for the formula , x2 is not decided because they may be nan
            left_index  = index -1
            left_val= df.loc[left_index, column]

            right_index =index+1
            right_val =df.loc[right_index, column]
            roop = True

            # find x2 get for the formula 
            while roop:
                if not np.isnan(right_val): 
                    roop = False
                else : 
                    right_index =right_index+1
                    right_val =df.loc[right_index, column]
           # step1 for the formula  
            slope = (right_val - left_val) / (right_index - left_index)

            # the missing values are updated using the formula from step 2 and the obtained values.
            while index <  right_index:
                # step2 for the formula  
                missing_value = left_val + slope * (index - left_index)
                df.loc[index, column] = missing_value
                print(f"    - Find Missing value at index {index} and Replace the Missing value to {missing_value}")
                index+= 1
                
            
linear_interpolation(df)


print()
print("print Completed  Dataset")
print_dataframe(df)



Column 'DATE':
Column 'CPI_all_items':
    - Find Missing value at index 308 and Replace the Missing value to 296.215
    - Find Missing value at index 309 and Replace the Missing value to 297.11
    - Find Missing value at index 310 and Replace the Missing value to 298.005
Column 'Rent_of_Primary_residence':
    - Find Missing value at index 257 and Replace the Missing value to 358.4205
    - Find Missing value at index 269 and Replace the Missing value to 370.46900000000005
    - Find Missing value at index 273 and Replace the Missing value to 374.8265
    - Find Missing value at index 286 and Replace the Missing value to 384.03499999999997
Column 'Monthly_Housing_Cost':
    - Find Missing value at index 258 and Replace the Missing value to 299.297
    - Find Missing value at index 262 and Replace the Missing value to 300.81399999999996
Column 'CPI_Energy':
    - Find Missing value at index 268 and Replace the Missing value to 257.8195
    - Find Missing value at index 273 and Replac

**3. Performing Wald’s test, Z-test and t-test** 

By using the Wald’s test, Z-test and t-test,We check whether the null hypothesis is true or not.



### Wald's Test ([Reference](https://www3.cs.stonybrook.edu/~anshul/courses/cse544_s23/lec15.pptx))
Wald's test statistic is calculated as follows:

W = (θ̂ - θ₀) / sqrt(Var(θ̂))

where θ̂ is the estimator, θ₀ is the null hypothesis value, and Var(θ̂) is the variance of the estimator.

### Z-Test([Reference](https://www3.cs.stonybrook.edu/~anshul/courses/cse544_s23/lec17.pptx))
The Z-test statistic is calculated as follows:

Z = (X̄ - μ) / (σ / sqrt(n))

where X̄ is the sample mean, μ is the population mean, σ is the population standard deviation, and n is the sample size.

### T-Test([Reference](https://www3.cs.stonybrook.edu/~anshul/courses/cse544_s23/lec17.pptx))

The t-test statistic is calculated as follows:

t = (X̄ - μ) / (s / sqrt(n))

where X̄ is the sample mean, μ is the population mean, s is the sample standard deviation, and n is the sample size.

**Use MLE assume for the estimators purposes that daily data is Poisson distributed**

### Wald's Test for Poisson Distribution (MLE)
Wald's test statistic is calculated as follows:

W = (λ̂₁ - λ̂₂) / sqrt(λ̂₁/n₁ + λ̂₂/n₂)

where λ̂₁ and λ̂₂ are the MLE estimates of the Poisson parameters for samples 1 and 2, and n₁ and n₂ are the sample sizes.

### Z-Test for Poisson Distribution (MLE)
The Z-test statistic is calculated as follows:

Z = (λ̂₁ - λ̂₂) / sqrt(λ̂₁/n₁ + λ̂₂/n₂)

where λ̂₁ and λ̂₂ are the MLE estimates of the Poisson parameters for samples 1 and 2, and n₁ and n₂ are the sample sizes.

### T-Test for Poisson Distribution (MLE)
The t-test statistic is calculated as follows:

t = (λ̂₁ - λ̂₂) / sqrt(λ̂₁/n₁ + λ̂₂/n₂)

where λ̂₁ and λ̂₂ are the MLE estimates of the Poisson parameters for samples 1 and 2, and n₁ and n₂ are the sample sizes.









**Note**

1. We will use MLE for all test; assume for the estimators purposes that daily data is Poisson distributed. 
(As instruted on [Piazza](https://piazza.com/class/lcuo750nulb47q/post/153))
2.To comply with the guidelines, we first extracted the year and month using the DATE column, and then created data frames for the monthly averages of the "Rent_of_Primary_residence" and "CPI_Energy" column values for 2020 and 2021.
3.Using the data frames created above, we performed hypothesis testing to determine whether the monthly averages of "Rent_of_Primary_residence" and "CPI_Energy" for 2020 and 2021 were the same or not.
4.We will perform both one-sided and two-sided tests for the null hypothesis.
5.MLE of Poisson = mean of sample data = lambda   
6.Variance of MLE of Poisson =  lambda
7.we only use from scipy.stats import t, norm libary to get cdf
8.We didn't have a predefined alpha value for the null hypothesis, so we used the commonly accepted value of 0.05.
9.We calculate the p_value by considering the computed estimators, the alpha value, and the type of test (one-sided or two-sided).
10.When the p_value is greater than alpha, we fail to reject the null hypothesis, and when the p_value is less than or equal to alpha, we reject the null hypothesis.

Before conducting the one-sided or two-sided test, we need a function to create data frames for the monthly mean of a specific feature between 2021 and 2022 (Note 2).

In [34]:
def get_monthly_mean(df,year, column_name):
    # By using the df['DATE'], extracted the year and month column

    df['year'] = df['DATE'].dt.year
    df['month'] = df['DATE'].dt.month

    # Filter data for year
    df_year = df[df['year'] == year]

    # Calculate monthly mean for each year by using the .groupby
    monthly_mean_year = df_year.groupby('month')[column_name].mean()

    return monthly_mean_year



  



Conducted one side test the Hypothesis for specific columns:
* Null Hypothesis (H0): Mean of column in year1 <= Mean of column in year2
* Alternative Hypothesis (H1): Mean of column in year1 > Mean of column in year2

In [35]:
#step 3.3

def compare_means_one_sided(df, column_name, year1, year2,alpha = 0.05 ):

    # Null Hypothesis (H0): Mean of column in year1 <= Mean of column in year2
    # Alternative Hypothesis (H1): Mean of column in year1 > Mean of column in year2 
    print("This is one side test")
    print(f"Here are the results for the comparison of {column_name}'s mean between {year1} and {year2}")
    print(f"Null Hypothesis (H0): Mean of {column_name} in {year1} <= Mean of {column_name} in {year2}")
    print(f"Alternative Hypothesis (H1): Mean of {column_name} in {year1} > Mean of {column_name} in {year2}") 
    print()

    # Calculate monthly mean for each year
    monthly_mean_year1 = get_monthly_mean(df,year1,column_name)
    monthly_mean_year2 = get_monthly_mean(df,year2,column_name)

    # According to the assumption, we know that the sample data follows a Poisson distribution.
    # As a above note, we know MLE of Poisson = mean of sample data = lambda_hat_data, so
    lambda_hat_x , lambda_hat_y  = monthly_mean_year1.mean(), monthly_mean_year2.mean() 
  
    # Get a number of sample data of two sample 
    n1, n2 = len(monthly_mean_year1), len(monthly_mean_year2)

    # Wald's Test: pls check above formula
    # As a above note, we know variance of MLE of Poisson =  lambda ,so  

    wald_statistic = (lambda_hat_x - lambda_hat_y) / np.sqrt(lambda_hat_x/n1 + lambda_hat_y/n2)
    wald_p_value = 1 - norm.cdf(wald_statistic) 
    print(f"Statistic of Wald's Test:{wald_statistic}")

    # Z-test Test:pls check above formula
    # We realized that  z_statistic is same as wald_p_value

    z_statistic = (lambda_hat_x - lambda_hat_y) / np.sqrt(lambda_hat_x/n1 + lambda_hat_y/n2)
    z_p_value = 1 - norm.cdf(z_statistic)
    print(f"Statistic of Z-test: {z_statistic}")

    # T-test :pls check above formula
    t_statistic = (lambda_hat_x - lambda_hat_y) / np.sqrt(lambda_hat_x/n1 + lambda_hat_y/n2)
    # To use a t.cdf, we cacaluate degrees_of_freedom, below is the code 

    degrees_of_freedom = (((lambda_hat_x / n1) + (lambda_hat_y / n2)) ** 2) / ((((lambda_hat_x) ** 2) / ((n1 ** 2) * (n1 - 1))) + (((lambda_hat_y**2) ** 2)/ ((n2 ** 2) * (n2 - 1))))
    t_p_value = 1 - t.cdf(np.abs(t_statistic), degrees_of_freedom)
    print(f"Statistic of T-test = {t_statistic}")

    print()

    test_names = ["Wald's Test", "Z-test", "T-test"]
    p_values = [wald_p_value, z_p_value, abs(t_p_value)]

    for test_name, p_value in zip(test_names, p_values):
        if p_value > alpha:
            print(f"For {test_name}, P-value of the {test_name}: {p_value} > alpha: {alpha}),so the H0 is not rejected.")
        else:
            print(f"For {test_name}, P-value of the {test_name}: {p_value} <= alpha: {alpha}),so the H0 is rejected.")
        print()

compare_means_one_sided(df, "Rent_of_Primary_residence", 2020, 2021)
compare_means_one_sided(df, "CPI_Energy", 2020, 2021)


This is one side test
Here are the results for the comparison of Rent_of_Primary_residence's mean between 2020 and 2021
Null Hypothesis (H0): Mean of Rent_of_Primary_residence in 2020 <= Mean of Rent_of_Primary_residence in 2021
Alternative Hypothesis (H1): Mean of Rent_of_Primary_residence in 2020 > Mean of Rent_of_Primary_residence in 2021

Statistic of Wald's Test:-0.9567867619485085
Statistic of Z-test: -0.9567867619485085
Statistic of T-test = -0.9567867619485085

For Wald's Test, P-value of the Wald's Test: 0.8306625523350959 > alpha: 0.05),so the H0 is not rejected.

For Z-test, P-value of the Z-test: 0.8306625523350959 > alpha: 0.05),so the H0 is not rejected.

For T-test, P-value of the T-test: 0.4993266351847243 > alpha: 0.05),so the H0 is not rejected.

This is one side test
Here are the results for the comparison of CPI_Energy's mean between 2020 and 2021
Null Hypothesis (H0): Mean of CPI_Energy in 2020 <= Mean of CPI_Energy in 2021
Alternative Hypothesis (H1): Mean of CPI_

Conducted two side test the Hypothesis for specific columns:
* Null Hypothesis (H0): Mean of column in year1 = Mean of column in year2
* Alternative Hypothesis (H1): Mean of column in year1 != Mean of column in year2

In [36]:
#step 3.3

def compare_means_two_sided(df, column_name, year1, year2,alpha = 0.05 ):

    # Null Hypothesis (H0): Mean of column in year1 = Mean of column in year2
    # Alternative Hypothesis (H1): Mean of column in year1 != Mean of column in year2 

    print("This is two side test")
    print(f"Here are the results for the comparison of {column_name}'s mean between {year1} and {year2}")
    print(f"Null Hypothesis (H0): Mean of {column_name} in {year1} = Mean of {column_name} in {year2}")
    print(f"Alternative Hypothesis (H1): Mean of {column_name} in {year1} != Mean of {column_name} in {year2}") 
    print()

    # Calculate monthly mean for each year
    monthly_mean_year1 = get_monthly_mean(df,year1,column_name)
    monthly_mean_year2 = get_monthly_mean(df,year2,column_name)

    # According to the assumption, we know that the sample data follows a Poisson distribution.
    # As a above note, we know MLE of Poisson = mean of sample data = lambda_hat_data, so
    lambda_hat_x , lambda_hat_y  = monthly_mean_year1.mean(), monthly_mean_year2.mean() 
  
    # Get a number of sample data of two sample 
    n1, n2 = len(monthly_mean_year1), len(monthly_mean_year2)

    # Wald's Test: pls check above formula
    # As a above note, we know variance of MLE of Poisson =  lambda ,so  

    wald_statistic = (lambda_hat_x - lambda_hat_y) / np.sqrt(lambda_hat_x/n1 + lambda_hat_y/n2)
    wald_p_value = 2*(1 - norm.cdf(wald_statistic))
    print(f"Statistic of Wald's Test:{wald_statistic}")

    # Z-test Test:pls check above formula
    # We realized that  z_statistic is same as wald_p_value

    z_statistic = (lambda_hat_x - lambda_hat_y) / np.sqrt(lambda_hat_x/n1 + lambda_hat_y/n2)
    z_p_value = 2*(1 - norm.cdf(z_statistic))
    print(f"Statistic of Z-test: {z_statistic}")

    # T-test :pls check above formula
    t_statistic = (lambda_hat_x - lambda_hat_y) / np.sqrt(lambda_hat_x/n1 + lambda_hat_y/n2)
    # To use a t.cdf, we cacaluate degrees_of_freedom, below is the code 
    degrees_of_freedom = (((lambda_hat_x / n1) + (lambda_hat_y / n2)) ** 2) / ((((lambda_hat_x) ** 2) / ((n1 ** 2) * (n1 - 1))) + (((lambda_hat_y**2) ** 2)/ ((n2 ** 2) * (n2 - 1))))
    t_p_value = 2*(1 - t.cdf(np.abs(t_statistic), degrees_of_freedom))
    print(f"Statistic of T-test = {t_statistic}")

    print()

    test_names = ["Wald's Test", "Z-test", "T-test"]
    p_values = [wald_p_value, z_p_value, abs(t_p_value)]

    for test_name, p_value in zip(test_names, p_values):
        if p_value > alpha:
            print(f"For {test_name}, P-value of the {test_name}: {p_value} > alpha: {alpha}),so the H0 is not rejected.")
        else:
            print(f"For {test_name}, P-value of the {test_name}: {p_value} <= alpha: {alpha}),so the H0 is rejected.")
        print()

compare_means_two_sided(df, "Rent_of_Primary_residence", 2020, 2021)
compare_means_two_sided(df, "CPI_Energy", 2020, 2021)


This is two side test
Here are the results for the comparison of Rent_of_Primary_residence's mean between 2020 and 2021
Null Hypothesis (H0): Mean of Rent_of_Primary_residence in 2020 = Mean of Rent_of_Primary_residence in 2021
Alternative Hypothesis (H1): Mean of Rent_of_Primary_residence in 2020 != Mean of Rent_of_Primary_residence in 2021

Statistic of Wald's Test:-0.9567867619485085
Statistic of Z-test: -0.9567867619485085
Statistic of T-test = -0.9567867619485085

For Wald's Test, P-value of the Wald's Test: 1.6613251046701918 > alpha: 0.05),so the H0 is not rejected.

For Z-test, P-value of the Z-test: 1.6613251046701918 > alpha: 0.05),so the H0 is not rejected.

For T-test, P-value of the T-test: 0.9986532703694486 > alpha: 0.05),so the H0 is not rejected.

This is two side test
Here are the results for the comparison of CPI_Energy's mean between 2020 and 2021
Null Hypothesis (H0): Mean of CPI_Energy in 2020 = Mean of CPI_Energy in 2021
Alternative Hypothesis (H1): Mean of CPI_E