<font size="5">**ABHISHEK KUMAR SINGH**</font>

<font size="5">**2K19/CO/021**</font>

**<font size="8"><center>EXPERIMENT - 5</center></font>**

**AIM:** To implement t-test and Chi-Square Test in python.

**THEORY**

**P-value** 

The P value, or calculated probability, is the probability of finding the observed, or more extreme, results when the null hypothesis (H 0) of a study question is true — the definition of ‘extreme’ depends on how the hypothesis is being tested.
If the P value is less than the chosen significance level then we reject the null hypothesis i.e. accept that our sample gives reasonable evidence to support the alternative hypothesis. It does NOT imply a “meaningful” or “important” difference; that is for us to decide when considering the real-world relevance of our result.

**T-Test**

A t-test is a type of inferential statistic which is used to determine if there is a significant difference between the means of two groups which may be related in certain features. It is mostly used when the data sets, like the set of data recorded as outcome from flipping a coin a 100 times, would follow a normal distribution and may have unknown variances. T test is used as a hypothesis testing tool, which allows testing of an assumption applicable to a population.

**One sample t-test :-** The One Sample t Test determines whether the sample mean is statistically different from a known or hypothesised population mean. The One Sample t Test is a parametric test.

**Two sampled T-test :-** The Independent Samples t Test or 2-sample t-test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. The Independent Samples t Test is a parametric test. This test is also known as Independent t Test.

**Paired sampled t-test :-** The paired sample t-test is also called dependent sample t-test. It’s an uni variate test that tests for a significant difference between 2 related variables. An example of this is if we were to collect the blood pressure for an individual before and after some treatment, condition, or time point.

**Chi-Square Test** 

The test is applied when we have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.
For example, in an election survey, voters might be classified by gender (male or female) and voting preference (Democrat, Republican, or Independent). We could use a chi-square test for independence to determine whether gender is related to voting preference.

A chi-square test is used in statistics to test the independence of two events. Given the data of two variables, we can get observed count O and expected count E. Chi-Square measures how expected count E and observed count O deviates each other.

![Chi](Chi.png)

Let’s consider a scenario where we need to determine the relationship between the independent category feature (predictor) and dependent category feature(response). In feature selection, we aim to select the features which are highly dependent on the response.

When two features are independent, the observed count is close to the expected count, thus we will have smaller Chi-Square value. So high Chi-Square value indicates that the hypothesis of independence is incorrect. In simple words, higher the Chi-Square value the feature is more dependent on the response and it can be selected for model training.

**CODE AND OUTPUT:**

**Importing Libraries**

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats

**Loading Dataset**

In [2]:
df1 = pd.read_csv('house_data_preprocessed.csv')

df2 = pd.read_csv("blood_pressure.csv")

In [3]:
df1.head()

Unnamed: 0,sqft_living,bedrooms,bathrooms,floors,waterfront,condition,yr_built,yr_renovated,price
0,1180,3.0,1.0,1.0,0,3,1955,0,221900
1,2570,3.0,2.25,2.0,0,3,1951,1991,538000
2,770,2.0,1.0,1.0,0,3,1933,0,180000
3,1960,4.0,3.0,1.0,0,5,1965,0,604000
4,1680,3.0,2.0,1.0,0,3,1987,0,510000


In [4]:
df2.head()

Unnamed: 0,patient,sex,agegrp,bp_before,bp_after
0,1,Male,30-45,143,153
1,2,Male,30-45,163,170
2,3,Male,30-45,153,168
3,4,Male,30-45,153,142
4,5,Male,30-45,146,141


**T-Test**

1. **One Sample t-test**

In [5]:
stats.ttest_1samp(df1['sqft_living'], 2050)

Ttest_1sampResult(statistic=0.054997819686604936, pvalue=0.9561412177329551)

The p value obtained from the one sample t-test is not significant (p > 0.05), and therefore, we conclude that the average sqft_living of the house in a random sample is equal to 2050 ft.

2. **Two sample t-test (unpaired or independent t-test)**

In [6]:
stats.ttest_ind(df1[df1['waterfront'] == 1]['price'],
                df1[df1['waterfront'] == 0]['price'])

Ttest_indResult(statistic=32.519473397582, pvalue=2.133372922509434e-221)

There is a statistically significant difference in the average price between houses with waterfront and houses without waterfront, t = 32.519, p = 2.133e-221.

In [7]:
stats.ttest_ind(df1[df1['yr_renovated'] != 0]['price'],
                df1[df1['yr_renovated'] == 0]['price'])

Ttest_indResult(statistic=15.303226675820525, pvalue=2.5697679087349873e-52)

There is a statistically significant difference in the average price between houses that were renovated and houses that were not renovated, t = 15.303, p = 2.570e-52.

3. **Paired t-test (dependent t-test)**

In [8]:
df2[['bp_before','bp_after']].describe()

Unnamed: 0,bp_before,bp_after
count,120.0,120.0
mean,156.45,151.358333
std,11.389845,14.177622
min,138.0,125.0
25%,147.0,140.75
50%,154.5,149.5
75%,164.0,161.0
max,185.0,185.0


In [9]:
stats.shapiro(df2['bp_before'])

ShapiroResult(statistic=0.9547787308692932, pvalue=0.0004928423441015184)

In [10]:
stats.shapiro(df2['bp_after'])

ShapiroResult(statistic=0.9740639328956604, pvalue=0.020227791741490364)

Both of the variables violate the assumption of normality by a large amount. Therefore, one should use a different test to analyze this data. An appropriate alternative to use would be the Wilcoxon signed-rank Test. However, for demonstration purposes, I shall continue with using the paired sample t-test. 

Note: The findings from this analysis should not be considered valid due to the large violation of the assumption about normality.

In [11]:
stats.ttest_rel(df2['bp_before'], df2['bp_after'])

Ttest_relResult(statistic=3.3371870510833657, pvalue=0.0011297914644840823)

The blood pressure before the intervention was higher (156.45 ± 11.39 units) compared to the blood pressure post intervention (151.36 ± 14.18 units); there was a statistically significant decrease in blood pressure (t(119)=3.34, p= 0.0011) of 5.09 units.

Note: Assumption of normality violated, results should not be trusted. Data should be analyzed using Wilcoxon signed-rank Test.

**Chi-Square Test** 

In [12]:
crosstab1 = pd.crosstab(index=df1['bathrooms'], columns=df1['floors'])
crosstab1 = crosstab1.loc[1.50:2.00, 1.0:2.0]
crosstab1

floors,1.0,1.5,2.0
bathrooms,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.5,518,111,120
1.75,1375,128,111
2.0,694,159,159


In [13]:
stats.chi2_contingency(crosstab1)

(127.48628089717965,
 1.3424545944700994e-26,
 4,
 array([[ 574.12237037,   88.32651852,   86.55111111],
        [1237.16088889,  190.33244444,  186.50666667],
        [ 775.71674074,  119.34103704,  116.94222222]]))

There is a relationship between number of bathrooms and the number of floors distribution, p = 1.342e-26

In [14]:
crosstab2 = pd.crosstab(index=df1['waterfront'], columns=df1['floors'])
crosstab2 = crosstab2.loc[:, 1.0:2.0]
crosstab2

floors,1.0,1.5,2.0
waterfront,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,5816,1016,3634
1,26,13,36


In [15]:
stats.chi2_contingency(crosstab2)

(14.078073607255998,
 0.0008769708507269439,
 2,
 array([[5800.43373494, 1021.67858837, 3643.88767669],
        [  41.56626506,    7.32141163,   26.11232331]]))

There is a relationship between waterfront and the number of floors distribution, p = 0.0008

**LEARNING OUTCOMES**

We learnt about t-test and Chi-Square Test.