####   Inferential statistics are used to draw inferences from the sample of a huge data set. Random samples of data are taken from a population, which are then used to describe and make inferences and predictions about the population

In this blog, following topics will be explored

Z Scores & Z-Test
 , t-Tests
, F-test
, Correlation Coefficients
, Chi-Square

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#### Z Scores, Z-Test
Z- Scores are used to calculate the probability of a score occurring within our normal distribution. This helps us to compare scores of two or more different normal distributions.

so first lets read a dummy dataset 

In [2]:
df= pd.DataFrame({ 
    'Student': ['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10','c11','c12','c13','c14'],
    'Marks':[56,62,63,67,70,75,72,72,71,76,78,80,83,86]
                })

In [3]:
df

Unnamed: 0,Student,Marks
0,c1,56
1,c2,62
2,c3,63
3,c4,67
4,c5,70
5,c6,75
6,c7,72
7,c8,72
8,c9,71
9,c10,76


In [4]:
student_marks_mean = df['Marks'].mean() ### mean of student marks
student_marks_std = df['Marks'].std(ddof=0)  ## For sample std, you can put ddof=1

In [5]:
df['z_score']= (df['Marks']-student_marks_mean)/df['Marks'].std(ddof=0)  ## For sample std, you can put ddof=1

In [6]:
df

Unnamed: 0,Student,Marks,z_score
0,c1,56,-2.012952
1,c2,62,-1.268071
2,c3,63,-1.143925
3,c4,67,-0.647337
4,c5,70,-0.274897
5,c6,75,0.345838
6,c7,72,-0.026603
7,c8,72,-0.026603
8,c9,71,-0.15075
9,c10,76,0.469984


### Finding Percentage / Area Under the Curve

We can find out the percentage of people who scored above 72. We take the mean and standard deviation to calculate the area under the curve. Here we use the below-mentioned code for finding the area under the curve i.e. to find the percentage of students who scored more than 72 marks when std.deviation and is the mean is calculated

In [7]:
import scipy
from scipy import stats


In [8]:
area_for72 = 1-(scipy.stats.norm(student_marks_mean,student_marks_std).cdf(72))

In [9]:
area_for72   ### this is to tell that 72 marks contains what % of density in a cruve 

0.5106117683571852

### Z Test
Z-test is used to test whether the two datasets are similar or not.

 

Z-tests are statistical calculations that can be used to compare population means to a sample's. The z-score tells you how far, in standard deviations, a data point is from the mean or average of a data set. A z-test compares a sample to a defined population and is typically used for dealing with problems relating to large samples (n > 30). Z-tests can also be helpful when we want to test a hypothesis. Generally, they are most useful when the standard deviation is known.  ex shown below

In [10]:
df=pd.read_csv('/home/vinay/Downloads/titanic.csv')
dd= df[['Sex','Age']]

In [11]:
dd.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [12]:
dd.isnull().sum()

Sex    0
Age    0
dtype: int64

In [13]:
dd

Unnamed: 0,Sex,Age
0,male,22.0
1,female,38.0
2,female,26.0
3,female,35.0
4,male,35.0
...,...,...
885,female,39.0
886,male,27.0
887,female,19.0
889,male,26.0


In [14]:
age_sample= dd['Age'].sample(30,random_state=11)
age_sample.mean()

28.483333333333334

In [15]:
pop_mean = dd['Age'].mean()
pop_mean

29.69911764705882

####  we know that the sample is a part of the population only, the results of our Z-Test should indicate that the difference between their mean is statistically insignificant especially if the sample size is more than 30. To confirm this we perform a Z-Test

In [16]:
from statsmodels.stats.weightstats import ztest

In [17]:
x=np.array(age_sample) # first we have to convert them in the array form

In [18]:
y=np.array(dd['Age']) # first we have to convert them in the array form

In [19]:
ztest(x,y,pop_mean)

(-11.463448135080279, 2.0133624643057705e-30)

In the output, we get the Z Statistic to be at -11.463448135080279 while the p-value comes out to be as 2.0133624643057705e-30. As discussed in Z scores, Z test and Probability Distribution, our null hypothesis in this scenario will be that both the data sets are significantly similar. If we consider the significance level to be at 5%, then to accept the null hypothesis, our p-value should be more than the chosen significance level. In our example, the p-value is below 0.05 (5%) thus the  Z- Test correctly indicates that the means of both the dataset are not same and are statistically significantly different from each othe

## t-Test

t-Test is used to see whether two groups are similar or not. Z-test is also used for the same purpose, however, the difference between these tests is, that the Z-test is used when the sample size is greater than 30, whereas t-Test is used when the sample size is less than 30.

### Two-sided One-Sample t-Test

In [20]:
df= pd.read_csv('/home/vinay/Downloads/dataset tricks/Untitled Folder/Gender_Height_Weight.csv')

In [21]:
df

Unnamed: 0,Gender,Height,Weight,Index
0,Male,174,96,4
1,Male,189,87,2
2,Female,185,110,4
3,Female,195,104,3
4,Male,149,61,3
...,...,...,...,...
495,Female,150,153,5
496,Female,184,121,4
497,Female,141,136,5
498,Male,150,95,5


In [22]:
df['Weight'].mean(skipna=True)  ### lets first calculate the mean of the population weight

106.0

We find that the mean of the variable ‘WEIGHT’ is 106.0 which is greater than 0.5, however, we need to run a t-Test to find if the difference is statically significant or not

In [31]:
stats.ttest_1samp(df['Weight'],0.5)

Ttest_1sampResult(statistic=72.84934415715804, pvalue=4.5747623308549124e-268)

Our p-value comes out to be 4.5747623308549124e-268 which is less than 0.05 (5% significance level) therefore, we reject the null hypothesis which states that mean is statically greater than 0.5 and is not due to random chance.

### Independent t-Test

The Independent Samples t Test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different.

In [48]:
df=pd.read_csv('/home/vinay/Downloads/titanic.csv')
dd= df[['Sex','Age']]

In [49]:
dd.dropna(inplace= True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [50]:
dd

Unnamed: 0,Sex,Age
0,male,22.0
1,female,38.0
2,female,26.0
3,female,35.0
4,male,35.0
...,...,...
885,female,39.0
886,male,27.0
887,female,19.0
889,male,26.0


In [57]:
female_age = dd[(dd['Sex']=='female')].Age

In [58]:
male_age = dd[(dd['Sex']=='male')].Age

In [59]:
female_age.mean()

27.915708812260537

In [60]:
male_age.mean()

30.72664459161148

We find that the mean age of female (27.915708812260537) is different from the mean age of the male(30.72664459161148). However, we need to perform an Independent t-Test to find if the difference is statistically significant or not

In [68]:
stats.ttest_ind(female_age,male_age,equal_var=False)

Ttest_indResult(statistic=-2.5258975171938896, pvalue=0.011814913211889735)

Our null hypothesis is that both groups are statistically significantly similar. Here, the p-value is less than 0.05, therefore, we reject the null hypothesis that these two groups are significantly similar. Even though the mean is different, their level of variance is also different

### Paired T-test

We import a hypothetical dataset that has marks of students in two different tests. Here we presume that they are the same tests undertaken over a period of time.

In [75]:
df = pd.DataFrame({'Student':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                        'Test A':[9,10,12,16,16,17,9,9,13,13,15,14,9,11,12,14,12,15,11,13],
                        'Test B':[17,24,17,19,15,16,15,25,15,25,21,21,16,25,25,19,24,20,22,24]
                       
                    })

In [76]:
df

Unnamed: 0,Student,Test A,Test B
0,1,9,17
1,2,10,24
2,3,12,17
3,4,16,19
4,5,16,15
5,6,17,16
6,7,9,15
7,8,9,25
8,9,13,15
9,10,13,25


We extract the scores of both the tests. Here Test A is presumed as the test scores taken before a certain teaching program and the score in Test B are supposedly the marks of the students on the same tests taken after the program

In [78]:
before = df['Test A']
after = df['Test B']

In [80]:
stats.ttest_rel(before,after)

Ttest_relResult(statistic=-6.970438606669267, pvalue=1.2167687282184405e-06)

Here our null hypothesis is that both test scores are significantly similar.We find that the p-value comes out to be 1.2167687282184405e-06 which is very less than 0.05 (significance level of 5%). Therefore, we reject the null hypothesis i.e. these test scores are significantly different from each other.

### F-test

One-Way ANOVA is a type of F-test. We can use the Age data used in the Independent t-Test and find if the age of males and females is statistically significantly different from each other or not. For this, we run a One way ANOVA using a function called stats.f_oneway.
But Always remember that anova test is only done when we have one numerical coloumn and one catagorical coloumn 
and there must be more than two catagories if there are two or less than two categories are present then we should only do the t test

In [83]:
df=pd.read_csv('/home/vinay/Downloads/titanic.csv')
dd= df[['Sex','Age']]

In [84]:
dd.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [85]:
dd

Unnamed: 0,Sex,Age
0,male,22.0
1,female,38.0
2,female,26.0
3,female,35.0
4,male,35.0
...,...,...
885,female,39.0
886,male,27.0
887,female,19.0
889,male,26.0


In [86]:
## we can use the same dataset for the annova test

In [88]:
female_age = dd[(dd['Sex']=='female')].Age

In [89]:
male_age = dd[(dd['Sex']=='male')].Age

In [91]:
stats.f_oneway(female_age,male_age)

F_onewayResult(statistic=6.246032404476692, pvalue=0.012671296797014266)

so here also as we can see the p value is less than o.05 so we can say that we are rejecting the null hypothesis

### Correlation Coefficients