### Use Chi-squared tests on vessel temporal analysis and establish believable velocity distributions (by type) 

As the data scientist and product owner, we would like to do Chi-squared (95% confidence) tests to establish the stochastic distribution (random or unique) of vessel speed by vessel type. Repeat different candidate distributions to find a good (i.e. pass) for the test data.
Then verify against a different test set (e.g. a later time) to verify the hypothesis still holds.

This assumes repeating the test for each different vessel type so that we could potentially have different distributions values or distribution types for each vessel type.
Learning set is the English Channel set. 

Establishing the distribution is not invariant overtime would be a valid outcome for the story.


## Chi-Square Test-
The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables ("Speed Over the Ground (SOG)" is the speed of the vessel relative to the surface of the earth and "vessel_type" (Cargo,High Speed Craft, Law Enforcement, Passenger, Search And Rescue, Tanker, Tug, Vessel, and Wing In Ground-effect).

Chi-Square test is a statistical test which is used to find out the difference between the observed and the expected data we can also use this test to find the correlation between categorical variables in our data. The purpose of this test is to determine if the difference between 2 categorical variables is due to chance, or if it is due to a relationship between them.

In [22]:
import pandas as pd
import scipy.stats as stats
from scipy.stats import chi2

In [23]:
#importing the required daa
ais = pd.read_csv('ais_reporting_rates.csv')

In [24]:
#checking the first 5 records in the data
ais.head()

Unnamed: 0.1,Unnamed: 0,message_id,mmsi,message_time_stamp,lat,lon,heading,sog,status,destination,vessel_type
0,0,9428086,11223344,2020-03-12 17:21:26,50.423552,-0.580688,254.0,12.7,Under way using engine,,
1,1,9429453,11223344,2020-03-12 17:26:44,50.419567,-0.609708,260.0,12.6,Under way using engine,,
2,2,9430776,11223344,2020-03-12 17:31:22,50.416072,-0.635027,260.0,13.1,Under way using engine,,
3,3,9432046,11223344,2020-03-12 17:36:45,50.41195,-0.664525,260.0,13.0,Under way using engine,,
4,4,9433162,11223344,2020-03-12 17:42:02,50.407778,-0.693447,260.0,13.0,Under way using engine,,


In [25]:
#checking the last 5 recordss in the data
ais.tail()

Unnamed: 0.1,Unnamed: 0,message_id,mmsi,message_time_stamp,lat,lon,heading,sog,status,destination,vessel_type
368944,368944,11479752,710032130,2020-03-20 20:42:00,50.306265,-1.157915,257.0,13.0,Under way sailing,ANGRA DOS REIS_BRA,Tanker
368945,368945,11480956,710032130,2020-03-20 20:56:41,50.294537,-1.23987,257.0,13.2,Under way sailing,ANGRA DOS REIS_BRA,Tanker
368946,368946,11481903,710032130,2020-03-20 21:00:41,50.291395,-1.262475,257.0,13.5,Under way sailing,ANGRA DOS REIS_BRA,Tanker
368947,368947,11482991,710032130,2020-03-20 21:06:40,50.286498,-1.296945,257.0,13.7,Under way sailing,ANGRA DOS REIS_BRA,Tanker
368948,368948,11484116,710032130,2020-03-20 21:12:03,50.282047,-1.328098,257.0,13.8,Under way sailing,ANGRA DOS REIS_BRA,Tanker


In [26]:
#Removing the records with NaN entries
ais.dropna(subset = ["vessel_type"], inplace=True)

In [27]:
#checking the top 5 records again
ais.head()

Unnamed: 0.1,Unnamed: 0,message_id,mmsi,message_time_stamp,lat,lon,heading,sog,status,destination,vessel_type
649,649,11420222,111232533,2020-03-20 16:00:43,50.436197,-0.818347,76.3,62.0,,,Search And Rescue
650,650,11421615,111232533,2020-03-20 16:07:22,50.444643,-0.746983,232.2,10.0,,,Search And Rescue
651,651,11422809,111232533,2020-03-20 16:12:03,50.442142,-0.766278,256.4,9.0,,,Search And Rescue
652,652,11424033,111232533,2020-03-20 16:17:13,50.439643,-0.788652,262.1,10.0,,,Search And Rescue
653,653,11425149,111232533,2020-03-20 16:21:51,50.43724,-0.808268,252.9,10.0,,,Search And Rescue


In [28]:
# create contingency table
ais_crosstab = pd.crosstab(ais['sog'],
                            ais['vessel_type'],
                           margins=True, margins_name="Total")
ais_crosstab

vessel_type,Cargo,High Speed Craft,Law Enforcement,Passenger,Search And Rescue,Tanker,Tug,Vessel,Wing In Ground-effect,Total
sog,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0.0,1117,0,0,2,185,1249,1,828,0,3382
0.1,645,0,0,5,205,1102,0,68,0,2025
0.2,267,0,0,4,76,612,0,70,0,1029
0.3,206,0,0,2,38,422,0,97,0,765
0.4,218,0,0,5,25,359,0,117,0,724
...,...,...,...,...,...,...,...,...,...,...
30.1,0,1,0,0,0,0,0,0,0,1
30.3,0,1,0,0,0,0,0,0,0,1
35.0,1,0,0,0,0,0,0,0,0,1
62.0,0,0,0,0,1,0,0,0,0,1


In [29]:
ais_crosstab.values

array([[  1117,      0,      0, ...,    828,      0,   3382],
       [   645,      0,      0, ...,     68,      0,   2025],
       [   267,      0,      0, ...,     70,      0,   1029],
       ...,
       [     1,      0,      0, ...,      0,      0,      1],
       [     0,      0,      0, ...,      0,      0,      1],
       [197475,     78,     90, ...,  43235,     69, 351544]], dtype=int64)

In [30]:
#Observed values
ais_observed = ais.values 
ais_observed

array([[649, 11420222, 111232533, ..., nan, nan, 'Search And Rescue'],
       [650, 11421615, 111232533, ..., nan, nan, 'Search And Rescue'],
       [651, 11422809, 111232533, ..., nan, nan, 'Search And Rescue'],
       ...,
       [368946, 11481903, 710032130, ..., 'Under way sailing',
        'ANGRA DOS REIS_BRA', 'Tanker'],
       [368947, 11482991, 710032130, ..., 'Under way sailing',
        'ANGRA DOS REIS_BRA', 'Tanker'],
       [368948, 11484116, 710032130, ..., 'Under way sailing',
        'ANGRA DOS REIS_BRA', 'Tanker']], dtype=object)

In [31]:
ais.values

array([[649, 11420222, 111232533, ..., nan, nan, 'Search And Rescue'],
       [650, 11421615, 111232533, ..., nan, nan, 'Search And Rescue'],
       [651, 11422809, 111232533, ..., nan, nan, 'Search And Rescue'],
       ...,
       [368946, 11481903, 710032130, ..., 'Under way sailing',
        'ANGRA DOS REIS_BRA', 'Tanker'],
       [368947, 11482991, 710032130, ..., 'Under way sailing',
        'ANGRA DOS REIS_BRA', 'Tanker'],
       [368948, 11484116, 710032130, ..., 'Under way sailing',
        'ANGRA DOS REIS_BRA', 'Tanker']], dtype=object)

In [32]:
#getting the statistical values of chi
ais_chi=stats.chi2_contingency(ais_crosstab)
ais_chi


(213462.43455674342,
 0.0,
 2394,
 array([[1.89979192e+03, 7.50392554e-01, 8.65837562e-01, ...,
         4.15938745e+02, 6.63808798e-01, 3.38200000e+03],
        [1.13751586e+03, 4.49303643e-01, 5.18427281e-01, ...,
         2.49046705e+02, 3.97460915e-01, 2.02500000e+03],
        [5.78026577e+02, 2.28312814e-01, 2.63437863e-01, ...,
         1.26552622e+02, 2.01969028e-01, 1.02900000e+03],
        ...,
        [5.61736226e-01, 2.21878342e-04, 2.56013472e-04, ...,
         1.22986027e-01, 1.96276995e-04, 1.00000000e+00],
        [5.61736226e-01, 2.21878342e-04, 2.56013472e-04, ...,
         1.22986027e-01, 1.96276995e-04, 1.00000000e+00],
        [1.97475000e+05, 7.80000000e+01, 9.00000000e+01, ...,
         4.32350000e+04, 6.90000000e+01, 3.51544000e+05]]))

In [33]:
#getting the expected vakues
ais_expected=ais_chi[3]

In [34]:
#defining the degree of freedom(df)
ais_rows=len(ais_crosstab.iloc[0:2,0])
ais_columns=len(ais.iloc[0,0:2])
df=(ais_rows-1)*(ais_columns-1)
df

1

In [35]:
alpha = 0.05

In [36]:
# Calcualtion of Chisquare test statistics
chi_square = 0
rows = ais['sog'].unique()
columns = ais['vessel_type'].unique()
for i in columns:
    for j in rows:
        O = ais_crosstab[i][j]
        E = ais_crosstab[i]['Total'] * ais_crosstab['Total'][j] / ais_crosstab['Total']['Total']
        chi_square += (O-E)**2/E

In [37]:
# The p-value approach
print("Approach 1: The p-value approach to hypothesis testing in the decision rule")
p_value = 1 - stats.norm.cdf(chi_square, (len(rows)-1)*(len(columns)-1))
conclusion = "Failed to reject the null hypothesis."
if p_value <= alpha:
    conclusion = "Null Hypothesis is rejected."
        
print("chisquare-score is:", chi_square, " and p value is:", p_value)
print(conclusion)

Approach 1: The p-value approach to hypothesis testing in the decision rule
chisquare-score is: 213462.43455674336  and p value is: 0.0
Null Hypothesis is rejected.


In [38]:
# The critical value approach
print("\n--------------------------------------------------------------------------------------")
print("Approach 2: The critical value approach to hypothesis testing in the decision rule")
critical_value = stats.chi2.ppf(1-alpha, (len(rows)-1)*(len(columns)-1))
conclusion = "Failed to reject the null hypothesis."
if chi_square > critical_value:
    conclusion = "Null Hypothesis is rejected."
        
print("chisquare-score is:", chi_square, " and p value is:", critical_value)
print(conclusion)


--------------------------------------------------------------------------------------
Approach 2: The critical value approach to hypothesis testing in the decision rule
chisquare-score is: 213462.43455674336  and p value is: 2228.2300267343776
Null Hypothesis is rejected.


In [39]:
from sklearn.feature_selection import chi2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load Dataset
#df = pd.read_csv('SampleData.csv')
# ais

# Resultant Dataframe will be a dataframe where the column names and Index will be the same
# This is a matrix similar to correlation matrix which we get after df.corr()
# Initialize the values in this matrix with 0
resultant = pd.DataFrame(data=[(0 for i in range(len(ais.columns))) for i in range(len(ais.columns))], 
                         columns=list(ais.columns))
resultant.set_index(pd.Index(list(ais.columns)), inplace = True)

# Finding p_value for all columns and putting them in the resultant matrix
for i in list(ais.columns):
    for j in list(ais.columns):
        if i != j:
            chi2_val, p_val = chi2(np.array(ais[i]).reshape(-1, 1), np.array(ais[j]).reshape(-1, 1))
            resultant.loc[i,j] = p_val
print(resultant)

MemoryError: Unable to allocate 921. GiB for an array with shape (351544, 351544) and data type int64

In [None]:
# Plotting a heatmap
fig = plt.figure(figsize=(6,6))
sns.heatmap(resultant, annot=True, cmap='Blues')
plt.title('Chi-Square Test Results')
plt.show()

## conclusion

Therefore The variable does not have the stochastic distribution, not normal (stochastic distribution)
 


## References


### Chi-Square Distribution
When we consider, the null speculation is true, the sampling distribution of the test statistic is called as chi-squared distribution. The chi-squared test helps to determine whether there is a notable difference between the normal frequencies and the observed frequencies in one or more classes or categories. It gives the probability of independent variables.

Note: Chi-squared test is applicable only for categorical data, such as men and women falling under the categories of Gender, Age, Height, etc.Chi-Square Distribution
When we consider, the null speculation is true, the sampling distribution of the test statistic is called as chi-squared distribution. The chi-squared test helps to determine whether there is a notable difference between the normal frequencies and the observed frequencies in one or more classes or categories. It gives the probability of independent variables.

Note: Chi-squared test is applicable only for categorical data, such as men and women (vessel type) falling under the categories of Gender, Age, Height, etc.


### Finding P-Value
P stands for probability here. To calculate the p-value, the chi-square test is used in statistics. The different values of p indicates the different hypothesis interpretation, are given below:

#### P≤ 0.05; Hypothesis rejected
#### P>.05; Hypothesis Accepted
Probability is all about chance or risk or uncertainty. It is the possibility of the outcome of the sample or the occurrence of an event. But when we talk about statistics, it is more about how we handle various data using different techniques. It helps to represent complicated data or bulk data in a very easy and understandable way. It describes the collection, analysis, interpretation, presentation, and organization of data. The concept of both probability and statistics is related to the chi-squared test.