## Statistical Testing For Independent, Continuous Management Data With Work Time As Response Variable

### Getting Management Data From A CSV File
The read.table function reads the CSV file containing the HR Management Data and stores the results in a data frame.

In [18]:
df <- read.table("C:/Kal/Stat-Work/Stat-Code/SciKit/Data/Management_DataSet_Classification.csv", 
                 sep=",",
                 skip=1,
                 col.names=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V"),
                 nrows=1000)
message("DataFrame is: ", df)

DataFrame is: 1:1000c(7633, 7215, 7110, 1626, 6404, 9570, 8552, 1434, 9545, 7899, 2696, 8123, 1278, 4195, 5897, 9250, 4625, 9079, 1032, 9136, 9569, 3058, 338, 8629, 1863, 5815, 3232, 72, 269, 3466, 2521, 9982, 8550, 3127, 2199, 5439, 8477, 1020, 8582, 9915, 4760, 1568, 8699, 9854, 9324, 2202, 5173, 9671, 56, 181, 1121, 2803, 6568, 4624, 174, 7285, 9355, 5812, 3784, 8548, 3339, 5607, 8882, 4612, 38, 975, 111, 1041, 6854, 8828, 5333, 1235, 5668, 383, 2716, 663, 7456, 8938, 9238, 3212, 222, 9762, 6365, 9895, 9849, 9476, 
6609, 6903, 5132, 8339, 1221, 4241, 8871, 182, 5063, 8143, 4715, 274, 7871, 3162, 6403, 2376, 1542, 4738, 946, 1035, 3765, 3335, 7788, 6544, 5387, 731, 1357, 8484, 5760, 6700, 5253, 6827, 8962, 1577, 3935, 2096, 7053, 7031, 4150, 8417, 6777, 7414, 2531, 2533, 6153, 3591, 3185, 4553, 9962, 5964, 7434, 4137, 9674, 7039, 1773, 3989, 4562, 4805, 1954, 6543, 4517, 7576, 9678, 243, 8080, 7252, 2276, 8077, 3196, 3877, 1249, 8797, 6825, 1329, 1070, 7218, 8122, 5447, 1051, 5524, 7

### Simple Random Sampling based Observational Study for Measuring Work Time
Columns "B" and "C" are Response Variables measured in the data set using column variables "D" through "T" as Explanatory variables. "B" and "C" are independent samples obtained from Cross-Sectional Studies that measure the Work Times of each corporate resource with variables "D" through "T" as the Explanatory variables. . 


### Testing If The Two Data Vectors Come From The Same Distribution
The Kolmogorov-Smirnov test can also be used to test the probability that two data vectors come from the same distribution.

In [19]:
ks.test(df[["B"]], df[["C"]])

"p-value will be approximate in the presence of ties"


	Two-sample Kolmogorov-Smirnov test

data:  df[["B"]] and df[["C"]]
D = 0.046, p-value = 0.2406
alternative hypothesis: two-sided


Since the p-value > 1-0.95 (=0.05) where alpha=0.05 is the level of confidence for the test, the NULL hypothesis is not rejected. The two data vectors, B and C, come from the same distribution. 

### Two-Sample Wilcoxon Test For Hypothesis Testing Of Population Medians 
Since the normality assumption for the two distributions fails, the Two-sample Wilcoxon Test is a nonparametric procedure that is used to test the equality of the two population medians. Since the sample size here is 1000 and since the two samples are independent, we can use the two-sample Wilcoxon test to test the two population medians, Mx and My. 

Step #1: Null hypothesis: H0: Mx = My.
         Alternate hypothesis: H1: Mx < My (left-tailed test) (Population median Mx < Population median My)
         
Step #2: The level of significance is alpha = 0.05 or 5%. 
 
Step #3: Rank all sample observations from smallest to largest. Handle ties by finding the mean of the ranks for tied values. Find the sum of the ranks for the sample from population X.    
  Compute the test statistic. Note that S is the sum of the ranks obtained from the sample observations from population X. In addition, n1 is the size of the sample from population X and n2 is the size of the sample from population Y. S is used in the computation of the test statistic. 
  
Step #4: The level of significance, alpha, is used to determine the critical value. Here, alpha = 0.05 and n1 > 20 and n2 > 20 and n1 = n2 = 1000. 

In [20]:
wilcox.test(df[["B"]], df[["C"]])



	Wilcoxon rank sum test with continuity correction

data:  df[["B"]] and df[["C"]]
W = 524990, p-value = 0.05297
alternative hypothesis: true location shift is not equal to 0


### Interpretation
Here’s an explanation of the output. The test function first shows the test statistic (W = 524990) and the p-value for the statistic (0.05054). With the Wilcoxon rank sum test, the difference between the two groups is not statistically significant at a 95% confidence level (though it barely misses) and the NULL hypothesis will not be rejected. 

The median of Population df[["B"]] and the median of population df[["C"]] are equal. 

## Statistical Testing For Dependent, Continuous Management Data With Work Time As Response Variable

### Getting Management Data From A CSV File
The read.table function reads the CSV file containing the HR Management Data and stores the results in a data frame.

In [21]:
df <- read.table("C:/Kal/Stat-Work/Stat-Code/SciKit/Data/Management_DataSet_Classification.csv", 
                 sep=",",
                 skip=1,
                 col.names=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V"),
                 nrows=1000)
message("DataFrame is: ", df)

DataFrame is: 1:1000c(7633, 7215, 7110, 1626, 6404, 9570, 8552, 1434, 9545, 7899, 2696, 8123, 1278, 4195, 5897, 9250, 4625, 9079, 1032, 9136, 9569, 3058, 338, 8629, 1863, 5815, 3232, 72, 269, 3466, 2521, 9982, 8550, 3127, 2199, 5439, 8477, 1020, 8582, 9915, 4760, 1568, 8699, 9854, 9324, 2202, 5173, 9671, 56, 181, 1121, 2803, 6568, 4624, 174, 7285, 9355, 5812, 3784, 8548, 3339, 5607, 8882, 4612, 38, 975, 111, 1041, 6854, 8828, 5333, 1235, 5668, 383, 2716, 663, 7456, 8938, 9238, 3212, 222, 9762, 6365, 9895, 9849, 9476, 
6609, 6903, 5132, 8339, 1221, 4241, 8871, 182, 5063, 8143, 4715, 274, 7871, 3162, 6403, 2376, 1542, 4738, 946, 1035, 3765, 3335, 7788, 6544, 5387, 731, 1357, 8484, 5760, 6700, 5253, 6827, 8962, 1577, 3935, 2096, 7053, 7031, 4150, 8417, 6777, 7414, 2531, 2533, 6153, 3591, 3185, 4553, 9962, 5964, 7434, 4137, 9674, 7039, 1773, 3989, 4562, 4805, 1954, 6543, 4517, 7576, 9678, 243, 8080, 7252, 2276, 8077, 3196, 3877, 1249, 8797, 6825, 1329, 1070, 7218, 8122, 5447, 1051, 5524, 7

### Simple Random Sampling based Observational Study for measuring Work Time
Columns "B" and "C" are Response Variables measured in the data set using column variables "D" through "T" as Explanatory variables. "B" and "C" are matched-pairs samples obtained from Cross-Sectional Studies that measure the Work Times of each corporate resource with variables "D" through "T" as the explanatory variables. 

### Testing If The Two Data Vectors Come From The Same Distribution
The Kolmogorov-Smirnov test can also be used to test the probability that two data vectors come from the same distribution.

In [22]:
ks.test(df[["B"]], df[["C"]])

"p-value will be approximate in the presence of ties"


	Two-sample Kolmogorov-Smirnov test

data:  df[["B"]] and df[["C"]]
D = 0.046, p-value = 0.2406
alternative hypothesis: two-sided


Since the p-value > 1-0.95 (=0.05) where alpha=0.05 is the level of confidence for the test, the NULL hypothesis is not rejected. The two data vectors, B and C, come from the same distribution. 

### Two-Sample Wilcoxon Test For Hypothesis Testing Of Population Medians
Since the normality assumption for the two distributions fails, the Wilcoxon Matched-Pairs Signed-Ranks Test is a nonparametric procedure used to test the equality of two population medians by dependent sampling. Since the sample size here is 1000 and since the two samples are dependent, we can use the two-sample Wilcoxon test to test the two population medians, Mx and My. 

Step #1: Null hypothesis: H0: Md = 0.  
         Alternate hypothesis: H1: Md < 0 (left-tailed test) (Population Median Mx < Population Median My)  
         
Step #2: The level of significance is alpha = 0.05 or 5%.   
 
Step #3: Compute the differences in the matched-pairs observations. Rank the absolute value of all sample differences from smallest to largest after discarding those differences that equal 0. Handle ties by finding the mean of the ranks for tied values. Assign negative values to the ranks where the differences are negative and positive values to the ranks where the differences are positive. Find the sum of the positive ranks, T+ and the sum of the negative ranks, T-.  

Compute the test statistic using T+ and T- for the large-sample case since n <= 30 and n = 1000.   
  
Step #4: The level of significance, alpha, is used to determine the critical value. Here, alpha = 0.05 or 5%.   

In [23]:
wilcox.test(df[["B"]], df[["C"]], paired=T)


	Wilcoxon signed rank test with continuity correction

data:  df[["B"]] and df[["C"]]
V = 265160, p-value = 0.1028
alternative hypothesis: true location shift is not equal to 0


### Interpretation
Here’s an explanation of the output. The test function first shows the test statistic (V = 265160) and the p-value for the statistic (0.1028). With the Wilcoxon matched pairs rank sum test, the difference between the two groups is not statistically significant at a 95% confidence level and the NULL hypothesis will not be rejected. 

The median of Population df[["B"]] and the median of population df[["C"]] are equal. 
