# Statistical Testing

### 1. Is there a statistically significant difference between the medians of GPA of the group that graduates and the group that does not.

Null Hypothesis: There is no difference in the medians of GPA between the group that graduates and the group that does not.

Alternative Hypothesis: The median GPA of the group that graduates is different from the median GPA of the group that does not.

We will use the Wilcoxon Rank-Sum test to find the p-value.

In [151]:
# Import numpy and pandas
import numpy as np
import pandas as pd
# Import scipy.stats to access test functions
from scipy import stats

In [152]:
# Read the collegeData.csv file
df_college = pd.read_csv('collegeData.csv')
df_college.head()

Unnamed: 0,SexCode,MaritalCode,PrevEdCode,DDVeteran,DaysEnrollToStart,AgeAtStart,AgeAtGrad,GPA,MinutesAttended,HoursAttempt,HoursEarned,HoursReq,MinutesAbsent,TransferCredits,TransferGPA,MinEFC,MaxENTEntranceScore,gradFlag
0,M,M,BACH,0,55,24,27,3.22,145953,2925.0,2550.0,2565,3475,19.0,2.55,0.0,81.0,1
1,F,M,BACH,0,143,22,25,3.02,129045,2640.0,2565.0,2565,11840,12.0,,0.0,89.5,1
2,F,S,BACH,0,98,30,33,3.47,111385,2559.0,2514.0,2565,935,37.67,2.84,0.0,,1
3,F,UN,BACH,0,101,24,27,3.19,135401,2520.0,2520.0,2565,4549,6.0,,0.0,87.5,1
4,M,,SOMECOLL,0,61,19,22,3.84,115660,2520.0,2520.0,2565,1340,22.0,,3141.0,,1


In [153]:
# Checking collegeData.csv types
df_college.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2784 entries, 0 to 2783
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   SexCode              2784 non-null   object 
 1   MaritalCode          1796 non-null   object 
 2   PrevEdCode           2784 non-null   object 
 3   DDVeteran            2784 non-null   int64  
 4   DaysEnrollToStart    2784 non-null   int64  
 5   AgeAtStart           2784 non-null   int64  
 6   AgeAtGrad            2784 non-null   int64  
 7   GPA                  2784 non-null   float64
 8   MinutesAttended      2784 non-null   int64  
 9   HoursAttempt         2784 non-null   float64
 10  HoursEarned          2784 non-null   float64
 11  HoursReq             2784 non-null   int64  
 12  MinutesAbsent        2784 non-null   int64  
 13  TransferCredits      2066 non-null   float64
 14  TransferGPA          1500 non-null   float64
 15  MinEFC               2743 non-null   f

In [154]:
# Create variable containing grad students GPA
grad_GPA = df_college.loc[df_college['gradFlag'] == 1, 'GPA'].dropna()
grad_GPA.head()

0    3.22
1    3.02
2    3.47
3    3.19
4    3.84
Name: GPA, dtype: float64

In [155]:
# Create variable containing nongrad students GPA
nongrad_GPA = df_college.loc[df_college['gradFlag'] == 0, 'GPA'].dropna()
nongrad_GPA

7       2.64
10      0.00
12      2.51
16      1.90
20      3.35
        ... 
2773    2.67
2775    2.82
2777    2.85
2780    2.50
2782    3.19
Name: GPA, Length: 944, dtype: float64

In [156]:
# Use Wilcoxon Rank-Sum test to compare medians
statistic_GPA, p_value_GPA = stats.ranksums(grad_GPA, nongrad_GPA)
print(statistic_GPA, p_value_GPA)

38.12823902794451 0.0


P-value < 0.05. At the 5% significance level, we reject the null hypothesis and conclude that there is sufficient evidence to suggest that the median GPA between the two groups are different.

### 2. Is there a statistically significant difference between the means of Age At Start of the group that graduates and the group that does not.

Null Hypothesis: There is no difference in the means of Age At Start between the group that graduates and the group that does not.

Alternative Hypothesis: The mean Age At Start of the group that graduates is different from the mean Age At Start of the group that does not.

We will use the independent t-test to find the p-value.

In [162]:
# Create variable containing grad students age at start
grad_age= df_college.loc[df_college['gradFlag'] == 1, 'AgeAtStart'].dropna()
grad_age.head()

0    24
1    22
2    30
3    24
4    19
Name: AgeAtStart, dtype: int64

In [163]:
# Create variable containing nongrad students age at start
nongrad_age = df_college.loc[df_college['gradFlag'] == 0, 'AgeAtStart'].dropna()
nongrad_age.head()

7     20
10    23
12    23
16    18
20    27
Name: AgeAtStart, dtype: int64

In [164]:
# Use independent t-tests to compare the means
stats.ttest_ind(grad_age, nongrad_age, equal_var=False)

TtestResult(statistic=1.2551304992860344, pvalue=0.20960262234043067, df=1712.65295996541)

P-value > 0.05. At the 5% significance level, we do not reject the null hypothesis and conclude that there is insufficient evidence to suggest that the means of Age to Start between the two groups are different.

### 3. Is there a statistically significant difference between the medians of Transfer GPA of the group that graduates and the group that does not.

Null Hypothesis: There is no difference in the medians of Transfer GPA between the group that graduates and the group that does not.

Alternative Hypothesis: The median Transfer GPA of the group that graduates is different from the median Transfer GPA of the group that does not.

We will use the Wilcoxon Rank-Sum test to find the p-value.

In [170]:
# Create variable containing grad students transfer GPA
grad_tGPA= df_college.loc[df_college['gradFlag'] == 1, 'TransferGPA'].dropna()
grad_tGPA.head()

0     2.55
2     2.84
14    2.09
15    3.46
33    3.03
Name: TransferGPA, dtype: float64

In [171]:
# Create variable containing nongrad students transfer GPA
nongrad_tGPA = df_college.loc[df_college['gradFlag'] == 0, 'TransferGPA'].dropna()
nongrad_tGPA.head()

7     3.00
10    2.36
20    2.66
22    2.41
31    2.08
Name: TransferGPA, dtype: float64

In [172]:
# Use Wilcoxon Rank-Sum test to compare medians
statistic_tGPA, p_value_tGPA = stats.ranksums(grad_tGPA, nongrad_tGPA)
print(statistic_tGPA, p_value_tGPA)

7.572647069553644 3.656948357918701e-14


P-value < 0.05. At the 5% significance level, we reject the null hypothesis and conclude that there is sufficient evidence to suggest that the median Transfer GPA between the two groups are different.

### 4. Is there a statistically significant difference between the means of Transfer Credits of the group that graduates and the group that does not.

Null Hypothesis: There is no difference in the means of Transfer Credits between the group that graduates and the group that does not.

Alternative Hypothesis: The mean Transfer Credits of the group that graduates is different from the mean Transfer Credits of the group that does not.

We will use the independent t-test to find the p-value.

In [178]:
# Create variable containing grad students Transfer Credits
grad_tCredits = df_college.loc[df_college['gradFlag'] == 1, 'TransferCredits'].dropna()
grad_tCredits.head()

0    19.00
1    12.00
2    37.67
3     6.00
4    22.00
Name: TransferCredits, dtype: float64

In [179]:
# Create variable containing nongrad students Transfer Credits
nongrad_tCredits = df_college.loc[df_college['gradFlag'] == 0, 'TransferCredits'].dropna()
nongrad_tCredits.head()

7      9.00
10    26.00
12     3.00
20    42.00
22    20.01
Name: TransferCredits, dtype: float64

In [180]:
# Use independent t-tests to compare the means
stats.ttest_ind(grad_tCredits, nongrad_tCredits, equal_var=False)

TtestResult(statistic=8.034523946203917, pvalue=2.4435242045889162e-15, df=1080.8508021909931)

P-value < 0.05. At the 5% significance level, we reject the null hypothesis and conclude that there is sufficient evidence to suggest that the mean Transfer Credits between the two groups are different.

### 5. Determine if there is any association between successful graduation and gender.

Null Hypothesis: Successful graduation and gender are not associated.

Alternative Hypothesis: Successful graduation and gender are associated.

We use the Chi-Square test to find the p-value.

In [186]:
# Create a contigency table for successful graduation and gender
contingency_table_g = pd.crosstab(df_college['gradFlag'] == 1, df_college['SexCode'])
print(contingency_table_g)
# Use the Chi-Square test function to find the p-value
stats.chi2_contingency(contingency_table_g)

SexCode      F    M
gradFlag           
False      655  289
True      1418  422


Chi2ContingencyResult(statistic=18.94778173057208, pvalue=1.3434539080515045e-05, dof=1, expected_freq=array([[ 702.9137931,  241.0862069],
       [1370.0862069,  469.9137931]]))

P-value < 0.05. At the 5% significance level, we reject the null hypothesis and conclude that there is sufficient evidence to suggest that successful graduation and gender are associated.

### 6. Determine if there is any association between successful graduation and marital status.

Null Hypothesis: Successful graduation and marital status are not associated.

Alternative Hypothesis: Successful graduation and marital status are associated

We use the Chi-Square test to find the p-value

In [192]:
# Create a contigency table for successful graduation and marital status
contingency_table_m = pd.crosstab(df_college['gradFlag'] == 1, df_college['MaritalCode'])
print(contingency_table_m)
# Use the Chi-Square test function to find the p-value
stats.chi2_contingency(contingency_table_m)

MaritalCode   D    M   P    S   UN
gradFlag                          
False         9   38   3  171  321
True         33  183  14  554  470


Chi2ContingencyResult(statistic=76.0543852268735, pvalue=1.1922501823455327e-15, dof=4, expected_freq=array([[ 12.67483296,  66.69376392,   5.13028953, 218.79175947,
        238.70935412],
       [ 29.32516704, 154.30623608,  11.86971047, 506.20824053,
        552.29064588]]))

P-value < 0.05. At the 5% significance level, we reject the null hypothesis and conclude that there is sufficient evidence to suggest that successful graduation and marital status are associated.

### 7. Determine if there is any association between successful graduation and previous education.

Null Hypothesis: Successful graduation and previous education are not associated

Alternative Hypothesis: Successful graduation and marital status are associated.

We use the Chi-Square test to find the p-value.

In [198]:
# Create a contigency table for successful graduation and previous education
contingency_table_p = pd.crosstab(df_college['gradFlag'] == 1, df_college['PrevEdCode'])
print(contingency_table_p)
# Use the Chi-Square test function to find the p-value
stats.chi2_contingency(contingency_table_p)

PrevEdCode  ASSOC  BACH  GED   HS  MAST  POSTHS  SOMECOLL  UN
gradFlag                                                     
False         103    42    4  138     2       2       647   6
True          260   432    1   94    15       0      1036   2


Chi2ContingencyResult(statistic=239.1952510794302, pvalue=5.51109634720097e-48, dof=7, expected_freq=array([[1.23086207e+02, 1.60724138e+02, 1.69540230e+00, 7.86666667e+01,
        5.76436782e+00, 6.78160920e-01, 5.70672414e+02, 2.71264368e+00],
       [2.39913793e+02, 3.13275862e+02, 3.30459770e+00, 1.53333333e+02,
        1.12356322e+01, 1.32183908e+00, 1.11232759e+03, 5.28735632e+00]]))

P-value < 0.05. At the 5% significance level, we reject the null hypothesis and conclude that there is sufficient evidence to suggest that successful graduation and previous education are associated.