## Student Score dataset

### Objectives:
1. Carry out descriptive analytics on the dataset.
2. Check whether female and male students scored the same marks.
3. Whether test preparation helps the students?

### Importing relevant libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

### Reading the dataset

In [2]:
df = pd.read_csv('Data/Week_8_Q&A_dataset - Sheet1.csv')
df.head()

Unnamed: 0,Gender,Test preparation,Total Marks
0,male,none,14
1,female,none,28
2,female,none,18
3,female,none,48
4,female,none,21


### 1. Performing descriptive analytics

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28 entries, 0 to 27
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Gender            28 non-null     object
 1   Test preparation  28 non-null     object
 2   Total Marks       28 non-null     int64 
dtypes: int64(1), object(2)
memory usage: 800.0+ bytes


In [4]:
df.describe()

Unnamed: 0,Total Marks
count,28.0
mean,32.321429
std,12.45452
min,12.0
25%,22.5
50%,33.0
75%,43.25
max,50.0


#### Insights:
- The dataset contains 28 rows and 3 columns
- The dataset doesn't have null values
- The average marks in the sample set is 32.32 with a standard deviation of 12.45
- Maximum marks is 50

### 2. Testing male and female marks 

In [5]:
df_male = df.loc[df['Gender'] == 'male']
df_male.head()

Unnamed: 0,Gender,Test preparation,Total Marks
0,male,none,14
6,male,none,30
8,male,none,18
9,male,none,24
10,male,completed,44


In [6]:
df_male.reset_index(drop=True, inplace=True)

In [7]:
df_male.head()

Unnamed: 0,Gender,Test preparation,Total Marks
0,male,none,14
1,male,none,30
2,male,none,18
3,male,none,24
4,male,completed,44


In [8]:
df_female = df.loc[df['Gender'] == 'female']
df_female.reset_index(drop=True, inplace = True)
df_female.head()

Unnamed: 0,Gender,Test preparation,Total Marks
0,female,none,28
1,female,none,18
2,female,none,48
3,female,none,21
4,female,completed,40


In [9]:
df_female.describe()

Unnamed: 0,Total Marks
count,16.0
mean,33.75
std,12.865976
min,12.0
25%,25.5
50%,39.0
75%,43.5
max,50.0


In [10]:
df_male.describe()

Unnamed: 0,Total Marks
count,12.0
mean,30.416667
std,12.16895
min,14.0
25%,21.75
50%,29.0
75%,43.25
max,47.0


#### Insights:
- The dataset is split based on gender into df_male and df_female
- There are 16 female & 12 male students in the dataset
- The average marks of female students is 33.75 with a standard deviation of 12.87
- The average marks of male students is 30.416 with a standard deviation of 12.168

 *Since the sample size is small (< 30), we are proceeding with the t-test*

In [11]:
# Taking significance value as 0.05
alpha = 0.05

#### Hypothesis:
- $H_{0}$ : The average score of female and male students is equal
- $H_{a}$ : The average scores are not equal
- $\alpha$ : 0.05

Performing two-tailed t-test on the dataset to analyse our hypothesis

In [12]:
t_value, p_value = stats.ttest_ind(df_female['Total Marks'], df_male['Total Marks'], alternative='two-sided')

In [13]:
print("The t value of means is: ", round(t_value, 3))
print("The p value based on the t score is: ", round(p_value, 3))

The t value of means is:  0.694
The p value based on the t score is:  0.494


#### Insights:
- For the t value of 0.694, we get a p value of 0.494
- P value is much higher than $\alpha$ = 0.05
- So we fail to reject the null hypothesis that the average scores are equal.

### 3. Usefullness of test preparation

In [14]:
df_course_yes = df.loc[df['Test preparation'] == 'completed']
df_course_yes.reset_index(drop= True, inplace=True)
df_course_yes

Unnamed: 0,Gender,Test preparation,Total Marks
0,female,completed,40
1,male,completed,44
2,male,completed,43
3,female,completed,48
4,male,completed,44
5,female,completed,50
6,female,completed,43


In [15]:
df_course_no = df.loc[df['Test preparation'] == 'none']
df_course_no.reset_index(drop = True, inplace=True)
df_course_no.head()

Unnamed: 0,Gender,Test preparation,Total Marks
0,male,none,14
1,female,none,28
2,female,none,18
3,female,none,48
4,female,none,21


In [16]:
df_course_yes.describe()

Unnamed: 0,Total Marks
count,7.0
mean,44.571429
std,3.359422
min,40.0
25%,43.0
50%,44.0
75%,46.0
max,50.0


In [17]:
df_course_no.describe()

Unnamed: 0,Total Marks
count,21.0
mean,28.238095
std,11.661495
min,12.0
25%,18.0
50%,27.0
75%,38.0
max,48.0


#### Insights:
- 7 students completed the test preparation course while 21 didn't
- The average marks of students who completed the course is 44.57
- The average marks of students who didn't complete is 28.24

#### Hypothesis:
- $H_{0}$ : The average score of students who completed vs students who didn't is equal
- $H_{a}$ : The average scores of students who completed are higher
- Significance value $\alpha$ : 0.05

Performing one-tailed t-test

In [18]:
t_value_2, p_value_2 = stats.ttest_ind(df_course_yes['Total Marks'], df_course_no['Total Marks'], alternative='greater')

In [19]:
print("The t value is: ", round(t_value_2, 3))
print("The p value is: ", round(p_value_2, 5))

The t value is:  3.614
The p value is:  0.00063


#### Insights:
- For the t value of 3.614, we get a p value of 0.00063
- P value is much lower than $\alpha$ = 0.05
- So we reject the null hypothesis that the scores are equal
- The students who completed the course have higher average score

#### Conclusion: 
> The test preparation course is usefull.