# Statistical Analaysis of Scores

Goal: 

Find the statistical significance of student test data from 2 different student groups.

The T-test is well known in the field of statistics. It is used to test the hypothesis using a data sample from the population. 
To perform the T-test, a population sample size, the mean or the average of each population in the pool Standard Deviation are all required.

The statistical variance in a dataset is a measurement of how closely each value in the set of data varies from its mean.

The pooled standard deviation is closely related to and calculated from the pooled variance. It measures how much each variable in the dataset varies from the mean.

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import os

In [2]:
df = pd.read_csv("scores.csv")
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [3]:
df.size

8000

In [4]:
df.shape

(1000, 8)

In [5]:
df.columns

Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')

In [6]:
df.isna().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

## Task1: 

Analyze the T-test problem using the data from the CSV

### Null Hypothesis:

There is no difference in math test scores between males and females.

In [7]:
# filtering the DataFrame to get only gender that is female and math scores (as well as index as its still a dataframe)
female_scores = df[df['gender'] == 'female'][['gender','math score']]
female_scores

Unnamed: 0,gender,math score
0,female,72
1,female,69
2,female,90
5,female,71
6,female,88
...,...,...
993,female,62
995,female,88
997,female,59
998,female,68


In [8]:
# filtering the dataframe to get only male math scores and make it into a list
male_scores = df[df['gender'] == 'male']['math score'].values.tolist()
male_scores

[47,
 76,
 40,
 64,
 58,
 40,
 78,
 88,
 46,
 66,
 44,
 74,
 73,
 69,
 70,
 40,
 97,
 81,
 57,
 55,
 59,
 65,
 82,
 53,
 77,
 53,
 88,
 52,
 58,
 79,
 39,
 62,
 67,
 45,
 61,
 63,
 61,
 49,
 44,
 30,
 80,
 49,
 50,
 72,
 42,
 27,
 71,
 43,
 78,
 65,
 79,
 68,
 60,
 98,
 66,
 62,
 54,
 84,
 91,
 63,
 83,
 72,
 65,
 82,
 89,
 53,
 87,
 74,
 58,
 51,
 70,
 71,
 57,
 88,
 88,
 73,
 100,
 62,
 77,
 54,
 62,
 60,
 66,
 82,
 49,
 52,
 53,
 72,
 94,
 62,
 45,
 65,
 80,
 62,
 48,
 76,
 77,
 61,
 59,
 55,
 69,
 59,
 74,
 82,
 81,
 80,
 35,
 60,
 87,
 84,
 66,
 61,
 87,
 86,
 57,
 68,
 76,
 46,
 92,
 83,
 80,
 63,
 54,
 84,
 73,
 59,
 75,
 85,
 89,
 68,
 47,
 80,
 54,
 78,
 79,
 76,
 59,
 69,
 58,
 88,
 83,
 73,
 53,
 45,
 81,
 97,
 88,
 77,
 76,
 86,
 63,
 78,
 67,
 46,
 71,
 40,
 90,
 81,
 56,
 80,
 69,
 99,
 51,
 66,
 67,
 71,
 83,
 63,
 61,
 28,
 82,
 71,
 47,
 62,
 90,
 76,
 49,
 58,
 67,
 79,
 62,
 75,
 87,
 66,
 63,
 59,
 85,
 59,
 49,
 69,
 61,
 84,
 74,
 46,
 66,
 87,
 79,
 73,
 73,
 76,

In [9]:
# this will become a np array
fm_sc = df.loc[df['gender'] == 'female', 'math score'].values
fm_sc

array([ 72,  69,  90,  71,  88,  38,  65,  50,  69,  18,  54,  65,  69,
        67,  62,  69,  63,  56,  74,  50,  75,  58,  53,  50,  55,  66,
        57,  71,  33,  82,   0,  69,  59,  60,  39,  58,  41,  61,  62,
        47,  73,  76,  71,  58,  73,  65,  79,  63,  58,  65,  85,  58,
        87,  52,  70,  77,  51,  99,  75,  78,  51,  55,  79,  88,  87,
        51,  75,  59,  76,  59,  42,  22,  68,  59,  70,  66,  61,  75,
        81,  96,  58,  68,  67,  79,  63,  43,  81,  46,  71,  52,  97,
        46,  50,  65,  77,  66,  62,  69,  45,  78,  67,  65,  57,  74,
        58,  42,  83,  34,  56,  55,  52,  45,  72,  88,  67,  64,  80,
        56,  58,  65,  71,  60,  62,  64,  70,  65,  64,  44,  99,  63,
        69,  88,  71,  47,  65,  85,  59,  65,  73,  70,  37,  67,  65,
        67,  74,  53,  49,  73,  68,  59,  77,  56,  67,  75,  71,  43,
        41,  82,  41,  83,  61,  24,  35,  61,  69,  72,  77,  52,  63,
        46,  59,  61,  42,  80,  58,  52,  27,  44,  73,  45,  8

In [10]:
type(fm_sc)

numpy.ndarray

In [11]:
mm_sc = df.loc[df['gender'] == 'male', 'math score'].values
mm_sc

array([ 47,  76,  40,  64,  58,  40,  78,  88,  46,  66,  44,  74,  73,
        69,  70,  40,  97,  81,  57,  55,  59,  65,  82,  53,  77,  53,
        88,  52,  58,  79,  39,  62,  67,  45,  61,  63,  61,  49,  44,
        30,  80,  49,  50,  72,  42,  27,  71,  43,  78,  65,  79,  68,
        60,  98,  66,  62,  54,  84,  91,  63,  83,  72,  65,  82,  89,
        53,  87,  74,  58,  51,  70,  71,  57,  88,  88,  73, 100,  62,
        77,  54,  62,  60,  66,  82,  49,  52,  53,  72,  94,  62,  45,
        65,  80,  62,  48,  76,  77,  61,  59,  55,  69,  59,  74,  82,
        81,  80,  35,  60,  87,  84,  66,  61,  87,  86,  57,  68,  76,
        46,  92,  83,  80,  63,  54,  84,  73,  59,  75,  85,  89,  68,
        47,  80,  54,  78,  79,  76,  59,  69,  58,  88,  83,  73,  53,
        45,  81,  97,  88,  77,  76,  86,  63,  78,  67,  46,  71,  40,
        90,  81,  56,  80,  69,  99,  51,  66,  67,  71,  83,  63,  61,
        28,  82,  71,  47,  62,  90,  76,  49,  58,  67,  79,  6

We will now compute the variance of the 2 arrays using the standard deviation from each array.

This variance is called the pooled variance because we have 2 arrays of different sizes so we need to take the size difference into account.

I will use the formula again to calculate the standard error that we need in order to compute the 2 sample T-test.

I will also use the pooled standard deviation to compute the standard error and I will use the diffence between the means and standard error to compute the T-value.

I will compute the degrees of freedom and will compute the P value using the stats package.

I will be identifying what the degrees of freedom mean and what the standard error means.

In [18]:
def compute_T_and_P_values(arra1, arra2):
    t, p = 0, 0 

    mean1, mean2, = np.mean(arra1), np.mean(arra2)

    std1, std2 = np.std(arra1), np.std(arra2)
    
    var = (((arra1.size-1)*(std1**2)) + ((arra2.size-1)*(std2**2))/arra1.size+arra2.size-2)

    # pooled standard deviation
    sp = np.sqrt(var)

    # standard error
    ste = sp*np.sqrt(1/arra1.size+ 1/arra2.size)

    # t value
    t = (mean1-mean2) / ste

    # degrees of freedom
    # the degrees of freedom is the number of data items given and the number needed to give an estimate assuming you are given a mean
    # for example if we are given an array of size 100 and we are given 99 values in the array , and we know the mean is some value,
    # your degrees of freedom is 99 because the last number has to add up, such that the total divided by the total number equals the mean
    # so in essence array size - 1 is the degree of freedom but because we got 2 arrays we do - 2
    df = arra1.size + arra2.size - 2 

    p = 2*(1-stats.t.cdf(t,df))

    return t, p

In [19]:
compute_T_and_P_values(mm_sc,fm_sc)

(0.25496479425978963, 0.798802821065316)

In [20]:
def remove_outliers(arr):
    q3, q1 = np.percentile(arr,[75,25])
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    arr = arr[(arr >= lower) & (arr <= upper)]


    return arr

In [21]:
remove_outliers(mm_sc)
remove_outliers(fm_sc)
compute_T_and_P_values(mm_sc,fm_sc)

(0.25496479425978963, 0.798802821065316)

 Which code snippet retrieves all of the Ford resale data from a DataFrame and stores it into a Raw Numpy array?

- The df.loc[df['vehicle'] == 'ford', 'resale'].values code snippet retrieves all of the Ford resale data from a DataFrame and stores it into a Raw Numpy array.


What are the degrees of freedom in a dataset of size 100 and why?

- The degrees of freedom in a dataset of size 100 is 99 because one of the values in the data set must result in obtaining the given mean.

What does the pooled variance of two arrays represent?

- How much the combined data varies from its mean. The variance of two arrays represents how far the combined data varies from its mean

How should outliers be removed from a dataset?

- Write code to remove them during executing. Outliers should be removed from a dataset by using program code to remove during executio
