Name: Yash Satra  Roll No: 1811109

### Importing packages

In [1]:
import pandas as pd
import numpy as np

### Importing dataset

In [2]:
df = pd.read_csv('./datasets/StudentsPerformance.csv')

In [3]:
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


### Dataset Description

There are no missing values in this dataset. It has the following attributes:

1. gender - gender of student
2. race/ethnicity - categorized into 5 groups based on nationality, culture, religion, linguistics, etc
3. parental level of education - tells how much student's parents have studied
4. lunch - whether student takes standard lunch or free/reduced lunch
5. test preparation course - whether student has completed any test preparation course
6. math score - math score out of 100
7. reading score - reading score out of 100
8. writing score - writing score out of 100

### finding the correlation among the scores

In [4]:
scores = df.loc[:,'math score':'writing score']

In [5]:
scores.corr()

Unnamed: 0,math score,reading score,writing score
math score,1.0,0.81758,0.802642
reading score,0.81758,1.0,0.954598
writing score,0.802642,0.954598,1.0


### Interpretation:

All the scores are highly correlated with each other

### Finding covariance among scores

In [6]:
scores.cov()

Unnamed: 0,math score,reading score,writing score
math score,229.918998,180.998958,184.939133
reading score,180.998958,213.165605,211.786661
writing score,184.939133,211.786661,230.907992


### Interpretation

There is high covariance among the scores

### Finding relationship between all categorical attributes

In [8]:
from scipy.stats import chi2_contingency

In [9]:
def chisquaretest(col1, col2, confidence_level):
    crosstab = pd.crosstab(df[col1], df[col2])
    chi, p, dof, exp = chi2_contingency(crosstab)
    alpha = 1 - confidence_level
    print('chi-square val: ', chi)
    print('p-val: ',p)
    if (alpha < p):
        print('Hypothesis is accepted. The attributes ',col1,' and ',col2,' are independent')
    else:
        print('Hypothesis rejected. The attributes ',col1,' and ',col2,' are dependent')

In [10]:
cols = df.columns
cols = cols[:5]
print(cols)

Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course'],
      dtype='object')


In [11]:
for i in range(4):
    for j in range(i+1,5):
        chisquaretest(cols[i],cols[j],0.95)
        print()
    

chi-square val:  9.02738626908596
p-val:  0.06041858784847785
Hypothesis is accepted. The attributes  gender  and  race/ethnicity  are independent

chi-square val:  3.384904766004173
p-val:  0.6408699721807456
Hypothesis is accepted. The attributes  gender  and  parental level of education  are independent

chi-square val:  0.37173802316040705
p-val:  0.5420584175146086
Hypothesis is accepted. The attributes  gender  and  lunch  are independent

chi-square val:  0.015529201882465888
p-val:  0.9008273880804724
Hypothesis is accepted. The attributes  gender  and  test preparation course  are independent

chi-square val:  29.45866151909779
p-val:  0.07911304840592065
Hypothesis is accepted. The attributes  race/ethnicity  and  parental level of education  are independent

chi-square val:  3.4423502326273185
p-val:  0.48669808284196503
Hypothesis is accepted. The attributes  race/ethnicity  and  lunch  are independent

chi-square val:  5.4875148857070695
p-val:  0.24082911295018397
Hypothe

### Interpretation: 

Thus there is no significant relationship between any pair of categorical attribute. 

### Principle component analysis

All the score values are highly correlated hence we can replace these three columns by a single column 

In [17]:
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
score = pca.fit_transform(scores)

In [13]:
df = df.drop(columns = ['math score', 'reading score', 'writing score'])

In [14]:
df['score'] = score

In [15]:
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,score
0,female,group B,bachelor's degree,standard,none,-8.488375
1,female,group C,some college,standard,completed,-25.461441
2,female,group B,master's degree,standard,none,-43.121753
3,male,group A,associate's degree,free/reduced,none,32.036284
4,male,group C,some college,standard,none,-14.777792
