<a href="https://colab.research.google.com/github/cccaaannn/machine_learning_colab/blob/master/feature_selection/data_mining_hw1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Questions**

- Find out whether attributes are categorical or numeric.

- Understand the meaning of each attribute.

- Look at some summary statistics of each attribute (such as mean, min, max, etc.)

- Find out the correlations between numerical attributes. Which of them are highly correlated? Is this correlation meaningful? Which of the attributes are highly correlated with the class attribute? You can use boxplots for exploring the relations between categorical attibutes and the class attribute.

- Find out the percentage of missing data for each attribute.

- Find out duplicate data if there are any.

- Plot histograms and box plots of the attributes. These will give an idea about the distribution of the attribute values.

- Do you think there are any errors or some strange data?

- Report some interesting findings if you found any.
​​​​​


**Attributes**

1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)

2 sex - student's sex (binary: 'F' - female or 'M' - male)

3 age - student's age (numeric: from 15 to 22)

4 address - student's home address type (binary: 'U' - urban or 'R' - rural)

5 famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)

6 Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)

7 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)

8 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)

9 Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')

10 Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')

11 reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')

12 guardian - student's guardian (nominal: 'mother', 'father' or 'other')

13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)

16 schoolsup - extra educational support (binary: yes or no)

17 famsup - family educational support (binary: yes or no)

18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

19 activities - extra-curricular activities (binary: yes or no)

20 nursery - attended nursery school (binary: yes or no)

21 higher - wants to take higher education (binary: yes or no)

22 internet - Internet access at home (binary: yes or no)

23 romantic - with a romantic relationship (binary: yes or no)

24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)

26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)

27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

29 health - current health status (numeric: from 1 - very bad to 5 - very good)

30 absences - number of school absences (numeric: from 0 to 93)

# these grades are related with the course subject, Math or Portuguese:

31 G1 - first period grade (numeric: from 0 to 20)

31 G2 - second period grade (numeric: from 0 to 20)

32 G3 - final grade (numeric: from 0 to 20, output target)


download-unzip data

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip
!unzip student.zip

imports

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_selection import  SelectKBest, f_regression

load data

In [3]:
df = pd.read_csv("student-mat.csv", index_col=0, delimiter = ";")

print head

In [None]:
df.head(5)

summary statistics

In [None]:
df.describe()

check missing data

In [None]:
df.info()

check duplicates

In [None]:
df[df.duplicated()]

correlation between numeric attributes

In [None]:
num_cols = df._get_numeric_data().columns
for cor in num_cols:
    corelation = df.corrwith(df[cor])
    sorted_indexes = abs(corelation).argsort()
    correlated = corelation[sorted_indexes][::-1]
    print(correlated, end="\n\n")

  correlation between final grade and other numeric attributes

In [None]:
corelation = df.drop(['G1', 'G2', 'G3'], axis=1).corrwith(df['G3'])
sorted_indexes = abs(corelation).argsort()
correlated = corelation[sorted_indexes][::-1]
correlated

with SelectKBest

In [None]:
x = df.select_dtypes(include=np.number)
x = x.drop(["G3","G2","G1"], axis=1)
y = df.loc[:,'G3']
x = x.fillna(x.mean())

selector = SelectKBest(f_regression, k=5)
selector.fit(x, y)

x.columns[selector.get_support(indices=True)]

scatter plots

In [None]:
for cor in correlated[:5:].index:
    fig, ax = plt.subplots()
    ax.scatter(x = df[cor], y = df['G3'])
    plt.ylabel('G3')
    plt.xlabel(cor)

histograms

In [None]:
for cor in correlated[:5:].index:
    fig, ax = plt.subplots()
    ax.hist(df[cor])
    ax.title.set_text(cor)

box plots for categoricals

In [None]:
num_cols = df._get_numeric_data().columns
categorical_columns = list(set(df.columns) - set(num_cols))

for categorical_column in categorical_columns:
    f, ax = plt.subplots()
    sns.boxplot(x=categorical_column, y="G3", data= df)