# Project - Guide school to activities to improve G3 grades

![Data Science Workflow](img/ds-workflow.png)

## Goal of Project
- Explore the dataset from lesson further
- Follow the Data Science process to understand it better
- It will be your task to identify possible activies to improve G3 grades
- NOTE: We have very limited skills, hence, we must limit our ambitions in our analysis

## Step 1: Acquire
- Explore problem
- Identify data
- Import data

### Step 1.a: Import libraries
- Execute the cell below (SHIFT + ENTER)

In [1]:
import pandas as pd

### Step 1.b: Read the data
- Use ```pd.read_csv()``` to read the file `files/student-mat.csv`
- NOTE: Remember to assign the result to a variable (e.g., ```data```)

In [2]:
data = pd.read_csv('files/student-mat.csv')

### Step 1.c: Inspect the data
- Call ```.head()``` on the data to see all is as expected

In [3]:
data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


### Step 1.d: Check length of data
- Call ```len(...)``` on the data
- Result: There should be 395 rows of data

In [4]:
len(data)

395

## Step 2: Prepare
- Explore data
- Visualize ideas
- Cleaning data

### Notice
- We will not cover visualization in this lecture
- We also know, that the data is clean - but we will do validations here anyway

### Step 2.a: Check the data types
- This step tells you if some numeric column is not represented numeric.
- Get the data types by ```.dtypes```

In [5]:
data.dtypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
G1             int64
G2             int64
G3             int64
dtype: object

### Step 2.b: Check for null (missing) values
- Data often is missing entries - there can be many reasons for this
- We need to deal with that (will do later in course)
- Use ```.isnull().any()```

In [6]:
data.isnull().any()

school        False
sex           False
age           False
address       False
famsize       False
Pstatus       False
Medu          False
Fedu          False
Mjob          False
Fjob          False
reason        False
guardian      False
traveltime    False
studytime     False
failures      False
schoolsup     False
famsup        False
paid          False
activities    False
nursery       False
higher        False
internet      False
romantic      False
famrel        False
freetime      False
goout         False
Dalc          False
Walc          False
health        False
absences      False
G1            False
G2            False
G3            False
dtype: bool

## Step 3: Analyze
- Feature selection
- Model selection
- Analyze data

### Description
- Want to find 3 features to use in our report
- The 3 features should be selected based on
    - Actionable insights
    - Convey credibility in report
    - What is realistic within possibilities (including a budget)

### Note
- This step is where you can explore
- You know how to use the following:
    - **corr()** to find see correlation
    - **groupby()** with **mean()**, **count()**, or **std()**
- This should be used for step 4: Report

### Step 3.a: Investigate correlation
- Correlation is an easy measure to find insights that are actionable.
- Use **corr()** and only show **G3**, as that is the row we are interested in.
    - Notice: **G1** and **G2** are highly correlated, but they are not intented to be used

In [7]:
data.corr()['G3']

age          -0.161579
Medu          0.217147
Fedu          0.152457
traveltime   -0.117142
studytime     0.097820
failures     -0.360415
famrel        0.051363
freetime      0.011307
goout        -0.132791
Dalc         -0.054660
Walc         -0.051939
health       -0.061335
absences      0.034247
G1            0.801468
G2            0.904868
G3            1.000000
Name: G3, dtype: float64

### Step 3.b: Get the Feature names
- This step can help you understand features better.
- All the features are availbale witb **.columns** applied on the **DataFrame**

In [8]:
data.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'G1', 'G2', 'G3'],
      dtype='object')

### Step 3.c: Investigate features
**Repeat this step** (possibly for all features)
- Select a features
- Calculate the **groupby(...)** **mean()** on **G3**
    - HINT: This was done in the lesson
- Calculate the **groupby(...)** **count()** on **G3**
- Calculate the **groupby(...)** **std()** on **G3**

In [9]:
data.groupby('traveltime').mean()['G3']

traveltime
1    10.782101
2     9.906542
3     9.260870
4     8.750000
Name: G3, dtype: float64

In [10]:
data.groupby('traveltime').count()['G3']

traveltime
1    257
2    107
3     23
4      8
Name: G3, dtype: int64

In [11]:
data.groupby('traveltime').std()['G3']

traveltime
1    4.523289
2    4.600108
3    5.074154
4    3.918819
Name: G3, dtype: float64

In [12]:
data.groupby('studytime').mean()['G3']

studytime
1    10.047619
2    10.171717
3    11.400000
4    11.259259
Name: G3, dtype: float64

In [13]:
data.groupby('studytime').count()['G3']

studytime
1    105
2    198
3     65
4     27
Name: G3, dtype: int64

In [14]:
data.groupby('studytime').std()['G3']

studytime
1    4.956311
2    4.217537
3    4.639504
4    5.281263
Name: G3, dtype: float64

In [15]:
data.groupby('schoolsup').mean()['G3']

schoolsup
no     10.561047
yes     9.431373
Name: G3, dtype: float64

In [16]:
data.groupby('schoolsup').count()['G3']

schoolsup
no     344
yes     51
Name: G3, dtype: int64

In [17]:
data.groupby('schoolsup').std()['G3']

schoolsup
no     4.769533
yes    2.865344
Name: G3, dtype: float64

In [18]:
data.groupby('famsup').mean()['G3']

famsup
no     10.640523
yes    10.272727
Name: G3, dtype: float64

In [19]:
data.groupby('famsup').count()['G3']

famsup
no     153
yes    242
Name: G3, dtype: int64

In [20]:
data.groupby('famsup').std()['G3']

famsup
no     4.636262
yes    4.550318
Name: G3, dtype: float64

In [21]:
data.groupby('paid').mean()['G3']

paid
no      9.985981
yes    10.922652
Name: G3, dtype: float64

In [22]:
data.groupby('paid').count()['G3']

paid
no     214
yes    181
Name: G3, dtype: int64

In [23]:
data.groupby('paid').std()['G3']

paid
no     5.126090
yes    3.791011
Name: G3, dtype: float64

In [24]:
data.groupby('activities').mean()['G3']

activities
no     10.340206
yes    10.487562
Name: G3, dtype: float64

In [25]:
data.groupby('activities').count()['G3']

activities
no     194
yes    201
Name: G3, dtype: int64

In [26]:
data.groupby('activities').std()['G3']

activities
no     4.488065
yes    4.679861
Name: G3, dtype: float64

In [28]:
data.groupby('internet').mean()['G3']

internet
no      9.409091
yes    10.617021
Name: G3, dtype: float64

In [29]:
data.groupby('internet').count()['G3']

internet
no      66
yes    329
Name: G3, dtype: int64

In [30]:
data.groupby('internet').std()['G3']

internet
no     4.485797
yes    4.580494
Name: G3, dtype: float64

In [32]:
data.groupby('famrel').mean()['G3']

famrel
1    10.625000
2     9.888889
3    10.044118
4    10.358974
5    10.830189
Name: G3, dtype: float64

In [33]:
data.groupby('famrel').count()['G3']

famrel
1      8
2     18
3     68
4    195
5    106
Name: G3, dtype: int64

In [34]:
data.groupby('famrel').std()['G3']

famrel
1    4.838462
2    5.550717
3    4.647046
4    4.395916
5    4.733813
Name: G3, dtype: float64

In [35]:
data.groupby('Dalc').mean()['G3']

Dalc
1    10.731884
2     9.253333
3    10.500000
4     9.888889
5    10.666667
Name: G3, dtype: float64

In [36]:
data.groupby('Dalc').count()['G3']

Dalc
1    276
2     75
3     26
4      9
5      9
Name: G3, dtype: int64

In [37]:
data.groupby('Dalc').std()['G3']

Dalc
1    4.676502
2    4.812970
3    3.443835
4    2.619372
5    2.692582
Name: G3, dtype: float64

In [38]:
data.groupby('Walc').mean()['G3']

Walc
1    10.735099
2    10.082353
3    10.725000
4     9.686275
5    10.142857
Name: G3, dtype: float64

In [39]:
data.groupby('Walc').count()['G3']

Walc
1    151
2     85
3     80
4     51
5     28
Name: G3, dtype: int64

In [40]:
data.groupby('Walc').std()['G3']

Walc
1    5.133812
2    4.950257
3    3.700753
4    3.619338
5    4.125030
Name: G3, dtype: float64

In [41]:
data.groupby('health').mean()['G3']

health
1    11.872340
2    10.222222
3    10.010989
4    10.106061
5    10.397260
Name: G3, dtype: float64

In [42]:
data.groupby('health').count()['G3']

health
1     47
2     45
3     91
4     66
5    146
Name: G3, dtype: int64

In [43]:
data.groupby('health').std()['G3']

health
1    4.351996
2    5.497474
3    4.183286
4    4.871041
5    4.417020
Name: G3, dtype: float64

### Step 3.d Select 3 features
- Decide on 3 features to use in the report
- The decision should be based on
    - Actionable insights
    - Convey credibility in report
    - What is realistic within possibilities (including a budget)

- **Travel Time**
- **School Support**
- **Paid Classes**


## Step 4: Report
- Present findings
- Visualize results
- Credibility counts

### Description
- With the 3 features from step 3 create a presentation
- As we have not learned visualization yet, keep it simple
- Remember, that credibility counts

### Notice
- At this stage it is not supposed to be perfect.
- Present the findings here in the Notebook

1. The shorter the travel time, the higher the final grades a student will get.
2. Students who does not receive extra school support perform better than those who do.
3. Students who take extra paid classes perform better than those who don't.

## Step 5: Actions
- Use insights
- Measure impact
- Main goal

### Description
- What actions should the schools take?
- How can they evaluate the impact?
- Remember, this is the main goal.

**Actions to be taken:**

1. Build a dormitory for students who are far from the school.
2. Temporarily suspend the school support program.
3. Make these paid classes part of the regular curriculum.

**Impact Evaluation**

1. Make another survey after the implementation of these actions.
2. Compare the students grades before and after the implementation of these actions.