# Project - Guide school to activities to improve G3 grades

![Data Science Workflow](img/ds-workflow.png)

## Goal of Project
- Explore the dataset from lesson further
- Follow the Data Science process to understand it better
- It will be your task to identify possible activies to improve G3 grades
- NOTE: We have very limited skills, hence, we must limit our ambitions in our analysis

## Step 1: Acquire
- Explore problem
- Identify data
- Import data

### Step 1.a: Import libraries
- Execute the cell below (SHIFT + ENTER)

In [1]:
import pandas as pd

### Step 1.b: Read the data
- Use ```pd.read_csv()``` to read the file `files/student-mat.csv`
- NOTE: Remember to assign the result to a variable (e.g., ```data```)

In [2]:
data = pd.read_csv('files/student-mat.csv')

### Step 1.c: Inspect the data
- Call ```.head()``` on the data to see all is as expected

In [3]:
data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [6]:
data.describe()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
count,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0
mean,16.696203,2.749367,2.521519,1.448101,2.035443,0.334177,3.944304,3.235443,3.108861,1.481013,2.291139,3.55443,5.708861,10.908861,10.713924,10.41519
std,1.276043,1.094735,1.088201,0.697505,0.83924,0.743651,0.896659,0.998862,1.113278,0.890741,1.287897,1.390303,8.003096,3.319195,3.761505,4.581443
min,15.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,3.0,0.0,0.0
25%,16.0,2.0,2.0,1.0,1.0,0.0,4.0,3.0,2.0,1.0,1.0,3.0,0.0,8.0,9.0,8.0
50%,17.0,3.0,2.0,1.0,2.0,0.0,4.0,3.0,3.0,1.0,2.0,4.0,4.0,11.0,11.0,11.0
75%,18.0,4.0,3.0,2.0,2.0,0.0,5.0,4.0,4.0,2.0,3.0,5.0,8.0,13.0,13.0,14.0
max,22.0,4.0,4.0,4.0,4.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,75.0,19.0,19.0,20.0


### Step 1.d: Check length of data
- Call ```len(...)``` on the data
- Result: There should be 395 rows of data

In [7]:
len(data)

395

## Step 2: Prepare
- Explore data
- Visualize ideas
- Cleaning data

### Notice
- We will not cover visualization in this lecture
- We also know, that the data is clean - but we will do validations here anyway

### Step 2.a: Check the data types
- This step tells you if some numeric column is not represented numeric.
- Get the data types by ```.dtypes```

In [8]:
data.dtypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
G1             int64
G2             int64
G3             int64
dtype: object

### Step 2.b: Check for null (missing) values
- Data often is missing entries - there can be many reasons for this
- We need to deal with that (will do later in course)
- Use ```.isnull().any()```

In [10]:
data.isnull().any()

school        False
sex           False
age           False
address       False
famsize       False
Pstatus       False
Medu          False
Fedu          False
Mjob          False
Fjob          False
reason        False
guardian      False
traveltime    False
studytime     False
failures      False
schoolsup     False
famsup        False
paid          False
activities    False
nursery       False
higher        False
internet      False
romantic      False
famrel        False
freetime      False
goout         False
Dalc          False
Walc          False
health        False
absences      False
G1            False
G2            False
G3            False
dtype: bool

## Step 3: Analyze
- Feature selection
- Model selection
- Analyze data

### Description
- Want to find 3 features to use in our report
- The 3 features should be selected based on
    - Actionable insights
    - Convey credibility in report
    - What is realistic within possibilities (including a budget)

### Note
- This step is where you can explore
- You know how to use the following:
    - **corr()** to find see correlation
    - **groupby()** with **mean()**, **count()**, or **std()**
- This should be used for step 4: Report

### Step 3.a: Investigate correlation
- Correlation is an easy measure to find insights that are actionable.
- Use **corr()** and only show **G3**, as that is the row we are interested in.
    - Notice: **G1** and **G2** are highly correlated, but they are not intented to be used

In [39]:
data.corr()['G1']

age          -0.064081
Medu          0.205341
Fedu          0.190270
traveltime   -0.093040
studytime     0.160612
failures     -0.354718
famrel        0.022168
freetime      0.012613
goout        -0.149104
Dalc         -0.094159
Walc         -0.126179
health       -0.073172
absences     -0.031003
G1            1.000000
G2            0.852118
G3            0.801468
Name: G1, dtype: float64

### Step 3.b: Get the Feature names
- This step can help you understand features better.
- All the features are availbale witb **.columns** applied on the **DataFrame**

In [14]:
data.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'G1', 'G2', 'G3'],
      dtype='object')

### Step 3.c: Investigate features
**Repeat this step** (possibly for all features)
- Select a features
- Calculate the **groupby(...)** **mean()** on **G3**
    - HINT: This was done in the lesson
- Calculate the **groupby(...)** **count()** on **G3**
- Calculate the **groupby(...)** **std()** on **G3**

In [34]:
data.groupby('studytime').mean()['G3']

studytime
1    10.047619
2    10.171717
3    11.400000
4    11.259259
Name: G3, dtype: float64

In [35]:
data.groupby('studytime').count()['G3']

studytime
1    105
2    198
3     65
4     27
Name: G3, dtype: int64

In [36]:
data.groupby('studytime').std()['G3']

studytime
1    4.956311
2    4.217537
3    4.639504
4    5.281263
Name: G3, dtype: float64

### Step 3.d Select 3 features
- Decide on 3 features to use in the report
- The decision should be based on
    - Actionable insights
    - Convey credibility in report
    - What is realistic within possibilities (including a budget)

In [37]:
'higher', 'sex', 'studytime'

('higher', 'sex', 'studytime')

## Step 4: Report
- Present findings
- Visualize results
- Credibility counts

### Description
- With the 3 features from step 3 create a presentation
- As we have not learned visualization yet, keep it simple
- Remember, that credibility counts

### Notice
- At this stage it is not supposed to be perfect.
- Present the findings here in the Notebook

## Step 5: Actions
- Use insights
- Measure impact
- Main goal

### Description
- What actions should the schools take?
- How can they evaluate the impact?
- Remember, this is the main goal.

### Recommendations

- Do more frequent Surveys

- Targeted survey at girls since they seem to score lower

