# Project - Guide school to activities to improve G3 grades

![Data Science Workflow](img/ds-workflow.png)

## Goal of Project
- Explore the dataset from lesson further
- Follow the Data Science process to understand it better
- It will be your task to identify possible activies to improve G3 grades
- NOTE: We have very limited skills, hence, we must limit our ambitions in our analysis

## Step 1: Acquire
- Explore problem
- Identify data
- Import data

### Step 1.a: Import libraries
- Execute the cell below (SHIFT + ENTER)

In [2]:
import pandas as pd
pd.__version__

'2.0.0'

### Step 1.b: Read the data
- Use ```pd.read_csv()``` to read the file `files/student-mat.csv`
- NOTE: Remember to assign the result to a variable (e.g., ```data```)

In [3]:
data = pd.read_csv('files/student-mat.csv')

### Step 1.c: Inspect the data
- Call ```.head()``` on the data to see all is as expected

In [4]:
data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


### Step 1.d: Check length of data
- Call ```len(...)``` on the data
- Result: There should be 395 rows of data

In [5]:
len(data)

395

## Step 2: Prepare
- Explore data
- Visualize ideas
- Cleaning data

### Notice
- We will not cover visualization in this lecture
- We also know, that the data is clean - but we will do validations here anyway

### Step 2.a: Check the data types
- This step tells you if some numeric column is not represented numeric.
- Get the data types by ```.dtypes```

In [6]:
data.dtypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
G1             int64
G2             int64
G3             int64
dtype: object

### Step 2.b: Check for null (missing) values
- Data often is missing entries - there can be many reasons for this
- We need to deal with that (will do later in course)
- Use ```.isnull().any()```

In [7]:
data.isnull().any()

school        False
sex           False
age           False
address       False
famsize       False
Pstatus       False
Medu          False
Fedu          False
Mjob          False
Fjob          False
reason        False
guardian      False
traveltime    False
studytime     False
failures      False
schoolsup     False
famsup        False
paid          False
activities    False
nursery       False
higher        False
internet      False
romantic      False
famrel        False
freetime      False
goout         False
Dalc          False
Walc          False
health        False
absences      False
G1            False
G2            False
G3            False
dtype: bool

## Step 3: Analyze
- Feature selection
- Model selection
- Analyze data

### Description
- Want to find 3 features to use in our report
- The 3 features should be selected based on
    - Actionable insights
    - Convey credibility in report
    - What is realistic within possibilities (including a budget)

### Note
- This step is where you can explore
- You know how to use the following:
    - **corr()** to find see correlation
    - **groupby()** with **mean()**, **count()**, or **std()**
- This should be used for step 4: Report

### Step 3.a: Investigate correlation
- Correlation is an easy measure to find insights that are actionable.
- Use **corr()** and only show **G3**, as that is the row we are interested in.
    - Notice: **G1** and **G2** are highly correlated, but they are not intented to be used

In [8]:
data.corr(numeric_only=True)['G3']

age          -0.161579
Medu          0.217147
Fedu          0.152457
traveltime   -0.117142
studytime     0.097820
failures     -0.360415
famrel        0.051363
freetime      0.011307
goout        -0.132791
Dalc         -0.054660
Walc         -0.051939
health       -0.061335
absences      0.034247
G1            0.801468
G2            0.904868
G3            1.000000
Name: G3, dtype: float64

### Step 3.b: Get the Feature names
- This step can help you understand features better.
- All the features are availbale witb **.columns** applied on the **DataFrame**

In [9]:
data.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'G1', 'G2', 'G3'],
      dtype='object')

In [10]:
data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


### Step 3.c: Investigate features
**Repeat this step** (possibly for all features)
- Select a features
- Calculate the **groupby(...)** **mean()** on **G3**
    - HINT: This was done in the lesson
- Calculate the **groupby(...)** **count()** on **G3**
- Calculate the **groupby(...)** **std()** on **G3**

In [11]:
data.groupby('school')['G3'].mean()

school
GP    10.489971
MS     9.847826
Name: G3, dtype: float64

In [12]:
data.groupby('school')['G3'].count()

school
GP    349
MS     46
Name: G3, dtype: int64

In [13]:
data.groupby('school')['G3'].std()

school
GP    4.625397
MS    4.237229
Name: G3, dtype: float64

### Step 3.d Select 3 features
- Decide on 3 features to use in the report
- The decision should be based on
    - Actionable insights
    - Convey credibility in report
    - What is realistic within possibilities (including a budget)

In [31]:
data.groupby('studytime')['G3'].count()

studytime
1    105
2    198
3     65
4     27
Name: G3, dtype: int64

In [32]:
data.groupby('studytime')['G3'].mean()

studytime
1    10.047619
2    10.171717
3    11.400000
4    11.259259
Name: G3, dtype: float64

In [25]:
data.groupby('schoolsup')['G3'].count()

schoolsup
no     344
yes     51
Name: G3, dtype: int64

In [37]:
students_without_schoolsup= (51/len(data))*100
students_without_schoolsup

12.91139240506329

In [29]:
data.groupby('failures')['G3'].mean()

failures
0    11.253205
1     8.120000
2     6.235294
3     5.687500
Name: G3, dtype: float64

In [30]:
data.groupby('health')['G3'].mean()

health
1    11.872340
2    10.222222
3    10.010989
4    10.106061
5    10.397260
Name: G3, dtype: float64

## Step 4: Report
- Present findings
- Visualize results
- Credibility counts

### Description
- With the 3 features from step 3 create a presentation
- As we have not learned visualization yet, keep it simple
- Remember, that credibility counts

### Notice
- At this stage it is not supposed to be perfect.
- Present the findings here in the Notebook

### Analysis of three characteristics related to students' G3 grades

#### Introduction:
In this presentation, we'll look at three features that might be related to the G3 scores of students in our dataframe. Through this analysis, we can identify potential areas for improvement and propose solutions that can help students improve their grades.

- Feature 1: Weekly Study Time
One of the most obvious features to consider is the students' weekly study time. According to our dataframe, the average study time of students is around 1-2 hours per week, but some students study more or less. This suggests that there might be a correlation between the time students spend studying and their G3 scores. By encouraging students to study harder, whether through educational support programs, tutoring, or simply by providing them with additional resources, we can help them improve their grades.

- Feature 2: Educational support
The "schoolsup" column indicates whether students receive educational support from the school. Although only 13% of students indicated that they received educational support, this characteristic could be important in improving G3 scores. Students who receive instructional support may be learning skills and strategies that allow them to be successful in school, and this could be reflected in their grades. Providing educational support to more students, either through after-school programs or in the classroom itself, could be an effective way to improve G3 scores.

- Feature 3: Medu 
The mother's level of education has a moderate positive correlation with G3 scores (0.217147). This suggests that mothers with higher levels of education may be able to provide more support and resources to their children to help them succeed in school.
#### Conclusion:
By analyzing these three characteristics, we can identify potential areas for improvement and propose solutions to help students improve their G3 scores. By encouraging students to study harder, providing educational support, and improving their health and well-being, we can help them succeed in school and beyond.

## Step 5: Actions
- Use insights
- Measure impact
- Main goal

### Description
- What actions should the schools take?
- How can they evaluate the impact?
- Remember, this is the main goal.

- Provide one-on-one tutoring: Some students may be struggling with certain subjects or topics and need one-on-one attention. The "paid" column indicates whether students have paid for tutoring in the past, suggesting that some may be willing to invest in one-on-one tutoring. It would be important to work with the school's teachers to identify students who need this type of help and provide them with tutors who can work with them on an individualized study plan.

- Encourage participation in extracurricular activities: The "activities" column indicates whether or not students participate in extracurricular activities. Encouraging students to engage in these activities could help them develop important skills like time management and problem solving, and it could also help them find a passion that motivates them to try harder in their studies.

- Offer educational support programs: The "schoolsup" column indicates whether or not students receive educational support from the school. Offering educational support programs after school could help students improve their skills in specific areas and provide them with the extra help they need to succeed in their studies.

- Provide access to technology and online resources: The “internet” column indicates whether students have access to the internet at home. Providing access to online educational resources, such as tutorials and educational videos, could help students learn at their own pace and complement what they are learning in school.

- Working with parents and guardians: The "famsize" column indicates the family size and the "Medu" and "Fedu" column indicates the educational level of the parents. Working with parents and guardians to ensure that students have a calm and supportive study environment at home, as well as providing them with additional resources, such as books or study materials, could help students improve their grades.