# **Data Handling**
---

## Introduction

You will apply different data exploration, cleaning, and visualization techniques. It is very important to take some time to understand the data. 


## **About the data** 
---
The data set consists of 116,658 observations and 10 columns. It contains data of fifth-grade students, including their Math final exam grade.

* Student ID: identifies uniquely every student. **Note that no two students have the same ID.** 
* Gender
* School group: **There are only three groups school groups (A, B and C)**
* Effort regulation (effort)
* Family stress-level (stress)
* Help-seeking behavior (feedback)
* Regularity patterns of a student throughout the course (regularity)
* Critical-thinking skills (critical)
* Duration in minutes to solve final Math exam (minutes). **Should be numerical.**
* Final Math exam grade (grade) 


**The data set is available in the folder data**

In [None]:
# Your libraries here
# YOUR CODE HERE
raise NotImplementedError()

## **0 Load the data**
---

In [None]:
### 0.1
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Let's see how the dataframe looks like
print("length of the dataframe:", len(df))
print("first rows of the dataframe:\n")
df.head()

In [None]:
# TEST CELL, DONT MODIFY
assert len(df) == 116658

<a id="section1"></a>
## **1 Data Exploration** 
---

As mentioned in class, it is good practice to report the percentage of missing values per feature together with the features' descriptive statistics. 

In order to understand the data better, in this exercise, you should:

1. Create a function that takes as input a DataFrame and returns a DataFrame with meaningful descriptive statistics and the percentage of missing values for numerical and categorical (object type) features. The process of data cleaning requires multiple iterations of data exploration. This function should be helpful for the later data cleaning exercises. 

2. Justify the choice of each descriptive statistic. What does each say about the data? Can you identify some irregularities? 

3. In a single figure, choose an appropriate type of graph for each feature and plot each feature individually.  

4. Explain your observations. How are the features distributed (poisson, exponential, gaussian, etc)? Can you visually identify any outliers?



### 1.1 
Create a function that takes as input a DataFrame and returns meaningful descriptive statistics and the percentage of missing values for numerical and categorical (object type) features.



In [None]:
#### GRADED CELL ####
### 1.1
def get_feature_stats(df):
    """
    Obtains descriptive statistics for all features and percentage of missing 
    values
    
    Parameters
    ----------
    df : DataFrame
         Containing all data

    Returns
    -------
    stats : DataFrame
            Containing the statistics for all features.
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return stats

In [None]:
stats = get_feature_stats(df)
stats

In [None]:
# TEST CELL, DONT MODIFY


In [None]:
# TEST CELL, DONT MODIFY


### 1.2
Justify the choice of each descriptive statistic. What do they say about the data? Can you identify some irregularities? 

YOUR ANSWER HERE

### 1.3
In a single figure, choose an appropriate type of graph for each feature and plot each feature individually.

In [None]:
#### GRADED CELL ####
### 1.3
def plot_features(df):
    """
    Plots all features individually in the same figure
    
    Parameters
    ----------
    df : DataFrame
         Containing all data
         
    Hint
    ------
    To have multiple plots in a single figure see pyplot.figure

    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
sample = df.sample(n=1000)
plot_features(sample)

### 1.4
Explain your observations. How are the features distributed (poisson, exponential, gaussian, etc)? Can you visually identify outliers? 

YOUR ANSWER HERE

<a id="section2"></a>
## **2 Data Cleaning** 
---

Using your findings from the previous section, carefully continue to explore the data set and do the following:

1. Create a function to handle the missing values
2. Justify your decisions to treat the missing values
3. Create a function to handle the inconsistent data
4. Justify your decisions to treat the inconsistent data


### 2.1
Create a function to handle the missing values

In [None]:
#### GRADED CELL ####
### 2.1
def handle_missing_values(df):
    """
    Identifies and removes all missing values

    Parameters
    ----------
    df : DataFrame
      Containing missing values

    Returns
    -------
    df : DataFrame
      Without missing values

    Hint:
    -----
    Try to understand the pattern in the missing values    
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return df


In [None]:
# TEST CELL, DONT MODIFY
assert len(handle_missing_values(df).columns) == 9


In [None]:
# TEST CELL, DONT MODIFY


In [None]:
df = handle_missing_values(df)
print("length of the dataframe:", len(df))
df.head()

In [None]:
# take a look at the new dataframe stats and compare it with the original
get_feature_stats(df)

### 2.2 
Justify your decisions to treat the missing values. Are there missing values? If so, how are the missing values encoded? Why are there missing values? Is there a pattern in the values missing?


YOUR ANSWER HERE

### 2.3 
Create a function to handle the inconsistent data

In [None]:
#### GRADED CELL ####
### 2.3
def handle_inconsistent_data(df):
    """
    Identifies features with inconsistent data types and transforms features
    to the correct data type (numerical, object). 

    Parameters
    ----------
    df : DataFrame
      Containing inconsistent data

    Returns
    -------
    df : DataFrame
       With consistent data. All columns must be either numerical or categorical

    Hint:
    -----
    Don't forget to convert the features into the correct data type 
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return df

In [None]:
# TEST CELL, DONT MODIFY


In [None]:
# TEST CELL, DONT MODIFY


In [None]:
df = handle_inconsistent_data(df)
print(len(df))
print(df.head())
print(get_feature_stats(df))

### 2.4 
Justify your decisions to treat the inconsistent data. Were there columns with inconsistent data types? How did you identify them? 
 

YOUR ANSWER HERE

<a id="section3"></a>
## **3 Visualization** 
---

After cleaning the data, we can try to understand or extract insights from it. To do so, in this last section, you will do the following:
1. Create a function to show the relationship between numerical features.
2. Interpret your findings. What is correlation useful for? What insights can you get from it? 
3. Select an appropriate type of graph to explore the relationship between grade, school group, and any other meaningful feature
4. Interpret your findings. What are some factors that seem to influence the grade of the students? Which features do not seem to affect the outcome?


### 3.1 
Create a function to show the linear correlation between features.

In [None]:
#### GRADED CELL ####
### 3.1
import seaborn as sns
def plot_correlation(df):
    """
    Builds upper triangular heatmap with pearson correlation between numerical variables

    Instructions
    ------------
    The plot must have:
    - An appropiate title
    - Only upper triangular elements
    - Annotated values of correlation coefficients rounded to three significant 
    figures
    - Negative correlation must be blue and possitive correlation red. 

    Parameters
    ----------
    df : DataFrame with data


    """
    # YOUR CODE HERE
    raise NotImplementedError()


In [None]:
plot_correlation(df)

### 3.2
Interpret your findings. What is correlation useful for? What insights can you get from it? 


YOUR ANSWER HERE

### 3.3
Select an appropriate type of graph to explore the relationship between grade, school group, and any other meaningful feature.


In [None]:
#### GRADED CELL ####
### 3.3
def plot_grades(df):
    """
    Visualizes the relationship between grade,  school group and other meaningful
    feature

    Parameters
    ----------
    df : DataFrame with data

    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
plot_grades(df)

### 3.4
Interpret your findings. What are some factors that seem to influence the grade of the students? Which features do not seem to affect the outcome?

YOUR ANSWER HERE