# Storing Tidy Data

`students_cleaned.csv` is the tidy data that was generated from Week 11's exercise. Below is a sample of the dataset:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

students_df = pd.read_csv('students_cleaned.csv')
students_df.sample(10)

#### Here is a description of the following columns:

- `student_id` - student's identification number
- `name` - full name of the student
- `quarter` - quarter of observation
- `grade` - grade of student that quarter
- `year_entered` - year student entered the college
- `major` - student's major
- `specialization` - student's specialization

## Exercise 1: Split the `students_df` into two dataframes:

1. `students_information_df`
    Fields:
    - `student_id`
    - `name`
    - `year_entered`
    - `major`
    - `specialization`

2. `students_grades_df`
    - `student_id`
    - `quarter`
    - `grade`

In [None]:
# Split student df here


## Exercise 2: Evaluate the best performing quarter for the students

In order to finish this exercise, we need to use a another process for subsetting data in `pandas` called boolean indexing. In boolean indexing, we filter out our rows based on `True` or `False` values. For example we have a matrix A:

In [None]:
x = np.arange(1,7)
y = x*2
table = {
    'x': x,
    'y': y,
}
A = pd.DataFrame(table)
A

We can get a `Series` of rows where the value of  the column of `x` is less than or equal to 3 by:

In [None]:
A['x'] <= 3

The series will be composed of `True` and `False` values based on our boolean expression. We can use this expression to extract the values of `x` that has a corresponding `True` value by enclosing the boolean expression inside the `A` DataFrame by square brackets:

In [None]:
A[ A['x'] <= 3 ] 

In our `students_df` DataFrame, we can extract all the rows having the quarter `Q1` by doing:

In [None]:
q1_df = students_df[ students_df['quarter'] == 'Q1' ]
q1_df

### Using the examples above, create DataFrames divided by quarter and put them into the following variables respectively: `q2_df`, `q3_df`, `q4_df`

In [None]:
# code here


### Create a 2x2 subplot. Each subplot should contain a histogram showing the frequency of scores the students have got per quarter

In [None]:
# code here


### Based on the histograms above, what quarter did the majority of the students perform best? Explain your answer in a Python comment below:

In [None]:
# Answer here

## Assertion tests
Run the code below to check if there is anything missing in your implementation. The following code below checks for:
- shape of `students_information_df` should be 100x5
- shape of `students_grades_df` should be 400x3
- existence of `student_id`, `name`, `year_entered`, `major` and `specialization` inside the `students_information_df` dataframe
- existence of `student_id`, `quarter`, and `grade` inside the `students_grades_df` dataframe

In [None]:
assert 'student_id' in students_information_df.columns, "student_id column does not exist in students_information_df"
assert 'name' in students_information_df.columns, "name column does not exist in students_information_df"
assert 'year_entered' in students_information_df.columns, "year_entered column does not exist in students_information_df"
assert 'major' in students_information_df.columns, "major column does not exist in students_information_df"
assert 'specialization' in students_information_df.columns, "specialization column does not exist in students_information_df"

assert 'student_id' in students_grades_df.columns, "student_id column does not exist in students_grades_df"
assert 'quarter' in students_grades_df.columns, "quarter column does not exist in students_grades_df"
assert 'grade' in students_grades_df.columns, "grade column does not exist in students_grades_df"

assert students_information_df.shape == (100, 5), "Dataframe's shape does not match expected shape"
assert students_grades_df.shape == (400, 3), "Dataframe's shape does not match expected shape"
