# `DSML_WS_03` - Introduction to Pandas

Please work on the following tasks **before** the third workshop session.

## 1. Working with student grades in NumPy

Last week, you made yourself familiar with NumPy. Let's check your NumPy knowledge using a small case.

Imagine you are the teacher of a class of 15 students. During the year, the class has written 3 tests, each with a maximum of 100 points. You want to summarize the students' performances using NumPy.

1. Simulate the described case by creating a two-dimensional NumPy array with each row representing a student and each column representing a test. Generate random scores for each student and test between 0 and 100, and assign the array to a variable called `student_scores`.
2. Oops! You completely forgot Thomas, who joined the class during the school year after the first test. Thomas' score for the second test was 87, and 93 for the third test.
    - Since Thomas does not have a score for the first test, you want to simply use the average score of all other students. Calculate this and assign it to a variable called `avg_first_test`. (Hint: array slicing and the function np.mean() might be helpful here)
    - Add Thomas using `avg_first_test` as his first test score and his actual second and third test scores to `student_scores`.
3. You want to generate the sum of the scores from all three tests for each student. Do this using a matrix multiplication and save the resulting array to a variable called `student_totals`.
4. Finally, you want to transform the total scores in `student_totals` to a percentage of maximum available points. Assign this array to a variable called `student_pct`.

In [4]:
# your code here
import numpy as np

#setting a seed in order to see the same results every time and creating a random array
np.random.seed(0)
student_scores = np.random.randint(low=0,high=101, size=(15,5))

#calculatig average of first exam 
avg_first_test = np.mean(student_scores[:,:1])

#creating array of Thomas' grades
thomas = np.array([avg_first_test, 87, 93])

random_thomas = np.random.randint(1, 101, size=2)

thomas = np.hstack((thomas, random_thomas)).astype(int)

#stacking students' & thomas' scores 
student_scores = np.vstack((student_scores, thomas))

#calculating students' totals 
student_totals = student_scores[:,:3]@np.ones((3, 1))

#calculatitng students' percentage from maximum points 

max_points = 500  # Maximum available points for all tests
student_pct = (student_totals / max_points) * 100

print("Student Scores:\n",student_scores)

print("\nStudent Totals:\n",(student_totals.T).astype(int))

print("\nStudent Percentages:\n",student_pct.T)




Student Scores:
 [[44 47 64 67 67]
 [ 9 83 21 36 87]
 [70 88 88 12 58]
 [65 39 87 46 88]
 [81 37 25 77 72]
 [ 9 20 80 69 79]
 [47 64 82 99 88]
 [49 29 19 19 14]
 [39 32 65  9 57]
 [32 31 74 23 35]
 [75 55 28 34  0]
 [ 0 36 53  5 38]
 [17 79  4 42 58]
 [31  1 65 41 57]
 [35 11 46 82 91]
 [40 87 93  1 15]]

Student Totals:
 [[155 113 246 191 143 109 193  97 136 137 158  89 100  97  92 220]]

Student Percentages:
 [[31.  22.6 49.2 38.2 28.6 21.8 38.6 19.4 27.2 27.4 31.6 17.8 20.  19.4
  18.4 44. ]]


## 2. Getting started with Pandas

This week, we will be exploring Pandas - a core package for working with data in Python. You can think of Pandas as enhanced versions of NumPy arrays. Let's see why.

As always, we first have to import pandas to use its functionalities within this Jupyter notebook. Pandas is commonly abbreviated using pd.

In [1]:
import pandas as pd

The Pandas equivalent to a one-dimensional array is a Series object, which you can create just like arrays, but use pd.Series instead of np.array. Let's stick with the student grade example from Task 1, but focus on only five students: Helena, Tom, Nina, Sam and Kim, who are 15, 15, 16, 17 and 16 years old, and scored 75, 69, 87, 88, and 54 points on the first test. Create three Pandas Series objects called `names`, `ages` and `scores` to store the respective data about our five students. How do Pandas Series objects differ from NumPy arrays?

In [None]:
# your code here


At the heart of Pandas are dataframes, the equivalent to two-dimensional arrays. Let's combine our three Series objects into one dataframe using pd.DataFrame({'name_1': series_1, 'name_2': series_2,...}) and assign it to a variable called `students`. How does the dataframe differ from a two-dimensional array?

In [None]:
# your code here


You can select specific information from your dataframe using the .loc[row_name, column_name] method. Return all rows but only the age column using .loc.

In [None]:
# your code here


We can also use .loc to filter based on certain conditions. For example, if I want to only return Helena's test score, I could write `students.loc[students.name == 'Helena','score']`. Return all information on students with a score higher than 80.

In [None]:
# your code here
