# Big Data Real-Time Analytics with Python and Spark

## Data Manipulation in Python with Pandas

### Case Study Challenge 2

**Problem Definition and Data Source**

In this Case Study the objective is to carry out a detailed school data analysis process by crossing, comparing and summarizing different types of information.

In total, 25 business questions will be answered that will require analyzing the data from different perspectives. Pandas will be the only tool used.

For this work we will use fictitious data, but which could represent real data. The data was generated from the Realistic Data Generator, at the address below:

https://www.mockaroo.com

In [1]:
import pandas as pd

#### Loading the Data

In [2]:
schools_data = pd.read_csv("dataset/dataset_escolas.csv")

In [3]:
schools_data.shape

(15, 5)

In [4]:
schools_data.head()

Unnamed: 0,ID_Escola,Nome_Escola,Tipo_Escola,Numero_Alunos,Orcamento_Anual
0,0,Escola A,Publica,2917,1910635
1,1,Escola B,Publica,2949,1884411
2,2,Escola C,Particular,1761,1056600
3,3,Escola D,Publica,4635,3022020
4,4,Escola E,Particular,1468,917500


In [5]:
student_data = pd.read_csv("dataset/dataset_estudantes.csv")

In [6]:
student_data.shape

(39160, 7)

In [7]:
student_data.head()

Unnamed: 0,ID_Estudante,Nome_Estudante,Genero,Serie,Nome_Escola,Nota_Redacao,Nota_Matematica
0,0,Kevin Bradley,M,6,Escola A,66,79
1,1,Paul Smith,M,9,Escola A,94,61
2,2,John Rodriguez,M,9,Escola A,90,60
3,3,Oliver Scott,M,9,Escola A,67,58
4,4,William Ray,F,6,Escola A,97,84


In [8]:
full_data = pd.merge(student_data, schools_data, how = "left", on = ["Nome_Escola", "Nome_Escola"])

In [9]:
full_data.shape

(39160, 11)

In [10]:
full_data.head()

Unnamed: 0,ID_Estudante,Nome_Estudante,Genero,Serie,Nome_Escola,Nota_Redacao,Nota_Matematica,ID_Escola,Tipo_Escola,Numero_Alunos,Orcamento_Anual
0,0,Kevin Bradley,M,6,Escola A,66,79,0,Publica,2917,1910635
1,1,Paul Smith,M,9,Escola A,94,61,0,Publica,2917,1910635
2,2,John Rodriguez,M,9,Escola A,90,60,0,Publica,2917,1910635
3,3,Oliver Scott,M,9,Escola A,67,58,0,Publica,2917,1910635
4,4,William Ray,F,6,Escola A,97,84,0,Publica,2917,1910635


In [11]:
full_data["Serie"].unique()

array([6, 9, 8, 7], dtype=int64)

In [12]:
full_data["Genero"].unique()

array(['M', 'F'], dtype=object)

#### Data Analysis Challenge


Answer the questions below.

> **1- How many have data schools?**

In [34]:
number_schools = full_data['Nome_Escola'].nunique()
number_schools

15

In [35]:
print("We have data from", number_schools, "schools")

We have data from 15 schools


> **2- What is the total number of student records in the database?**

In [15]:
number_students = full_data['ID_Estudante'].count()
number_students

39160

In [16]:
print("We have a total of", number_students, "student records")

We have a total of 39160 student records


> **3- What is the total budget considering all schools?**

In [17]:
sum_annual_budget = schools_data['Orcamento_Anual'].sum()
sum_annual_budget

24649428

In [18]:
print("The total budget value considering all schools is: ${:.2f}".format(sum_annual_budget))

The total budget value considering all schools is: $24649428.00


In [19]:
average_writing_grade = full_data['Nota_Redacao'].mean()
average_writing_grade

81.87574055158325

> **4- What is the average grade of students in Writing?**

In [20]:
print("The average grade of students in Writing is", average_writing_grade)

The average grade of students in Writing is 81.87574055158325


> **5- What is the average grade of students in Mathematics?**

In [21]:
average_math_grade = full_data['Nota_Matematica'].mean()
average_math_grade

78.98493360572012

In [22]:
print("The average grade of students in Mathematics is", average_math_grade)

The average grade of students in Mathematics is 78.98493360572012


> **6- Considering that the passing grade is 70, how many students passed Writing? (Deliver the result in absolute value and percentage)**

In [23]:
approved_students_writing = full_data[full_data['Nota_Redacao'] >= 70].shape[0]
approved_students_writing

33600

In [24]:
total_students = full_data.shape[0]
percentage_approved_writing = (approved_students_writing / total_students) * 100
percentage_approved_writing

85.80183861082737

In [25]:
print("Number of students approved in Writing:", approved_students_writing)
print("Percentage of students approved in Writing: {:.2f}%".format(percentage_approved_writing))

Number of students approved in Writing: 33600
Percentage of students approved in Writing: 85.80%


> **7- Considering that the passing score is 70, how many students passed Mathematics? (Deliver the result in absolute value and percentage)**

In [26]:
approved_students_math = full_data[full_data['Nota_Matematica'] >= 70].shape[0]
approved_students_math

29360

In [27]:
percentage_approved_math = (approved_students_math / total_students) * 100
percentage_approved_math

74.97446373850867

In [28]:
print("Number of students approved in Mathematics:", approved_students_math)
print("Percentage of students approved in Mathematics: {:.2f}%".format(percentage_approved_math))

Number of students approved in Mathematics: 29360
Percentage of students approved in Mathematics: 74.97%


> **8- Considering that the passing score is 70, how many students passed Mathematics and Writing? (Deliver the result in absolute value and percentage)**

In [29]:
approved_students = full_data[(full_data['Nota_Matematica'] >= 70) & (full_data['Nota_Redacao'] >= 70)].shape[0]
approved_students

25518

In [30]:
percentage_approved = (approved_students / total_students) * 100
percentage_approved

65.16343207354444

In [31]:
print("Number of students approved in Mathematics and Writing:", approved_students)
print("Percentage of students approved in Mathematics and Writing: {:.2f}%".format(percentage_approved))

Number of students approved in Mathematics and Writing: 25518
Percentage of students approved in Mathematics and Writing: 65.16%


> **9- Create a dataframe with the results of questions 1 to 8 that you calculated above. (Tip: create a dictionary and then convert it to a Pandas dataframe)**

In [32]:
results = pd.DataFrame ({"Number of Schools": [number_schools],
                            "Number of Students": [number_students],
                            "Total Annual Budget for Schools":[sum_annual_budget],
                            "Average Grade in Writing":[average_writing_grade],
                            "Average Grade in Mathematics":[average_math_grade],
                            "Number of students approved in Writing (absolute value)":[approved_students_writing],
                            "Number of students approved in Writing (percentage value)":[percentage_approved_writing],
                            "Number of students approved in Mathematics (absolute value)":[approved_students_math],
                            "Number of students approved in Mathematics (percentage value)":[percentage_approved_math],
                            "Number of students approved in Mathematics and Writing (absolute value)":[approved_students],
                            "Number of students approved in Mathematics and Writing (percentage value)":[percentage_approved]})

results

Unnamed: 0,Number of Schools,Number of Students,Total Annual Budget for Schools,Average Grade in Writing,Average Grade in Mathematics,Number of students approved in Writing (absolute value),Number of students approved in Writing (percentage value),Number of students approved in Mathematics (absolute value),Number of students approved in Mathematics (percentage value),Number of students approved in Mathematics and Writing (absolute value),Number of students approved in Mathematics and Writing (percentage value)
0,15,39160,24649428,81.875741,78.984934,33600,85.801839,29360,74.974464,25518,65.163432


> **10- Format the "Total Students" and "Total Budget" columns by adjusting the decimal places.**

In [33]:
results_2 = pd.DataFrame ({"Number of Schools": [number_schools],
                            "Number of Students":  ["{:.2f}".format(number_students)],
                            "Total Annual Budget for Schools": ["{:.2f}".format(sum_annual_budget)],
                            "Average Grade in Writing":[average_writing_grade],
                            "Average Grade in Mathematics":[average_math_grade],
                            "Number of students approved in Writing (absolute value)":[approved_students_writing],
                            "Number of students approved in Writing (percentage value)":[percentage_approved_writing],
                            "Number of students approved in Mathematics (absolute value)":[approved_students_math],
                            "Number of students approved in Mathematics (percentage value)":[percentage_approved_math],
                            "Number of students approved in Mathematics and Writing (absolute value)":[approved_students],
                            "Number of students approved in Mathematics and Writing (percentage value)":[percentage_approved]})

results_2

Unnamed: 0,Number of Schools,Number of Students,Total Annual Budget for Schools,Average Grade in Writing,Average Grade in Mathematics,Number of students approved in Writing (absolute value),Number of students approved in Writing (percentage value),Number of students approved in Mathematics (absolute value),Number of students approved in Mathematics (percentage value),Number of students approved in Mathematics and Writing (absolute value),Number of students approved in Mathematics and Writing (percentage value)
0,15,39160.0,24649428.0,81.875741,78.984934,33600,85.801839,29360,74.974464,25518,65.163432
