# Big Data Real-Time Analytics with Python and Spark

## Chapter 3 - Case Study 2 - Data Manipulation in Python with Pandas

- Documentation: https://pandas.pydata.org/
- Data generated in: https://www.mockaroo.com/

![Case Study 2DSA](images/CaseStudy2.png "Case Study DSA")

In [1]:
# Python version
from platform import python_version
print('The version used in this notebook is: ', python_version())

The version used in this notebook is:  3.8.8


In [2]:
# Install watermark package
!pip install -q -U watermark

In [3]:
# Import pandas
import pandas as pd

In [4]:
# Package versions used in this notebook
%reload_ext watermark
%watermark -a "Bianca Amorim" --iversions

Author: Bianca Amorim

pandas: 1.5.0



## Loading the dataset

In [5]:
# Loading the schools dataset
dataset_schools = pd.read_csv("datasets/dataset_schools.csv")

In [6]:
# Shape
dataset_schools.shape

(15, 5)

In [7]:
# View
dataset_schools.head()

Unnamed: 0,ID_Escola,Nome_Escola,Tipo_Escola,Numero_Alunos,Orcamento_Anual
0,0,Escola A,Publica,2917,1910635
1,1,Escola B,Publica,2949,1884411
2,2,Escola C,Particular,1761,1056600
3,3,Escola D,Publica,4635,3022020
4,4,Escola E,Particular,1468,917500


In [8]:
# Loading the students dataset
dataset_students = pd.read_csv("datasets/dataset_students.csv")

In [9]:
# Shape
dataset_students.shape

(39160, 7)

In [10]:
# View
dataset_students.head()

Unnamed: 0,ID_Estudante,Nome_Estudante,Genero,Serie,Nome_Escola,Nota_Redacao,Nota_Matematica
0,0,Kevin Bradley,M,6,Escola A,66,79
1,1,Paul Smith,M,9,Escola A,94,61
2,2,John Rodriguez,M,9,Escola A,90,60
3,3,Oliver Scott,M,9,Escola A,67,58
4,4,William Ray,F,6,Escola A,97,84


In [11]:
# Merge datasets
dataset_full = pd.merge(dataset_students, dataset_schools, how = "left", on = ["Nome_Escola", "Nome_Escola"])

In [12]:
# Shape
dataset_full.shape

(39160, 11)

In [13]:
# View
dataset_full.head()

Unnamed: 0,ID_Estudante,Nome_Estudante,Genero,Serie,Nome_Escola,Nota_Redacao,Nota_Matematica,ID_Escola,Tipo_Escola,Numero_Alunos,Orcamento_Anual
0,0,Kevin Bradley,M,6,Escola A,66,79,0,Publica,2917,1910635
1,1,Paul Smith,M,9,Escola A,94,61,0,Publica,2917,1910635
2,2,John Rodriguez,M,9,Escola A,90,60,0,Publica,2917,1910635
3,3,Oliver Scott,M,9,Escola A,67,58,0,Publica,2917,1910635
4,4,William Ray,F,6,Escola A,97,84,0,Publica,2917,1910635


In [14]:
# How many series
dataset_full["Serie"].unique()

array([6, 9, 8, 7])

In [15]:
# How many genders
dataset_full["Genero"].unique()

array(['M', 'F'], dtype=object)

## Data analytics challenge
Answer the following 10 questions

> **1. Do we have data of how many schools?**

In [16]:
# He use len(dataset_full["Nome_Escola"].unique()) (We can sum only in the school dataset too)
nschools = dataset_full["Nome_Escola"].nunique()
nschools

15

> **2. Whats is the total of students records in the database?**

In [17]:
# Its better use the ID columns because this is the unique register
nstudents_records = dataset_full["Nome_Estudante"].count()
nstudents_records

39160

> **3. What is the total budget considering all schools?**

In [18]:
# He use here only the school dataset to do not have duplicate data
total_budget = dataset_full["Orcamento_Anual"].unique().sum()
total_budget

24649428

> **4. What is the average grade of writing students?**

In [19]:
mean_grade_writ = dataset_full["Nota_Redacao"].mean()
mean_grade_writ

81.87574055158325

> **5. What is the average grade of math students?**

In [20]:
mean_grade_math = dataset_full["Nota_Matematica"].mean()
mean_grade_math

78.98493360572012

> **6. Considering that the passing grade is 70, how many students passed in writing? (Give the result in absolute value and percentage)**

In [21]:
# He use slicing to get only >= 70 inside the slicing to get the column
# -> dataset_full[dataset_full["Nota_Redacao"] >= 70].count()["Nome_estudante"]
students_pass_writing = dataset_full["Nota_Redacao"].gt(69).sum()
students_pass_writing

33600

In [22]:
# Its important to put the float format, in the divisor number, when you are doing an operation
# Because without the format float pyhton will round the number and this is not good in ML
# That prevents us from losing precision
perc_students_pass_writing = (students_pass_writing / float(nstudents_records)) * 100
perc_students_pass_writing

85.80183861082737

> **7. Considering that the passing grade is 70, how many students passed in math (Give the result in absolute value and percentage)**

In [23]:
students_pass_math = dataset_full["Nota_Matematica"].gt(69).sum()
students_pass_math

29360

In [24]:
perc_students_pass_math = (students_pass_math / float(nstudents_records)) * 100
perc_students_pass_math

74.97446373850867

> **8. Considering that the passing grade is 70, how many students passed in math and writing (Give the result in absolute value and percentage)**

In [25]:
# This is the way the we could do the others above.
# We put ["Nome_Estudante"] in the end because we only want this value
students_pass_math_writing = dataset_full[(dataset_full["Nota_Redacao"] >= 70)
                                         & (dataset_full["Nota_Matematica"] >= 70)].count()["Nome_Estudante"]
students_pass_math_writing

25518

In [26]:
perc_students_pass_math_writing = (students_pass_math_writing / float(nstudents_records)) * 100; perc_students_pass_math_writing

65.16343207354444

> **9. Create a dataframe with the results of the questions from 1 to 8 that you calculate above. (Tip: Create a dictionary, then convert to a  pandas dataframe)**

In [36]:
df_results = pd.DataFrame({'Number_Schools': [nschools],
               'Number_Students_Record': [nstudents_records],
               'Total_Budget': [total_budget],
               'Mean_Students_Writing':[mean_grade_writ], 
               'Mean_Students_Math': [mean_grade_math],
               'Students_Pass_Writing':[students_pass_writing],
               '%Students_Pass_Writing': [perc_students_pass_writing],
               'Students_Pass_Math': [students_pass_math],
               '%Students_Pass_Math': [perc_students_pass_math],
               'Students_Pass_Writing_Math': [students_pass_math_writing],
               '%Students_Pass_Writing_Math': [perc_students_pass_math_writing]})

In [37]:
df_results

Unnamed: 0,Number_Schools,Number_Students_Record,Total_Budget,Mean_Students_Writing,Mean_Students_Math,Students_Pass_Writing,%Students_Pass_Writing,Students_Pass_Math,%Students_Pass_Math,Students_Pass_Writing_Math,%Students_Pass_Writing_Math
0,15,39160,24649428,81.875741,78.984934,33600,85.801839,29360,74.974464,25518,65.163432


In [38]:
type(df_results)

pandas.core.frame.DataFrame

> **10. Format the "Total de Estudantes" and "Total Orçamento" columns adjusting the decimal house.**

In [39]:
df_results["Number_Students_Record"] = df_results["Number_Students_Record"].map("{:,}".format)
df_results["Total_Budget"] = df_results["Total_Budget"].map("${:,.2f}".format)

In [40]:
df_results

Unnamed: 0,Number_Schools,Number_Students_Record,Total_Budget,Mean_Students_Writing,Mean_Students_Math,Students_Pass_Writing,%Students_Pass_Writing,Students_Pass_Math,%Students_Pass_Math,Students_Pass_Writing_Math,%Students_Pass_Writing_Math
0,15,39160,"$24,649,428.00",81.875741,78.984934,33600,85.801839,29360,74.974464,25518,65.163432


# The End