<a href="https://colab.research.google.com/github/fmejias/CienciasDeLosDatosTEC/blob/master/BigData/Tareas/Tarea1/TP1_BigData_FelipeMejias.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Big Data
# Trabajo práctico 1

- Professor: Luis Chavarría.

- Student:  
    - Felipe Alberto Mejías Loría, Instituto Tecnológico de Costa Rica. 

- November 28th, 2019

## **1-) Instalación de PySpark y Optimus**

In [0]:
# Install necessary libraries
!pip3 install pyspark
!pip install -q findspark
!pip install optimuspyspark

# Needed to install Spark in Google Colab
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-eu.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz


# **2-) Actualizar variables de ambiente necesarias para correr Spark en Google Colab**

In [0]:
# Set necessary environmental variables to use Apache Spark in Google Colab
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

# **3-) Importar bibliotecas necesarias para la ejecución de la TP1**

In [0]:
# Necessary Imports for the execution of the TP1
import pandas as pd
import findspark
from datetime import datetime
from pyspark.sql import SparkSession, Row, dataframe
from pyspark.sql.functions import col, date_format, udf, array
from pyspark.sql.types import DateType
from pyspark.sql.types import IntegerType, StringType, StructField, StructType
from optimus import Optimus
from urllib.error import HTTPError

# Set SPARK_HOME. Needed to initialize Apache Spark.
findspark.init("spark-2.4.4-bin-hadoop2.7")

# **4-) Funciones utilitarias para la construcción de DataFrames y de obtención de valores específicos de los DataFrames**

In [0]:
# CSV Files Path
STUDENTS_CSV_PATH = "https://raw.githubusercontent.com/fmejias/CienciasDeLosDatosTEC/master/BigData/Tareas/Tarea1/estudiante.csv"
COURSE_CSV_PATH = "https://raw.githubusercontent.com/fmejias/CienciasDeLosDatosTEC/master/BigData/Tareas/Tarea1/curso.csv"
GRADES_CSV_PATH = "https://raw.githubusercontent.com/fmejias/CienciasDeLosDatosTEC/master/BigData/Tareas/Tarea1/nota.csv"

def create_spark_session():
  """
  This function builds a Spark Session
  return the main entry of a Spark DataFrame
  """
  spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("Basic JDBC pipeline") \
    .getOrCreate()
  return spark

def show_complete_spark_data_frame(spark_data_frame):
  """
  This function shows the complete spark_data_frame
  """
  spark_data_frame.show(spark_data_frame.count(), False)

def create_spark_data_frame_from_csv_file(csv_file):
  """
  This function loads a Web CSV file into a Spark DataFrame using Optimus
  csv_file: Web CSV File
  return the Spark DataFrame from the CSV
  """
  try:
    op = Optimus()
    spark_data_frame = op.load.csv(csv_file)
    show_complete_spark_data_frame(spark_data_frame)
    return spark_data_frame
  except HTTPError as csv_ex:
    raise RuntimeError("El URL del archivo CSV especificado no existe: {}".format(
                csv_file)) from csv_ex

def get_column_values_to_list(data_frame, column_name):
  """
  This function returns the values of a column into a list
  data_frame: Spark DataFrame
  column_name: Column Name to get the values from
  """
  return data_frame.select(column_name).rdd.flatMap(lambda x: x).collect()

def join_spark_data_frames(data_frame_1, data_frame_2,
                           using_column_data_frame_1,
                           using_column_data_frame_2):
  """
  This function joint two Spark Data Frames
  data_frame_1: Spark DataFrame 1
  data_frame_2: Spark DataFrame 2
  using_column_data_frame_1: Column from DataFrame 1 to compare
  using_column_data_frame_2: Column from DataFrame 2 to compare
  return the Spark DataFrame from the JOIN
  """
  using_columns_statement = using_column_data_frame_1 == using_column_data_frame_2
  joint_data_frame = data_frame_1.join(data_frame_2, using_columns_statement)

  # To remove duplicated columns
  joint_data_frame = joint_data_frame.drop(using_column_data_frame_1)

  show_complete_spark_data_frame(joint_data_frame)
  return joint_data_frame

def create_data_frame_with_grades_by_student(joint_data_frame, student_carnet):
  """
  This function builds a DataFrame of grades by specified student
  joint_data_frame: Joint DataFrame with Grades, Students and Courses
  student_carnet: Student Carnet
  return the Spark DataFrame with the grades of the specified student
  """
  filter_statement = joint_data_frame.Carnet == student_carnet
  grades_by_student_data_frame = joint_data_frame.filter(filter_statement)
  show_complete_spark_data_frame(grades_by_student_data_frame)
  return grades_by_student_data_frame

def add_column_grades_times_credits_by_student(grades_by_student_data_frame):
  """
  This function add another column to the Filter DataFrame that contains
  the grades of the student
  grades_by_student_data_frame: Filter DataFrame with the grades of the student
  return the Spark DataFrame with the grades of the specified student and a 
         additional column with the calculation of grade times credits
  """
  grades_times_credits_op = grades_by_student_data_frame['Creditos']*grades_by_student_data_frame['Nota']
  grades_by_student_df = grades_by_student_data_frame.withColumn('CreditosxNotas',
                                                                 grades_times_credits_op)
  show_complete_spark_data_frame(grades_by_student_df)
  return grades_by_student_df

def create_weighted_average_row(student_data_frame):
  """
  This function creates a weighted average row for an specific student
  student_data_frame: DataFrame with the grades and grades times credits of a student
  return the Spark Row with the weighted average of a student
  """
  sum_of_credits = sum(get_column_values_to_list(student_data_frame,
                                                 'Creditos'))
  list_of_weighted_averages = get_column_values_to_list(student_data_frame,
                                                        'CreditosxNotas')
  weighted_average = sum(list_of_weighted_averages)/sum_of_credits
  student_name = set(get_column_values_to_list(student_data_frame,
                                               'NombreCompleto')).pop()
  career_name = set(get_column_values_to_list(student_data_frame,
                                              'Carrera')).pop()
  weighted_average_row = Row("NombreCompleto", "Carrera", "PromedioPonderado")
  return weighted_average_row(student_name, career_name, weighted_average)


# **5-) Funciones principales del programa y función main() para ejecutar el programa que obtiene los dos mejores estudiantes por carrera**

In [15]:
def create_data_frame_of_weighted_averages(joint_data_frame):
  """
  This function creates the data frame of the students weighted averages
  joint_data_frame: DataFrame with the notes, courses and students info
  return the weighted averages Spark DataFrame
  """

  # Extract all carnets from joint data frame
  student_carnet_set = set(get_column_values_to_list(joint_data_frame,
                                                     'Carnet'))

  # Iterate through each of the students and create a data frame with the 
  # results of the student
  students_rows = []
  for student_carnet in student_carnet_set:
    print("Ahora se muestra el DataFrame con las notas del estudiante con carnet:",
          student_carnet,"\n")
    student_data_frame = create_data_frame_with_grades_by_student(joint_data_frame,
                                                                  student_carnet)
    
    print("Ahora se muestra el DataFrame con las notas y los poderados por credito",
          "del estudiante con carnet:", student_carnet,"\n")
    student_data_frame = add_column_grades_times_credits_by_student(student_data_frame)

    # Create the weighted average row of the student
    student_weighted_average_row = create_weighted_average_row(student_data_frame)
    students_rows.append(student_weighted_average_row)
  
  # Create Weighted Averages DataFrame
  spark = create_spark_session()
  weigthed_averages_data_frame = spark.createDataFrame(students_rows,
                                                       ['NombreCompleto',
                                                        'Carrera',
                                                        'PromedioPonderado'])

  # Show weighted_averages data frame
  print("Los promedios ponderados de los estudiantes son los siguientes:", "\n")
  show_complete_spark_data_frame(weigthed_averages_data_frame)
  return weigthed_averages_data_frame
  

def create_joint_spark_data_frames(student_data_frame, course_data_frame,
                                   grades_data_frame):
  """
  This function creates the data frame of the joint of the three datasets
  student_data_frame: DataFrame with the students info
  course_data_frame: DataFrame with the courses info
  grades_data_frame: DataFrame with the grades info
  return the joint Spark DataFrame
  """

  print("\nLa unión de los datos de entrada de los cursos y las notas da el",
        "siguiente DataFrame: \n")
  joint_grades_and_course_df = join_spark_data_frames(course_data_frame,
                                                      grades_data_frame,
                                                      course_data_frame.CodigoCurso,
                                                      grades_data_frame.CodigoCurso)

  print("\nLa unión de los datos de entrada de los cursos y las notas, junto",
        "con los datos de los estudiantes da el siguiente DataFrame: \n")
  joint_students_grades_and_course_df = join_spark_data_frames(student_data_frame,
                                                               joint_grades_and_course_df,
                                                               student_data_frame.Carnet,
                                                               joint_grades_and_course_df.Carnet).drop(joint_grades_and_course_df.Carrera)
  return joint_students_grades_and_course_df

def select_best_n_students_per_career(weighted_averages_data_frame, n=2):
  """
  This function selects the best N students per career
  weighted_averages_data_frame: DataFrame with the weighted averages info
  n: number of students to select
  """
  # Extract all careers from weighted averages data frame
  careers_set = set(get_column_values_to_list(weighted_averages_data_frame,
                                              'Carrera'))
  for career in careers_set:
    filter_statement = weighted_averages_data_frame.Carrera == career
    filter_weighted_averages_df = weighted_averages_data_frame.filter(filter_statement)

    # Order by descending notes
    filter_weighted_averages_df = filter_weighted_averages_df.orderBy(filter_weighted_averages_df.PromedioPonderado.desc())
    
    # Select first N columns
    select_best_n_students_data_frame = filter_weighted_averages_df.limit(n)

    # Show best N students
    print("Los mejores ", n, "estudiantes de la carrera: ", career, "\n")
    show_complete_spark_data_frame(select_best_n_students_data_frame)


def main():
  """
  This function calculates the best weighted averages of N students per career
  """

  # Create Spark Data Frames from CSV
  print("\nLos datos de entrada de los estudiantes son los siguientes: \n")
  student_data_frame = create_spark_data_frame_from_csv_file(STUDENTS_CSV_PATH)

  print("\nLos datos de entrada de los cursos son los siguientes: \n")
  course_data_frame  = create_spark_data_frame_from_csv_file(COURSE_CSV_PATH)

  print("\nLos datos de entrada de las notas son los siguientes: \n")
  grades_data_frame  = create_spark_data_frame_from_csv_file(GRADES_CSV_PATH)

  # Joint Spark Data Frames
  joint_data_frame   = create_joint_spark_data_frames(student_data_frame,
                                                      course_data_frame,
                                                      grades_data_frame)
  
  # Create Weighted Averages Spark Data Frame
  weighted_averages_data_frame = create_data_frame_of_weighted_averages(joint_data_frame)

  # Select best two students per career
  select_best_n_students_per_career(weighted_averages_data_frame, n=2)

# Execute main program
main()


Los datos de entrada de los estudiantes son los siguientes: 

+------+----------------+--------------------------+
|Carnet|NombreCompleto  |Carrera                   |
+------+----------------+--------------------------+
|2000  |Felipe Mejias   |Ingenieria en Computadores|
|2001  |Daniel Canessa  |Ingenieria en Computadores|
|2002  |Daniel Chacon   |Ingenieria en Computadores|
|2003  |Edgar Campos    |Ingenieria Electronica    |
|2004  |Roberto Bolanos |Ingenieria Electronica    |
|2005  |Esteban Ferarios|Ingenieria Electronica    |
+------+----------------+--------------------------+


Los datos de entrada de los cursos son los siguientes: 

+-----------+--------+--------------------------+
|CodigoCurso|Creditos|Carrera                   |
+-----------+--------+--------------------------+
|1          |4       |Ingenieria en Computadores|
|2          |3       |Ingenieria Electronica    |
|3          |3       |Ingenieria Electronica    |
|4          |2       |Ingenieria Electronica    

# **6-) Pruebas Unitarias con Pytest**

**6.1) Instalar Pytest en Google Colab**

In [0]:
!pip install ipytest
!pip install pytest

**6.2) Importar Pytest y los comandos llamados magics para lograr correr Pytest en Google Colab**

In [0]:
import ipytest.magics
import pytest
import sys

# This is needed in order to fix the __file__ issue that Google Colab throws
__file__ = sys.argv[0]

**6.3) Datos utilitarios para las pruebas unitarias**

In [0]:
import pandas as pd

# Dictionary with students data information
students_data_1 = {
    'Carnet' : [1000, 1001, 1002, 1003,
                1004, 1005, 1006, 1007,
                1008],
    'NombreCompleto' : ["Felipe", "Daniel", "Luis Daniel", "Melvin",
                        "Roberto", "Esteban", "Andres", "Edgar"
                        "Thomas"],
    'Carrera' : ['Computadores', 'Computadores', 'Computadores', 'Computadores'
                 'Electronica', 'Electronica', 'Electronica', 'Electronica',
                 'ATI']
}

# Dictionary with courses data information
courses_data_1 = {
    'CodigoCurso' : [1, 2, 3,
                     4, 5,
                     6],
    'Creditos' : [4, 3, 2,
                  4, 3,
                  4],
    'Carrera' : ['Computadores', 'Computadores', 'Computadores',
                 'Electronica', 'Electronica',
                 'ATI']
}

# Dictionary with grades data information
# This dictionary represents the case where a student from ATI
# haven't enrolled any courses yet.
grades_data_1 = {
    'Carnet' : [1000, 1000, 1000,
                1001, 1001, 1001,
                1002, 1002, 1002,
                1003, 1003, 1003,
                1004, 1004,
                1005, 1005,
                1006, 1006,
                1007, 1007],
    'CodigoCurso' : [1, 3, 2,
                     1, 3, 2,
                     1, 3, 2,
                     1, 3, 2,
                     4, 5,
                     4, 5,
                     4, 5,
                     4, 5],
    'Nota' : [90, 95, 70,
              75, 85, 80,
              85, 85, 85,
              70, 95, 95,
              90, 95,
              80, 85,
              90, 85,
              70, 85]
}

# Expect Dictionary Joint From CSV Files
joint_between_courses_and_grades = {
    'Creditos': [4, 4, 3, 4, 4, 3, 4, 4, 3,
                 3, 3, 2, 3, 3, 2, 3, 3, 2],
    'Carrera': ["Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria Electronica",
                "Ingenieria Electronica",
                "Ingenieria Electronica",
                "Ingenieria Electronica",
                "Ingenieria Electronica",
                "Ingenieria Electronica",
                "Ingenieria Electronica",
                "Ingenieria Electronica",
                "Ingenieria Electronica"],
    'Carnet': [2000, 2000, 2000,
               2001, 2001, 2001,
               2002, 2002, 2002,
               2003, 2003, 2003,
               2004, 2004, 2004,
               2005, 2005, 2005],
    'CodigoCurso': [1, 5, 6, 1, 5, 6, 1, 5, 6, 2, 3, 4, 2, 3, 4, 2, 3, 4],
    'Nota': [95, 90, 80, 90, 70, 75, 85, 95, 75, 85, 95, 75, 80, 95, 95,
             70, 85, 75]
}

# Convert expected dictionary to Pandas DataFrame
joint_between_courses_and_grades_pandas_df = pd.DataFrame.from_dict(joint_between_courses_and_grades)

# Convert Pandas DataFrame to Spark DataFrame
spark = create_spark_session()
schema = StructType([
    StructField("Creditos", IntegerType()),
    StructField("Carrera", StringType()),
    StructField("Carnet", IntegerType()),
    StructField("CodigoCurso", IntegerType()),
    StructField("Nota", IntegerType())])
expected_joint_between_courses_and_grades_spark_df = spark.createDataFrame(joint_between_courses_and_grades_pandas_df, schema)

# Expect Dictionary Joint From CSV Files
final_joint = {
    'NombreCompleto': ["Felipe Mejias",
                       "Felipe Mejias",
                       "Felipe Mejias",
                       "Daniel Canessa",
                       "Daniel Canessa",
                       "Daniel Canessa",
                       "Daniel Chacon",
                       "Daniel Chacon",
                       "Daniel Chacon",
                       "Edgar Campos",
                       "Edgar Campos",
                       "Edgar Campos",
                       "Roberto Bolanos",
                       "Roberto Bolanos",
                       "Roberto Bolanos",
                       "Esteban Ferarios",
                       "Esteban Ferarios",
                       "Esteban Ferarios"],
    'Carrera': ["Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria Electronica",
                "Ingenieria Electronica",
                "Ingenieria Electronica",
                "Ingenieria Electronica",
                "Ingenieria Electronica",
                "Ingenieria Electronica",
                "Ingenieria Electronica",
                "Ingenieria Electronica",
                "Ingenieria Electronica"],
    'Creditos': [4, 4, 3, 4, 4, 3, 4, 4, 3,
                 3, 3, 2, 3, 3, 2, 3, 3, 2],
    'Carnet': [2000, 2000, 2000,
               2001, 2001, 2001,
               2002, 2002, 2002,
               2003, 2003, 2003,
               2004, 2004, 2004,
               2005, 2005, 2005],
    'CodigoCurso': [1, 5, 6, 1, 5, 6, 1, 5, 6, 2, 3, 4, 2, 3, 4, 2, 3, 4],
    'Nota': [95, 90, 80, 90, 70, 75, 85, 95, 75, 85, 95, 75, 80, 95, 95,
             70, 85, 75]
}

# Convert expected dictionary to Pandas DataFrame
final_joint_pandas_df = pd.DataFrame.from_dict(final_joint)

# Convert Pandas DataFrame to Spark DataFrame
final_joint_spark_df = spark.createDataFrame(final_joint_pandas_df)

# Expect Tables Per Student
expected_student_1_notes = {
    'NombreCompleto': ["Felipe Mejias",
                       "Felipe Mejias",
                       "Felipe Mejias"],
    'Carrera': ["Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores"],
    'Creditos': [4, 4, 3],
    'Carnet': [2000, 2000, 2000],
    'CodigoCurso': [1, 5, 6],
    'Nota': [95, 90, 80]
}

# Convert expected dictionary to Pandas DataFrame
expected_student_1_notes_pandas_df = pd.DataFrame.from_dict(expected_student_1_notes)

# Convert Pandas DataFrame to Spark DataFrame
expected_student_1_notes_spark_df = spark.createDataFrame(expected_student_1_notes_pandas_df)

expected_student_1_notes_and_notes_times_credits = {
    'NombreCompleto': ["Felipe Mejias",
                       "Felipe Mejias",
                       "Felipe Mejias"],
    'Carrera': ["Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores"],
    'Creditos': [4, 4, 3],
    'Carnet': [2000, 2000, 2000],
    'CodigoCurso': [1, 5, 6],
    'Nota': [95, 90, 80],
    'CreditosxNotas': [380, 360, 240]
}

# Convert expected dictionary to Pandas DataFrame
expected_student_1_notes_and_notes_times_credits_pandas_df = pd.DataFrame.from_dict(expected_student_1_notes_and_notes_times_credits)

# Convert Pandas DataFrame to Spark DataFrame
expected_student_1_notes_and_notes_times_credits_spark_df = spark.createDataFrame(expected_student_1_notes_and_notes_times_credits_pandas_df)

expected_student_2_notes = {
    'NombreCompleto': ["Daniel Canessa",
                       "Daniel Canessa",
                       "Daniel Canessa"],
    'Carrera': ["Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores"],
    'Creditos': [4, 4, 3],
    'Carnet': [2001, 2001, 2001],
    'CodigoCurso': [1, 5, 6],
    'Nota': [90, 70, 75]
}

expected_student_3_notes = {
    'NombreCompleto': ["Daniel Chacon",
                       "Daniel Chacon",
                       "Daniel Chacon"],
    'Carrera': ["Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores"],
    'Creditos': [4, 4, 3],
    'Carnet': [2002, 2002, 2002],
    'CodigoCurso': [1, 5, 6],
    'Nota': [85, 95, 75]
}

# Expect Weighted Averages Table
expected_weighted_averages = {
    'NombreCompleto': ["Felipe Mejias",
                       "Daniel Canessa",
                       "Daniel Chacon",
                       "Edgar Campos",
                       "Roberto Bolanos",
                       "Esteban Ferarios"],
    'Carrera': ["Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria en Computadores",
                "Ingenieria Electronica",
                "Ingenieria Electronica",
                "Ingenieria Electronica"],
    'PromedioPonderado': [89.0909090909091,
                          78.63636363636364,
                          85.9090909090909,
                          86.25,
                          89.375,
                          76.875]
}

# Convert expected dictionary to Pandas DataFrame
expected_weighted_averages_pandas_df = pd.DataFrame.from_dict(expected_weighted_averages)

# Convert Pandas DataFrame to Spark DataFrame
expected_weighted_averages_spark_df = spark.createDataFrame(expected_weighted_averages_pandas_df)

expected_weighted_averages_per_career_1 = {
    'NombreCompleto': ["Felipe Mejias",
                       "Daniel Chacon"],
    'Carrera': ["Ingenieria en Computadores",
                "Ingenieria en Computadores"],
    'PromedioPonderado': [89.0909090909091,
                          85.9090909090909]
}

# Convert expected dictionary to Pandas DataFrame
expected_weighted_averages_per_career_1_pandas_df = pd.DataFrame.from_dict(expected_weighted_averages_per_career_1)

# Convert Pandas DataFrame to Spark DataFrame
expected_weighted_averages_per_career_1_spark_df = spark.createDataFrame(expected_weighted_averages_per_career_1_pandas_df)

expected_weighted_averages_per_career_2 = {
    'NombreCompleto': ["Roberto Bolanos",
                       "Edgar Campos"],
    'Carrera': ["Ingenieria Electronica",
                "Ingenieria Electronica"],
    'PromedioPonderado': [89.375,
                          86.25]
}

# Convert expected dictionary to Pandas DataFrame
expected_weighted_averages_per_career_2_pandas_df = pd.DataFrame.from_dict(expected_weighted_averages_per_career_2)

# Convert Pandas DataFrame to Spark DataFrame
expected_weighted_averages_per_career_2_spark_df = spark.createDataFrame(expected_weighted_averages_per_career_2_pandas_df)

**6.4) Pruebas unitarias para la unión de datos**

In [58]:
# This command is needed to run the UTs in Google Colab
%%run_pytest[clean] -qq

def test_create_succesful_spark_session():
    assert create_spark_session() is not None

def test_create_spark_data_frame_from_none_csv_file_path():
    non_existent_csv_url_path = "https://raw.githubusercontent.com/fmejias/CienciasDeLosDatosTEC/master/BigData/Tareas/Tarea1/estudiante2.csv"
    with pytest.raises((HTTPError, Exception)):
      create_spark_data_frame_from_csv_file(non_existent_csv_url_path)

def test_create_spark_data_frame_from_students_csv_file_path():
    student_spark_data_frame = create_spark_data_frame_from_csv_file(STUDENTS_CSV_PATH)
    assert student_spark_data_frame is not None
    assert isinstance(student_spark_data_frame, dataframe.DataFrame)

def test_create_spark_data_frame_from_courses_csv_file_path():
    courses_spark_data_frame = create_spark_data_frame_from_csv_file(COURSE_CSV_PATH)
    assert courses_spark_data_frame is not None
    assert isinstance(courses_spark_data_frame, dataframe.DataFrame)

def test_create_spark_data_frame_from_grades_csv_file_path():
    grades_spark_data_frame = create_spark_data_frame_from_csv_file(GRADES_CSV_PATH)
    assert grades_spark_data_frame is not None
    assert isinstance(grades_spark_data_frame, dataframe.DataFrame)

def test_joint_between_two_spark_data_frames():
    courses_spark_data_frame = create_spark_data_frame_from_csv_file(COURSE_CSV_PATH)
    grades_spark_data_frame = create_spark_data_frame_from_csv_file(GRADES_CSV_PATH)
    joint_grades_and_course_df = join_spark_data_frames(courses_spark_data_frame,
                                                        grades_spark_data_frame,
                                                        courses_spark_data_frame.CodigoCurso,
                                                        grades_spark_data_frame.CodigoCurso)

    # Get a DataFrame with the rows that are in joint_grades_and_course_df
    # but not in expected_joint_between_courses_and_grades_spark_df
    dataframes_difference = joint_grades_and_course_df.exceptAll(expected_joint_between_courses_and_grades_spark_df)
    assert dataframes_difference.count() == 0

def test_final_joint_from_three_spark_data_frames():
    student_spark_data_frame = create_spark_data_frame_from_csv_file(STUDENTS_CSV_PATH)
    courses_spark_data_frame = create_spark_data_frame_from_csv_file(COURSE_CSV_PATH)
    grades_spark_data_frame = create_spark_data_frame_from_csv_file(GRADES_CSV_PATH)

    # Create Joint Data Frame
    joint_data_frame = create_joint_spark_data_frames(student_spark_data_frame,
                                                      courses_spark_data_frame,
                                                      grades_spark_data_frame)
    
    dataframes_difference = joint_data_frame.exceptAll(final_joint_spark_df)
    assert dataframes_difference.count() == 0

# Execute these UTs
ipytest.run_tests()

unittest.case.FunctionTestCase (test_create_spark_data_frame_from_courses_csv_file_path) ... 

+-----------+--------+--------------------------+
|CodigoCurso|Creditos|Carrera                   |
+-----------+--------+--------------------------+
|1          |4       |Ingenieria en Computadores|
|2          |3       |Ingenieria Electronica    |
|3          |3       |Ingenieria Electronica    |
|4          |2       |Ingenieria Electronica    |
|5          |4       |Ingenieria en Computadores|
|6          |3       |Ingenieria en Computadores|
+-----------+--------+--------------------------+



ok
unittest.case.FunctionTestCase (test_create_spark_data_frame_from_grades_csv_file_path) ... 

+------+-----------+----+
|Carnet|CodigoCurso|Nota|
+------+-----------+----+
|2000  |1          |95  |
|2000  |5          |90  |
|2000  |6          |80  |
|2001  |1          |90  |
|2001  |5          |70  |
|2001  |6          |75  |
|2002  |1          |85  |
|2002  |5          |95  |
|2002  |6          |75  |
|2003  |2          |85  |
|2003  |3          |95  |
|2003  |4          |75  |
|2004  |2          |80  |
|2004  |3          |95  |
|2004  |4          |95  |
|2005  |2          |70  |
|2005  |3          |85  |
|2005  |4          |75  |
+------+-----------+----+



ok
unittest.case.FunctionTestCase (test_create_spark_data_frame_from_none_csv_file_path) ... ok
unittest.case.FunctionTestCase (test_create_spark_data_frame_from_students_csv_file_path) ... 

+------+----------------+--------------------------+
|Carnet|NombreCompleto  |Carrera                   |
+------+----------------+--------------------------+
|2000  |Felipe Mejias   |Ingenieria en Computadores|
|2001  |Daniel Canessa  |Ingenieria en Computadores|
|2002  |Daniel Chacon   |Ingenieria en Computadores|
|2003  |Edgar Campos    |Ingenieria Electronica    |
|2004  |Roberto Bolanos |Ingenieria Electronica    |
|2005  |Esteban Ferarios|Ingenieria Electronica    |
+------+----------------+--------------------------+



ok
unittest.case.FunctionTestCase (test_create_succesful_spark_session) ... ok
unittest.case.FunctionTestCase (test_final_joint_from_three_spark_data_frames) ... 

+------+----------------+--------------------------+
|Carnet|NombreCompleto  |Carrera                   |
+------+----------------+--------------------------+
|2000  |Felipe Mejias   |Ingenieria en Computadores|
|2001  |Daniel Canessa  |Ingenieria en Computadores|
|2002  |Daniel Chacon   |Ingenieria en Computadores|
|2003  |Edgar Campos    |Ingenieria Electronica    |
|2004  |Roberto Bolanos |Ingenieria Electronica    |
|2005  |Esteban Ferarios|Ingenieria Electronica    |
+------+----------------+--------------------------+

+-----------+--------+--------------------------+
|CodigoCurso|Creditos|Carrera                   |
+-----------+--------+--------------------------+
|1          |4       |Ingenieria en Computadores|
|2          |3       |Ingenieria Electronica    |
|3          |3       |Ingenieria Electronica    |
|4          |2       |Ingenieria Electronica    |
|5          |4       |Ingenieria en Computadores|
|6          |3       |Ingenieria en Computadores|
+-----------+------

ok
unittest.case.FunctionTestCase (test_joint_between_two_spark_data_frames) ... 

+-----------+--------+--------------------------+
|CodigoCurso|Creditos|Carrera                   |
+-----------+--------+--------------------------+
|1          |4       |Ingenieria en Computadores|
|2          |3       |Ingenieria Electronica    |
|3          |3       |Ingenieria Electronica    |
|4          |2       |Ingenieria Electronica    |
|5          |4       |Ingenieria en Computadores|
|6          |3       |Ingenieria en Computadores|
+-----------+--------+--------------------------+

+------+-----------+----+
|Carnet|CodigoCurso|Nota|
+------+-----------+----+
|2000  |1          |95  |
|2000  |5          |90  |
|2000  |6          |80  |
|2001  |1          |90  |
|2001  |5          |70  |
|2001  |6          |75  |
|2002  |1          |85  |
|2002  |5          |95  |
|2002  |6          |75  |
|2003  |2          |85  |
|2003  |3          |95  |
|2003  |4          |75  |
|2004  |2          |80  |
|2004  |3          |95  |
|2004  |4          |95  |
|2005  |2          |70  |
|2005

ok

----------------------------------------------------------------------
Ran 7 tests in 5.436s

OK


.......                                                                  [100%]
