# <font color="AA00FF">Exam Scores</font>

## <font color="AA00FF">Project Participants</font>

Stakeholder(s): Me

Project Manager(s): Me

Data Analyst(s): erm.... Me

## <font color="AA00FF">Problem Definition</font>

The purpose of this project is to gather insights from a collection of exam scores to understand the following main goals:

* The average score for each gender
* The average score for each group

## <font color="AA00FF">Import Required Libraries</font>

The below libraries are required by this notebook.

In [1]:
import pandas as pd
import numpy as np
import plotly as ply
import plotly.express as px
import plotly.graph_objs as go

from platform import python_version

Show the versions of the libraries that will be used.

In [2]:
print(f"Numpy Version:  {np.__version__}")
print(f"Pandas Version: {pd.__version__}")
print(f"Plotly Version: {ply.__version__}")
print(f"Python Version: {python_version()}")

Numpy Version:  1.23.3
Pandas Version: 1.4.4
Plotly Version: 5.9.0
Python Version: 3.10.6


## <font color="AA00FF">Data Processing</font>

### <font color="AA0088">Data Source(s)</font>

The data for this project was acquired from the following location(s):

### <font color="AA0088">Data Overview</font>

Give an overview of what the data represents as a whole.

### <font color="AA0088">Data Dictionary</font>

No data dictionary has been provided by the supplier of the data. 

From initial investigation of the features, a data dictionary has been created and can be found [here](data/raw-data-dictionary.xlsx).

**<font color="AA0088">NOTE</font>**: The contents of the description for each feature are assumptions but are logical assumptions based on the description of each feature.

### <font color="AA0088">Import The Raw Data</font>

First step is to import the data to a pandas dataframe from the source of the data. In this case, the source data will be a csv file.

In [3]:
raw_data_df = pd.read_csv("./data/raw_data.csv")

### <font color="AA0088">Details About The Raw Data Dataframe</font>

Let us have a quick look at the first five rows of the dataframe.

In [4]:
raw_data_df.head(n = 5)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,male,group A,high school,standard,completed,67,67,63
1,female,group D,some high school,free/reduced,none,40,59,55
2,male,group E,some college,free/reduced,none,59,60,50
3,male,group B,high school,standard,none,77,78,68
4,male,group E,associate's degree,standard,completed,78,73,68


Now let's have a look at the information about the raw_data_df.

In [5]:
raw_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


Initial observations are that there are:

* 8 features which are mostly:
  * string (object)
  * integers (int64)
* 1000 rows
* No null values but that will be checked soon

Some of the features have spaces or / in their names so they should be changed to _ to make querying them simpler.

### <font color="AA0088">Cleaning Up The Data</font>

Before performing any data cleaning, the names of each feature with a space or a / need to be changed to an _.

First, make a copy of the raw_data_df so that it stays in tact in case it is needed later on.

In [6]:
exam_scores_df = raw_data_df.copy()
exam_scores_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


Now let's change the names of the features to remove the spaces and /'s. 

Once done, the columns will then have any uppercase values converted to lowercase.

In [7]:
exam_scores_df.columns = exam_scores_df.columns.str.replace(" ", "_")
exam_scores_df.columns = exam_scores_df.columns.str.replace("/", "_")

# --- Convert any uppercase characters to lowercase:
exam_scores_df.columns = exam_scores_df.columns.str.lower()

In [8]:
exam_scores_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race_ethnicity               1000 non-null   object
 2   parental_level_of_education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test_preparation_course      1000 non-null   object
 5   math_score                   1000 non-null   int64 
 6   reading_score                1000 non-null   int64 
 7   writing_score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


Now that the column names are in a more preferred format, let's have a look to see if there are any null values.

In [9]:
exam_scores_df.isna().sum()

gender                         0
race_ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
dtype: int64

Ok, there are no null values in any of the features.

### <font color="AA0088">Export The Working Set Dataframe</font>

Before going any further, let's make a backup of the current progress of the dataframe to a csv file.

In [10]:
exam_scores_df.to_csv(path_or_buf = "./data/exam_scores.csv", 
                      index = False)

Now let's move onto the exploratory data analysis phase.

## <font color="AA00FF">Exploratory Data Analysis</font>

First, let's define some custom classes and functions to create the plots / charts that will be used.

### <font color="AA0088">Required Custom Constants And Variables</font>

In [11]:
# --- Set the colour theme for the charts and figures:
CHART_COLOR_THEME = px.colors.sequential.Agsunset

### <font color="AA0088">Required Custom Classes And Functions</font>

In [12]:
def create_pie_chart(df_source:list, 
                     feature_name:str, 
                     fig_title:str):
    
    
    """_summary_
    This function will generate a pie plot / chart for a single feature that is passed to it as an argument.
    
    Args:
        df_source (list (Pandas DataFrame)): The name of the dataframe that you wish to use.
        feature_name (str): The name of the feature (column) to use from the dataframe.
        fig_title (str): The title that will be shown on the pie plot / chart.

    Returns:
        plotly graph (figure): The function will return a pie plot / chart depicting the data passed to this function.
    """
    
    
    fig_pie = px.pie(df_source,
                     values = df_source[feature_name].value_counts(),
                     names = df_source[feature_name].value_counts().index,
                     title = fig_title.title(),
                     width = 500,
                     color_discrete_sequence = CHART_COLOR_THEME,
                     )


    fig_pie.update_traces(textposition="inside", 
                          textinfo="label+percent",
                          textfont_size = 14,
                          textfont_color = "white",
                          hovertemplate = "<b>%{label}</b><br><br>Total: %{value}</br>Percent: %{percent}",
                          hoverlabel = dict(font = dict(color = "white"))
                          )
    
    
    fig_pie.update_layout({"title_font_size": 24,
                           "title_x": 0.50,
                           "legend_title": "Legend"})
    
    
    return fig_pie

In [13]:
# --- List all of the figure functions available in this notebook:
list_figure_options = [var_name for var_name in dir() if var_name.startswith("create")]

### <font color="AA0088">Total Examinees By Gender</font>

In [14]:
fig_pie_gender = create_pie_chart(df_source = exam_scores_df, 
                                 feature_name = "gender", 
                                 fig_title = "Total examinees by gender")


fig_pie_gender.show();

Initial findings show that there are more male examinees than there are female examinees in this dataset.

### <font color="AA0088">Total Examinees By Race / Ethnicity</font>

In [15]:
fig_pie_race = create_pie_chart(df_source = exam_scores_df, 
                               feature_name = "race_ethnicity", 
                               fig_title = "Total examinees by race / ethnicity")


fig_pie_race.show();

Initial findings show that group C has the most examinees, followed by groups D, B, E and A.

### <font color="AA0088">Total Examinees By Parental Education</font>

In [16]:
fig_pie_education = create_pie_chart(df_source = exam_scores_df, 
                               feature_name = "parental_level_of_education", 
                               fig_title = "Total examinees by parental education")


fig_pie_education.show();

### <font color="AA0088">Total Examinees By Lunch Fee</font>

In [17]:
fig_pie_lunch = create_pie_chart(df_source = exam_scores_df, 
                               feature_name = "lunch", 
                               fig_title = "Total examinees by lunch fee")


fig_pie_lunch.show();

Conclusion: Most examinees (65.2%) pay the standard price for their lunches with 34.8% paying a reduced rate or getting a free lunch.

### <font color="AA0088">Total Examinees That Completed The Prep Courses</font>

In [18]:
fig_pie_lunch = create_pie_chart(df_source = exam_scores_df, 
                               feature_name = "test_preparation_course", 
                               fig_title = "Examinees that completed<br>prep courses")


fig_pie_lunch.show();

Conclusion: Only one third of all examinees completed the prep course(s) for the exams.

### <font color="AA0088">Average Scores By Gender</font>

In [19]:
fig_bar_avg_gender_score = go.Figure(data = [
     go.Bar(name = "Maths",
            x = exam_scores_df.groupby('gender').math_score.mean().index, 
            y = exam_scores_df.groupby('gender').math_score.mean().values,
            text = np.round(exam_scores_df.groupby('gender').math_score.mean().values, 2),
            marker_color = CHART_COLOR_THEME[0]),
     go.Bar(name = "Reading", 
            x = exam_scores_df.groupby('gender').reading_score.mean().index, 
            y = exam_scores_df.groupby('gender').reading_score.mean().values,
            text = np.round(exam_scores_df.groupby('gender').reading_score.mean().values, 2),
            marker_color = CHART_COLOR_THEME[1]),
     go.Bar(name = "Writing", 
            x = exam_scores_df.groupby('gender').writing_score.mean().index, 
            y = exam_scores_df.groupby('gender').writing_score.mean().values,
            text = np.round(exam_scores_df.groupby('gender').writing_score.mean().values, 2),
            marker_color = CHART_COLOR_THEME[2]),
              
])

#fig_bar_avg_gender_score.update_layout(barmode='group')
fig_bar_avg_gender_score.update_traces(textposition="inside", 
                                       textfont_size = 14,
                                       textfont_color = "white",
                                       hovertemplate = "<b>%{fullData.name}</b><br><br>Avg Score: %{value}<extra></extra>",
                                       hoverlabel = dict(font = dict(color = "white"))
                                       )


fig_bar_avg_gender_score.update_layout({"title_text": "Average Scores By Gender",
                                        "title_font_size": 24,
                                        "title_x": 0.50,
                                        "legend_title": "Legend",
                                        "barmode":'group'
                                        })

fig_bar_avg_gender_score.show()

### <font color="AA0088">Average Scores By Race / Ethnicity Group</font>

In [92]:
def create_grouped_bar_chart(df_source:list, 
                             x_feature_names:list, 
                             y_feature_names:list,
                             y_feature_function:str,
                             fig_title:str,
                             y_axis_range:list=[0, 100]
                             ):
       
    # --- Create an empty list that will contain the payload for the figure:              
    fig_data = list()
    
    
    # --- 
    for y_feature in y_feature_names:
        grouped_by_x_feature_df = df_source.groupby(x_feature_names)[y_feature].agg(["mean", "min", "max", "median", "count"]).reset_index()
        
        
        fig_data.append(go.Bar(name = y_feature.replace("_", " ").title(),
                               x = [(grouped_by_x_feature_df[x_feature_names[idx]]) for idx, item in enumerate(x_feature_names)],
                               y = grouped_by_x_feature_df[y_feature_function].values,
                               text = np.round(grouped_by_x_feature_df[y_feature_function].values, 2),
                               marker_color = CHART_COLOR_THEME[y_feature_names.index(y_feature)]
                               )
                        )
                     
    # --- Create    
    fig_grouped_bar_chart = go.Figure(data = fig_data)


    fig_grouped_bar_chart.update_traces(textposition="inside", 
                                        textfont_size = 14,
                                        textfont_color = "white",
                                        hovertemplate = "<b>%{fullData.name}</b><br><br>Avg Score: %{value}<extra></extra>",
                                        hoverlabel = dict(font = dict(color = "white"))
                                        )


    # --- Set the y-axis start range:
    y_axis_start_from = y_axis_range[0]
    y_axis_end_at = y_axis_range[1]
    
    # if y_axis_start_from > 0:
    #     y_axis_start_from = 0
    
    if y_axis_start_from == 0 and y_axis_end_at == 100:
        y_axis_start_from = 0
        y_axis_end_at = int(math.ceil(np.round(grouped_by_x_feature_df[y_feature_function].max(), 2) / 100)) * 100


    # --- Modify the layout of the figure as needed:
    fig_grouped_bar_chart.update_layout({"title_text": fig_title,
                                        "title_font_size": 24,
                                        "title_x": 0.50,
                                        "legend_title": "Legend",
                                        "barmode": "group"
                                        },
                                        yaxis_range = [y_axis_start_from, 
                                                       y_axis_end_at]
                                        )
       

    # --- Return the figure payload to the calling variable:
    return fig_grouped_bar_chart

In [104]:
fig_bar_avg_gender_score = create_grouped_bar_chart(df_source = exam_scores_df,
                                                    x_feature_names = ["race_ethnicity", "gender"], 
                                                    y_feature_names = ["math_score", "reading_score", "writing_score"],
                                                    y_feature_function = "mean",
                                                    #y_axis_range = [0, 90],
                                                    fig_title = "Average Scores By By Gender By Race / Ethnicity Group")



fig_bar_avg_gender_score.show()

In [70]:
import math
a = 48

int(math.ceil(a / 100)) * 100

100

In [46]:
list_figure_options

['create_pie_chart']

In [23]:
from pptx import Presentation

prs = Presentation()

title_slide_layout = prs.slide_layouts[0]
slide = prs.slides.add_slide(title_slide_layout)
title = slide.shapes.title
subtitle = slide.placeholders[1]

title.text = "Hello, World!"
subtitle.text = "python-pptx was here!"

#prs.save('test.pptx')