### This notebook will create a table of a the unit's training status from from raw .csv exports.

The goal is to have a semi-automated process that works from .csv files that the unit training manager can download. Then, the analyst simply has to drop the files into the appropriate folder on a weekly (or any timeframe) basis, and run this workbook. The resulting table will be linked to Tableau for visualization.

Note: All names used are fictitious.

### Step 1: Loading our libraries

In [2]:
#pip install xlrd
#pip install findspark
#pip install pyjanitor

In [3]:
import pandas as pd
import janitor
import glob

### Step 2: Reading the data

For our example, we have data from 6 different .csv files. We will read in all of the .csv files in our folder, and combine them into a single Pandas dataframe.

In [9]:
import os
pwd = os.getcwd()

path = pwd + "/raw_data/"

all_files = glob.glob(os.path.join(path, "*.csv"))

df_from_each_file = (pd.read_csv(f) for f in all_files)
union = pd.concat(df_from_each_file, ignore_index=True)

union.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017 entries, 0 to 1016
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Id                 1017 non-null   int64  
 1   Rank               980 non-null    object 
 2   Name               1017 non-null   object 
 3   Course Name        1017 non-null   object 
 4   Enrollment Date    1017 non-null   object 
 5   Grade              136 non-null    float64
 6   Date Completed     1006 non-null   object 
 7   Completion Status  1017 non-null   object 
dtypes: float64(1), int64(1), object(6)
memory usage: 63.7+ KB


Convert all columns containing dates to the proper "datetime" format:

In [10]:
union['Enrollment Date'] = union['Enrollment Date'].astype('datetime64[ns]')
union['Date Completed'] = union['Date Completed'].astype('datetime64[ns]')

In [11]:
print(union)

            Id  Rank               Name  \
0       391572   Maj  Williamson Ashley   
1      1835295   Maj  Williamson Ashley   
2      3232826   Maj  Williamson Ashley   
3      3302406   Maj  Williamson Ashley   
4      4154595   Maj  Williamson Ashley   
...        ...   ...                ...   
1012   8470477  TSgt        Pham Joshua   
1013   9075150  TSgt        Pham Joshua   
1014   9135968  TSgt        Pham Joshua   
1015  10917837  TSgt        Pham Joshua   
1016  11365022  TSgt        Pham Joshua   

                                            Course Name Enrollment Date  \
0            *Cyber Awareness Challenge 2021 (ZZ133098)      2021-03-22   
1            *Cyber Awareness Challenge (ZZ133098)-ADLS      2021-04-23   
2                     *Force Protection (ZZ133079)-ADLS      2021-04-25   
3                            SECDEF OPSEC Campaign-ADLS      2021-04-25   
4           *Religious Freedom Training (ZZ133109)-ADLS      2021-04-27   
...                              

### Step 3: Preparing and cleaning the data

First, we want to add columns for 1, 2, and 3 year due dates:

In [12]:
import datetime

In [13]:
from datetime import timedelta

In [14]:
union['Due Date 1 yr'] = union['Date Completed']+timedelta(days=365)
union['Due Date 2 yr'] = union['Date Completed']+timedelta(days=365*2)
union['Due Date 3 yr'] = union['Date Completed']+timedelta(days=365*3)

print(union)

            Id  Rank               Name  \
0       391572   Maj  Williamson Ashley   
1      1835295   Maj  Williamson Ashley   
2      3232826   Maj  Williamson Ashley   
3      3302406   Maj  Williamson Ashley   
4      4154595   Maj  Williamson Ashley   
...        ...   ...                ...   
1012   8470477  TSgt        Pham Joshua   
1013   9075150  TSgt        Pham Joshua   
1014   9135968  TSgt        Pham Joshua   
1015  10917837  TSgt        Pham Joshua   
1016  11365022  TSgt        Pham Joshua   

                                            Course Name Enrollment Date  \
0            *Cyber Awareness Challenge 2021 (ZZ133098)      2021-03-22   
1            *Cyber Awareness Challenge (ZZ133098)-ADLS      2021-04-23   
2                     *Force Protection (ZZ133079)-ADLS      2021-04-25   
3                            SECDEF OPSEC Campaign-ADLS      2021-04-25   
4           *Religious Freedom Training (ZZ133109)-ADLS      2021-04-27   
...                              

Swap first and last name order:

In [15]:
union.Name = union.Name.str.split(' ').map(lambda x : ' '.join(x[::-1]))

We need to add entries for every course, that way if someone hasn't started a course, they will show up as incomplete for our visualization:

In [16]:
union.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017 entries, 0 to 1016
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Id                 1017 non-null   int64         
 1   Rank               980 non-null    object        
 2   Name               1017 non-null   object        
 3   Course Name        1017 non-null   object        
 4   Enrollment Date    1017 non-null   datetime64[ns]
 5   Grade              136 non-null    float64       
 6   Date Completed     1006 non-null   datetime64[ns]
 7   Completion Status  1017 non-null   object        
 8   Due Date 1 yr      1006 non-null   datetime64[ns]
 9   Due Date 2 yr      1006 non-null   datetime64[ns]
 10  Due Date 3 yr      1006 non-null   datetime64[ns]
dtypes: datetime64[ns](5), float64(1), int64(1), object(4)
memory usage: 87.5+ KB


In [17]:
union = union.complete('Name', 'Course Name')

#Can use the argument .fillna(0) if desired

In [18]:
union.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5313 entries, 0 to 5312
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Name               5313 non-null   object        
 1   Course Name        5313 non-null   object        
 2   Id                 1017 non-null   float64       
 3   Rank               980 non-null    object        
 4   Enrollment Date    1017 non-null   datetime64[ns]
 5   Grade              136 non-null    float64       
 6   Date Completed     1006 non-null   datetime64[ns]
 7   Completion Status  1017 non-null   object        
 8   Due Date 1 yr      1006 non-null   datetime64[ns]
 9   Due Date 2 yr      1006 non-null   datetime64[ns]
 10  Due Date 3 yr      1006 non-null   datetime64[ns]
dtypes: datetime64[ns](5), float64(2), object(4)
memory usage: 456.7+ KB


Sanity check a few of your values:

In [19]:
n = len(pd.unique(union['Name']))
print("Number of unique names:", n)

Number of unique names: 64


In [20]:
m = len(pd.unique(union['Course Name']))
print("Number of unique courses:", m)

Number of unique courses: 83


Header cleaning:

In [24]:
# Column headers: lower case + remove spaces and the following characters: ,;{}()=  

union.columns = union.columns.str.lower()
union.columns = union.columns.str.replace(' ', '_')

problematic_chars = ',;{}()=?'
for c in problematic_chars:
        union.columns = union.columns.str.replace(c, '')

  union.columns = union.columns.str.replace(c, '')


In [25]:
union.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5313 entries, 0 to 5312
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   name               5313 non-null   object        
 1   course_name        5313 non-null   object        
 2   id                 1017 non-null   float64       
 3   rank               980 non-null    object        
 4   enrollment_date    1017 non-null   datetime64[ns]
 5   grade              136 non-null    float64       
 6   date_completed     1006 non-null   datetime64[ns]
 7   completion_status  1017 non-null   object        
 8   due_date_1_yr      1006 non-null   datetime64[ns]
 9   due_date_2_yr      1006 non-null   datetime64[ns]
 10  due_date_3_yr      1006 non-null   datetime64[ns]
dtypes: datetime64[ns](5), float64(2), object(4)
memory usage: 456.7+ KB


### Step 4: Creating a .csv

This is what we will use as our data source for Tableau.

In [29]:
outname = 'training_table.csv'

pwd = os.getcwd()

outdir = pwd + "/export_data/"

if not os.path.exists(outdir):
    os.mkdir(outdir)

fullname = os.path.join(outdir, outname)    

union.to_csv(fullname)

### We now have our completed results table.

For this project, I'll use Tableau to create a dashboard.