<a class="anchor" id="top"></a>

# Topic Modeling with LDA to Generate Target Risk Topics
Author: Ainesh Pandey

In this notebook, we will perform topic modeling using [LDA (Latent Dirichlet Allocation)](https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2) on the _Lesson(s) Learned_ column of the `lessons_learned.csv` dataframe. The purpose of this analysis is to attempt to organize the different projects into risk topics. Assuming the LDA model produces reasonable topics, we can use the topic classification as ground truth in a modeling objective.

## Table of Contents
[Step 1: Import Packages and Data](#step-1) <br>
[Step 2: Exploratory Data Analysis](#step-2) <br>
[Step 2: Data Preparation](#step-3) <br>

<a class="anchor" id="step-1"></a>

## Import Packages and Data

### Packages

We start by importing the required packages for this analysis.

In [1]:
# basic data science packages
import pandas as pd
import numpy as np
np.random.seed(5)

<a class="anchor" id="step-2"></a>

### Data
We import `lessons_learned.csv` and keep all of the data. After some exploratory data analysis, we'll choose which features we will keep as inputs.

In [3]:
df_lessons = pd.read_csv('../Risky Space Business Challenge Files/lessons_learned.csv')

display(df_lessons.shape)
df_lessons.head()

(2101, 17)

Unnamed: 0,Lesson ID,Title,Abstract,Lesson(s) Learned,Recommendation(s),Organization,Date Lesson Occurred,Driving Event,Evidence,Project / Program,"The related NASA policy(s), standard(s), handbook(s), procedure(s) or other rules",NASA Mission Directorate(s),Sensitivity,From what phase of the program or project was this lesson learned captured?,"Where (other lessons, presentations, publications, etc.)?",Publish Date,Topics
0,30004,Relationship of Government and Contractor Risk...,The purpose of this lesson is to highlight the...,Approach 1 made it difficult to understand the...,Projects should consider RBI's risk management...,LaRC,04/05/2018,"Throughout the project, it was repeatedly dete...",,Radiation Budget Instrument,Langley Management System Center Procedure LMS...,"Aeronautics Research, Human Exploration and Op...",Public,Implementation,LaRC Institutional Knowledge Management (IKM) ...,07/23/2021,"Procurement, Small Business & Industrial Relat..."
1,30101,Cable Harness Wiring and Connector Anomalies C...,Early indications show that the commercial spa...,As a result of many years of expensive lessons...,As commercial vehicles and other NASA vehicles...,NESC,02/28/2021,NASA has found that the commercial spacecraft ...,,"Space Shuttle Program, Commercial Crewed Space...",,Human Exploration and Operations,Public,Implementation � Phase E,,07/23/2021,"Flight Equipment, Ground Operations, Hardware,..."
2,29801,Best Practices for the Elemental Profiling of ...,Trace contaminants in high-purity hydrazine (H...,There was an unexpectedly wide variation in el...,The recommendations to prevent this lesson fro...,NESC,12/14/2020,"Hypergolic propellants (e.g., hydrazine (N2H4)...",,All NASA missions using high purity hydrazine ...,,"Human Exploration and Operations, Science, Spa...",Public,Not Applicable,,06/23/2021,"Ground Operations, Launch Vehicle, Parts, Mate..."
3,29702,Integration and Dependency Between Different A...,During the Radiological Control Center (RADCC)...,"If possible, the design phase of both systems ...",Design phase should incorporate both design te...,KSC,03/15/2020,The RADCC AV system controls and routes variou...,,Radiological Control Center (RADCC),NPR 7120.7A NASA Information Technology Progra...,Human Exploration and Operations,Public,Not Specified,,06/01/2021,"Engineering Design, Integration and Testing, S..."
4,29103,Copper Tube Pinch Failure,While pinching copper tubes is a standard prac...,The pinch was initially visually inspected and...,Have pinch tool operator perform several pract...,KSC,10/17/2020,A copper tube was pinched as a test for the Ma...,Inverted metallography image of separated pinc...,Mass Spectrometer observing lunar operations (...,,"Human Exploration and Operations, Space Techno...",Public,Implementation � Phase D,,12/10/2020,"Engineering Design, Integration and Testing, M..."


Next, we will import `risk_classifications.csv`, which we created in [GenerateRiskTarget.ipynb](GenerateRiskTarget.ipynb).

In [4]:
df_riskclasses = pd.read_csv('../data/risk_classifications.csv')

display(df_riskclasses.shape)
df_riskclasses.head()

(2086, 2)

Unnamed: 0,Lesson ID,Risk Class
0,30004,4
1,30101,4
2,29801,4
3,29702,4
4,29103,4


We will combine both dataframes into a master dataframe.

In [5]:
df_master = df_lessons.merge(df_riskclasses, on='Lesson ID', how='inner')

display(df_master.shape)
df_master.head()

(2086, 18)

Unnamed: 0,Lesson ID,Title,Abstract,Lesson(s) Learned,Recommendation(s),Organization,Date Lesson Occurred,Driving Event,Evidence,Project / Program,"The related NASA policy(s), standard(s), handbook(s), procedure(s) or other rules",NASA Mission Directorate(s),Sensitivity,From what phase of the program or project was this lesson learned captured?,"Where (other lessons, presentations, publications, etc.)?",Publish Date,Topics,Risk Class
0,30004,Relationship of Government and Contractor Risk...,The purpose of this lesson is to highlight the...,Approach 1 made it difficult to understand the...,Projects should consider RBI's risk management...,LaRC,04/05/2018,"Throughout the project, it was repeatedly dete...",,Radiation Budget Instrument,Langley Management System Center Procedure LMS...,"Aeronautics Research, Human Exploration and Op...",Public,Implementation,LaRC Institutional Knowledge Management (IKM) ...,07/23/2021,"Procurement, Small Business & Industrial Relat...",4
1,30101,Cable Harness Wiring and Connector Anomalies C...,Early indications show that the commercial spa...,As a result of many years of expensive lessons...,As commercial vehicles and other NASA vehicles...,NESC,02/28/2021,NASA has found that the commercial spacecraft ...,,"Space Shuttle Program, Commercial Crewed Space...",,Human Exploration and Operations,Public,Implementation � Phase E,,07/23/2021,"Flight Equipment, Ground Operations, Hardware,...",4
2,29801,Best Practices for the Elemental Profiling of ...,Trace contaminants in high-purity hydrazine (H...,There was an unexpectedly wide variation in el...,The recommendations to prevent this lesson fro...,NESC,12/14/2020,"Hypergolic propellants (e.g., hydrazine (N2H4)...",,All NASA missions using high purity hydrazine ...,,"Human Exploration and Operations, Science, Spa...",Public,Not Applicable,,06/23/2021,"Ground Operations, Launch Vehicle, Parts, Mate...",4
3,29702,Integration and Dependency Between Different A...,During the Radiological Control Center (RADCC)...,"If possible, the design phase of both systems ...",Design phase should incorporate both design te...,KSC,03/15/2020,The RADCC AV system controls and routes variou...,,Radiological Control Center (RADCC),NPR 7120.7A NASA Information Technology Progra...,Human Exploration and Operations,Public,Not Specified,,06/01/2021,"Engineering Design, Integration and Testing, S...",4
4,29103,Copper Tube Pinch Failure,While pinching copper tubes is a standard prac...,The pinch was initially visually inspected and...,Have pinch tool operator perform several pract...,KSC,10/17/2020,A copper tube was pinched as a test for the Ma...,Inverted metallography image of separated pinc...,Mass Spectrometer observing lunar operations (...,,"Human Exploration and Operations, Space Techno...",Public,Implementation � Phase D,,12/10/2020,"Engineering Design, Integration and Testing, M...",4


<a class="anchor" id="step-2"></a>

## Exploratory Data Analysis

The problem at hand is a classification task; we want to use data available about projects before they are officially launched to predict what class of risk they fall under. Therefore, we intend to generate a text-based description of each project, carry out basic NLP transformations to convert the descriptions into tabular data, and utilize several different multi-class classification algorithms to try to accurately predict which class of risk each project is likely to fall under.

To generate the text-based description, we need to decide which of the available features can add value beyond the _Title_ and _Abstract_ features of the dataframe.

### Remove Irrelevant Features

First, we remove any columns containing data unavailable at the beginning of the project.
- _Lesson(s) Learned_ was used to generate the risk classifications. Logically, we cannot use it as an input. Regardless, it is information that comes to light during the course of the project. Therefore, this column will be dropped.
- The other columns that either relate to _Lesson(s) Learned_ or are post-facto include:
    - _Recommendation(s)_
    - _Date Lesson Occurred_
    - _Driving Event_
    - _Evidence_
    - _The related NASA policy(s), standard(s), handbook(s), procedure(s) or other rules_
    - _From what phase of the program or project was this lesson learned captured?_
    - _Where (other lessons, presentations, publications, etc.)?_
- _Publish Date_ is also a post-facto column. Furthermore, we don't expect the publication date to actually impact the risk classification.

In [8]:
if 'Lesson(s) Learned' in df_master.columns:
    df_master.drop(['Lesson(s) Learned', 'Recommendation(s)', 'Date Lesson Occurred', 'Driving Event', 'Evidence',
                    'The related NASA policy(s), standard(s), handbook(s), procedure(s) or other rules',
                    'From what phase of the program or project was this lesson learned captured?',
                    'Where (other lessons, presentations, publications, etc.)?', 'Publish Date'], inplace=True, axis=1)

display(df_master.shape)
df_master.head()

(2086, 9)

Unnamed: 0,Lesson ID,Title,Abstract,Organization,Project / Program,NASA Mission Directorate(s),Sensitivity,Topics,Risk Class
0,30004,Relationship of Government and Contractor Risk...,The purpose of this lesson is to highlight the...,LaRC,Radiation Budget Instrument,"Aeronautics Research, Human Exploration and Op...",Public,"Procurement, Small Business & Industrial Relat...",4
1,30101,Cable Harness Wiring and Connector Anomalies C...,Early indications show that the commercial spa...,NESC,"Space Shuttle Program, Commercial Crewed Space...",Human Exploration and Operations,Public,"Flight Equipment, Ground Operations, Hardware,...",4
2,29801,Best Practices for the Elemental Profiling of ...,Trace contaminants in high-purity hydrazine (H...,NESC,All NASA missions using high purity hydrazine ...,"Human Exploration and Operations, Science, Spa...",Public,"Ground Operations, Launch Vehicle, Parts, Mate...",4
3,29702,Integration and Dependency Between Different A...,During the Radiological Control Center (RADCC)...,KSC,Radiological Control Center (RADCC),Human Exploration and Operations,Public,"Engineering Design, Integration and Testing, S...",4
4,29103,Copper Tube Pinch Failure,While pinching copper tubes is a standard prac...,KSC,Mass Spectrometer observing lunar operations (...,"Human Exploration and Operations, Space Techno...",Public,"Engineering Design, Integration and Testing, M...",4


### Categorical Feature Analysis

Next, aside from _Title_ and _Abstract_, we will try to identify which other categorical features (if any) may offer value to the text-based description of the project. We start by checking how many unique values exist for each categorical feature.

In [15]:
df_master.iloc[:,3:].nunique()

Organization                    15
Project / Program              202
NASA Mission Directorate(s)     25
Sensitivity                      1
Topics                         447
Risk Class                       5
dtype: int64

- _Project / Program_ doesn't seem to have a standard format for its values. That, 