# Homework 5: Bayesian Graphical Models

- Name: Congxin (David) Xu
- Computing ID: cx2rx

### Honor Pledge: 
I have neither given nor received aid on this assignment.

### Problem 1
(30) This problem explores the use of variational approximation in Latent
Dirichlet Allocation (LDA). We will use the implementation in sklearn of
the variational approximation algorithm in [1].

#### Part a

Use the notation in the diagram in Figure 1 [2] to write the target
posterior distribution of the latent variables and parameters for the
general LDA method. Why do we use variational approximation
rather than conjugate priors or sampling to obtain this posterior
distribution?

**Answer**

The posterior distribution of the latent variables is 
$$ p(\theta, z | w, \alpha, \beta) = \frac{p(\theta, z, w| \alpha, \beta)}{p(w | \alpha, \beta)}  $$

The reason to use variational approximation rather than conjugate priors or sampling to obtain this posterior distribution is becuase for many models, it is intractable to compute the posterior distribution. Sampling methods are limited to smaller data problems but variational approximation can handle large problems.

#### Part b

Accident reports provide a good use-case for LDA since the narrative
information in these reports is frequently overlooked in safety analysis. LDA allows us to capture elements (topics) in this narrative
data and use them to better understand unsafe conditions. For this
use-case, modify the LDA class for Wikipedia in the `LDA Examples Wikipedia and Trains` jupyter notebook to perform LDA on
the accident narratives. About 10 years of these narratives are in
the json file, `TrainNarratives.txt`. Use this class to obtain 10
topics from the accident narratives.

In [1]:
import numpy as np
import pandas as pd
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import wikipedia
import nltk
from nltk.corpus import stopwords
import json

# Set stop words
stopWords = set(stopwords.words('english'))

In [2]:
class LDA_trains:
    """Creates a class for Latent Dirichlet Allocation using summaries from Wikipedia
    Input:
        reports = list of narratives from accident reports
        N_topics = number of topics for LDA to produce
        N_words = the number of words to show in a topic
        new_report = narrative for a new accident report not in the training set
    Methods:
        Topics = output the list of topics in the selected narratives
        Predict_Topics = Show the predicted probabilities for topics for a new accident narrative
            Input: new narrative
            """
    def __init__(self, reports, N_topics=3, N_words = 10):
        # the narrative reports
        self.reports = reports
        # initialize variables
        self.N_topics = N_topics
        self.N_words = N_words
        
        # Get the word counts in the reports
        self.countVectorizer = CountVectorizer(stop_words='english')
        self.termFrequency = self.countVectorizer.fit_transform(self.reports)
        self.Words = self.countVectorizer.get_feature_names()
        
    def Topics(self):
                
        # Obtain the estimates for the LDA model 
        self.lda = LatentDirichletAllocation(n_components=self.N_topics)
        self.lda.fit(self.termFrequency)
        
        # Obtain the list of the top N_words in the topics
        topics = list()
        for topic in self.lda.components_:
            topics.append([self.Words[i] for i in topic.argsort()[:-self.N_words - 1:-1]])
            
        # For each of the topics in the model add the top N_words the list of topics
        ### Your code here
        # Create column names for the output matrix
        cols = list()
        for i in range(self.N_words):
            cols.append("Word " + (str(i)))
            
        # Create a dataframe with the topic no. and the words in each topic 
        # output this dataframe 
        Topics_df = pd.DataFrame(topics, columns = cols)
        Topics_df.index.name = "Topics"
        return Topics_df
    
    def Predict_Topics(self, new_reports):
        self.new_reports = new_reports
        
        # Get the list of new accident report narratives
        # and the number of new narratives
        N_new_reports = len(self.new_reports)
        
        
        # For each of the new narratives 
        # obtain the estimated probabilities for each of the topics
        # in each of the new narratives as estimated by the LDA results
        # on the training set 
        new_report_topics = list()
        ### Your code here        
        for i in self.new_reports:
            new_report_topics.append(self.lda.\
                                     transform(self.countVectorizer.\
                                               transform([i])))
        
        # Recast the list of probabilities for topics as an array 
        # of size no. of new reports X no. of topics
        new_report_topics = np.array(new_report_topics).\
            reshape(N_new_reports, self.N_topics)
        
        # Create column names for the output dataframe
        cols = list()
        ### Your code here        
        for i in range(self.N_topics):
            cols.append("Topic "+(str(i)))
            
        # Create the dataframe whose rows contain topic probabilities for 
        # specificed narratives/reports
        ### Your code here
        New_Reports_df = pd.DataFrame(new_report_topics, columns = cols)        
        New_Reports_df.insert(0, 'Reports', self.new_reports)
        
        return New_Reports_df
                

In [3]:
# Train accident narratives are in a json file
# Read the JSON file with the narratives and convert to a list for the LDA analysis

with open('TrainNarratives.txt') as json_file:  
    Narrative_dict = json.load(json_file)
    
train_reports = list(Narrative_dict.values())
    
train_reports[0:3]

['UNITS 231-281(BACK TO BACK)  WERE COMING INTO UP DEISEL SHOP  WHEN THE LEFT WHEEL OF 281 RODE OVER RECENTLY REPAIRED SWITCH PLATE AND DERAILED. THE CAUSE WAS DETERMINED TO BE THE TRACK TELEMETRY IN THAT IT WAS TOO SHARP OF A CURVE.',
 'ENGINE 286 CAUGHT FIRE AT THE SPRINGFIELD, MA STATION DUE TO BEARINGS IN MAIN GENERATOR LET GO.',
 'TRAIN NO.#4 WITH ENGS 83/11/90/44 AND 11 CARS DERAILED 2 DEADHEAD CARS, C/44834 AND C/9639, WHILE MAKING A SHOVING MOVE ONTO TRACK 28.  THE DERAILMENT WAS DUE TO HIGH BUFF FORCES CAUSED JACKKNIFING OFDEADHEADING AMFLEET CAR 44834 LOCATED DIRECTLY BEHIND ENGINES DUE TO EXCESSIVE AMPERAGE GENERATED BY FOUR P42 LOCOMOTIVES SHOVING TRAIN AGAINST AN APPROXIMATELY 15-POUND BRAKE REDUCTION.']

In [4]:
lda_train = LDA_trains(reports = train_reports, N_topics = 10, N_words = 10)
lda_train.Topics()

Unnamed: 0_level_0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9
Topics,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,train,derailed,cars,car,main,emergency,track,rail,traveling,mp
1,train,pantograph,damage,struck,wire,engine,vehicle,catenary,operating,equipment
2,car,cars,lead,auto,truck,pullback,derailed,25,hump,engine
3,cars,track,switch,train,car,crew,derailed,conductor,engineer,rail
4,track,cars,derailed,switch,yard,crew,released,hazardous,materials,shoving
5,switch,derailed,track,train,end,lead,cars,point,east,crew
6,derailed,cars,track,pulling,loads,shoving,rail,ns,empties,head
7,cars,car,bowl,track,humping,derailed,hump,pulling,retarder,derailment
8,car,track,switch,damage,causing,train,derailed,equipment,bnsf,derailment
9,car,cars,track,cut,rolled,lead,switching,kicked,end,causing


#### Part c

Use the class you developed for Problem 1b to obtain the probabilities
for each of the topics in the first 10 narratives in the `TrainNarratives.txt`
data set. What is the notation in Figure1 that represents these prob-
abilities?

**Answer**

The notation in Figure 1 that represents these probabilities is $\theta$.

In [5]:
lda_train.Predict_Topics(train_reports[:3])

Unnamed: 0,Reports,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9
0,UNITS 231-281(BACK TO BACK) WERE COMING INTO ...,0.004546,0.004546,0.004546,0.004547,0.187757,0.004546,0.072772,0.004546,0.707647,0.004546
1,"ENGINE 286 CAUGHT FIRE AT THE SPRINGFIELD, MA ...",0.009094,0.326304,0.099534,0.009092,0.009093,0.359446,0.009093,0.009091,0.009093,0.160161
2,TRAIN NO.#4 WITH ENGS 83/11/90/44 AND 11 CARS ...,0.467596,0.148066,0.002326,0.002326,0.368055,0.002326,0.002326,0.002326,0.002326,0.002326


#### Part d

Briefly explain how a safety engineer at Federal Railroad Administration could use the results you obtain in Problem 1c to improve safety for trains.

**Answer**

The safety engineer at Federal Rialroad Administration could look at the train report and the topic that report belongs to. Then, the safety engineer will be able to review the words under that topic and pay special attention to those areas when he/she conduct the safety checks.
