# OIL & GAS PRODUCTION - Text Mining Accident Reports 

### This project analyzes worker accidents based on the summary reports provided by Occupational Safety and Health Administration (OSHA) and classified according to NAICS system. 



## Outline:
The following are the steps required in performing the necessary analysis on the textual corpus of accident reports. 
Each of these steps further contain multiple tasks which are discussed in detail under their respective sections.
### 1. Data Acquisition
### 2. Text Preprocessing 
### 3. Text Structuring
### 4. Text Mining 
### 5. Advanced Analytics
___

## 1. Data Acquisition 
The datasets are obtained from "osha.gov" website. Each accident report contains the **Summary** along with **Report ID**, **Event Date** and **Establishment Name**.
As the report is not available in a downloadable format, they will be scraped from the website for a selected timeline. 

For this demo, text from a single report is used to show the different steps involved in the process. For the project though, text from all reports will be downloaded into a single csv file based on the **Report ID** as primary key.

### Importing Data into Python Environment

In [15]:
raw_rpt = "At 8:37 a.m. on January 22, 2018, five employees were working on an oil rig, tripping out the wellbore to change a bit. During work, the gas surfaced as the pipe/bit was removed from the well, causing the well to burp. Once the initial burp subsided, another burp occurred and gas fumes found an ignition source, causing a series of explosions on and around the rig. Five employees were in the doghouse, unable to escape and were over come by smoke and excessive heat. Three employees were killed."
print(raw_rpt)

At 8:37 a.m. on January 22, 2018, five employees were working on an oil rig, tripping out the wellbore to change a bit. During work, the gas surfaced as the pipe/bit was removed from the well, causing the well to burp. Once the initial burp subsided, another burp occurred and gas fumes found an ignition source, causing a series of explosions on and around the rig. Five employees were in the doghouse, unable to escape and were over come by smoke and excessive heat. Three employees were killed.


Text from a random report is assigned to a string variable in Python. We can look at some basic features like length of the report by line count and word count.

In [11]:
print("Number of lines:", raw_rpt.count('. '))
print("Number of words:", raw_rpt.count(' ')+1)

Number of lines: 5
Number of words: 88


For a large corpus of data, as we cannot browse at individual documents it is useful to have a look at some of these features to know if some of files have no data or if some files contain corrupted text in abnormal format with special characters and so on.
___

## 2. Text Preprocessing
Preprocessing is an important yet time-consuming process in analyzing textual data. It is vital because we need to convert human readable text into machine readable format which involves several steps with multiple tools or packages. Rather than importing all packages at the beginning, for now I will import each of them where they are used in their respective cells.  

### Text Normalization
We will start with Text Normalization which includes:
- **converting all letters to lower or upper case**
- **converting numbers into words or removing numbers**
- **removing punctuations, accent marks and other diacritics**
- **removing white spaces**
- **expanding abbreviations**
- **removing stop words, sparse terms, and particular words**

Depending on our needs and nature of the data some of these steps might be redundant. We can remove or add some steps as we seem fit to not lose any important details in the data.  


#### Convert text to lowercase

In [16]:
report = raw_rpt.lower()
print(report)

at 8:37 a.m. on january 22, 2018, five employees were working on an oil rig, tripping out the wellbore to change a bit. during work, the gas surfaced as the pipe/bit was removed from the well, causing the well to burp. once the initial burp subsided, another burp occurred and gas fumes found an ignition source, causing a series of explosions on and around the rig. five employees were in the doghouse, unable to escape and were over come by smoke and excessive heat. three employees were killed.


#### Remove numbers
Generally numbers are not relevant for text analysis as it can be more difficult to extract context or meaning from numbers than from words. In our report, we can observe that the numbers used indicate time and date which is already provided in a separate field if we need them. The last line in the report contains vital information that "three employees are killed" in word form, thus we are not discarding any useful data. 

In [18]:
import re  # import Regex library 
report = re.sub("\d+", "", report)
print(report)

at : a.m. on january , , five employees were working on an oil rig, tripping out the wellbore to change a bit. during work, the gas surfaced as the pipe/bit was removed from the well, causing the well to burp. once the initial burp subsided, another burp occurred and gas fumes found an ignition source, causing a series of explosions on and around the rig. five employees were in the doghouse, unable to escape and were over come by smoke and excessive heat. three employees were killed.


#### Remove Punctuation
We can use the `string` library to remove any symbols like [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]:

In [31]:
import string  # Along with importing string, we are also using the 're' Regex library
report = re.sub('[%s]' % re.escape(string.punctuation), '', report)
print(report)

at  am on january   five employees were working on an oil rig tripping out the wellbore to change a bit during work the gas surfaced as the pipebit was removed from the well causing the well to burp once the initial burp subsided another burp occurred and gas fumes found an ignition source causing a series of explosions on and around the rig five employees were in the doghouse unable to escape and were over come by smoke and excessive heat three employees were killed


#### Remove whitespaces
We will use the `strip()` method to remove ant leading or trailing whitespaces and `re.sub` method from Regex to remove any duplicate whitespaces between words.

In [39]:
report = report.strip()
report = re.sub(' +', ' ',report)
print(report)

at am on january five employees were working on an oil rig tripping out the wellbore to change a bit during work the gas surfaced as the pipebit was removed from the well causing the well to burp once the initial burp subsided another burp occurred and gas fumes found an ignition source causing a series of explosions on and around the rig five employees were in the doghouse unable to escape and were over come by smoke and excessive heat three employees were killed


#### Tokenize and remove stopwords
Tokenization is the process of splitting the given text into smaller pieces called tokens. Stop words (or commonly occurring words) should be removed from the text data as they do not provide any value and also makes our word matrix much sparse later. For this purpose, we can either create a list of stopwords ourselves or we can use predefined libraries. Here we'll use `nltk` library which contains all basic stopwords in English language. 
Apart from these words we can also remove some specific words by creating a word set.


In [47]:
import nltk  
"""
If python gives out error for not finding some subpackages, 
running these code lines may help: 
nltk.download('stopwords')
nltk.download('punkt') 
"""
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# tokenize
tokens = word_tokenize(report)
print(tokens)

['at', 'am', 'on', 'january', 'five', 'employees', 'were', 'working', 'on', 'an', 'oil', 'rig', 'tripping', 'out', 'the', 'wellbore', 'to', 'change', 'a', 'bit', 'during', 'work', 'the', 'gas', 'surfaced', 'as', 'the', 'pipebit', 'was', 'removed', 'from', 'the', 'well', 'causing', 'the', 'well', 'to', 'burp', 'once', 'the', 'initial', 'burp', 'subsided', 'another', 'burp', 'occurred', 'and', 'gas', 'fumes', 'found', 'an', 'ignition', 'source', 'causing', 'a', 'series', 'of', 'explosions', 'on', 'and', 'around', 'the', 'rig', 'five', 'employees', 'were', 'in', 'the', 'doghouse', 'unable', 'to', 'escape', 'and', 'were', 'over', 'come', 'by', 'smoke', 'and', 'excessive', 'heat', 'three', 'employees', 'were', 'killed']


In [49]:
# remove stopwords
stop_words = set(stopwords.words('english'))
wrd_tkns = [i for i in tokens if not i in stop_words]
print (wrd_tkns)

['january', 'five', 'employees', 'working', 'oil', 'rig', 'tripping', 'wellbore', 'change', 'bit', 'work', 'gas', 'surfaced', 'pipebit', 'removed', 'well', 'causing', 'well', 'burp', 'initial', 'burp', 'subsided', 'another', 'burp', 'occurred', 'gas', 'fumes', 'found', 'ignition', 'source', 'causing', 'series', 'explosions', 'around', 'rig', 'five', 'employees', 'doghouse', 'unable', 'escape', 'come', 'smoke', 'excessive', 'heat', 'three', 'employees', 'killed']


We can see that the string is now broken into a list of strings with words. When we have multiple reports the words will we be seperated into columns in a dataframe.

### Stemming
Stemming is a process of reducing words to their word stem, base or root form (for example, books — book, looked — look). The main two algorithms are **Porter** stemming algorithm (removes common morphological and inflexional endings from words) and **Lancaster** stemming algorithm (a more aggressive stemming algorithm). 

We will try both Porter and Lancaster Stemmers again from `nltk` library and see which of these produce better results.

In [51]:
from nltk.stem import PorterStemmer
stemmer= PorterStemmer()
stm_tkns = [stemmer.stem(word) for word in wrd_tkns]
print(stm_tkns)

['januari', 'five', 'employe', 'work', 'oil', 'rig', 'trip', 'wellbor', 'chang', 'bit', 'work', 'ga', 'surfac', 'pipebit', 'remov', 'well', 'caus', 'well', 'burp', 'initi', 'burp', 'subsid', 'anoth', 'burp', 'occur', 'ga', 'fume', 'found', 'ignit', 'sourc', 'caus', 'seri', 'explos', 'around', 'rig', 'five', 'employe', 'doghous', 'unabl', 'escap', 'come', 'smoke', 'excess', 'heat', 'three', 'employe', 'kill']


In [68]:
from nltk.stem import LancasterStemmer
stemmer= LancasterStemmer()
stm_tkns = [stemmer.stem(word) for word in wrd_tkns]
print(stm_tkns)

['janu', 'fiv', 'employ', 'work', 'oil', 'rig', 'trip', 'wellb', 'chang', 'bit', 'work', 'gas', 'surfac', 'pipebit', 'remov', 'wel', 'caus', 'wel', 'burp', 'init', 'burp', 'subsid', 'anoth', 'burp', 'occur', 'gas', 'fum', 'found', 'ignit', 'sourc', 'caus', 'sery', 'explod', 'around', 'rig', 'fiv', 'employ', 'dogh', 'un', 'escap', 'com', 'smok', 'excess', 'heat', 'three', 'employ', 'kil']


As we can see, the resulting words from both the stemmers are not quite satisfactory. For many of the words the last characters are discarded and the resulting words are not meaningful and neither actual words of English language.
Hence we will try lemmatization, which is a similar process to stemming. But for lemmatization to work effectively we will need the part of speech of the words. So we will first tag our words with their parts of speech. 

### Part of speech (POS) Tagging
Part-of-speech tagging aims to assign parts of speech to each word of a given text (such as nouns, verbs, adjectives, and others) based on its definition and its context. 

In [94]:
# We will use the nltk package method to create tags 
pos_tags = nltk.pos_tag(wrd_tkns)
print(pos_tags)

[('january', 'JJ'), ('five', 'CD'), ('employees', 'NNS'), ('working', 'VBG'), ('oil', 'NN'), ('rig', 'NN'), ('tripping', 'VBG'), ('wellbore', 'JJR'), ('change', 'NN'), ('bit', 'NN'), ('work', 'NN'), ('gas', 'NN'), ('surfaced', 'VBD'), ('pipebit', 'NN'), ('removed', 'VBN'), ('well', 'RB'), ('causing', 'VBG'), ('well', 'RB'), ('burp', 'RB'), ('initial', 'JJ'), ('burp', 'NN'), ('subsided', 'VBD'), ('another', 'DT'), ('burp', 'NN'), ('occurred', 'VBD'), ('gas', 'NN'), ('fumes', 'NNS'), ('found', 'VBN'), ('ignition', 'NN'), ('source', 'NN'), ('causing', 'VBG'), ('series', 'NN'), ('explosions', 'NNS'), ('around', 'IN'), ('rig', 'NN'), ('five', 'CD'), ('employees', 'NNS'), ('doghouse', 'VBP'), ('unable', 'JJ'), ('escape', 'NN'), ('come', 'VB'), ('smoke', 'NN'), ('excessive', 'JJ'), ('heat', 'NN'), ('three', 'CD'), ('employees', 'NNS'), ('killed', 'VBD')]


We obtained POS tags for all of our words, but they are in what called as a Penn Treebank format. For the `nltk`'s  lemmatizer function to work we need the tags in a Wordnet format. Therefore I created a small mapper function to convert Penn tags to Wordnet tags below.

In [97]:
# part is a dictionary containing the Penn to Wordnet mapping as key, value pairs
part = {
    'N' : 'n',
    'V' : 'v',
    'J' : 'a',
    'S' : 's',
    'R' : 'r'
}


def convert_tag(penn_tag):
    """
    convert_tag() accepts the **first letter** of a Penn part-of-speech tag,
    then uses a dict lookup to convert it to the appropriate WordNet tag.
    """
    if penn_tag in part.keys():
        return part[penn_tag]
    else:
        # other parts of speech will be tagged as nouns
        return 'n'

In [98]:
wrdnet_tags = [(word, convert_tag(tag[0])) for word, tag in pos_tags]
print(wrdnet_tags)

[('january', 'a'), ('five', 'n'), ('employees', 'n'), ('working', 'v'), ('oil', 'n'), ('rig', 'n'), ('tripping', 'v'), ('wellbore', 'a'), ('change', 'n'), ('bit', 'n'), ('work', 'n'), ('gas', 'n'), ('surfaced', 'v'), ('pipebit', 'n'), ('removed', 'v'), ('well', 'r'), ('causing', 'v'), ('well', 'r'), ('burp', 'r'), ('initial', 'a'), ('burp', 'n'), ('subsided', 'v'), ('another', 'n'), ('burp', 'n'), ('occurred', 'v'), ('gas', 'n'), ('fumes', 'n'), ('found', 'v'), ('ignition', 'n'), ('source', 'n'), ('causing', 'v'), ('series', 'n'), ('explosions', 'n'), ('around', 'n'), ('rig', 'n'), ('five', 'n'), ('employees', 'n'), ('doghouse', 'v'), ('unable', 'a'), ('escape', 'n'), ('come', 'v'), ('smoke', 'n'), ('excessive', 'a'), ('heat', 'n'), ('three', 'n'), ('employees', 'n'), ('killed', 'v')]


The POS tags are now converted into Wordnet format and we can input these now to the lemmatizer function.

### Lemmatization
The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses lexical knowledge bases to get the correct base forms of words.

In [56]:
# download if 'wordnet' not in nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Astron\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [99]:
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()

lmt_tkns = [lemmatizer.lemmatize(word[0], word[1][0]) for word in wrdnet_tags]
print(lmt_tkns)

['january', 'five', 'employee', 'work', 'oil', 'rig', 'trip', 'wellbore', 'change', 'bit', 'work', 'gas', 'surface', 'pipebit', 'remove', 'well', 'cause', 'well', 'burp', 'initial', 'burp', 'subside', 'another', 'burp', 'occur', 'gas', 'fume', 'find', 'ignition', 'source', 'cause', 'series', 'explosion', 'around', 'rig', 'five', 'employee', 'doghouse', 'unable', 'escape', 'come', 'smoke', 'excessive', 'heat', 'three', 'employee', 'kill']


Finally the words are now reduced at a sufficient level to their base form. So, when we import data of all reports in a dataframe we can skip the Stemming part and work with the lemmatizer to trim the words. 

This concludes text preprocessing to some extent. These are majority of the steps involved in the process and a few more may be added as we scale up. 