# Capstone Project: Criminal Case Database

### Overall Contents:
- Background
- Webscraping Lawnet
- Webscraping Singapore Statutes
- [Natural Language Processing](#4.-Natural-Language-Processing) **(In this notebook)**
- Search Function
- Flask and Google App Engine
- Conclusion and Recommendation

## Datasets

For the Natural Language Processing of the datasets, I will use the datasets which I created previously. These will be run through the processor to obtain the final datasets which will form the basis of my database.  

The datasets that I will use are as follows:- 

* subordinatecourt.csv
* subordinatecourt_compiled.csv 
* statecourt.csv 
* statecourt_compiled.csv
* statutes_crimes.csv

The datasets that I will create are as follows:-

* database.csv
* database_temp.csv  


The information extracted will be presented in the database in the following format:  

|Name|Type|Dataset|Description|
|:---|:---|:---|:---|
|**case_name**|*object*|database.csv|Case name|
|**tribunal/court**|*object*|database.csv|Court of judgment|
|**decision_date**|*object*|database.csv|Decision date of the judgment|
|**possible_offences**|*object*|database.csv|Possible offences discussed in the judgment|
|**possible_statutes**|*object*|database.csv|Possible statutes discussed in the judgment|
|**citations**|*object*|database.csv|Other cases discussed or cited in the judgment|
|**mitigation_discussed**|*object*|database.csv|Whether mitigating circumstances were discussed in the judgment|
|**aggravation_discussed**|*object*|database.csv|Whether aggravating circumstances were discussed in the judgment|


## 4. Natural Language Processing

In this notebook, I will be exploring the use of Natural Language Processing (NLP) to identify key information from each judgment to create a database of information for the judgments that I have archived.  

This database can be used as a starting point for legal research by quickly giving statistical summaries of recent cases, and the case links on Lawnet.  

The information I am looking to extract:  

* Case name  
* Court of judgment  
* Decision date  
* Possible offences discussed in the judgment 
* Possible statutes discussed in the judgment  
* Cases cited in the judgment  
* Whether mitigating circumstances were discussed  
* Whether aggravating circumstances were discussed  

The above information may help a lawyer to quickly decide whether the judgment is relevant to his/her case and whether it is worth researching on, further, it will also help him/her to quickly identify further cases cited to expand the research.  
To achieve this, there are two possible methods of information extraction which I am aware of:  
1. Rule-based Information Extraction (RBIE)  
2. Named Entity Recognition (NER)  

In RBIE, a set of rules (or multiple sets of rules) is used for the identification of patterns which match the rules in order to extract the information. It is a more transparent method as the rules are clearly defined, and it can be maintained easily.

In NER, unstructured text is processed through machine learning to identify named entities and classify them into predefined categories such as for the purposes of this project, `case_name` or `statutes`. NER can be advantageous as it uses machine learning to train a model that can identify the named entities within the unstructured text. It can result in a more efficient process that is able to better capture and classify named entities which may be missed by rules. However, as most machine learning models, it may end up overfit or underfit, and maintaining it would require retraining the model.  
For my project, I chose to work with RBIE given the limited amount of time I have. I did not want to use a pre-trained NER model as a 'plug-and-play' method, and training an NER model from scratch to extract the information I want would require the manual tagging of the named entities for the training data. Given the sheer number of words in each judgment, and the limitation of data that I have (less than 200 judgments in total), I was not confident that I had sufficient data and time to train the NER model.  

### 4.1 Judgment Processing

Once again, I created a custom class `Database` which has methods for the following:  
1. Calculate the number of rows for each `court` class in the final database, and in their respective court judgment databases  
2. Identify the number of new rows to set a start and end point for the NLP  
3. Extract the various information from the judgment based on sets of rules   
4. Update and save the final database with the information from new judgments  

![example of a judgment](../images/judgment.png)  

An example of a judgment from Lawnet [1]:  
- The blue box indicates the `case_name`  
- The red box indicates the `decision_date` and `tribunal/court`  
- The green boxes show examples of `possible_statutes` which consists of section numbers and statute names  
- The orange boxes show examples of other case `citations`  

By going through the html code of the judgment, I was able to pick out various html tags where the information required can be found. Thus the basis of my RBIE rule sets was formed.

### 4.1.1 Database class creation  

The first step once again is to create the class Database, and initialize it.  

<details> 
    <summary> <b> Click here for code </b></summary>
    
```python
# Create the Database class
class Database:

    def __init__(self):
        """
        Initializes the class and loads the datasets.
        """
        # Load the .csv files as pandas dataframes
        self.supremecourt_df = pd.read_csv('../data/supremecourt_compiled.csv')
        self.subordinatecourt_df = pd.read_csv('../data/subordinatecourt_compiled.csv')
        self.database_df = pd.read_csv('../data/database.csv')
        self.statutes_df = pd.read_csv('../data/statutes_crimes.csv')
        
```  
</details>  

During initialization, the datasets to be used are loaded as pandas dataframes.

### 4.1.2 Number of rows  

As mentioned above, I created a private method which calculates the number of rows in the two court datasets `supremecourt_compiled.csv` and `subordinatecourt_compiled.csv`, as well as the difference in number of rows for each of these court categories in the final database `database.csv`.  

The purpose of this method is to find out if there are any new entries to be processed (determined by a difference in number of rows between the court datasets and the final database) and to find the start and end indices for the NLP.  

<details>
    <summary> <b> Click here for code </b></summary>
    
```python
    def __get_num_rows(self):
        """
        Calculates the number of rows there are in each dataset.
        """
        # Find the number of rows in each court's dataframe
        self.__supremecourt_rows = len(self.supremecourt_df)
        self.__subordinatecourt_rows = len(self.subordinatecourt_df)
        
        # Compare the number of rows for each court in the database to find out how many new rows there are
        self.__new_supremecourt_rows = self.__supremecourt_rows - len(self.database_df[self.database_df['court_tag'] == 'supreme'])
        self.__new_subordinatecourt_rows = self.__subordinatecourt_rows - len(self.database_df[self.database_df['court_tag'] == 'subordinate'])
   
    
``` 
</details>  

### 4.1.3 Case name  

As mentioned above, based on the html tags, I was able to identify that the `case_name` was nested within the `h2` tag. Hence the method to extract the case name searches only within this tag. It takes the info as text and returns a dictionary for the `case_name`.  

Further, it creates a temporary variable for the `temp_case_name` without its legal citation notation for example "[2021] SGCA 3" in the image above. This will be used to remove the `case_name` of the judgment being processed from the other case `citations` down the pipeline.  

<details>
    <summary> <b> Click here for code </b></summary>
    
```python
    def __get_case_name(self):
        """
        Takes out the case name from the judgment
        """
        # Search for the case name in the judgment
        self.__case_name = {'case_name': self.__search_results.find('h2').text.strip()}
        
        # Create a temp variable for the case name without the case citation notation by looking for patterns of Capitalized words with name terms which are followed by v and further capitalized words with name terms as these suggest that it is a case name
        self.__temp_case_name = re.search('(([A-Z][a-z]*)(([A-Z][a-z]*)|(s\/o| |bte|bin|and|another|anr|binti|de|the|for|other|matters))* v (([A-Z][a-z]*)|(s\/o| |bte|bin|and|another|anr|binti|de|the|for|other|matters))*(?=|))', str(self.__case_name)).group(0).strip()
        
        # Return case name as a dictionary
        return self.__case_name   
    
``` 
</details>  

### 4.1.4 Court name and Decision Date  

The `tribunal/court` and `decision_date` information are found within the `info-table id` of the html. Hence their methods simply search within this id to find match the patterns `Tribunal/Court :` and `Decision Date :` to capture their information.  

The information is theen split into two parts by the colon `:` to return the final information dictionaries.  

<details>
    <summary> <b> Click here for __get_court code </b></summary>
    
```python
    def __get_court(self):
        """
        Takes out the court name from the judgment
        """
        # Search for the info table in the html
        self.__temp_results1 = self.__search_results.find('table', {'id': 'info-table'})
        
        # Picks out the court info in the search results as string
        self.__temp_court = re.search('Tribunal/Court : (\w* )*(?=Coram)', self.__temp_results1.text).group(0).strip()
        
        # Split the court info string to the key and value
        self.__temp_list = self.__temp_court.split(" : ")
        
        # Set court info in a dictionary
        self.__court = {str.lower(self.__temp_list[0]): self.__temp_list[1]}
        
        # Return court info as a dictionary
        return self.__court   
    
``` 
</details>  

<details>
    <summary> <b> Click here for __get_date code </b></summary>
    
```python
    def __get_date(self):
        """
        Takes out the court name from the judgment
        """
        # Search for the info table in the html
        self.__temp_results1 = self.__search_results.find('table', {'id': 'info-table'})
        
        # Picks out the decision date in the search results as string
        self.__temp_date = re.search('Decision Date : (\w* )*(?=Tribunal)', self.__temp_results1.text).group(0).strip()
        
        # Split the decision date to the key and value
        self.__temp_list = self.__temp_date.split(" : ")
        
        # Set decision date in a dictionary
        self.__decision_date = {'decision_date': self.__temp_list[1]}
        
        # Return decision date as a dictionary
        return self.__decision_date   
    
``` 
</details>  

### 4.1.5 Possible Statutes and Offences  

Identifying the possible statutes and offences required a more complex method. They could either be found within the judgment's header with html tags `span` within `p class:'txt-body'` or within the judgment text itself.  

The pattern to identify possible statutes is that they will follow one of the following formats:  
* `Section` `space` `digits` where section can be plural and is case insensitive, and there is at least 1 digit (e.g. Section 33)  
* `space` `S` `space` `digits` where s is preceded by a space and can be plural and is case insensitive, and there is at least 1 digit (e.g. s 33)  
Followed by the name of the statute:  
* `Word` `space` `of (optional)``Act` where there can be more than one word which may have `of` in between, followed by Act in title case (e.g. Misuse of Drugs Act)  
* `Word` `space` `of (optional)``Code` where there can be more than one word which may have `of` in between, followed by Code in title case (e.g. Penal Code)

This results in a `regex` pattern of `( [Ss](ection|)(s|) \d+)` to find the sections and `((([A-Z][a-z]*)|(Corruption, Drug Trafficking and Other Serious Crimes \(Confiscation of Benefits\)|and|of| )){2,}(Act|Code))` to find the names of the statutes  

I inserted the specific Act in the pattern due to its complexity, and because not many Statutes follow the same pattern.

The method first tries to find the possible statutes in the header, failing which, it will look within the judgment text.  

The sections and statute names are then extracted and joined to form a final pattern such as `5 Misuse of Drugs Act` in the image above. This pattern is matched within the `statute_crimes` database which I created previously in order to try to find a possible title for the offence.  

Where there are multiple sections and statute names found, the method permutates them into all possible combinations to look for matches.

All the information is passed as sets at some point in the method to prevent duplicates.

If an offence title is found, the possible offence and possible statute is added to a list, and all `possible_offences` and `possible_statutes` are returned as strings in dictionaries.  

If no offence title is found, `Unsure` is returned for the possible offence while the `possible_statutes` are returned.  

<details>
    <summary> <b> Click here for code (Warning: very long code)</b></summary>
    
```python
    def __get_statute(self):
        """
        Identifies criminal offences and statutes mentioned in the judgment based on its header as a first choice, and text as a second choice.
        """
        # Create an empty list of offences
        self.__offences = []
        
        # First try to identify the crimes based on the header of the judgment.
        try:
            # Search for the header in the html
            self.__temp_results1 = self.__search_results.find('p', {'class': 'txt-body'})
            self.__temp_results2 = self.__temp_results1.find_all('span')
            
            # Iterate through the results and try to find the offences
            for result in self.__temp_results2:
                try:
                    # Search for "Section(s) or "s(s)" (abbreviated sections) with digits. First replace weird text.
                    section = re.search('([Ss](ection|)(s|) \d+)', result.text.replace('\xa0',' ')).group(0).strip()
                    
                    # Pick out only the section numbers
                    section_num = re.sub('([Ss](ection|)(s|) )', "", section)
                    
                    # Pick out patterns which end in Act or Code as these refer to statutes
                    statute = re.search(r'((([A-Z][a-z]*)|(Corruption, Drug Trafficking and Other Serious Crimes \(Confiscation of Benefits\)|and|of| )){2,}(Act|Code))', result.text.replace('\xa0',' ')).group(0).strip()
                    
                    # Combine section numbers and statute
                    section_statute = section_num + " " + statute
                    
                    # Checks the database of statutes I created to find a possible offence if it exists within and adds it to the list of offences for this judgment
                    if section_statute in self.statutes_df['section_statute'].values:
                        index = self.statutes_df[self.statutes_df['section_statute'] == section_statute].index
                        offence = self.statutes_df.iloc[index].values
                        self.__offences.append([offence[0][1], section_statute])
                        
                    # If not found within the database of statutes, adds the section number and statute but list offences as "unsure"
                    else:
                        self.__offences.append(['Not in database', section_statute])
                        
                except:
                    pass
        except:
            pass
        
        # If the judgment header does not contain the section and statute, identify it through the text
        if len(self.__offences) == 0:
            print('No offences found in header, checking document text..')
            
            # Instantiate empty lists
            offence_b = []
            sections_found = []
            statutes_found = []
            
            # Search for all the document text
            self.__search_results2 = self.__search_results.text.replace("\xa0"," ")
            
            # Instantiate empty lists
            section_list = []
            statute_list = []
            try:
                # Try to find the sections as above
                sections = re.findall('( [Ss](ection|)(s|) \d+)', self.__search_results2)

                # If sections is not empty:
                if sections != []:
                    # Iterate through the sections and adds their digits to the section list
                    for s in sections:
                        section_list.append(re.findall(r'\d+',s[0]))

                        # Add the sections found to the list
                        for ss in section_list:
                            sections_found.append(ss[0])

                # Try to find the statutes as above
                statutes = re.findall(r'((([A-Z][a-z]*)|(Corruption, Drug Trafficking and Other Serious Crimes \(Confiscation of Benefits\)|and|of| )){2,}(Act|Code))', self.__search_results2)

                # If statutes is not empty:
                if statutes != []:
                    # Iterate through the statutes and adds their name to the statute list
                    for s in statutes:
                        statute_list.append(s[0].strip())

                        # Add the statutes found to the list
                        for ss in list(statute_list):
                            statutes_found.append(ss)
            except:
                pass
            
            if statutes_found != []:
                # Convert the sections and statutes found to sets to remove duplicates
                sections2 = set(sections_found)
                statutes2 = set(statutes_found)
                
                # Permutate through the sections and statutes to find all possible combinations of the two
                combinations = list(itertools.product(list(sections2),list(statutes2)))
                
                # Check if combinations is blank
                if combinations == []:
                    combinations = list(statutes_found)
                
                # Instantiate list of possible offences
                possible_offences = []
                
                # Check if combinations has only 1 entry and sets possible_offences as combinations
                if type(combinations[0]) == str:
                    possible_offences = combinations
                
                else:
                    # Add each possible offence from the permutations to the list of possible offences
                    for offence in combinations:
                        possible_offences.append(' '.join(offence))
                    
                # Check the database of statutes I created to find a possible offence if it exists within and adds it to the list of offences for this judgment
                for value in possible_offences:
                    if value in self.statutes_df['section_statute'].values:
                        index = self.statutes_df[self.statutes_df['section_statute'] == value].index
                        offence = self.statutes_df.iloc[index].values
                        offence_b.append([offence[0][1], value])
                        [self.__offences.append(x) for x in offence_b if x not in self.__offences];
                        
                    # If not found within the database of statutes, adds the section number and statute but list offences as "unsure"
                    else:
                        offence_b.append(['Not in database', value])
                        [self.__offences.append(x) for x in offence_b if x not in self.__offences];
            
        # Instantiate empty lists and dictionaries
        self.__title = []
        self.__temp_title = []
        self.__statute = []
        self.__title_statute = {}
        
        # Iterate through self.__offences
        for item in self.__offences:
        # Add each offence into temp_title and statute
            self.__temp_title.append(item[0])
            self.__statute.append(item[1])
            
        # converts temp_title to a set to remove duplicates
        self.__temp_title = set(self.__temp_title)
        
        # Adds each temp_title to title
        for titles in self.__temp_title:
            self.__title.append(str(titles))
            
        # Joins all the titles in title list as a full string
        self.__title = ",".join(self.__title)
        
        # Joins all the statutes in statute list as a full string
        self.__statute = ",".join(self.__statute)
        
        # Returns title and statute as a dictionary
        self.__title_statute = {'possible_titles': self.__title, 'possible_statutes': self.__statute}
        return self.__title_statute
        
    
``` 
</details>  

### 4.1.6 Other cases cited  

For the other cases cited in the judgment, the following pattern was used:  
* `Title Case Word` `space` `special name connectors (optional)` or `suffix (optional)` `space` `v` `space` `Title Case Word` `space` `special name connectors (optional)` or `suffix (optional)` (e.g. Syed Suhail bin Syed Zin v Public Prosecutor)  

This results in a `regex` expression of `(([A-Z][a-z]*)(([A-Z][a-z]*)|(s\/o| |bte|bin|and|another|anr|binti|de|the|for|other|matters))* v (([A-Z][a-z]*)|(s\/o| |bte|bin|and|another|anr|binti|de|the|for|other|matters))*(?=|))`.  

A `findall` was used to find all possible matches, and all the matches are iterated through and appended in a list if they are not duplicates.  

The `temp_case_name` created previously is now used to remove the judgment's `case_name` from the list of other case `citations`.  

Finally, a few outlier captured words are removed and the list is converted to a string to be stored in a dictionary which is returned.  

<details>
    <summary> <b> Click here for code </b></summary>
    
```python
    def __get_citations(self):
        """
        Searches the document text for case citations which are in the format of `____ v ____`.
        """
        # Replaces weird characters from html and return text
        self.__judgment_text = self.__search_results.text.replace('\xa0','')
        
        # Searches through text for patterns of Capitalized words with name terms which are followed by v and further capitalized words with name terms as these suggest that it is a case name
        self.__case_search = re.findall('(([A-Z][a-z]*)(([A-Z][a-z]*)|(s\/o| |bte|bin|and|another|anr|binti|de|the|for|other|matters))* v (([A-Z][a-z]*)|(s\/o| |bte|bin|and|another|anr|binti|de|the|for|other|matters))*(?=|))', self.__judgment_text)
        
        # Instantiate an empty list
        self.__cases = []
        
        # Iterate through the results to append the case name into a list, excluding duplicates
        for item in self.__case_search:
            case = []
            case.append(re.search('(([A-Z][a-z]*)(([A-Z][a-z]*)|(s\/o| |bte|bin|and|another|anr|binti|de|the|for|other|matters))* v (([A-Z][a-z]*)|(s\/o| |bte|bin|and|another|anr|binti|de|the|for|other|matters))*(?=|))', str(item)).group(0).strip().replace("In ",""))
            [self.__cases.append(x) for x in case if x not in self.__cases];
            
        # Remove this judgment's name from the list as it cannot be its own citation
        self.__cases.remove(self.__temp_case_name)
        
        # Change the list to a string
        self.__cases = ",".join(self.__cases)
        
        # Try removing a few wrong words which are captured
        try:
            self.__cases = self.__cases.replace('Antecedents','').replace('Untraced','').strip()
        except:
            pass
        
        # Returns the citations as a dictionary
        self.__citations = {'citations': self.__cases}
        return self.__citations   
    
``` 
</details>  

### 4.1.7 Mitigating and Aggravating circumstances  

To identify whether mitigating and aggravating circumstances were discussed in the judgment, I simply did a search for the words `mitigation`, `mitigation`, `aggravating`, `aggravated` (all case insensitive) in the judgment text and returned `1` if they were found and `0` if they were not.  

The purpose of returning it as a binary is so that I can quickly calculate the mean for the statistics by summing them and dividing by the number of rows.  

<details>
    <summary> <b> Click here for code </b></summary>
    
```python
    def __get_miscellaneous(self):
        """
        Searches the document text to identify if mitigating factors were discussed, and if aggravating factors were discussed
        """
        # Searches the judgment text to see if mitigation or mitigating is mentioned and returns 1 for yes, 0 for no
        if re.search(r'[mM]itigation|[mM]itigating',self.__judgment_text):
            self.__mitigation_discussed = 1
        else:
            self.__mitigation_discussed = 0
            
        # Searches the judgment text to see if aggravating or aggravated is mentioned and returns 1 for yes, 0 for no
        if re.search(r'[aA]ggravating|[aA]ggravated',self.__judgment_text):
            self.__aggravated_discussed = 1
        else:
            self.__aggravated_discussed = 0   
            
        # Returns the results as a dictionary
        self.__miscellaneous = {'mitigation_discussed': self.__mitigation_discussed, 'aggravation_discussed': self.__aggravated_discussed}
        return self.__miscellaneous   
    
``` 
</details>  


### 4.1.8 Database creation and export  

For the actual processing of the judgments, the code first identifies which is the relevant court tag, to load the correct dataset of judgments, and sets the relevant start and end indices.  

The method then iterates from the start index to the end index, loading each archived html file and parsing it through `BeautifulSoup`. It then calls the various methods above to get the `case_name`, `tribunal/court`, `decision_date`, `possible_offences`, `possible_statutes`, `citations`, `mitigation_discussed`, and `aggravation_discussed`.  

At the same time, it also adds a `court_tag`, which identifies whether it is from the `subordinate` or `supreme` court, and `link` from the court links database. These are to allow easy identification of new entries through the number of rows, and linking to the original judgment online as this project will not host any judgments.  

The extracted information is merged into a dictionary for each judgment and added to a list of dictionaries, which is finally used to create or merge into the final database.  

The final database is then saved as a csv file.  

<details>
    <summary> <b> Click here for database creation code (Warning: long code) </b></summary>
    
```python
    def __process_judgments(self, court):
        """
        Loads each new html judgment and performs the NLP steps to extract the key information.
        """
        # Instantiate a list for the dictionary outputs of the above functions
        self.dictionaries_list = []
        
        # Set a default start value
        self.__start = 0
        
        # Check which court is being processed, and find the number of new rows to set the start and end indices
        if court == 'supreme':
            self.__start = self.__supremecourt_rows - self.__new_supremecourt_rows
            self.__end = self.__supremecourt_rows
        elif court == 'subordinate':
            self.__start = self.__subordinatecourt_rows - self.__new_subordinatecourt_rows
            self.__end = self.__subordinatecourt_rows
            
        # Raise an error if an invalid court is set
        else:
            raise CourtNameError("There is only the Subordinate (State) or Supreme Court!")
            
        # Load the dataset based on which court is given
        self.dataset = pd.read_csv(f'../data/{court}court_compiled.csv')
        
        # Check if there are any new entries
        if self.__start < self.__end:
            # Create index range for new entries and iterate through the range of indices
            for index in range(self.__start, self.__end):               
                # Set the case_link to be the link for the current index
                self.__case_link = self.dataset.loc[index]['link']
                
                # Print the current progress
                print(f'Current progress: {index+1}/{self.__end}.')
                
                # Load the judgment html for the current index and parse it in BeautifulSoup
                load_judgment = codecs.open(f'../judgments/{court}_court/{court}court_{index}.html', 'r', 'utf-8')
                print('Judgment loaded')
                self.document = BeautifulSoup(load_judgment.read())
                print('BeautifulSoup initialized')
                
                # Create __search_results which are the contents of the html
                self.__search_results = self.document.find('div', {'class': 'contentsOfFile'})
                print('Judgment text identified')
                
                # Call the functions above to extract the information required
                self.__get_case_name()    
                print(f'Case name extracted: {self.__case_name}')
                self.__get_court()
                print(f'Court extracted: {self.__court}')
                self.__get_date()       
                print(f'Decision date extracted: {self.__decision_date}')
                self.__get_statute()   
                print(f'Statutes extracted*: {self.__title_statute}')
                self.__get_citations()      
                print(f'Citations extracted*: {self.__citations}')
                self.__get_miscellaneous()
                print(f'Miscellaneous extracted*: {self.__miscellaneous}')
                
                # Add a court_tag column which specifies whether it is subordinate or supreme court (for the identification of new entries)
                self.__court_column = {'court_tag': court}
                print('tag set')
                
                # Add the case url
                self.__add_link = {'link': self.__case_link}
                print('case link set')
                
                # Create and merge a dictionary which merges all the information extracted for each judgment
                self.__dictionaries_merged = self.__case_name.copy()
                self.__dictionaries_merged.update(self.__court)
                self.__dictionaries_merged.update(self.__decision_date)
                self.__dictionaries_merged.update(self.__title_statute)
                self.__dictionaries_merged.update(self.__citations)
                self.__dictionaries_merged.update(self.__miscellaneous)
                self.__dictionaries_merged.update(self.__court_column) 
                self.__dictionaries_merged.update(self.__add_link)
                
                # Add the dictionary for each judgment into a list of dictionaries
                self.dictionaries_list.append(self.__dictionaries_merged)

                # Clear cell output
                clear_output(wait=True)
            
            # Create a Dataframe out of the list of dictionaries
            self.database = pd.DataFrame(self.dictionaries_list)
            
            # Print current progress
            print(f'Current progress: DataFrame created.')
            
            # Merge the Dataframe into the full database
            self.database_df = self.database_df.merge(self.database, how='outer')
            self.database_df.reset_index(drop=True)
            
        # Print 'No new entries' if there are no new entries.
        else:
            self.database = pd.DataFrame()
            print('No new entries')   
    
``` 
</details>  
    
    
<details>
    <summary> <b> Click here for database export code </b></summary>
    
```python
    def __export_database(self):
        """
        Exports dataframes to the respective .csv files
        """
        # Save the temporary database and updated full database to .csv files
        self.database.to_csv(path_or_buf=f'../data/database_temp.csv', index=False)
        self.database_df.to_csv(path_or_buf=f'../data/database.csv', index=False)   
    
``` 
</details>  

### 4.1.9 Public method to perform NLP  

Finally, a pipeline method is used to call the functions in order to create or update the database, with a log file created at the end with the latest database update date.  


<details>
    <summary> <b> Click here for code </b></summary>
    
```python
    def create_database(self,court):
        """
        Call command to pull urls and export to csv database
        """
        # Call the functions to create / update the database
        self.__get_num_rows()
        self.__process_judgments(court)
        self.__export_database()
        
        # Update the log file for the latest database update date
        self.file = open('../logs/database_log.txt', 'a', encoding='utf_8')
        self.file.write(f'database last updated on: {datetime.today()}; \n')
        
        # Print current progress
        print(f'Current progress: Completed judgment processing and export.')   
    
``` 
</details>  

## 4.2 Testing the custom class

### 4.2.1 Libraries Import  

I will import the custom class `Database` and pandas to explore the results of the code.

In [1]:
# Import libaries
import pandas as pd
from criminalcasedatabase import Court, Database

### 4.2.2 Create an instance of the class and run it

In [5]:
database = Database()

In [3]:
database.create_database('supreme')

Current progress: DataFrame created.
Current progress: Completed judgment processing and export.


In [4]:
database.create_database('subordinate')

Current progress: DataFrame created.
Current progress: Completed judgment processing and export.


In [12]:
database.database_df.head()

Unnamed: 0,case_name,tribunal/court,decision_date,possible_titles,possible_statutes,citations,mitigation_discussed,aggravation_discussed,court_tag,link
0,Chander Kumar a/l Jayagaran v Public Prosecuto...,Court of Appeal,18 January 2021,"Punishment for offences,Not in database,Traffi...","394 Misuse of Drugs Act,394 Criminal Procedure...","Perumal v Public Prosecutor and another,Syed S...",1,0,supreme,https://www.lawnet.sg/lawnet/web/lawnet/free-r...
1,Public Prosecutor v Teo Ghim Heng [2021] SGHC 13,General Division of the High Court,22 January 2021,"Murder,Punishment for culpable homicide not am...","300 Mauritius Dangerous Drugs Act,300 Criminal...","Public Prosecutor v BNO,Osman bin Ali v Public...",0,0,supreme,https://www.lawnet.sg/lawnet/web/lawnet/free-r...
2,GCM v Public Prosecutor and another appeal [20...,High Court,25 January 2021,"Sale of obscene books, etc.,Not in database,As...","376 Criminal Law Reform Act,376 Films Act,376 ...","Public Prosecutor v GCM,AQW v Public Prosecuto...",1,1,supreme,https://www.lawnet.sg/lawnet/web/lawnet/free-r...
3,Public Prosecutor v Salzawiyah bte Latib and o...,General Division of the High Court,26 January 2021,"Punishment for offences,Possession and consump...","33 Misuse of Drugs Act,33 Criminal Procedure C...","Joseph v Public Prosecutor,Public Prosecutor v...",1,1,supreme,https://www.lawnet.sg/lawnet/web/lawnet/free-r...
4,Public Prosecutor v Salzawiyah bte Latib and o...,General Division of the High Court,26 January 2021,Effacing any writing from a substance bearing ...,"261 Misuse of Drugs Act,261 Evidence Act,261 C...","Chai Chien Wei Kelvin v Public Prosecutor,Publ...",0,0,supreme,https://www.lawnet.sg/lawnet/web/lawnet/free-r...


In [6]:
database.database_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126 entries, 0 to 125
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   case_name              126 non-null    object
 1   tribunal/court         126 non-null    object
 2   decision_date          126 non-null    object
 3   possible_titles        125 non-null    object
 4   possible_statutes      125 non-null    object
 5   citations              121 non-null    object
 6   mitigation_discussed   126 non-null    int64 
 7   aggravation_discussed  126 non-null    int64 
 8   court_tag              126 non-null    object
 9   link                   126 non-null    object
dtypes: int64(2), object(8)
memory usage: 10.0+ KB


In [7]:
database.database_df[database.database_df['possible_titles'].isnull()]

Unnamed: 0,case_name,tribunal/court,decision_date,possible_titles,possible_statutes,citations,mitigation_discussed,aggravation_discussed,court_tag,link
100,Public Prosecutor v Yak Eng Hwee [2021] SGDC 79,District Court,22 April 2021,,,,0,0,subordinate,https://www.lawnet.sg/lawnet/web/lawnet/free-r...


## 4.3 Summary and observations

Overall, the RBIE system that I created ran well without an errors. It was able to extract information for most judgments accurately, except for a few missing values in the columns `citations`, `possible_offences`, and `possible_statutes`.  

However, I checked the judgments with missing `citations` and `possible_statutes`, and found that only the case name was mentioned in those judgments, with no other case citations or statutes found within, hence this is not an issue with the NLP. 

I manually did a random sample of judgments and found that the accuracy rate for information extraction is close to 100%, although the permutations method causes a lot of non-existent statutes to be listed as `possible_statutes`. This is acceptable as it is more important to reduce type II errors (false negatives) where statutes wich are present are missed out.

There are many instances where the `possible_offences` are `Not in database`. This is likely because the NLP uses permutations for statutes which are sometimes found separately from the section number, and the database of `statutes_crimes` is very limited and should be expanded as an improvement to the project.

This however, still suggests that the use of RBIE is not perfect, using a brute-force method that causes some wrong data to be captured.   

I will not be not dropping any rows despite the null values as the judgments can still be found via their name.

## References

[1] LawNet, a service provided by the Singapore Academy of Law *"Chander Kumar a/l Jayagaran v Public Prosecutor
[2021] SGCA 3,"* 2021. [Online]. Available: [https://www.lawnet.sg/lawnet/web/lawnet/free-resources?p_p_id=freeresources_WAR_lawnet3baseportlet&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_pos=2&p_p_col_count=3&_freeresources_WAR_lawnet3baseportlet_action=openContentPage&_freeresources_WAR_lawnet3baseportlet_docId=/Judgment/25538-SSP.xml](https://www.lawnet.sg/lawnet/web/lawnet/free-resources?p_p_id=freeresources_WAR_lawnet3baseportlet&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_pos=2&p_p_col_count=3&_freeresources_WAR_lawnet3baseportlet_action=openContentPage&_freeresources_WAR_lawnet3baseportlet_docId=/Judgment/25538-SSP.xml) [Accessed: April 6, 2021].