<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Natural Language Processing - Classifying Definitions of Nouns and Adjectives

---
## Part 1: Background, Overview & Data Scraping
---

## Contents
---
- [Problem Statement](##Problem-Statement)
- [Overview of Process](##Overview-of-Process)
- [Introduction to NLP Topic](##Introduction-to-NLP-Topic)
- [Considerations for Data Scraping](##Considerations-for-Data-Scraping)
- [Data Scraping](##Data-Scraping)

---
## Problem Statement
---

Constant exposure to a language through conversation, reading and writing is often the best long-term strategy to gain mastery. However, formal curricula and teaching methods are required to be explicit and objective-driven, thus leading to a structured, rules-based approach that is often seen as stifling and/or boring. To combat this, models can be built to mimic video games, where learners ‘compete’ against models. This may help learners view learning more positively by incentivising them to work harder so as to be able to win. This project represents the starting point, and it aims to use Natural Language Processing (NLP) to develop a model that can classify a given definition as either defining a noun or an adjective. Educators will be able to input word definitions and pit learners against the model to see who has a higher accuracy rate. In addition, much like how video games feature different levels of difficulty, the model can easily be tuned to the desired level of accuracy.

---
## Overview of Process
---

#### Introduction & Data Scraping

This segment explores the objective of this project, how it compares to other NLP projects, as well as potential limitations and considerations. 

Following this, data scraping is carried out. Rationales for the websites used, as well as conditions put in place while scraping, are elaborated upon in the relevant sections of this document.


#### Exploratory Data Analysis (EDA)

Preliminary EDA is carried out in 3 main parts:
- length of definitions;
- prominent parts-of-speech (POS) by word class;
- prominent features by word class.


#### Modelling

Based on a mixture of the EDA, logical reasoning as well as domain knowledge, data will be processed and models tuned accordingly.

A total of 4 models were explored: 
- Naive Bayes (NB) (part of the requirements for this project)
- Logistic Regression (LR)
- K-Nearest Neighbours (KNN)
- Random Forest (RF)

Prior to modelling, a brief overview of each model will be given, including potential advantages and limitations in the context of this project.

The models will then be assessed based on the following sets of scores:
| Metric | Applicable Dataset |
| --- | --- |
| CV | training |
| accuracy | training & test |
| specificity | test |
| recall | test |
| precision | test |
| f1 | test |


#### Explanatory Data Analysis (ExDA)

After the final models are selected, explanatory data analysis will be carried out to explain how the models work and how some important differences may lead them to different conclusions.


#### Applications, Suggestions & Recommendations

This final segment covers possible applications of the models, as well as potential extensions, future versions, etc.

---
## Introduction to NLP Topic
---

#### Objective

This NLP project aims to take in definitions of English words, specifically that of nouns and adjectives, and classify them as defining either nouns or adjectives. All definitions are taken from the same [source](https://www.dictionary.com). 


#### Comparison with other NLP projects

Compared to other NLP projects that scrape data from forums, reviews, etc, this project is a lot more contained and structured since it uses formal definitions. On one hand, this could be a boon, as professionally-written definitions are likely to follow certain structures and rules, thus making it easier for the model to train on. On the other hand, the data scraped is a lot more ‘robotic’ than posts written by human beings, which could make modelling more challenging - lengths are unlikely to differ much, ‘tone’ is irrelevant, etc.


#### Considerations & limitations

In terms of data, this project is both aided and limited by the consistency of data. As it is a small-scale project, care has been taken to ensure that the models can work with the data - words scraped are largely common, definitions are from the same website, and certain conditions have been put in place to eliminate largely useless information, such as overly-short definitions of 5 or fewer words. This makes the data more reliable, but also limits its use. We will explore this more in the final notebook.



---
## Considerations for Data Scraping
---

#### Data sources

Words come from two different lists that purport to be that of commonly-used nouns and adjectives. While users are not required to understand definitions to be able to classify them, it is likely that definitions for uncommon words would be more difficult to understand as well. Therefore, using common words is a better idea. 

The definitions come from dictionary.com. This is an often-used and reputable dictionary. Cambridge would have been a better choice, but the website disallows scraping. 


#### Conditions put in place during the scraping process

Definitions that fulfilled the following conditions were scraped:
- Consisting of at least 5 words: overly-succinct definitions are often not definitions, but rather synonyms;
- Formal definitions: dictionary.com occasionally provides informal definitions, likely for the sake of completeness, but sticking to standard English is better, especially if the model is intended for further uses;
- Are still in use: some words are archaic, as noted by dictionary.com, and should not be scraped.


#### 'Re'-scraping

Words tend to span mutiple classes. As such, it is possible to use the nouns list to scrape definitions, and vice-versa. The reason this is necessary is because less adjectives made the cut according to the conditions above. 


#### Data cleaning

Scraping two rounds from both lists inevitably results in some duplicated entries that have to be removed. 

In [1]:
import pandas as pd

import requests

from bs4 import BeautifulSoup

In [2]:
# Class that, for a given word class (noun or adjective):
# 1. scrapes words from a given URL, then
# 2. scrapes their definitions from dictionary.com.

class Scrape():
    
    
    def __init__(self, 
                 url=None, 
                 class_word=None, 
                 class_def=None, 
                 limit=False):
        
        '''
        Parameters
        ----------
        url : str
            URL to scrape words from.
        class_word : str
            Accepts 'noun' or 'adjective'. 
            Used to differentiate between which scraping code should be used.
        class_def : str
            Accepts 'noun' or 'adjective'.
            Used to indicate which type of definition should be scraped.
        '''
        
        self.url = url
        self.class_word = class_word
        self.class_def = class_def
        self.limit = limit
        
        self.get_words()
        self.get_definitions()
        self.create_df()
     
    
    # Function to scrape words from a given URL.
    def get_words(self):
        
        # List to store words.
        self.words = []

        response = requests.get(self.url)
        
        html = response.text
        
        soup = BeautifulSoup(html, features='lxml')
        
        # Scrape only definitions that match the specified word class.
        # Note that the following code is specific to the webpage.
        if self.class_word == 'noun':
            
            # Each 'tr' tag corresponds to a row of 3 words in this webpage.
            for row in soup.findAll('tr'):
                
                # Each 'td' tag corresponds to 1 of the 3 words, as well as its number.
                for word_and_num in row.findAll('td'):
                    
                    # Eliminates the word's number.
                    for word in word_and_num.findAll('a'):
                        
                        try:
                            
                            # Words are preceded by \r & \n.
                            # .split() is applied to scrape only the actual word.
                            self.words.append(word.text.lower().split('\r\n')[1])
                            
                        except:
                            
                            pass

        # Scrape only definitions that match the specified word class.
        # Note that the following code is specific to the webpage.
        elif self.class_word == 'adjective':
            
            # Each 'tbody' tag corresponds to a table of words.
            # Each table corresponds to a letter of the alphabet.
            for table in soup.findAll('tbody'):
                
                # Each 'td' tag corresponds to 1 word in the table.
                for word in table.findAll('td'):
                    
                    try:
                        
                        # Words are preceded and succeeded by a whitespace.
                        # .split() is applied to scrape only the actual word.
                        self.words.append(word.text.lower().split(' ')[1])
                        
                    except:
                        
                        pass

        return self.words
    
    
    # Function to scrape definitions from dictionary.com.
    # To avoid confusion, some for-loops will be labelled (for eg. Loop [1]) and referenced later on.
    def get_definitions(self):
        
        # Dictionary that contains the following key-value pairs:
        # Key: word.
        # Value: list, where the 1st value is the specific dictionary.com URL for the word.
        # The remaining list elements will be added later on.
        self.info = {self.words[i]: [f'https://www.dictionary.com/browse/{self.words[i]}'] \
                     for i in range(len(self.words))}
        
        if self.limit:
            
            for word in list(self.info.keys())[self.limit:]:
                del self.info[word]
        
        for word, lst in self.info.items():
            
            # There's a possibility that dictionary.com does not contain that specific word.
            try:

                response = requests.get(lst[0])

            except:

                continue

            html = response.text

            soup = BeautifulSoup(html, features='lxml')

            # Loop [1]
            # Each dictionary.com page is separated into sections.
            # Each section contains definitions for a given word class.
            # Word classes are repeated as there are 2 major segments for each page...
            # ...American & British English.
            # For example, if the word can be a noun, there would be two 'noun' sections.
            for section in soup.findAll('section', 
                                        attrs={'class': 'css-109x55k e1hk9ate4'}):

                # Loop [2]
                # Each 'span' tag, with the specified attribute, corresponds to a word class.
                for w_class in section.findAll('span', 
                                               attrs={'class': 'luna-pos'}):

                    # Word classes are expressed in a variety of ways in dictionary.com...
                    # ...but they always start with the actual word class itself.
                    if self.class_def in w_class.text:

                        # Loop [3]
                        # Each 'div' tag, with the specified attribute, corresponds to either:
                        # 1) an expandable section, or
                        # 2) a non-expandable section.
                        # These sections contain definition(s) of the word for the specified word class.
                        for content in section.findAll('div', 
                                                       attrs={'class': 'css-10n3ydx e1hk9ate0'}):

                            # For the expandable sections.
                            # Expandable sections are defined as having a 'see more' option that...
                            # ...can be clicked and collapsed.
                            try:

                                # If an exception is raised here, the section is non-expandable.
                                assert content.findAll('div', 
                                                       attrs={'class': 'default-content'})

                                # Each 'div' tag, with the specified attribute, corresponds to the...
                                # ...default section, ie. what's already shown on the page.
                                # Note: this loop doesn't require breaking, since there's only 1.
                                for definitions in content.findAll('div', 
                                                                    attrs={'class': 'default-content'}):

                                    # Loop [4a]
                                    # Each 'div' tag corresponds to 1 definition.
                                    for definition in definitions.findAll('div'):
                                        
                                        
                                        # Only scrape definitions that fulfill the followng requirements:
                                        # 1. at least 5 words, to avoid scraping mere synonyms;
                                        # 2. are not informal definitions, to avoid non-standard English;
                                        # 3. are still in use (not archaic).
                                        if len(definition.text.split()) >= 5 and \
                                        not definition.text.lower().startswith('informal') and \
                                        not definition.text.lower().startswith('archaic'):

                                            # As mentioned above, the dictionary will have additional elements.
                                            # The 2nd element is the word class of the word.
                                            self.info[word].append(self.class_def)

                                            # The 3rd element is a definition of the word.
                                            self.info[word].append(definition.text.lower())

                                            # Breaks loop [4a].
                                            # Ie. only the 1st relevant definition is scraped.
                                            break
                                            
                                        else:
                                            
                                            continue

                            # For the non-expandable sections.
                            # Non-expandable sections are defined as not having the 'see more' option.
                            # Ie. all definitions are already shown on the page.
                            except:

                                # Loop [4b]
                                # Each 'div' tag corresponds to 1 definition.
                                for definition in content.findAll('div'):
                                    
                                    # Only scrape definitions that fulfill the followng requirements:
                                    # 1. at least 5 words, to avoid scraping mere synonyms;
                                    # 2. are not informal definitions, to avoid non-standard English;
                                    # 3. are still in use (not archaic).
                                    if len(definition.text.split()) >= 5 and \
                                    not definition.text.lower().startswith('informal') and \
                                    not definition.text.lower().startswith('archaic'):

                                        # As mentioned above, the dictionary will have additional elements.
                                        # The 2nd element is the word class of the word.
                                        self.info[word].append(self.class_def)

                                        # The 3rd element is a definition of the word.
                                        self.info[word].append(definition.text.lower())

                                        # Breaks loop [4b].
                                        # Ie. only the 1st relevant definition is scraped.
                                        break
                                        
                                    else:
                                        
                                        continue

                            # Length of list should be 3 (URL, word class, definition)...
                            # ...if scraping was successful.
                            if len(self.info[word]) == 3:

                                # Breaks loop [3].
                                # Avoids moving on to the next section.
                                break

                    if len(self.info[word]) == 3:

                        # Breaks loop [2].
                        # Avoids looking for word classes in other sections.
                        break
                        
                if len(self.info[word]) == 3:

                    # Breaks loop [1].
                    # Avoids looking for word classes in other sections.
                    break

        return self.info
    
    
    # Creates a DataFrame to store scraped information.
    def create_df(self):
        
        self.df = pd.DataFrame.from_dict(self.info, 
                                         orient='index', 
                                         columns=['url', 'word_class', 'definition'])
        
        self.df.reset_index(inplace=True)
        
        self.df.rename({'index': 'word'}, axis=1, inplace=True)
        
        # Drop words that have no definitions for the specified word class.
        self.df.dropna(inplace=True)
        
        # Unlikely to happen, but good to have as a precaution to ensure integrity of data.
        self.df.drop_duplicates(subset=['definition'], inplace=True)
        
        return self.df

In [3]:
# URLs of webpages containing the nouns and adjectives for scraping.

url_nouns = 'https://www.syllablecount.com/syllables/words/nouns.aspx'
url_adjectives = 'https://www.wordscoach.com/blog/common-adjectives/'

In [4]:
# Using the Scrape() class, create a DataFrame of scraped nouns and their definitions.

df_nouns_1 = Scrape(url=url_nouns, 
                    class_word='noun', 
                    class_def='noun', 
                    limit=False).df

In [5]:
# Using the Scrape() class, create a DataFrame of scraped adjectives and their definitions.

df_adjectives_1 = Scrape(url=url_adjectives, 
                         class_word='adjective', 
                         class_def='adjective', 
                         limit=False).df

In [6]:
# Check how many more nouns to scrape (includes buffer).

shortage_nouns = 2000-df_nouns_1.shape[0]+200

In [7]:
# Using the Scrape() class, create a 2nd DataFrame of scraped nouns and their definitions.

df_nouns_2 = Scrape(url=url_adjectives, 
                    class_word='adjective', 
                    class_def='noun', 
                    limit=shortage_nouns).df

In [8]:
# Using the Scrape() class, create a 2nd DataFrame of scraped adjectives and their definitions.

df_adjectives_2 = Scrape(url=url_nouns, 
                         class_word='noun', 
                         class_def='adjective').df

In [9]:
# Check size of scraped data.

df_nouns_1.shape[0]+df_nouns_2.shape[0], df_adjectives_1.shape[0]+df_adjectives_2.shape[0]

(2048, 1966)

In [10]:
# Merge DataFrames together.

df = pd.concat([df_nouns_1, df_nouns_2, df_adjectives_1, df_adjectives_2])

In [11]:
# Drop null values, if any.

df = df.dropna()

In [12]:
# Drop repeated definitions.

df = df.drop_duplicates(subset='definition')

In [13]:
# Check that word classes are reasonably balances.

df.word_class.value_counts(normalize=True)

noun         0.515712
adjective    0.484288
Name: word_class, dtype: float64

In [14]:
# Reset index to ensure that numbers are sequential.

df = df.reset_index().drop('index', axis=1)

In [15]:
# Save DataFrame into csv file for use in other notebooks.

df.to_csv('../data/nouns_and_adjectives.csv')