

This assignment introduces a natural language processing task that requires classification into more than two classes, through the use of multinomial logistic regression. That task is well-known in NLP research as **named entity recognition (NER)**, which identifies and classifies named entities mentioned in unstructured text into predefined classes (e.g., person, organization, location, medicine, etc.).

A complete NER task comprises taking as input an unannotated *raw* text, such as

> Harry Belafonte, the popular American singer, actor, and civili rights activist, was born Harold George Bellanfanti Jr. in 1927, at Lying-in Hospital in Harlem, New York.

and producing an annotated text that identifies the names and categories of the mentioned entities:

> [Harry Belafonte](Type: Person), the popular American singer, actor, and civili rights activist, was born [Harold George Bellanfanti Jr.](Type: Person) in 1927, at [Lying-in Hospital](Type: Location) in [Harlem](Type: Location), [New York](Type: Location).


# Dataset

You will use the data introduced by the Language-Independent Named Entity Recognition tasks, through the following body of work:

* Erik F. Tjong Kim Sang and Fien De Meulder. 2003. [Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition](https://aclanthology.org/W03-0419/). In *Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003*, pages 142--147.
* Erik F. Tjong Kim Sang. 2002. [Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition](https://aclanthology.org/W02-2024/). In *COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)*.

This assignment, however, is restricted to NER in the English language only, and the dataset consists of three files:

1. `eng.train`, for training
2. `eng.testa`, as the development set
3. `eng.testb`, as the final test set

These files can be downloaded as a single `.zip` [here](https://drive.google.com/file/d/15YEXQlDk8wvqAFOE1chaS_PYvMaLGUGX/view?usp=sharing)

To avoid any complications, you should take advantage of the fact that the total amount of data is much smaller than the previous assignment, and store the entire dataset in your own Google Drive. To do this, connect your Drive to your Colab notebook:

In [None]:
from google.colab import drive
drive.mount ('/content/drive', force_remount=True)

Mounted at /content/drive


Then, unzip the dataset (**remember to change the path to where you have stored it in your own Google drive**):

In [None]:
!unzip /content/drive/MyDrive/courses/cse354/eng-ner-dataset.zip

Archive:  /content/drive/MyDrive/courses/cse354/eng-ner-dataset.zip
   creating: eng-ner-dataset/
  inflating: eng-ner-dataset/eng.testa  
   creating: __MACOSX/
   creating: __MACOSX/eng-ner-dataset/
  inflating: __MACOSX/eng-ner-dataset/._eng.testa  
  inflating: eng-ner-dataset/eng.train  
  inflating: __MACOSX/eng-ner-dataset/._eng.train  
  inflating: eng-ner-dataset/eng.testb  
  inflating: __MACOSX/eng-ner-dataset/._eng.testb  


At this point, you have the unzipped corpus (with the three files) as the `eng-ner-dataset` folder accessible to your Colab notebook. The format of this data is probably new to you, so the first thing to do is to use the `head` command and see what the data looks like ([see this man page](https://www.gnu.org/software/coreutils/manual/html_node/head-invocation.html) for the details of its syntax). For example, you can view the top 20 lines of the `eng.train` file as follows:

In [None]:
!head -n 20 eng-ner-dataset/eng.train

EU NNP I-NP I-ORG
rejects VBZ I-VP O
German JJ I-NP I-MISC
call NN I-NP O
to TO I-VP O
boycott VB I-VP O
British JJ I-NP I-MISC
lamb NN I-NP O
. . O O

Peter NNP I-NP I-PER
Blackburn NNP I-NP I-PER

BRUSSELS NNP I-NP I-LOC
1996-08-22 CD I-NP O

The DT I-NP O
European NNP I-NP I-ORG
Commission NNP I-NP I-ORG
said VBD I-VP O


The format you see is known as the [IOB format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), popularly used in many NLP tasks since the CoNLL 2003 NER tasks. The file format requires

- each token has to be on a separate line
- there must be an empty line after each sentence
- a line must contain at least two columns: first, the token itself; and the last, the named entity

It doesn't matter if there are extra columns in between (perhaps containing part-of-speech tag or other information), as long as the named entity information is given in the IOB format (either IOB or IOB2).

**Note:** There is a slight difference between the original IOB and IOB2 formats, and you may need to convert the training and test data to IOB2 (if you spot that some instances are using IOB while others are using IOB2).

# Task Overview

The programming involves three stages:

1. converting the text data into feature vectors, so that it can readily be used by supervised machine learning algorithms,
2. implement your own logistic regression classifier to identify whether or not a token is part of a person's name, and
3. implement your own multinomial logistic regression classifier to develop a complete NER system.

Throughout this assignment, remember to use type annotations in your Python code. Even if you are unable to do this for variables whose data types are dependent on external libraries that are allowed in this assignment (specified later), don't forget the type annotations for the core data types. These annotations are already provided to you in the method signatures from this point onward (to illustrate how to do this, as well as to specify the method signatures required by this assignment).

* Feel free to import additional types as needed (see the line below, where a few data types are already imported for such type annotations: `from typing import ...`).

#### 1.1 Importing required libraries
- You may import modules from core Python
- You may use any modules from `numpy` and `pandas` as long as it does not involve any 'outsourcing' of machine learning algorithms to these modules.

**Do not add the following dependencies:**
- Any module from NLTK
- Any module from SciPy
- Any module from scikit-learn (i.e., `sklearn`) unless it is already provided to you in this Colab notebook
- Any library/module that performs optimizations (minimization or maximization of a function) for you. Purely numeric calculations that arise from mathematics (outside the topics in this assignment) can be done by calling numpy functions, but you must implement the stochastic gradient descent algorithm on your own.
  - For example, computing a dot product can be done using numpy, but logits, sigmoid, softmax, etc. must be your own implementation.

**What about additional methods, variables, data structures, etc.?**

Throughout this assignment, you may add any number of helper methods, as you feel the need to do so. Similarly, you may use additional variables and/or data structures as the need arises. For example, if your implementation of the classifier requires you to add a class attribute, you can certainly do that.

However, please keep in mind three things:

1. Any external user should remain oblivious to any such additional function or variable (i.e., they should not have to assume or figure out things beyond what is already given, in order to run your code).
2. You must update the docstring of a class if you are introducing any additional attribute.
3. Any additional method that you write (say, a helper method) must also have a proper docstring and type hint/annotation for its signature (i.e., what data types it expects as parameters, and what data type it returns).

# Data Preparation [20 points]

The first step is to make sure that your code is able to read the data one sentence at a time. Given the number of sentences, and that you may have to do analyze or process each sentence in computationally complex ways, it is always prudent in this kind of work to write your code in a ways that avoids loading the entire training set. In this assignment, it may be possible, but the better option is to use the *generator* idea in Python. In this approach, the sentences are generated one at a time in a *lazy* manner (if you are more familiar with Java, think `Stream` insted of `List`).

In [None]:
import re, numpy, pandas
from typing import Dict, Iterator, List, TextIO, Tuple
from pathlib import Path

UNKNOWN_TOKEN = 'UNK'  # This will be needed at times, so let's just declare it as a global constant right away
#Used in create dataframe

def load_instances(iob_file: TextIO, sep: str = '\n') -> Iterator[str]:
    """
    Load instances (which are sentences) from an input file stream.

    This function reads an input file stream (`iob_file`), where tokenized sentences are provided in the IOB or IOB2
    format, which requires each token to be on a separate line, and that there is an empty line after each sentence.
    This empty line acts as the default separator (`sep`). Each yielded instance is a single annotated sentence (in the
    IOB or IOB2 format, as given in the input file).

    Parameters:
        iob_file (TextIO): An input file stream containing annotated text data.
        sep (str, optional): The separator used to separate instances. Defaults to '\\n'.

    Yields:
        str: A string representing a single (tokenized and annotated) sentence.
    """

    #Don't understand generators too well need to research how to use them, they use lazy loading and avoids whole data set
    #Had some problems loading files this way

    instancesToLoad = []
    count = 0
    for line in iob_file:
      line = line.strip()
      if line:
        count+=1
        instancesToLoad.append(line) #Add sentence
      else:
        if instancesToLoad: #Load chunk
          yield sep.join(instancesToLoad)
          instancesToLoad = []
    if instancesToLoad: #Last chunck to load
      yield sep.join(instancesToLoad)
      #No need to clear all parsed

    print('this is the load', count)



Building the feature vectors will require the use of tokens seen in the training data, as well as other properties such as the part-of-speech (POS) tags of these tokens. Thus, it is imperative that all such tokens and POS tags are properly collected and tracked. The next method should help you do just that.

> ---
> **Optional features: phrasal information**
>
> You may have already noticed that the annotated data also contains information about the phrase containing a token (again, in IOB or IOB2 format). You are welcome to build features out of this information as well, although this is not mandated by the assignment. If you want to do this, we strongly suggest that you investigate this only *after* finishing everything else.
>
> *Phrasal information* is encoded as whether a token is a part of a noun phrase (NP), verb phrase (VP), prepositional phrase (PP), adjective/adverb phrases (ADJP/ADVP), verb particles (PRT), interjections (INTJ), and clauses introduced by a subordinating conjunction (SBAR).
>
> * If you are interested, you can read more about it [here](https://aclanthology.org/W00-0726.pdf).
>
> If you want to include features based on this phrase-level annotation, and want to add a third dictionary to the return type of the following function, please mention it very clearly in the docstring, and also modify the docstring to reflect this updated use of the `get_vocabulary` method.
>
> ---

In [None]:
def get_vocabulary(training_file: str) -> Tuple[Dict[str, int], Dict[str, int]]:
    """
    Create a vocabulary of lowercase tokens and part-of-speech (POS) tags from a training file, associating each token
    and each POS with a unique index.

    This function reads the specified training file, extracts tokens and POS tags from each sentence. It then converts
    the tokens to lowercase, and associates each lowercase token with a unique index. It also associates each POS with
    a unique index. The result is returned as a pair of dictionaries.

    Parameters:
        training_file (str): The path to the training file containing annotated text data.

    Returns:
        Tuple[Dict[str, int], Dict[str, int]]: A pair of dictionaries where the first dictionary consists of keys that
        are lowercase tokens and values are their corresponding indices; and the second dictionary consists of keys that
        are part-of-speech tags and values are their corresponding indices.

    Raises:
        IOError: If the specified training file cannot be opened or read.
    """

    tokenDict = {}
    POSDict = {}

    try:
      with open(training_file, 'r') as file:
        for line in file:
          if line is not None:
            line = line.split() #Split sentence into parts to seperate token and tags
            for i, tokAndPOS in enumerate(line):
              if i == 0 and tokAndPOS is not None: #First part is the token
                if tokAndPOS not in tokenDict:
                  tokenDict[tokAndPOS.lower()] = len(tokenDict) #Add to token dict
              elif i==1 and tokAndPOS is not None: #Second part is the POS
                if tokAndPOS not in POSDict:
                  POSDict[tokAndPOS] = len(POSDict) #Add to POS dict
              else:
                break

        #print(len(tokenDict))
        return (tokenDict, POSDict)
    except IOError as e:
      raise IOError(f"File couldn't be open or read: {e}")

## Feature Selection and Data Frames

The most important question to ask at this point is about the features. *What are the type of features likely to be important in the identification of various kinds of named entities?*

Unsurprisingly, the token itself and its part of speech are the most important indicators. For example, a conjunction is probably not the name of a person; an adjective is probably not a part of the name of a place (assuming that the greatness of "Great Britain" or the length of "Long Island" are correctly tagged as nouns). A few other features that research in NER detection has found to be helpful are the orthographic properties of a token, which involve the patterns of capitalization (e.g., is the word in all capital letters? is it starting with a capital letter?), the POS tags of surroundings tokens, the surrounding tokens themselves, and the orthographic properties of the surrounding tokens.

You are by no means restricted to use only these properties. They are provided to you as a minimal set to explore (i.e., a starting point from where you can/should explore incorporating better features).

This work is what people often call **feature engineering**. It is, in some ways, "old school" NLP. Nevertheless, it is a relatively recent phase of NLP research, and going through this help you gain hands-on knowledge of various programming tools/approaches in NLP. It will (hopefully) also help you appreciate the complexity and utility of neural networks where such feature engineering is rarely needed.

It is, of course, important to represent the training, development, and test instances using the same set of features. And for the supervised classification, we want to store the training, development, and test sets are data frames (essentially, vectors with class labels). Your next task is to complete the following method to do this.

First, let's define an enumerable type so that only a fixed set of the "kinds of data frames" are allowed.

In [None]:
from enum import Enum

class ActionType(Enum):
    TRAIN = 'train'
    TEST = 'test'
    DEV = 'dev'

Use this `ActionType` below:

In [None]:
from collections import OrderedDict #Imported ordered dict so dict stays consistent when doing one hot encodings, should be allowed since from collections library
#import pdb used for breakpoints in colab

def create_dataframe(actiontype: ActionType, output_file_name: str) -> None:
    """
    Generate a pandas DataFrame from text files containing sentences (tokenized and annotated, in IOB or IOB2 format)
    and class labels, and write the DataFrame to a CSV file.

    Parameters:
        actiontype (ActionType): The type of action, either ActionType.TRAIN, ActionType.TEST, or ActionType.DEV.
        output_file_name (str): The name of the output CSV file.

    Returns:
        None

    Raises:
        FileNotFoundError: If the input training or test file is not found.
    """

    #Given action type we can choose what data set to generate a dataframe for, if wrong action or unknown action error
    if actiontype == ActionType.TRAIN:
      fileName = '/content/eng-ner-dataset/eng.train'
    elif actiontype == ActionType.TEST:
      fileName = '/content/eng-ner-dataset/eng.testb'
    elif actiontype == ActionType.DEV:
      fileName = '/content/eng-ner-dataset/eng.testa'
    else:
      raise FileNotFoundError(f"Input file not found due to invalid action type or missing file")

    sentences = []
    currentSentence = []

    try:
      tokenDict, POSDict = get_vocabulary('/content/eng-ner-dataset/eng.train')  #Get vocab from training set always

      vocab = set(tokenDict.keys())
      vocab.add(UNKNOWN_TOKEN) #Add unk token for when test or dev data set has vocab not in training vocab

      #All possible keys to be used for one hot encoding
      allPOS = set(POSDict.keys())
      allCapitalization = {'allCaps', 'mixedCaps', 'noCaps'}
      allNER = {'O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'}

      #Creating a dict to store all of the features which will later turn into an one hot encoding
      #Features used: POS, Capitalization, PREV NER, NEXT NER
      #Also extracting NER to be used for true class labels
      vocabFeatureClassDict = OrderedDict((word, {'POS': set(), 'Capitalization': set(), 'NER' : set(), 'PREV_NER' : set(), 'NEXT_NER': set()}) for word in vocab)

      with open(fileName, 'r') as file:
        prevLine = None
        for line in file:
          if line is not None:
            line = line.split()
            curToken = None
            for j, tokAndPOS in enumerate(line):
              if j == 0 and tokAndPOS is not None:
                curToken = tokAndPOS.lower()
                if curToken not in vocab: #If test or dev has token not in vocab make it an unknown token
                  curToken = UNKNOWN_TOKEN
                if tokAndPOS.isupper(): #This is where I find capital features
                  vocabFeatureClassDict[curToken]['Capitalization'].add('allCaps')
                elif any(char.isupper() for char in tokAndPOS):
                  vocabFeatureClassDict[curToken]['Capitalization'].add('mixedCaps')
                else:
                  vocabFeatureClassDict[curToken]['Capitalization'].add('noCaps')
              elif j==1 and tokAndPOS is not None: #This is where I find POS features
                if curToken in vocabFeatureClassDict:
                    vocabFeatureClassDict[curToken]['POS'].add(tokAndPOS)
              elif j == 3 and tokAndPOS is not None: #This is where I find NER to use for classes
                if curToken in vocabFeatureClassDict:
                  vocabFeatureClassDict[curToken]['NER'].add(tokAndPOS)

            #This is where I find PREV NER features
            if prevLine is not None:
              line = prevLine
              #line = line.split()
              if len(line) > 3 and curToken in vocabFeatureClassDict:
                vocabFeatureClassDict[curToken]['PREV_NER'].add(line[3])

            #This is where I find NEXT NER features
            nextLine = next(file, None)
            if nextLine is not None:
              line = nextLine
              line = line.split()
              if len(line) > 3 and curToken in vocabFeatureClassDict:
                vocabFeatureClassDict[curToken]['NEXT_NER'].add(line[3])

            prevLine = line



        # print(len(allPOS))
        # print(vocabFeatureClassDict)

        tokenFeatures = list(vocabFeatureClassDict.keys())

        #Begin creating our pandas dataframe with tokens in first collumn features in middle and true classes at end
        df = pandas.DataFrame({'Token': tokenFeatures})

        #Added new one hot encoding can be done by following same steps for all below
        #-------------------------------------------------------------------POS----------------------------------------------------------------------------------------
        #Doing one hot encoding for POS
        POSFeaturesList = []
        #print(f'length of tokenf: {len(tokenFeatures)}')

        #Had trouble with library so doing it manually and concating at end
        for token in tokenFeatures:
          aggEncode = numpy.zeros(len(allPOS))
          POSTags = list(vocabFeatureClassDict[token]['POS'])

          #Do one hot encoding

          for POSTag in POSTags:
            POSi = list(allPOS).index(POSTag)
            aggEncode[POSi] = 1  #If has tag set to 1

          #Store one hot encoding on pandas df arry
          POSdf = pandas.DataFrame([aggEncode], columns=list(allPOS))
          POSFeaturesList.append(POSdf)


        POSdfConcat = pandas.concat(POSFeaturesList, ignore_index=True)

        #Combine to full df
        df = pandas.concat([df, POSdfConcat], axis=1)

        #-----------------------------------------------------------Capitalization----------------------------------------------------------------------------------------
        #Doing one hot encoding for Capitalization
        capitalizationFeaturesList = []

        for token in tokenFeatures:
          aggEncode = numpy.zeros(len(allCapitalization))
          capitalizationTags = list(vocabFeatureClassDict[token]['Capitalization'])

          #Do one hot encoding
          for capTag in capitalizationTags:
            capi = list(allCapitalization).index(capTag)
            aggEncode[capi] = 1 #If has tag set to 1

          #Store one hot encoding on pandas df arry
          capitalizationdf = pandas.DataFrame([aggEncode], columns=list(allCapitalization))
          capitalizationFeaturesList.append(capitalizationdf)


        capitalizationdfConcat = pandas.concat(capitalizationFeaturesList, ignore_index=True)

        #Combine to full df
        df = pandas.concat([df, capitalizationdfConcat], axis=1)


        #-----------------------------------------------------------PREV and NEXT NER----------------------------------------------------------------------------------------
        #Doing one hot encoding for PREV and NEXT NER
        prevNERFeaturesList = []
        nextNERFeaturesList = []

        for token in tokenFeatures:
          #PREV NER feature
          aggEncodingPrevNER = numpy.zeros(len(allNER))
          prevNERTags = list(vocabFeatureClassDict[token]['PREV_NER'])

          #Do one hot encoding for PREV NER
          for NERTag in prevNERTags:
            NERi = list(allNER).index(NERTag)
            aggEncodingPrevNER[NERi] = 1 #If has tag set to 1

          #Store one hot encoding on pandas df arry
          prevNERdf = pandas.DataFrame([aggEncodingPrevNER], columns=list(allNER))
          prevNERFeaturesList.append(prevNERdf)

          #NEXT NER feature
          aggEncodingNextNER = numpy.zeros(len(allNER))
          nextNERTags = list(vocabFeatureClassDict[token]['NEXT_NER'])

          #Do one hot encoding for PREV NER
          for NERTag in nextNERTags:
            NERi = list(allNER).index(NERTag)
            aggEncodingNextNER[NERi]= 1 #If has tag set to 1

          #Store one hot encoding on pandas df arry
          nextNERdf = pandas.DataFrame([aggEncodingNextNER], columns=list(allNER))
          nextNERFeaturesList.append(nextNERdf)

        #Combine to full df
        prevNERdfConcat = pandas.concat(prevNERFeaturesList, ignore_index=True)
        df = pandas.concat([df, prevNERdfConcat], axis=1)

        #Combine to full df
        nextNERdfConcat = pandas.concat(nextNERFeaturesList, ignore_index=True)
        df = pandas.concat([df, nextNERdfConcat], axis=1)


      #-----------------------------------------------------------NER for true classes----------------------------------------------------------------------------------------
        NERValues = []

        for token in tokenFeatures:
          NERTags = vocabFeatureClassDict[token]['NER']
          if NERTags:
            NERValues.append(NERTags)
          else: #Should never happen I think but just in case a sentence has no NER tag
            NERValues.append('O')

        #print("Length of NERValues:", len(NERValues))
        #print("Length of DataFrame:", len(df))

        #May need to change but alter NER tags if multiple and combines PER tags to just PER
        #Work around
        updatedNERValues = []
        for tags in NERValues:
          # if 'B-PER' in tags or 'I-PER' in tags:
          #     updatedNERValues.append('PER')
          # el
          if len(tags) > 1: #Multitags assign O? Could work or just use first
            if isinstance(tags, set):
              updatedNERValues.append(tags.pop()) #Sometimes tags are in sets
            else:
              updatedNERValues.append(tags) #Sometimes tags are strings (i think when using UNK)
          else:
            if isinstance(tags, set):
              updatedNERValues.append(tags.pop()) #Sometimes tags are in sets
            else:
              updatedNERValues.append(tags) #Sometimes tags are strings (i think when using UNK)

        #Put NER values (class values) in df
        df['NER'] = updatedNERValues

        #Write df to file to be used later
        df.to_csv(output_file_name, index=False)
        print('SUCCESS')

    except FileNotFoundError:
      raise FileNotFoundError(f"Input file not found due to invalid action type or missing file")




In [None]:
create_dataframe(ActionType.TRAIN, 'train_set.csv')

SUCCESS


# Binary Logistic Regression Classifier [30 points]

Now that your data frames are built, it is time to build your binary logistic regression classifier to identify if a token is a part of a person's name. To do this, you can effectively treat the labels `I-PER` and `B-PER` together as a single label, `PER`, and treat all the other labels simply as *other*, denoted by `O` (how you denote it internally in your code is entirely up to you).

Complete the class `BinaryLogisticRegression` below.

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

class LogisticRegressionClassifier:
    """
    A binary logistic regression classifier.

    Attributes:
        learning_rate (float): The learning rate for gradient descent.
        learning_rate_decay (float, optional): The factor by which learning rate decays with each iteration.
        num_iterations (int): The number of iterations for gradient descent.
        weights (ndarray): The weights for the features.
        bias (float): The bias term.
        training_data (pandas.DataFrame): The training data as a pandas DataFrame (to be read from a valid CSV file)
    """

    def __init__(self, training_data_csv: str, learning_rate=0.01, learning_rate_decay=1.0, num_iterations=1000):
        """
        Initialize the logistic regression classifier.

        Parameters:
            training_data_csv (str): The file path to the CSV file containing training data.
            learning_rate (float, optional): The learning rate for gradient descent.
            learning_rate_decay (float, optional): The factor by which learning rate decays with each iteration.
            num_iterations (int, optional): The number of iterations for gradient descent.
        """
        try:
          self.training_data = pandas.read_csv(training_data_csv)
        except FileNotFoundError:
          raise FileNotFoundError("File not found")
        self.learning_rate = learning_rate
        self.learning_rate_decay = learning_rate_decay
        self.num_iterations = num_iterations

        #May remove these but should be fine
        self.bias = None
        self.weights = None



    @staticmethod
    def sigmoid(z: float) -> float:
        """
        Compute the sigmoid function.

        Parameters:
            z (float): The input to the sigmoid function.

        Returns:
            float: The output of the sigmoid function.
        """
        return 1 / (1 + numpy.exp(-z))

    def __to_feature_matrix(self, df: pandas.DataFrame) -> numpy.ndarray: #Changed to ndarray so I can use to transform part of df to ndarray to be used in learn and predict which take ndarrays not df
        """
        A private method to extract the feature matrix from the data frame.

        Parameters:
            df (pandas.DataFrame): The given data frame.

        Returns:
            The matrix of features as a pandas DataFrame
        """
        #First col is tokens and last col is classes so just use col 1 to (n-1)
        features = df.iloc[:, 1:-1].values

        return features

    def __to_class_labels(self, df: pandas.DataFrame) -> numpy.ndarray: #Changed to ndarray so I can use to transform part of df to ndarray to be used in learn and predict which take ndarrays not df
        """
        A private method to extract the class labels from the data frame.

        Parameters:
            df (pandas.DataFrame): The given data frame.

        Returns:
            The vector of class labels as a pandas DataFrame.
        """

        #Use last col for classes
        classes = df.iloc[:, -1].values

        #Turn into 0 and 1 for binary classification, need to do I believe
        classesBinary = [1 if (label == 'B-PER') or (label == 'I-PER') else 0 for label in classes]

        return classesBinary

    def learn(self, feature_matrix, y):
        """
        Learn the weight vector to obtain the best decision boundary that separates the two classes in the training set.

        It initializes the model parameters (the weights are initialized to zeros, and the bias is also initially set to
        zero). It then performs gradient descent to optimize the parameters based on the training data. The optimization
        is done by minimizing the cross-entropy loss or the logistic loss. The learning rate determines the step size
        taken during gradient descent. If the decay factor is less than 1, the step size reduces with each iteration.

        Parameters:
            feature_matrix (ndarray): The feature matrix.
            y (ndarray): The target class labels.

        Returns:
            None
        """
        self.weights = numpy.zeros(feature_matrix.shape[1])
        self.bias = 0.0

        #Gradient Descent with n num iterations
        for i in range(self.num_iterations):
          #Find probabilties through sigmoid of logits
          logits = numpy.dot(feature_matrix, self.weights) + self.bias
          predictions = self.sigmoid(logits)

          #Get new learning rate with given decay need to add regularization as well
          currentLearningRate = (self.learning_rate / ((self.learning_rate_decay * i) + 1))

          #Find gradient and bias by calculating the between pred and true labels
          miss = predictions - y
          gradientBias = numpy.sum(miss)
          gradient = numpy.dot(feature_matrix.T, miss) #Need to transpose to do dot product should be fine

          #Use L1 regularization, I think it works better here for removing uneeded features
          L1 = numpy.sign(self.weights)
          L1Lambda = 0.01 #Starting point
          gradient +=  L1Lambda * L1


          #Update weights and bias using new learning rate
          self.weights = self.weights - (gradient * currentLearningRate)
          self.bias = self.bias - (gradientBias * currentLearningRate )

        print('SUCCESS')



    def predict(self, feature_matrix) -> List[int]:
        """
        Predict the target labels for new/test data.

        Parameters:
            feature_matrix (ndarray): The feature matrix of new/test data.

        Returns:
            list: The predicted target labels.
        """
        #Find probabilties through sigmoid of logits
        logits = numpy.dot(feature_matrix, self.weights) + self.bias
        probabilities = self.sigmoid(logits)

        #Change threshold to alter what counts for 0 and 1
        threshold = 0.50 #Dont need to alter for mine
        predictions = numpy.where(probabilities < threshold, 0, 1)

        return list(predictions)

    def report(self, feature_matrix: numpy.ndarray, y_true: numpy.ndarray) -> Tuple[float, float, float]:
        """
        Compute the precision, recall, and F-1 scores for new/test data.

        Parameters:
            feature_matrix (ndarray): The feature matrix of new/test data.
            y_true (ndarray): The true class labels.

        Returns:
            Tuple[float, float, float]: A tuple containing three values: the positive class' precision, recall, and F-1
            scores (in this order)
        """
        # TODO
        #
        # Note 1: For binary classification, we only need these three values for the positive class (instead of micro-
        # or macro-averaging). The PER class is considered as the positive class for this component of the assignment.
        #
        # Note 2: You should aim for an F-1 measure of at least 0.7 on the final test set (eng.testb)
        #F-1 is at least 0.7

        #Use predict to get predictions again. May want to just save them so don't have to recompute but doesnt take long
        predictions = self.predict(feature_matrix)

        #Use sklearn.metrics to get precision recall and f1
        precision = precision_score(y_true, predictions, average='binary')
        recall = recall_score(y_true, predictions, average='binary')
        f1 = f1_score(y_true, predictions, average='binary')

        return (precision, recall, f1)

In [None]:
#Creating all dfs
create_dataframe(ActionType.TRAIN, 'train_set.csv')
create_dataframe(ActionType.TEST, 'test_set.csv')
create_dataframe(ActionType.DEV, 'dev_set.csv')

SUCCESS
SUCCESS
SUCCESS


In [None]:
#Traing models with training set
model = LogisticRegressionClassifier(learning_rate=0.001, training_data_csv='train_set.csv', num_iterations=1000)

featureMatrix = model._LogisticRegressionClassifier__to_feature_matrix(model.training_data) #Private method can be called like this
y = model._LogisticRegressionClassifier__to_class_labels(model.training_data)
model.learn(featureMatrix, y)

SUCCESS


In [None]:
#Model tests and stats reports, using another 'model' to extract information out from test, not sure if best option but works. Using DEV set
testModel = LogisticRegressionClassifier(learning_rate=0.05, training_data_csv='dev_set.csv')
testFeatureMatrix = testModel._LogisticRegressionClassifier__to_feature_matrix(testModel.training_data)
testy = testModel._LogisticRegressionClassifier__to_class_labels(testModel.training_data)

predictions = model.predict(testFeatureMatrix)
model.report(testFeatureMatrix, testy)

(0.9461538461538461, 0.8848920863309353, 0.9144981412639406)

In [None]:
#Model tests and stats reports, using another 'model' to extract information out from test, not sure if best option but works. Using TEST set
testModel = LogisticRegressionClassifier(learning_rate=0.05, training_data_csv='test_set.csv')
testFeatureMatrix = testModel._LogisticRegressionClassifier__to_feature_matrix(testModel.training_data)
testy = testModel._LogisticRegressionClassifier__to_class_labels(testModel.training_data)

predictions = model.predict(testFeatureMatrix)
model.report(testFeatureMatrix, testy)

(0.8658892128279884, 0.8865671641791045, 0.8761061946902655)

# Multinomial Logistic Regression Classifier [30 points]

This is also known as **Softmax Regression** or **Maxent Classifier**. It is a popular tool for multi-class classification, which we will use here for NER.

Note that the classification is happening on a *per-token* basis, and the class labels are `B-ORG`, `I-ORG`, etc.

Generalizing the binary classification task, you should complete the `MultinomialLogisticRegression` class, whose skeleton is provided to you next. For this portion, you may (optionally) import the `OneHotEncoder` or `LabelEncoder` from scikit-learn. For example, you may add this line at the beginning of the next cell:

```
from sklearn.preprocessing import LabelEncoder
```


In [None]:
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import classification_report #Imported for generate_report() as instructed to wrap it

class MultinomialLogisticRegression:
    """
    A multinomial logistic regression classifier.

    Attributes:
        learning_rate (float): The learning rate for gradient descent.
        learning_rate_decay (float): The factor by which learning rate decays with each iteration.
        weights (ndarray): The weights for the features.
        bias (float): The bias term.
        training_data (pandas.DataFrame): The training data as a pandas DataFrame (to be read from a valid CSV file)

        ADDED ATTRIBUTES:

        epochs (int): The number of training cycles (I think this one was forgotten in skeleton code)

        trueLabels (ndarry): The true class labels for the data set

        predictions (ndarry): The predicted class labels for the data set

    """

    def __init__(self, learning_rate: float, learning_rate_decay: float, epochs: int, training_data_csv: str):
        """
        Initialize the multinomial logistic regression classifier.

        Parameters:
            learning_rate (float): The learning rate for gradient descent.
            learning_rate_decay (float): The factor by which learning rate decays with each iteration.
            epochs (int): The number of training epochs.
            training_data_csv (str): The file path to the CSV file containing training data.
        """

        try:
          self.training_data = pandas.read_csv(training_data_csv)
        except FileNotFoundError:
          raise FileNotFoundError("File not found")
        self.learning_rate = learning_rate
        self.learning_rate_decay = learning_rate_decay
        self.epochs = epochs
        self.bias = None
        self.weights = None
        self.trueLabels = None
        self.predictions = None


    @staticmethod
    def softmax(logits):
        """
        Compute the softmax function.

        Parameters:
            z (numpy.ndarray): The input to the softmax function.

        Returns:
            numpy.ndarray: The output of the softmax function.
        """
        exp_logits = numpy.exp(logits - numpy.max(logits))
        return exp_logits / numpy.sum(exp_logits)

    def learn(self) -> None:
        """
        Train the multinomial logistic regression model.

        This method trains the multinomial logistic regression model using stochastic gradient descent. It begins by
        encoding the target labels into one-hot encoded vectors and initializes the weights (to zeros) for the model.
        During each training epoch, it iterates through the training data, computing the softmax probabilities for each
        class and updating the weights based on the gradient of the cross-entropy loss.

        Returns:
            None
        """
        # TODO
        # Note: Remember to use regularization, but think about what type of regularization might suit this task.
        #Going to use L1
        #Doing regular grad descent, confused on implementing stochastic random choice of vectors?

        training_data = self.training_data
        numTokens = len(training_data)
        numFeatures = len(training_data.columns) - 2
        allClassesTotal = self.training_data.iloc[:, -1]
        numClasses = len(allClassesTotal.unique()) #All classes should be in training data

        #Use weights to help with class imbalance
        classCounts = allClassesTotal.value_counts()
        classWeights = len(self.training_data) / (len(classCounts) * classCounts)
        classWeightsValues = classWeights.values #Take out values cant use directly

        #Init weights to 0s
        self.weights = numpy.zeros((numFeatures, numClasses))

        features = self.training_data.iloc[:,1:-1].values

        #Map NER classes to ints, think I need this for multi logistic regression to work (may also need to fix PER since is combined right now)
        classMap = {className: i for i, className in enumerate(allClassesTotal.unique())}
        self.training_data['NERint'] = self.training_data.iloc[:, -1].map(classMap)
        classes = self.training_data['NERint'].values
        #print(classes) #Should be numbers 1- 7 8 ish

        for epoch in range(self.epochs):
          #Find probs but logged due to size constraints
          logits = numpy.dot(features, self.weights)

          logProbs = numpy.log(self.softmax(logits))

          #Compute labels using one hot, using numpy since library not working for me
          yHot = numpy.zeros_like(logProbs)
          yHot[numpy.arange(numTokens),classes] = 1
          gradient = numpy.dot(features.T, (numpy.exp(logProbs)- yHot))/numTokens

          #Doing weighted classes before L1 but might need to do after if not working right
          gradient *= classWeightsValues

          #Use L1 regularization, I think it works better here for removing uneeded features
          L1 = numpy.sign(self.weights)
          L1Lambda = 0.01 #Starting point may change to parameter
          gradient +=  L1Lambda * L1


          self.weights -= self.learning_rate * gradient
          ceLoss = -(numpy.sum(logProbs* yHot) / len(logits))
          print(f"Loss: {ceLoss}")

        print('SUCCESS')


    def predict(self, test_data_csv: str) -> numpy.ndarray:
        """
        Predict class labels for test data using the trained multinomial logistic regression model.

        Loads the test data from a CSV file into a pandas DataFrame. Then, computes the softmax probabilities for each
        class using the dot product of the feature matrix and the model weights. The class label for each instances is
        predicted based on the highest probability using argmax.

        Parameters:
            test_data_csv (str): File path to the CSV file containing test data.

        Returns:
            numpy.ndarray: Predicted class labels for the test data.
        """
        test_data = pandas.read_csv(test_data_csv)
        features = test_data.iloc[:,1:-1].values
        allClassesTotal = self.training_data.iloc[:, -1]

        #Map NER classes to ints, think I need this for multi logistic regression to work (may also need to fix PER since is combined right now)
        uniqueClassNamesSorted = sorted(allClassesTotal.unique())
        classMap = {className: i for i, className in enumerate(uniqueClassNamesSorted)}
        self.training_data['NERint'] = allClassesTotal.map(classMap)
        classes = self.training_data['NERint'].values

        self.trueLabels = classes

        #Find probs
        logits = numpy.dot(features, self.weights)
        prob = self.softmax(logits)

        #Argmax to get highest prob for prediction
        pred = numpy.argmax(prob, axis=1) #Take largest prob for prediction Maybe store so can be used in report
        self.predictions = pred
        return pred

    def generate_report(self) -> None:
      """
      Generate a classification report using scikit learn, base the logic off the library verison
      This fucntion acts as a wrapper for classification report as instructed

          Parameters:
              None

          Returns:
              None
      """
      if self.trueLabels is None or self.predictions is None:
        raise ValueError("Cannot generate report before making predictions")

      trueLabels = self.trueLabels
      predLabels = self.predictions

      uniqueClassNamesSorted = sorted(self.training_data.iloc[:, -2].unique(), reverse=True) #Should give back classes in right order, use -2 now since new col adde

      print(uniqueClassNamesSorted)
      print(classification_report(trueLabels, predLabels, target_names=uniqueClassNamesSorted))

In [None]:
create_dataframe(ActionType.TRAIN, 'train_set.csv')
create_dataframe(ActionType.TEST, 'test_set.csv')
create_dataframe(ActionType.DEV, 'dev_set.csv')

SUCCESS
SUCCESS
SUCCESS


In [None]:
# Model training
model = MultinomialLogisticRegression(learning_rate=0.01, epochs=10, learning_rate_decay=0, training_data_csv='train_set.csv')
model.learn()

Loss: 12.032242930768886
Loss: 12.031874806636484
Loss: 12.03163689386428
Loss: 12.031335646248479
Loss: 12.031095234099414
Loss: 12.030803578690593
Loss: 12.03057437010865
Loss: 12.030286259927358
Loss: 12.030063488498174
Loss: 12.029804744184313
SUCCESS


In [None]:
#Model testing for DEV
predictions: numpy.ndarray = model.predict(test_data_csv='dev_set.csv')

In [None]:
#Model Report for DEV
model.generate_report()

['O', 'I-PER', 'I-ORG', 'I-MISC', 'I-LOC', 'B-ORG', 'B-MISC', 'B-LOC']
              precision    recall  f1-score   support

           O       0.78      0.84      0.80     16094
       I-PER       0.21      0.00      0.01      2458
       I-ORG       0.00      0.00      0.00      1298
      I-MISC       0.00      0.00      0.00       798
       I-LOC       0.00      0.00      0.00       353
       B-ORG       0.00      0.00      0.00         4
      B-MISC       0.00      0.60      0.00         5
       B-LOC       0.00      0.00      0.00         1

    accuracy                           0.64     21011
   macro avg       0.12      0.18      0.10     21011
weighted avg       0.62      0.64      0.62     21011



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
#Model testing for TEST
predictions: numpy.ndarray = model.predict(test_data_csv='test_set.csv')

In [None]:
#Model Report for TEST
model.generate_report()

['O', 'I-PER', 'I-ORG', 'I-MISC', 'I-LOC', 'B-ORG', 'B-MISC', 'B-LOC']
              precision    recall  f1-score   support

           O       0.77      0.85      0.81     16094
       I-PER       0.11      0.00      0.00      2458
       I-ORG       0.00      0.00      0.00      1298
      I-MISC       0.00      0.00      0.00       798
       I-LOC       0.00      0.00      0.00       353
       B-ORG       0.00      0.00      0.00         4
      B-MISC       0.00      0.60      0.00         5
       B-LOC       0.00      0.00      0.00         1

    accuracy                           0.65     21011
   macro avg       0.11      0.18      0.10     21011
weighted avg       0.60      0.65      0.62     21011



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Reporting the results of a multiclass classification is a little more complicated than binary classification. For this part, [please read the documentation of scikit-learn's `classification_report`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html).

Then, add an instance method called `generate_report` to the above class. This method should take no arguments (other than the default `self`), and print the report for your classification results in the same format as shown in the above documentation. That is, for a 3-class classification, it should print (shown with dummy results):

```
                precision    recall  f1-score   support

     class 0       0.50      1.00      0.67         1
     class 1       0.00      0.00      0.00         1
     class 2       1.00      0.67      0.80         3

    accuracy                           0.60         5
   macro avg       0.50      0.56      0.49         5
   micro avg       1.00      0.67      0.80         5
```

In your generated report, the class names must be the actual labels (e.g., `I-PER`) and not just numbers or indices. As you may have realized, your method will simply be a wrapper around scikit-learn's `classification_report`, but you will have to carefully think about the parameter values to use.

**Note:** You are not responsible for generating a report if a user calls this method before testing (by a call to the `predict` function). If a user does invoke `generate_report` without invoking `predict` first, it is acceptable for your code to raise an error.

# Your insights [20 points]

## 1. Test set vs Dev set: Binary classification [4 points]

For the binary classification task, how much does the performance (in terms of each metric) differ between the dev set and the final test set?

Between the dev set and the final test set it seems that our dev set performs much better overall. Our precision is about 0.08 better in the dev set, recall is about 0.01 better and f1-score is about 0.04 better.

What do you think are the causes behind these differences?

I believe that a cause for these differences is that our vocabulary obtained from the training set has more in common to our dev set then our test set. This leads to many of our tokens being defined as UNK giving us a worse performance and words that have been seen are easier to correctly identify in the correct class. Also we try to maximize the results of our dev set so in doing so we should almost always expect our dev set result to be higher because of this.

Suggest one or two experiments that you should design and conduct, in order to test your hypothesis (i.e., in order to test whether your answer to the above question is, indeed, correct).

I am going to get the overlap in vocabulary between the training set and dev set and training set and test set and compare them. After testing them below I have determined that this is in fact true. The dev set contains 6147 unique overlaping words and the test set contains only 5281 unique overlaping words meaning that the dev set has nearly 1000 more overlapping words making it possibly easier to predict seen words. This could also mean that our test and dev set are more related (discussing similar topics) and our test set may be different from the two. Aditionally, if we tried to maximize the results of our test set before we looked at thse dev set we may be able to get a higher test set than dev set.


## 2. Stochastic Gradient Descent [10 points]

What exactly is an epoch?

An epoch is a complete pass through of all the training data one time. So 100 epochs would be a complete pass through of all of our data 100 times.

Why is it important to optimize over multiple epochs, when in each epoch, the training is happening over the same data?

When training over multiple epochs we are continually changing our weights using our training data to attempt to minimize our loss function. Doing multiple epochs will let our model converge to a point where it can't minimize the loss function anymore.

For the binary classification task, what regularization did you choose when optimizing? Why did you choose this, and not any other?

For binary classification task I used L1 regularization or Lasso regression when optimizing. I chose this and not L2 regularization since after doing further research L1 regularization helps with feature selection by shrinking unnessarcy features to 0. I thought this would be helpful for this problem since in my eyes only some of the features I used would contribute more to classification and some may not contribute at all and by doing this it may make our model simpler.

For the multiclass classification task, what regularization did you choose when optimizing? Why did you choose this, and not any other?

For multiclass classification task I used L1 regularization or Lasso regression when optimizing just like binary classification. I thought this would be helpful for this problem just like binary classification since in my eyes only some of the features I used would contribute more to classification and some of the features I used would just further complicate the model. By using L1 I am helping the model remove unnessarcy features.

What types of learning rate decay were included in your experiments (as discussed in the lecture before Spring break)? Did the dev set play an important role in these experiments? Briefly explain how. Also briefly explain what made you fix the type of learning rate decay when testing on the final test set.

* Note that this question is about the *type* of decay, not the value of the learning rate or the value of the decay parameter.

The type of learning rate decay that were used was time based learnning decay where I limit the learning after each iteration. The dev set played in important role in these experiements because it allowed me to mess around with different parameters to try to maximize my results without over fitting the test set since I hadn't looked at it yet. On the final test I choose to not alter decay because I didnt want to overfit my model on the test data.

## 3. Multiclass classification [6 points]

Were there any classes that were particularly hard to detect?

Most of the classes other than O were hard for my model to dectect, my model detected some of the I-PER aswell but not the best.

Why do you think these classes were comparatively more difficult to identify correctly?

These classes were comparatively more difficult to identify correctly because we can see the training set had less data to train on for those classes. For B-LOC, B-MISC, and B-ORG there where only 3, 2 and 4 samples respectively in the training set. Even trying to weight the classes to help with the imbalance didn't really help because there are barley any samples to learn off of.

What experiments would you design and conduct to try and improve the performance on these difficult categories? Support your answer with technical reasoning (in this context, "technical" means either based on mathematical reasoning, linguistic insights, or statistical insights drawn from data).

Some things to try would be random under sampling of the majority class to try to help with the class imbalance and would therefore help our model learn each class equally and we could probably expect better results. Additionally, it would help even more if our training data had more data of our under represented classes so our model could learn more about each of them. Right now it is nearly impossible to learn any thing from just a few (1-5) samples of a class, our model needs more information if it is going to learn what features lead to what class.


In [None]:
#After reading piazza post it seems like I don't need to implement experiments but will keep this one here anyway since already tried it

#Experiment for insight 1
trainTokenDict, trainPOSDict = get_vocabulary('/content/eng-ner-dataset/eng.train')
devTokenDict, devPOSDict = get_vocabulary('/content/eng-ner-dataset/eng.testa')
testTokenDict, testPOSDict = get_vocabulary('/content/eng-ner-dataset/eng.testb')

#Get same vocab between train dev and train test

commonTrainDev = set(trainTokenDict.keys()) & set(devTokenDict.keys())

commonTrainTest = set(trainTokenDict.keys()) & set(testTokenDict.keys())

print(f"Ammount of unique vocab in dev: {len(set(devTokenDict.keys()))}")
print(f"Ammount of unique vocab in test: {len(set(testTokenDict.keys()))}")
print(f"Amount of unique vocab in common between train and dev: {len(commonTrainDev)}")
print(f"Amount of unique vocab in common between train and dev: {len(commonTrainTest)}")

Ammount of unique vocab in dev: 9003
Ammount of unique vocab in test: 8549
Amount of unique vocab in common between train and dev: 6147
Amount of unique vocab in common between train and dev: 5281
