# Simple Vectorization and Classification by Spacy Word Vector
This notebook shows a simple solution to the RR task (https://sites.google.com/view/legaleval/home?pli=1)
1. Use Spacy word vector
2. Apply Logistic Regression (LR) classifier to classify the sentences into 13 categories
3. Evalute the LR classifier on the dev data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# train and dev files
## Change the following paths to your paths:
train_file = "/content/drive/MyDrive/Colab Notebooks/semEval/legalEval/taskA-RR/data/train.csv"
dev_file = "/content/drive/MyDrive/Colab Notebooks/semEval/legalEval/taskA-RR/data/dev.csv"

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Read in the train data
train = pd.read_csv(train_file)
train.shape

(28986, 9)

## Map the labels to numbers

In [None]:
lab2id = {"['PREAMBLE']":1, "['NONE']":2, "['FAC']":3, "['ARG_RESPONDENT']":4,
       "['RLC']":5, "['ARG_PETITIONER']":6, "['ANALYSIS']":7, "['PRE_RELIED']":8,
       "['RATIO']":9, "['RPC']":10, "['ISSUE']":11, "['STA']":12,
       "['PRE_NOT_RELIED']":13}

In [None]:
id2lab = {1:"['PREAMBLE']", 2:"['NONE']", 3:"['FAC']", 4:"['ARG_RESPONDENT']",
       5:"['RLC']", 6:"['ARG_PETITIONER']", 7:"['ANALYSIS']", 8:"['PRE_RELIED']",
       9:"['RATIO']", 10:"['RPC']", 11:"['ISSUE']", 12:"['STA']",
       13:"['PRE_NOT_RELIED']"}

## Extract Training sentences and labels

In [None]:
sents = train["value.text"].str.replace('\n', "").apply(lambda x: x.lower())
sents

0              in the high court of karnataka,         ...
1              beforethe hon'ble mr.justice anand byrar...
2        this criminal appeal is filed under section 37...
3               this appeal coming on for hearing this ...
4               heard the learned counsel for the appel...
                               ...                        
28981     so section 132 of the evidence act sufficient...
28982     for the reasons aforesaid, the appeal is allo...
28983    the judgment and order dated april 27, 1987 pa...
28984                                               r.s.s.
28985                                      appeal allowed.
Name: value.text, Length: 28986, dtype: object

In [None]:
y = train['value.labels'].map(lab2id)
y

0         1
1         1
2         1
3         1
4         2
         ..
28981     9
28982    10
28983    10
28984     2
28985    10
Name: value.labels, Length: 28986, dtype: int64

## Conver the sentences to vectors by Spacy word vector

In [None]:
import spacy

In [None]:
!python -m spacy download en_core_web_md

2022-11-30 16:18:11.150891: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-md==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.4.1/en_core_web_md-3.4.1-py3-none-any.whl (42.8 MB)
[K     |████████████████████████████████| 42.8 MB 7.7 MB/s 
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.4.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [None]:
# Load the spacy model that you have installed
nlp = spacy.load('en_core_web_md')

In [None]:
# A test sentence
asent = 'This is some text that I am processing with Spacy []'

# Test to process a sentence using the model
doc = nlp(asent)

# It's that simple - all of the vectors and words are assigned after this point
# Get the vector for 'text':
doc[3].vector

array([ 1.8153e+00, -3.0974e+00,  7.8781e+00,  1.7159e+00,  1.3492e+00,
       -4.6307e+00,  3.6709e+00, -8.5784e-02, -4.9755e+00, -8.4094e-01,
        1.0642e+01,  6.8609e+00, -9.2319e+00, -1.5872e-01, -3.8155e-01,
       -1.9255e-01,  3.3571e+00,  3.7723e+00,  1.3672e+00,  6.5571e+00,
       -6.5411e+00, -3.9489e-01, -5.2012e-01,  5.5753e-01, -3.4513e+00,
       -4.5028e+00, -1.5902e+00, -3.7582e+00, -4.8479e+00,  2.5768e+00,
       -7.2187e+00, -4.7998e+00, -1.8594e+00, -4.9777e-01, -2.4411e-01,
       -4.1268e+00, -3.4901e+00, -4.8338e+00,  4.3046e+00,  2.6234e+00,
       -4.4230e-02, -1.3608e-02, -8.8456e+00,  3.7733e+00,  2.6316e+00,
        3.4657e+00,  4.3546e+00,  1.1333e+00, -3.7832e+00, -5.7349e+00,
       -3.3476e+00, -1.0848e+00,  3.8662e+00, -1.7437e+00, -9.9700e-01,
        4.1109e+00,  1.0865e+00,  3.2447e+00,  1.9290e+00, -4.9990e+00,
        6.1250e+00,  3.9852e+00, -5.0349e+00,  2.2019e+00, -1.2268e+00,
        1.2217e+01, -1.9911e-01, -6.9239e+00, -1.4570e-01,  2.51

In [None]:
# Convert a sentence to a vector by averaging all words' vectors
def sent2vec(asent, nlp):
    doc = nlp(asent)
    vec = doc[0].vector
    for i in range(1, len(doc)):
        vec = vec + doc[i].vector

    return vec / len(doc)

In [None]:
sent2vec(asent, nlp)

array([ 2.03711152e+00, -1.51326597e+00,  2.02612591e+00, -1.93055809e+00,
       -1.17620003e+00,  1.35419145e-01,  9.98580754e-01,  2.76163888e+00,
       -5.72324896e+00,  9.86769021e-01,  5.58257294e+00,  2.59669036e-01,
       -2.19843388e-01, -3.55576038e-01,  2.34821296e+00,  2.28833413e+00,
        1.81841612e-01,  8.29836547e-01, -1.51111412e+00,  3.50373030e-01,
        2.30216670e+00,  2.78213573e+00, -2.39745593e+00, -3.09082007e+00,
       -3.40377688e+00, -1.40545905e+00, -3.33626658e-01,  1.67857483e-01,
       -2.56542516e+00, -1.09479415e+00, -6.33201838e-01, -5.74387968e-01,
       -1.31133652e+00,  4.87605006e-01, -1.64612997e+00, -1.02813327e+00,
       -4.14576441e-01,  9.03148234e-01,  6.28048706e+00,  3.40591073e+00,
       -9.81136024e-01,  1.20753026e+00, -2.47438264e+00,  9.25227463e-01,
       -5.81508458e-01, -1.06582916e+00,  3.93569827e+00, -1.85011244e+00,
       -7.90956676e-01, -7.38145888e-01, -4.79655117e-01, -4.37842578e-01,
        1.17497838e+00, -

### It takes long time to convert all sentences to vectors: about 6 minutes!!!

In [None]:
features = sents.apply(lambda x: sent2vec(x, nlp))
features.shape

(28986,)

In [None]:
features[0].shape

(300,)

In [None]:
features_arr = np.array(features.to_list())

In [None]:
features_arr.shape

(28986, 300)

## Logistic Regression Classification

In [1]:
from sklearn.linear_model import LogisticRegression

In [None]:
lrcls = LogisticRegression()

In [None]:
lrcls.fit(features_arr, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [None]:
predicts_train = lrcls.predict(features_arr)

In [None]:
from sklearn.metrics import precision_recall_fscore_support

In [None]:
evals = precision_recall_fscore_support(predicts_train, y, average='weighted')
evals

(0.6881341508631901, 0.5594424894776788, 0.6020199722990451, None)

In [None]:
print('weighted precision on training: {}'.format(evals[0]))
print('weighted recall on training: {}'.format(evals[1]))
print('weighted f1score on training: {}'.format(evals[2]))

weighted precision on training: 0.6881341508631901
weighted recall on training: 0.5594424894776788
weighted f1score on training: 0.6020199722990451


## Evaluate the dev data

In [None]:
# Read in the train data
dev = pd.read_csv(dev_file)
dev.shape

(2890, 9)

In [None]:
sents_dev = dev["value.text"].str.replace('\n', "").apply(lambda x: x.lower())
sents_dev

0       petitioner:the commissioner of income-taxnew d...
1              date of judgment:05/05/1961bench:das, s.k.
2       bench:das, s.k.hidayatullah, m.shah, j.c.citat...
3       itentered into transactions in the nature of f...
4       the assessee claimed deduction of theselosses ...
                              ...                        
2885                           the petitions are allowed.
2886    the impugned orders are set aside with directi...
2887     the respondent having challenged the judgment...
2888    therefore, having regard to the law laid down ...
2889                                        sd/- judge nv
Name: value.text, Length: 2890, dtype: object

In [None]:
y_dev = dev['value.labels'].map(lab2id)
y_dev

0        1
1        1
2        1
3        1
4        1
        ..
2885    10
2886    10
2887    10
2888    10
2889     2
Name: value.labels, Length: 2890, dtype: int64

In [None]:
features_dev = sents_dev.apply(lambda x: sent2vec(x, nlp))
features_dev.shape

(2890,)

In [None]:
features_dev_arr = np.array(features_dev.to_list())
features_dev_arr.shape

(2890, 300)

In [None]:
predicts = lrcls.predict(features_dev_arr)

In [None]:
evals_dev = precision_recall_fscore_support(predicts, y_dev, average='weighted')

  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
print('weighted precision on dev: {}'.format(evals_dev[0]))
print('weighted recall on dev: {}'.format(evals_dev[1]))
print('weighted f1score on dev: {}'.format(evals_dev[2]))

weighted precision on dev: 0.6708300666455441
weighted recall on dev: 0.5480968858131487
weighted f1score on dev: 0.5852034172397143
