### Address Extraction Classifier

#### Prem Shah

Addresses generated randomly from https://www.randomlists.com/random-addresses. Legal documents taken from open dataset https://archive.ics.uci.edu/ml/datasets/Legal+Case+Reports.


Dataset is in the form of Text and Label



### Import libraries and data

In [193]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix

In [400]:
data = pd.read_csv("Data.csv")
test_data = pd.read_excel("testdata.xlsx")

In [284]:
data.tail()

Unnamed: 0,Text,Tag
11244,9.6 Governing law and jurisdiction (a) The Sc...,
11245,"<sentence id=""s532"">(b) The Scheme Creditors h...",
11246,"<sentence id=""s533"">(c) The Scheme Creditors i...",
11247,"<sentence id=""s534"">(d) Nothing in this clause...",
11248,"<sentence id=""s535"">(e) Notwithstanding the pr...",


### Clean data and convert data types

In [128]:
data = data.dropna()
data["Tag"] = data["Tag"].astype('category')

In [130]:
def remove_xml_tags(series):
    """
    Removing XML tags from training data
    """
    series = series.replace('<[^>]+>', '', regex=True)
    return series

In [164]:
def remove_short_texts(df,series):
    """
    Takes dataframe and returns dataframe
    Removes all rows with less than 5 characters
    """
    series = series.astype(str)
    df = df[series.apply(lambda x: len(x) >= 5)]
    return df

In [386]:
try:
    data["Tag"] = data["Tag"].cat.codes
except:
    print("Already Categorical")
#0 = Address, 1 = Statement

Already Categorical


In [141]:
data['Text'] = remove_xml_tags(data['Text'])
data = remove_short_texts(data,data['Text'])

### Model buiding and training

In [281]:
def dataset_split(X,y,train_size):
    """
    Training and testing split
    """
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1-train_size, random_state=42)
    return X_train, X_test, y_train, y_test

In [30]:
X_train, X_test, y_train, y_test = dataset_split(data['Text'],data['Tag'],0.7)

In [361]:
vectorizer = TfidfVectorizer(lowercase=False, ngram_range=(1,1))
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [362]:
def metrics(predicted_values, test_values):
    '''
    This function takes in two lists/arrays
    
    predicted_values: List of values predicted by model
    test_values: List of actual values
    
    It outputs a printed list of metrics like accuracy, precision, recall and f1-score
    
    The function can be modified to return variables instead of printing the outputs
    
    '''
    predicted_values = np.array(predicted_values)
    test_values = np.array(test_values)
    true_positives,true_negatives,false_positives,false_negatives = 0,0,0,0
    
    for i in range(0,len(predicted_values)):
        if (predicted_values[i] == 0 and test_values[i] == 0):
            true_negatives = true_negatives + 1
        elif (predicted_values[i] == 1 and test_values[i] == 1):
            true_positives = true_positives + 1
        elif (predicted_values[i] == 1 and test_values[i] == 0):
            false_positives = false_positives + 1
        else:
            false_negatives = false_negatives + 1
        
    
    accuracy = 100*(true_positives + true_negatives) / (true_positives + true_negatives + false_positives + false_negatives) 
    precision = 100*true_positives / (true_positives + false_positives)
    recall = 100*true_positives / (true_positives + false_negatives)
    f1_score = 2*precision*recall / (precision+recall)
    
    print("Results:\n----------------------------")
    print("Accuracy:", round(accuracy,2), "%")
    print("Precision:", round(precision,2), "%")
    print("Recall:", round(recall,2), "%")
    print("F1 Score:", round(f1_score,2), "%")   
    print("\n")
    print(confusion_matrix(predicted_values,test_values))
    
    return None

In [363]:
def evaluate_model(X_train,y_train,X_test,y_test,model):
    model.fit(X_train.todense(),y_train)
    results = model.predict(X_test.todense())
    evaluation = metrics(results,y_test)
    return results,model

In [364]:
def test_model(X_test,model):
    results = model.predict(X_test.todense())
    return results

### Model Results 

In [365]:
results_lr,model_lr = evaluate_model(X_train_vectorized,y_train,X_test_vectorized,y_test,LogisticRegression())

Results:
----------------------------
Accuracy: 98.5 %
Precision: 99.86 %
Recall: 95.23 %
F1 Score: 97.49 %


[[1752   37]
 [   1  739]]


In [366]:
results_gnb,model_gnb = evaluate_model(X_train_vectorized,y_train,X_test_vectorized,y_test,GaussianNB())

Results:
----------------------------
Accuracy: 99.8 %
Precision: 99.36 %
Recall: 100.0 %
F1 Score: 99.68 %


[[1748    0]
 [   5  776]]


In [367]:
results_mnb,model_mnb = evaluate_model(X_train_vectorized,y_train,X_test_vectorized,y_test,MultinomialNB())

Results:
----------------------------
Accuracy: 99.6 %
Precision: 100.0 %
Recall: 98.71 %
F1 Score: 99.35 %


[[1753   10]
 [   0  766]]


### Misclassifications

In [393]:
X_test[results_gnb!=y_test]

4060    930 n.e high street #f-223 issaql ah washingto...
5558                                 255 W. Bedford Lane 
5665                                        283 1st Ave. 
5378                                      247 Wayne Ave. 
4690                                      167 Sierra St. 
Name: Text, dtype: object

In [369]:
X_test[results_mnb!=y_test]

7695                                              (TS 20)
7589                                      (-) citalopram.
6854                         SERIOUS QUESTION TO BE TRIED
6981     MANAGING ALLEGATIONS OF MISCONDUCT OR SERIOUS...
7058                                           REDUNDANCY
7190                                      &#8226; Fashion
7046                                       Superannuation
7186                                     &#8226; Trekking
7573                                                FACTS
7185                                       &#8226; Rugged
Name: Text, dtype: object

In [370]:
X_test[results_lr!=y_test]

6408                                        SECTION 3D(2)
6742    As the High Court observed in Alexandra Privat...
7254    18 In August 1992 Edgarlodge opened a retail s...
6581               Jacobson J dismissed that application.
7681                               (b) a letter enclosing
7472           A defence under s 122(1)(f) was abandoned.
7184                                     &#8226; American
7695                                              (TS 20)
7589                                      (-) citalopram.
7611                        The letter states inter alia:
7041                                Fixed Term Employment
6307                                        They provide:
6895    BREACH OF AN IMPLIED TERM OF TRUST AND CONVENI...
8382    4 In 1988 Ms Pelka purchased a two bedroom apa...
7292     This last proposition requires some elaboration.
8517                                       (Ground 4(d)).
7373    They are as varied as plastic bags, bodybags, ...
7957          

### Testing on Real Data

In [401]:
test_data = remove_short_texts(test_data,test_data['Text'])

In [402]:
test_data_vectorized = vectorizer.transform(test_data['Text'])

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [403]:
test_data_vectorized

<173x10539 sparse matrix of type '<class 'numpy.float64'>'
	with 962 stored elements in Compressed Sparse Row format>

In [404]:
results = test_model(test_data_vectorized,model_gnb)

In [405]:
test_data['Tag'] = results

In [406]:
test_data['Tag'] = np.where(test_data['Tag']==0, 'Address', 'Statement')

In [408]:
test_data[test_data['Tag'] == 'Address']

Unnamed: 0,Text,Tag
31,2120 S.W. 337TH PLACE #209,Address
32,"FEDERAL WAY, Washington 98023",Address
45,"i Ft Bo 4 901 Fifth Avenue, Suite 800",Address
46,"* ama SEATTLE, WASHINGTON 98164",Address
47,TELEPHONE: (206) 386-4800,Address
48,FACSIMILE: (206) 233-8166,Address
90,andy Redford/WSBA No. 21529,Address
94,"900 Fourth Avenue, Suite 1400",Address
95,"SEATTLE, WASHINGTON 98164",Address
96,TELEPHONE; (206) 386-4800,Address


### Writing to file 

In [409]:
test_data.to_csv("test_output.csv")