### Address Extraction Classifier

#### Prem Shah

Addresses generated randomly from https://www.randomlists.com/random-addresses. Legal documents taken from open dataset https://archive.ics.uci.edu/ml/datasets/Legal+Case+Reports.


Dataset is in the form of Text and Label



1. Classifier Explanation

2. keras

3. File parsing/results

Import libraries and data

In [2]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

In [3]:
data = pd.read_csv("Data.csv")

In [4]:
data.head()

Unnamed: 0,Text,Tag
0,"7710 Tower Street Hamden, CT 06514",Address
1,"932 Bishop St. Billerica, MA 01821",Address
2,"7906 Silver Spear Road Davison, MI 48423",Address
3,"9739 Acacia Lane Beaver Falls, PA 15010",Address
4,"712 Oak Valley Street Temple Hills, MD 20748",Address


Clean data and convert data types

In [5]:
data = data.dropna()
data["Tag"] = data["Tag"].astype('category')

In [6]:
data.tail()

Unnamed: 0,Text,Tag
6150,"<sentence id=""s198"">In the absence of any rele...",Statement
6151,"<sentence id=""s199"">70 In my view, whether or ...",Statement
6152,"<sentence id=""s200"">71 Given the lack of co-op...",Statement
6153,"<sentence id=""s201"">Accordingly, ground (d) of...",Statement
6154,"<sentence id=""s202"">APPLICATION OF SECTION 33A...",Statement


In [7]:
data['Text'] = data['Text'].replace('<[^>]+>', '', regex=True)

In [8]:
data.tail(5)

Unnamed: 0,Text,Tag
6150,In the absence of any relevant information bef...,Statement
6151,"70 In my view, whether or not the appellant wo...",Statement
6152,71 Given the lack of co-operation of the appel...,Statement
6153,"Accordingly, ground (d) of the appeal is rejec...",Statement
6154,APPLICATION OF SECTION 33A TO THE FACTS,Statement


In [9]:
try:
    data["Tag"] = data["Tag"].cat.codes
except:
    print("Already Categorical")
#0 = Address, 1 = Statement

In [10]:
def dataset_split(X,y,train_size):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1-train_size, random_state=42)
    return X_train, X_test, y_train, y_test

In [11]:
X_train, X_test, y_train, y_test = dataset_split(data['Text'],data['Tag'],0.7)

In [12]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X_train)
X_t = vectorizer.transform(X_test)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [13]:
X

<4308x8550 sparse matrix of type '<class 'numpy.float64'>'
	with 49278 stored elements in Compressed Sparse Row format>

75

Superior court of the state of washington

In [14]:
X_train

4739    This prohibition may well mean that Mr Hopper'...
5494    By this time, the Colorado side of Edgarlodge'...
620                  Old Bridge, NJ 08857328 Pawnee Ave. 
6127    In this respect I adopt the statement in CCH A...
4254    In his oral evidence and in submissions, Fairb...
3807                     86 Hilltop Ave. Auburn, NY 13021
5926    12 The statement of affairs as signed by the a...
2232             Christiansburg, VA 2407333 Anderson Dr. 
5943    19 The appellant gave evidence both before the...
644        Hopewell Junction, NY 125339819 Blackburn Rd. 
5436    12 Three Colorado stores were opened in July 1...
4606    In particular, when ss 22(1) , (3), (5) and (7...
5224    Under the AWA, suspension with or without pay ...
1061                       371 Bay Road Astoria, NY 11102
4877    Certain provisions of the HIA (ss 16A, 16B and...
5449    By this time the mark "Colorado" was used prin...
120              117 Center Street Delray Beach, FL 33445
1777          

In [15]:
model = LogisticRegression()

In [16]:
model.fit(X,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [17]:
results = model.predict(X_t)

In [18]:
accuracy_score(results,y_test)

0.9723876556578235

In [19]:
confusion_matrix(results,y_test)

array([[1229,   51],
       [   0,  567]], dtype=int64)

In [20]:
X_test[results!=y_test]

4371                                                   3.
4977                                                    '
4385                                                   4.
5007                                                   5.
5816                                                  ...
5966                                                     
5811                                                  ...
4641                                        SECTION 3D(2)
4906                                                  ...
5221    DVC or PVC including providing a recommendatio...
5823                                                     
5310                                                     
5838                                                  ...
5804                                                    )
4681                                          SUBMISSIONS
4671                      381-382 [69]-[71] and 384 [78].
5246                                                     
5342          

In [21]:
vectorizer.get_feature_names()

['00',
 '000',
 '01020',
 '01040',
 '01085',
 '01089',
 '0108986',
 '01201',
 '0120127',
 '01453',
 '014537211',
 '01545',
 '01545473',
 '01604',
 '01701',
 '01752',
 '01760',
 '01801',
 '01803',
 '01810',
 '01821',
 '01824',
 '018249328',
 '01826',
 '018267371',
 '01841',
 '01844',
 '0184493',
 '01845',
 '01851',
 '01867',
 '01867301',
 '01876',
 '01880',
 '01886',
 '01887',
 '01887564',
 '01902',
 '0190244',
 '01906',
 '01915',
 '01923',
 '019239481',
 '01930',
 '01960',
 '02026',
 '02026770',
 '02038',
 '02048',
 '02062',
 '02072',
 '020725',
 '02124',
 '0212446',
 '02125',
 '02127',
 '02130',
 '02131',
 '02132',
 '02135',
 '02136',
 '021367',
 '02138',
 '021387858',
 '02148',
 '02149',
 '02150',
 '02151',
 '02155',
 '02155722',
 '02176',
 '02184',
 '02186',
 '02301',
 '02446',
 '02446832',
 '02453',
 '02472',
 '02474',
 '02478',
 '02478396',
 '02703',
 '02720',
 '0272096',
 '02740',
 '027409298',
 '02760421',
 '02780',
 '02816',
 '0281648',
 '02860',
 '02864',
 '02886',
 '02893',
 

In [None]:
pd.read_excel("test_data")