# Using `spaCy` for text classification

Prior to running this code, it's necessary to install `spaCy` on your machine, and also to download its English libraries. 

*Note:* I'd like to install sklearn in the virtual environment, but `jupyter` is having trouble calling it. I've tried the following to troubleshoot, but so far no dice. Will solve this later.  
* https://stackoverflow.com/questions/42449814/running-jupyter-notebook-in-a-virtualenv-installed-sklearn-module-not-available  

In [1]:
import spacy
import pandas as pd
from spacy.tokens import Doc
from spacy.vocab import Vocab

In [2]:
# Load the pre-defined English model:
nlp = spacy.load('en_core_web_md')

### Example using Pandas dataframe

In [19]:
# Read in a CSV file with a column of text abstracts. Keep only the columns we need.
df = pd.read_csv('resources/fedreg_18-05-22-14-45.csv')
df = df[['agency', 'abstract', 'type']]
df['abstract']=df['abstract'].astype(str) # Make sure all values are strings. There were some floats in here.
df['abstract-utf8']=df['abstract'].apply(lambda x: x.decode('utf-8')) # Convert to unicode
df = df.dropna(how='any') # get rid of missing data.
df.shape

(2000, 4)

In [20]:
# Let's do a little recording, and then take a look at the label column.
df.loc[df['type']=='Proposed Rule', 'type']='Rule'
df['type'].value_counts(dropna=False)

Notice                   1537
Rule                      430
Presidential Document      33
Name: type, dtype: int64

In [21]:
# Split into training and testing data. Keeping it small so the example doesn't take so long.
df_train=df.head(100)
df_test=df.tail(100)

In [22]:
# Create a dictionary with the abstract and its labels.
def dictionize_me(row):
    if row['type']=='Rule':
        return (row['abstract-utf8'], {"cats": {"Notice": 0, "Rule": 1}})
    else:
        return (row['abstract-utf8'], {"cats": {"Notice": 1, "Rule": 0}})

In [23]:
# Apply this function to the pandas dataframe.
df_train['newcol'] = df_train.apply(lambda x: dictionize_me(x), axis=1)
df_train.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,agency,abstract,type,abstract-utf8,newcol
1,Transportation Department,We are superseding Airworthiness Directive (AD...,Rule,We are superseding Airworthiness Directive (AD...,(We are superseding Airworthiness Directive (A...
2,Commodity Futures Trading Commission,The Commodity Futures Trading Commission (Comm...,Rule,The Commodity Futures Trading Commission (Comm...,(The Commodity Futures Trading Commission (Com...


In [24]:
# Convert this into a list
train_data=list(df_train['newcol'])
train_data[0]

(u'We are superseding Airworthiness Directive (AD) 2017-11-03 for DG Flugzeugbau GmbH Model DG-500MB gliders that are equipped with a Solo 2625 02 engine modified with a fuel injection system following the instructions of Solo Kleinmoteren GmbH Technische Mitteilung 4600-3 and identified as Solo 2625 02i. This AD results from mandatory continuing airworthiness information (MCAI) issued by an aviation authority of another country to identify and correct an unsafe condition on an aviation product. The MCAI describes the unsafe condition as failure of the connecting rod bearing resulting from too much load on the rod bearings from the engine control unit. This AD adds a model to the applicability. We are issuing this AD to require actions to address the unsafe condition on these products.',
 {'cats': {'Notice': 0, 'Rule': 1}})

In [29]:
# Create an analytic "pipeline" of type "textcat"
mytextcat2 = nlp.create_pipe('textcat') # Note that we use 'textcat2' to distinguish from earlier example.
nlp.add_pipe(mytextcat2, last=True)

ValueError: [E007] 'textcat' already exists in pipeline. Existing names: [u'tagger', u'parser', u'ner', u'textcat']

In [26]:
# Add labels to the pipeline. These will be called using the `.cats` attribute, below.
mytextcat2.add_label('Notice')
mytextcat2.add_label('Rule')

1

In [27]:
# Begin training. 
optimizer = nlp.begin_training()

In [28]:
# Apply to X_train
for itn in range(10):
    for doc, gold in train_data:
        nlp.update([doc], [gold], sgd=optimizer)

In [None]:
# Provide a new text, and classify it. The predicted category is called using the `.cats` attribute.
doc = nlp(u'TEST TEST TEST We are TEST superseding TEST Airworthiness Directive (AD) 2017-11-03 for DG TEST Flugzeugbau GmbH Model DG-500MB gliders that are equipped with a Solo 2625 02 engine modified with a fuel injection system following the instructions of Solo Kleinmoteren GmbH Technische Mitteilung 4600-3 and identified as Solo 2625 02i. This AD results from mandatory continuing airworthiness information (MCAI) issued by an aviation authority of another country to identify and correct an unsafe condition on an aviation product. The MCAI describes the unsafe condition as failure of the connecting rod bearing resulting from too much load on the rod bearings from the engine control unit. This AD adds a model to the applicability. TEST TEST TEST We are issuing this AD to require actions to address the unsafe condition on these products. TEST TEST TEST TEST')
print(doc.cats)

In [None]:
# Classify the testing dataset.
def classify_testing_data(row):
    x=row.decode('utf-8')
    doc = nlp(x)
#     return doc.cats['Notice']
    return x

In [None]:
df_test['abstract'][1302]

In [None]:
classify_testing_data(df_test['abstract'][1302])

In [None]:
df_test['prob_notice']=df_test['abstract'].map(lambda x: classify_testing_data(x))