# NLP Case
In this case we will solve two NLP tasks. 
In the first one we need to find sentences that dicuss maintenance. 
In the second task we will classify documents.

We start by reading the provided dataset.

In [1]:
import pandas as pd
df=pd.read_json("data.json")
df.head()

Unnamed: 0,file_name,pagenum,content,el_number,category
0,00206BA4E8F5200610123622.pdf,0,K3G800-PW07-01 EC centrifugal module...,0,36
1,00206BA4E8F5200610123622.pdf,0,"backward-curved, single-intake\n ...",1,36
2,00206BA4E8F5200610123622.pdf,0,ebm-papst Mulfingen GmbH & Co. KG\n ...,2,36
3,00206BA4E8F5200610123622.pdf,0,Nominal data,3,36
4,00206BA4E8F5200610123622.pdf,0,Type K3G800-PW07-01\n Motor M3G200-QA,4,36


## Task 1
In this task we will find sentences that discuss maintenance. The suggested approach is to look for the following keywords:

 - “kontroll”, “vedlikehold”, “tilsyn”, "sikkerhetskontroll”, “inspiser", “inspeksjon”, “fjerning” (could be a suffix, eg. "støvfjerning”), “fjerne”, “service”, “støvsuging”, “prøves", “skiftes", "ettertrekking”, “ettersyn”

in the same sentence as any of these phrases:

 - årlig, månedlig,   daglig, per år / pr år, per måned / pr måned,  halvårlig, hvert halvår, ukentlig, per uke / pr uke, hver x. til y uke (x, y = integers), hver x. til y. måned (x, y = integers), driftstime, time, en gang i året, en gang i uken, en gang i måneden.
 
The first step is to split paragraphs into sentences. The trivial solution is to split strings by '.' characters, but there are many exceptions. It's better to use a library:

In [2]:
import nltk
#nltk.download('punkt')
tokenizer = nltk.data.load('tokenizers/punkt/norwegian.pickle')

In [3]:
def flatten(l):
    """
    Flatten a list of lists
    """
    return [item for sublist in l for item in sublist]

def make_sentences(content_string):
    """
    Take a string as an input and return a list of sentences
    """
    lines = content_string.splitlines()
    sentences = [tokenizer.tokenize(l) for l in lines]
    return flatten(sentences)

Let's test it:

In [4]:
content_string = "Jeg spiser i dag. Jeg spiser hver 5. minutt. Ny linje:\nHei hei"
make_sentences(content_string)

['Jeg spiser i dag.', 'Jeg spiser hver 5. minutt.', 'Ny linje:', 'Hei hei']

Apply this function to our dataset (content column):

In [5]:
sentences = (
    df.content
    .apply(make_sentences)
    .apply(lambda x: x[0])
)
sentences.head()

0    K3G800-PW07-01           EC centrifugal module...
1                       backward-curved, single-intake
2                       ebm-papst Mulfingen GmbH & Co.
3                                         Nominal data
4                                 Type  K3G800-PW07-01
Name: content, dtype: object

Finally search for sentences that follow the specification:

In [6]:
candidates = ["årlig", "månedlig", "daglig", "per år", "pr år", "per måned", "pr måned", 
              "halvårlig", "hvert halvår", "ukentlig", "per uke", "pr uke", "driftstime", 
              "time", "en gang i året", "en gang i uken", "en gang i måneden", 
              "hver \d. til \d. uke", "hver \d. til \d. måned"]
keywords = ["kontroll", "vedlikehold", "tilsyn", "sikkerhetskontroll", 
             "inspiser", "inspeksjon", "fjerning", "fjerne", "service", 
             "støvsuging", "prøves", "skiftes", "ettertrekking", "ettersyn"]
filter1 = sentences.str.contains('|'.join(candidates))
filter2 = sentences.str.contains('|'.join(keywords))

sentences_maintenance = sentences[filter1 & filter2]
sentences_maintenance.head()

15560    Sikkerhetsrapport årlig kontroll brannalarmanl...
15570    Sikkerhetsrapport årlig kontroll brannalarmanl...
15572      Kontrollseddel årlig kontroll brannalarmanlegg:
15579      Kontrollseddel årlig kontroll brannalarmanlegg:
15581     Kontrollrapport årlig kontroll brannalarmanlegg:
Name: content, dtype: object

## Task 2

In this task we will classify documents. 
The first step is to merge paragraphs into documents.

Some documents belong to multiple categories for some reason. This could be an error, or maybe some documents have sections that belong to one class or another. Since we don't know why, we can just use the first category. 

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn import metrics

documents = []
categories = []

for f in df.file_name.unique():
    paragraph_list = df.loc[df.file_name == f, :].sort_values(by='el_number')
    doc = ' '.join(paragraph_list.content)
    cat = paragraph_list.category.iloc[0]
    
    if (ncat:=paragraph_list.category.nunique()) != 1:
        print(f"Document '{f}' has {ncat} categories")
    
    documents.append(doc)
    categories.append(cat)

Document '15092006 Samsvarserklæring på Høyttalerjobb i bg14.pdf' has 2 categories
Document '171016.sk bla A51 anbefalt oppgradering tekniske anlegg i forbindelse med nye leietakere.pdf' has 2 categories
Document '1_Ferdigattest.pdf' has 2 categories
Document '20_02714-15GNR 137 BNR 3 - Brynsengfaret 4 - Sikring mot tilbakestrømming - Tilbakeslagsventiler - Bilde av tilkobling til fettu.pdf' has 2 categories
Document '220506,samsvarserklæring på høyttalerjobb bg14.pdf' has 2 categories
Document '254.02 Fugemasse SB Flex (245.02).pdf' has 2 categories
Document 'AX flow reduksjonsventil 000300.pdf' has 2 categories
Document 'Adresseliste Akersgata 51.pdf' has 2 categories
Document 'EV220B 6-12 engelsk.pdf' has 2 categories
Document 'FDV XMS 0-4 Oslomiks med Oldroyd og 10mm VT-filt_Blomstertak AS.pdf' has 2 categories
Document 'FDV-dokumentasjon.pdf' has 2 categories
Document 'FDV.pdf' has 2 categories
Document 'Ferdigattest - gnr-bnr 24-547 - solenergianlegg.PDF' has 2 categories
Docum

In [8]:
categories = pd.Series(categories)
categories.nunique()

62

In [9]:
sorted(categories.unique())

['10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '18',
 '19',
 '20',
 '21',
 '210',
 '22',
 '23',
 '233',
 '24',
 '25',
 '26',
 '27',
 '28',
 '29',
 '30',
 '31',
 '32',
 '33',
 '34',
 '35',
 '36',
 '37',
 '40',
 '41',
 '42',
 '43',
 '44',
 '45',
 '46',
 '47',
 '49',
 '51',
 '52',
 '53',
 '54',
 '542',
 '546',
 '547',
 '548',
 '55',
 '552',
 '556',
 '56',
 '57',
 '58',
 '61',
 '62',
 '620',
 '65',
 '66',
 '67',
 '71',
 '74',
 '76',
 '77']

In [10]:
categories.value_counts(dropna = False).head()

11    1364
43     924
40     888
36     752
24     607
dtype: int64

In [11]:
categories.value_counts(dropna = False).tail(20)

21     16
66     13
65     11
51     10
77     10
10      9
547     8
552     7
548     5
16      5
34      3
49      2
210     2
233     2
74      2
42      2
18      2
71      1
620     1
76      1
dtype: int64

It would be a good idea to understand those categories. Can some categories be merged? Some have many instances, while others have very few (e.g. '76' has only 1 instance).

I will map all classes with less than 10 instances to an "other" class.

In [12]:
category_counts = categories.value_counts()
other_categories = category_counts[category_counts < 10].index.values
other_index = categories.isin(other_categories)
categories[other_index] = '0'

Split into training and test:

In [13]:
X_train, X_test, y_train, y_test = train_test_split(documents, categories, test_size=0.33, random_state=42, stratify=categories)

We will use a classical classification model: TFIDF features and an SVM:

In [14]:
vectorizer = TfidfVectorizer()
X_vect_train = vectorizer.fit_transform(X_train)
X_vect_test = vectorizer.transform(X_test)

In [15]:
model = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, max_iter=5, tol=None).fit(X_vect_train, y_train)
y_train_hat = model.predict(X_vect_train)
y_test_hat = model.predict(X_vect_test)

Measure performance on the training and test sets:

In [16]:
print(metrics.classification_report(y_train, y_train_hat))

              precision    recall  f1-score   support

           0       1.00      0.80      0.89        35
          11       0.99      0.99      0.99       914
          12       0.96      0.79      0.87        29
          13       0.82      1.00      0.90        45
          14       0.94      0.74      0.83        23
          15       0.70      0.96      0.81        70
          19       0.94      0.88      0.91       138
          20       0.96      0.77      0.86        62
          21       1.00      0.91      0.95        11
          22       1.00      0.88      0.94        17
          23       0.97      0.90      0.94       189
          24       0.94      0.97      0.96       407
          25       0.95      0.95      0.95       225
          26       1.00      0.94      0.97        62
          27       1.00      0.94      0.97       173
          28       0.91      0.93      0.92        54
          29       0.93      0.98      0.95       157
          30       1.00    

In [17]:
print(metrics.classification_report(y_test, y_test_hat))

              precision    recall  f1-score   support

           0       1.00      0.12      0.21        17
          11       0.96      0.99      0.97       450
          12       0.80      0.53      0.64        15
          13       0.63      0.83      0.72        23
          14       0.75      0.25      0.38        12
          15       0.62      0.88      0.73        34
          19       0.74      0.63      0.68        68
          20       0.75      0.29      0.42        31
          21       0.00      0.00      0.00         5
          22       0.83      0.56      0.67         9
          23       0.70      0.53      0.61        94
          24       0.70      0.80      0.74       200
          25       0.77      0.77      0.77       111
          26       1.00      0.50      0.67        30
          27       0.85      0.66      0.74        85
          28       0.86      0.73      0.79        26
          29       0.73      0.94      0.82        77
          30       0.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Performance varies from class to class. When the number of instances in the test set is low, the performance estimate is probably not statistically significant. We could have considered other methods such as cross validation to have a better performance estimate.

The process of generating TFIDF vectors has a lot of parameters that can be optimized. I'd have liked to look into char level ngrams. This works well in language like Norwegian where there are compound words.

The dataset is quite unbalanced, as we have seen. Some classes have thousands of instances, while some others have less than twenty. There are methods to deal with this issue. For example, the model that we used has a class_frequency parameter that can be used for this purpose. We leave this for future work as well.