langdetectBenchmarking Language Detection for NLP
===

Four Python tools for identifying the language of your text and a speed and accuracy test

- https://mc.ai/benchmarking-language-detection-for-nlp/

## langdetect

langdetect is a re-implementation of Google’s language-detection library from Java to Python. Simply pass your text to the imported detect function and it will output the two-letter ISO 693 code of the language for which the model gave the highest confidence score. (Refer to this page for a full list of 693 codes and their respective languages.) If you use detect_langs instead, it will output a list of the top languages that the model has predicted, along with their probabilities.

In [4]:
from langdetect import DetectorFactory, detect, detect_langs
text = "My lubimy mleko i chleb." 
detect(text) #  'cs'
detect_langs(text)  # [cs:0.7142840957132709, pl:0.14285810606233737, sk:0.14285779665739756]

[cs:0.5714266061070243, pl:0.28571505741259656, sk:0.14285832166773346]

In [7]:
from langdetect import DetectorFactory, detect
from langdetect.lang_detect_exception import LangDetectException 
DetectorFactory.seed = 0
def is_english(text):
    try:
        if detect(text) != "en":
            return False
    except LangDetectException:
        return False
    return True

In [8]:
text = "As time progresses, software gets more tailored to solving specific problems in different domains"

is_english(text)

True

## spaCy language detector

If you use spaCy for your NLP needs, you can add a custom language detection component to your existing spaCy pipeline, which will enable you to set an extension attribute called .language on the Doc object. This attribute can then be accessed via Doc._.language, which will return the predicted language along with its probability.

In [12]:
import spacy
from spacy_langdetect import LanguageDetector
text2 = 'In 1793, Alexander Hamilton recruited Webster to move to New York City and become an editor for a Federalist Party newspaper.'
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(LanguageDetector(), name='language_detector', last=True)
doc = nlp(text)
doc._.language  # {'language': 'en', 'score': 0.9999978351575265}

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

In [1]:
import glob
import codecs
import numpy
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, f1_score


pd.set_option('display.max_rows', 20, 
              'display.max_columns', 100, 
              'display.float_format', '{:,.2f}'.format)

In [2]:
# Connect Database
import pyodbc
conn = pyodbc.connect("Driver={SQL Server Native Client 11.0};"
                      "Server=YO-PC\SQLEXPRESS2016;"
                      "Database=ECS_DATA2;"
                      "Trusted_Connection=yes;")

cursor = conn.cursor()

In [3]:
df = pd.read_sql_query("SELECT * FROM Deci_InvoiceD", conn)
print(df.shape)
df.sample(5)

(23608, 121)


Unnamed: 0,BranchCode,RefNO,InvNO,ItemNO,DecItemNO,TariffCode,TariffSeq,StatCode,ImportTariff,PrivilegeCode,AHTN,DepositReson,IsFreeOfChage,UNDG,GroupCode,PdtCode,PdtSubCode,PdtDescriptionT,PdtDescriptionE,RTCProductCode,ProductAttribute1,ProductAttribute2,ProductYear,BrandName,DRemark,DOriginCountry,NetWeight,GrossWeight,DWeightUnit,TariffQty,TariffUnit,SalesCur,DegreeAmt,SalesRate,SalesPrice,SalesPriceTHB,SalesPackUnit,SalesPackQty,SalesTotalPrice,DSalesNetPriceTHB,IncreasedPrice,IncreasedCur,IncreasedRate,IncreasedPriceTHB,DSalesCIFPrice,DSalesCIFPriceTHB,CIFValueAssess,DShippingMark,PackAmount,PackUnit,...,ExpFin,ExpLnd,ExpOth,DExpShippingTHB,DExpFreightTHB,DExpInsuranceTHB,DExpPkgTHB,DExpFinTHB,DExpLndTHB,DExpOthTHB,SubPack,RawCode,ExpOth2,ExpOth3,ExpOth4,ExpOth5,DExpOthTHB2,DExpOthTHB3,DExpOthTHB4,DExpOthTHB5,AHTN2,AHTN3,AHTN4,AHTN5,RTCAHTN,ImpTaxIncentive,ArgTariffCode,ArgTariffSeq,ArgPrivilegeCode,OriginCer,ArgReasonCode,CerExportNo,ACDDDeductedAmt,ACDDProCode,ACDDValueCode,ACDDGrossW,ModelNumber,ModelVersion,ModelCmpTax,ArgValueRate,ArgSpcRate,ArgSpcCode,BOILicenseNo,Royalty,DetailRoyalty,OtherRemark,SubQTY,SortItemNo,SortDecItemNo,DetailOther
11070,0,DSYPI00000871,WXJH18082201,2,2,84509020,60037,0,,0,,,0,,IMPORT,5214FA1146G,IM0001,ส่วนประกอบเครื่องซักผ้า,HOSE ASSEMBLY\r\n5214FA1146G,,,,2018,NO BRAND,,CN,3483.0,0.0,KGM,16200.0,C62,USD,0.0,33.11,0.64,21.29,EA,16200.0,10416.6,344914.46,0.0,USD,33.11,0.0,10843.81,359060.24,0.0,,0.0,CT,...,0.0,0.0,0.0,0.0,10696.83,3448.95,0.0,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,0.0,,,0.0,,,,0.0,0.0,,2358/ร./2557,0,0,,,0,0,
22613,0,DSYPI00001937,M20181108,4,4,85423900,60084,0,,0,,,0,,IMPORT,IC,IM0001,ไอซี,IC,,,,2017,STMICROELECTRONICS,การลดอัตราอากรและการยกเว้นอากรศุลกากร ตามมาตรา...,CN,72.84,0.0,KGM,50000.0,C62,USD,0.0,32.83,0.11,3.74,C62,50000.0,5700.0,187134.42,0.0,USD,32.83,0.0,5700.0,187134.42,0.0,,0.0,CT,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,0.0,,,0.0,,,,0.0,0.0,,1227/ร./2553 ลว. 5 มี.ค. 2553,0,0,,,0,0,
9448,0,DSYP000000701,SVCHQ20180730043,71,71,85299091,60051,0,,0,,,0,,IMPORT,EAJ63968101,IM0001,อุปกรณ์ประกอบทีวี,"LCD,MODULE-TFT\r\nEAJ63968101",,,,2018,LG,,KR,11.73,0.0,KGM,2.0,C62,USD,0.0,33.57,133.08,4466.9,EA,2.0,266.16,8933.79,0.0,USD,33.57,0.0,266.16,8933.79,0.0,LG,0.0,CT,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,0.0,,,0.0,,,,0.0,0.0,,,0,0,,,0,0,
13606,0,DSYPI00001085,HQDG400866669-1,16,16,85412900,60046,0,,0,,,0,,IMPORT,0TR107009AG,IM0001,ตัวควบคุมกระแสไฟฟ้า มีหน้าที่ในการคอนโทรลการไห...,"TR,BIPOLAR\r\n0TR107009AG",,,,2018,"""NO BRAND""",,KR,7.0,0.0,KGM,30000.0,C62,USD,0.0,33.11,0.01,0.24,EA,30000.0,213.3,7062.79,0.0,USD,33.11,0.0,213.3,7062.79,0.0,LG,0.0,CT,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,0.0,,,0.0,,,,0.0,0.0,,,0,0,,,0,0,
4455,0,DSYP000000325,300618-20/SO-LGETH,2,2,73181590,60040,0,,0,,,0,,IMPORT,1ATF0402808,IM0001,สกรู,SCREW\r\n1ATF0402808,,,,2018,NO BRAND,,VN,314.0,0.0,KGM,200000.0,C62,USD,0.0,32.85,0.01,0.2,EA,200000.0,1240.0,40728.05,0.0,USD,32.85,0.0,1276.77,41935.77,0.0,P/O NO:\r\nPART NO\r\nCUST NO\r\n-AS PER ATTAC...,0.0,PX,...,0.0,0.0,0.0,0.0,800.44,407.28,0.0,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,0.0,,,0.0,,,,0.0,0.0,,2358/ร./2557,0,0,,,0,0,


In [4]:
print('จำนวนพิกัด : ', len(df['TariffCode'].unique()))
print('จำนวนรหัสสินค้า : ', len(df['PdtCode'].unique()))

จำนวนพิกัด :  239
จำนวนรหัสสินค้า :  5525


In [5]:
df = df[['TariffCode', 'PdtDescriptionE']]
df

Unnamed: 0,TariffCode,PdtDescriptionE
0,000085322900,CAPACITOR
1,000085322900,CAPACITOR
2,000085322900,CAPACITOR
3,000085369099,SOCKET
4,000085369099,SOCKET
...,...,...
23603,000040101900,GATE CONVEYOR
23604,000084145999,COOLING BUFFER
23605,000084791010,BUFFER CONVEYOR(NG)
23606,000084289090,MAGAZINE UNLOADER


In [6]:
db = []
for i in range(len(df)):    
    db.append({'descript': df.PdtDescriptionE[i],'class':df.TariffCode[i] })
    
db

[{'descript': 'CAPACITOR', 'class': '000085322900'},
 {'descript': 'CAPACITOR', 'class': '000085322900'},
 {'descript': 'CAPACITOR', 'class': '000085322900'},
 {'descript': 'SOCKET', 'class': '000085369099'},
 {'descript': 'SOCKET', 'class': '000085369099'},
 {'descript': 'SOCKET', 'class': '000085369099'},
 {'descript': 'SOCKET', 'class': '000085369099'},
 {'descript': 'PUMP-DRAIN [DC31-00030A]', 'class': '000084714910'},
 {'descript': 'PUMP-DRAIN [DC31-00030A]', 'class': '000084714910'},
 {'descript': 'PUMP-DRAIN [DC31-00030A]', 'class': '84149091'},
 {'descript': 'PUMP-DRAIN [DC31-00030A]', 'class': '84149091'},
 {'descript': 'PUMP-DRAIN [DC31-00030A]', 'class': '84149091'},
 {'descript': 'PUMP-DRAIN [DC31-00030A]', 'class': '84149091'},
 {'descript': 'PUMP-DRAIN [DC31-00030A]', 'class': '84149091'},
 {'descript': 'LED MODULE\r\nLED MODULE', 'class': '000085371091'},
 {'descript': 'LED MODULE                                        \r\nLED MODULE',
  'class': '000085371091'},
 {'desc

In [7]:
#convert เป็น DataFrame

data = pd.DataFrame(db)
print(len(data))
data

23608


Unnamed: 0,descript,class
0,CAPACITOR,000085322900
1,CAPACITOR,000085322900
2,CAPACITOR,000085322900
3,SOCKET,000085369099
4,SOCKET,000085369099
...,...,...
23603,GATE CONVEYOR,000040101900
23604,COOLING BUFFER,000084145999
23605,BUFFER CONVEYOR(NG),000084791010
23606,MAGAZINE UNLOADER,000084289090


In [8]:
desc =data.iloc[:]['descript'].values

#CountVectorizer ทำ tokenize

vect = CountVectorizer(stop_words='english', lowercase=True)
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [9]:
#เริ่มตัดคำ

count_train = vect.fit(desc)
count_train

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [10]:
#ดู vocabulary_

vect.vocabulary_

{'capacitor': 2282,
 'socket': 6003,
 'pump': 5813,
 'drain': 2555,
 'dc31': 2527,
 '00030a': 3,
 'led': 4914,
 'module': 5619,
 'oled': 5700,
 'eaj63950701': 2886,
 'pcb': 5729,
 'assembly': 2160,
 'main': 5003,
 '6871er1081f': 1119,
 'sub': 6044,
 'lcd': 4908,
 'panel': 5722,
 'tft': 6188,
 'eaj64108001': 2901,
 'harness': 4801,
 'ead62154304': 2663,
 '6871a10056l': 1082,
 'ber': 2223,
 'total': 6207,
 'ebu64071876': 4558,
 'ebu63486601': 4475,
 'ebu62391902': 4411,
 'ebu63172004': 4449,
 'ebu62925601': 4435,
 'ebu62464701': 4426,
 'ebu63478301': 4474,
 'ebu63942551': 4544,
 'ebu63584101': 4485,
 'ebu63493501': 4476,
 'ebu63264701': 4464,
 'ebu63584801': 4486,
 'ebu63740201': 4515,
 'ebu63814701': 4526,
 'ebu63541901': 4480,
 'ebu62622901': 4432,
 'ebu64036001': 4551,
 'ebu63646101': 4506,
 'ebu64101901': 4565,
 'ebu63933051': 4540,
 'ebu62445701': 4422,
 'ebu62153001': 4394,
 'ebu64151201': 4571,
 'ebu63496001': 4478,
 'ebu62373402': 4408,
 'chassis': 2306,
 'ebt64584508': 4306,
 'e

In [11]:
#ทำ transform ให้อยู่ในรูปแบบ One Hot Encoding

transformer = vect.transform(desc)
transformer

<23608x6336 sparse matrix of type '<class 'numpy.int64'>'
	with 75818 stored elements in Compressed Sparse Row format>

In [12]:
#ดูผลลัพท์ matrix ที่ได้โดย array จะประกอบด้วย context ใน 1 paragraph หากมี word ใน context ให้ค่าเป็น 1 และหากไม่มี word ใน context ให้ค่าเป็น 0

print(transformer.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [13]:
# เตรียม tf-idf

from sklearn.feature_extraction.text import TfidfTransformer
# Config tf-idf

tfidf_transformer = TfidfTransformer(smooth_idf=False,).fit(transformer)

#show ค่า idf

tfidf_transformer.idf_

#tf(t) = จำนวนที่ปรากฏใน document/จำนวนคำทั้งหมดใน document
#idf(d, t) = ln(จำนวน document ทั้งหมด/จำนวนที่ปรากฏใน Document) + 1
#idf(d, t) = ln [ (1 + จำนวน document) / (1 + จำนวนที่ปรากฏใน Document) ] + 1
#Parameter smoot=true => ln((1+5)/(1+1))+1 = 2.09861229
#Parameter smoot=false=>ln(5)+1 = 2.60943791
#บางนิยามใช้ log() บางนิยามใช้ ln()

#เตรียมnaivebay

array([10.37619374,  9.68304656, 11.06934092, ...,  8.07360864,
        9.68304656,  8.29675219])

In [14]:
from sklearn.naive_bayes import MultinomialNB

#เตรียม Class ในรูปแบบ array เหมือน arr_text

arr_class=data.iloc[:]['class'].values
arr_class

array(['000085322900', '000085322900', '000085322900', ...,
       '000084791010', '000084289090', '000063079090'], dtype=object)

In [15]:
#เตรียม train จาก transformer

messages_tfidf = tfidf_transformer.transform(transformer)
print (messages_tfidf.shape)

(23608, 6336)


In [16]:
messages_tfidf.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [17]:
#สร้าง Model

detect_model = MultinomialNB().fit(messages_tfidf,arr_class)

from sklearn.model_selection import train_test_split
x_train,x_test,y_train, y_test = train_test_split( messages_tfidf,arr_class, test_size=0.4 , random_state=0)

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

detect_model =MultinomialNB().fit(x_train,y_train)
#detect_model = MultinomialNB().fit(messages_tfidf,arr_class)

(14164, 6336) (9444, 6336) (14164,) (9444,)


In [18]:
print(detect_model.score(x_test, y_test))

0.7922490470139771


In [19]:
#การ Predict
#สร้าง ตัว test
#จะ test ด้วยคำว่า london bridge is falling down

desc[0]

'CAPACITOR'

In [20]:
t0=vect.transform([desc[0]])

#แปลงเป็น vector

t0.toarray()


array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [21]:
#นำไป Predict

print ('Predicted: ',detect_model.predict(messages_tfidf[0]) )

Predicted:  ['000085322900']


In [22]:
#เห็นว่าคำทำนายออกมาเป็น london ตรงกับ Expected
#นำ data ทั้งหมดที่ train ไป test มันก็ควรจะตรง(Expected คือ 100%)

all_predictions = detect_model.predict(messages_tfidf)
print(all_predictions)

['000085322900' '000085322900' '000085322900' ... '000085371019'
 '000085371019' '000085371019']


In [23]:
all_predictions[-1]

'000085371019'

In [24]:
#shot copy

def model_predict(txt):
    vectmp = vect.transform([txt])
    tftmp = tfidf_transformer.transform(vectmp)
    return detect_model.predict(tftmp)

In [25]:
#ดูค่า Performance

from sklearn.metrics import classification_report
print (classification_report(arr_class, all_predictions))

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

 '0000852990       0.00      0.00      0.00         1
000022041000       0.00      0.00      0.00         2
000022042111       0.00      0.00      0.00         4
000022051010       0.00      0.00      0.00         2
000022082050       0.00      0.00      0.00         1
000022083000       0.00      0.00      0.00        33
000022084000       0.00      0.00      0.00         1
000022085000       0.00      0.00      0.00         2
000022086000       0.00      0.00      0.00         6
000022087090       0.00      0.00      0.00         2
000024022090       0.00      0.00      0.00        37
000027101943       0.00      0.00      0.00        48
000027101990       0.00      0.00      0.00         5
000028112290       0.00      0.00      0.00         1
000032089090       0.00      0.00      0.00         3
000034022015       0.00      0.00      0.00         2
000035069100       0.00      0.00      0.00         2
000038247800       0.00    

Multi Class Text Classification With Deep Learning Using BERT
===

- https://towardsdatascience.com/multi-class-text-classification-with-deep-learning-using-bert-b59ca2f5c613

In [None]:
!pip install torch 

In [1]:
import torch
from tqdm.notebook import tqdm

from transformers import BertTokenizer
from torch.utils.data import TensorDataset

from transformers import BertForSequenceClassification

df = pd.read_csv('data/title_conference.csv')
df.head()

ModuleNotFoundError: No module named 'torch'

In [4]:
from langdetect import DetectorFactory, detect, detect_langs
text = "My lubimy mleko i chleb." 
detect(text) #  'cs'
detect_langs(text)  # [cs:0.7142840957132709, pl:0.14285810606233737, sk:0.14285779665739756]

[cs:0.5714266061070243, pl:0.28571505741259656, sk:0.14285832166773346]