# Creating Standard Training Data

In this notebook we will be doing the following :
  1. Build & perform basic text cleaning operations/pipeline on the documents
  2. Convert the data in training data format. i.e. _label_tag1 _label_tag2
  3. Split the dataset into : train,dev and test sets  
  4.  Divide the dataset into Groups/Classes

    a. Dividing the entire Dataset(~1M) into 14 groups/class.

    b. Check the Label Distribution of Labels in each 14 groups/class.


  5. Create Corpus & Label Dictionary : Flair Corpus

**HOT TIP** : *Save them as pickle for easy rendering for experiments*



In [0]:
# First let's check what has Google given us ! Thank you Google for the GPU

!nvidia-smi

In [0]:
# Let's mount our G-Drive. Hey !! Because for GPU you now give your data to Google 

from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# Install necessary packages and restart the environment

! pip install tiny-tokenizer
! pip install  flair

Collecting flair
[?25l  Downloading https://files.pythonhosted.org/packages/16/22/8fc8e5978ec05b710216735ca47415700e83f304dec7e4281d61cefb6831/flair-0.4.4-py3-none-any.whl (193kB)
[K     |████████████████████████████████| 194kB 1.3MB/s 
Collecting transformers>=2.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/d1/08/4a6768ca1a7a4fa37e5ee08077c5d02b8d83876bd36caa5fc24d98992ac2/transformers-2.2.2-py3-none-any.whl (387kB)
[K     |████████████████████████████████| 389kB 4.2MB/s 
Collecting langdetect
[?25l  Downloading https://files.pythonhosted.org/packages/59/59/4bc44158a767a6d66de18c4136c8aa90491d56cc951c10b74dd1e13213c9/langdetect-1.0.7.zip (998kB)
[K     |████████████████████████████████| 1.0MB 6.5MB/s 
Collecting bpemb>=0.2.9
  Downloading https://files.pythonhosted.org/packages/bc/70/468a9652095b370f797ed37ff77e742b11565c6fd79eaeca5f2e50b164a7/bpemb-0.3.0-py3-none-any.whl
Collecting sqlitedict>=1.6.0
  Downloading https://files.pythonhosted.org/packages/0f/1c/c7

In [0]:
# Let's import our packages !

import pandas as pd
from tqdm import tqdm
import html
import re
from bs4 import BeautifulSoup
import re
from sklearn.model_selection import train_test_split
# import flair
import pickle
from torch.optim.adam import Adam

# Making Corpus

from flair.data import Corpus
from flair.datasets import ClassificationCorpus
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentRNNEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

In [0]:
## Mentioning where is our data located on G-Drive. Make sure to rectify your path

path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/'
tag_group = '/content/drive/My Drive/ICDMAI_Tutorial/stack-overflow-tag-network/stack_network_nodes.csv'
data ='filtered_data/question_tag_text_mapping.pkl'

In [0]:
# Let's see the main Data-Set

question_tag = pd.read_pickle(path+data)
question_tag.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body,CreationMonth,CreationYear,Tag
0,120,83.0,2008-08-01 15:50:08,,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...,8,2008,"[asp.net, sql]"
1,260,91.0,2008-08-01 23:22:08,,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...,8,2008,"[c#, .net]"
2,330,63.0,2008-08-02 02:51:36,,29,Should I use nested classes in this case?,<p>I am working on a collection of classes use...,8,2008,[c++]
3,470,71.0,2008-08-02 15:11:47,2016-03-26T05:23:29Z,13,Homegrown consumption of web services,<p>I've been writing a few web services for a ...,8,2008,"[web-services, .net]"
4,580,91.0,2008-08-02 23:30:59,,21,Deploying SQL Server Databases from Test to Live,<p>I wonder how you guys manage deployment of ...,8,2008,[sql-server]


In [0]:
# Let's see the Groups/Classes

tag_group = pd.read_csv(tag_group)
tag_group.head()

Unnamed: 0,name,group,nodesize
0,html,6,272.45
1,css,6,341.17
2,hibernate,8,29.83
3,spring,8,52.84
4,ruby,3,70.14


## 1. Text Pre-processing Pipeline

Every try-except block can be written as a different modular function which can be invoked from preprocess_text() function. This serves as a pipeline of the series of text-cleaning that you might require for your dataset.

In [0]:
clean = re.compile('<.*?>')

def preprocess_text(text) :
  try :
    # soup = BeautifulSoup(text, "html.parser")
    # text = soup.get_text()
    text=  re.sub(clean, '', text)
    text = html.unescape(text)
  except :
    print("Error in HTML Processing ...")
    print(text)
    text = text
  try :
    # remove extra newlines (often might be present in really noisy text)
    text = text.translate(text.maketrans("\n\t\r", "   "))
  except :
    print("Error in removing extra lines ...")
    print(text)
    text = text

  try :
    # remove extra whitespace
    text = re.sub(' +', ' ', text)
    text = text.strip()
  except :
    print("Error in extra whitespace removal ...")
    print(text)
    text = text

  return text

## 2. Create Training Data Format

Here we iterate the dataset dataframe and create the format acceptable to Flair. This is a standard format for few other Text Classification models/frameworks by Facebook.

***Format***  : ____label ____**tag1** ____label ____**tag2** **text**

Here the text Document has to be in a single line which was handled in the preprocess_text() method.

In [0]:
def create_training_format(question_tag) :

  print("Preparing training data format ...")
  # training_df = pd.DataFrame("columns")
  labels = list()
  texts = list()
  for index in tqdm(question_tag.index) :
    tags = question_tag.loc[index,'Tag']
    text_label = ''
    for tag in tags :
      label = '__label__'+tag
      text_label = text_label + ' ' + label
    
    text_label = text_label.strip()
    # text =  html.unescape(question_tag.loc[index,'Body'])
    text =  question_tag.loc[index,'Title'].strip() + '. ' + question_tag.loc[index,'Body'].strip()

    # if len(text.split()) < 5 :
    #   continue 

    labels.append(text_label)
    texts.append(text)


  df = pd.DataFrame(list(zip(labels[:], texts[:])), columns =['label', 'text']) 
  # df.head()
  print("Cleaning Text ....")
  df['text'] = df['text'].apply(preprocess_text)
  print("Cleaned Data Size : {}".format(df.shape))

  return df






## 3. Create Training Splits

Here we create standard random splits of the dataset to :
  1. training set : 90 % data
  2. dev/validation set :  10 % data
  3. test set : 10 % data

#### TO DO : Experiments :
  1. Stratified Sampling
  2. Does a  70-15-15 split or 90-5-5 split make any difference when you ahve 1M records ?

In [0]:
def create_splits(df,path,group_id = ''):

  print("Splitting Training Data ... ")
  train_df , test_df = train_test_split(df,random_state=42,test_size=0.30)
  dev_df ,test_df = train_test_split(test_df,random_state=42,test_size=0.5)
  print("Training Dataset : {}".format(train_df.shape[0]))
  print("Validation Dataset : {}".format(dev_df.shape[0]))
  print("Test Dataset : {}".format(test_df.shape[0]))

  print("Path  : {} ".format(path+'training_data/group/'+ str(group_id) + '/train.txt'))
  train_df.to_csv(path+'training_data/group/'+ str(group_id) + '/train.txt',sep='\t',index=False,header=False)
  dev_df.to_csv(path+'training_data/group/'+ str(group_id) + '/dev.txt',sep='\t',index=False,header=False)
  test_df.to_csv(path+'training_data/group/'+ str(group_id) + '/test.txt',sep='\t',index=False,header=False)

  return train_df,dev_df,test_df


## 4. Divide the dataset into Groups/Classes

Here we iterate over the entire dataset to create group level datasets in the following steps :
  1. Iterate over the groups and read the full-dataset eveytime
  2. Get all the tags in the group from the **tag_group** lookup
  3. Iterate over training examples and see if the labels fall in the same group
  4. Remove training examples which don't belong to the group
  5. Create the training data format of the remaining dataset
  6. Split & Save the dataset

  ### TO DO :
    1. Make a single corpus for the entire dataset.


In [0]:
for grp_id in range(1,15):
  
  ## 1. Iterate over the groups and read the full-dataset eveytime
  print("=================================================================")
  print("Group ID being Processed : {}".format(grp_id))
  print("=================================================================")
  print("Reading Pickle File ...")

  question_tag = pd.read_pickle(path+data)

  ## 2. Get all the tags in the group from the **tag_group** lookup  

  group =  tag_group[tag_group.group == grp_id]
  labels = list(set(group['name']))

  ## 3. Iterate over training examples and see if the labels fall in the same group
  for index in tqdm(question_tag.index):
    tags = question_tag.loc[index,'Tag']
    group_tags = list()
    for tag in tags :
      if tag in labels :
        group_tags.append(tag)
    question_tag.at[index,'Tag'] =  group_tags
  print("Before Removal of Blank Data : {} ".format(question_tag.shape))

   ## 4. Remove training examples which don't belong to the group
  question_tag = question_tag[question_tag['Tag'].map(lambda d: len(d)) > 0]
  print("Final Data for Group ID  : {} is {}".format(grp_id,question_tag.shape))

  ## 5. Create the training data format of the remaining dataset
  training_data_format = create_training_format(question_tag)

  ## 6. Split & Save the dataset
  train_df,dev_df,test_df = create_splits(training_data_format,path,group_id=grp_id)

  print("=================================================================")



Reading Pickle File ...


  0%|          | 1873/1051992 [00:00<00:56, 18726.22it/s]

Current Group ID : 2 


100%|██████████| 1051992/1051992 [00:37<00:00, 27773.86it/s]


Before Removal of Blank Data : (1051992, 10) 


  2%|▏         | 3787/196254 [00:00<00:05, 37869.58it/s]

Final Data for Group ID  : 2 is (196254, 10)
Preparing training data format ...


100%|██████████| 196254/196254 [00:04<00:00, 40483.37it/s]


Cleaning Text ....
Cleaned Data Size : (196254, 2)
Splitting Training Data ... 
Training Dataset : 137377
Validation Dataset : 29438
Test Dataset : 29439
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/2/train.txt 
Reading Pickle File ...


  0%|          | 1911/1051992 [00:00<00:54, 19107.69it/s]

Current Group ID : 3 


100%|██████████| 1051992/1051992 [00:37<00:00, 27904.62it/s]


Before Removal of Blank Data : (1051992, 10) 


  6%|▋         | 4126/65453 [00:00<00:01, 41259.35it/s]

Final Data for Group ID  : 3 is (65453, 10)
Preparing training data format ...


100%|██████████| 65453/65453 [00:01<00:00, 40072.83it/s]


Cleaning Text ....
Cleaned Data Size : (65453, 2)
Splitting Training Data ... 
Training Dataset : 45817
Validation Dataset : 9818
Test Dataset : 9818
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/3/train.txt 
Reading Pickle File ...


  0%|          | 2233/1051992 [00:00<00:47, 22326.24it/s]

Current Group ID : 4 


100%|██████████| 1051992/1051992 [00:38<00:00, 27463.32it/s]


Before Removal of Blank Data : (1051992, 10) 


  2%|▏         | 4160/169599 [00:00<00:03, 41593.20it/s]

Final Data for Group ID  : 4 is (169599, 10)
Preparing training data format ...


100%|██████████| 169599/169599 [00:04<00:00, 39921.11it/s]


Cleaning Text ....
Cleaned Data Size : (169599, 2)
Splitting Training Data ... 
Training Dataset : 118719
Validation Dataset : 25440
Test Dataset : 25440
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/4/train.txt 
Reading Pickle File ...


  0%|          | 2035/1051992 [00:00<00:51, 20346.58it/s]

Current Group ID : 5 


100%|██████████| 1051992/1051992 [00:37<00:00, 27976.69it/s]


Before Removal of Blank Data : (1051992, 10) 


  8%|▊         | 4180/54356 [00:00<00:01, 41793.36it/s]

Final Data for Group ID  : 5 is (54356, 10)
Preparing training data format ...


100%|██████████| 54356/54356 [00:01<00:00, 40572.70it/s]


Cleaning Text ....
Cleaned Data Size : (54356, 2)
Splitting Training Data ... 
Training Dataset : 38049
Validation Dataset : 8153
Test Dataset : 8154
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/5/train.txt 
Reading Pickle File ...


  0%|          | 1911/1051992 [00:00<00:54, 19109.79it/s]

Current Group ID : 6 


100%|██████████| 1051992/1051992 [00:37<00:00, 27710.50it/s]


Before Removal of Blank Data : (1051992, 10) 


  1%|          | 3883/356875 [00:00<00:09, 38826.43it/s]

Final Data for Group ID  : 6 is (356875, 10)
Preparing training data format ...


100%|██████████| 356875/356875 [00:08<00:00, 39733.52it/s]


Cleaning Text ....
Cleaned Data Size : (356875, 2)
Splitting Training Data ... 
Training Dataset : 249812
Validation Dataset : 53531
Test Dataset : 53532
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/6/train.txt 
Reading Pickle File ...


  0%|          | 1845/1051992 [00:00<00:56, 18437.80it/s]

Current Group ID : 7 


100%|██████████| 1051992/1051992 [00:36<00:00, 28654.94it/s]


Before Removal of Blank Data : (1051992, 10) 


100%|██████████| 3460/3460 [00:00<00:00, 35486.89it/s]

Final Data for Group ID  : 7 is (3460, 10)
Preparing training data format ...
Cleaning Text ....





Cleaned Data Size : (3460, 2)
Splitting Training Data ... 
Training Dataset : 2422
Validation Dataset : 519
Test Dataset : 519
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/7/train.txt 
Reading Pickle File ...


  0%|          | 2146/1051992 [00:00<00:48, 21455.72it/s]

Current Group ID : 8 


100%|██████████| 1051992/1051992 [00:37<00:00, 28387.56it/s]


Before Removal of Blank Data : (1051992, 10) 


  3%|▎         | 4045/143543 [00:00<00:03, 40447.82it/s]

Final Data for Group ID  : 8 is (143543, 10)
Preparing training data format ...


100%|██████████| 143543/143543 [00:03<00:00, 40256.02it/s]


Cleaning Text ....
Cleaned Data Size : (143543, 2)
Splitting Training Data ... 
Training Dataset : 100480
Validation Dataset : 21531
Test Dataset : 21532
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/8/train.txt 
Reading Pickle File ...


  0%|          | 1604/1051992 [00:00<01:05, 16039.17it/s]

Current Group ID : 9 


100%|██████████| 1051992/1051992 [00:38<00:00, 27611.78it/s]


Before Removal of Blank Data : (1051992, 10) 


 46%|████▋     | 4103/8850 [00:00<00:00, 41026.71it/s]

Final Data for Group ID  : 9 is (8850, 10)
Preparing training data format ...


100%|██████████| 8850/8850 [00:00<00:00, 39520.67it/s]


Cleaning Text ....
Cleaned Data Size : (8850, 2)
Splitting Training Data ... 
Training Dataset : 6195
Validation Dataset : 1327
Test Dataset : 1328
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/9/train.txt 
Reading Pickle File ...


  0%|          | 2316/1051992 [00:00<00:45, 23158.14it/s]

Current Group ID : 10 


100%|██████████| 1051992/1051992 [00:36<00:00, 28819.57it/s]


Before Removal of Blank Data : (1051992, 10) 


 33%|███▎      | 4231/12979 [00:00<00:00, 42303.69it/s]

Final Data for Group ID  : 10 is (12979, 10)
Preparing training data format ...


100%|██████████| 12979/12979 [00:00<00:00, 40248.98it/s]


Cleaning Text ....
Cleaned Data Size : (12979, 2)
Splitting Training Data ... 
Training Dataset : 9085
Validation Dataset : 1947
Test Dataset : 1947
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/10/train.txt 
Reading Pickle File ...


  0%|          | 2249/1051992 [00:00<00:46, 22484.18it/s]

Current Group ID : 11 


100%|██████████| 1051992/1051992 [00:37<00:00, 28330.50it/s]


Before Removal of Blank Data : (1051992, 10) 


100%|██████████| 5703/5703 [00:00<00:00, 40276.34it/s]

Final Data for Group ID  : 11 is (5703, 10)
Preparing training data format ...
Cleaning Text ....





Cleaned Data Size : (5703, 2)
Splitting Training Data ... 
Training Dataset : 3992
Validation Dataset : 855
Test Dataset : 856
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/11/train.txt 
Reading Pickle File ...


  0%|          | 1889/1051992 [00:00<00:55, 18886.46it/s]

Current Group ID : 12 


100%|██████████| 1051992/1051992 [00:37<00:00, 27885.86it/s]


Before Removal of Blank Data : (1051992, 10) 


100%|██████████| 570/570 [00:00<00:00, 36411.11it/s]

Final Data for Group ID  : 12 is (570, 10)
Preparing training data format ...
Cleaning Text ....
Cleaned Data Size : (570, 2)
Splitting Training Data ... 
Training Dataset : 399
Validation Dataset : 85
Test Dataset : 86
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/12/train.txt 
Reading Pickle File ...



  0%|          | 2043/1051992 [00:00<00:51, 20429.39it/s]

Current Group ID : 13 


100%|██████████| 1051992/1051992 [00:37<00:00, 27844.14it/s]


Before Removal of Blank Data : (1051992, 10) 


 22%|██▏       | 4421/19933 [00:00<00:00, 44203.40it/s]

Final Data for Group ID  : 13 is (19933, 10)
Preparing training data format ...


100%|██████████| 19933/19933 [00:00<00:00, 39517.02it/s]


Cleaning Text ....
Cleaned Data Size : (19933, 2)
Splitting Training Data ... 
Training Dataset : 13953
Validation Dataset : 2990
Test Dataset : 2990
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/13/train.txt 
Reading Pickle File ...


  0%|          | 1807/1051992 [00:00<00:58, 18064.59it/s]

Current Group ID : 14 


100%|██████████| 1051992/1051992 [00:38<00:00, 27594.32it/s]


Before Removal of Blank Data : (1051992, 10) 


 27%|██▋       | 3885/14291 [00:00<00:00, 38846.05it/s]

Final Data for Group ID  : 14 is (14291, 10)
Preparing training data format ...


100%|██████████| 14291/14291 [00:00<00:00, 36915.84it/s]


Cleaning Text ....
Cleaned Data Size : (14291, 2)
Splitting Training Data ... 
Training Dataset : 10003
Validation Dataset : 2144
Test Dataset : 2144
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/14/train.txt 


## 5. Create Corpus & Label Dictionary : Flair Corpus

For all the training splits created above for each group, we will be creating a corpus & vocabulary to train a different model.

In [0]:
path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/'

for grp in range(1,15):
  
  print("=================================================================")
  print("Group ID being Processed : {}".format(grp))
  print("=================================================================")
  
  # this is the folder in which train, test and dev files reside
  data_folder =path+str(grp)+'/'
  print(data_folder)

  print("Creating Corpus ...")
  # load corpus containing training, test and dev data
  corpus: Corpus = ClassificationCorpus(data_folder,
                                        test_file='test.txt',
                                        dev_file='dev.txt',
                                        train_file='train.txt')

  # 2. create the label dictionary
  label_dict = corpus.make_label_dictionary()

  print("Obtaining Corpus Statisitics...")
  stats  = corpus.obtain_statistics()
  json_stats = json.loads(stats)

  print(json_stats)
  
  with open(data_folder+'corpus_statistics.json', 'w') as f:
    json.dump(json_stats, f)

  print("Creating Dumps ... ")
  with open(data_folder+'classification_corpus.pkl',mode='wb') as f :
    pickle.dump(corpus,f)

  with open(data_folder + 'classification_corpus_label_dict.pkl',mode='wb') as f:
    pickle.dump(label_dict,f)

/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/1/
Creating Corpus ...
2019-12-20 07:36:28,399 Reading data from /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/1
2019-12-20 07:36:28,402 Train: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/1/train.txt
2019-12-20 07:36:28,404 Dev: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/1/dev.txt
2019-12-20 07:36:28,406 Test: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/1/test.txt
2019-12-20 07:36:37,861 Computing label dictionary. Progress:


100%|██████████| 114121/114121 [05:25<00:00, 350.70it/s]

2019-12-20 07:42:03,422 [b'python', b'django', b'r', b'c', b'c++', b'matlab', b'qt', b'embedded', b'machine-learning', b'flask']
Creating Dumps ... 
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/2/
Creating Corpus ...
2019-12-20 07:42:03,438 Reading data from /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/2
2019-12-20 07:42:03,439 Train: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/2/train.txt
2019-12-20 07:42:03,440 Dev: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/2/dev.txt
2019-12-20 07:42:03,441 Test: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/2/test.txt





2019-12-20 07:42:12,136 Computing label dictionary. Progress:


100%|██████████| 137377/137377 [06:14<00:00, 366.66it/s]

2019-12-20 07:48:26,966 [b'sql-server', b'sql', b'wpf', b'.net', b'c#', b'visual-studio', b'wcf', b'unity3d', b'asp.net', b'vb.net', b'asp.net-web-api', b'oracle', b'entity-framework', b'azure', b'linq', b'xamarin', b'plsql']
Creating Dumps ... 
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/3/
Creating Corpus ...
2019-12-20 07:48:27,001 Reading data from /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/3
2019-12-20 07:48:27,001 Train: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/3/train.txt
2019-12-20 07:48:27,003 Dev: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/3/dev.txt
2019-12-20 07:48:27,004 Test: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/3/test.txt





2019-12-20 07:48:33,096 Computing label dictionary. Progress:


100%|██████████| 45817/45817 [02:04<00:00, 366.94it/s]

2019-12-20 07:50:38,081 [b'node.js', b'ruby-on-rails', b'mongodb', b'ruby', b'express', b'elasticsearch', b'reactjs', b'redux', b'postgresql', b'redis', b'react-native']





Creating Dumps ... 
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/4/
Creating Corpus ...
2019-12-20 07:50:38,092 Reading data from /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/4
2019-12-20 07:50:38,093 Train: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/4/train.txt
2019-12-20 07:50:38,094 Dev: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/4/dev.txt
2019-12-20 07:50:38,096 Test: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/4/test.txt
2019-12-20 07:50:45,712 Computing label dictionary. Progress:


100%|██████████| 118719/118719 [05:51<00:00, 337.60it/s]

2019-12-20 07:56:37,485 [b'iphone', b'ios', b'android', b'objective-c', b'xcode', b'osx', b'swift', b'android-studio']
Creating Dumps ... 
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/5/
Creating Corpus ...
2019-12-20 07:56:37,503 Reading data from /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/5





2019-12-20 07:56:37,504 Train: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/5/train.txt
2019-12-20 07:56:37,506 Dev: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/5/dev.txt
2019-12-20 07:56:37,508 Test: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/5/test.txt
2019-12-20 07:56:42,386 Computing label dictionary. Progress:


100%|██████████| 38049/38049 [01:39<00:00, 380.54it/s]

2019-12-20 07:58:22,489 [b'unix', b'git', b'windows', b'apache', b'shell', b'bash', b'linux', b'github', b'ubuntu', b'powershell', b'nginx']





Creating Dumps ... 
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/6/
Creating Corpus ...
2019-12-20 07:58:22,508 Reading data from /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/6
2019-12-20 07:58:22,509 Train: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/6/train.txt
2019-12-20 07:58:22,511 Dev: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/6/dev.txt
2019-12-20 07:58:22,512 Test: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/6/test.txt
2019-12-20 07:58:34,837 Computing label dictionary. Progress:


100%|██████████| 249812/249812 [11:16<00:00, 369.52it/s]

2019-12-20 08:09:50,995 [b'php', b'mysql', b'json', b'html', b'javascript', b'jquery', b'angularjs', b'twitter-bootstrap-3', b'css', b'twitter-bootstrap', b'ajax', b'xml', b'laravel', b'wordpress', b'less', b'codeigniter', b'drupal', b'html5', b'vue.js', b'sass', b'ionic-framework', b'photoshop']
Creating Dumps ... 
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/7/
Creating Corpus ...
2019-12-20 08:09:51,024 Reading data from /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/7
2019-12-20 08:09:51,025 Train: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/7/train.txt
2019-12-20 08:09:51,027 Dev: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/7/dev.txt
2019-12-20 08:09:51,028 Test: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/7/test.txt





2019-12-20 08:09:53,125 Computing label dictionary. Progress:


100%|██████████| 2422/2422 [00:06<00:00, 368.61it/s]

2019-12-20 08:09:59,820 [b'typescript', b'angular2']





Creating Dumps ... 
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/8/
Creating Corpus ...
2019-12-20 08:09:59,840 Reading data from /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/8
2019-12-20 08:09:59,841 Train: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/8/train.txt
2019-12-20 08:09:59,844 Dev: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/8/dev.txt
2019-12-20 08:09:59,848 Test: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/8/test.txt
2019-12-20 08:10:06,881 Computing label dictionary. Progress:


100%|██████████| 100480/100480 [05:05<00:00, 329.12it/s]

2019-12-20 08:15:12,303 [b'java', b'spring', b'rest', b'web-services', b'hibernate', b'eclipse', b'api', b'jsp', b'maven', b'java-ee', b'spring-boot', b'spring-mvc']
Creating Dumps ... 
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/9/
Creating Corpus ...
2019-12-20 08:15:12,323 Reading data from /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/9
2019-12-20 08:15:12,325 Train: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/9/train.txt
2019-12-20 08:15:12,327 Dev: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/9/dev.txt
2019-12-20 08:15:12,333 Test: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/9/test.txt





2019-12-20 08:15:15,482 Computing label dictionary. Progress:


100%|██████████| 6195/6195 [00:17<00:00, 355.34it/s]

2019-12-20 08:15:33,039 [b'jenkins', b'amazon-web-services', b'docker', b'cloud', b'go', b'devops']
Creating Dumps ... 
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/10/
Creating Corpus ...
2019-12-20 08:15:33,052 Reading data from /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/10
2019-12-20 08:15:33,053 Train: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/10/train.txt
2019-12-20 08:15:33,055 Dev: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/10/dev.txt
2019-12-20 08:15:33,056 Test: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/10/test.txt





2019-12-20 08:15:35,809 Computing label dictionary. Progress:


100%|██████████| 9085/9085 [00:26<00:00, 341.64it/s]

2019-12-20 08:16:02,513 [b'hadoop', b'scala', b'haskell', b'apache-spark']
Creating Dumps ... 
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/11/
Creating Corpus ...
2019-12-20 08:16:02,528 Reading data from /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/11
2019-12-20 08:16:02,529 Train: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/11/train.txt
2019-12-20 08:16:02,531 Dev: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/11/dev.txt
2019-12-20 08:16:02,533 Test: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/11/test.txt





2019-12-20 08:16:05,235 Computing label dictionary. Progress:


100%|██████████| 3992/3992 [00:09<00:00, 411.15it/s]

2019-12-20 08:16:15,068 [b'selenium', b'testing']
Creating Dumps ... 
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/12/
Creating Corpus ...
2019-12-20 08:16:15,078 Reading data from /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/12
2019-12-20 08:16:15,079 Train: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/12/train.txt
2019-12-20 08:16:15,081 Dev: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/12/dev.txt
2019-12-20 08:16:15,082 Test: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/12/test.txt





2019-12-20 08:16:17,769 Computing label dictionary. Progress:


100%|██████████| 399/399 [00:01<00:00, 318.56it/s]

2019-12-20 08:16:19,153 [b'agile', b'tdd']
Creating Dumps ... 
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/13/
Creating Corpus ...
2019-12-20 08:16:19,166 Reading data from /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/13
2019-12-20 08:16:19,167 Train: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/13/train.txt
2019-12-20 08:16:19,170 Dev: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/13/dev.txt
2019-12-20 08:16:19,171 Test: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/13/test.txt





2019-12-20 08:16:25,533 Computing label dictionary. Progress:


100%|██████████| 13953/13953 [00:29<00:00, 471.11it/s]

2019-12-20 08:16:55,248 [b'regex', b'perl']
Creating Dumps ... 
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/14/
Creating Corpus ...
2019-12-20 08:16:55,261 Reading data from /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/14
2019-12-20 08:16:55,262 Train: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/14/train.txt
2019-12-20 08:16:55,264 Dev: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/14/dev.txt
2019-12-20 08:16:55,265 Test: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/group/14/test.txt





2019-12-20 08:17:01,750 Computing label dictionary. Progress:


100%|██████████| 10003/10003 [00:29<00:00, 335.86it/s]

2019-12-20 08:17:31,656 [b'excel', b'vba', b'excel-vba']
Creating Dumps ... 



