# Creating Standard Training Data

In this notebook we will be doing the following :
  1. Build & perform basic text cleaning operations/pipeline on the documents
  2. Convert the data in training data format. i.e. _label_tag1 _label_tag2
  3. Split the dataset into : train,dev and test sets  
  4.  Divide the dataset into Groups/Classes  
    a. Dividing the entire Dataset(~1M) into 14 groups/class.  
    b. Check the Label Distribution of Labels in each 14 groups/class.  
  5. Standard practices to Create Corpus & Label Dictionary using **Flair**

**HOT TIP** : *Save them as pickle for easy rendering for experiments*



In [1]:
# First let's check what has Google given us ! Thank you Google for the GPU

!nvidia-smi

Thu Jan  9 14:02:24 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [10]:
# Let's mount our G-Drive. Hey !! Because for GPU you now give your data to Google 

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Install necessary packages and restart the environment

! pip install tiny-tokenizer
! pip install  flair

Collecting tiny-tokenizer
  Downloading https://files.pythonhosted.org/packages/8d/0f/aa52c227c5af69914be05723b3deaf221805a4ccbce87643194ef2cdde43/tiny_tokenizer-3.1.0.tar.gz
Building wheels for collected packages: tiny-tokenizer
  Building wheel for tiny-tokenizer (setup.py) ... [?25l[?25hdone
  Created wheel for tiny-tokenizer: filename=tiny_tokenizer-3.1.0-cp36-none-any.whl size=10550 sha256=33a3ed6bd1d88a53ee9020f40c6384bc83d41ec63e79bbda122706d4e738250d
  Stored in directory: /root/.cache/pip/wheels/d1/c8/36/334497a689fab90128232e86b5829b800dd271a3d5d5959c53
Successfully built tiny-tokenizer
Installing collected packages: tiny-tokenizer
Successfully installed tiny-tokenizer-3.1.0
Collecting flair
[?25l  Downloading https://files.pythonhosted.org/packages/16/22/8fc8e5978ec05b710216735ca47415700e83f304dec7e4281d61cefb6831/flair-0.4.4-py3-none-any.whl (193kB)
[K     |████████████████████████████████| 194kB 2.8MB/s 
Collecting sqlitedict>=1.6.0
  Downloading https://files.pythonho

In [0]:
# Let's import our packages !

import pandas as pd
from tqdm import tqdm
import html
import re
from bs4 import BeautifulSoup
import re
from sklearn.model_selection import train_test_split
# import flair
import pickle
from torch.optim.adam import Adam
import json
# Making Corpus

from flair.data import Corpus
from flair.datasets import ClassificationCorpus
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentRNNEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

In [0]:
## Mentioning where is our data located on G-Drive. Make sure to rectify your path

path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/'
tag_group = '/content/drive/My Drive/ICDMAI_Tutorial/stack-overflow-tag-network/stack_network_nodes.csv'
data ='filtered_data/question_tag_text_mapping.pkl'

In [3]:
# Let's see the main Data-Set

question_tag = pd.read_pickle(path+data)
question_tag.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body,Tag
0,120,83.0,2008-08-01T15:50:08Z,,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...,"[sql, asp.net]"
1,260,91.0,2008-08-01T23:22:08Z,,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...,"[c#, .net]"
2,330,63.0,2008-08-02T02:51:36Z,,29,Should I use nested classes in this case?,<p>I am working on a collection of classes use...,[c++]
3,470,71.0,2008-08-02T15:11:47Z,2016-03-26T05:23:29Z,13,Homegrown consumption of web services,<p>I've been writing a few web services for a ...,"[web-services, .net]"
4,580,91.0,2008-08-02T23:30:59Z,,21,Deploying SQL Server Databases from Test to Live,<p>I wonder how you guys manage deployment of ...,[sql-server]


In [4]:
# Let's see the Groups/Classes
import pandas as pd
tag_group = pd.read_csv(tag_group)
tag_group.head()

Unnamed: 0,name,group,nodesize,group_name
0,c,1,189.83,Programming
1,c++,1,268.11,Programming
2,django,1,40.91,Programming
3,python,1,438.67,Programming
4,flask,1,9.39,Programming


## 1. Text Pre-processing Pipeline

Every try-except block can be written as a different modular function which can be invoked from preprocess_text() function. This serves as a pipeline of the series of text-cleaning that you might require for your dataset.

In [0]:
clean = re.compile('<.*?>')

def preprocess_text(text) :
  try :
    # soup = BeautifulSoup(text, "html.parser")
    # text = soup.get_text()
    text=  re.sub(clean, '', text)
    text = html.unescape(text)
  except :
    print("Error in HTML Processing ...")
    print(text)
    text = text
  try :
    # remove extra newlines (often might be present in really noisy text)
    text = text.translate(text.maketrans("\n\t\r", "   "))
  except :
    print("Error in removing extra lines ...")
    print(text)
    text = text

  try :
    # remove extra whitespace
    text = re.sub(' +', ' ', text)
    text = text.strip()
  except :
    print("Error in extra whitespace removal ...")
    print(text)
    text = text

  return text

## 2. Create Training Data Format

Here we iterate the dataset dataframe and create the format acceptable to Flair. This is a standard format for few other Text Classification models/frameworks by Facebook.

***Format***  : ____label ____**tag1** ____label ____**tag2** **text**

Here the text Document has to be in a single line which was handled in the preprocess_text() method.

In [0]:
def create_training_format(question_tag) :

  print("Preparing training data format ...")
  # training_df = pd.DataFrame("columns")
  labels = list()
  texts = list()
  for index in tqdm(question_tag.index) :
    tags = question_tag.loc[index,'Tag']
    text_label = ''
    for tag in tags :
      label = '__label__'+tag
      text_label = text_label + ' ' + label
    
    text_label = text_label.strip()
    # text =  html.unescape(question_tag.loc[index,'Body'])
    text =  question_tag.loc[index,'Title'].strip() + '. ' + question_tag.loc[index,'Body'].strip()

    # if len(text.split()) < 5 :
    #   continue 

    labels.append(text_label)
    texts.append(text)


  df = pd.DataFrame(list(zip(labels[:], texts[:])), columns =['label', 'text']) 
  # df.head()
  print("Cleaning Text ....")
  df['text'] = df['text'].apply(preprocess_text)
  print("Cleaned Data Size : {}".format(df.shape))

  return df


## 3. Create Training Splits

Here we create standard random splits of the dataset to :
  1. training set : 70 % data
  2. dev/validation set :  15 % data
  3. test set : 15 % data

#### TO DO : Experiments :
  1. Stratified Sampling of documents based on Tags
  2. Does a  70-15-15 split or 90-5-5 split make any difference when you have 1M records ?

In [0]:
def create_splits(df,path,group_id = ''):

  print("Splitting Training Data ... ")
  train_df , test_df = train_test_split(df,random_state=42,test_size=0.30)
  dev_df ,test_df = train_test_split(test_df,random_state=42,test_size=0.5)
  print("Training Dataset : {}".format(train_df.shape[0]))
  print("Validation Dataset : {}".format(dev_df.shape[0]))
  print("Test Dataset : {}".format(test_df.shape[0]))

  print("Path  : {} ".format(path+'training_data/standard/group/'+ str(group_id) + '/train.txt'))
  train_df.to_csv(path+'training_data/standard/group/'+ str(group_id) + '/train.txt',sep='\t',index=False,header=False)
  dev_df.to_csv(path+'training_data/standard/group/'+ str(group_id) + '/dev.txt',sep='\t',index=False,header=False)
  test_df.to_csv(path+'training_data/standard/group/'+ str(group_id) + '/test.txt',sep='\t',index=False,header=False)

  return train_df,dev_df,test_df


## 4. Divide the dataset into Groups/Classes

Here we iterate over the entire dataset to create group level datasets in the following steps :
  1. Iterate over the groups and read the full-dataset eveytime
  2. Get all the tags in the group from the **tag_group** lookup
  3. Iterate over training examples and see if the labels fall in the same group
  4. Remove training examples which don't belong to the group
  5. Create the training data format of the remaining dataset
  6. Split & Save the dataset

  ### TO DO :
    1. Make a single corpus for the entire dataset.


In [11]:
for grp_id in range(1,15):
  
  ## 1. Iterate over the groups and read the full-dataset eveytime
  print("=================================================================")
  print("Group ID being Processed : {}".format(grp_id))
  print("=================================================================")
  print("Reading Pickle File ...")

  question_tag = pd.read_pickle(path+data)

  ## 2. Get all the tags in the group from the **tag_group** lookup  

  group =  tag_group[tag_group.group == grp_id]
  labels = list(set(group['name']))

  ## 3. Iterate over training examples and see if the labels fall in the same group
  for index in tqdm(question_tag.index):
    tags = question_tag.loc[index,'Tag']
    group_tags = list()
    for tag in tags :
      if tag in labels :
        group_tags.append(tag)
    question_tag.at[index,'Tag'] =  group_tags
  print("Before Removal of Blank Data : {} ".format(question_tag.shape))

   ## 4. Remove training examples which don't belong to the group
  question_tag = question_tag[question_tag['Tag'].map(lambda d: len(d)) > 0]
  print("Final Data for Group ID  : {} is {}".format(grp_id,question_tag.shape))

  ## 5. Create the training data format of the remaining dataset
  training_data_format = create_training_format(question_tag)

  ## 6. Split & Save the dataset
  train_df,dev_df,test_df = create_splits(training_data_format,path,group_id=grp_id)

  print("=================================================================")



Group ID being Processed : 1
Reading Pickle File ...


100%|██████████| 1051992/1051992 [00:44<00:00, 23823.81it/s]


Before Removal of Blank Data : (1051992, 8) 


  2%|▏         | 3191/163030 [00:00<00:05, 31901.89it/s]

Final Data for Group ID  : 1 is (163030, 8)
Preparing training data format ...


100%|██████████| 163030/163030 [00:04<00:00, 33146.80it/s]


Cleaning Text ....
Cleaned Data Size : (163030, 2)
Splitting Training Data ... 
Training Dataset : 114121
Validation Dataset : 24454
Test Dataset : 24455
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/1/train.txt 
Group ID being Processed : 2
Reading Pickle File ...


100%|██████████| 1051992/1051992 [00:43<00:00, 24228.84it/s]


Before Removal of Blank Data : (1051992, 8) 


  2%|▏         | 3065/196254 [00:00<00:06, 30648.57it/s]

Final Data for Group ID  : 2 is (196254, 8)
Preparing training data format ...


100%|██████████| 196254/196254 [00:06<00:00, 32375.24it/s]


Cleaning Text ....
Cleaned Data Size : (196254, 2)
Splitting Training Data ... 
Training Dataset : 137377
Validation Dataset : 29438
Test Dataset : 29439
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/2/train.txt 
Group ID being Processed : 3
Reading Pickle File ...


100%|██████████| 1051992/1051992 [00:43<00:00, 24008.42it/s]


Before Removal of Blank Data : (1051992, 8) 


  5%|▍         | 3219/65453 [00:00<00:01, 32185.66it/s]

Final Data for Group ID  : 3 is (65453, 8)
Preparing training data format ...


100%|██████████| 65453/65453 [00:02<00:00, 32067.82it/s]


Cleaning Text ....
Cleaned Data Size : (65453, 2)
Splitting Training Data ... 
Training Dataset : 45817
Validation Dataset : 9818
Test Dataset : 9818
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/3/train.txt 
Group ID being Processed : 4
Reading Pickle File ...


100%|██████████| 1051992/1051992 [00:43<00:00, 24420.07it/s]


Before Removal of Blank Data : (1051992, 8) 


  2%|▏         | 3046/169599 [00:00<00:05, 30454.00it/s]

Final Data for Group ID  : 4 is (169599, 8)
Preparing training data format ...


100%|██████████| 169599/169599 [00:05<00:00, 32466.04it/s]


Cleaning Text ....
Cleaned Data Size : (169599, 2)
Splitting Training Data ... 
Training Dataset : 118719
Validation Dataset : 25440
Test Dataset : 25440
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/4/train.txt 
Group ID being Processed : 5
Reading Pickle File ...


100%|██████████| 1051992/1051992 [00:44<00:00, 23892.67it/s]


Before Removal of Blank Data : (1051992, 8) 


  6%|▌         | 3006/54356 [00:00<00:01, 30053.01it/s]

Final Data for Group ID  : 5 is (54356, 8)
Preparing training data format ...


100%|██████████| 54356/54356 [00:01<00:00, 32328.72it/s]


Cleaning Text ....
Cleaned Data Size : (54356, 2)
Splitting Training Data ... 
Training Dataset : 38049
Validation Dataset : 8153
Test Dataset : 8154
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/5/train.txt 
Group ID being Processed : 6
Reading Pickle File ...


100%|██████████| 1051992/1051992 [00:44<00:00, 23714.83it/s]


Before Removal of Blank Data : (1051992, 8) 


  1%|          | 2734/356875 [00:00<00:12, 27336.38it/s]

Final Data for Group ID  : 6 is (356875, 8)
Preparing training data format ...


100%|██████████| 356875/356875 [00:11<00:00, 32225.08it/s]


Cleaning Text ....
Cleaned Data Size : (356875, 2)
Splitting Training Data ... 
Training Dataset : 249812
Validation Dataset : 53531
Test Dataset : 53532
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/6/train.txt 
Group ID being Processed : 7
Reading Pickle File ...


100%|██████████| 1051992/1051992 [00:43<00:00, 24275.77it/s]


Before Removal of Blank Data : (1051992, 8) 


100%|██████████| 3460/3460 [00:00<00:00, 33465.37it/s]

Final Data for Group ID  : 7 is (3460, 8)
Preparing training data format ...
Cleaning Text ....





Cleaned Data Size : (3460, 2)
Splitting Training Data ... 
Training Dataset : 2422
Validation Dataset : 519
Test Dataset : 519
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/7/train.txt 
Group ID being Processed : 8
Reading Pickle File ...


100%|██████████| 1051992/1051992 [00:42<00:00, 24540.76it/s]


Before Removal of Blank Data : (1051992, 8) 


  2%|▏         | 3146/143543 [00:00<00:04, 31454.93it/s]

Final Data for Group ID  : 8 is (143543, 8)
Preparing training data format ...


100%|██████████| 143543/143543 [00:04<00:00, 32565.70it/s]


Cleaning Text ....
Cleaned Data Size : (143543, 2)
Splitting Training Data ... 
Training Dataset : 100480
Validation Dataset : 21531
Test Dataset : 21532
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/8/train.txt 
Group ID being Processed : 9
Reading Pickle File ...


100%|██████████| 1051992/1051992 [00:44<00:00, 23447.54it/s]


Before Removal of Blank Data : (1051992, 8) 


 36%|███▌      | 3184/8850 [00:00<00:00, 31835.25it/s]

Final Data for Group ID  : 9 is (8850, 8)
Preparing training data format ...


100%|██████████| 8850/8850 [00:00<00:00, 31610.16it/s]


Cleaning Text ....
Cleaned Data Size : (8850, 2)
Splitting Training Data ... 
Training Dataset : 6195
Validation Dataset : 1327
Test Dataset : 1328
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/9/train.txt 
Group ID being Processed : 10
Reading Pickle File ...


100%|██████████| 1051992/1051992 [00:45<00:00, 23251.60it/s]


Before Removal of Blank Data : (1051992, 8) 


 25%|██▌       | 3270/12979 [00:00<00:00, 32695.35it/s]

Final Data for Group ID  : 10 is (12979, 8)
Preparing training data format ...


100%|██████████| 12979/12979 [00:00<00:00, 31265.18it/s]


Cleaning Text ....
Cleaned Data Size : (12979, 2)
Splitting Training Data ... 
Training Dataset : 9085
Validation Dataset : 1947
Test Dataset : 1947
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/10/train.txt 
Group ID being Processed : 11
Reading Pickle File ...


100%|██████████| 1051992/1051992 [00:43<00:00, 24290.52it/s]


Before Removal of Blank Data : (1051992, 8) 


100%|██████████| 5703/5703 [00:00<00:00, 31418.48it/s]

Final Data for Group ID  : 11 is (5703, 8)
Preparing training data format ...





Cleaning Text ....
Cleaned Data Size : (5703, 2)
Splitting Training Data ... 
Training Dataset : 3992
Validation Dataset : 855
Test Dataset : 856
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/11/train.txt 
Group ID being Processed : 12
Reading Pickle File ...


100%|██████████| 1051992/1051992 [00:45<00:00, 23314.04it/s]


Before Removal of Blank Data : (1051992, 8) 


100%|██████████| 570/570 [00:00<00:00, 26618.35it/s]

Final Data for Group ID  : 12 is (570, 8)
Preparing training data format ...
Cleaning Text ....
Cleaned Data Size : (570, 2)
Splitting Training Data ... 
Training Dataset : 399
Validation Dataset : 85
Test Dataset : 86
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/12/train.txt 





Group ID being Processed : 13
Reading Pickle File ...


100%|██████████| 1051992/1051992 [00:45<00:00, 23203.04it/s]


Before Removal of Blank Data : (1051992, 8) 


 17%|█▋        | 3371/19933 [00:00<00:00, 33703.44it/s]

Final Data for Group ID  : 13 is (19933, 8)
Preparing training data format ...


100%|██████████| 19933/19933 [00:00<00:00, 31861.69it/s]


Cleaning Text ....
Cleaned Data Size : (19933, 2)
Splitting Training Data ... 
Training Dataset : 13953
Validation Dataset : 2990
Test Dataset : 2990
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/13/train.txt 
Group ID being Processed : 14
Reading Pickle File ...


100%|██████████| 1051992/1051992 [00:44<00:00, 23495.44it/s]


Before Removal of Blank Data : (1051992, 8) 


 23%|██▎       | 3324/14291 [00:00<00:00, 33233.46it/s]

Final Data for Group ID  : 14 is (14291, 8)
Preparing training data format ...


100%|██████████| 14291/14291 [00:00<00:00, 31517.53it/s]


Cleaning Text ....
Cleaned Data Size : (14291, 2)
Splitting Training Data ... 
Training Dataset : 10003
Validation Dataset : 2144
Test Dataset : 2144
Path  : /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/14/train.txt 


## 5. Create Corpus & Label Dictionary : Flair Corpus

For all the training splits created above for each group, we will be creating a corpus & vocabulary to train a different model.
  1. First we load the train, dev, test dataset and create corpus using ClassificationCorpus
  2. label dictionary is created using make_label_dictionary
  3. saving corpus and dictionary dumps for easy retrieval

In [14]:
path = '/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/'

for grp in range(1,15):
  
  print("=================================================================")
  print("Group ID being Processed : {}".format(grp))
  print("=================================================================")
  
  # this is the folder in which train, test and dev files reside
  data_folder =path+str(grp)+'/'
  print(data_folder)

  print("Creating Corpus ...")
  # load corpus containing training, test and dev data
  corpus: Corpus = ClassificationCorpus(data_folder,
                                        test_file='test.txt',
                                        dev_file='dev.txt',
                                        train_file='train.txt')

  # 2. create the label dictionary
  label_dict = corpus.make_label_dictionary()

  print("Obtaining Corpus Statisitics...")
  stats  = corpus.obtain_statistics()
  json_stats = json.loads(stats)

  print(json_stats)
  
  with open(data_folder+'corpus_statistics.json', 'w') as f:
    json.dump(json_stats, f)

  print("Creating Dumps ... ")
  with open(data_folder+'classification_corpus.pkl',mode='wb') as f :
    pickle.dump(corpus,f)

  with open(data_folder + 'classification_corpus_label_dict.pkl',mode='wb') as f:
    pickle.dump(label_dict,f)

Group ID being Processed : 1
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/1/
Creating Corpus ...
2020-01-09 14:56:43,336 Reading data from /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/1
2020-01-09 14:56:43,343 Train: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/1/train.txt
2020-01-09 14:56:43,346 Dev: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/1/dev.txt
2020-01-09 14:56:43,349 Test: /content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/1/test.txt
2020-01-09 14:56:45,511 Computing label dictionary. Progress:


100%|██████████| 114121/114121 [05:56<00:00, 319.89it/s]

2020-01-09 15:02:42,757 [b'python', b'django', b'r', b'c', b'c++', b'matlab', b'qt', b'embedded', b'machine-learning', b'flask']





Obtaining Corpus Statisitics...
{'TRAIN': {'dataset': 'TRAIN', 'total_number_of_documents': 114121, 'number_of_documents_per_class': {'python': 45294, 'django': 8957, 'r': 11005, 'c': 16255, 'c++': 33272, 'matlab': 4553, 'qt': 3733, 'embedded': 421, 'machine-learning': 870, 'flask': 993}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 19629187, 'min': 9, 'max': 5701, 'avg': 172.00328598592722}}, 'TEST': {'dataset': 'TEST', 'total_number_of_documents': 24455, 'number_of_documents_per_class': {'r': 2425, 'c++': 7070, 'c': 3482, 'python': 9620, 'django': 1929, 'matlab': 990, 'machine-learning': 176, 'qt': 843, 'flask': 201, 'embedded': 96}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 4211200, 'min': 9, 'max': 3518, 'avg': 172.20200368022898}}, 'DEV': {'dataset': 'DEV', 'total_number_of_documents': 24454, 'number_of_documents_per_class': {'r': 2271, 'c': 3501, 'python': 9687, 'c++': 7249, 'django': 1932, 'matlab': 952, 'qt': 808, 'flask': 213, 'machine-lear

100%|██████████| 137377/137377 [06:42<00:00, 341.55it/s]

2020-01-09 15:26:32,991 [b'sql', b'sql-server', b'wpf', b'.net', b'c#', b'visual-studio', b'wcf', b'unity3d', b'asp.net', b'vb.net', b'asp.net-web-api', b'oracle', b'entity-framework', b'azure', b'linq', b'xamarin', b'plsql']





Obtaining Corpus Statisitics...
{'TRAIN': {'dataset': 'TRAIN', 'total_number_of_documents': 137377, 'number_of_documents_per_class': {'sql': 25022, 'sql-server': 12726, 'wpf': 8738, '.net': 16884, 'c#': 70995, 'visual-studio': 4296, 'wcf': 3120, 'unity3d': 1378, 'asp.net': 20960, 'vb.net': 7162, 'asp.net-web-api': 1370, 'oracle': 5277, 'entity-framework': 4266, 'azure': 2443, 'linq': 4125, 'xamarin': 1198, 'plsql': 958}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 22224504, 'min': 8, 'max': 4001, 'avg': 161.77747366735335}}, 'TEST': {'dataset': 'TEST', 'total_number_of_documents': 29439, 'number_of_documents_per_class': {'azure': 511, 'asp.net': 4485, '.net': 3590, 'oracle': 1192, 'c#': 15030, 'wpf': 1845, 'sql': 5354, 'linq': 942, 'sql-server': 2791, 'entity-framework': 912, 'wcf': 708, 'vb.net': 1575, 'asp.net-web-api': 321, 'visual-studio': 940, 'unity3d': 308, 'xamarin': 259, 'plsql': 219}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 4753046, 'mi

100%|██████████| 45817/45817 [02:15<00:00, 339.09it/s]

2020-01-09 15:48:38,748 [b'node.js', b'ruby-on-rails', b'mongodb', b'ruby', b'express', b'elasticsearch', b'reactjs', b'redux', b'postgresql', b'redis', b'react-native']





Obtaining Corpus Statisitics...
{'TRAIN': {'dataset': 'TRAIN', 'total_number_of_documents': 45817, 'number_of_documents_per_class': {'node.js': 10053, 'ruby-on-rails': 18170, 'mongodb': 5001, 'ruby': 12027, 'express': 1844, 'elasticsearch': 1481, 'reactjs': 1782, 'redux': 276, 'postgresql': 4112, 'redis': 705, 'react-native': 554}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 7366580, 'min': 8, 'max': 4819, 'avg': 160.78267891830544}}, 'TEST': {'dataset': 'TEST', 'total_number_of_documents': 9818, 'number_of_documents_per_class': {'ruby': 2512, 'elasticsearch': 329, 'ruby-on-rails': 3821, 'node.js': 2227, 'reactjs': 379, 'postgresql': 919, 'mongodb': 1030, 'react-native': 120, 'express': 415, 'redis': 139, 'redux': 52}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 1586708, 'min': 15, 'max': 2428, 'avg': 161.61214096557345}}, 'DEV': {'dataset': 'DEV', 'total_number_of_documents': 9818, 'number_of_documents_per_class': {'ruby-on-rails': 3798, 'postgresql

100%|██████████| 118719/118719 [06:12<00:00, 319.02it/s]

2020-01-09 16:01:36,564 [b'iphone', b'ios', b'android', b'objective-c', b'xcode', b'osx', b'swift', b'android-studio']





Obtaining Corpus Statisitics...
{'TRAIN': {'dataset': 'TRAIN', 'total_number_of_documents': 118719, 'number_of_documents_per_class': {'iphone': 15101, 'ios': 32761, 'android': 63447, 'objective-c': 18776, 'xcode': 7440, 'osx': 5081, 'swift': 8287, 'android-studio': 2190}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 20278283, 'min': 8, 'max': 3665, 'avg': 170.80907858051364}}, 'TEST': {'dataset': 'TEST', 'total_number_of_documents': 25440, 'number_of_documents_per_class': {'swift': 1749, 'ios': 7080, 'iphone': 3204, 'objective-c': 4056, 'android': 13609, 'osx': 1070, 'android-studio': 476, 'xcode': 1564}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 4341168, 'min': 11, 'max': 2572, 'avg': 170.6433962264151}}, 'DEV': {'dataset': 'DEV', 'total_number_of_documents': 25440, 'number_of_documents_per_class': {'xcode': 1629, 'ios': 7168, 'osx': 1092, 'android': 13603, 'iphone': 3234, 'objective-c': 4090, 'swift': 1796, 'android-studio': 488}, 'number_of_token

100%|██████████| 38049/38049 [01:45<00:00, 360.84it/s]

2020-01-09 16:21:31,025 [b'unix', b'git', b'windows', b'apache', b'shell', b'bash', b'linux', b'github', b'ubuntu', b'powershell', b'nginx']





Obtaining Corpus Statisitics...
{'TRAIN': {'dataset': 'TRAIN', 'total_number_of_documents': 38049, 'number_of_documents_per_class': {'unix': 2243, 'git': 5310, 'windows': 7108, 'apache': 4693, 'shell': 3524, 'bash': 5196, 'linux': 9332, 'github': 1526, 'ubuntu': 2147, 'powershell': 2724, 'nginx': 1493}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 5695655, 'min': 10, 'max': 3502, 'avg': 149.6926331835265}}, 'TEST': {'dataset': 'TEST', 'total_number_of_documents': 8154, 'number_of_documents_per_class': {'linux': 2052, 'ubuntu': 493, 'bash': 1152, 'apache': 965, 'git': 1163, 'powershell': 527, 'windows': 1533, 'shell': 723, 'nginx': 324, 'unix': 437, 'github': 318}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 1196924, 'min': 10, 'max': 3215, 'avg': 146.78979641893548}}, 'DEV': {'dataset': 'DEV', 'total_number_of_documents': 8153, 'number_of_documents_per_class': {'linux': 2011, 'unix': 475, 'git': 1153, 'ubuntu': 489, 'windows': 1501, 'powershell': 603,

100%|██████████| 249812/249812 [12:11<00:00, 341.39it/s]

2020-01-09 16:39:13,622 [b'php', b'mysql', b'json', b'html', b'jquery', b'javascript', b'angularjs', b'css', b'twitter-bootstrap-3', b'twitter-bootstrap', b'ajax', b'xml', b'laravel', b'wordpress', b'less', b'codeigniter', b'drupal', b'html5', b'vue.js', b'sass', b'ionic-framework', b'photoshop']





Obtaining Corpus Statisitics...
{'TRAIN': {'dataset': 'TRAIN', 'total_number_of_documents': 249812, 'number_of_documents_per_class': {'php': 69172, 'mysql': 29751, 'json': 12281, 'html': 41206, 'jquery': 55075, 'javascript': 87098, 'angularjs': 14200, 'css': 29721, 'twitter-bootstrap-3': 1153, 'twitter-bootstrap': 5052, 'ajax': 10957, 'xml': 10277, 'laravel': 3409, 'wordpress': 6969, 'less': 390, 'codeigniter': 3412, 'drupal': 1187, 'html5': 6635, 'vue.js': 151, 'sass': 659, 'ionic-framework': 1071, 'photoshop': 124}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 40194666, 'min': 6, 'max': 4819, 'avg': 160.89966054472964}}, 'TEST': {'dataset': 'TEST', 'total_number_of_documents': 53532, 'number_of_documents_per_class': {'javascript': 18551, 'css': 6328, 'jquery': 11839, 'twitter-bootstrap': 1072, 'html': 8899, 'php': 14782, 'mysql': 6273, 'angularjs': 3086, 'laravel': 776, 'xml': 2267, 'html5': 1396, 'json': 2665, 'ajax': 2381, 'codeigniter': 726, 'wordpress': 1486, 'ph

100%|██████████| 2422/2422 [00:07<00:00, 327.82it/s]

2020-01-09 17:15:48,970 [b'typescript', b'angular2']





Obtaining Corpus Statisitics...
{'TRAIN': {'dataset': 'TRAIN', 'total_number_of_documents': 2422, 'number_of_documents_per_class': {'typescript': 1210, 'angular2': 1598}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 397679, 'min': 18, 'max': 2423, 'avg': 164.19446738232867}}, 'TEST': {'dataset': 'TEST', 'total_number_of_documents': 519, 'number_of_documents_per_class': {'typescript': 283, 'angular2': 321}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 88052, 'min': 14, 'max': 1526, 'avg': 169.65703275529864}}, 'DEV': {'dataset': 'DEV', 'total_number_of_documents': 519, 'number_of_documents_per_class': {'typescript': 270, 'angular2': 340}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 88473, 'min': 15, 'max': 1104, 'avg': 170.46820809248555}}}
Creating Dumps ... 
Group ID being Processed : 8
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/8/
Creating Corpus ...
2020-01-09 17:16:11,179 Reading data from /conte

100%|██████████| 100480/100480 [05:39<00:00, 295.86it/s]

2020-01-09 17:21:53,127 [b'java', b'spring', b'rest', b'web-services', b'hibernate', b'eclipse', b'api', b'jsp', b'maven', b'java-ee', b'spring-boot', b'spring-mvc']





Obtaining Corpus Statisitics...
{'TRAIN': {'dataset': 'TRAIN', 'total_number_of_documents': 100480, 'number_of_documents_per_class': {'java': 80609, 'spring': 6941, 'rest': 3432, 'web-services': 3754, 'hibernate': 4081, 'eclipse': 7040, 'api': 3395, 'jsp': 3009, 'maven': 3313, 'java-ee': 1878, 'spring-boot': 924, 'spring-mvc': 2656}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 18549783, 'min': 7, 'max': 10740, 'avg': 184.61169386942674}}, 'TEST': {'dataset': 'TEST', 'total_number_of_documents': 21532, 'number_of_documents_per_class': {'eclipse': 1428, 'java': 17306, 'api': 720, 'rest': 755, 'web-services': 818, 'maven': 678, 'hibernate': 918, 'spring': 1478, 'jsp': 660, 'spring-mvc': 571, 'spring-boot': 208, 'java-ee': 376}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 3982212, 'min': 8, 'max': 4137, 'avg': 184.94389745495076}}, 'DEV': {'dataset': 'DEV', 'total_number_of_documents': 21531, 'number_of_documents_per_class': {'java': 17297, 'api': 788, '

100%|██████████| 6195/6195 [00:19<00:00, 310.77it/s]

2020-01-09 17:38:20,268 [b'jenkins', b'amazon-web-services', b'docker', b'cloud', b'go', b'devops']





Obtaining Corpus Statisitics...
{'TRAIN': {'dataset': 'TRAIN', 'total_number_of_documents': 6195, 'number_of_documents_per_class': {'jenkins': 1289, 'amazon-web-services': 2231, 'docker': 1123, 'cloud': 291, 'go': 1323, 'devops': 47}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 955937, 'min': 18, 'max': 2267, 'avg': 154.30782889426956}}, 'TEST': {'dataset': 'TEST', 'total_number_of_documents': 1328, 'number_of_documents_per_class': {'jenkins': 270, 'amazon-web-services': 478, 'go': 265, 'docker': 252, 'devops': 12, 'cloud': 76}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 205314, 'min': 12, 'max': 1219, 'avg': 154.6039156626506}}, 'DEV': {'dataset': 'DEV', 'total_number_of_documents': 1327, 'number_of_documents_per_class': {'amazon-web-services': 481, 'go': 270, 'docker': 247, 'jenkins': 283, 'cloud': 58, 'devops': 3}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 212073, 'min': 21, 'max': 2549, 'avg': 159.81386586284853}}}
Creating D

100%|██████████| 9085/9085 [00:26<00:00, 339.39it/s]

2020-01-09 17:39:41,511 [b'hadoop', b'scala', b'haskell', b'apache-spark']





Obtaining Corpus Statisitics...
{'TRAIN': {'dataset': 'TRAIN', 'total_number_of_documents': 9085, 'number_of_documents_per_class': {'hadoop': 2145, 'scala': 4011, 'haskell': 2120, 'apache-spark': 1389}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 1499702, 'min': 11, 'max': 3107, 'avg': 165.07451843698405}}, 'TEST': {'dataset': 'TEST', 'total_number_of_documents': 1947, 'number_of_documents_per_class': {'scala': 856, 'haskell': 433, 'hadoop': 485, 'apache-spark': 285}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 330290, 'min': 17, 'max': 3163, 'avg': 169.64047252182846}}, 'DEV': {'dataset': 'DEV', 'total_number_of_documents': 1947, 'number_of_documents_per_class': {'scala': 809, 'apache-spark': 302, 'haskell': 490, 'hadoop': 469}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 325000, 'min': 20, 'max': 4048, 'avg': 166.92347200821777}}}
Creating Dumps ... 
Group ID being Processed : 11
/content/drive/My Drive/ICDMAI_Tutorial/notebook/tr

100%|██████████| 3992/3992 [00:14<00:00, 279.08it/s]

2020-01-09 17:41:18,554 [b'selenium', b'testing']





Obtaining Corpus Statisitics...
{'TRAIN': {'dataset': 'TRAIN', 'total_number_of_documents': 3992, 'number_of_documents_per_class': {'selenium': 2388, 'testing': 1708}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 587267, 'min': 15, 'max': 1736, 'avg': 147.11097194388776}}, 'TEST': {'dataset': 'TEST', 'total_number_of_documents': 856, 'number_of_documents_per_class': {'testing': 392, 'selenium': 486}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 127064, 'min': 19, 'max': 1918, 'avg': 148.4392523364486}}, 'DEV': {'dataset': 'DEV', 'total_number_of_documents': 855, 'number_of_documents_per_class': {'selenium': 499, 'testing': 371}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 124647, 'min': 11, 'max': 930, 'avg': 145.7859649122807}}}
Creating Dumps ... 
Group ID being Processed : 12
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/12/
Creating Corpus ...
2020-01-09 17:41:53,979 Reading data from /content/drive

100%|██████████| 399/399 [00:01<00:00, 323.40it/s]

2020-01-09 17:41:55,742 [b'agile', b'tdd']





Obtaining Corpus Statisitics...
{'TRAIN': {'dataset': 'TRAIN', 'total_number_of_documents': 399, 'number_of_documents_per_class': {'agile': 90, 'tdd': 311}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 63426, 'min': 14, 'max': 901, 'avg': 158.9624060150376}}, 'TEST': {'dataset': 'TEST', 'total_number_of_documents': 86, 'number_of_documents_per_class': {'tdd': 68, 'agile': 19}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 12223, 'min': 28, 'max': 422, 'avg': 142.12790697674419}}, 'DEV': {'dataset': 'DEV', 'total_number_of_documents': 85, 'number_of_documents_per_class': {'tdd': 68, 'agile': 18}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 12426, 'min': 29, 'max': 412, 'avg': 146.18823529411765}}}
Creating Dumps ... 
Group ID being Processed : 13
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/13/
Creating Corpus ...
2020-01-09 17:42:00,107 Reading data from /content/drive/My Drive/ICDMAI_Tutorial/notebook/

100%|██████████| 13953/13953 [00:36<00:00, 380.28it/s]

2020-01-09 17:42:37,646 [b'regex', b'perl']





Obtaining Corpus Statisitics...
{'TRAIN': {'dataset': 'TRAIN', 'total_number_of_documents': 13953, 'number_of_documents_per_class': {'regex': 10732, 'perl': 3657}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 1784337, 'min': 10, 'max': 2886, 'avg': 127.8819608686304}}, 'TEST': {'dataset': 'TEST', 'total_number_of_documents': 2990, 'number_of_documents_per_class': {'regex': 2277, 'perl': 800}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 381849, 'min': 19, 'max': 2374, 'avg': 127.70869565217392}}, 'DEV': {'dataset': 'DEV', 'total_number_of_documents': 2990, 'number_of_documents_per_class': {'regex': 2340, 'perl': 749}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 373761, 'min': 7, 'max': 2168, 'avg': 125.00367892976588}}}
Creating Dumps ... 
Group ID being Processed : 14
/content/drive/My Drive/ICDMAI_Tutorial/notebook/training_data/standard/group/14/
Creating Corpus ...
2020-01-09 17:44:27,708 Reading data from /content/drive/My Drive/

100%|██████████| 10003/10003 [00:33<00:00, 302.18it/s]

2020-01-09 17:45:01,509 [b'excel', b'vba', b'excel-vba']





Obtaining Corpus Statisitics...
{'TRAIN': {'dataset': 'TRAIN', 'total_number_of_documents': 10003, 'number_of_documents_per_class': {'excel': 7261, 'vba': 4773, 'excel-vba': 3682}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 1741069, 'min': 10, 'max': 2934, 'avg': 174.05468359492153}}, 'TEST': {'dataset': 'TEST', 'total_number_of_documents': 2144, 'number_of_documents_per_class': {'excel': 1542, 'vba': 1087, 'excel-vba': 778}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 369777, 'min': 19, 'max': 2135, 'avg': 172.47061567164178}}, 'DEV': {'dataset': 'DEV', 'total_number_of_documents': 2144, 'number_of_documents_per_class': {'vba': 1032, 'excel-vba': 743, 'excel': 1581}, 'number_of_tokens_per_tag': {}, 'number_of_tokens': {'total': 368862, 'min': 14, 'max': 4458, 'avg': 172.04384328358208}}}
Creating Dumps ... 
