Environment - Jupyter Notebook, Python 3

## Introduction
### In this notebook we will perform preprocessing news articles for further modelling task.
### Since executing this notebook can take a long time there are Checkpoints along the way from where pickle file can be loaded to resume execution


In [1]:
# Importing libraries
from TextPreprocess import Preprocess
import nltk
import pickle
from nltk.corpus import wordnet
import re
from nltk.tokenize import MWETokenizer
from nltk import word_tokenize,sent_tokenize
from itertools import chain
from nltk.probability import FreqDist
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.naive_bayes import MultinomialNB
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer


### Loading the data

In [2]:
file1 = open('../Data/original/training_docs.txt','r',encoding='UTF-8')
file2 = open('../Data/original/training_labels_final.txt','r',encoding='UTF-8')
file3 = open('../Data/original/testing_docs.txt','r',encoding='UTF-8')
raw_train_data = file1.readlines()
raw_train_label = file2.readlines()
raw_test_data = file3.readlines()

In [3]:
preprocess = Preprocess()

### Take a look at the training file

In [4]:
raw_train_data[0:10]

['ID tr_doc_1\n',
 'TEXT Two German tourists have been found safe and well after spending almost six hours lost in rugged rainforest at Finch Hatton Gorge, west of Mackay, last night. It is the same area a young Mackay man fell or jumped to his death last week. Sergeant Jon Purcell says rescuers located the missing pair just before midnight AEST.\n',
 'EOD\n',
 '\n',
 'ID tr_doc_2\n',
 'TEXT ACT police have seized a rare drug during a raid of a Florey home. Police found a number of syringes filled with the drug Ox-Blood, which is a form of amphetamine. They also found a number of bags believed to contain crystal methamphetamine. A 29-year-old woman has been charged with a number of offences and has faced court this morning. Acting Sergeant Matt Varley says it is only the third time the drug has been found in the territory. "It\'s actually a bi-product of the amphetamine manufacturing process whereby normal powders and crystals are produced," he said. "It\'s a liquid methamphetamine and

### Take a look at test data

In [5]:
raw_test_data[0:10]

['ID te_doc_1\n',
 "TEXT The Police Royal Commission in Western Australia is hearing evidence from the first serving officer to testify about corrupt activities in the service. The officer, code-named L-8, has been in the service for more than 30 years. He reached the rank of Inspector. He has testified that his improper behaviour began in 1979 when he and another detective took $200 from an armed robbery suspect. He has told the commission that he subsequently assaulted suspects, and made up a statement that was put before a court in an armed robbery case. L-8 is the eighth officer to 'roll over' since the Royal Commission started mid last year.\n",
 'EOD\n',
 '\n',
 'ID te_doc_2\n',
 "TEXT The Northern Territory Government says it is the Queensland Government's responsibility to explain why it denied a transfer to Alice Springs prisoner Tommy Neale. Neale is mounting a Supreme Court challenge over the decision. Tommy Neale was convicted of murder in Mount Isa in 1981. He later reques

### Take a look at the labels file

In [6]:
raw_train_label[0:10]

['tr_doc_1 C1\n',
 'tr_doc_2 C1\n',
 'tr_doc_3 C1\n',
 'tr_doc_4 C1\n',
 'tr_doc_5 C1\n',
 'tr_doc_6 C1\n',
 'tr_doc_7 C1\n',
 'tr_doc_8 C1\n',
 'tr_doc_9 C1\n',
 'tr_doc_10 C1\n']

### Cleaning up labels

In [7]:
# labels for tarining data in order
labels = [each.strip('\n').split()[1] for each in raw_train_label]

In [8]:
labels[1:9]

['C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1']

### clean up text data in training and test 

In [9]:
# Remove the '\n' items
raw_train_data = [i for i in raw_train_data if i != '\n']
raw_test_data = [i for i in raw_test_data if i != '\n']

In [10]:
raw_train_data[0:4]

['ID tr_doc_1\n',
 'TEXT Two German tourists have been found safe and well after spending almost six hours lost in rugged rainforest at Finch Hatton Gorge, west of Mackay, last night. It is the same area a young Mackay man fell or jumped to his death last week. Sergeant Jon Purcell says rescuers located the missing pair just before midnight AEST.\n',
 'EOD\n',
 'ID tr_doc_2\n']

### Converting train and test data into paragraph form

In [11]:
paragraphs_train = preprocess.get_in_paragraph(raw_train_data)
paragraphs_test = preprocess.get_in_paragraph(raw_test_data)

In [12]:
# Checking...results
paragraphs_train[0]

'Two German tourists have been found safe and well after spending almost six hours lost in rugged rainforest at Finch Hatton Gorge, west of Mackay, last night. It is the same area a young Mackay man fell or jumped to his death last week. Sergeant Jon Purcell says rescuers located the missing pair just before midnight AEST.'

### Checking max and min length of data

In [13]:
max_len = max([len(i) for i in paragraphs_train])
print('training data max length',max_len)
min_len =min([len(i) for i in paragraphs_train])
print('training data min length',min_len)
# number of documents with words above 1000
large_len =len([i for i in paragraphs_train if len(i)>1000])
print('number of news article with length greater than 1000 = ',large_len)
#number of documents with words below 10
very_small_len = len([i for i in paragraphs_train if len(i)<10])
print('number of news article with length smaller than 10 = ',very_small_len)
small_len = len([i for i in paragraphs_train if len(i)<20])
print('number of news article with length smaller than 20 = ',small_len)

training data max length 93206
training data min length 1
number of news article with length greater than 1000 =  49913
number of news article with length smaller than 10 =  1287
number of news article with length smaller than 20 =  1438


### Converting data to sentences for POS tagging

In [14]:
# loading punkt sentence segmenter
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
#paragraphs_train
#paragraphs_test
sentences_train =[0]*len(paragraphs_train)
sentences_test =[0]*len(paragraphs_test)
for i in range(len(paragraphs_train)):
        sentences_train[i] = sent_detector.tokenize(paragraphs_train[i].strip())
        
for i in range(len(paragraphs_test)):
        sentences_test[i] = sent_detector.tokenize(paragraphs_test[i].strip())

sentences_train[0]

['Two German tourists have been found safe and well after spending almost six hours lost in rugged rainforest at Finch Hatton Gorge, west of Mackay, last night.',
 'It is the same area a young Mackay man fell or jumped to his death last week.',
 'Sergeant Jon Purcell says rescuers located the missing pair just before midnight AEST.']

### Adding POS

In [24]:
preprocess.add_POS(sentences_train)
preprocess.add_POS(sentences_test)

In [25]:
sentences_train[0]

[[('Two', 'CD'),
  ('German', 'JJ'),
  ('tourists', 'NNS'),
  ('have', 'VBP'),
  ('been', 'VBN'),
  ('found', 'VBN'),
  ('safe', 'JJ'),
  ('and', 'CC'),
  ('well', 'RB'),
  ('after', 'IN'),
  ('spending', 'VBG'),
  ('almost', 'RB'),
  ('six', 'CD'),
  ('hours', 'NNS'),
  ('lost', 'VBN'),
  ('in', 'IN'),
  ('rugged', 'JJ'),
  ('rainforest', 'NN'),
  ('at', 'IN'),
  ('Finch', 'NNP'),
  ('Hatton', 'NNP'),
  ('Gorge', 'NNP'),
  (',', ','),
  ('west', 'NN'),
  ('of', 'IN'),
  ('Mackay', 'NNP'),
  (',', ','),
  ('last', 'JJ'),
  ('night', 'NN'),
  ('.', '.')],
 [('It', 'PRP'),
  ('is', 'VBZ'),
  ('the', 'DT'),
  ('same', 'JJ'),
  ('area', 'NN'),
  ('a', 'DT'),
  ('young', 'JJ'),
  ('Mackay', 'NNP'),
  ('man', 'NN'),
  ('fell', 'VBD'),
  ('or', 'CC'),
  ('jumped', 'VBD'),
  ('to', 'TO'),
  ('his', 'PRP$'),
  ('death', 'NN'),
  ('last', 'JJ'),
  ('week', 'NN'),
  ('.', '.')],
 [('Sergeant', 'JJ'),
  ('Jon', 'NNP'),
  ('Purcell', 'NNP'),
  ('says', 'VBZ'),
  ('rescuers', 'NNS'),
  ('located', '

### Lemmatize training and test data

In [26]:
lemmazied_docs_train = preprocess.lemmatization(sentences_train)
lemmazied_docs_test = preprocess.lemmatization(sentences_test)


### Checkpoint 1

### Since POS and lemmatization takes so long we can save the result in pickle file and continue later

In [30]:
pickle.dump( lemmazied_docs_train, open( "lemmazied_docs_train_lemmatized_unfiltered_byPOS.p", "wb" ) )
pickle.dump( lemmazied_docs_test, open( "lemmazied_docs_test_lemmatized_unfiltered_byPOS.p", "wb" ) )

### Importing lemmatized training and test documents

In [15]:
lemmazied_docs_train = pickle.load( open( "lemmazied_docs_train_lemmatized_unfiltered_byPOS.p", "rb" ) )
lemmazied_docs_test = pickle.load( open( "lemmazied_docs_test_lemmatized_unfiltered_byPOS.p", "rb" ) )

### Converting 1 word articles to UKNWN token to reduce features

In [16]:
new_test_docs = preprocess.change_oneworddoc_to(lemmazied_docs_test,"UKNWN")
new_train_docs = preprocess.change_oneworddoc_to(lemmazied_docs_train,"UKNWN")

length of token, 1
index - 4553
length of token, 1
index - 4583
length of token, 1
index - 10221
length of token, 1
index - 10237
length of token, 1
index - 10471
length of token, 1
index - 11138
length of token, 1
index - 11591
length of token, 1
index - 13465
length of token, 1
index - 18280
length of token, 1
index - 18851
total number  15
no of one word docs 15
percentage of one word docs 0.0005636978579481398
some of the indexes of 1 word docs 
 [4553, 4583, 10221, 10237, 10471, 11138, 11591, 13465, 18280, 18851]
length of token, 1
index - 4235
length of token, 1
index - 4238
length of token, 1
index - 4265
length of token, 1
index - 8749
length of token, 1
index - 8756
length of token, 1
index - 8818
length of token, 1
index - 18393
length of token, 1
index - 18427
length of token, 1
index - 19020
length of token, 1
index - 25067
total number  56
no of one word docs 56
percentage of one word docs 0.0005260932876133214
some of the indexes of 1 word docs 
 [4235, 4238, 4265, 8749, 

### Converting to lower case

In [17]:
lc_new_train_docs = preprocess.convert_lowercase(new_train_docs)
lc_new_test_docs = preprocess.convert_lowercase(new_test_docs)

## Adding trigram and bi-gram collocations

In [18]:
# using ngramlist2
ngramlist2 = preprocess.get_bi_tri_collocations_filtered_freq_bmi_perc(tokens=lc_new_train_docs,
                                                            trigram=True, N_best_bigram=7000,
                                                            N_best_trigram=5000,
                                                           min_corpus_bigram_freq=3,
                                                   min_corpus_trigram_freq=3,
                                           remove_stopwords=True,remove_symbols=True)

starting to get all tokens
Got all tokens
getting all bigram collocations
got all bigram collocations
getting all trigram collocations
got all trigram collocations
remving stopwords from all bigram collocations
remved stopwords from all bigram collocations
remving stopwords from all trigram collocations
remved stopwords from all trigram collocations
remving symbols from all collocations
remved symbols from all collocations
[('sakineh', 'mohammadi', 'ashtiani'), ('sutopo', 'purwo', 'nugroho'), ('ku', 'klux', 'klan'), ('inverness', 'caledonian', 'thistle'), ('fathur', 'rohman', 'al-ghozi'), ('se', 'og', 'hoer'), ('taur', 'matan', 'ruak'), ('abd', 'al-rahim', 'al-nashiri'), ('khagendra', 'thapa', 'magar'), ('tuilaepa', 'sailele', 'malielegaoi'), ('gro', 'harlem', 'brundtland'), ('kwa', 'zulu', 'natal'), ('kyodo', 'senpaku', 'kaisha'), ('elmer', 'funke', 'kupper'), ('bovine', 'spongiform', 'encephalopathy'), ('masai', 'moses', 'ndiema'), ('bran', 'nue', 'dae'), ('kiri', 'te', 'kanawa'), ('

###  Removing stopwords to reduce features

In [19]:
preprocess.remove_stopwords(lc_new_train_docs)
preprocess.remove_stopwords(lc_new_test_docs)

### Removing punctuations

In [20]:
# test data
preprocess.remove_punctuation(lc_new_train_docs)
# train data
preprocess.remove_punctuation(lc_new_test_docs)

### Adding the ngrams into the documents

In [21]:
preprocess.introduce_n_grams_in_docs(lc_new_train_docs,ngramlist2)
preprocess.introduce_n_grams_in_docs(lc_new_test_docs,ngramlist2)

### Removing words which occur less than 5 times 

In [22]:
req_filt_words = preprocess.filtered_words_by_perc_occur_corpus_more_less_than(list_tokens=lc_new_train_docs,perc_value=5,percent=False,more = False)
latest_ngramlist = preprocess.remove_words_from_ngramlist(ngramlist2,req_filt_words)

num_of_doc_threshold 5
<FreqDist with 149955 samples and 9074448 outcomes>


In [23]:
#Removing filter words from Training docs
lc_new_train_docs = preprocess.remove_words_tuples_corpus(lc_new_train_docs,latest_ngramlist,req_filt_words)
print('finishing removing word from train')


total lenth of doclist =  106445
total words in filter list 105970
--- 2.0369150638580322 seconds for removal ---
--- 6.797841548919678 seconds for MWE ---
finishing removing word from train


In [24]:
#Removing filter words from test docs
lc_new_test_docs = preprocess.remove_words_tuples_corpus(lc_new_test_docs,latest_ngramlist,req_filt_words)
print('finishing removing word from testlist')

total lenth of doclist =  26610
total words in filter list 105970
--- 0.5163042545318604 seconds for removal ---
--- 1.4766912460327148 seconds for MWE ---
finishing removing word from testlist


In [25]:
# first find those high frequency words among all the text
all_tokens = []
for each in lc_new_train_docs:
    all_tokens += list(set(each))
all_frequency = dict(nltk.FreqDist(all_tokens))
might_remove = list(dict(nltk.FreqDist(all_tokens).most_common(2000)).keys())

### Removing words which have low skewness in its distribution over all classes.

In [26]:
a_dict = {}

for i in range(len(lc_new_train_docs)):
    word_set = list(set(lc_new_train_docs[i]))
    for each in word_set:
        if labels[i] in a_dict:
            a_dict[labels[i]].append(each)
        else:
            a_dict[labels[i]] = [each]
    
frequent_words = []
for k,v in a_dict.items():
    v = np.random.permutation(v)
    frequent_word = set(v[:2000])
    frequent_words += list(frequent_word)
    
a = dict(nltk.FreqDist(frequent_words).most_common(1000))

# if a word is both frequent among the whole text and frequent among all classes (>=20), then remove
to_remove = set()
for each in might_remove:
    if each in a:
        if a[each] >=22:
            to_remove.add(each)

In [27]:
# make the distribution of classes even

label_count = dict(nltk.FreqDist(labels))

ratio = [0]*23
for k,v in label_count.items():
    k = int(k.replace(' ','')[1:]) -1
    ratio[k] = 5520/v
ratio = np.array(ratio)

In [28]:
skewness_matrix = {}
for k,v in a_dict.items():
    k = int(k.replace(' ','')[1:]) -1
    for each in v:
        if each in skewness_matrix:
            skewness_matrix[each][k] +=1
        else:
            skewness_matrix[each] = [0]*23
            skewness_matrix[each][k] +=1

In [29]:
skewness_dict = {}
for k,v in skewness_matrix.items():
    v = np.array(v)
    average = np.std(v)/np.mean(v)
    skewness_dict[k] = average

In [30]:
# df_dict_skewness = pd.read_table('skewness_dict.txt',delim_whitespace =True,header=None)
df_dict_skewness = pd.Series(skewness_dict).to_frame('ColumnName').reset_index()
df_dict_skewness.rename(columns={0: 'word', 1: 'skewness'}, inplace=True)
df_dict_skewness.columns = { 'word',  'skewness'}
df_dict_skewness.head()


Unnamed: 0,word,skewness
0,a-b-c,4.690416
1,a-c-t,4.690416
2,a-c-t-u,4.690416
3,a-f-l,4.690416
4,a-f-l_footballer_liam,4.690416


In [31]:
to_rmv = df_dict_skewness[df_dict_skewness.skewness<0.6].word.reset_index().word
to_rmv.head()

0     abandon
1       abide
2     ability
3        able
4    absolute
Name: word, dtype: object

In [33]:
to_rmv_set = set(list(to_rmv)+list(to_remove))

In [34]:
# Removing the low skewness words
final_train_doclist = preprocess.remove_words_tuples_corpus(lc_new_train_docs,ngramlist2,to_rmv_set)

total lenth of doclist =  106445
total words in filter list 874
--- 1.44642972946167 seconds for removal ---
--- 5.749646902084351 seconds for MWE ---


In [35]:
final_test_doclist = preprocess.remove_words_tuples_corpus(lc_new_test_docs,ngramlist2,to_rmv_set)

total lenth of doclist =  26610
total words in filter list 874
--- 0.4021146297454834 seconds for removal ---
--- 2.850888252258301 seconds for MWE ---


### Checking for empty documents

In [36]:
final_train_doclist1 = list(final_train_doclist) 
less_than_10_idx = preprocess.too_short_item(final_train_doclist, 1,-1,1000000)
final_train_doclist = [final_train_doclist[i] for i in range(len(final_train_doclist1)) if i not in less_than_10_idx]

length of token, 0
index - 15311
length of token, 0
index - 70829
total number  2


In [37]:
labels1 = list(labels)
train_labels_new = [labels[i] for i in range(len(labels1)) if i not in less_than_10_idx]

In [39]:
less_than_10_idx_test = preprocess.too_short_item(final_test_doclist, 1,-1,1000000)

total number  0


In [41]:
vocab_dict_final,revocab_dict_final = preprocess.create_vocab_revocab_dict(final_train_doclist)

In [42]:
# Checking length
len(vocab_dict_final)

48799

## Convert token list of documents back into whole document list of sentences for modelling

In [43]:
training_doclist_concated =  preprocess.finalizing_document_list(final_train_doclist)
test_doclist_concated =  preprocess.finalizing_document_list(final_test_doclist)

In [44]:
training_doclist_concated[0:3]

['german tourist safe hour lose rugged rainforest finch hatton gorge west mackay night area young mackay man fell jump death sergeant jon purcell rescuer locate miss pair midnight aest',
 'police seize rare drug raid florey home police syrinx drug form amphetamine bag contain crystal methamphetamine woman charge offence face court acting sergeant matt varley third drug territory amphetamine manufacturing whereby powder crystal produce liquid methamphetamine contain iodine colour',
 'brisbane man charge fraud allegedly pose taxi driver police man car meter computer man brisbane magistrates court']

In [45]:
len(training_doclist_concated)

106443

### Checkpoint 2 
### Saving final result of pre-processing for modelling task

In [46]:
pickle.dump( training_doclist_concated, open( "training_doclist_concated.p", "wb" ) )
pickle.dump( test_doclist_concated, open( "test_doclist_concated.p", "wb" ) )

In [47]:
pickle.dump( train_labels_new, open( "train_labels_new.p", "wb" ) )

### Importing final result of preprocessing for modelling tasks

In [None]:
# lemmazied_docs_train = pickle.load( open( "lemmazied_docs_train_lemmatized_unfiltered_byPOS.p", "rb" ) )
# lemmazied_docs_test = pickle.load( open( "lemmazied_docs_test_lemmatized_unfiltered_byPOS.p", "rb" ) )

### Dividing the training data into validation and training data and also writing to txt file

In [48]:
idx1 = np.random.permutation(len(training_doclist_concated))
x_train1 = [training_doclist_concated[i] for i in idx1]
y_train1 = [train_labels_new[i] for i in idx1]
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=321)
for train_index, test_index in sss.split(x_train1, y_train1):
    X_train, X_test = [x_train1[ind] for ind in train_index ], [x_train1[ind] for ind in test_index ]
    y_train, y_test = [y_train1[ind] for ind in train_index ], [y_train1[ind] for ind in test_index ]
    break

complete_doc = X_train+X_test

# writing to file for keras in r to read
        
    
with open('../Data/processed/complete_data1.txt','w+') as o_fh:
    for doc in complete_doc:
        o_fh.write('{}'.format(doc))
            
        o_fh.write('\n')
o_fh.close()                
   
with open('../Data/processed/train_data1.txt','w+') as o_fh:
    for doc in X_train:
        o_fh.write('{}'.format(doc))
            
        o_fh.write('\n')
o_fh.close()  


with open('../Data/processed/validation_data1.txt','w+') as o_fh:
    for doc in X_test:
        o_fh.write('{}'.format(doc))
            
        o_fh.write('\n')
o_fh.close()  

label_df_train=pd.DataFrame(y_train,columns=['y'])
label_df_validation=pd.DataFrame(y_test,columns=['y'])
label_df_train.to_csv('../Data/processed/train_labels1.csv')
label_df_validation.to_csv('../Data/processed/validation_labels1.csv')

In [49]:
# Saving processed test data as well in form of a txt file
with open('../Data/processed/test_data1.txt','w+') as o_fh:
    for doc in test_doclist_concated:
        o_fh.write('{}'.format(doc))
            
        o_fh.write('\n')
o_fh.close()  

## Also creating a set of processed data without ngram collocations for evaluation purpose


In [50]:
# Importing data from pickle saved at checkpoint 1
lemmazied_docs_train_eval = pickle.load( open( "lemmazied_docs_train_lemmatized_unfiltered_byPOS.p", "rb" ) )
lemmazied_docs_test_eval = pickle.load( open( "lemmazied_docs_test_lemmatized_unfiltered_byPOS.p", "rb" ) )

In [51]:
# Changing one word documents to UNKWN
new_test_docs_eval = preprocess.change_oneworddoc_to(lemmazied_docs_test_eval,"UKNWN")
new_train_docs_eval = preprocess.change_oneworddoc_to(lemmazied_docs_train_eval,"UKNWN")

length of token, 1
index - 4553
length of token, 1
index - 4583
length of token, 1
index - 10221
length of token, 1
index - 10237
length of token, 1
index - 10471
length of token, 1
index - 11138
length of token, 1
index - 11591
length of token, 1
index - 13465
length of token, 1
index - 18280
length of token, 1
index - 18851
total number  15
no of one word docs 15
percentage of one word docs 0.0005636978579481398
some of the indexes of 1 word docs 
 [4553, 4583, 10221, 10237, 10471, 11138, 11591, 13465, 18280, 18851]
length of token, 1
index - 4235
length of token, 1
index - 4238
length of token, 1
index - 4265
length of token, 1
index - 8749
length of token, 1
index - 8756
length of token, 1
index - 8818
length of token, 1
index - 18393
length of token, 1
index - 18427
length of token, 1
index - 19020
length of token, 1
index - 25067
total number  56
no of one word docs 56
percentage of one word docs 0.0005260932876133214
some of the indexes of 1 word docs 
 [4235, 4238, 4265, 8749, 

In [52]:
new_train_docs_eval = preprocess.convert_lowercase(new_train_docs_eval)
new_test_docs_eval = preprocess.convert_lowercase(new_test_docs_eval)
# removing stopwords and punctuations
preprocess.remove_stopwords(new_test_docs_eval)
preprocess.remove_stopwords(new_train_docs_eval)
# test data
preprocess.remove_punctuation(new_test_docs_eval)
# train data
preprocess.remove_punctuation(new_train_docs_eval)

### Removing words which occur less than 5 times

In [53]:
ngramlist_eval=[]
req_filt_words_eval = preprocess.filtered_words_by_perc_occur_corpus_more_less_than(list_tokens=new_train_docs_eval,perc_value=5,percent=False,more = False)


num_of_doc_threshold 5
<FreqDist with 141844 samples and 9146696 outcomes>


In [54]:
#Removing filter words from Training docs
new_train_docs_eval = preprocess.remove_words_tuples_corpus(new_train_docs_eval,ngramlist_eval,req_filt_words_eval)
print('finishing removing word from train')


total lenth of doclist =  106445
total words in filter list 99053
--- 11.630670547485352 seconds for removal ---
--- 7.522034645080566 seconds for MWE ---
finishing removing word from train


In [55]:
#Removing filter words from test docs
new_test_docs_eval = preprocess.remove_words_tuples_corpus(new_test_docs_eval,ngramlist_eval,req_filt_words_eval)
print('finishing removing word from testlist')

total lenth of doclist =  26610
total words in filter list 99053
--- 0.7798235416412354 seconds for removal ---
--- 1.4874277114868164 seconds for MWE ---
finishing removing word from testlist


In [56]:
# first find those high frequency words among all the text
all_tokens = []
for each in new_train_docs_eval:
    all_tokens += list(set(each))
all_frequency = dict(nltk.FreqDist(all_tokens))
might_remove = list(dict(nltk.FreqDist(all_tokens).most_common(2000)).keys())

In [57]:
# Removing low skewness (classwise) words 
a_dict = {}

for i in range(len(new_train_docs_eval)):
    word_set = list(set(new_train_docs_eval[i]))
    for each in word_set:
        if labels[i] in a_dict:
            a_dict[labels[i]].append(each)
        else:
            a_dict[labels[i]] = [each]
    
frequent_words = []
for k,v in a_dict.items():
    v = np.random.permutation(v)
    frequent_word = set(v[:2000])
    frequent_words += list(frequent_word)
    
a = dict(nltk.FreqDist(frequent_words).most_common(1000))

# if a word is both frequent among the whole text and frequent among all classes (>=20), then remove
to_remove = set()
for each in might_remove:
    if each in a:
        if a[each] >=22:
            to_remove.add(each)
# make the distribution of classes even

label_count = dict(nltk.FreqDist(labels))

ratio = [0]*23
for k,v in label_count.items():
    k = int(k.replace(' ','')[1:]) -1
    ratio[k] = 5520/v
ratio = np.array(ratio)

skewness_matrix = {}
for k,v in a_dict.items():
    k = int(k.replace(' ','')[1:]) -1
    for each in v:
        if each in skewness_matrix:
            skewness_matrix[each][k] +=1
        else:
            skewness_matrix[each] = [0]*23
            skewness_matrix[each][k] +=1
            
skewness_dict = {}
for k,v in skewness_matrix.items():
    v = np.array(v)
    average = np.std(v)/np.mean(v)
    skewness_dict[k] = average
    
# df_dict_skewness = pd.read_table('skewness_dict.txt',delim_whitespace =True,header=None)
df_dict_skewness = pd.Series(skewness_dict).to_frame('ColumnName').reset_index()
df_dict_skewness.columns = { 'word',  'skewness'}
df_dict_skewness.head()

to_rmv = df_dict_skewness[df_dict_skewness.skewness<0.6].word.reset_index().word
to_rmv.head()

to_rmv_set = set(list(to_rmv)+list(to_remove))

ngramlist3 = []
# Removing the low skewness words
final_train_doclist_eval = preprocess.remove_words_tuples_corpus(new_train_docs_eval,ngramlist3,to_rmv_set)

final_test_doclist_eval = preprocess.remove_words_tuples_corpus(new_test_docs_eval,ngramlist3,to_rmv_set)

total lenth of doclist =  106445
total words in filter list 875
--- 8.6093008518219 seconds for removal ---
--- 4.351987361907959 seconds for MWE ---
total lenth of doclist =  26610
total words in filter list 875
--- 0.5055673122406006 seconds for removal ---
--- 1.1956005096435547 seconds for MWE ---


In [58]:
# removing empty training documents
final_train_doclist1_eval = list(final_train_doclist_eval) 
less_than_10_idx_eval = preprocess.too_short_item(final_train_doclist_eval, 1,-1,1000000)
final_train_doclist_eval = [final_train_doclist_eval[i] for i in range(len(final_train_doclist1_eval)) if i not in less_than_10_idx_eval]
labels_eval = list(labels)
labels1_eval = list(labels)
train_labels_new_eval = [labels_eval[i] for i in range(len(labels1_eval)) if i not in less_than_10_idx_eval]

length of token, 0
index - 15311
length of token, 0
index - 70829
total number  2


In [59]:
training_doclist_concated_eval =  preprocess.finalizing_document_list(final_train_doclist_eval)
test_doclist_concated_eval =  preprocess.finalizing_document_list(final_test_doclist_eval)

In [60]:
# Saving in txt file
idx1 = np.random.permutation(len(training_doclist_concated_eval))
x_train1 = [training_doclist_concated_eval[i] for i in idx1]
y_train1 = [train_labels_new_eval[i] for i in idx1]
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=321)
for train_index, test_index in sss.split(x_train1, y_train1):
    X_train, X_test = [x_train1[ind] for ind in train_index ], [x_train1[ind] for ind in test_index ]
    y_train, y_test = [y_train1[ind] for ind in train_index ], [y_train1[ind] for ind in test_index ]
    break

complete_doc = X_train+X_test

       
    
with open('../Data/processed/complete_data_no_ngram1.txt','w+') as o_fh:
    for doc in complete_doc:
        o_fh.write('{}'.format(doc))
            
        o_fh.write('\n')
o_fh.close()                
   
with open('../Data/processed/train_data_no_ngram1.txt','w+') as o_fh:
    for doc in X_train:
        o_fh.write('{}'.format(doc))
            
        o_fh.write('\n')
o_fh.close()  


with open('../Data/processed/validation_data_no_ngram1.txt','w+') as o_fh:
    for doc in X_test:
        o_fh.write('{}'.format(doc))
            
        o_fh.write('\n')
o_fh.close()  

label_df_train=pd.DataFrame(y_train,columns=['y'])
label_df_validation=pd.DataFrame(y_test,columns=['y'])
label_df_train.to_csv('../Data/processed/train_labels_no_ngram1.csv')
label_df_validation.to_csv('../Data/processed/validation_labels_no_ngram1.csv')


# Saving processed test data as well in form of a txt file
with open('../Data/processed/test_data_no_ngram1.txt','w+') as o_fh:
    for doc in test_doclist_concated_eval:
        o_fh.write('{}'.format(doc))
            
        o_fh.write('\n')
o_fh.close()  

# Summary
## News articles have now been preprocessed and cleaned and can be used for modelling in next task