<h4 style="color:purple">What is CRISP DM Methodology?</h4>
    
    The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model with six phases that naturally describes the data science life cycle.
    
    1- Business understanding – What does the business need?
    2- Data understanding – What data do we have / need? Is it clean?
    3- Data preparation – How do we organize the data for modeling?
    4- Modeling – What modeling techniques should we apply?
    5- Evaluation – Which model best meets the business objectives?
    6- Deployment – How do stakeholders access the results?
    
<img src="assets/img/tutorial/crisp.jpg" width="400" style="float:right">

<br><br><br>
<h4 style="color:darkred">CRISP: 1-Business Understanding</h4>

    In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context.
    
    Finally, we want to label the word they give us as input in a sentence.
    
<img src="assets/img/tutorial/postagger.png" width="300" style="float:right;">

In [6]:
# import libreries
import codecs
import pandas as pd
import re
import csv
import numpy as np
import time
import gc
from PersianStemmer import PersianStemmer
ps = PersianStemmer()

<br><br><h2 style="text-align:center; color: darkred">CRISP: 2-Data Understanding Phase</h2>


<h4 style="color:purple">What is our data?</h4>
    
    Our Data Source is the Dr. BijanKhan corpus that we want to make a dataset from it.

<br><br>
<i >BijanKhan Corpus<i>
<img src="assets/img/output/maincorpus2.jpg">

<br><br><br>
<b>Rewrite Main Corpus for making neccessary changes with RegularExpression</b>

In [9]:
def prepair_txt_file():
    with codecs.open('assets/maincorpus.txt', "r", "utf-8") as myfile:
        data = myfile.read()

        newData = re.sub('   +', '$', data)
        newData = re.sub('ي', 'ی', newData)
        newData = re.sub('ك', 'ک', newData)
        #newData = re.sub('‌', ' ', data)

    with codecs.open('output/improvedcorpus.txt', "w", "utf-8") as myfiles:
        myfiles.write(newData)
prepair_txt_file()

<i>improved corpus</i>
<img src="assets/img/output/improvedcorpus.jpg">

<br><br><br>
<b>Convert corpus from .txt file to .CSV</b>

In [10]:
def prepair_txt_file():
    file = pd.read_csv("output/improvedcorpus.txt", error_bad_lines=False, encoding='utf-8', delimiter = '$', quoting=csv.QUOTE_NONE)
    file.to_csv('output/table.csv',encoding='utf-8-sig', index = None)
prepair_txt_file()

b'Skipping line 406250: expected 2 fields, saw 3\nSkipping line 406267: expected 2 fields, saw 3\n'


<i>output</i>
<img src="assets/img/output/table.jpg">

<br><br><br>
<b>Prepairing the DataFrame for processing & save it to new .csv file</b>

In [11]:
def column_array():
    array = ['input', 'label', 'word len' , 'stem len', 'prefix len', 'suffix len'
             , 'pre pos 1', 'pre pos 2', 'pre pos 3'
             , 'nxt pos 1', 'nxt pos 2', 'nxt pos 3'
            , 'suffix', 'suffix zamir', 'suffix exception', 'mokasar']
    return array

t = time.time()

def create_table(): 
    table = pd.read_csv("output/table.csv", encoding='utf-8')
    table.rename(columns={'#':'input',table.columns[1]:'label'}, inplace=True)
    array = column_array()
    for item in array:
        if item != 'input' and item != 'label':
            table[item] = np.nan
    table.to_csv('output/datasetstructure.csv',encoding='utf-8-sig', index = None)


create_table()

elapsed = time.time() - t
print(elapsed)

#tiktok 97.60322308540344

33.15774726867676


<i>Dataset structure output</i>
<img src="assets/img/output/datasetstructure.jpg">

<br><br><br>
<h4>Word_analysis() method is for get features from data</h4>
<h4 style="color:purple">Our suggested features</h4>
    
    word len: word length.
    stem len: stem length.
    preffix len: preffix length.
    suffext len: suffext length.
    pre pos 1: lable of first previous word.
    pre pos 2: lable of second previous word.
    pre pos 3: lable of third previous word.
    next pos 1: lable of first next word.
    next pos 2: lable of second next word.
    next pos 3: lable of third next word.
    suffix: such as "سار", "سان", "لاخ", "مند", "دار"
    suffix zamir: such as "م", "ت", "ش"
    suffix exception: such as "تر", "ترین", "ام", "ات"
    mokasar: such as "کتب", "کسورات"

In [227]:
def word_analyse(word, operate):
    if operate == 'word len':
        return len(word)
    if operate == 'stem len':
        return len(ps.run(word))
    if operate == 'prefix len':
        stem = ps.run(word)
        if len(stem) != len(word):
            if word.endswith(stem) and len(stem) > 0:
                split = word.split(stem)
                return len(split[0])
            else:
                return 0
        else:
            return 0
    if operate == 'suffix len':
        stem = ps.run(word)
        if len(stem) != len(word):
            if word.startswith(stem) and len(stem) > 0:
                split = word.split(stem)
                return len(split[1])
            else:
                return 0
        else:
            return 0
    if operate == 'suffix':
        stem = ps.run(word)
        suffix_arr = ["كار", "ناك", "وار", "آسا", "آگین", "بار", "بان", "دان", "زار", "سار", "سان", "لاخ", "مند", "دار", "مرد", "کننده", "گرا", "نما", "متر"]
        if word_analyse(word, 'suffix len') > 0  and len(stem) > 0:
            suffix = word.split(stem)[1]
            suffix = re.sub('   +', '', suffix)
            if suffix in suffix_arr:
                return 1
            else: 
                return 0
        return 0
    if operate == 'suffix zamir':
        stem = ps.run(word)
        suffixZamir = ["م", "ت", "ش"]
        if word_analyse(word, 'suffix len') > 0  and len(stem) > 0:
            suffix = word.split(stem)[1]
            suffix = re.sub('   +', '', suffix)
            if suffix in suffixZamir:
                return 1
            else: 
                return 0
        return 0
    if operate == 'suffix exception':
        stem = ps.run(word)
        suffixException = ["ها", "تر", "ترین", "ام", "ات", "اش"]
        if word_analyse(word, 'suffix len') > 0  and len(stem) > 0:
            suffix = word.split(stem)[1]
            suffix = re.sub('   +', '', suffix)
            if suffix in suffixException:
                return 1
            else: 
                return 0
        return 0
    if operate == 'mokasar':
        stem = ps.run(word)
        if len(stem) != len(word) and stem not in word:
            return 1
        else:
            return 0
    

def column_array():
    array = ['input', 'label', 'word len' , 'stem len', 'prefix len', 'suffix len'
             , 'pre pos 1', 'pre pos 2', 'pre pos 3'
             , 'nxt pos 1', 'nxt pos 2', 'nxt pos 3'
            , 'suffix', 'suffix zamir', 'suffix exception', 'mokasar']
    return array

<br><br><br>
<b>Go for make Features DataFrame chunk by chunk ;)</b>

<b>Save each chunks to one csv for making final integrated Dataset</b>

In [229]:
t = time.time()

header = True
#row.name = 0
ChunkSize = 10000
chunkCount = 0
for chunk in pd.read_csv("output/datasetstructure.csv", encoding='utf-8', chunksize = ChunkSize, low_memory=False):
###### for loop creating chunks - starts
    if chunkCount > 10:
    ###### for loop entire chunks starts
        table = chunk
        counter = 0
        for index, row in table.iterrows():
            #prepair vars
            table_rows_num = len(table.index)
            row_num = counter
            col_num = len(table.columns)

            #word_len
            word = row['input']
            table.iloc[[row_num],[2]] = word_analyse(word, 'word len')

            #stem_len
            table.iloc[[row_num],[3]] = word_analyse(word, 'stem len')

            #prefix len
            table.iloc[[row_num],[4]] = word_analyse(word, 'prefix len')

            #suffix len
            table.iloc[[row_num],[5]] = word_analyse(word, 'suffix len')                                        


            #pre pos 1
            if row_num-1 >= 0:
                pos = table.iloc[[row_num-1],[1]].values[0][0]
                table.iloc[[row_num],[6]] = pos
                #break
            else:
                table.iloc[[row_num],[6]] = 0

            #pre pos 2
            if row_num-2 >= 0:
                pos = table.iloc[[row_num-2],[1]].values[0][0]
                table.iloc[[row_num],[7]] = pos
                #break
            else:
                table.iloc[[row_num],[7]] = 0

            #pre pos 3
            if row_num-3 >= 0:
                pos = table.iloc[[row_num-3],[1]].values[0][0]
                table.iloc[[row_num],[8]] = pos
                #break
            else:
                table.iloc[[row_num],[8]] = 0

            #pre nxt 1
            if row_num+1 < table_rows_num:
                pos = table.iloc[[row_num+1],[1]].values[0][0]
                table.iloc[[row_num],[9]] = pos
                #break
            else:
                table.iloc[[row_num],[9]] = 0

            #pre nxt 2
            if row_num+2 < table_rows_num:
                pos = table.iloc[[row_num+2],[1]].values[0][0]
                table.iloc[[row_num],[10]] = pos
                #break
            else:
                table.iloc[[row_num],[10]] = 0

            #pre nxt 3
            if row_num+3 < table_rows_num:
                pos = table.iloc[[row_num+3],[1]].values[0][0]
                table.iloc[[row_num],[11]] = pos
                #break
            else:
                table.iloc[[row_num],[11]] = 0

            #suffix
            table.iloc[[row_num],[12]] = word_analyse(word, 'suffix')

            #suffix
            table.iloc[[row_num],[13]] = word_analyse(word, 'suffix zamir') 

            #suffix
            table.iloc[[row_num],[14]] = word_analyse(word, 'suffix exception') 

            #suffix
            table.iloc[[row_num],[15]] = word_analyse(word, 'mokasar') 

            counter = counter+1
            if counter == ChunkSize:
                counter = 1

    
    
        ###### for loop entire chunks ends

        array = column_array()
        chunk.columns = array
        chunk.to_csv('output/datatrain/datatrain'+str(chunkCount)+'.csv',encoding='utf-8-sig', header=header, mode='a')
        gc.collect()
        header = False

#table.to_csv('table.csv',encoding='utf-8-sig', index = None)
    chunkCount = chunkCount+1
    chunkTotal = 2000000 / ChunkSize
    remains = chunkTotal - chunkCount
    print(str(chunkCount) + " / " + str(chunkTotal) + "------ ChunkRemains: " + str(remains))
    elapsed = time.time() - t
    print(elapsed)

1 / 200.0------ ChunkRemains: 199.0
0.02992081642150879
2 / 200.0------ ChunkRemains: 198.0
0.048868417739868164
3 / 200.0------ ChunkRemains: 197.0
0.06781935691833496
4 / 200.0------ ChunkRemains: 196.0
0.09275078773498535
5 / 200.0------ ChunkRemains: 195.0
0.10970640182495117
6 / 200.0------ ChunkRemains: 194.0
0.12765836715698242
7 / 200.0------ ChunkRemains: 193.0
0.1486034393310547
8 / 200.0------ ChunkRemains: 192.0
0.1715536117553711
9 / 200.0------ ChunkRemains: 191.0
0.1874988079071045
10 / 200.0------ ChunkRemains: 190.0
0.20545029640197754
11 / 200.0------ ChunkRemains: 189.0
0.23337531089782715
12 / 200.0------ ChunkRemains: 188.0
83.44049620628357
13 / 200.0------ ChunkRemains: 187.0
171.97193932533264
14 / 200.0------ ChunkRemains: 186.0
254.09863805770874
15 / 200.0------ ChunkRemains: 185.0
335.3263463973999
16 / 200.0------ ChunkRemains: 184.0
421.12051916122437
17 / 200.0------ ChunkRemains: 183.0
523.7453479766846
18 / 200.0------ ChunkRemains: 182.0
628.0317368507

149 / 200.0------ ChunkRemains: 51.0
11864.729574203491
150 / 200.0------ ChunkRemains: 50.0
11946.464850902557
151 / 200.0------ ChunkRemains: 49.0
12031.100084543228
152 / 200.0------ ChunkRemains: 48.0
12110.05702495575
153 / 200.0------ ChunkRemains: 47.0
12197.779732465744
154 / 200.0------ ChunkRemains: 46.0
12281.565051317215
155 / 200.0------ ChunkRemains: 45.0
12367.021774053574
156 / 200.0------ ChunkRemains: 44.0
12450.54115653038
157 / 200.0------ ChunkRemains: 43.0
12529.272244215012
158 / 200.0------ ChunkRemains: 42.0
12611.635533571243
159 / 200.0------ ChunkRemains: 41.0
12692.970117807388
160 / 200.0------ ChunkRemains: 40.0
12772.999831199646
161 / 200.0------ ChunkRemains: 39.0
12857.715972185135
162 / 200.0------ ChunkRemains: 38.0
12942.62639284134
163 / 200.0------ ChunkRemains: 37.0
13027.125674009323
164 / 200.0------ ChunkRemains: 36.0
13106.395667552948
165 / 200.0------ ChunkRemains: 35.0
13184.779160499573
166 / 200.0------ ChunkRemains: 34.0
13269.76384472

<br><br><br>
<b>combine chuncks into one CSV file</b>

In [None]:
import os
import glob

path = "."
os.chdir(path)
results = pd.DataFrame()

for counter, current_file in enumerate(glob.glob("*.csv")):
    namedf = pd.read_csv(current_file, header=None, sep="|")
    results = pd.concat([results, namedf])

#results.to_csv('output/datatrain/Combined.csv',encoding='utf-8-sig', index=None, header=None, sep="|")

print('done')

<i>output chunk files</i>
<img src="assets/img/output/chunkfiles.jpg">
<br><br><br>
<i>final version of DataSet</i>
<img src="assets/img/output/dataset.jpg">

<br><br><h2 style="text-align:center; color: darkred">End of Data Understanding Phase</h2>