## Text Segmentation


Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. When segmentation bounndaries are well defined, text segmentation is simple. But in unstructured data there are no distinct boundaries and segmentation becomes tedious task.
<br><br>
In this notebook we will be extracting different segments like abstract, methodology, conclusion in a research paper. 




## Reading a pdf file

We need preprocessed text before we start segmentation, so we will be extracting text from pdf file first.

In [2]:
import re
import string

In [20]:
file_path='researchpaper.pdf'


In [21]:
import PyPDF2
pdfFileObject = open(file_path, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
text = ''
for i in range(count):
    page = pdfReader.getPage(i)
    text += page.extract_text()


In [22]:
def text_cleaning(text):
    text = text.lower()
    return text

In [23]:
cleantext= text_cleaning(text)

In [24]:
headings =['Introduction','Abstract', 'Methodology', 'Conclusion','References',"Proposed System","Conclusion and Future Work"]


We need to find how many times a certain heading occurs in a given scientific paper. If it occurs muliple times then we need to find the actual heading start point. If **introduction** appears more than one time then its occurance can be mentioned in other segment of paper like abstract, methodology and so on.

In [25]:
heading_occurance = [(heading,cleantext.count(heading.lower()) )for heading in headings]
heading_occurance

[('Introduction', 1),
 ('Abstract', 1),
 ('Methodology', 0),
 ('Conclusion', 1),
 ('References', 1),
 ('Proposed System', 3),
 ('Conclusion and Future Work', 1)]

`find_occurance` function returns indices where a given heading is present

In [26]:
def find_occurance(heading):
    occurance_indices = [word.start() for word in re.finditer(heading, cleantext)]
    return occurance_indices


In order to determine if a given portion of text is start of segment, we need to analyze fews words after the occurance of heading.
If there is no new line or if there is presence of punctuation marks like `.?,;` after the heading term or some words , then it can't be the start of text. It should be the only word or phrase  in a  line, to get identified as a heading.

In [27]:
def prob_next_word(index,heading):
    return cleantext[index:index+len(heading)+10]

In [28]:
def available_topics(headings):
    available_headings =[]
    headings = [heading.lower()for heading in headings]
    start_indices = []
    punctuations= string.punctuation 
    
    for heading in headings:
        frequency = cleantext.count(heading.lower())
        occurance_indices = find_occurance(heading)
        if len(occurance_indices) >0:
            for index in occurance_indices:
                next_words_seq = prob_next_word(index,heading)
                immediate_next_word = next_words_seq[len(heading):]

                length= len(immediate_next_word.replace(" ",""))
                is_start = ("\n" in next_words_seq) and (punctuations not in next_words_seq)
                if is_start:
                    start_indices.append(index)
                    available_headings.append(heading)
    return available_headings,start_indices
available_headings,start_indices= available_topics(headings)

Start of one segment is end for another segment. Inorder to find the ending of a segment, we choose segment whose starting point is nearest to the current segment's  out of all segment's starting points.

In [29]:
def nearest_higher_index(heading_index,start_indices):
    sorted_indices =  sorted(start_indices)
    end_index = sorted_indices.index(heading_index)+1
    if end_index == len(sorted_indices):
        end_index_val = len(cleantext)
    else:
        end_index_val =  sorted_indices[end_index]
    return end_index_val
    

For extracting sections like introduction, abstract we need to find its starting point and ending point.

In [30]:
def start_and_end_index(headings,start_indices):
    heading_prob_points = []
    for heading in available_headings:
        
        start_index_val = start_indices[available_headings.index(heading)]
        end_index_val =nearest_higher_index(start_index_val,start_indices)
        print((heading,start_index_val,end_index_val))
        heading_prob_points.append((heading,start_index_val,end_index_val))
    return heading_prob_points

In [31]:
heading_prob_points= start_and_end_index(headings,start_indices)

('introduction', 1249, 5584)
('abstract', 330, 1249)
('references', 20127, 23531)
('proposed system', 5584, 19391)
('conclusion and future work', 19391, 20127)


In [32]:
common_headings = ["introduction","abstract","conclusion"]

In [33]:
for heading in available_headings:
    index = available_headings.index(heading)
    prob_points = heading_prob_points[index]
    text = cleantext[prob_points[1]+len(heading):prob_points[2]]
    if heading in common_headings:
        print(heading.upper())
        print(text)


INTRODUCTION

the arabic language is a highly inected natural
language that has an enormous number of possi-
ble words (othman et al., 2003). and although it
is the native language of over 300 million people,
it suffers from the lack of useful resources as op-
posed to other languages, specially english and
until now there are no systems that cover the wide
range of possible spelling errors. fortunately the
qalb corpus (zaghouani et al., 2014) will help
enrich the resources for arabic language generally
and the spelling correction specically by provid-
ing an annotated corpus with corrected sentences
from user comments, native student essays, non-
native data and machine translation data. in this
work, we are trying to use this corpus to build an
error correction system that can cover a range of
spelling errors.
this paper is a system description paper that is
submitted in the emnlp 2014 conference shared
task ﬂautomatic arabic error correctionﬂ (mohit
et al., 2014) in the arabic nlp