# Introduction

The text data from Wikipedia on two topics is gathered and combined to form a composite document. The scripts from the Bayesian Unsupervised Topic Segmentation paper are downloaded and modified for running on the composite document. All the scripts including the composite document are then pushed to a GitHub repository.


> Importing Wikipedia API to gather text data.



In [1]:
!pip install wikipedia
import wikipedia

Collecting wikipedia
  Downloading https://files.pythonhosted.org/packages/67/35/25e68fbc99e672127cc6fbb14b8ec1ba3dfef035bf1e4c90f78f24a80b7d/wikipedia-1.4.0.tar.gz
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-cp36-none-any.whl size=11686 sha256=c82e7d768b1676b221773a425a7d612e7dcdf39c6737da0af9ace08b533a944f
  Stored in directory: /root/.cache/pip/wheels/87/2a/18/4e471fd96d12114d16fe4a446d00c3b38fb9efcb744bd31f4a
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


## 1) Topic 1 - Atlassian

Gather data for the first topic.

In [2]:
topic1 = wikipedia.page("Atlassian").content       #get the entire page content of the topic
related_links1 = wikipedia.page("Atlassian").links #get all the related links in the wikipedia page of the topic

for i in related_links1[:4]:                       #only the first few links are considered for attaining the required word count
  try:
    related_content1 = wikipedia.page(i).content   #get the entire page content of the related links
    topic1 = topic1 + related_content1             #combine the original content of the topic with the content of the related links
  except:                                          #exception handling is introduced as few wikipedia links might be broken
    pass

len(topic1.split())                               #display word count excluding blank spaces                   

28162

In [3]:
print(topic1)

Atlassian Corporation Plc () is an Australian enterprise software company that develops products for software developers, project managers, and content management. It is best known for its issue tracking application, Jira, and its team collaboration and wiki product, Confluence. Atlassian serves over 135,000 customers.


== History ==
Mike Cannon-Brookes and Scott Farquhar founded Atlassian in 2002. The pair met while studying at the University of New South Wales in Sydney. They bootstrapped the company for several years, financing the startup with a $10,000 credit card debt.The name derives from the Titan Atlas from Greek mythology who had been punished to hold up the Heavens after the Greek gods had overthrown the Titans. This was reflected in the company's logo used from 2011 through to the 2017 re-branding through a blue X-shaped figure holding up what is shown to be the bottom of the sky.Atlassian released its flagship product, Jira – a project and issue tracker, in 2002. In 2004,

## 2) Topic 2 - Cochlear Limited

Gather data for the second topic.

In [4]:
topic2 = wikipedia.page("Cochlear Limited").content       #get the entire page content of the topic
related_links2 = wikipedia.page("Cochlear Limited").links #get all the related links in the wikipedia page of the topic

for j in related_links2[:11]:                              #only the first few links are considered for attaining the required word count
  try:
    related_content2 = wikipedia.page(j).content          #get the entire page content of the related links
    topic2 = topic2 + related_content2                    #combine the original content of the topic with the content of the related links
  except:                                                 #exception handling is introduced as few wikipedia links might be broken
    pass

len(topic2.split())                                       #display word count excluding blank spaces                   

26335

In [5]:
print(topic2)

Cochlear (ASX: COH) is a medical device company that designs, manufactures and supplies the Nucleus cochlear implant, the Hybrid electro-acoustic implant and the Baha bone conduction implant.Based in Sydney, Cochlear was formed in 1981 with finance from the Australian government to commercialise the implants pioneered by Dr Graeme Clark. Today, the company holds over two-thirds of the worldwide hearing implant market, with more than 250,000 people receiving one of Cochlear's implants since 1982.Cochlear was named Australia's most innovative company in 2002 and 2003, and one of the world's most innovative companies by Forbes in 2011.


== Products ==
Cochlear produces three implants for different medical situations.
Nucleus is a system combining an electrical simulation device that is surgically implanted behind a patient's ear, a processor that captures sounds, and an electrode array that relays the sounds to the brain. It is a direct descendant of the original cochlear implants, also 

## 3) Choi Notation

The wikipedia content gathered for both topics already has segments, for example, == Blockchain Technology ==, === Australia's capital markets ===, ==== Other ====. These segment headings have to be replaced by '==========' for the scripts from the Bayesian Unsupervised Topic Segmentation paper to run successfully on these texts. The ten '=' signs ('==========') is called Choi Notation which acts as boundaries between segments.

Regex library is used to make the above changes in the texts. In addition to '==========', a dummy '-,' sign is introduced at the end as a separator for the segments in these two topics to be put in two dataframes for creating a composite document.




In [0]:
import re

#replacing segment headings with choi notation in first topic

topic1_replaced = re.sub('==== .+ ====', '==========-,',topic1)
topic1_replaced = re.sub('=== .+ ===', '==========-,',topic1_replaced)
topic1_replaced = re.sub('== .+ ==','==========-,',topic1_replaced)

#replacing segment headings with choi notation in second topic

topic2_replaced = re.sub('==== .+ ====', '==========-,',topic2)
topic2_replaced = re.sub('=== .+ ===', '==========-,',topic2_replaced)
topic2_replaced = re.sub('== .+ ==','==========-,',topic2_replaced)

In [7]:
print(topic1_replaced)

Atlassian Corporation Plc () is an Australian enterprise software company that develops products for software developers, project managers, and content management. It is best known for its issue tracking application, Jira, and its team collaboration and wiki product, Confluence. Atlassian serves over 135,000 customers.


Mike Cannon-Brookes and Scott Farquhar founded Atlassian in 2002. The pair met while studying at the University of New South Wales in Sydney. They bootstrapped the company for several years, financing the startup with a $10,000 credit card debt.The name derives from the Titan Atlas from Greek mythology who had been punished to hold up the Heavens after the Greek gods had overthrown the Titans. This was reflected in the company's logo used from 2011 through to the 2017 re-branding through a blue X-shaped figure holding up what is shown to be the bottom of the sky.Atlassian released its flagship product, Jira – a project and issue tracker, in 2002. In 2004, it released C

In [8]:
print(topic2_replaced)

Cochlear (ASX: COH) is a medical device company that designs, manufactures and supplies the Nucleus cochlear implant, the Hybrid electro-acoustic implant and the Baha bone conduction implant.Based in Sydney, Cochlear was formed in 1981 with finance from the Australian government to commercialise the implants pioneered by Dr Graeme Clark. Today, the company holds over two-thirds of the worldwide hearing implant market, with more than 250,000 people receiving one of Cochlear's implants since 1982.Cochlear was named Australia's most innovative company in 2002 and 2003, and one of the world's most innovative companies by Forbes in 2011.


Cochlear produces three implants for different medical situations.
Nucleus is a system combining an electrical simulation device that is surgically implanted behind a patient's ear, a processor that captures sounds, and an electrode array that relays the sounds to the brain. It is a direct descendant of the original cochlear implants, also known as Nucleu

## 4) Strings to Dataframes

Create two dataframes from both these two topics with segments as row values using the separator '-,' which is a dummy sign introduced at the end of choi notations '=========='. These two dataframes will help us in creating a composite document.

In [9]:
import pandas as pd
topic1_df = pd.DataFrame({'segments' : topic1_replaced.split('-,')}) #create dataframe with segments as the single column and separator '-,'
topic1_df = topic1_df[topic1_df['segments'] != '\n\n\n==========']   #remove blank segments, if any
topic1_df = topic1_df.reset_index(drop=True)                         #resetting the index after changes for iterations, if any
topic1_df

Unnamed: 0,segments
0,Atlassian Corporation Plc () is an Australian ...
1,\nMike Cannon-Brookes and Scott Farquhar found...
2,\nAtlassian does not have a traditional sales ...
3,"\nIn 2010, Atlassian acquired Bitbucket, a hos..."
4,"\nIn March 2011, the company raised $1 million..."
5,"\nOfficial websiteAccel, formerly known as Acc..."
6,"\nIn 1983, Accel was founded by Arthur Patters..."
7,\nAccel is a venture capital firm that concent...
8,"\nAccel works with seed, early and growth-stag..."
9,\nAccel's US fund is headquartered in Palo Alt...


In [10]:
import pandas as pd
topic2_df = pd.DataFrame({'segments' : topic2_replaced.split('-,')}) #create dataframe with segments as the single column and separator '-,'
topic2_df = topic2_df[topic2_df['segments'] != '\n\n\n==========']   #remove blank segments, if any
topic2_df = topic2_df.reset_index(drop=True)                         #resetting the index after changes for iterations, if any
topic2_df

Unnamed: 0,segments
0,Cochlear (ASX: COH) is a medical device compan...
1,\nCochlear produces three implants for differe...
2,\nCochlear manufactures principally in Sweden ...
3,\nCochlear Bone Anchored Solutions ABAGL Energ...
4,\n\nThe Australian Gas Light Company was forme...
5,\nAGL has a diverse power generation portfolio...
6,\nIn 2015 the EPA ordered the suspension of AG...
7,"\nIn August 2017, it was announced that the Co..."
8,"\nIn May 2017, it was announced that construct..."
9,"\nIn June 2017, AGL announced the development ..."


## 5) Composite String

Create a composite document from the two dataframes in the form of

topic1_segment1,  topic2_segment1,  topic1_segment2,  topic2_segment2, ... topic1_segmentN,  topic2_segmentN.

In [11]:
c_string = ""                                    #new string to store text from the 2 dataframes

for i in range(119):                             #range is the least number of rows out of the 2 dataframes
  c_string = c_string + topic1_df['segments'][i]
  c_string = c_string + topic2_df['segments'][i]
  
print(c_string)

Atlassian Corporation Plc () is an Australian enterprise software company that develops products for software developers, project managers, and content management. It is best known for its issue tracking application, Jira, and its team collaboration and wiki product, Confluence. Atlassian serves over 135,000 customers.




Mike Cannon-Brookes and Scott Farquhar founded Atlassian in 2002. The pair met while studying at the University of New South Wales in Sydney. They bootstrapped the company for several years, financing the startup with a $10,000 credit card debt.The name derives from the Titan Atlas from Greek mythology who had been punished to hold up the Heavens after the Greek gods had overthrown the Titans. This was reflected in the company's logo used from 2011 through to the 2017 re-branding through a blue X-shaped figure holding up what is shown to be the bottom of the sky.Atlassian released its flagship product, Jira – a project and issue tracker, in 2002. In 2004, it released

In [12]:
len(c_string.split()) #display word count excluding blank spaces  

50163

## 6) Saving Documents

Save the strings of two topics and composite as .txt files.

In [0]:
with open('atlassian.txt','w') as f:
  f.write(topic1_replaced)
  
with open('cochlear.txt','w') as f:
  f.write(topic2_replaced)

with open('composite.txt','w') as f:
  f.write(c_string)

## 7) Get the Code

Download the scripts from the link provided in the Bayesian Unsupervised Topic Segmentation paper.

In [14]:
!wget http://groups.csail.mit.edu/rbg/code/bayesseg/bayesseg.tar.gz #download from the link

!tar zxvf bayesseg.tar.gz #unzip the file

--2019-09-10 10:04:40--  http://groups.csail.mit.edu/rbg/code/bayesseg/bayesseg.tar.gz
Resolving groups.csail.mit.edu (groups.csail.mit.edu)... 128.30.2.44
Connecting to groups.csail.mit.edu (groups.csail.mit.edu)|128.30.2.44|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4687672 (4.5M) [application/x-gzip]
Saving to: ‘bayesseg.tar.gz’


2019-09-10 10:04:41 (6.64 MB/s) - ‘bayesseg.tar.gz’ saved [4687672/4687672]

bayesseg/baselines/
bayesseg/baselines/textseg-1.211/
bayesseg/baselines/textseg-1.211/data/
bayesseg/baselines/textseg-1.211/data/comp/
bayesseg/baselines/textseg-1.211/data/org/
bayesseg/baselines/textseg-1.211/data/t/
bayesseg/baselines/textseg-1.211/doc/
bayesseg/baselines/textseg-1.211/doc/eng/
bayesseg/classes/
bayesseg/classes/edu/
bayesseg/classes/edu/mit/
bayesseg/classes/edu/mit/multimodal/
bayesseg/classes/edu/mit/multimodal/motifs/
bayesseg/classes/edu/mit/nlp/
bayesseg/classes/edu/mit/nlp/segmenter/
bayesseg/classes/edu/mit/nlp/segmenter/

## 8) Modify the Code

Move the composite file under 'data' folder, for the scripts to run it.

In [15]:
!mkdir /content/bayesseg/data/assignment/                                                #create a directory under 'data' folder

import shutil
shutil.move("/content/composite.txt", "/content/bayesseg/data/assignment/composite.txt") #move the composite file under 'data' folder

'/content/bayesseg/data/assignment/composite.txt'

The directory of the file and the suffix of the file should be modified for the 'eval' script to run on the composite document. The 'eval' command is used for evaluating the segments in the text files.

In [16]:
%cd /content/bayesseg/
!rm eval #delete existing eval script

#create new eval script with changes to the directory and suffix of the file
!echo 'CLASSPATH="classes:lib/colt.jar:lib/lingpipe-3.4.0.jar:lib/MinCutSeg.jar:lib/mtj.jar:lib/options.jar:lib/log4j-1.2.14.jar"' >> eval
!echo 'java -cp ${CLASSPATH} edu.mit.nlp.segmenter.SegTester -config $1 -dir data/assignment -suff txt' >> eval

/content/bayesseg


## 9) Push to GitHub

All the downloaded files along with the composite document and the new 'eval' script is pushed to a GitHub repository, for running the scripts on the composite document in a virtual machine.

In [0]:
!git config --global user.email "you@example.com"
!git config --global user.name "Your Name"

In [18]:
!git init 
!git add *
!git commit -m "first change"
!git remote add origin https://featgautham:barinacep31h@github.com/featgautham/My-Bayessian-Segmentation.git
!git push -u origin master

Initialized empty Git repository in /content/bayesseg/.git/
[master (root-commit) 6b3399d] first change
 541 files changed, 79007 insertions(+)
 create mode 100644 README
 create mode 100644 baselines/textseg-1.211.tar
 create mode 100644 baselines/textseg-1.211/COPYING
 create mode 100644 baselines/textseg-1.211/ChangeLog
 create mode 100644 baselines/textseg-1.211/ESeg
 create mode 100644 baselines/textseg-1.211/Experiments
 create mode 100644 baselines/textseg-1.211/Install-guide
 create mode 100644 baselines/textseg-1.211/JSeg
 create mode 100644 baselines/textseg-1.211/Makefile
 create mode 100644 baselines/textseg-1.211/PStemmer.class
 create mode 100644 baselines/textseg-1.211/PStemmer.java
 create mode 100644 baselines/textseg-1.211/README
 create mode 100644 baselines/textseg-1.211/README.ja
 create mode 100644 baselines/textseg-1.211/Seg
 create mode 100644 baselines/textseg-1.211/cstemmer
 create mode 100644 baselines/textseg-1.211/cstemmer.pl
 create mode 100644 baselines/t