# Article Spinning with Markov Model

##Importing the Libraries

In this case, we will also be importing some new libraries :

* `textwrap` : The **TextWrap** module in Python is an in-built module. This module provides functions for wrapping, filling, and formatting plain text

* `TreeBankWordDetokinzer` : Detokenize List of Tokens back into a single string

In [None]:
import numpy as np
import pandas as pd
import textwrap
import nltk
from nltk import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

##Importing Data and PreProcessing

In [None]:
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

--2023-12-04 13:50:42--  https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv
Resolving lazyprogrammer.me (lazyprogrammer.me)... 104.21.23.210, 172.67.213.166, 2606:4700:3031::6815:17d2, ...
Connecting to lazyprogrammer.me (lazyprogrammer.me)|104.21.23.210|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5085081 (4.8M) [text/csv]
Saving to: ‘bbc_text_cls.csv’


2023-12-04 13:50:43 (11.8 MB/s) - ‘bbc_text_cls.csv’ saved [5085081/5085081]



In [None]:
df = pd.read_csv('bbc_text_cls.csv')
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [None]:
labels = set(df['labels'])
labels

{'business', 'entertainment', 'politics', 'sport', 'tech'}

Lets say we are working with Business Articles

In [None]:
label = 'business'
texts = df[df['labels'] == label]['text']
texts.head()

0    Ad sales boost Time Warner profit\n\nQuarterly...
1    Dollar gains on Greenspan speech\n\nThe dollar...
2    Yukos unit buyer faces loan claim\n\nThe owner...
3    High fuel prices hit BA's profits\n\nBritish A...
4    Pernod takeover talk lifts Domecq\n\nShares in...
Name: text, dtype: object

###KKK

Func

In [None]:
probs = {}

for doc in texts:
  lines = doc.split("\n")
  for line in lines:
    tokens = word_tokenize(line)
    for i in range(len(tokens) - 2):
      t0 = tokens[i]
      t1 = tokens[i+1]
      t2 = tokens[i+2]
      key = (t0, t2)

      if key not in probs:
        probs[key] = {}

      if t1 not in probs[key]:
        probs[key][t1] = 1
      else:
        probs[key][t1] += 1

In [None]:
for key, d in probs.items():
  total = sum(d.values())
  for k, v in d.items():
    d[k] = v / total

In [None]:
probs[('US', 'giant')]

{'media': 0.1,
 'telecoms': 0.1,
 'banking': 0.2,
 'foods': 0.1,
 'retail': 0.1,
 'oil': 0.2,
 'mortgage': 0.1,
 'agrochemical': 0.1}

In [None]:
texts.iloc[0].split("\n")

['Ad sales boost Time Warner profit',
 '',
 'Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.',
 '',
 'The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.',
 '',
 "Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers a

### Using DeTokeniser

In [None]:
detokenizer = TreebankWordDetokenizer()

In [None]:
texts.iloc[0].split("\n")[2]

'Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.'

In [None]:
detokenizer.detokenize(word_tokenize(texts.iloc[0].split("\n")[2]))

'Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.'

In [None]:
print((texts.iloc[0].split("\n")[2]) == (detokenizer.detokenize(word_tokenize(texts.iloc[0].split("\n")[2]))))

True


In [None]:
def sample_word(d):
  p0 = np.random.random()
  cumalitve = 0

  for t, p in d.items():
    cumalitve += p
    if p0 < cumalitve:
      return t
  assert(False)

To spin our line, we follow the following steps :

* Tokenize the line provided
* Append the first word in the list of tokens as, we do not have any word preeceding it
* After that we are going to run a `while` loop which is going to run till 2nd Last Token
  * Inside this while loop we will define 3 tokens : `t0 = tokens[i]` , `t1 = tokens[i+1]` and `t2 = tokens[i+2]`

  * Then we are going to define a `key` which is going to be a `tuple` of the form `key = (t0, t2)` as we are going to use this key to predict the middle word

  * Using this key, we will **access** the **probabilties**. Now if there exist more than 1 options options, we would extract a middle word and append the three words as : `t1 + middle + t2`. Else we will **simply append** `t1`

  * Now we need `break condition`  which is if the index is of the 2nd last word, therefore at this moment we append just `tokens[-1]`

* The method returns the detokenised line

In [None]:
def spin_line(line):
  tokens = word_tokenize(line)
  i = 0
  output = [tokens[0]]
  while i < (len(tokens) - 2):
    t0 = tokens[i]
    t1 = tokens[i+1]
    t2 = tokens[i+2]

    key = (t0, t2)
    p_dist = probs[key]

    if len(p_dist) > 1 and np.random.random() < 3:
      middle = sample_word(p_dist)
      output.append(t1)
      output.append(f'< {middle} >')
      output.append(t2)

      i += 2

    else:
      output.append(t1)
      i += 1

  if i == len(tokens) - 2:
    output.append(tokens[-1])

  return detokenizer.detokenize(output)

In the `spin_document` function :

* We take our document and then we split our document as in that we have `\n\n` to seperate our model

* Then out of the lists of lines, we are going to iterate through individual lines which contains two types of line :

  * 'Hello Word'
  * ' . '

* Now if the line exist, i.e. if the line has content then, we are going to spin that line

* Else we are just gonna ignore it

* Finally we need to join the line with a `\n` and then `return '\n'.join(output)`

In [None]:
def spin_document(doc):
  lines = doc.split('\n')
  output = []

  for line in lines:
    if line:
      new_line = spin_line(line)
    else:
      new_line = line
    output.append(new_line)

  return "\n".join(output)

In [None]:
np.random.seed(1234)

In [None]:
i = np.random.choice(texts.shape[0])
doc = texts.iloc[i]
new_doc = spin_document(doc)

In [None]:
print(textwrap.fill(new_doc, replace_whitespace=False, fix_sentence_endings=True))

Bombardier chief to leave <discuss> company

Shares in train <October>
and plane-making <drink> giant Bombardier <Bombardier> have fallen
<continued> to a 10-year <five-year> low following <following> the
departure <meeting> of its <Japan> chief executive and two members
<thirds> of the <the> board.

Paul Tellier <Sheard>, who <which> was
also Bombardier's president <prediction>, left <"> the company amid
<has> an ongoing <ongoing> restructuring . Laurent Beaudoin, part
<part> of the <his> family that controls <unless> the Montreal-based
<little-known> firm, will take <begin> on the role <order> of CEO
<just> under a newly created management structure <fee>. Analysts
<Traders> said the resignations seem to have stemmed <moved> from a
boardroom <tax> dispute . Under <"> Mr Tellier <Schroeder>'s tenure
<loss> at the company <prosecutor>, which <also> began in January
<November> 2003, plans <starts> to cut <strengthen> the worldwide
workforce of 75,000 <redundancy> by almost <allowing> a 