<a href="https://colab.research.google.com/github/ayoubbensakhria/finance_algo/blob/master/Rautor_rewriter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1. Install required packages and do necessary imports


##1.0. Objectives
* Build a **SOLID and EFFECTIVE** text generator/rewriter based on human and machine generated sequences
* Process: simple: sentence > pilars and interstitial sequences > replace interstitial sequences and preserve pilars to keep the sentence sense intact  

- **input**: **kw**: the effectiveness of marketing on boosting sales, **mode**:academic, **max_length**:2000, **variants**: 5, **media_placeholders**=True

###1.0.1. Stages
*Stage 1*:
- Input text Pilars -> output original text intact when the same reference is specified.

*Stage 2*
- Input text Pilars -> output two concatenated texts when two references are specified

*Stage 3*
- Input text Pilars -> Output multiple concatenated texts with respect to the references specified at input.

*Stage 4*
- Pilar graph builder using LSTM model
- Machine sequences builder (Semantic regression ML sentence generator)
- Semantic classifier ML (to choose the most matching semantic sequence possible)

*Stage 5*

Fresh content generator:  

* Download and learn fresh content
* Generate optimized pilar graph
* Generate high quality unique and fresh content

In [None]:
!pip install --user -U nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 6.7 MB/s 
Collecting regex>=2021.8.3
  Downloading regex-2022.6.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (749 kB)
[K     |████████████████████████████████| 749 kB 49.5 MB/s 
Installing collected packages: regex, nltk
Successfully installed nltk-3.7 regex-2022.6.2


In [None]:
import pandas as pd
import numpy as np
import nltk
import json
import time
import random
from google.colab import drive
from lxml import html
from lxml.html.clean import clean_html, Cleaner
from nltk.tokenize import word_tokenize, sent_tokenize
from pandas import json_normalize
import re

drive.mount('/content/drive/')

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

Mounted at /content/drive/


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
# file name
filename = "learn_content_export"

# first word 
FIRST = 'FST '
# last word 
LAST = ' LST'

## 1.1. Analyse a text
 
* Get sentences
* For each sentence
  * if the sentence doesn't begin with a pilar create one
  * if the sentence doesn't end with a pilar create one
  * get pilars 
  * get the sequence (vector) after to the next pilar
  * add pilars to the pilar collection (text)

----
### Result: 

Each beginning pilar has SEQ and ending pilar

Database: 
PILAR1 - SEQ - PILAR2 - SEQ - PILAR3

| Begin Pilar | SEQ_H | SEQ_M | End Pilar|
|---|---|---|---|
|PILAR 1| HUMAN SEQ LIST | MACHINE SEQ LIST | PILAR 2 | 

Vriables: 

* *b_pilar* : beginning pilar
* *e_pilar*: end pilar 
* *h_seq*: human sequence
* *m_seq*: machine-generated sequence


# 2. Classes and Functions

##2.1. classes

In [None]:
class Sequence (object):

  def __init__(self, *args):

    self.sequence = args[0]['sequence'] if args else ''
    self.longReference = args[0]['longReference'] if args else 'web'
    self.shortReference = args[0]['shortReference'] if args  else 'web'
    self.verified = args[0]['verified'] if args else False

  def to_json(self):
    return {
        'sequence': self.sequence,
        'longReference': self.longReference,
        'shortReference': self.shortReference,
        'verified': self.verified,
    }
   
"""
input = json.loads('{"sequence": "", "longReference": "web", "shortReference": "web", "verified": false, "isMachine": false}')
print(type(input))
seq = Sequence(input)
print(type(seq.isMachine))
"""

'\ninput = json.loads(\'{"sequence": "", "longReference": "web", "shortReference": "web", "verified": false, "isMachine": false}\')\nprint(type(input))\nseq = Sequence(input)\nprint(type(seq.isMachine))\n'

## 2.2 Functions

In [None]:
def get_pilars(text):
  words = word_tokenize(text)
  pilars = []
  for pilar in nltk.pos_tag(words):
    if pilar[1] in ['NN', 'NNS', 'NNP']:
      pilars.append(pilar[0])
  return pilars 

def is_pilar(word):
  pilar = nltk.pos_tag(word_tokenize(word))
  return pilar[0][1] in ['NN', 'NNS', 'NNP']

def get_adjectives(text):
  words = word_tokenize(text)
  adjectives = []
  for adj in nltk.pos_tag(words):
    if adj[1] == 'JJ':
      adjectives.append(adj[0])
  return adjectives 

def get_sequence_between(s1, s2, text):
  result = re.search('{s1}(.*){s2}'.format(s1=s1, s2=s2), text)
  return result.group(1)

def get_sequence_between(s1, s2, text):
  result = re.search('{s1}(.*){s2}'.format(s1=s1, s2=s2), text)
  return result.group(1)

def get_iterstitial_seqs(pilars, text):
  # TODO
  # fix data loss problem
  # fix one char '(' ')' seq problem
  _text = text
  for pilar in pilars:
    _text = _text.replace(pilar, ',')
  if _text and ',' in _text:
    _results = _text.split(',')
    _results.pop()
    _results.pop(0)
    return _results
  else:
    return []

def get_suggestions(b_pilar, e_pilar, dataframe):
  suggestions = []
  if (b_pilar in dataframe['b_pilar'].values) and (e_pilar in dataframe['e_pilar'].values):
    suggestions = dataframe.loc[(dataframe['b_pilar']==b_pilar) & (dataframe['e_pilar']==e_pilar)]['h_seq'] ## <- 
  return suggestions

def get_suggestions_mono(b_pilar, dataframe):
  suggestions = []
  if (b_pilar in dataframe['b_pilar'].values):
    for suggestion in dataframe.loc[(dataframe['b_pilar']==b_pilar)]['h_seq'].values: 
      suggestions.append(suggestion)
  return suggestions

def text_to_parts(text, longReference, shortReference, dataframe):
  df = dataframe
  sentences = sent_tokenize(text)
  for sent in sentences:
    sentence = sent.strip()
    sentence_words = word_tokenize(sentence)
    # if sentence is <3 words pass
    if (len(sent)<3):
      continue
    # check pilars at the beginning and end
    if not is_pilar(sentence_words[0]):
      sentence = FIRST + sentence
      # case 'pilar is not doing' 
    if sentence_words [-1] != '.' and not is_pilar(sentence_words[-1]):
      sentence = sentence + LAST + '.'
      # case 'pilar is not doing.' 
    if sentence_words [-1] == '.' and not is_pilar(sentence_words[-2]):
      sentence = sentence.replace('.', LAST) + '.'

    # get pilars  
    pilars = get_pilars(sentence)
    # get sequences 
    sequences = get_iterstitial_seqs(pilars, sentence)
    # assert n_pilars = n_sequence + 1
    if(len(pilars) != len(sequences)+1):
      continue

    for i in range (len(pilars)-1):
      # get string between two pilars
      seq = sequences[i]
      sequence = Sequence({
        'sequence': seq,
        'longReference': longReference,
        'shortReference': shortReference,
        'verified': True
      })
      if pilars[i] in df['b_pilar'].values:
        index = df.loc[df['b_pilar']==pilars[i]].index.tolist()[0]
        if df['e_pilar'].iloc[index] == pilars[i+1]:
          # h_seq is a dict
          # check duplicate
          if (sequence.to_json() not in df['h_seq'].iloc[index]):
            df['h_seq'].iloc[index].append(sequence.to_json())    
        else:
          row = {
              'b_pilar': pilars[i],
              'e_pilar': pilars[i+1],
              'h_seq': [sequence.to_json()],
              'm_seq': []
          }
          df = df.append(row, ignore_index = True)     
      else:
        row = {
            'b_pilar': pilars[i],
            'e_pilar': pilars[i+1],
            'h_seq': [sequence.to_json()],
            'm_seq': []
        }
        df = df.append(row, ignore_index = True)
  return df

In [None]:
dataframe = pd.DataFrame(columns=['b_pilar', 'e_pilar', 'h_seq','m_seq'])
df = pd.read_csv('https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv')
df_learn = df[:500]
df_test = df[501:]
print(len(df))

2225


#3. Compile 

In [None]:
for index, row in df_learn.iterrows():
  dataframe = text_to_parts(row['text'], 'BBC News 2022 Long', 'BBC News', dataframe)
dataframe.to_csv('/content/drive/MyDrive/data/{filename}.csv'.format(filename=filename))
print(len(dataframe))

In [None]:
dataframe.head()

Unnamed: 0,b_pilar,e_pilar,h_seq,m_seq
0,tv,future,"[{'sequence': ' ', 'longReference': 'BBC News ...",[]
1,future,hands,"[{'sequence': ' in the ', 'longReference': 'BB...",[]
2,hands,viewers,"[{'sequence': ' of ', 'longReference': 'BBC Ne...",[]
3,viewers,home,"[{'sequence': ' with ', 'longReference': 'BBC ...",[]
4,home,theatre,"[{'sequence': ' ', 'longReference': 'BBC News ...",[]


In [None]:
suggestions = get_suggestions_mono('people', dataframe)
for suggestion in suggestions:
  print(suggestion[0]['sequence'])

 the 
 the 
 and if you would rather watch live 
 they already have 
 it said 


#4. Rewrtiter class


In [None]:
class Rewriter(object):
  def __init__(text, longReference, shortReference):
    self.text = text 
  sentences = sent_tokenize(self.text)
  for sent in sentences:
    sentence = sent.strip()
    sentence_words = word_tokenize(sentence)
    # if sentence is <3 words pass
    if (len(sent)<3):
      continue
    # check pilars at the beginning and end
    if not is_pilar(sentence_words[0]):
      sentence = FIRST + sentence
      # case 'pilar is not doing' 
    if sentence_words [-1] != '.' and not is_pilar(sentence_words[-1]):
      sentence = sentence + LAST + '.'
      # case 'pilar is not doing.' 
    if sentence_words [-1] == '.' and not is_pilar(sentence_words[-2]):
      sentence = sentence.replace('.', LAST) + '.'

    # get pilars  
    pilars = get_pilars(sentence)
    for i in range(len(pilars)-1):

      b_pilar = pilars[i]
      e_pilar = pilars[i+1]

      sequences = get_iterstitial_seqs(pilars, sentence)
      suggestions = get_suggestions(b_pilar, e_pilar, dataframe)
      ####################################
      # choose the longest (experimental)#
      ####################################
      sequence_toreplace = '{b_pilar} {sequence} {e_pilar}'.format(b_pilar=b_pilar, sequence=sequences[i], e_pilar=e_pilar)
      new_sequence = '<span style="color: green">{b_pilar} {sequence} {e_pilar}</span>'.format(b_pilar=b_pilar, sequence=suggestions[0], e_pilar=e_pilar)
      
      if suggestions:
        sentence.replace(sequence_toreplace, new_sequence)


NameError: ignored