<a href="https://colab.research.google.com/github/amaslov455/nlp_project/blob/main/sst_tocsv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pytreebank

Collecting pytreebank
  Downloading https://files.pythonhosted.org/packages/e0/12/626ead6f6c0a0a9617396796b965961e9dfa5e78b36c17a81ea4c43554b1/pytreebank-0.2.7.tar.gz
Building wheels for collected packages: pytreebank
  Building wheel for pytreebank (setup.py) ... [?25l[?25hdone
  Created wheel for pytreebank: filename=pytreebank-0.2.7-cp36-none-any.whl size=37070 sha256=3494e67500ee6135798c2a68abc80f2e0995ea2004adccbf1ee5d06238669fa8
  Stored in directory: /root/.cache/pip/wheels/e0/b6/91/e9edcdbf464f623628d5c3aa9de28888c726e270b9a29f2368
Successfully built pytreebank
Installing collected packages: pytreebank
Successfully installed pytreebank-0.2.7


In [2]:
import pytreebank
import pandas as pd

In [3]:
dataset = pytreebank.load_sst()

In [4]:
dataset.keys()

dict_keys(['train', 'test', 'dev'])

In [5]:
dataset['train'][0]

<pytreebank.labeled_trees.LabeledTree at 0x7fd4c7ac1a58>

In [6]:
def create_df_from_treebank(input_dataset):
  dict_ = {}
  dict_['sentence'] = []
  dict_['santiment'] = []

  list_santiments = ["very_negative", "negative", "neutral", "positive", "very_positive"]

  for part in input_dataset:
    label, sentence = part.to_labeled_lines()[0]

    dict_['sentence'].append(sentence)
    dict_['santiment'].append(list_santiments[label])

  df = pd.DataFrame.from_dict(dict_)
  return df

In [7]:
df_train = create_df_from_treebank(dataset['train'])
df_test = create_df_from_treebank(dataset['test'])

In [8]:
df_train

Unnamed: 0,sentence,santiment
0,The Rock is destined to be the 21st Century 's...,positive
1,The gorgeously elaborate continuation of `` Th...,very_positive
2,Singer/composer Bryan Adams contributes a slew...,positive
3,You 'd think by now America would have had eno...,neutral
4,Yet the act is still charming here .,positive
...,...,...
8539,A real snooze .,very_negative
8540,No surprises .,negative
8541,We 've seen the hippie-turned-yuppie plot befo...,positive
8542,Her fans walked out muttering words like `` ho...,very_negative


In [None]:
df_train.to_csv('/content/drive/MyDrive/diplom_project/train.csv', index = False)
df_test.to_csv('/content/drive/MyDrive/diplom_project/test.csv', index = False)

In [9]:
import nltk
nltk.download('punkt')

joined_sen = ' '.join(df_train['sentence'])

tokens = nltk.word_tokenize(joined_sen)
print('count of all tokens: ', len(tokens))

unique_tokens = list(set(tokens))
print('count of unique tokens: ', len(unique_tokens))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
count of all tokens:  163642
count of unique tokens:  18270


In [10]:
def write_sents_to_txt(list_of_sents, filename):
    with open(filename, 'w',encoding='utf-8') as f:
        for text in list_of_sents:
            f.write(text + "\n")

In [12]:
DIR_TXT_FILE = '/content/drive/MyDrive/diplom_project/train_sents1.txt'

write_sents_to_txt(list(df_train.sentence.values), DIR_TXT_FILE)

In [13]:
!pip install sentencepiece
import sentencepiece as spm

spm.SentencePieceTrainer.train('--input={} --model_prefix=m --vocab_size=10000'.format(DIR_TXT_FILE))

sp = spm.SentencePieceProcessor()
sp.load('m.model')

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/e5/2d/6d4ca4bef9a67070fa1cac508606328329152b1df10bdf31fb6e4e727894/sentencepiece-0.1.94-cp36-cp36m-manylinux2014_x86_64.whl (1.1MB)
[K     |▎                               | 10kB 13.8MB/s eta 0:00:01[K     |▋                               | 20kB 14.0MB/s eta 0:00:01[K     |▉                               | 30kB 9.1MB/s eta 0:00:01[K     |█▏                              | 40kB 5.7MB/s eta 0:00:01[K     |█▌                              | 51kB 6.9MB/s eta 0:00:01[K     |█▊                              | 61kB 8.0MB/s eta 0:00:01[K     |██                              | 71kB 8.5MB/s eta 0:00:01[K     |██▍                             | 81kB 9.0MB/s eta 0:00:01[K     |██▋                             | 92kB 9.2MB/s eta 0:00:01[K     |███                             | 102kB 7.5MB/s eta 0:00:01[K     |███▎                            | 112kB 7.5MB/s eta 0:00:01[K     |███▌                 

True

In [14]:
df_train['joined_nltk'] = df_train['sentence'].apply(lambda x: ' '.join(nltk.word_tokenize(x)))
df_train['joined_sentencepiece'] = df_train['sentence'].apply(lambda x: ' '.join(sp.encode_as_pieces(x)))

In [15]:
df_train

Unnamed: 0,sentence,santiment,joined_nltk,joined_sentencepiece
0,The Rock is destined to be the 21st Century 's...,positive,The Rock is destined to be the 21st Century 's...,▁The ▁Rock ▁is ▁destin ed ▁to ▁be ▁the ▁21 s t...
1,The gorgeously elaborate continuation of `` Th...,very_positive,The gorgeously elaborate continuation of `` Th...,▁The ▁gorgeous ly ▁ e laborat e ▁continu ation...
2,Singer/composer Bryan Adams contributes a slew...,positive,Singer/composer Bryan Adams contributes a slew...,▁S ing er / compos er ▁Br yan ▁Adam s ▁contrib...
3,You 'd think by now America would have had eno...,neutral,You 'd think by now America would have had eno...,▁You ▁' d ▁think ▁by ▁now ▁America ▁would ▁ ha...
4,Yet the act is still charming here .,positive,Yet the act is still charming here .,▁Ye t ▁the ▁act ▁is ▁still ▁charm ing ▁here ▁.
...,...,...,...,...
8539,A real snooze .,very_negative,A real snooze .,▁A ▁real ▁snooze ▁.
8540,No surprises .,negative,No surprises .,▁No ▁surprise s ▁.
8541,We 've seen the hippie-turned-yuppie plot befo...,positive,We 've seen the hippie-turned-yuppie plot befo...,▁We ▁' ve ▁see n ▁the ▁hippie - turned - y upp...
8542,Her fans walked out muttering words like `` ho...,very_negative,Her fans walked out muttering words like `` ho...,▁Her ▁fan s ▁walk ed ▁out ▁mut tering ▁word s ...
