<a href="https://colab.research.google.com/github/componavt/sns4human/blob/main/src/vk/voyant_tools/post_tokens_with_group.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. This script reads a file with VK group post texts (created using the vk_group_all_posts.ipynb script).

2. Tokenizes (breaks into words), saves periods in sentences.

3. Combines posts from several VK groups into one CSV file. Adds a group field with the name of the VK group. Group posts are sorted by the date field.

The resulting CSV file is ready to be uploaded to Voyant Tools. The text (the tokens field) is enclosed in quotation marks specifically for this purpose.

---
In Russian

1. Этот скрипт читает файл с текстами постов группы ВК (созданный с помощью скрипта vk_group_all_posts.ipynb).

2. Токенизирует (разбивает на слова), сохраняет точки в предложениях.

3. Объединяет посты нескольких ВК-групп в один CSV-файл. Добавляет поле group с именем ВК-группы. Посты групп отсортированы по полю date.

Итоговый CSV-файл готов к загрузке в Voyant Tools. Специально для этого текст (поле tokens) заключён в кавычки.

In [4]:
# not religion
# domains = ['club221681617','concerto','club151359929','pravoslav_karelia']# smallest groups for tests smallest_tokens.csv
domains = ['aparfenchikov','minnazrk', 'mincultrk', 'uoknrk']# state_tokens.csv
# domains = ['rk_nationalmuseum','olonmus','museum_ptz','echo_association', 'domderevnivoknavolok', 'vepsmuseum', 'club226126304']# museum_tokens.csv
# domains = ['satasanaa','speechvepkar','desyatiletieyazykovkarelia','karjalan_kieli']# language_tokens.csv
# domains = ['omapajo','kyykkakarjala','tastykarjala','senofest','olongames','kalevala_fest']# festival_tokens.csv
# domains = ['karjalankielenkodi','mediacenter_periodika','public111906776','karjalanrahvahanliitto','club2562309', 'club_dk_padany', 'melnikpryazha']# multifunctional_tokens.csv

# religion
# domains = ['mitropolit_manuil','club57656949','popeshenie','ekaterinahram','nevsoborptz','sortavala_chram',
#           'pravmk','club18647865','dpcentr','krest_sobor','stupeniorthodox','club103835710','club151359929','club221681617',
#           'pravkarelia','svirskiymonastery','club2975745','club19347481','kemskoepodvorie']# orthodoxy.csv
# domains = ['islamrk','halal_ptz','siogroups']# islam.csv
# domains = ['infoinkeri','kemskij_prihod','club18959947','kirkko','kareliandiocese','concerto']# lutheran.csv
# domains = ['hve10','church_of_christ_ptz','glorygod_ptz']# baptists_and_evangelists.csv
result_file = 'state_tokens.csv'

#religion = '/religion'
religion = ''           # not religion

# to archive or not result cvs file (cvs.gz)
b_gzip = 1
#b_gzip = 0

In [5]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
!pip install -U pymorphy3
import pymorphy3
import requests
import csv
import gzip
import shutil

!pip install emoji
import emoji

from io import StringIO
filename = 'text_preprocessing.py'
response = requests.get(f'https://raw.githubusercontent.com/componavt/sns4human/refs/heads/main/src/vk/nlp/{filename}')
with open(filename,'w+') as f:
  f.write(response.text)
import text_preprocessing

nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

from nltk.corpus import stopwords
stop_words = stopwords.words("russian")
stop_words += requests.get('https://raw.githubusercontent.com/componavt/sns4human/refs/heads/main/src/vk/nlp/RussianStopWords.txt').text.split('\n')

alphabet  = set('абвгдеёжзийклмнопрстуфхцчшщъыьэюя')                    # Russian alphabet
alphabet |= set('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ') # Add Latin letters (A-Z, a-z)
alphabet |= set('äöåšžüÄÖÅŠŽÜ')  # ä, ö, å (Finnish); š, ž (Karelian); ü (Veps)
alphabet_dash = alphabet | {'-'} # Optionally allow hyphen (dash) as part of words

morph = pymorphy3.MorphAnalyzer(lang='ru')



In [6]:
temp_df = []
for domain in domains:
    t = pd.read_csv(f'https://raw.githubusercontent.com/componavt/sns4human/refs/heads/main/data/vk/posts{religion}/{domain}.csv',
                    usecols=['text', 'date'])
    t['group'] = domain
    temp_df.append(t)

df = pd.concat(temp_df, ignore_index=True).sort_values('date')
df = df[df['text'].notna() & (df['text'].apply(lambda x: isinstance(x, str))) & (df['text'] != '')]

df['tokens'] = df['text'].apply(lambda x: text_preprocessing.process_text(x, filter_fio=False, period=True))

df = df[df['tokens'].notna() & (df['tokens'].apply(lambda x: isinstance(x, str))) & (df['tokens'] != '') & (df['tokens'] != ' ')]
df_tokens = pd.concat([df['tokens'], df['date'],df['group']], axis=1, keys=['tokens', 'date','group'])

# Removing lines with empty 'tokens'
df_tokens = df_tokens[df_tokens['tokens'].str.strip().astype(bool)]

# Save CSV with quotes only for 'tokens' field, without quotes for 'date' and 'group'
with open(result_file, 'w', newline='', encoding='utf-8') as f:
#   writer = csv.writer(f, delimiter=';', quoting=csv.QUOTE_NONE, escapechar='\\')
#   writer = csv.writer(f, delimiter=';', quoting=csv.QUOTE_NONE, quotechar=None)
    writer = csv.writer(f, delimiter=',', quoting=csv.QUOTE_NONE, quotechar=None, escapechar='\\')
    writer.writerow(['tokens', 'date', 'group'])
    for _, row in df_tokens.iterrows():
        writer.writerow([f'"{row["tokens"]}"', row['date'], row['group']])

# If b_gzip == 1, create a gzip archive
if b_gzip == 1:
    gzip_file = result_file + ".gz"
    with open(result_file, 'rb') as f_in, gzip.open(gzip_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
    print(f"Archived: {gzip_file}")

Archived: state_tokens.csv.gz
