# Segmenting text files

Segmenting text files is another preprocessing step you can, but don't have to take for Topic Modeling. <br>
Separating long text files into chunks leads to a larger quantity of and more equally sized files, which is an advantage for Topic Modeling. <br><br>
In this notebook, you only need to change the path variables. After that, you can run all cells at once. 

## Loading & sorting files

In [1]:
from pathlib import Path
import os 
import re
import sys

In [2]:
# Path variables
data = 'Y:/data/projekte/dispecs/TopicModeling' 
language = 'es'
path_to_corpus = Path(data, 'dispecs_'+language+'_lemmatized')
output_dir = data + '/dispecs_'+language+'_paragr'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

In [3]:
filenames = [os.path.join(path_to_corpus, fn) for fn in sorted(os.listdir(path_to_corpus))]
filenames

['Y:\\data\\projekte\\dispecs\\TopicModeling\\dispecs_es_lemmatized\\1735-1736_El-Duende-Crítico_Frai-Manuel-de-San-Josef_Vol-1_Nr-001_112-821.txt',
 'Y:\\data\\projekte\\dispecs\\TopicModeling\\dispecs_es_lemmatized\\1735-1736_El-Duende-Crítico_Frai-Manuel-de-San-Josef_Vol-1_Nr-002_112-822.txt',
 'Y:\\data\\projekte\\dispecs\\TopicModeling\\dispecs_es_lemmatized\\1735-1736_El-Duende-Crítico_Frai-Manuel-de-San-Josef_Vol-1_Nr-003_112-840.txt',
 'Y:\\data\\projekte\\dispecs\\TopicModeling\\dispecs_es_lemmatized\\1735-1736_El-Duende-Crítico_Frai-Manuel-de-San-Josef_Vol-1_Nr-004_112-823.txt',
 'Y:\\data\\projekte\\dispecs\\TopicModeling\\dispecs_es_lemmatized\\1735-1736_El-Duende-Crítico_Frai-Manuel-de-San-Josef_Vol-1_Nr-005_112-824.txt',
 'Y:\\data\\projekte\\dispecs\\TopicModeling\\dispecs_es_lemmatized\\1735-1736_El-Duende-Crítico_Frai-Manuel-de-San-Josef_Vol-1_Nr-006_112-825.txt',
 'Y:\\data\\projekte\\dispecs\\TopicModeling\\dispecs_es_lemmatized\\1735-1736_El-Duende-Crítico_Frai-Manu

# Segmenting in paragraphs
Seperate the texts in paragraphs length chunks and save them as plain text files. 

In [4]:
def split_to_paragraphs(filename, n_words, max_len):
    """Split a text into chunks approximately `n_words` words in length."""
    input = open(filename, 'r', encoding="utf-8")
    l = re.sub(',|\"|\;|\:|\(|\)|\-','',input.read().strip()).split(' ')
    words = list(filter(None, l))
    input.close()
    chunks = []
    current_chunk_words = []
    current_chunk_word_count = 0
    for word in words:
        current_chunk_words.append(word)
        if word not in ['.','!','?','###']:
            current_chunk_word_count += 1
        if ((current_chunk_word_count == n_words or current_chunk_word_count > n_words) and word=="###") or (current_chunk_word_count > max_len and word in ['.','!','?']):
            chunks.append(' '.join(current_chunk_words))
            current_chunk_words = []
            current_chunk_word_count = 0
        
    chunks.append(' '.join(current_chunk_words) )
    return chunks

In [5]:
# filepath=str(path_to_corpus)+"/1727_Il-Filosofo-alla-Moda_Cesare-Frasponi_Vol-2_Nr-101_096-282.txt"
# input = open(filepath, 'r', encoding="utf-8")
# l = re.sub(',|\"|\;|\:|\(|\)|\-','',input.read().strip()).split(' ')
# words = list(filter(None, l))
# input.close()
# for word in words:
#     if "\n" in word:
#         print(word)
# print(l)

In [6]:
#filenames.sort()

In [7]:
chunk_length = 500
max_len = 600
chunks = []

for filename in filenames:
    chunk_counter = 0
    texts = split_to_paragraphs(filename, chunk_length, max_len)
    for text in texts:
        chunk = {'text': text, 'number': chunk_counter, 'filename': filename} # make dictionary with file content and information
        chunks.append(chunk)
        chunk_counter += 1
        

Original number of files:

In [8]:
len(filenames)

690

Number of chunks we generated:

In [9]:
len(chunks)

3571

In [10]:
#example
chunks[10:20]

[{'text': 'Despues que estubieron iá todo junto en lo salar donde tener su Excelencia Bufete silléta i cama se repetir lo oracion de lo semana pasar i el Obispo comisàrio Gobernador i Tetrárca discurrío comer uno gilguero hablar mas que uno Urraca Devidiose en parecer lo gran Junta Patiñana lo uno querer guerra otro por lo paz clamar pero ni en Güerra ni en Paz adelantar palábra Reyes sin mirar à Ustaríz pro poner se levantáran ocho nuebos Regimientos de Dragones parir Itália Mesurose Matéo Pablo riendose Mesa i Cuádra Ibañez mui Jesuita con culto Latiniparla pro poner coser mui bueno segun díjo Maturána que dar pues de uno glória pàtri con mèdia cabeza gachó hacer los seña de Amen en do òtras cabezada Los otro hablar todo el mismo que si no hablar pro poner dis parátes i lo que lo aprobar al mismo tiémpo decir que ser de opinion contrária Prebalecio lo opinion del que lo hacer de nada Èra èsta uno ciencia medio ni bien gordo ni bien magro uno diptongar GuerriPaz boda de Mercurio i Pal

If a file had for example 510 words, then it will produce 2 chunks: <br>
1) with length 500 <br>
2) with length 10. <br>
We want to add those short chunks to their previous sibling. 

In [11]:
min_length = 200
i = 0
for chunk in chunks:
    index = chunks.index(chunk)
    l_chunk = len(chunk['text'].split(' '))
    if l_chunk < min_length and chunk['number'] != 0:
        i+=1
        chunks[index-1]['text'] = chunks[index-1]['text'] + ' ' + chunk['text']
        print('Chunk '+ str(chunk['number']-1) +' of file ' + chunk['filename'] + ' appended to chunk ' + str(chunk['number']) + ' on index ' + str(index))
        
print('Number of appended chunks: ' + str(i))

Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1735-1736_El-Duende-Crítico_Frai-Manuel-de-San-Josef_Vol-1_Nr-010_112-841.txt appended to chunk 2 on index 13
Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1735-1736_El-Duende-Crítico_Frai-Manuel-de-San-Josef_Vol-1_Nr-018_112-835.txt appended to chunk 2 on index 26
Chunk 0 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1735-1736_El-Duende-Crítico_Frai-Manuel-de-San-Josef_Vol-1_Nr-019_112-836.txt appended to chunk 1 on index 28
Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1735-1736_El-Duende-Crítico_Frai-Manuel-de-San-Josef_Vol-1_Nr-020_112-837.txt appended to chunk 2 on index 31
Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1735-1736_El-Duende-Crítico_Frai-Manuel-de-San-Josef_Vol-1_Nr-022_112-842.txt appended to chunk 2 on index 35
Chunk 7 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es

Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1781_El-Censor_Anonym-(García-de-Cañuelo,-Luis+-Pereira,-Luis-Marcelino)_Vol-1_Nr-02_096-323.txt appended to chunk 3 on index 1256
Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1781_El-Censor_Anonym-(García-de-Cañuelo,-Luis+-Pereira,-Luis-Marcelino)_Vol-1_Nr-05_096-326.txt appended to chunk 3 on index 1268
Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1781_El-Censor_Anonym-(García-de-Cañuelo,-Luis+-Pereira,-Luis-Marcelino)_Vol-1_Nr-06_096-327.txt appended to chunk 3 on index 1272
Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1781_El-Censor_Anonym-(García-de-Cañuelo,-Luis+-Pereira,-Luis-Marcelino)_Vol-1_Nr-07_096-328.txt appended to chunk 2 on index 1275
Chunk 3 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1781_El-Censor_Anonym-(García-de-Cañuelo,-Luis+-Pereira,-Luis-Marcelino)_Vol-1_Nr-08_096-32

Chunk 5 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1787-bzw.1788_El-Duende-de-Madrid_Pedro-Pablo-Trullench_Vol-1_Nr-2_097-349.txt appended to chunk 6 on index 2238
Chunk 5 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1787-bzw.1788_El-Duende-de-Madrid_Pedro-Pablo-Trullench_Vol-1_Nr-3_091-82.txt appended to chunk 6 on index 2245
Chunk 5 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1787-bzw.1788_El-Duende-de-Madrid_Pedro-Pablo-Trullench_Vol-1_Nr-4_091-83.txt appended to chunk 6 on index 2252
Chunk 6 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1787-bzw.1788_El-Duende-de-Madrid_Pedro-Pablo-Trullench_Vol-1_Nr-5_091-84.txt appended to chunk 7 on index 2260
Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1787_El-Censor_Anonym-(García-de-Cañuelo,-Luis+-Pereira,-Luis-Marcelino)_Vol-7_Nr-138_105-599.txt appended to chunk 3 on index 2282
Chunk 3 of file Y:\data\projekte\d

Chunk 4 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1803_El-Regañón-general_Anónimo-(Ventura-Ferrer)_Vol-1_Nr-30_2758.txt appended to chunk 5 on index 2977
Chunk 4 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1803_El-Regañón-general_Anónimo-(Ventura-Ferrer)_Vol-1_Nr-31_2759.txt appended to chunk 5 on index 2983
Chunk 3 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1803_El-Regañón-general_Anónimo-(Ventura-Ferrer)_Vol-1_Nr-33_2907.txt appended to chunk 4 on index 2993
Chunk 4 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1803_El-Regañón-general_Anónimo-(Ventura-Ferrer)_Vol-1_Nr-34_2908.txt appended to chunk 5 on index 2999
Chunk 3 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1803_El-Regañón-general_Anónimo-(Ventura-Ferrer)_Vol-1_Nr-36_2910.txt appended to chunk 4 on index 3009
Chunk 4 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1803_El-Regañón-ge

Chunk 3 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1813_El-Pensador-Mexicano_José-Joaquín-Fernández-de-Lizardi_Vol-1_Nr-012_8084.txt appended to chunk 4 on index 3564
Number of appended chunks: 381


Now delete those chunks that we already copied to their previous siblings. <br>
Optional: You can also delete the chunks that didn't have siblings and were very short (= short original files). Therefore, delete the part "chunk['number'] != 0". 

In [12]:
i = 0
for chunk in chunks:
    index = chunks.index(chunk)
    l_chunk = len(chunk['text'].split(' '))
    if l_chunk < min_length and chunk['number'] != 0:
        i+=1
        chunks.remove(chunk)
        print('Chunk '+ str(chunk['number']) +' of file ' + chunk['filename'] + ' on index ' + str(index) + ' deleted.')
        
print('Number of deleted chunks: ' + str(i))

Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1735-1736_El-Duende-Crítico_Frai-Manuel-de-San-Josef_Vol-1_Nr-010_112-841.txt on index 13 deleted.
Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1735-1736_El-Duende-Crítico_Frai-Manuel-de-San-Josef_Vol-1_Nr-018_112-835.txt on index 25 deleted.
Chunk 1 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1735-1736_El-Duende-Crítico_Frai-Manuel-de-San-Josef_Vol-1_Nr-019_112-836.txt on index 26 deleted.
Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1735-1736_El-Duende-Crítico_Frai-Manuel-de-San-Josef_Vol-1_Nr-020_112-837.txt on index 28 deleted.
Chunk 2 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1735-1736_El-Duende-Crítico_Frai-Manuel-de-San-Josef_Vol-1_Nr-022_112-842.txt on index 31 deleted.
Chunk 8 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1761-06-13_El-Duende-especulativo-sobre-la-

Chunk 3 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1784_El-Censor_Anonym-(García-de-Cañuelo,-Luis+-Pereira,-Luis-Marcelino)_Vol-3_Nr-057_106-673.txt on index 1361 deleted.
Chunk 4 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1784_El-Censor_Anonym-(García-de-Cañuelo,-Luis+-Pereira,-Luis-Marcelino)_Vol-3_Nr-058_103-458.txt on index 1365 deleted.
Chunk 4 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1784_El-Censor_Anonym-(García-de-Cañuelo,-Luis+-Pereira,-Luis-Marcelino)_Vol-3_Nr-059_103-459.txt on index 1369 deleted.
Chunk 3 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1784_El-Censor_Anonym-(García-de-Cañuelo,-Luis+-Pereira,-Luis-Marcelino)_Vol-3_Nr-061_104-546.txt on index 1376 deleted.
Chunk 3 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1784_El-Censor_Anonym-(García-de-Cañuelo,-Luis+-Pereira,-Luis-Marcelino)_Vol-3_Nr-065_104-550.txt on index 1393 deleted.
Chunk 3 of

Chunk 3 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1788_El-Filósofo-á-la-Moda_Anónimo_Vol-2_Nr-005_111-805.txt on index 2297 deleted.
Chunk 4 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1788_El-Filósofo-á-la-Moda_Anónimo_Vol-2_Nr-007_111-807.txt on index 2305 deleted.
Chunk 3 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1788_El-Filósofo-á-la-Moda_Anónimo_Vol-2_Nr-011_111-810.txt on index 2319 deleted.
Chunk 3 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1788_El-Filósofo-á-la-Moda_Anónimo_Vol-2_Nr-012_111-811.txt on index 2322 deleted.
Chunk 3 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1788_El-Filósofo-á-la-Moda_Anónimo_Vol-2_Nr-014_111-812.txt on index 2325 deleted.
Chunk 3 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1788_El-Filósofo-á-la-Moda_Anónimo_Vol-2_Nr-016_111-816.txt on index 2331 deleted.
Chunk 2 of file Y:\data\projekte\dispecs

Chunk 3 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1804_El-Regañón-general_Anónimo-(Ventura-Ferrer)_Vol-2_Nr-39_7888.txt on index 2997 deleted.
Chunk 3 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1804_El-Regañón-general_Anónimo-(Ventura-Ferrer)_Vol-2_Nr-41_7890.txt on index 3004 deleted.
Chunk 5 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1804_El-Regañón-general_Anónimo-(Ventura-Ferrer)_Vol-2_Nr-43_7897.txt on index 3015 deleted.
Chunk 4 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1804_El-Regañón-general_Anónimo-(Ventura-Ferrer)_Vol-2_Nr-44_7898.txt on index 3019 deleted.
Chunk 5 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1804_El-Regañón-general_Anónimo-(Ventura-Ferrer)_Vol-2_Nr-45_7899.txt on index 3024 deleted.
Chunk 5 of file Y:\data\projekte\dispecs\TopicModeling\dispecs_es_lemmatized\1804_El-Regañón-general_Anónimo-(Ventura-Ferrer)_Vol-2_Nr-46_7900.txt on 

In [13]:
print('Remaining chunks: ' + str(len(chunks)))

Remaining chunks: 3190


## Saving chunks to text files

In [14]:
for chunk in chunks:
    basename = os.path.basename(chunk['filename'])
    fn_base, fn_ext = os.path.splitext(basename)
    fn = os.path.join(output_dir, "{}_{:04d}{}".format(fn_base, chunk['number'], fn_ext)) 
    fn = fn.replace(',','').replace('N°', '') # replace characters in file names that can cause trouble while saving the file
    with open(fn, 'w', encoding='utf-8') as f:
        f.write(chunk['text'])

# Check document lengths

The following code is only for you to get insight into how long or short your files are. Even though we segmented the texts, it is still possible that there are very short files (if the original text is short, so there was no possibility to combine multiple chunks in one file) or a single paragraph is very long.

In [15]:
filenames = [os.path.join(output_dir, fn) for fn in sorted(os.listdir(output_dir))]

filenames

['Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\\1711-1712_Le-Misantrope_Justus-Van-Effen_Vol-1_Nr-001_2948_0000.txt',
 'Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\\1711-1712_Le-Misantrope_Justus-Van-Effen_Vol-1_Nr-002_2949_0000.txt',
 'Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\\1711-1712_Le-Misantrope_Justus-Van-Effen_Vol-1_Nr-002_2949_0001.txt',
 'Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\\1711-1712_Le-Misantrope_Justus-Van-Effen_Vol-1_Nr-002_2949_0002.txt',
 'Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\\1711-1712_Le-Misantrope_Justus-Van-Effen_Vol-1_Nr-002_2949_0003.txt',
 'Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\\1711-1712_Le-Misantrope_Justus-Van-Effen_Vol-1_Nr-003_2950_0000.txt',
 'Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\\1711-1712_Le-Misantrope_Justus-Van-Effen_Vol-1_Nr-003_2950_0001.txt',
 'Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\\1711-1712_Le-Misantrope_Justus

In [16]:
## Count tokens per document

def count_words(filename):
    """Count number of words for a file."""
    input = open(filename, 'r', encoding="utf-8")
    words = " ".join(re.sub(',|\.|\;|\:|\(|\)|\-|\?|\!|\###','',input.read()).split()).split(' ') # remove special charachters and normalize space
    input.close()
    chunks = []
    words_list = []
    for word in words:
        words_list.append(word)
    return len(words_list)

In [17]:
word_lens = []
for filename in filenames:
    #print(filename)
    word_len = count_words(filename)
    len_file = {'filename': filename, 'tokens': word_len} 
    word_lens.append(len_file)

In [18]:
from termcolor import colored
sorted_lens = sorted(word_lens, key = lambda i: i['tokens'])
for file in sorted_lens:
    print(colored(file['tokens'], 'red'), file['filename'])

[31m90[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1723-1725_Le-Nouveau-Spectateur-français_Justus-Van-Effen_Vol-3_Nr-000_3391_0000.txt
[31m158[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1711-1712_Le-Misantrope_Justus-Van-Effen_Vol-1_Nr-001_2948_0000.txt
[31m172[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1760-1761_Le-Monde_Jean-François-de-Bastide_Vol-3_Nr-004_6951_0003.txt
[31m175[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1760-1761_Le-Monde_Jean-François-de-Bastide_Vol-4_Nr-002_6969_0008.txt
[31m178[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1760_Le-Monde-comme-il-est_Jean-François-de-Bastide_Vol-2_Nr-017_4276_0003.txt
[31m180[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1750_La-Bigarure_Anonyme-(Joseph-Marie-Durey-de-Morsan)_Vol-6_Nr-003_7216_0003.txt
[31m181[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1750_La-Bigarure_Anonyme-(Joseph-Marie-Durey-de-Morsan

[31m493[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1715--1714_Le-Censeur-ou-Caractères-des-Mœurs-de-la-Haye_Anonym-(Jean-Rousset-de-Missy-+-Nicolas-de-Guedeville)_Vol-1_Nr-001_6408_0000.txt
[31m493[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1716_Le-Spectateur-ou-le-Socrate-moderne_Anonym_Vol-2_Nr-040_400_0000.txt
[31m493[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1720_Le-Spectateur-ou-le-Socrate-moderne_Anonym_Vol-4_Nr-026_2296_0001.txt
[31m493[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1723-1725_Le-Nouveau-Spectateur-français_Justus-Van-Effen_Vol-1_Nr-020_3143_0000.txt
[31m493[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1726_Le-Spectateur-ou-le-Socrate-moderne_Anonym_Vol-6_Nr-045_2904_0001.txt
[31m493[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1726_Le-Spectateur-ou-le-Socrate-moderne_Anonym_Vol-6_Nr-053_2920_0000.txt
[31m493[0m Y:/data/projekte/dispecs/TopicModeling/dispe

[31m513[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1751--1749-1751_La-Spectatrice-Ouvrage-traduit-de-l-anglois_Anonym-(Eliza-Haywood)_Vol-4_Nr-001_7557_0002.txt
[31m513[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1751--1749-1751_La-Spectatrice-Ouvrage-traduit-de-l-anglois_Anonym-(Eliza-Haywood)_Vol-4_Nr-001_7557_0016.txt
[31m513[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1751--1749-1751_La-Spectatrice-Ouvrage-traduit-de-l-anglois_Anonym-(Eliza-Haywood)_Vol-4_Nr-004_7560_0024.txt
[31m513[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1752_Le-Cabinet-du-Philosophe_Pierre-Carlet-de-Marivaux_Vol-1_Nr-009_122-1369_0003.txt
[31m513[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1752_Le-Spectateur-françois_Pierre-Carlet-de-Marivaux_Vol-1_Nr-018_122-1353_0001.txt
[31m513[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1758_Le-Nouveau-Spectateur_Jean-François-de-Bastide_Vol-2_Nr-012_3349_0008.txt

[31m538[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1735_Le-Philosophe-Nouvelliste_Armand-de-Boisbeleau-de-La-Chapelle_Vol-2_Nr-028_7934_0000.txt
[31m538[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1748_La-Spectatrice-danoise-ou-l-Aspasie-moderne_Laurent-Angliviel-de-la-Beaumelle_Vol-1_Nr-027_6606_0001.txt
[31m538[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1750--1749-1751_La-Spectatrice-Ouvrage-traduit-de-l-anglois_Anonym-(Eliza-Haywood)_Vol-1_Nr-002_4521_0025.txt
[31m538[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1750--1749-1751_La-Spectatrice-Ouvrage-traduit-de-l-anglois_Anonym-(Eliza-Haywood)_Vol-1_Nr-003_4525_0011.txt
[31m538[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1750--1749-1751_La-Spectatrice-Ouvrage-traduit-de-l-anglois_Anonym-(Eliza-Haywood)_Vol-3_Nr-003_6401_0010.txt
[31m538[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1750--1749-1751_La-Spectatrice-Ouvrage-traduit

[31m575[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1711-1712_Le-Misantrope_Justus-Van-Effen_Vol-2_Nr-026_3053_0000.txt
[31m575[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1715--1714_Le-Censeur-ou-Caractères-des-Mœurs-de-la-Haye_Anonym-(Jean-Rousset-de-Missy-+-Nicolas-de-Guedeville)_Vol-1_Nr-005_6412_0001.txt
[31m575[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1720_Le-Spectateur-ou-le-Socrate-moderne_Anonym_Vol-4_Nr-045_2315_0000.txt
[31m575[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1723_Le-Mentor-moderne_Justus-Van-Effen-(Joseph-Addison-Richard-Steele)_Vol-1_Nr-042_6478_0001.txt
[31m575[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1723_Le-Mentor-moderne_Justus-Van-Effen-(Joseph-Addison-Richard-Steele)_Vol-3_Nr-115_6860_0000.txt
[31m575[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1723_Le-Mentor-moderne_Justus-Van-Effen-(Joseph-Addison-Richard-Steele)_Vol-3_Nr-145_6890_0001.txt


[31m610[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1723_Le-Mentor-moderne_Justus-Van-Effen-(Joseph-Addison-Richard-Steele)_Vol-3_Nr-125_6870_0003.txt
[31m610[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1723_Le-Spectateur-ou-le-Socrate-moderne_Anonym_Vol-5_Nr-006_2370_0001.txt
[31m610[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1723_Le-Spectateur-ou-le-Socrate-moderne_Anonym_Vol-5_Nr-013_2427_0002.txt
[31m610[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1723_Le-Spectateur-ou-le-Socrate-moderne_Anonym_Vol-5_Nr-041_2458_0001.txt
[31m610[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1723_Le-Spectateur-ou-le-Socrate-moderne_Anonym_Vol-5_Nr-049_2466_0001.txt
[31m610[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1726_Le-Spectateur-ou-le-Socrate-moderne_Anonym_Vol-6_Nr-012_2714_0000.txt
[31m610[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1728_La-Spectatrice_Anonym_Vol-1_Nr-

[31m673[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1751_La-Bigarure_Anonyme-(Claude-de-Crébillon)_Vol-10_Nr-013_7977_0001.txt
[31m673[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1752_Le-Spectateur-françois_Pierre-Carlet-de-Marivaux_Vol-1_Nr-024_122-1359_0002.txt
[31m673[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1759_Le-Nouveau-Spectateur_Jean-François-de-Bastide_Vol-5_Nr-009_3704_0012.txt
[31m674[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1716_Le-Spectateur-ou-le-Socrate-moderne_Anonym_Vol-3_Nr-014_11C-1273_0001.txt
[31m674[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1723_Le-Mentor-moderne_Justus-Van-Effen-(Joseph-Addison-Richard-Steele)_Vol-2_Nr-069_6700_0002.txt
[31m674[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1750_La-Bigarure_Anonyme-(Joseph-Marie-Durey-de-Morsan)_Vol-7_Nr-016_7312_0003.txt
[31m674[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1759_Le-No

In [19]:
lo,hi = sys.maxsize,-sys.maxsize-1
for file in (item['tokens'] for item in word_lens):
    lo,hi = min(file,lo),max(file,hi)

print(lo)

print(hi)

90
1301


In [20]:
len_sum = 0
for file in (item['tokens'] for item in word_lens):
    len_sum += int(file)

len_sum/len(word_lens)

559.5870853080569

In [21]:
# short files
i = 0
for file in sorted_lens:
    if file['tokens'] < 200:
        print(colored(file['tokens'], 'red'), file['filename'])
        i+=1
print('Total number of short files: ', i)

[31m90[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1723-1725_Le-Nouveau-Spectateur-français_Justus-Van-Effen_Vol-3_Nr-000_3391_0000.txt
[31m158[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1711-1712_Le-Misantrope_Justus-Van-Effen_Vol-1_Nr-001_2948_0000.txt
[31m172[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1760-1761_Le-Monde_Jean-François-de-Bastide_Vol-3_Nr-004_6951_0003.txt
[31m175[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1760-1761_Le-Monde_Jean-François-de-Bastide_Vol-4_Nr-002_6969_0008.txt
[31m178[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1760_Le-Monde-comme-il-est_Jean-François-de-Bastide_Vol-2_Nr-017_4276_0003.txt
[31m180[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1750_La-Bigarure_Anonyme-(Joseph-Marie-Durey-de-Morsan)_Vol-6_Nr-003_7216_0003.txt
[31m181[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1750_La-Bigarure_Anonyme-(Joseph-Marie-Durey-de-Morsan

In [22]:
# long files
i = 0
for file in sorted_lens:
    if file['tokens'] > 1000:
        print(colored(file['tokens'], 'red'), file['filename'])
        i+=1
print('Total number of long files: ', i)


[31m1001[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1759_Le-Nouveau-Spectateur_Jean-François-de-Bastide_Vol-4_Nr-007_3662_0002.txt
[31m1007[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1795_Le-Spectateur-français-avant-la-Révolution_Jacques-Vincent-Delacroix_Vol-1_Nr-043_6560_0002.txt
[31m1011[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1758_Le-Nouveau-Spectateur_Jean-François-de-Bastide_Vol-1_Nr-004_3149_0020.txt
[31m1021[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1723_Le-Mentor-moderne_Justus-Van-Effen-(Joseph-Addison-Richard-Steele)_Vol-2_Nr-078_6711_0002.txt
[31m1029[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1716_Le-Spectateur-ou-le-Socrate-moderne_Anonym_Vol-2_Nr-036_396_0001.txt
[31m1036[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1728_La-Spectatrice_Anonym_Vol-1_Nr-014_127-1394_0003.txt
[31m1037[0m Y:/data/projekte/dispecs/TopicModeling/dispecs_fr_paragr\1716_Le-Spect