# Analysis

Notebook analyze the text from the original books and the generated one. Later on the results are visualized to conclude insights

In [1]:
%load_ext autoreload
%autoreload 2

from src import *
import nltk
nltk.download('all')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\piotr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\piotr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\piotr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\piotr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\piotr\AppData\Roaming\nltk_data...
[

True

In [2]:
settings = Settings()

Original texts are cleaned if they weren't cleaned before.
#### Generated texts

- Discard to small text
- Discard texts with repeated substrings at the end of the response
- Remove emoji codes
- Remove `@` characters
- Remove html tags

#### Books

- Remove italic formatting 
- Remove dividers
- Remove illustration annotations
- Remove note annotations

In [3]:
cleaner = Cleaner(settings)
if not cleaner.cleaned_generated_corpus_exists():
    print("Cleaning generated corpus...")
    cleaner.clean_generated_corpus()    
if not cleaner.cleaned_books_corpus_exists():
    print("Cleaning books corpus...")
    cleaner.clean_books_corpus()

Read collections of books and generated text into memory from files

In [4]:
print("Reading authors' collections...")
authors = []
for author_name in FileUtils.read_authors(settings.paths.selected_authors_filepath):
    author = Author(
        settings=settings,
        name=author_name
    )
    author.read_selected_books_collection()
    author.read_generated_texts()
    authors.append(author)

Reading authors' collections...


During the preprocessing the texts are chunked. Only the subset of chunks are chosen for the analysis, which is defined by `settings.configuration.analysis_number_of_words`. To keep the analysis objective chunks are shuffled and then they are picked up for further analysis. Chunks are split into sentences and words.

In [5]:
print("Preprocessing collections...")
preprocessing_results = Preprocessing(settings, authors).preprocess()

Preprocessing collections...


In [6]:
print(f"Analysing collections using sample of {settings.configuration.analysis_number_of_words} words per collection...")
analysis_results = Analysis(
    settings=settings,
    preprocessing_results=preprocessing_results,
    read_from_file=settings.configuration.read_analysis_from_file
).get_analysis(authors)
print(f"Text removed while cleaning: {analysis_results.metadata.percentage_of_removed_text}%.")

Analysing collections using sample of 200000 words per collection...
Text removed while cleaning: 0.9765146365788469%.


In [7]:
print(f"Top 10 features of PC1: {analysis_results.full.pca.top_features["PC1"]}")
print(f"Top 10 features of PC2: {analysis_results.full.pca.top_features["PC2"]}")

Top 10 features of PC1: ['of', 'yules_characteristic_k', 'simpsons_index', 'the', 'our', '-', 'he', 'was', 'its', '"']
Top 10 features of PC2: ['to', '*', 'and', 'in', 'as', 'that', 'average_sentence_length', 'a', ',', 'gunning_fog_index']


In [10]:
AnalysisVisualization(settings).visualize(analysis_results)