# Analysis

Notebook reads the list of the selected authors from `res/selected/author_list` and loads their text data from book and genereted resources available in `res` directory.
Displayed data:
- Notebook displays the number of words per each source-author pair.
- Analysis on chunk of `configuration.analysis_size` size
    - Average word length
    - Average sentence length
    - Unique words count
    - Top 10 function words

Further work on that notebook is planned.

*Note: the list of function words are taken from [functionwords](https://pypi.org/project/functionwords/) package.*

In [None]:
%load_ext autoreload
%autoreload 2

from src import *

In [None]:
settings = Settings()

In [None]:
cleaner = Cleaner(settings)
if not cleaner.cleaned_generated_corpus_exists():
    print("Cleaning generated corpus...")
    cleaner.clean_generated_corpus()    
if not cleaner.cleaned_books_corpus_exists():
    print("Cleaning books corpus...")
    cleaner.clean_books_corpus()

In [None]:
print("Reading authors' collections...")
authors = []
for author_name in FileUtils.read_authors(settings.paths.selected_authors_filepath):
    author = Author(
        settings=settings,
        name=author_name
    )
    author.read_selected_books_collection()
    author.read_generated_texts()
    authors.append(author)

In [None]:
analysis = Analysis(
    settings=settings,
    authors=authors
)
raw_words_count, cleaned_words_count = analysis.get_count_words()

print("Raw words count: ", raw_words_count)
print("Cleaned words count: ", cleaned_words_count)
print("Removed words: ", 100 * (raw_words_count - cleaned_words_count) / raw_words_count, "%")

In [None]:
analysis_data = analysis.analyze(authors)

In [None]:
AnalysisVisualization().visualize(analysis_data)