# Analysis

Notebook reads the list of the selected authors from `res/selected/author_list` and loads their text data from book and genereted resources available in `res` directory.
Displayed data:
- Notebook displays the number of words per each source-author pair.
- Analysis on chunk of `configuration.analysis_size` size
    - Average word length
    - Average sentence length
    - Unique words count
    - Top 10 function words

Further work on that notebook is planned.

*Note: the list of function words are taken from [functionwords](https://pypi.org/project/functionwords/) package.*

In [10]:
%load_ext autoreload
%autoreload 2

from src import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [11]:
settings = Settings()

In [12]:
cleaner = Cleaner(settings)
if not cleaner.cleaned_generated_corpus_exists():
    print("Cleaning generated corpus...")
    cleaner.clean_generated_corpus()    
if not cleaner.cleaned_books_corpus_exists():
    print("Cleaning books corpus...")
    cleaner.clean_books_corpus()

In [13]:
print("Reading authors' collections...")
authors = []
for author_name in FileUtils.read_authors(settings.paths.selected_authors_filepath):
    author = Author(
        settings=settings,
        name=author_name
    )
    author.read_selected_books_collection()
    author.read_generated_texts()
    authors.append(author)

Reading authors' collections...


In [14]:
print("Preprocessing collections...")
preprocessing_data = Preprocessing(settings, authors).preprocess()

Preprocessing collections...


In [15]:
analysis = Analysis(
    settings=settings,
    authors=authors,
    preprocessing_data=preprocessing_data,
    read_from_file=settings.configuration.read_analysis_from_file
)

In [24]:
print(f"Analysing collections using sample of {settings.configuration.analysis_size} words per collection...")
analysis_data = analysis.get_analysis(authors)
print(f"Text removed while cleaning: {analysis_data.percentage_of_removed_text}%.")

Analysing collections using sample of 10000 words per collection...
Text removed while cleaning: 0.9401511228114402%.


In [50]:
visualization = AnalysisVisualization(settings)
visualization.visualize(analysis_data)