# Analysis

Notebook reads the list of the selected authors from `res/selected/author_list` and loads their text data from book and genereted resources available in `res` directory.
Displayed data:
- Notebook displays the number of words per each source-author pair.
- Analysis on chunk of `configuration.analysis_size` size
    - Average word length
    - Average sentence length
    - Unique words count
    - Top 10 function words

Further work on that notebook is planned.

*Note: the list of function words are taken from [functionwords](https://pypi.org/project/functionwords/) package.*

In [1]:
%load_ext autoreload
%autoreload 2

from src import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\piotr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package cmudict to
[nltk_data]     C:\Users\piotr\AppData\Roaming\nltk_data...
[nltk_data]   Package cmudict is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\piotr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
settings = Settings()

In [3]:
cleaner = Cleaner(settings)
if not cleaner.cleaned_generated_corpus_exists():
    print("Cleaning generated corpus...")
    cleaner.clean_generated_corpus()    
if not cleaner.cleaned_books_corpus_exists():
    print("Cleaning books corpus...")
    cleaner.clean_books_corpus()

In [4]:
print("Reading authors' collections...")
authors = []
for author_name in FileUtils.read_authors(settings.paths.selected_authors_filepath):
    author = Author(
        settings=settings,
        name=author_name
    )
    author.read_selected_books_collection()
    author.read_generated_texts()
    authors.append(author)

Reading authors' collections...


In [5]:
print("Preprocessing collections...")
preprocessing_data = Preprocessing(settings, authors).preprocess()

Preprocessing collections...


In [6]:
analysis = Analysis(
    settings=settings,
    preprocessing_data=preprocessing_data,
    read_from_file=settings.configuration.read_analysis_from_file
)

In [8]:
print(f"Analysing collections using sample of {settings.configuration.analysis_number_of_words} words per collection...")
analysis_data = analysis.get_analysis(authors)
print(f"Text removed while cleaning: {analysis_data.metadata.percentage_of_removed_text}%.")

Analysing collections using sample of 200000 words per collection...


TypeError: MetadataAnalysis.get_percentage_of_removed_text() missing 1 required positional argument: 'authors'

In [11]:
print(f"Top 10 features of PC1: {analysis_data.pca.top_features["PC1"]}")
print(f"Top 10 features of PC2: {analysis_data.pca.top_features["PC2"]}")

Top 10 features of PC1: ['of', 'yules_characteristic_k', 'simpsons_index', 'the', 'our', 'its', 'flesch_reading_ease', 'he', 'was', '-']
Top 10 features of PC2: ['to', '*', 'and', 'in', 'as', 'that', 'a', 'average_sentence_length', ',', '.']


In [17]:
visualization = AnalysisVisualization(settings)
visualization.visualize(analysis_data)