# Analysis

Notebook reads the list of the selected authors from `res/selected/author_list` and loads their text data from book and genereted resources available in `res` directory.
Displayed data:
- Notebook displays the number of words per each source-author pair.
- Analysis on chunk of `configuration.analysis_size` size
    - Average word length
    - Average sentence length
    - Unique words count
    - Top 10 function words

Further work on that notebook is planned.

*Note: the list of function words are taken from [functionwords](https://pypi.org/project/functionwords/) package.*

In [6]:
%load_ext autoreload
%autoreload 2

from src import *
import nltk
nltk.download('all')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\piotr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\piotr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\piotr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\piotr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\piotr\AppData\Roaming\nltk_data...
[

True

In [7]:
settings = Settings()

In [8]:
cleaner = Cleaner(settings)
if not cleaner.cleaned_generated_corpus_exists():
    print("Cleaning generated corpus...")
    cleaner.clean_generated_corpus()    
if not cleaner.cleaned_books_corpus_exists():
    print("Cleaning books corpus...")
    cleaner.clean_books_corpus()

In [9]:
print("Reading authors' collections...")
authors = []
for author_name in FileUtils.read_authors(settings.paths.selected_authors_filepath):
    author = Author(
        settings=settings,
        name=author_name
    )
    author.read_selected_books_collection()
    author.read_generated_texts()
    authors.append(author)

Reading authors' collections...


In [10]:
print("Preprocessing collections...")
preprocessing_data = Preprocessing(settings, authors).preprocess()

Preprocessing collections...


In [11]:
analysis = Analysis(
    settings=settings,
    preprocessing_data=preprocessing_data,
    read_from_file=settings.configuration.read_analysis_from_file
)

In [12]:
print(f"Analysing collections using sample of {settings.configuration.analysis_number_of_words} words per collection...")
analysis_data = analysis.get_analysis(authors)
print(f"Text removed while cleaning: {analysis_data.metadata.percentage_of_removed_text}%.")

Analysing collections using sample of 200000 words per collection...
Text removed while cleaning: 1.0128367970170837%.


In [13]:
print(f"Top 10 features of PC1: {analysis_data.pca.top_features["PC1"]}")
print(f"Top 10 features of PC2: {analysis_data.pca.top_features["PC2"]}")

Top 10 features of PC1: ['of', 'yules_characteristic_k', 'simpsons_index', 'the', 'our', 'its', '-', 'was', 'he', 'flesch_reading_ease']
Top 10 features of PC2: ['to', '*', 'and', 'in', 'as', 'that', 'a', 'average_sentence_length', ',', 'gunning_fog_index']


In [14]:
visualization = AnalysisVisualization(settings)
visualization.visualize(analysis_data)