# Analysis

Notebook analyze the text from the original books and the generated one. Later on the results are visualized to conclude insights

In [17]:
%load_ext autoreload
%autoreload 2

from src import *
import nltk
nltk.download('all')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\piotr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\piotr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\piotr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\piotr\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\piotr\AppData\Roaming\nltk_data...
[

True

In [2]:
settings = Settings()

Original texts are cleaned if they weren't cleaned before.
#### Generated texts

- Discard to small text
- Discard texts with repeated substrings at the end of the response
- Remove emoji codes
- Remove `@` characters
- Remove html tags

#### Books

- Remove italic formatting 
- Remove dividers
- Remove illustration annotations
- Remove note annotations

In [3]:
cleaner = Cleaner(settings)
if not cleaner.cleaned_generated_corpus_exists():
    print("Cleaning generated corpus...")
    cleaner.clean_generated_corpus()    
if not cleaner.cleaned_books_corpus_exists():
    print("Cleaning books corpus...")
    cleaner.clean_books_corpus()

Read collections of books and generated text into memory from files

In [4]:
print("Reading authors' collections...")
authors = []
for author_name in FileUtils.read_authors(settings.paths.selected_authors_filepath):
    author = Author(
        settings=settings,
        name=author_name
    )
    author.read_selected_books_collection()
    author.read_generated_texts()
    authors.append(author)

Reading authors' collections...


In [5]:
print(f"Text removed while cleaning: {MetadataAnalysis.get_percentage_of_removed_text(authors)}%.")

Text removed while cleaning: 0.9765146365788469%.


In [6]:
print("Preprocessing collections...")
preprocessing_results = Preprocessing(settings).preprocess(authors)

Preprocessing collections...


In [7]:
print(f"Analysing collections' metrics using sample of {settings.configuration.analysis_number_of_words} words per auhtor-collection pair...")
metrics_analysis_results = MetricsAnalysis(settings).analyze(preprocessing_results)

Analysing collections' metrics using sample of 200000 words per auhtor-collection pair...


In [8]:
print(f"Performing pca analysises for given metrics data...")
pca_analysis_results = PCAAnalysis(settings).get_pca_analysis(metrics_analysis_results=metrics_analysis_results)

Performing pca analysises for given metrics data...


In [9]:
MetricsAnalysisVisualization(settings).visualize(
    metrics_analysis_results=metrics_analysis_results
)

In [10]:
PCAAnalysisVisualization(pca_analysis_results=pca_analysis_results).run(port=8000)

# Classification

In [45]:
print("Performing logistic regression classification...")
logistic_regression_results = LogisticRegressionClassification(settings).fit_and_predict_on_pca_for_two_classes(
    pca_analysis_results=pca_analysis_results
)
print(f"Cross-validatin accuracy: {logistic_regression_results.cross_validation_accuracy}")
print(f"Accuracy on all training data: {logistic_regression_results.final_accuracy}")
print(f"Report: {logistic_regression_results.report}")

Performing logistic regression classification...
Cross-validatin accuracy: 0.98375
Accuracy on all training data: 0.9916666666666667
Report:               precision    recall  f1-score   support

       human       0.96      0.99      0.97        72
         llm       1.00      0.99      1.00       408

    accuracy                           0.99       480
   macro avg       0.98      0.99      0.98       480
weighted avg       0.99      0.99      0.99       480



In [49]:
ClassificationVisualization.visualize(logistic_regression_results)


X does not have valid feature names, but LogisticRegression was fitted with feature names

