# About

In this Jupyter Notebook, I will be "analyzing" Leo Tolstoy's classic novel, "War and Peace, Vol. 1" in the original Russian language. I've used Python's re (regular expression) and nltk (Natural Language Toolkit) packages to extract and analyze the text of the novel. 

For the most part, I was practicing regex in this notebook, but I've decided to do a bit more, to get this thing more iteresting, than just writing regex.

*P.S. I might add more analysis and regex stuff, stay tuned*

## Load the data

In [8]:
with open("data/voyna-i-mir-tom-1.txt", "r") as file:
    book = file.read()

## What are the most common used words?

In [9]:
# import package to use regex
import re

In [18]:
pattern = re.compile("[а-яa-z']+")
matches = re.findall(pattern, book.lower())

most_used_words = {}

for word in matches:
    if word in most_used_words.keys():
        most_used_words[word] += 1
    else:
        most_used_words[word] = 1

sorted_words = [(value, key) for (key, value) in most_used_words.items()]

sorted_words.sort(reverse=True)

sorted_words[:20]


[(468, 'а'),
 (453, 'по'),
 (423, 'вс'),
 (403, 'из'),
 (392, 'ему'),
 (384, 'от'),
 (374, 'вы'),
 (368, 'был'),
 (366, 'же'),
 (358, 'ее'),
 (329, 'у'),
 (323, 'бы'),
 (320, 'о'),
 (303, 'только'),
 (294, 'андрей'),
 (291, 'еще'),
 (234, 'мне'),
 (231, 'все'),
 (229, 'ты'),
 (228, 'ростов'),
 (217, 'него'),
 (217, 'для'),
 (216, 'de'),
 (213, 'пьер'),
 (208, 'да'),
 (207, 'уже'),
 (207, 'была'),
 (201, 'они'),
 (199, 'себя'),
 (197, 'когда'),
 (195, 'сказала'),
 (191, 'говорил'),
 (184, 'теперь'),
 (183, 'очень'),
 (181, 'вот'),
 (180, 'чтобы'),
 (178, 'ну'),
 (176, 'vous'),
 (174, 'будто'),
 (173, 'ни'),
 (173, 'быть'),
 (172, 'их'),
 (172, 'были'),
 (171, 'время'),
 (168, 'нет'),
 (163, 'меня'),
 (162, 'который'),
 (161, 'ничего'),
 (158, 'или'),
 (158, 'до'),
 (156, 'князя'),
 (154, 'этого'),
 (152, 'анна'),
 (151, 'того'),
 (150, 'опять'),
 (149, 'вас'),
 (145, 'a'),
 (142, 'княжна'),
 (140, 'всех'),
 (138, 'под'),
 (138, 'глаза'),
 (137, 'чем'),
 (136, 'лицо'),
 (136, 'будет'),
 

### Use nlp to remove "stopwords" from the list

In [17]:
from nltk.corpus import stopwords
import nltk
# download nltk package with names
nltk.download("stopwords")

#get stop words for russian language
ru_stopwords = stopwords.words("russian")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Acer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [16]:
# create new word list without stopwords
filtered_words = []

for num, word in sorted_words:
    if word not in ru_stopwords:
        filtered_words.append((num, word))

filtered_words[:20]

[(625, 'князь'),
 (584, 'это'),
 (568, 'сказал'),
 (423, 'вс'),
 (294, 'андрей'),
 (228, 'ростов'),
 (216, 'de'),
 (213, 'пьер'),
 (195, 'сказала'),
 (191, 'говорил'),
 (183, 'очень'),
 (176, 'vous'),
 (171, 'время'),
 (162, 'который'),
 (156, 'князя'),
 (152, 'анна'),
 (145, 'a'),
 (142, 'княжна'),
 (138, 'глаза'),
 (136, 'лицо')]

## How many chapters are in the book?

In [13]:
pattern = re.compile("[ixv]+[\n\n]")
matches = re.findall(pattern, book.lower())

len(matches)

68

## Whats the overall mood of the book?

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download("vader_lexicon")
analyzer = SentimentIntensityAnalyzer()
points = analyzer.polarity_scores(book)
points

In points variable, value pos represents positive sentiment, neg represents negative sentiment, compound sums everything, and its negative, so this means text's sentiment is more negative. 

I also plan to analyze the most used character names, but i need to figure out whats the best solution to do it.