# Ukrainian telegram messages research
#### Vynokury

---

## Intro

This notebook requires theese modules to be installed:
- pandas
- matplotlb
- geopandas
- nltk
- pymorphy3
- pymorphy3-dicts-uk
- wordcloud

Project executors:
- Andrii Kryvyi
- Nikita Lenyk
- Ostap Kostiuk
- Luka Konovalov

In [1]:
%pip install pandas matplotlib geopandas pymorphy3 pymorphy3-dicts-uk nltk wordcloud


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Initializing

In this part we initialize required modules, read dataset and clean it (analyzing it dirtiness).

### Setuping Jupyter Notebook

In [2]:
%config InlineBackend.figure_formats = ['svg']

### Import modules

In [3]:
from collections import Counter

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import geopandas as gpd
import nltk
import re
import pymorphy3
import wordcloud

### Initialize NLP modules

In [4]:
nltk.download("punkt")

morph = pymorphy3.MorphAnalyzer(lang="uk")

[nltk_data] Downloading package punkt to /var/home/andriy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Read map of Ukraine

In [5]:
iso_regions = {
    "UA-65": "–•–µ—Ä—Å–æ–Ω—Å—å–∫–∞",
    "UA-07": "–í–æ–ª–∏–Ω—Å—å–∫–∞",
    "UA-56": "–†—ñ–≤–Ω–µ–Ω—Å—å–∫–∞",
    "UA-18": "–ñ–∏—Ç–æ–º–∏—Ä—Å—å–∫–∞",
    "UA-32": "–ö–∏—ó–≤—Å—å–∫–∞",
    "UA-74": "–ß–µ—Ä–Ω—ñ–≥—ñ–≤—Å—å–∫–∞",
    "UA-59": "–°—É–º—Å—å–∫–∞",
    "UA-63": "–•–∞—Ä–∫—ñ–≤—Å—å–∫–∞",
    "UA-09": "–õ—É–≥–∞–Ω—Å—å–∫–∞",
    "UA-14": "–î–æ–Ω–µ—Ü—å–∫–∞",
    "UA-23": "–ó–∞–ø–æ—Ä—ñ–∑—å–∫–∞",
    "UA-46": "–õ—å–≤—ñ–≤—Å—å–∫–∞",
    "UA-26": "–Ü–≤–∞–Ω–æ-–§—Ä–∞–Ω–∫—ñ–≤—Å—å–∫–∞",
    "UA-21": "–ó–∞–∫–∞—Ä–ø–∞—Ç—Å—å–∫–∞",
    "UA-61": "–¢–µ—Ä–Ω–æ–ø—ñ–ª—å—Å—å–∫–∞",
    "UA-77": "–ß–µ—Ä–Ω—ñ–≤–µ—Ü—å–∫–∞",
    "UA-51": "–û–¥–µ—Å—å–∫–∞",
    "UA-48": "–ú–∏–∫–æ–ª–∞—ó–≤—Å—å–∫–∞",
    "UA-43": "–ê–≤—Ç–æ–Ω–æ–º–Ω–∞ –†–µ—Å–ø—É–±–ª—ñ–∫–∞ –ö—Ä–∏–º",
    "UA-05": "–í—ñ–Ω–Ω–∏—Ü—å–∫–∞",
    "UA-68": "–•–º–µ–ª—å–Ω–∏—Ü—å–∫–∞",
    "UA-71": "–ß–µ—Ä–∫–∞—Å—å–∫–∞",
    "UA-53": "–ü–æ–ª—Ç–∞–≤—Å—å–∫–∞",
    "UA-12": "–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞",
    "UA-35": "–ö—ñ—Ä–æ–≤–æ–≥—Ä–∞–¥—Å—å–∫–∞",
    "UA-30": "–ö–∏—ó–≤",
    "UA-40": "–°–µ–≤–∞—Å—Ç–æ–ø–æ–ª—å",
}

map_ = gpd.read_file("ukrainian_map.geojson")
map_["region"] = map_.shapeISO.map(iso_regions)
map_ = map_.drop(
    columns=["shapeName", "shapeISO", "shapeID", "shapeGroup", "shapeType"]
)
map_

Unnamed: 0,geometry,region
0,"POLYGON ((35.23342 45.79173, 35.22632 45.81739...",–•–µ—Ä—Å–æ–Ω—Å—å–∫–∞
1,"POLYGON ((25.11276 50.28727, 25.11147 50.29428...",–í–æ–ª–∏–Ω—Å—å–∫–∞
2,"POLYGON ((25.11276 50.28727, 25.11291 50.28489...",–†—ñ–≤–Ω–µ–Ω—Å—å–∫–∞
3,"POLYGON ((27.19595 50.56224, 27.19661 50.55239...",–ñ–∏—Ç–æ–º–∏—Ä—Å—å–∫–∞
4,"MULTIPOLYGON (((30.34907 50.48887, 30.34605 50...",–ö–∏—ó–≤—Å—å–∫–∞
5,"POLYGON ((32.14266 50.34881, 32.2184 50.35759,...",–ß–µ—Ä–Ω—ñ–≥—ñ–≤—Å—å–∫–∞
6,"POLYGON ((33.06618 50.51985, 33.06913 50.51844...",–°—É–º—Å—å–∫–∞
7,"POLYGON ((34.94082 50.15259, 34.93279 50.13471...",–•–∞—Ä–∫—ñ–≤—Å—å–∫–∞
8,"POLYGON ((37.87444 49.23388, 37.87342 49.22317...",–õ—É–≥–∞–Ω—Å—å–∫–∞
9,"POLYGON ((36.73891 48.62595, 36.74531 48.59654...",–î–æ–Ω–µ—Ü—å–∫–∞


### Read dataset

Dataset is placed in the `message-weather.csv` file in a csv format.

In [6]:
dataset = pd.read_csv("message-weather.csv")
dataset

Unnamed: 0,city,date_weather,latitude_decimal,longitude_decimal,max_temperature,min_temperature,region_x,temperature,wind_direction,wind_speed,weather_description,date_hour_x,tg_message
0,–°—É–º–∏,2022-12-02 12:32,50.911944,34.803333,-0.2,-3.3,–°—É–º—Å—å–∫–∞,-0.2,343,31.2,Overcast,2022-12-02 12:00:00,üí• –•–æ—Ç—ñ–Ω—å (–°—É–º—Å—å–∫–∞ –æ–±–ª.)\n–ó–∞–≥—Ä–æ–∑–∞ –∞—Ä—Ç–∏–ª–µ—Ä—ñ–π—Å—å–∫–æ...
1,–ú–∞—Ä–≥–∞–Ω–µ—Ü—å,2022-12-02 11:33,47.644722,34.604167,2.7,-0.1,–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞,2.7,3,28.5,Overcast,2022-12-02 11:00:00,üí• –ú–∞—Ä–≥–∞–Ω–µ—Ü—å (–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞ –æ–±–ª.)\n–ó–∞–≥—Ä–æ–∑–∞ –∞...
2,–î–Ω—ñ–ø—Ä–æ,2022-12-02 11:33,48.466111,35.025278,1.7,-1.6,–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞,1.7,357,29.9,Partly cloudy,2022-12-02 11:00:00,üí• –ú–∞—Ä–≥–∞–Ω–µ—Ü—å (–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞ –æ–±–ª.)\n–ó–∞–≥—Ä–æ–∑–∞ –∞...
3,–ù—ñ–∫–æ–ø–æ–ª—å,2022-12-02 11:33,47.577222,34.357500,2.4,0.1,–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞,2.4,7,25.4,Overcast,2022-12-02 11:00:00,üí• –ú–∞—Ä–≥–∞–Ω–µ—Ü—å (–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞ –æ–±–ª.)\n–ó–∞–≥—Ä–æ–∑–∞ –∞...
4,–î–Ω—ñ–ø—Ä–æ,2022-12-02 11:33,48.466111,35.025278,1.7,-1.6,–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞,1.7,357,29.9,Partly cloudy,2022-12-02 11:00:00,üí• –ú–∞—Ä–≥–∞–Ω–µ—Ü—å (–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞ –æ–±–ª.)\n–ó–∞–≥—Ä–æ–∑–∞ –∞...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8798,–•–µ—Ä—Å–æ–Ω,2023-04-01 09:06,46.640000,32.614444,18.1,10.2,–•–µ—Ä—Å–æ–Ω—Å—å–∫–∞,16.9,248,45.4,Mainly clear,2023-04-01 09:00:00,–•–µ—Ä—Å–æ–Ω—Å—å–∫–∞ –æ–±–ª–∞—Å—Ç—å. –Ü–Ω—Ñ–æ—Ä–º–∞—Ü—ñ—è —â–æ–¥–æ –≤–æ—Ä–æ–∂–∏—Ö –æ–±...
8799,–•–µ—Ä—Å–æ–Ω,2023-03-30 16:20,46.640000,32.614444,16.2,3.1,–•–µ—Ä—Å–æ–Ω—Å—å–∫–∞,10.7,245,27.7,Mainly clear,2023-03-30 16:00:00,‚Äã‚Äã‚ö°Ô∏è–û—Ç—Ä–∏–º–∞–≤ –æ—Å–∫–æ–ª–∫–æ–≤–µ –ø–æ—Ä–∞–Ω–µ–Ω–Ω—è –ø—ñ–¥ —á–∞—Å –±–æ–º–±–∞—Ä...
8800,–•–µ—Ä—Å–æ–Ω,2023-03-10 08:43,46.640000,32.614444,-1.5,-6.3,–•–µ—Ä—Å–æ–Ω—Å—å–∫–∞,-2.7,65,36.1,Overcast,2023-03-10 08:00:00,–•–µ—Ä—Å–æ–Ω—Å—å–∫–∞ –æ–±–ª–∞—Å—Ç—å. –Ü–Ω—Ñ–æ—Ä–º–∞—Ü—ñ—è —â–æ–¥–æ –≤–æ—Ä–æ–∂–∏—Ö –æ–±...
8801,–•–µ—Ä—Å–æ–Ω,2023-03-10 08:43,46.640000,32.614444,-1.5,-6.3,–•–µ—Ä—Å–æ–Ω—Å—å–∫–∞,-2.7,65,36.1,Overcast,2023-03-10 08:00:00,–•–µ—Ä—Å–æ–Ω—Å—å–∫–∞ –æ–±–ª–∞—Å—Ç—å. –Ü–Ω—Ñ–æ—Ä–º–∞—Ü—ñ—è —â–æ–¥–æ –≤–æ—Ä–æ–∂–∏—Ö –æ–±...


### Clean columns

First of all, we rename columns `tg_message` to `message_text`, `latitude_decimal` to `latitude`, `longitude_decimal` to `longitude` and `region_x` to `region` for names to be understandable and easy-to-use.

In [7]:
dataset = dataset.rename(
    columns={
        "tg_message": "message_text",
        "latitude_decimal": "latitude",
        "longitude_decimal": "longitude",
        "region_x": "region",
    }
)

There is also a problem with parsing `date_weather` and `date_hour_x`, since those are strings, but have to be dates, so we convert them:

In [8]:
dataset.date_weather = pd.to_datetime(dataset.date_weather, format="%Y-%m-%d %H:%M")
dataset.date_hour_x = pd.to_datetime(dataset.date_hour_x, format="%Y-%m-%d %H:%M:%S")

Now the dataset looks like this:

In [9]:
dataset

Unnamed: 0,city,date_weather,latitude,longitude,max_temperature,min_temperature,region,temperature,wind_direction,wind_speed,weather_description,date_hour_x,message_text
0,–°—É–º–∏,2022-12-02 12:32:00,50.911944,34.803333,-0.2,-3.3,–°—É–º—Å—å–∫–∞,-0.2,343,31.2,Overcast,2022-12-02 12:00:00,üí• –•–æ—Ç—ñ–Ω—å (–°—É–º—Å—å–∫–∞ –æ–±–ª.)\n–ó–∞–≥—Ä–æ–∑–∞ –∞—Ä—Ç–∏–ª–µ—Ä—ñ–π—Å—å–∫–æ...
1,–ú–∞—Ä–≥–∞–Ω–µ—Ü—å,2022-12-02 11:33:00,47.644722,34.604167,2.7,-0.1,–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞,2.7,3,28.5,Overcast,2022-12-02 11:00:00,üí• –ú–∞—Ä–≥–∞–Ω–µ—Ü—å (–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞ –æ–±–ª.)\n–ó–∞–≥—Ä–æ–∑–∞ –∞...
2,–î–Ω—ñ–ø—Ä–æ,2022-12-02 11:33:00,48.466111,35.025278,1.7,-1.6,–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞,1.7,357,29.9,Partly cloudy,2022-12-02 11:00:00,üí• –ú–∞—Ä–≥–∞–Ω–µ—Ü—å (–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞ –æ–±–ª.)\n–ó–∞–≥—Ä–æ–∑–∞ –∞...
3,–ù—ñ–∫–æ–ø–æ–ª—å,2022-12-02 11:33:00,47.577222,34.357500,2.4,0.1,–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞,2.4,7,25.4,Overcast,2022-12-02 11:00:00,üí• –ú–∞—Ä–≥–∞–Ω–µ—Ü—å (–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞ –æ–±–ª.)\n–ó–∞–≥—Ä–æ–∑–∞ –∞...
4,–î–Ω—ñ–ø—Ä–æ,2022-12-02 11:33:00,48.466111,35.025278,1.7,-1.6,–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞,1.7,357,29.9,Partly cloudy,2022-12-02 11:00:00,üí• –ú–∞—Ä–≥–∞–Ω–µ—Ü—å (–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞ –æ–±–ª.)\n–ó–∞–≥—Ä–æ–∑–∞ –∞...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8798,–•–µ—Ä—Å–æ–Ω,2023-04-01 09:06:00,46.640000,32.614444,18.1,10.2,–•–µ—Ä—Å–æ–Ω—Å—å–∫–∞,16.9,248,45.4,Mainly clear,2023-04-01 09:00:00,–•–µ—Ä—Å–æ–Ω—Å—å–∫–∞ –æ–±–ª–∞—Å—Ç—å. –Ü–Ω—Ñ–æ—Ä–º–∞—Ü—ñ—è —â–æ–¥–æ –≤–æ—Ä–æ–∂–∏—Ö –æ–±...
8799,–•–µ—Ä—Å–æ–Ω,2023-03-30 16:20:00,46.640000,32.614444,16.2,3.1,–•–µ—Ä—Å–æ–Ω—Å—å–∫–∞,10.7,245,27.7,Mainly clear,2023-03-30 16:00:00,‚Äã‚Äã‚ö°Ô∏è–û—Ç—Ä–∏–º–∞–≤ –æ—Å–∫–æ–ª–∫–æ–≤–µ –ø–æ—Ä–∞–Ω–µ–Ω–Ω—è –ø—ñ–¥ —á–∞—Å –±–æ–º–±–∞—Ä...
8800,–•–µ—Ä—Å–æ–Ω,2023-03-10 08:43:00,46.640000,32.614444,-1.5,-6.3,–•–µ—Ä—Å–æ–Ω—Å—å–∫–∞,-2.7,65,36.1,Overcast,2023-03-10 08:00:00,–•–µ—Ä—Å–æ–Ω—Å—å–∫–∞ –æ–±–ª–∞—Å—Ç—å. –Ü–Ω—Ñ–æ—Ä–º–∞—Ü—ñ—è —â–æ–¥–æ –≤–æ—Ä–æ–∂–∏—Ö –æ–±...
8801,–•–µ—Ä—Å–æ–Ω,2023-03-10 08:43:00,46.640000,32.614444,-1.5,-6.3,–•–µ—Ä—Å–æ–Ω—Å—å–∫–∞,-2.7,65,36.1,Overcast,2023-03-10 08:00:00,–•–µ—Ä—Å–æ–Ω—Å—å–∫–∞ –æ–±–ª–∞—Å—Ç—å. –Ü–Ω—Ñ–æ—Ä–º–∞—Ü—ñ—è —â–æ–¥–æ –≤–æ—Ä–æ–∂–∏—Ö –æ–±...


## Examination

For further cleanup, it would be great to examine data.

### Dates

We can see that the `date_weather` is precise date (minute precision), but in comparison `date_hour_x` is not (hour precision). It would be great to examine their relation in the dataset:

In [None]:
diff = dataset["date_weather"] - dataset["date_hour_x"]
dataset.groupby(diff).count().plot.line(y="message_text", legend=False)
diff.describe()


count                         8803
mean     0 days 00:30:25.239123026
std      0 days 00:17:11.615431718
min                0 days 00:00:00
25%                0 days 00:15:00
50%                0 days 00:32:00
75%                0 days 00:45:00
max                0 days 00:59:00
dtype: object

Since the `date_weather` stands between `date_hour_x` and `date_hour_x + 1h` (if we look at the maximum value of the difference) we can suppose, that `date_hour_x` actually equals `date_weather` without minutes component. So let's check this:

In [None]:
(dataset["date_weather"].dt.floor("h") == dataset["date_hour_x"]).describe()

We can see that there is only 1 unique element (True), that means, that this equality is true for all records in the dataset.

Because of this, we can say, that there is an error in the data: weather information is usually measured every hour in 0 minutes, but messages are usually sent in different time, so probably `date_hour_x` must be a time when weather was measured and `date_weather` a time when message was sent. This would explain why `date_hour_x` equals `date_weather` without minutes component: when data was collected, recent weather information was used.

To not get confused later, we rename `date_hour_x` to `weather_time` and `date_weather` to `message_time`.

In [None]:
dataset = dataset.rename(
    columns={
        "date_hour_x": "weather_time",
        "date_weather": "message_time",
    }
)

In final, we have normalized columns names and got this kind of table:

In [None]:
dataset

### Duplicates
It would be great to check whether there are duplicates in the table.

In [None]:
dataset.duplicated().sum()

And in fact, there are a lot of them, so we clean it:

In [None]:
old_size = len(dataset.index)
dataset.drop_duplicates(inplace=True)
new_size = len(dataset.index)

print(f"{old_size} -> {new_size}, {(1 - new_size/old_size) * 100:.2f}%")

We can see, that the dataset size was reduced from 8803 records to 4345 (by 50.64%), meaning that the half of the data were duplicates of the same records.

### Regions
It would be great to look how many mentions are there for different regions.

In [None]:
messages_per_region = dataset.groupby("region").count()["message_text"]
messages_per_region.plot.bar(xlabel="–û–±–ª–∞—Å—Ç—å", ylabel="–ö-—Å—Ç—å –∑–≥–∞–¥—É–≤–∞–Ω—å", legend=False)
messages_per_region["–ö–∏—ó–≤"] = messages_per_region["–ö–∏—ó–≤—Å—å–∫–∞"]
messages_map = map_.join(messages_per_region, on="region")

messages_map.plot("message_text", cmap="Blues")

So we can see, that almost all the data is from the eastern and southern regions of Ukraine. In future we will focus only on those regions, so we filter them out.

In [None]:
eastern_regions = [
    "–°—É–º—Å—å–∫–∞",
    "–•–∞—Ä–∫—ñ–≤—Å—å–∫–∞",
    "–î–Ω—ñ–ø—Ä–æ–ø–µ—Ç—Ä–æ–≤—Å—å–∫–∞",
    "–î–æ–Ω–µ—Ü—å–∫–∞",
    "–õ—É–≥–∞–Ω—Å—å–∫–∞",
    "–ó–∞–ø–æ—Ä—ñ–∑—å–∫–∞",
    "–•–µ—Ä—Å–æ–Ω—Å—å–∫–∞",
    "–ú–∏–∫–æ–ª–∞—ó–≤—Å—å–∫–∞",
]
dataset = dataset[dataset.region.isin(eastern_regions)]
dataset

So we are left with 3865 records from eastern regions.

### Unique messages

Also there might be other kinds of duplicates. Since the main column of the dataset is telegram message and time when it was sent, we examine how many unique messages are there.

Note: we suppose that probability of two identical text messages being sent at the same time and being in this dataset is pretty low.

In [None]:
(dataset.groupby(["message_text", "message_time"]).size() > 0).sum()

This means, that from 3865 messages there are only around 1618 unique. But since those are not strong duplicates, we can lookup for the difference between them.

In [None]:
# Take the group of the maximum size
group = max(
    dataset.groupby(["message_text", "message_time"]).groups.values(),
    key=lambda item: len(item),
)
dataset.loc[group[:7]]

And the message was:

In [None]:
print(dataset.loc[group[0], "message_text"])

We can see, that the difference is in the city names (and hence the weather). All of the cities in the group were mentioned in the message, so it probably means that those duplicates just mean different towns. We can check this hypothesis by checking whether there are duplicates of values in the form of `(message_time, message_text, city)`:

In [None]:
dataset.duplicated(["message_time", "message_text", "city"]).any()

In fact, there are no duplicates of that form, so the dataset actually looks like this:

In [None]:
dataset.set_index(["message_time", "message_text", "city"]).head(5)

However we won't update the indices of the dataset, since it is easier to work with denormalized data in pandas, but it would be great to have another copy of the dataset, that contains only those 1618 unique messages.

In [None]:
dataset_messages = dataset[["message_time", "message_text"]].drop_duplicates()
dataset_messages

### Finalization

In fact, we finished with data cleaning, but also we would like to show the plot of the messages reeived per day.

In [None]:
dataset_messages.groupby(dataset_messages.message_time.dt.round("7d").dt.date).count().plot.bar(
    y="message_text", legend=False, xlabel="–î–∞—Ç–∞", ylabel="–ö-—Å—Ç—å –ø–æ–≤—ñ–¥–æ–º–ª–µ–Ω—å"
)

## Linguistics analysis

In this part we will analyse texts of the messages given in the dataset.

### Word usage frequencies

We analyze how frequently words are used by nltk library and ukrainian dictionary. This creates new column `words`, which contins Counter, where key is word and value is number of times it was used.

In [None]:
def tokenize_and_lemmatize(text):
    tokens = nltk.word_tokenize(text.lower())
    lemmas = Counter()
    for token in tokens:
        if not token.isalpha():
            continue

        lemma = morph.parse(token)[0].normal_form
        lemmas[lemma] += 1

    return lemmas

dataset_messages["words"] = dataset_messages.message_text.apply(tokenize_and_lemmatize)

Also we join dataset_messages back to dataset to use words in future with region information.

In [None]:
dataset = dataset.join(dataset_messages.set_index(["message_time", "message_text"]), on=["message_time", "message_text"])

Now we can create plot of often used words:

In [None]:
words = pd.Series(dataset_messages.words.sum())
words = words.sort_values()
words.iloc[-25:].plot.barh()

We can see, that there are lots of words connected to the war in theese messages.

#### WordCloud

We also create a wordcloud to see the general mood the messages reflect.

In [None]:
text_data = " ".join(dataset_messages['message_text'])

picture = wordcloud.WordCloud(width=1920, height=1080, background_color="white").generate(text_data)

plt.imshow(picture)
plt.axis("off")
plt.show()

#### Threats counting

We create a set of tokens, that mean some type of threat, and then count how many messages are about threats.

In [None]:
threats_lemmas = {"–æ–±—Å—Ç—Ä—ñ–ª—é–≤–∞–Ω–∏–π", "–∞—Ç–∞–∫–æ–≤–∞–Ω–∏–π", "–æ–±—Å—Ç—Ä—ñ–ª–ª—é—á–∏", "–∞—Ä—Ç–æ–±—Å—Ç—Ä—ñ–ª", "–∞—Ç–∞–∫—É–≤–∞—Ç–∏", "–æ–±—Å—Ç—Ä—ñ–ª—é–≤–∞—Ç–∏", "–∞—Ç–∞–∫–∞", "–≤–∏–±—É—Ö", "–æ–±—Å—Ç—Ä—ñ–ª—è—Ç–∏", "–∑–∞–≥—Ä–æ–∑–∞", "–æ–±—Å—Ç—Ä—ñ–ª"}
dataset_messages.words.apply(lambda lemmas: bool(set(lemmas.keys()) & threats_lemmas)).value_counts()

We can see, that only 104 messages don't contain those lemmas.

### Injured people and casualties counting

We want to count injured people and casualties in the messages:

In [None]:
dataset_regions = dataset[["message_time", "message_text", "region"]].drop_duplicates()

def extract_injured_count(text):
    match = re.search(r"\b(\d+)\s*(?:–ø–æ—Ä–∞–Ω–µ–Ω—ñ|–ø–æ—Ä–∞–Ω–µ–Ω–∏—Ö)\b", str(text), re.IGNORECASE)
    return int(match.group(1)) if match else 0

dataset_regions = dataset_regions.assign(injured_count = dataset_regions.message_text.apply(extract_injured_count))
injured_by_region = dataset_regions.groupby("region")["injured_count"].sum()
total_injured = injured_by_region.sum()
injured_by_region.sort_values().plot.bar()
total_injured

We can see that the most injured people are in Luhansk and Dnipropetrovsk oblast's.

In [None]:
dataset_regions = dataset[["message_time", "message_text", "region"]].drop_duplicates()

def extract_casualties_count(text):
    match = re.search(r"\b(\d+)\s*(?:–ª—é–¥–µ–π\s*)?(?:–∑–∞–≥–∏–Ω—É–ª–∏|–∑–∞–≥–∏–Ω—É–ª–æ)|(?:–∑–∞–≥–∏–Ω—É–ª–∏|–∑–∞–≥–∏–Ω—É–ª–æ)\s*(\d+)\b", str(text), re.IGNORECASE)
    if match:
        return int(match.group(1) or match.group(2))
    return 0

dataset_regions = dataset_regions.assign(casualties_count = dataset_regions.message_text.apply(extract_casualties_count))
casualties_by_region = dataset_regions.groupby("region")["casualties_count"].sum()
total_casualties = casualties_by_region.sum()
casualties_by_region.sort_values().plot.bar()
total_casualties

We can see that the most casualties are in Zaporizha, Donetsk and Kherson oblast's.

## Weather analysis

We create a dataset for unique weather records for each region at each time period.

In [None]:
weather_dataset = dataset.drop_duplicates(['weather_time', 'region'])
weather_dataset

So we actually have only 1313 records of weather. Let's try to look at wind rose:

In [None]:
wind_data = weather_dataset['wind_direction'].round(-1).value_counts().sort_index()
ax = plt.axes(projection="polar")
ax.set_theta_zero_location("N")
ax.set_theta_direction(-1)
plt.plot(np.deg2rad(wind_data.index),wind_data.values,)

We can see that most of the winds are directed to south-west. Let's see which wind speed tends to be there.

In [None]:
north_east_wind_speed = weather_dataset[(180<weather_dataset['wind_direction']) & (weather_dataset['wind_direction'] <= 270)].groupby("weather_time").wind_speed.mean()
north_east_wind_speed.plot.line()
north_east_wind_speed.mean()

We can see that there is a gap of weather in December and March. And the mean speed of the weather is 25 (while speed itself is between 0 and 50 km/h). We suppose that wind is measured in km/h because 50 m/s is very anomalic wind (hurricane), there is a low possibility of that. 

Also we want to see temperature plot for the Eastern Ukraine:

In [None]:
weather_dataset.assign(date=weather_dataset['message_time'].dt.date).groupby('date')['temperature'].mean().plot(kind='line', marker='o', xlabel="–î–∞—Ç–∞", ylabel='–°–µ—Ä–µ–¥–Ω—è —Ç–µ–º–ø–µ—Ä–∞—Ç—É—Ä–∞ (¬∞C)')

temperature_per_region = dataset.groupby("region")["temperature"].mean()
temperature_map = map_.join(temperature_per_region, on="region")
temperature_map.plot("temperature", cmap="Blues", legend=True)