# <center>**Selecting the most reading city and region in Russia**</center>

## 1. Problem and background description

For this project, we will assume that the Ministry of Culture of Russia is planning a campain to stimulate population's interest towards reading. Several measures are being designed, and one of them is to create a rating of the "most reading" cities and regions (the idea came out upon the success of other ratings such as "100 best places in Russia" or "Name of Russia" for the emblematic characters of Russian history). 

Russia has the world's biggest territory of 17 098 246 km² divided into 85 regions. There are 146 745 098 inhabitants in the country, which makes Russia only the 9th most populated country in the world. Approximately 75% of the population of Russia is concentrated in its cities.

## 2. Methodology description and gathering of the data needed

**Preliminary note**: if the project was indeed realized on behalf of the ministery, it would be using other data - more official for the data on cities, regions and their population, and more local for the bookstores and libraries data (e.g. Yandex maps). 

For the present project, we will use the following data:
* list of 1117 Russian cities and their population available at: https://ru.wikipedia.org/wiki/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D0%B3%D0%BE%D1%80%D0%BE%D0%B4%D0%BE%D0%B2_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8
* list of 85 Russian regions and their population available at: https://ru.wikipedia.org/wiki/%D0%A1%D1%83%D0%B1%D1%8A%D0%B5%D0%BA%D1%82%D1%8B_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B9%D1%81%D0%BA%D0%BE%D0%B9_%D0%A4%D0%B5%D0%B4%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D0%B8
* data on bookstores and libraries location provided by Foursquare API.
* we also need russian cities and regions geolocalisation data: https://on55.ru/articles/2.

The downside of these data is that we would only be looking at physical stores, while today the number of people bying physical and e-books online is steadily increasing. To take this into account, we will try to connect to our analysis data on the visits to sites of top Russian book retailers as available at: https://radar.yandex.ru/top_list?thematic=retail%2Cbooks.

In our project we will aim to analyse the concentration of book shops and libraries in the cities in relation to the number of inhabitants, as well as the inhabitant's interest to the most popular online bookstores in Russia. We will then compare the cities between them using k-mean clustering and map the resulting clusters.

To establish the most reading region of Russia, we will use the data we obtained for the cities located in each region, and will also need to account for the rural population pertaining to the region. We will account for rural population wheter by using data of connection to the online bookstores, or by using some assumptions.

Let's start by scrapping Wikipedia pages and convert regions and cities data in dataframe format.

In [37]:
# Collecting data on the cities
import pandas as pd
# Reading the content of the web page (result = list of dataframes)
WebPage_cities = pd.read_html("https://ru.wikipedia.org/wiki/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA_%D0%B3%D0%BE%D1%80%D0%BE%D0%B4%D0%BE%D0%B2_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B8")
# Selecting from the list the item corresponding to the cities table and storing it as df
RU_cities = WebPage_cities[1]
RU_cities.reset_index(drop=True, inplace=True)
RU_cities.columns

MultiIndex(levels=[['Города Российской Федерации'], ['Герб', 'Город', 'Население', 'Основание илипервое упоминание', 'Регион', 'Статус города[5]', 'Федеральный округ', '№']],
           codes=[[0, 0, 0, 0, 0, 0, 0, 0], [7, 0, 1, 4, 6, 2, 3, 5]])

In [41]:
#We discover that the resulting dataframe is multilevel. There is no reason for us to keep it.
RU_cities.columns = [col[1] for col in RU_cities.columns]
RU_cities

Unnamed: 0,№,Герб,Город,Регион,Федеральный округ,Население,Основание илипервое упоминание,Статус города[5]
0,1,,Абаза,Хакасия,Сибирский,17111,1867,
1,2,,Абакан,Хакасия,Сибирский,165183,1734,1931.0
2,3,,Абдулино,Оренбургская область,Приволжский,20663,1795,
3,4,,Абинск,Краснодарский край,Южный,34926,1863,
4,5,,Агидель,Башкортостан,Приволжский,16365,1980,1991.0
5,6,,Агрыз,Татарстан,Приволжский,19299,1646,
6,7,,Адыгейск,Адыгея,Южный,12689,1969,1976.0
7,8,,Азнакаево,Татарстан,Приволжский,34859,1762,
8,9,,Азов,Ростовская область,Южный,82882,1067,
9,10,,Ак-Довурак,Тыва,Сибирский,13469,1964,


In [43]:
# We only need to keep information on "Город" (name of the city), "Регион" (region), "Федеральный округ" (macro region) and "Население" (population).
# Dropping of all the other columns
RU_cities = RU_cities[["Город","Регион","Федеральный округ","Население"]]
RU_cities.head()

Unnamed: 0,Город,Регион,Федеральный округ,Население
0,Абаза,Хакасия,Сибирский,17111
1,Абакан,Хакасия,Сибирский,165183
2,Абдулино,Оренбургская область,Приволжский,20663
3,Абинск,Краснодарский край,Южный,34926
4,Агидель,Башкортостан,Приволжский,16365


In [44]:
RU_cities.shape

(1117, 4)

This looks better: we only have kept the information of interest, while keeping all the 1117 cities. Let's replicate the analysis to retrieve the information on regions and population.

In [58]:
# Reading the content of the web page (result = list of dataframes)
WebPage_regions = pd.read_html("https://ru.wikipedia.org/wiki/%D0%A1%D1%83%D0%B1%D1%8A%D0%B5%D0%BA%D1%82%D1%8B_%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B9%D1%81%D0%BA%D0%BE%D0%B9_%D0%A4%D0%B5%D0%B4%D0%B5%D1%80%D0%B0%D1%86%D0%B8%D0%B8")
# Selecting from the list the item corresponding to the regions table and storing it as df
RU_regions = WebPage_regions[4]
RU_regions

Unnamed: 0,№,Субъект Российской Федерации,Флаг,Герб,Терри-тория (км²),Население1.01.2020[14],Административный центр/столица,Административно-территориальное деление(согласно ОКАТО),Код ОКАТО,Муниципальные образования,Административно-территориальное деление (подробно)
0,,Республики,,,,,,,,,
1,1.0,Республика Адыгея,,,7792.0,↗463 453,Майкоп,7 районов и 2 города,79,"7 муниципальных районов, 2 городских округа",подробнее
2,2.0,Республика Алтай,,,92903.0,↗220 140,Горно-Алтайск,10 районов и 1 город,84,"10 муниципальных районов, 1 городской округ",подробнее
3,3.0,Республика Башкортостан,,,142947.0,↘4 037 811,Уфа,54 района и 21 город,80,"54 муниципальных района, 9 городских округов",подробнее
4,4.0,Республика Бурятия,,,351334.0,↗986 109,Улан-Удэ,21 район и 2 города,81,"21 муниципальный район, 2 городских округа",подробнее
5,5.0,Республика Дагестан,,,50270.0,↗3 111 353,Махачкала,41 район и 10 городов,82,"42 муниципальных района, 10 городских округов",подробнее
6,6.0,Республика Ингушетия,,,3123.0,↗506 688,Магас,4 района и 4 города,26,"4 муниципальных района, 4 городских округа",подробнее
7,7.0,Кабардино-Балкарская Республика,,,12470.0,↗868 174,Нальчик,10 районов и 3 города,83,"10 муниципальных районов, 3 городских округа",подробнее
8,8.0,Республика Калмыкия,,,74731.0,↘271 035,Элиста,13 районов и 1 город,85,"13 муниципальных районов, 1 городской округ",подробнее
9,9.0,Карачаево-Черкесская Республика,,,14277.0,↗465 669,Черкесск,10 районов и 2 города,91,"10 муниципальных районов, 2 городских округа",подробнее


In [61]:
# Keeping only the interesting information
RU_regions = RU_regions[["Субъект Российской Федерации","Население1.01.2020[14]","Административный центр/столица"]]
RU_regions

Unnamed: 0,Субъект Российской Федерации,Население1.01.2020[14],Административный центр/столица
0,Республики,,
1,Республика Адыгея,↗463 453,Майкоп
2,Республика Алтай,↗220 140,Горно-Алтайск
3,Республика Башкортостан,↘4 037 811,Уфа
4,Республика Бурятия,↗986 109,Улан-Удэ
5,Республика Дагестан,↗3 111 353,Махачкала
6,Республика Ингушетия,↗506 688,Магас
7,Кабардино-Балкарская Республика,↗868 174,Нальчик
8,Республика Калмыкия,↘271 035,Элиста
9,Карачаево-Черкесская Республика,↗465 669,Черкесск


There is a problem with this data: we have 92 rows, but we know from the scrapped page that there are only 85 regions in Russia. Indeed, some rows are just the names of categories and contain "Nan". We need to drop these, as well as the last row which is total (we might use this information later, so we'll store it in a separate df.

In [74]:
RU_regions.dropna()
RUS_total = RU_regions.tail(1)

Unnamed: 0,Субъект Российской Федерации,Население1.01.2020[14],Административный центр/столица
91,Российская Федерация[13],↘146 745 098,Москва


Last preparation step - let's obtain latitude and longitude data, using a build-in function of IBM Watson Studio.

In [72]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,Город,Регион,Федеральный округ,lat,lng
0,Абаза,Хакасия,Сибирский,52.651657,90.088572
1,Абакан,Хакасия,Сибирский,53.721152,91.442387
2,Абдулино,Оренбургская область,Приволжский,53.677839,53.647263
3,Абинск,Краснодарский край,Южный,44.866256,38.151163
4,Агидель,Башкортостан,Приволжский,55.899835,53.92204


## 3. Analysis

## 4. Conclusions