This is one of the three problem that was on RuCode festival. 

The All-Russian educational festival on artificial intelligence and programming RuCode has already been held twice in the spring and autumn of 2020 and has gathered more than 20 thousand participants. It was organized by 15 leading universities and public organizations from all over Russia. The festival was supported by the Ministry of Science and Higher Education of the Russian Federation, and the industrial partners of the event were MegaFon, Yandex, Beber and Gazprombank. 

In this competition I need to solve the problem of determining whether a regulatory legal act will be adopted or not.


In [1]:
# import libraries
import pandas as pd
import numpy as np

from sklearn.ensemble import GradientBoostingRegressor
from sklearn import linear_model, metrics, tree
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, accuracy_score, roc_auc_score
from catboost import CatBoostClassifier
from sklearn.preprocessing import OneHotEncoder
import xgboost as xgb
import matplotlib
from matplotlib import pyplot as plt
matplotlib.style.use('ggplot')
%matplotlib inline

from datetime import datetime

In [2]:
# import our data
all_data = pd.read_csv('regulations.csv')
train_answer = pd.read_csv('train_answer.csv')
test_data = pd.read_csv('sample_submission.csv')

## 1. Exploring the Data

The data is represented by several csv tables: regulations.csv contains common informations about acts; regulations_texts.csv - description of texts; sample_submission.csv file presents the structure of how your csv file with responses to test data should look like; train_answer.csv file contains the values of the target variable for the training part of the dataset; ria_reports - this folder contains summary reports on the regulatory impact assessment of the projects; ria_reports_structures - there are descriptions of ria_reports.

In [3]:
all_data.head(5)

Unnamed: 0,id,act_title,publication_date,developer,okved_list,views_num,comments_num,likes_num,dislikes_num,regulatory_impact,added_by,responsible,is_regionally_signigicant,act_changes_controlling_activities,mineco_solution,problem_addressed,act_objectives,persons_affected_by_act,relations_regulated_by_act,act_significance
0,5038,Об утверждении тарифов на услуги по транспорти...,2013-09-11,ФСТ России,,376.0,0.0,0.0,0.0,Низкая,Митина Ольга Викторовна,Митина Ольга Викторовна,False,False,Не определено,,,,,
1,5039,О внесении изменений в отдельные законодательн...,2013-06-11,Минтруд России,Здравоохранение; Предоставление социальных услуг,504.0,0.0,0.0,0.0,Низкая,Рахов Виталий Сергеевич,Павлова Зоя Ивановна,False,False,Не определено,,,,,
2,5040,Об утверждении Положения об уведомлении лиц об...,2013-04-29,Росфинмониторинг,Финансовая деятельность,428.0,0.0,0.0,0.0,Низкая,Тимофеева Алёна Игоревна,Лях Валерий Владимирович,False,False,Не определено,,,,,
3,5041,О внесении изменений в Положение о Министерств...,2013-10-21,Минобрнауки России,Образование,376.0,0.0,0.0,0.0,Низкая,Вотоновская Ирина Вячеславовна,Михайлова Ирина Вячеславовна,False,False,Не определено,,,,,
4,5042,О внесении изменений в Правила подготовки и пр...,2014-02-24,Минприроды России,,499.0,0.0,0.0,0.0,Низкая,Соболева Светлана Юрьевна,Соболева Светлана Юрьевна,False,False,Не определено,предоставление водного объекта в пользование п...,Пунктом 12 части 2 статьи 11 Водного кодекса Р...,неопределенный круг лиц,необходимость корреляции Правил подготовки и п...,Проект постановления Правительства Российской ...


In [4]:
all_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85006 entries, 0 to 85005
Data columns (total 20 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   id                                  85006 non-null  int64  
 1   act_title                           85005 non-null  object 
 2   publication_date                    84417 non-null  object 
 3   developer                           84221 non-null  object 
 4   okved_list                          73097 non-null  object 
 5   views_num                           84417 non-null  float64
 6   comments_num                        84417 non-null  float64
 7   likes_num                           84417 non-null  float64
 8   dislikes_num                        84417 non-null  float64
 9   regulatory_impact                   84417 non-null  object 
 10  added_by                            84417 non-null  object 
 11  responsible                         81684

## 2. Clean the data

We can see that there are null values in our table. Some raws I delete, some values I fill with non null.

In [5]:
# if we look at our data we can see that few acts was deleted
# delete this regulations
new_all_data = all_data.drop(all_data[all_data['act_title'] == 'Проект удален'].index)

In [6]:
# save deleted acts to new table, this help me delete this acts from train_answer  
delete_acts = all_data[all_data['act_title'] == 'Проект удален']

In [7]:
# look at our main table
new_all_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84417 entries, 0 to 85005
Data columns (total 20 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   id                                  84417 non-null  int64  
 1   act_title                           84416 non-null  object 
 2   publication_date                    84417 non-null  object 
 3   developer                           84221 non-null  object 
 4   okved_list                          73097 non-null  object 
 5   views_num                           84417 non-null  float64
 6   comments_num                        84417 non-null  float64
 7   likes_num                           84417 non-null  float64
 8   dislikes_num                        84417 non-null  float64
 9   regulatory_impact                   84417 non-null  object 
 10  added_by                            84417 non-null  object 
 11  responsible                         81684

In [8]:
# one act has NaN in 'act_title'
# fill it with 'Unknow'
new_all_data[['act_title']] = new_all_data['act_title'].fillna('Неизвестно')

Let's look at our data. There are a lot of categorical data and it's difficult to work with it. The easiest approach to dealing with categorical variables is to simply remove them from the dataset. This approach will only work well if the columns did not contain useful information. But I think this is't work for us. 



In [9]:
new_all_data['developer'].value_counts()

Минобрнауки России                                            6258
Минтруд России                                                6254
Минтранс России                                               5062
Минфин России                                                 5053
Минэкономразвития России                                      4757
                                                              ... 
Федеральная служба по оборонному заказу                         20
Рособоронпоставка                                               18
Коллегия Военно-промышленной комиссии Российской Федерации      15
СВР России                                                       2
Департамент энергетики Правительства Российской Федерации        1
Name: developer, Length: 98, dtype: int64

In [10]:
new_all_data['okved_list'].value_counts()

Государственное управление                                                                                                                                                                                                                                                                                         18691
Финансовая деятельность                                                                                                                                                                                                                                                                                             8502
Образование                                                                                                                                                                                                                                                                                                         7322
Транспорт                                                    

In [11]:
new_all_data['regulatory_impact'].value_counts()

Не определена    57404
Низкая           18105
Средняя           6633
Высокая           2275
Name: regulatory_impact, dtype: int64

In [12]:
new_all_data['is_regionally_signigicant'].value_counts()

False    83785
True       632
Name: is_regionally_signigicant, dtype: int64

In [13]:
new_all_data['act_changes_controlling_activities'].value_counts()

False    83360
True      1057
Name: act_changes_controlling_activities, dtype: int64

In [14]:
new_all_data['mineco_solution'].value_counts()

Не определено    76156
Положительное     5860
Отрицательное     2401
Name: mineco_solution, dtype: int64

In [15]:
# delete columns where a lot off null objects and texts
new_all_data_clean = new_all_data.drop(['developer', 'act_title', 'okved_list', 'responsible', 'act_significance', 
                          'relations_regulated_by_act', 'persons_affected_by_act', 
                          'act_objectives', 'problem_addressed', 'publication_date', 'added_by'], axis=1)

In [16]:
#look at our data again 
new_all_data_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84417 entries, 0 to 85005
Data columns (total 9 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   id                                  84417 non-null  int64  
 1   views_num                           84417 non-null  float64
 2   comments_num                        84417 non-null  float64
 3   likes_num                           84417 non-null  float64
 4   dislikes_num                        84417 non-null  float64
 5   regulatory_impact                   84417 non-null  object 
 6   is_regionally_signigicant           84417 non-null  object 
 7   act_changes_controlling_activities  84417 non-null  object 
 8   mineco_solution                     84417 non-null  object 
dtypes: float64(4), int64(1), object(4)
memory usage: 6.4+ MB


## 3. Split data to train and test

There are unusual training and test values. We have an act's id in simple_submission this is what we need to predict. Let's 

In [17]:
train_id = train_answer['id'].to_list()
test_id = test_data['id'].to_list()
delete_id = delete_acts['id'].to_list()

In [18]:
id_list = []
for i in train_id:
    if i not in delete_id:
        id_list.append(i)

In [19]:
train_data = new_all_data_clean[new_all_data_clean['id'].isin(id_list)]
test_data = new_all_data_clean[new_all_data_clean['id'].isin(test_id)]
train_answer = train_answer[train_answer['id'].isin(id_list)]

In [20]:
y = train_answer['passed']
X = train_data

In [21]:
# print all our categorical columns 
all_categorical_columns = [x for x in new_all_data_clean.columns if new_all_data[x].dtype == 'object']
all_categorical_columns

['regulatory_impact',
 'is_regionally_signigicant',
 'act_changes_controlling_activities',
 'mineco_solution']

In [22]:
# Encode categorical features as a one-hot numeric array
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_enc_train = pd.DataFrame(OH_encoder.fit_transform(train_data[all_categorical_columns]))
OH_enc_test = pd.DataFrame(OH_encoder.transform(test_data[all_categorical_columns]))

In [23]:
num_data_train = train_data.drop(all_categorical_columns, axis=1)
num_data_test = test_data.drop(all_categorical_columns, axis=1)

In [24]:
# Add one-hot encoded columns to numerical features
OH_train = pd.DataFrame(np.hstack([num_data_train, OH_enc_train]))
OH_test = pd.DataFrame(np.hstack([num_data_test, OH_enc_test]))

In [25]:
train_data_split, test_data_split, train_labels_split, test_labels_split = train_test_split(OH_train, y,  
                                                                                     test_size = 0.3)

## 4. Make prediction
Try different models to find the best. 

In [26]:
linear_regressor = linear_model.LinearRegression()
linear_regressor.fit(train_data_split, train_labels_split)
linear_predictions = linear_regressor.predict(test_data_split)

In [27]:
roc_auc_score(test_labels_split, linear_predictions)

0.5903142253633753

In [28]:
log_regressor = linear_model.LogisticRegression(solver='lbfgs', max_iter=1000)
log_regressor.fit(train_data_split, train_labels_split)
lr_predictions = log_regressor.predict_proba(test_data_split)[:, 1]

In [29]:
roc_auc_score(test_labels_split, lr_predictions)

0.5804749094333153

In [30]:
des_tree = tree.DecisionTreeClassifier(random_state=1)
des_tree.fit(train_data_split, train_labels_split)
des_tree_predictions = des_tree.predict(test_data_split)

In [31]:
roc_auc_score(test_labels_split, des_tree_predictions)

0.5578572584456929

In [32]:
gbx_model = GradientBoostingRegressor(n_estimators = 3, max_depth=3, learning_rate=0.9)
gbx_model.fit(train_data_split, train_labels_split)
gbx_model_predictions = gbx_model.predict(test_data_split)

In [33]:
roc_auc_score(test_labels_split, gbx_model_predictions)

0.6323012389292695

## 5. Try to improve our prediction
We can see that GradientBoostingRegressor has the best score. But still not good. We can try to improve it if we work more clearly with our data. I deleted a lot of columns now I'll try to leave columns 'developer' and 'okved_list' and work with it better.

In [34]:
new_data_clean = new_all_data.drop(['act_title', 'responsible', 'act_significance', 
                          'relations_regulated_by_act', 'persons_affected_by_act', 
                          'act_objectives', 'problem_addressed', 'publication_date', 'added_by'], axis=1)

In [35]:
new_data_clean[['developer']] = new_data_clean['developer'].fillna('Неизвестно')

In [None]:
new_data_clean[['okved_list']] = new_data_clean['okved_list'].fillna('Неизвестно')

In [36]:
#look at our data again 
new_data_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84417 entries, 0 to 85005
Data columns (total 11 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   id                                  84417 non-null  int64  
 1   developer                           84417 non-null  object 
 2   okved_list                          73097 non-null  object 
 3   views_num                           84417 non-null  float64
 4   comments_num                        84417 non-null  float64
 5   likes_num                           84417 non-null  float64
 6   dislikes_num                        84417 non-null  float64
 7   regulatory_impact                   84417 non-null  object 
 8   is_regionally_signigicant           84417 non-null  object 
 9   act_changes_controlling_activities  84417 non-null  object 
 10  mineco_solution                     84417 non-null  object 
dtypes: float64(4), int64(1), object(6)
memory

In [37]:
new_train_data = new_data_clean[new_data_clean['id'].isin(id_list)]
new_test_data = new_data_clean[new_data_clean['id'].isin(test_id)]
new_train_answer = train_answer[train_answer['id'].isin(id_list)]

In [38]:
new_y = new_train_answer['passed']
new_X = new_train_data

In [47]:
new_all_categorical_columns = [x for x in new_data_clean.columns if new_data_clean[x].dtype == 'object']
new_all_categorical_columns

['developer',
 'okved_list',
 'regulatory_impact',
 'is_regionally_signigicant',
 'act_changes_controlling_activities',
 'mineco_solution']

In [48]:
# Encode categorical features as a one-hot numeric array
new_OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
new_OH_enc_train = pd.DataFrame(new_OH_encoder.fit_transform(new_train_data[new_all_categorical_columns]))
new_OH_enc_test = pd.DataFrame(new_OH_encoder.transform(new_test_data[new_all_categorical_columns]))

In [49]:
new_num_data_train = new_train_data.drop(new_all_categorical_columns, axis=1)
new_num_data_test = new_test_data.drop(new_all_categorical_columns, axis=1)


In [50]:
# Add one-hot encoded columns to numerical features
new_OH_train = pd.DataFrame(np.hstack([new_num_data_train, new_OH_enc_train]))
new_OH_test = pd.DataFrame(np.hstack([new_num_data_test, new_OH_enc_test]))

In [51]:
new_train_data_split, new_test_data_split, new_train_labels_split, new_test_labels_split = train_test_split(
    new_OH_train, new_y, test_size = 0.3)

## 6. Make new prediction
Try different models to find the best.

In [53]:
new_linear_regressor = linear_model.LinearRegression()
new_linear_regressor.fit(new_train_data_split, new_train_labels_split)
new_linear_predictions = new_linear_regressor.predict(new_test_data_split)


In [54]:
roc_auc_score(new_test_labels_split, new_linear_predictions)

0.7950517444868533

In [55]:
new_log_regressor = linear_model.LogisticRegression(solver='lbfgs', max_iter=1000)
new_log_regressor.fit(new_train_data_split, new_train_labels_split)
new_log_predictions = new_log_regressor.predict_proba(new_test_data_split)[:, 1]

In [56]:
roc_auc_score(new_test_labels_split, new_log_predictions)

0.5857173436727234

In [57]:
new_des_tree = tree.DecisionTreeClassifier(random_state=1)
new_des_tree.fit(new_train_data_split, new_train_labels_split)
new_des_tree_predictions = new_des_tree.predict(new_test_data_split)

In [58]:
roc_auc_score(new_test_labels_split, new_des_tree_predictions)

0.6897043530741223

In [59]:
new_gbx_model = GradientBoostingRegressor(n_estimators = 3, max_depth=3, learning_rate=0.9)
new_gbx_model.fit(new_train_data_split, new_train_labels_split)
new_gbx_model_predictions = new_gbx_model.predict(new_test_data_split)

In [60]:
roc_auc_score(new_test_labels_split, new_gbx_model_predictions)

0.691121971470083

Now our best model is LinearRegression with score 0.79. What is better.

In [65]:
# use LinearRegression to solve our problem - determine whether a regulatory legal act will be adopted or not
linreg = linear_model.LinearRegression()
linreg.fit(new_OH_train, new_y)
linreg_predictions = linreg.predict(new_OH_test)

In [66]:
output = pd.DataFrame({'id': test_data.id, 'Passed': linreg_predictions})
output.to_csv('submission.csv', index=False)