# PREPARE THE DATASET

To prevent data leakage while partitioning your data into training and testing groups, adhere to the following procedures:

Randomly shuffle the data: Scramble the dataset's sequence before dividing it, which helps avoid any potential order-related bias.
Divide the data: Separate the dataset into two distinct sets, one for training and the other for testing. Common split ratios are 70-30 or 80-20 for training and testing, although the ideal ratio may differ based on your specific dataset and task requirements.
Independently preprocess data: Make sure that any data preprocessing steps, such as normalization or feature scaling, are executed separately on both the training and testing sets. This entails calculating any necessary statistics (e.g., mean and standard deviation for normalization) using only the training data and then applying the same transformations to both the training and testing data

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 150308 entries, 0 to 150582
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   text       150308 non-null  object
 1   sentiment  150308 non-null  object
 2   words      150308 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 8.6+ MB


In [None]:
data.describe()

Unnamed: 0,words
count,150308.0
mean,66.038567
std,43.477705
min,12.0
25%,33.0
50%,55.0
75%,87.0
max,289.0


## Shuffle Data


In [None]:
data = data.sample(frac=1)

In [None]:
data

Unnamed: 0,text,sentiment,words
77230,Вашингтон ввел экспортные ограничения против с...,negative,24
137414,Новые санкции Европейского союза в связи с рос...,negative,99
18506,На международной выставке вооружений Internati...,neutral,74
89441,Новые предложения ООН по возобновлению зерново...,neutral,79
106290,Министр финансов США Уолли Адейемо прибыл в Нь...,neutral,68
...,...,...,...
93604,Власти Канады приняли решение о запрете исполь...,neutral,40
22402,Поставки смартфонов ценовой категории до 10 ты...,negative,94
55684,Крупнейший российский банк Сбербанк вынужден з...,negative,20
87806,Российское отделение глобальной экологической ...,neutral,34


## Convert Sentiments From Strings to Integer Ids

In [None]:
data["sentiment"] = data["sentiment"].astype('category')
data.dtypes

text           object
sentiment    category
words           int64
dtype: object

In [None]:
data["sentiment_id"] = data["sentiment"].cat.codes
data.tail()

Unnamed: 0,text,sentiment,words,sentiment_id
93604,Власти Канады приняли решение о запрете исполь...,neutral,40,1
22402,Поставки смартфонов ценовой категории до 10 ты...,negative,94,0
55684,Крупнейший российский банк Сбербанк вынужден з...,negative,20,0
87806,Российское отделение глобальной экологической ...,neutral,34,1
79520,"Компании , продающие российские товары , перев...",negative,22,0


## Build a Dictionary for id to text sentiments:

In [None]:
id_to_sentiment = pd.Series(data.sentiment.values,index=data.sentiment_id).to_dict()
id_to_sentiment

{0: 'negative', 1: 'neutral', 2: 'positive'}

## Build another Dictionary for sentiments to text:

In [None]:
sentiment_to_id = {v:k for k,v in id_to_sentiment.items()}
sentiment_to_id

{'negative': 0, 'neutral': 1, 'positive': 2}

### Check the conversions:

In [None]:
!ls

data_preprocessing.ipynb  models	       sums_lang_for_news.xlsx
EDA.ipynb		  prepare_text.ipynb   страны_партнеры_с_тональностью_общий_271123.xlsx
id_to_sentiment.pkl	  sentiment_full.xlsx
model.ipynb		  stopwords-ru.txt


In [None]:
import pickle
with open('id_to_sentiment.pkl', 'wb') as fp:
    pickle.dump(id_to_sentiment, fp)

In [None]:
with open('id_to_sentiment.pkl', 'rb') as fp:
    id_to_sentiment_loaded = pickle.load(fp)

In [None]:
id_to_sentiment_loaded

{0: 'negative', 1: 'neutral', 2: 'positive'}

In [None]:
number_of_sentiment = len(sentiment_to_id)
print(f"number of sentiments: {number_of_sentiment}")

number of sentiments: 3


## Finally check the dataset columns and rows of the modified data frame:

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 150308 entries, 77230 to 79520
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype   
---  ------        --------------   -----   
 0   text          150308 non-null  object  
 1   sentiment     150308 non-null  category
 2   words         150308 non-null  int64   
 3   sentiment_id  150308 non-null  int8    
dtypes: category(1), int64(1), int8(1), object(1)
memory usage: 3.7+ MB


## Split the Raw Dataset into Train, Validation, and Test Datasets


### Split Train & Test Datasets

In [None]:
features, targets = data['text'], data['sentiment_id']

all_train_features, test_features, all_train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        random_state=int(''.join(hex(ord(c))[2:] for c in "NLP"), 16),
        shuffle = True,
        stratify=targets
)

### Split Train & Validation Datasets

In [None]:
# we will use 95% of the dataset in this article

reduce_ratio = 0.95

reduced_train_features, _, reduced_train_targets, _ = train_test_split(
        all_train_features, all_train_targets,
        train_size=reduce_ratio,
        random_state=int(''.join(hex(ord(c))[2:] for c in "NLP"), 16),
        shuffle = True,
        stratify=all_train_targets
)

In [None]:
train_features, val_features, train_targets, val_targets = train_test_split(
        reduced_train_features, reduced_train_targets,
        train_size=0.9,
        random_state=int(''.join(hex(ord(c))[2:] for c in "NLP"), 16),
        shuffle = True,
        stratify=reduced_train_targets
    )

In [None]:
print("Train Data Set size: ",len(train_features))
print("Validation Data Set size: ",len(val_features))
print("Test Data Set size: ",len(test_features))

Train Data Set size:  102809
Validation Data Set size:  11424
Test Data Set size:  30062


## Split the Raw Dataset into Train, Validation, and Test Datasets


### Split Train & Test Datasets

In [None]:
features, targets = data['text'], data['sentiment_id']

all_train_features, test_features, all_train_targets, test_targets = train_test_split(
        features, targets,
        train_size=0.8,
        test_size=0.2,
        random_state=int(''.join(hex(ord(c))[2:] for c in "NLP"), 16),
        shuffle = True,
        stratify=targets
)

### Split Train & Validation Datasets

In [None]:
train_features, val_features, train_targets, val_targets = train_test_split(
        all_train_features, all_train_targets,
        train_size=0.9,
        random_state=int(''.join(hex(ord(c))[2:] for c in "NLP"), 16),
        shuffle = True,
        stratify=reduced_train_targets
)