# **Analysis**

## Configuration:

Import entities:

In [1]:
import os
import sys

sys.path.append(
    os.path.abspath(
        os.path.join(
            os.getcwd(),
            os.pardir,
        ),
    ),
)

from warnings import filterwarnings
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from pandas import (
    Series,
    DataFrame,

    read_csv,
    read_pickle,
)

from src.utilities import top_similar_vectors

Ignore warnings:

In [2]:
filterwarnings("ignore", )

## Preprocessing:

Create a dictionary for `read_csv()` function callings:

In [3]:
read_csv_params: dict[str, str] = {
    "texts_file": "texts.csv",
    "target_file": "target.csv",

    "files_path": "../data/datasets/raw/",
}

Create a dictionary for `read_pickle` function calling:

In [4]:
read_pickle_params: dict[str, str] = {
    "file": "features.pkl",

    "files_path": "../data/datasets/processed/",
}

Read the `target.csv` data:

In [5]:
y: Series = read_csv(
    read_csv_params["files_path"] + read_csv_params["target_file"],
    index_col=0,
)

Check `y` variable:

In [6]:
y

Unnamed: 0,tweet_label
0,negative
1,negative
2,negative
3,negative
4,negative
...,...
2711,positive
2712,positive
2713,positive
2714,positive


Check target distribution:

In [7]:
y.value_counts()

tweet_label
positive       999
neutral        883
negative       834
Name: count, dtype: int64

Read the `texts.csv` data:

In [8]:
texts: Series = read_csv(
    read_csv_params["files_path"] + read_csv_params["texts_file"],
    index_col=0,
)

Check `texts` variable:

In [9]:
texts

Unnamed: 0,tweet
0,"How unhappy some dogs like it though,talking ..."
1,I miss going to gigs in Liverpool unhappy
2,There isnt a new Riverdale tonight ? unhappy
3,it's that A*dy guy from pop Asia and then the ...
4,Who's that chair you're sitting in? Is this ho...
...,...
2711,Thanks for the recent follow Happy to connect ...
2712,- top engaged members this week happy
2713,ngam to weeks left for cadet pilot exam cryin...
2714,Great! You're welcome Josh happy ^Adam


Read the `features.pkl` data:

In [10]:
X: DataFrame = read_pickle(
    read_pickle_params["files_path"] + read_pickle_params["file"],
)

Check `X` variable:

In [11]:
X

[      aa  aah  aam  aamby  aand  aap  aaree  aatein  abbeydale  abbreviation  \
 0      0    0    0      0     0    0      0       0          0             0   
 1      0    0    0      0     0    0      0       0          0             0   
 2      0    0    0      0     0    0      0       0          0             0   
 3      0    0    0      0     0    0      0       0          0             0   
 4      0    0    0      0     0    0      0       0          0             0   
 ...   ..  ...  ...    ...   ...  ...    ...     ...        ...           ...   
 2711   0    0    0      0     0    0      0       0          0             0   
 2712   0    0    0      0     0    0      0       0          0             0   
 2713   0    0    0      0     0    0      0       0          0             0   
 2714   0    0    0      0     0    0      0       0          0             0   
 2715   0    0    0      0     0    0      0       0          0             0   
 
       ...  yrs  yummy  yu

## Analytics:

Print top `10` pairs of texts for every dataframe:

In [12]:
df_idx: int = 0

for df in X:
    case_idx: int = 1

    print(f"CASE № {(df_idx // 4) + 1}.{(df_idx % 4) + 1}", )

    for texts_pair in top_similar_vectors(df, ):
        print(f"\tTEXTS PAIR № {case_idx}:", )

        print(f"\t\tTEXT INDEX: {texts_pair[0]}", )
        print(f"\t\tTEXT: {texts.iloc[texts_pair[0]].values[0]}", )

        print(f"\t\tTEXT INDEX: {texts_pair[1]}", )
        print(f"\t\tTEXT: {texts.iloc[texts_pair[1]].values[0]}\n", )

        case_idx += 1

    df_idx += 1

CASE № 1.1
	TEXTS PAIR № 1:
		TEXT INDEX: 1931
		TEXT: Exaseetly what's happening in TN Politisees. This is not to get BJP at all to TN. Game is different. Again, you'll be see
		TEXT INDEX: 2133
		TEXT: Exaseetly what's happening in TN Politisees. This is not to get BJP at all to TN. Game is different. Again, you'll be see

	TEXTS PAIR № 2:
		TEXT INDEX: 43
		TEXT: Koalas are dying of thirst  and it's all because of us unhappy 
		TEXT INDEX: 124
		TEXT: Koalas are dying of thirst  and it's all because of us unhappy 

	TEXTS PAIR № 3:
		TEXT INDEX: 43
		TEXT: Koalas are dying of thirst  and it's all because of us unhappy 
		TEXT INDEX: 213
		TEXT: Koalas are dying of thirst  and it's all because of us unhappy 

	TEXTS PAIR № 4:
		TEXT INDEX: 43
		TEXT: Koalas are dying of thirst  and it's all because of us unhappy 
		TEXT INDEX: 295
		TEXT: Koalas are dying of thirst  and it's all because of us unhappy 

	TEXTS PAIR № 5:
		TEXT INDEX: 43
		TEXT: Koalas are dying of thirst  and it's all

## Models:

Check *Logistic Regression* *accuracy* metric values for datasets:

In [13]:
df_idx: int = 0

for df in X:
    X_train, X_test, y_train, y_test = train_test_split(
        df,
        y,
        test_size=0.2,
        random_state=21,
    )
    log_reg_model: LogisticRegression = LogisticRegression(
        C=1.5,
        solver="saga",
        max_iter=750,
        multi_class="multinomial",
    )

    log_reg_model.fit(X_train, y_train, )

    print(f"CASE № {(df_idx // 4) + 1}.{(df_idx % 4) + 1}:", )

    print(
        "\tLogisic regression train accuracy:",
        round(accuracy_score(log_reg_model.predict(X_train, ), y_train, ), 3, ),
    )
    print(
        "\tLogisic regression test accuracy:",
        round(accuracy_score(log_reg_model.predict(X_test, ), y_test, ), 3, ),
    )

    print()

    df_idx += 1

CASE № 1.1:
	Logisic regression train accuracy: 0.999
	Logisic regression test accuracy: 0.936

CASE № 1.2:
	Logisic regression train accuracy: 0.999
	Logisic regression test accuracy: 0.925

CASE № 1.3:
	Logisic regression train accuracy: 0.994
	Logisic regression test accuracy: 0.932

CASE № 1.4:
	Logisic regression train accuracy: 0.966
	Logisic regression test accuracy: 0.949

CASE № 2.1:
	Logisic regression train accuracy: 0.998
	Logisic regression test accuracy: 0.936

CASE № 2.2:
	Logisic regression train accuracy: 0.998
	Logisic regression test accuracy: 0.928

CASE № 2.3:
	Logisic regression train accuracy: 0.993
	Logisic regression test accuracy: 0.928

CASE № 2.4:
	Logisic regression train accuracy: 0.968
	Logisic regression test accuracy: 0.928

CASE № 3.1:
	Logisic regression train accuracy: 0.998
	Logisic regression test accuracy: 0.93

CASE № 3.2:
	Logisic regression train accuracy: 0.998
	Logisic regression test accuracy: 0.93

CASE № 3.3:
	Logisic regression train accu

Check *Decision Tree* *accuracy* metric values for datasets:

In [14]:
df_idx: int = 0

for df in X:
    X_train, X_test, y_train, y_test = train_test_split(
        df,
        y,
        test_size=0.2,
        random_state=21,
    )
    tree_model: LogisticRegression = DecisionTreeClassifier(random_state=21, )

    tree_model.fit(X_train, y_train, )

    print(f"CASE № {(df_idx // 4) + 1}.{(df_idx % 4) + 1}:", )

    print(
        "\tDecision tree train accuracy:",
        round(accuracy_score(tree_model.predict(X_train, ), y_train, ), 3, ),
    )
    print(
        "\tDeicison tree test accuracy:",
        round(accuracy_score(tree_model.predict(X_test, ), y_test, ), 3, ),
    )

    print()

    df_idx += 1

CASE № 1.1:
	Decision tree train accuracy: 1.0
	Deicison tree test accuracy: 0.91

CASE № 1.2:
	Decision tree train accuracy: 1.0
	Deicison tree test accuracy: 0.906

CASE № 1.3:
	Decision tree train accuracy: 1.0
	Deicison tree test accuracy: 0.903

CASE № 1.4:
	Decision tree train accuracy: 1.0
	Deicison tree test accuracy: 0.792

CASE № 2.1:
	Decision tree train accuracy: 1.0
	Deicison tree test accuracy: 0.912

CASE № 2.2:
	Decision tree train accuracy: 1.0
	Deicison tree test accuracy: 0.91

CASE № 2.3:
	Decision tree train accuracy: 1.0
	Deicison tree test accuracy: 0.904

CASE № 2.4:
	Decision tree train accuracy: 1.0
	Deicison tree test accuracy: 0.774

CASE № 3.1:
	Decision tree train accuracy: 1.0
	Deicison tree test accuracy: 0.904

CASE № 3.2:
	Decision tree train accuracy: 1.0
	Deicison tree test accuracy: 0.904

CASE № 3.3:
	Decision tree train accuracy: 1.0
	Deicison tree test accuracy: 0.89

CASE № 3.4:
	Decision tree train accuracy: 1.0
	Deicison tree test accuracy: 0.

Check *Random Forest Tree* *accuracy* metric values for datasets:

In [15]:
df_idx: int = 0

for df in X:
    X_train, X_test, y_train, y_test = train_test_split(
        df,
        y,
        test_size=0.2,
        random_state=21,
    )
    forest_model: RandomForestClassifier = RandomForestClassifier(
        n_jobs=-1,
        random_state=21,
        n_estimators=250,
    )

    forest_model.fit(X_train, y_train, )

    print(f"CASE № {(df_idx // 4) + 1}.{(df_idx % 4) + 1}:", )

    print(
        "\tRandom Forest Tree train accuracy:",
        round(accuracy_score(forest_model.predict(X_train, ), y_train, ), 3, ),
    )
    print(
        "\tRandom Forest Tree test accuracy:",
        round(accuracy_score(forest_model.predict(X_test, ), y_test, ), 3, ),
    )

    print()

    df_idx += 1

CASE № 1.1:
	Random Forest Tree train accuracy: 1.0
	Random Forest Tree test accuracy: 0.932

CASE № 1.2:
	Random Forest Tree train accuracy: 1.0
	Random Forest Tree test accuracy: 0.928

CASE № 1.3:
	Random Forest Tree train accuracy: 1.0
	Random Forest Tree test accuracy: 0.938

CASE № 1.4:
	Random Forest Tree train accuracy: 1.0
	Random Forest Tree test accuracy: 0.928

CASE № 2.1:
	Random Forest Tree train accuracy: 1.0
	Random Forest Tree test accuracy: 0.925

CASE № 2.2:
	Random Forest Tree train accuracy: 1.0
	Random Forest Tree test accuracy: 0.926

CASE № 2.3:
	Random Forest Tree train accuracy: 1.0
	Random Forest Tree test accuracy: 0.928

CASE № 2.4:
	Random Forest Tree train accuracy: 1.0
	Random Forest Tree test accuracy: 0.915

CASE № 3.1:
	Random Forest Tree train accuracy: 1.0
	Random Forest Tree test accuracy: 0.923

CASE № 3.2:
	Random Forest Tree train accuracy: 1.0
	Random Forest Tree test accuracy: 0.926

CASE № 3.3:
	Random Forest Tree train accuracy: 1.0
	Random 