# **Preprocessing**

## Configuration:

Import entities:

In [1]:
import os
import re
import sys

sys.path.append(
    os.path.abspath(
        os.path.join(
            os.getcwd(),
            os.pardir,
        ),
    ),
)

from pandas import (
    DataFrame,

    concat,
    to_pickle,
)

from src.models import TextPreprocessor, TextToFeaturesConverter

## Preprocessing:

Create a text preprocessing tool:

In [2]:
text_preprocessor: TextPreprocessor = TextPreprocessor()

Create a text features extractor tool:

In [3]:
features_extractor: TextToFeaturesConverter = TextToFeaturesConverter()

Create a dictionary for text preprocessing tool initialization:

In [4]:
text_preprocessor_init_params: dict[str, str] = {
    "file": "word_frequency_en.txt",

    "file_path": "../data/txt/",
}

Initializes the `TextPreprocessor` class:

In [5]:
text_preprocessor.initialize_tools(
    lang_dict_file=text_preprocessor_init_params["file"],
    lang_dict_file_path=text_preprocessor_init_params["file_path"],
)

Initializes the `TextToFeaturesConverter` class:

In [6]:
features_extractor.initialize_tools()

Create a dictionary for `open()` function callings:

In [7]:
open_params: dict[str, str] = {
    "neutral_tweets_file": "neutral_tweets.csv",
    "positive_tweets_file": "positive_tweets.csv",
    "negative_tweets_file": "negative_tweets.csv",

    "files_path": "../data/datasets/raw/",
}

Read the file `negative_tweets.csv` data:

In [8]:
with open(
    open_params["files_path"] + open_params["negative_tweets_file"],
    'r',
    encoding="utf-8",
) as file:
    file_content: str = file.read()

Parse negative tweets data:

In [9]:
neg_tweets: list[str] = file_content.split(" ,", )

Check `neg_tweets` list:

In [10]:
neg_tweets

["How unhappy  some dogs like it though,talking to my over driver about where I'm goinghe said he'd love to go to New York too but since Trump it's probably not,Does anybody know if the Rand's likely to fall against the dollar? I got some money  I need to change into R but it keeps getting stronger unhappy",
 'I miss going to gigs in Liverpool unhappy',
 'There isnt a new Riverdale tonight ? unhappy',
 "it's that A*dy guy from pop Asia and then the translator so they'll probs go with them around Aus unhappy",
 "Who's that chair you're sitting in? Is this how I find out. Everyone knows now. You've shamed me in pu,don't like how jittery caffeine makes me sad",
 "My area's not on the list unhappy  think I'll go LibDems anyway,I want fun plans this weekend unhappy",
 'When can you notice me.  unhappy  what? ',
 'Ahhhhh! You recognized LOGAN!!! Cinemax shows have a BAD track record for getting cancelled unhappy',
 "Errr dude.... They're gone unhappy  Asked other league memeber to check  the

Create a *Pandas* dataframe of tweets:

In [11]:
tweets_df: DataFrame = DataFrame({
    "tweet": neg_tweets,
    "tweet_label": ["negative", ] * len(neg_tweets, ),
}, )

Check `tweets_df` *Pandas* dataframe:

In [12]:
tweets_df

Unnamed: 0,tweet,tweet_label
0,"How unhappy some dogs like it though,talking ...",negative
1,I miss going to gigs in Liverpool unhappy,negative
2,There isnt a new Riverdale tonight ? unhappy,negative
3,it's that A*dy guy from pop Asia and then the ...,negative
4,Who's that chair you're sitting in? Is this ho...,negative
...,...,...
829,YG should have sent them to MCD. I want to see...,negative
830,wish knock out lang talaga for the new school ...,negative
831,i miss so much unhappy,negative
832,Same unhappy,negative


Read the file `neutral_tweets.csv` data:

In [13]:
with open(
    open_params["files_path"] + open_params["neutral_tweets_file"],
    'r',
    encoding="utf-8",
) as file:
    file_content: str = file.read()

Parse neutral tweets data:

In [14]:
neutral_tweets: list[str] = file_content.split(" ,", )

Check `neutral_tweets` list:

In [15]:
neutral_tweets

['Pak PM survives removal scare, but court orders further probe into corruption charge.',
 "Supreme Court quashes criminal complaint against cricketer for allegedly depicting himself as on magazine cover.,Art of Living's fights back over Yamuna floodplain damage, livid.",
 'FCRA slap on NGO for lobbying...But was it doing so as part of govt campaign?',
 'Why doctors, pharma companies are opposing names on',
 'Why a bicycle and not a CM asked. His officer learnt ground reality -- and  a dip in a river.',
 "It's 2017. making law to ban And MHA is sitting on draft.",
 'Rivals govts unite to act against sex-determination tests.',
 'Haryana peasants demand justice for right to cattle trade.',
 'Why schools in Calcutta (and elsewhere) are stunned by imposition plan.',
 'Why renamed places in',
 'Oooh! the shame, the trauma... of driving without the red, flashing light. (And paying for parking).',
 'Now have to learn to live without that flashing red light.',
 'BJP leaders in the dock, lose t

Add `neutral_tweets` to `tweets_df` *Pandas* dataframe:

In [16]:
tweets_df = concat(
    [
        tweets_df,
        DataFrame({
            "tweet": neutral_tweets,
            "tweet_label": ["neutral", ] * len(neutral_tweets, ),
        }, ),
    ],
    ignore_index=True,
)

Check updated `tweets_df` *Pandas* dataframe:

In [17]:
tweets_df

Unnamed: 0,tweet,tweet_label
0,"How unhappy some dogs like it though,talking ...",negative
1,I miss going to gigs in Liverpool unhappy,negative
2,There isnt a new Riverdale tonight ? unhappy,negative
3,it's that A*dy guy from pop Asia and then the ...,negative
4,Who's that chair you're sitting in? Is this ho...,negative
...,...,...
1712,"dept crackdown on hoarders; issues 87 notices,...",neutral
1713,Payments Bank to offer 4.5% on deposits up to ...,neutral
1714,"Historian Ram Guha, IDFC official Vikram Limay...",neutral
1715,Supreme Court names former CAG as head of 4-me...,neutral


Read the file `pos_tweets.csv` data:

In [18]:
with open(
    open_params["files_path"] + open_params["positive_tweets_file"],
    'r',
    encoding="utf-8",
) as file:
    file_content: str = file.read()

Parse positive tweets data:

In [19]:
pos_tweets: list[str] = re.split(r",(?!\s)", file_content, )

Check `pos_tweets` list:

In [20]:
pos_tweets

['An inspiration in all aspects: Fashion, fitness, beauty and personality. :)KISSES TheFashionIcon',
 'Apka Apna Awam Ka Channel Frankline Tv Aam Admi Production Please Visit Or Likes  Share :)Fb Page :...',
 "Beautiful album from  the greatest unsung guitar genius of our time - and I've met the great backstage",
 'Good luck to Rich riding for great project in this Sunday. Can you donate?',
 'Omg he... kissed... him crying with joy',
 'happy anniv ming and papi!!!!! love love happy',
 'thanks happy',
 "C'mon Tweeps, Join  vote for the singer! Do spread the word. :D",
 'Thanks for the great review! smile',
 'Yay another art raffle! Everything you need to know is in the picture :D',
 'Hello I hope you visit Luxor its amazing city in Egypt pleas check',
 "We got a Vive tracker in the office and our intern, went to work.Don't get too excited, this isn't",
 'Take a look at favourites.io You can do this and more happy',
 'Go back to school for music! I think I will in time happy',
 'Sixth sp

Add `pos_tweets` to `tweets_df` *Pandas* dataframe:

In [21]:
tweets_df = concat(
    [
        tweets_df,
        DataFrame({
            "tweet": pos_tweets,
            "tweet_label": ["positive", ] * len(pos_tweets, ),
        }, ),
    ],
    ignore_index=True,
)

Check updated `tweets_df` *Pandas* dataframe:

In [22]:
tweets_df

Unnamed: 0,tweet,tweet_label
0,"How unhappy some dogs like it though,talking ...",negative
1,I miss going to gigs in Liverpool unhappy,negative
2,There isnt a new Riverdale tonight ? unhappy,negative
3,it's that A*dy guy from pop Asia and then the ...,negative
4,Who's that chair you're sitting in? Is this ho...,negative
...,...,...
2711,Thanks for the recent follow Happy to connect ...,positive
2712,- top engaged members this week happy,positive
2713,ngam to weeks left for cadet pilot exam cryin...,positive
2714,Great! You're welcome Josh happy ^Adam,positive


Create a clean tweets *Pandas* dataframe column:

In [23]:
tweets_df["clean_tweet"] = tweets_df["tweet"].apply(
    TextPreprocessor.clean_text,
)

Create a tokenized tweets *Pandas* dataframe column:

In [24]:
tweets_df["tokenized_tweet"] = tweets_df["clean_tweet"].apply(
    text_preprocessor.tokenize_text,
)

Create a stemmed tweets *Pandas* dataframe column:

In [25]:
tweets_df["stemmed_tweet"] = tweets_df["clean_tweet"].apply(
    text_preprocessor.stem_text,
)

Create a lemmatized tweets *Pandas* dataframe column of tweets:

In [26]:
tweets_df["lemmatized_tweet"] = tweets_df["clean_tweet"].apply(
    text_preprocessor.lemmatize_text,
)

Create a correct tweets *Pandas* dataframe column of tweets:

In [27]:
tweets_df["correct_tweet"] = tweets_df["clean_tweet"].apply(
    text_preprocessor.correct_text,
)

Create a stemmed correct tweets *Pandas* dataframe column of tweets:

In [28]:
tweets_df["stemmed_correct_tweet"] = tweets_df["correct_tweet"].apply(
    text_preprocessor.stem_text,
)

Create a correct lemmatized tweets *Pandas* dataframe column of tweets:

In [29]:
tweets_df["lemmatized_correct_tweet"] = tweets_df["correct_tweet"].apply(
    text_preprocessor.lemmatize_text,
)

Create a correct lemm. stemm. tweets *Pandas* dataframe column of tweets:

In [30]:
tweets_df["lemm_stemm_correct_tweet"] = tweets_df[
    "lemmatized_correct_tweet"
].apply(
    text_preprocessor.stem_text,
)

Check updated `tweets_df` *Pandas* dataframe:

In [31]:
tweets_df

Unnamed: 0,tweet,tweet_label,clean_tweet,tokenized_tweet,stemmed_tweet,lemmatized_tweet,correct_tweet,stemmed_correct_tweet,lemmatized_correct_tweet,lemm_stemm_correct_tweet
0,"How unhappy some dogs like it though,talking ...",negative,how unhappy some dogs like it thoughtalking to...,how unhappy some dogs like it thoughtalking to...,how unhappi some dog like it thoughtalk to my ...,how unhappy some dog like it thoughtalke to my...,how unhappy some dogs like it though talking t...,how unhappi some dog like it though talk to my...,how unhappy some dog like it though talk to my...,how unhappi some dog like it though talk to my...
1,I miss going to gigs in Liverpool unhappy,negative,i miss going to gigs in liverpool unhappy,i miss going to gigs in liverpool unhappy,i miss go to gig in liverpool unhappi,I miss go to gig in liverpool unhappy,i miss going to gigs in liverpool unhappy,i miss go to gig in liverpool unhappi,I miss go to gig in liverpool unhappy,i miss go to gig in liverpool unhappi
2,There isnt a new Riverdale tonight ? unhappy,negative,there isnt a new riverdale tonight unhappy,there is nt a new riverdale tonight unhappy,there isnt a new riverdal tonight unhappi,there be not a new riverdale tonight unhappy,there int a new river dale tonight unhappy,there int a new river dale tonight unhappi,there int a new river dale tonight unhappy,there int a new river dale tonight unhappi
3,it's that A*dy guy from pop Asia and then the ...,negative,it's that ady guy from pop asia and then the t...,it 's that ady guy from pop asia and then the ...,it' that adi guy from pop asia and then the tr...,it be that ady guy from pop asia and then the ...,it's that any guy from pop asia and then they ...,it' that ani guy from pop asia and then they t...,it be that any guy from pop asia and then they...,it be that ani guy from pop asia and then they...
4,Who's that chair you're sitting in? Is this ho...,negative,who's that chair you're sitting in is this how...,who 's that chair you 're sitting in is this h...,who' that chair you'r sit in is thi how i find...,who be that chair you be sit in be this how I ...,who's that chair you're sitting in is this how...,who' that chair you'r sit in is thi how i find...,who be that chair you be sit in be this how I ...,who be that chair you be sit in be thi how i f...
...,...,...,...,...,...,...,...,...,...,...
2711,Thanks for the recent follow Happy to connect ...,positive,thanks for the recent follow happy to connect ...,thanks for the recent follow happy to connect ...,thank for the recent follow happi to connect h...,thank for the recent follow happy to connect h...,thanks for they recent follow happy to connect...,thank for they recent follow happi to connect ...,thank for they recent follow happy to connect ...,thank for they recent follow happi to connect ...
2712,- top engaged members this week happy,positive,top engaged members this week happy,top engaged members this week happy,top engag member thi week happi,top engage member this week happy,top engaged members this week happy,top engag member thi week happi,top engage member this week happy,top engag member thi week happi
2713,ngam to weeks left for cadet pilot exam cryin...,positive,ngam to weeks left for cadet pilot exam crying...,ngam to weeks left for cadet pilot exam crying...,ngam to week left for cadet pilot exam cri wit...,ngam to week leave for cadet pilot exam cry wi...,nam to weeks left for cadet pilot exam crying ...,nam to week left for cadet pilot exam cri with...,nam to week leave for cadet pilot exam cry wit...,nam to week leav for cadet pilot exam cri with...
2714,Great! You're welcome Josh happy ^Adam,positive,great you're welcome josh happy adam,great you 're welcome josh happy adam,great you'r welcom josh happi adam,great you be welcome josh happy adam,great you're welcome josh happy adam,great you'r welcom josh happi adam,great you be welcome josh happy adam,great you be welcom josh happi adam


### Case № 1.1: *tokenization* + one-hot word encoding

Create a `one_one_X` *Pandas* dataframe:

In [32]:
one_one_X: DataFrame = features_extractor.one_hot_texts_encoding(
    tweets_df["tokenized_tweet"],
)

Check `one_one_X` features *Pandas* dataframe:

In [33]:
one_one_X

Unnamed: 0,aa,aah,aam,aamby,aand,aap,aaree,aatein,abbeydale,abbreviation,...,yrs,yummy,yura,yuri,zabardast,zac,zcc,zoo,zoos,zplus
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2712,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2713,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2714,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Case № 1.2: *tokenization* + word counts

Create a `one_two_X` *Pandas* dataframe:

In [34]:
one_two_X: DataFrame = features_extractor.word_count_texts_encoding(
    tweets_df["tokenized_tweet"],
)

Check `one_two_X` features *Pandas* dataframe:

In [35]:
one_two_X

Unnamed: 0,aa,aah,aam,aamby,aand,aap,aaree,aatein,abbeydale,abbreviation,...,yrs,yummy,yura,yuri,zabardast,zac,zcc,zoo,zoos,zplus
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2712,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2713,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2714,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Case № 1.3: *tokenization* + *TFIDF*

Create a `one_three_X` *Pandas* dataframe:

In [36]:
one_three_X: DataFrame = features_extractor.tfidf_texts_encoding(
    tweets_df["tokenized_tweet"],
)

Check `one_three_X` features *Pandas* dataframe:

In [37]:
one_three_X

Unnamed: 0,aa,aah,aam,aamby,aand,aap,aaree,aatein,abbeydale,abbreviation,...,yrs,yummy,yura,yuri,zabardast,zac,zcc,zoo,zoos,zplus
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2712,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2713,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Case № 1.4: *tokenization* + *Word2Vec*

Create a `one_four_X` *Pandas* dataframe:

In [38]:
one_four_X: DataFrame = features_extractor.vectorize_texts(
    tweets_df["tokenized_tweet"],
)

Check `one_four_X` features *Pandas* dataframe:

In [39]:
one_four_X

Unnamed: 0,dimension_0,dimension_1,dimension_2,dimension_3,dimension_4,dimension_5,dimension_6,dimension_7,dimension_8,dimension_9,...,dimension_240,dimension_241,dimension_242,dimension_243,dimension_244,dimension_245,dimension_246,dimension_247,dimension_248,dimension_249
0,-0.061767,-0.122467,0.300482,0.115267,-0.303182,0.017469,0.167554,0.520933,-0.175822,0.162333,...,0.119785,-0.238796,-0.226898,0.339356,0.114046,0.157465,-0.206771,-0.208556,-0.144915,0.117434
1,-0.134083,-0.078439,0.335571,-0.175300,-0.127151,0.211416,0.000985,0.496592,-0.178489,-0.041985,...,0.159959,0.070245,-0.185904,0.346785,-0.059531,0.050251,-0.286910,-0.045060,-0.182499,-0.058027
2,-0.023282,-0.103393,0.420455,0.240858,-0.310392,-0.130594,0.068090,0.365690,-0.065331,-0.056268,...,0.025511,-0.231293,0.090603,0.349733,0.177972,-0.041384,-0.502523,-0.232789,0.074859,0.087928
3,-0.125613,-0.137377,0.203219,-0.032459,-0.196844,-0.039268,-0.033831,0.391309,-0.178239,0.076454,...,0.163098,-0.193279,-0.054824,0.135720,-0.178860,-0.091143,-0.178470,-0.335222,-0.227865,0.197706
4,0.078919,-0.089494,0.221988,0.084062,-0.208204,0.007334,-0.209732,0.499130,-0.106748,-0.062078,...,-0.023611,-0.326033,-0.187465,0.374848,-0.100698,0.009577,-0.331047,-0.164686,0.011378,0.215837
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0.066765,-0.004433,0.197034,0.136130,-0.306570,-0.425096,0.137893,0.667307,-0.214141,-0.003280,...,0.207779,-0.253599,0.127879,0.364308,0.109542,-0.186394,-0.156605,-0.139887,-0.176057,0.413698
2712,0.414234,0.086750,0.485559,0.298917,-0.380568,-0.232517,0.034228,0.766327,-0.450644,0.020289,...,0.079562,-0.477127,-0.008500,0.443839,0.165756,0.153278,-0.402060,0.095470,0.164943,0.255702
2713,0.142955,0.152104,0.313865,0.130010,0.016866,-0.140609,0.073849,0.841080,-0.196955,0.362229,...,0.042612,0.003830,0.050807,0.373064,-0.094443,-0.179905,-0.212822,-0.282937,-0.316601,0.055774
2714,-0.089700,-0.139854,0.217621,-0.055435,-0.319571,-0.085019,-0.105720,0.575298,-0.189676,-0.023705,...,-0.046609,-0.242340,-0.095550,0.260596,-0.105095,0.109452,-0.110255,-0.081533,0.050351,0.318043


### Case № 2.1: *stemming* + one-hot word encoding

Create a `two_one_X` *Pandas* dataframe:

In [40]:
two_one_X: DataFrame = features_extractor.one_hot_texts_encoding(
    tweets_df["stemmed_tweet"],
)

Check `two_one_X` features *Pandas* dataframe:

In [41]:
two_one_X

Unnamed: 0,aa,aah,aam,aambi,aand,aap,aare,aatein,abbeydal,abbrevi,...,yrold,yummi,yura,yuri,zabardast,zac,zcc,zero,zoo,zplu
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2712,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2713,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2714,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Case № 2.2: *stemming* + word counts

Create a `two_two_X` *Pandas* dataframe:

In [42]:
two_two_X: DataFrame = features_extractor.word_count_texts_encoding(
    tweets_df["stemmed_tweet"],
)

Check `two_two_X` features *Pandas* dataframe:

In [43]:
two_two_X

Unnamed: 0,aa,aah,aam,aambi,aand,aap,aare,aatein,abbeydal,abbrevi,...,yrold,yummi,yura,yuri,zabardast,zac,zcc,zero,zoo,zplu
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2712,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2713,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2714,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Case № 2.3: *stemming* + *TFIDF*

Create a `two_three_X` *Pandas* dataframe:

In [44]:
two_three_X: DataFrame = features_extractor.tfidf_texts_encoding(
    tweets_df["stemmed_tweet"],
)

Check `two_three_X` features *Pandas* dataframe:

In [45]:
two_three_X

Unnamed: 0,aa,aah,aam,aambi,aand,aap,aare,aatein,abbeydal,abbrevi,...,yrold,yummi,yura,yuri,zabardast,zac,zcc,zero,zoo,zplu
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2712,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2713,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Case № 2.4: *stemming* + *Word2Vec*

Create a `two_four_X` *Pandas* dataframe:

In [46]:
two_four_X: DataFrame = features_extractor.vectorize_texts(
    tweets_df["stemmed_tweet"],
)

Check `two_four_X` features *Pandas* dataframe:

In [47]:
two_four_X

Unnamed: 0,dimension_0,dimension_1,dimension_2,dimension_3,dimension_4,dimension_5,dimension_6,dimension_7,dimension_8,dimension_9,...,dimension_240,dimension_241,dimension_242,dimension_243,dimension_244,dimension_245,dimension_246,dimension_247,dimension_248,dimension_249
0,0.007371,-0.085470,0.236070,0.155977,-0.168675,-0.284488,0.275337,0.438373,-0.241257,0.112027,...,-0.008026,-0.195773,-0.052042,0.695786,0.114877,-0.092523,-0.262793,-0.228381,0.086128,0.055692
1,-0.391601,-0.075652,0.227152,0.016913,-0.302496,-0.367725,0.088900,0.192588,-0.485205,-0.145853,...,0.158838,-0.061492,-0.002628,0.403645,-0.054337,-0.053898,-0.332101,-0.145703,0.081106,0.079052
2,0.119211,0.087575,0.376863,-0.109246,-0.225153,-0.380305,0.249609,0.331963,-0.263438,0.339153,...,0.237542,0.009354,0.194919,0.358382,-0.137499,0.056158,-0.291287,-0.253453,-0.002387,0.243125
3,-0.057977,-0.167968,0.123735,0.178510,-0.308903,-0.186987,0.089537,0.549404,-0.337233,0.186393,...,-0.062461,-0.149335,0.092723,0.657991,-0.043155,-0.011003,-0.360469,-0.174854,0.048784,0.083889
4,0.033642,-0.053385,0.381596,0.127415,0.091244,-0.372803,-0.001501,0.367281,-0.241985,0.116003,...,0.074357,-0.324612,-0.139472,0.349342,0.017981,-0.216648,-0.159461,-0.210256,-0.043262,0.144848
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0.150164,-0.176155,0.227357,0.013217,-0.189874,-0.116120,0.241157,0.536817,-0.204420,0.036558,...,0.104508,-0.150083,-0.097600,0.677180,-0.049715,-0.019152,-0.139606,-0.216275,0.079587,0.349817
2712,0.648642,0.051286,0.054825,0.254072,0.064896,-0.134701,0.093122,0.324043,-0.031901,-0.171741,...,-0.098902,-0.291058,-0.219070,0.442312,-0.039814,-0.008915,-0.295867,-0.132462,0.182028,0.099058
2713,0.300705,-0.143432,0.179449,0.068929,-0.212265,-0.395035,-0.008465,0.218681,-0.039697,0.281688,...,0.087581,-0.165768,-0.061890,0.567710,0.062975,-0.065133,-0.103726,-0.256378,0.057903,-0.075661
2714,0.264570,-0.189800,0.092798,0.111379,-0.245650,0.059020,-0.011596,0.508932,-0.258937,-0.007942,...,0.238563,-0.294874,-0.174598,0.358291,-0.152969,0.057080,-0.311616,-0.133364,0.168500,0.283640


### Case № 3.1: *lemmatization* + one-hot word encoding

Create a `three_one_X` *Pandas* dataframe:

In [48]:
three_one_X: DataFrame = features_extractor.one_hot_texts_encoding(
    tweets_df["lemmatized_tweet"],
)

Check `three_one_X` features *Pandas* dataframe:

In [49]:
three_one_X

Unnamed: 0,aa,aah,aam,aamby,aand,aap,aaree,aatein,abbeydale,abbreviation,...,yrs,yummy,yura,yuri,zabardast,zac,zcc,zero,zoo,zplus
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2712,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2713,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2714,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Case № 3.2: *lemmatization* + word counts

Create a `three_two_X` *Pandas* dataframe:

In [50]:
three_two_X: DataFrame = features_extractor.word_count_texts_encoding(
    tweets_df["lemmatized_tweet"],
)

Check `three_two_X` features *Pandas* dataframe:

In [51]:
three_two_X

Unnamed: 0,aa,aah,aam,aamby,aand,aap,aaree,aatein,abbeydale,abbreviation,...,yrs,yummy,yura,yuri,zabardast,zac,zcc,zero,zoo,zplus
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2712,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2713,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2714,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Case № 3.3: *lemmatization* + *TFIDF*

Create a `three_three_X` *Pandas* dataframe:

In [52]:
three_three_X: DataFrame = features_extractor.tfidf_texts_encoding(
    tweets_df["lemmatized_tweet"],
)

Check `three_three_X` features *Pandas* dataframe:

In [53]:
three_three_X

Unnamed: 0,aa,aah,aam,aamby,aand,aap,aaree,aatein,abbeydale,abbreviation,...,yrs,yummy,yura,yuri,zabardast,zac,zcc,zero,zoo,zplus
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2712,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2713,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Case № 3.4: *lemmatization* + *word2vec*

Create a `three_four_X` *Pandas* dataframe:

In [54]:
three_four_X: DataFrame = features_extractor.vectorize_texts(
    tweets_df["lemmatized_tweet"],
)

Check `three_four_X` features *Pandas* dataframe:

In [55]:
three_four_X

Unnamed: 0,dimension_0,dimension_1,dimension_2,dimension_3,dimension_4,dimension_5,dimension_6,dimension_7,dimension_8,dimension_9,...,dimension_240,dimension_241,dimension_242,dimension_243,dimension_244,dimension_245,dimension_246,dimension_247,dimension_248,dimension_249
0,0.038704,-0.112328,-0.025798,0.233627,-0.392081,-0.131821,0.154124,0.546077,-0.315891,0.008830,...,0.118994,-0.100012,0.062877,0.376420,0.207360,0.042215,0.032110,-0.130029,-0.013559,0.017186
1,-0.044653,0.027982,0.041566,0.123163,-0.585888,-0.051363,0.191326,0.380328,-0.219200,0.025554,...,0.095256,-0.168751,0.047749,0.301466,0.266683,-0.086764,-0.181943,-0.196085,0.078450,0.073279
2,0.218548,-0.153565,0.151471,0.222589,-0.395482,-0.161680,0.293205,0.281407,-0.225107,-0.158923,...,0.063444,-0.119497,0.143630,0.171524,0.114426,-0.074181,-0.097516,-0.318782,0.009866,-0.162398
3,-0.077549,-0.143141,-0.032672,0.196796,-0.347374,-0.171031,0.005976,0.449855,-0.315995,-0.037393,...,0.254728,-0.210875,0.040427,0.187226,0.257362,-0.099821,-0.001914,-0.019356,-0.064082,-0.047987
4,0.134185,-0.038812,0.070896,0.248045,-0.239581,-0.202576,0.294571,0.589893,-0.308370,0.007291,...,0.075010,-0.053010,0.074603,0.482202,0.172830,-0.078999,0.050462,-0.110391,0.039740,-0.047807
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0.156194,0.130419,0.207399,0.021782,-0.102476,-0.134323,0.126175,0.530689,-0.359598,-0.114455,...,0.039294,-0.239449,0.043111,0.173033,0.055853,-0.106947,0.068909,0.112883,0.146479,0.234748
2712,0.072073,-0.002961,0.413089,-0.070277,-0.416020,-0.151069,0.057561,0.276155,-0.495450,0.126096,...,-0.252758,-0.356070,0.108292,0.135098,0.072268,0.003202,-0.067966,-0.043079,0.302974,0.323739
2713,0.341742,0.132485,0.096627,0.001022,-0.171352,-0.135961,0.113901,0.004896,-0.332527,0.122229,...,0.122860,-0.439927,0.104689,0.314026,-0.018684,-0.082604,-0.082245,-0.131466,-0.049321,0.106019
2714,0.037118,0.062026,0.063738,-0.117307,-0.192628,0.024351,0.347455,0.520548,-0.414385,-0.143796,...,0.079612,-0.156151,-0.018002,0.160139,0.148415,0.100957,-0.081162,-0.161900,0.143983,0.191308


### Case № 4.1: *stemming* + *misspelling* + one-hot word encoding

Create a `four_one_X` *Pandas* dataframe:

In [56]:
four_one_X: DataFrame = features_extractor.one_hot_texts_encoding(
    tweets_df["stemmed_correct_tweet"],
)

Check `four_one_X` features *Pandas* dataframe:

In [57]:
four_one_X

Unnamed: 0,aah,aba,abacu,abb,abbey,abbrevi,abdul,abil,abl,abolish,...,youth,youtub,yr,yuck,yuk,yummi,yuri,zero,zit,zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2712,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2713,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2714,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Case № 4.2: *stemming* + *misspelling* + word counts

Create a `four_two_X` *Pandas* dataframe:

In [58]:
four_two_X: DataFrame = features_extractor.word_count_texts_encoding(
    tweets_df["stemmed_correct_tweet"],
)

Check `four_two_X` features *Pandas* dataframe:

In [59]:
four_two_X

Unnamed: 0,aah,aba,abacu,abb,abbey,abbrevi,abdul,abil,abl,abolish,...,youth,youtub,yr,yuck,yuk,yummi,yuri,zero,zit,zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2712,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2713,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2714,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Case № 4.3: *stemming* + *misspelling* + *TFIDF*

Create a `four_three_X` *Pandas* dataframe:

In [60]:
four_three_X: DataFrame = features_extractor.tfidf_texts_encoding(
    tweets_df["stemmed_correct_tweet"],
)

Check `four_three_X` features *Pandas* dataframe:

In [61]:
four_three_X

Unnamed: 0,aah,aba,abacu,abb,abbey,abbrevi,abdul,abil,abl,abolish,...,youth,youtub,yr,yuck,yuk,yummi,yuri,zero,zit,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2712,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2713,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Case № 4.4: *stemming* + *misspelling* + *Word2Vec*

Create a `four_four_X` *Pandas* dataframe:

In [62]:
four_four_X: DataFrame = features_extractor.vectorize_texts(
    tweets_df["stemmed_correct_tweet"],
)

Check `four_four_X` features *Pandas* dataframe:

In [63]:
four_four_X

Unnamed: 0,dimension_0,dimension_1,dimension_2,dimension_3,dimension_4,dimension_5,dimension_6,dimension_7,dimension_8,dimension_9,...,dimension_240,dimension_241,dimension_242,dimension_243,dimension_244,dimension_245,dimension_246,dimension_247,dimension_248,dimension_249
0,-0.162476,-0.194091,0.335169,0.143437,-0.588507,-0.247566,-0.233377,0.239815,-0.257635,-0.017461,...,-0.036192,-0.228887,-0.009539,0.230115,-0.283369,0.348551,-0.085193,-0.191582,-0.067620,0.226611
1,-0.031624,-0.177385,0.206018,-0.007090,-0.440385,-0.260499,-0.227845,0.139961,-0.198470,0.204221,...,-0.001599,-0.319729,0.031838,0.034238,-0.076299,0.183001,-0.119065,0.065834,-0.100457,0.304954
2,-0.331145,0.040748,0.347002,-0.162450,-0.458903,-0.237373,-0.410471,0.300278,0.092020,-0.062379,...,-0.047151,-0.474848,-0.000471,-0.105235,-0.040227,0.261341,-0.134749,0.083114,0.133014,0.433739
3,-0.022656,-0.161444,0.345800,0.097267,-0.568974,-0.291655,-0.202658,0.234336,-0.177197,0.022286,...,-0.109555,-0.412490,-0.247934,0.050345,-0.134376,0.263614,-0.182416,-0.109744,-0.071136,0.024110
4,0.019949,-0.385882,0.179616,0.097124,-0.225836,-0.146915,-0.270495,0.148015,-0.216197,0.071952,...,0.085054,-0.322955,0.015083,0.275901,0.050868,0.184796,-0.114525,0.056256,-0.171089,0.107932
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0.056671,-0.005310,0.335454,0.082531,-0.198613,-0.232593,-0.181941,0.527217,-0.197734,-0.001245,...,0.052345,-0.067845,0.024690,0.247971,0.006248,-0.076053,-0.105791,-0.196382,-0.214628,0.344528
2712,0.222690,-0.265681,0.295385,0.138828,-0.419833,-0.305284,0.053335,0.100132,-0.091036,0.058695,...,-0.145672,-0.329974,0.103858,0.074635,0.052480,-0.089449,-0.000296,-0.286248,-0.286036,0.301091
2713,0.014180,-0.182193,0.383550,0.102780,-0.655337,-0.388702,-0.144063,0.380744,-0.109167,0.131153,...,-0.363147,-0.099617,-0.012555,0.260641,-0.253481,-0.095104,-0.280262,-0.114229,-0.140941,0.040897
2714,0.187862,-0.297265,0.159643,-0.113890,-0.066808,-0.171475,-0.187411,0.335857,-0.210900,0.209595,...,0.104987,-0.112360,0.199657,0.278640,0.165910,-0.073006,-0.057076,0.096683,-0.172961,0.261731


### Case № 5.1: *lemmatization* + *misspelling* + one-hot word encoding

Create a `five_one_X` *Pandas* dataframe:

In [64]:
five_one_X: DataFrame = features_extractor.one_hot_texts_encoding(
    tweets_df["lemmatized_correct_tweet"],
)

Check `five_one_X` features *Pandas* dataframe:

In [65]:
five_one_X

Unnamed: 0,aah,aba,abacus,abb,abbey,abbreviation,abdul,ability,able,abolish,...,youth,youtube,yrs,yuck,yuk,yummy,yuri,zero,zit,zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2712,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2713,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2714,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Case № 5.2: *lemmatization* + *misspelling* + word counts

Create a `five_two_X` *Pandas* dataframe:

In [66]:
five_two_X: DataFrame = features_extractor.word_count_texts_encoding(
    tweets_df["lemmatized_correct_tweet"],
)

Check `five_two_X` features *Pandas* dataframe:

In [67]:
five_two_X

Unnamed: 0,aah,aba,abacus,abb,abbey,abbreviation,abdul,ability,able,abolish,...,youth,youtube,yrs,yuck,yuk,yummy,yuri,zero,zit,zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2712,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2713,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2714,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Case № 5.3: *lemmatization* + *misspelling* + *TFIDF*

Create a `five_three_X` *Pandas* dataframe:

In [68]:
five_three_X: DataFrame = features_extractor.tfidf_texts_encoding(
    tweets_df["lemmatized_correct_tweet"],
)

Check `five_three_X` features *Pandas* dataframe:

In [69]:
five_three_X

Unnamed: 0,aah,aba,abacus,abb,abbey,abbreviation,abdul,ability,able,abolish,...,youth,youtube,yrs,yuck,yuk,yummy,yuri,zero,zit,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2712,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2713,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Case № 5.4: *lemmatization* + *misspelling* + *Word2Vec*

Create a `five_four_X` *Pandas* dataframe:

In [70]:
five_four_X: DataFrame = features_extractor.vectorize_texts(
    tweets_df["lemmatized_correct_tweet"],
)

Check `five_four_X` features *Pandas* dataframe:

In [71]:
five_four_X

Unnamed: 0,dimension_0,dimension_1,dimension_2,dimension_3,dimension_4,dimension_5,dimension_6,dimension_7,dimension_8,dimension_9,...,dimension_240,dimension_241,dimension_242,dimension_243,dimension_244,dimension_245,dimension_246,dimension_247,dimension_248,dimension_249
0,-0.018053,-0.016279,0.057575,0.112256,-0.327474,0.012242,0.185735,0.157532,-0.189161,-0.106480,...,0.087008,-0.277312,-0.063090,0.441163,0.103859,-0.053000,-0.158805,0.060010,0.032491,0.044363
1,0.081701,0.132216,0.143102,0.169117,-0.280074,-0.159592,0.203107,0.136328,-0.205358,-0.031638,...,-0.072767,-0.253792,-0.140147,0.474967,0.175684,-0.157541,-0.343145,0.248563,0.067823,0.210059
2,0.152251,-0.086496,0.218267,0.066760,-0.263602,-0.000570,0.072416,0.140005,-0.154260,-0.155220,...,0.072875,0.105646,0.022758,0.576317,0.248606,-0.085606,0.163423,-0.170099,0.053728,0.174531
3,0.114298,-0.139698,-0.018334,0.148686,-0.152567,0.004993,0.243839,0.029320,-0.255074,0.009711,...,0.082266,-0.051576,0.064414,0.445034,0.246621,-0.169040,0.008304,-0.030564,0.009899,0.057274
4,0.071350,0.079069,0.127595,0.208275,-0.383567,0.001132,0.289007,0.396918,-0.130785,-0.166354,...,0.075115,-0.175006,-0.168919,0.469753,0.108412,-0.066194,0.008599,0.086783,0.203974,0.140814
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0.094143,0.121030,0.252455,0.006931,-0.022882,0.011973,0.259360,0.482440,-0.207756,-0.060945,...,0.312982,-0.268140,0.114624,0.330606,0.088114,-0.127563,-0.052228,0.151088,-0.077215,0.081717
2712,0.122801,-0.182374,0.094358,0.059427,-0.218031,-0.030407,0.376148,0.552203,-0.270842,0.347519,...,0.420137,-0.486543,0.197607,0.546291,-0.062744,0.207924,0.109035,0.039536,0.291047,0.075078
2713,0.004791,-0.160167,0.042394,-0.021770,-0.066995,-0.173233,0.250857,0.342969,-0.303371,0.155700,...,0.194208,-0.270978,0.356018,0.461030,-0.090645,-0.218212,-0.220552,-0.040794,-0.103120,0.065196
2714,0.200782,0.035730,0.153905,0.124389,-0.048563,0.052753,0.232641,0.551770,-0.175158,-0.235083,...,0.244287,-0.084117,-0.006416,0.503867,0.287464,0.045109,0.012400,0.008506,0.076308,0.196940


### Case № 6.1: *misspelling* + *lemmatization* + *stemming* + one-hot word encoding

Create a `six_one_X` *Pandas* dataframe:

In [72]:
six_one_X: DataFrame = features_extractor.one_hot_texts_encoding(
    tweets_df["lemm_stemm_correct_tweet"],
)

Check `six_one_X` features *Pandas* dataframe:

In [73]:
six_one_X

Unnamed: 0,aah,aba,abacu,abb,abbey,abbrevi,abdul,abil,abl,abolish,...,youth,youtub,yr,yuck,yuk,yummi,yuri,zero,zit,zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2712,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2713,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2714,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Case № 6.2: *misspelling* + *lemmatization* + *stemming* + word counts

Create a `six_two_X` *Pandas* dataframe:

In [74]:
six_two_X: DataFrame = features_extractor.word_count_texts_encoding(
    tweets_df["lemm_stemm_correct_tweet"],
)

Check `six_two_X` features *Pandas* dataframe:

In [75]:
six_two_X

Unnamed: 0,aah,aba,abacu,abb,abbey,abbrevi,abdul,abil,abl,abolish,...,youth,youtub,yr,yuck,yuk,yummi,yuri,zero,zit,zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2712,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2713,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2714,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Case № 6.3: *misspelling* + *lemmatization* + *stemming* + *TFIDF*

Create a `six_three_X` *Pandas* dataframe:

In [76]:
six_three_X: DataFrame = features_extractor.tfidf_texts_encoding(
    tweets_df["lemm_stemm_correct_tweet"],
)

Check `six_three_X` features *Pandas* dataframe:

In [77]:
six_three_X

Unnamed: 0,aah,aba,abacu,abb,abbey,abbrevi,abdul,abil,abl,abolish,...,youth,youtub,yr,yuck,yuk,yummi,yuri,zero,zit,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2712,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2713,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Case № 6.4: *misspelling* + *lemmatization* + *stemming* + *Word2Vec*

Create a `six_four_X` *Pandas* dataframe:

In [78]:
six_four_X: DataFrame = features_extractor.vectorize_texts(
    tweets_df["lemm_stemm_correct_tweet"],
)

Check `six_four_X` features *Pandas* dataframe:

In [79]:
six_four_X

Unnamed: 0,dimension_0,dimension_1,dimension_2,dimension_3,dimension_4,dimension_5,dimension_6,dimension_7,dimension_8,dimension_9,...,dimension_240,dimension_241,dimension_242,dimension_243,dimension_244,dimension_245,dimension_246,dimension_247,dimension_248,dimension_249
0,0.075623,-0.053573,0.048276,0.158851,-0.435287,0.036035,0.076583,0.477610,-0.159406,0.004982,...,0.169894,-0.196760,-0.025702,0.322409,0.038325,-0.167791,0.069093,-0.227788,0.045708,-0.025613
1,-0.286654,0.205551,0.023496,0.122787,-0.488694,-0.175087,0.110520,0.261982,-0.139379,0.079901,...,0.021428,-0.313149,0.204978,0.360924,0.071416,-0.363158,-0.140681,-0.030414,0.120179,0.177560
2,0.178179,0.023283,0.027334,0.046935,-0.305808,0.077488,0.142202,0.429714,-0.329555,0.114736,...,0.222631,-0.306034,0.027065,0.226805,0.110873,-0.194974,0.015451,-0.173793,-0.050488,0.025409
3,0.048241,0.106163,0.067108,0.224597,-0.273883,-0.011376,0.063097,0.476041,-0.158223,0.112210,...,-0.111203,-0.148045,0.061340,0.183386,0.146527,-0.338070,0.149569,-0.167992,-0.071339,0.002086
4,0.023565,0.081978,0.105434,0.150774,-0.331002,-0.019933,0.164921,0.367725,-0.036313,-0.090679,...,0.003924,-0.138574,-0.055408,0.306338,0.131306,-0.146905,-0.042798,-0.110156,0.090330,-0.005352
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,0.212906,-0.096060,0.377995,-0.035244,-0.030764,-0.019777,-0.056901,0.446691,-0.061690,0.196482,...,0.198938,-0.406966,0.168049,0.168423,-0.213310,-0.291436,0.010886,-0.028548,0.063877,0.008243
2712,0.142034,-0.211142,0.369832,0.003968,0.090582,0.058259,-0.250913,0.430139,0.072233,0.087886,...,0.170410,-0.195520,0.403132,0.148676,-0.040021,-0.015708,-0.213357,-0.002442,0.033454,-0.069370
2713,0.032787,-0.267921,0.255933,0.133437,-0.167732,0.133439,-0.083933,0.253139,0.059078,0.084206,...,0.085826,-0.253374,-0.026229,0.136999,0.002855,0.026949,-0.008894,-0.228312,-0.125285,0.005153
2714,0.218502,0.058609,0.368104,0.147220,-0.036475,-0.107740,0.051077,0.571327,-0.174996,0.130581,...,0.069327,-0.102500,-0.021250,0.286999,-0.209717,-0.080766,-0.012929,-0.124525,0.047091,0.135164


## Saving a data:

Create a dictionary for `to_csv()` method callings:

In [80]:
to_csv_params: dict[str, str] = {
    "texts_file": "texts.csv",
    "target_file": "target.csv",

    "file_path": "../data/datasets/raw/",
}

Create a dictionary for `to_pickle()` method calling:

In [81]:
to_pickle_params: dict[str, str] = {
    "file": "features.pkl",

    "file_path": "../data/datasets/processed/",
}

Prepare a features variable:

In [82]:
features: list[DataFrame] = [
    one_one_X, one_two_X, one_three_X, one_four_X,
    two_one_X, two_two_X, two_three_X, two_four_X,
    three_one_X, three_two_X, three_three_X, three_four_X,
    four_one_X, four_two_X, four_three_X, four_four_X,
    five_one_X, five_two_X, five_three_X, five_four_X,
    six_one_X, six_two_X, six_three_X, six_four_X,
]

Check `features` list:

In [83]:
features

[      aa  aah  aam  aamby  aand  aap  aaree  aatein  abbeydale  abbreviation  \
 0      0    0    0      0     0    0      0       0          0             0   
 1      0    0    0      0     0    0      0       0          0             0   
 2      0    0    0      0     0    0      0       0          0             0   
 3      0    0    0      0     0    0      0       0          0             0   
 4      0    0    0      0     0    0      0       0          0             0   
 ...   ..  ...  ...    ...   ...  ...    ...     ...        ...           ...   
 2711   0    0    0      0     0    0      0       0          0             0   
 2712   0    0    0      0     0    0      0       0          0             0   
 2713   0    0    0      0     0    0      0       0          0             0   
 2714   0    0    0      0     0    0      0       0          0             0   
 2715   0    0    0      0     0    0      0       0          0             0   
 
       ...  yrs  yummy  yu

Save a target variable:

In [84]:
tweets_df["tweet_label"].to_csv(
    to_csv_params["file_path"] + to_csv_params["target_file"],
)

Save a tweet texts variable:

In [85]:
tweets_df["tweet"].to_csv(
    to_csv_params["file_path"] + to_csv_params["texts_file"],
)

Save a features variable:

In [86]:
to_pickle(features, to_pickle_params["file_path"] + to_pickle_params["file"], )