# Tweets

In [1]:
!pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
%load_ext autoreload
%autoreload 2

## Imports

In [3]:
import re

import pandas as pd

import nltk
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem import SnowballStemmer

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

import text_cleaninig
import text_processing
import machine_learning
import w2v_ml

import warnings
warnings.filterwarnings('ignore')

import cosine_similarity

In [4]:
'''
In case of problems with SSL in nltk.download
https://github.com/gunthercox/ChatterBot/issues/930#issuecomment-322111087
'''
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

# nltk.download()

In [5]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /Users/dnb/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/dnb/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/dnb/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/dnb/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## Obtaining data

There are 3 datasets with positive, negative and neutral tweets stored in csv files.
Let's create a dataframe of those data. Negative tweets will have a type of -1, positive ones will have a type of 1, and neutral ones will have a type of 0. Duplicate tweets are removed.


In [6]:
with open('data/processedNegative.csv') as file:
    data = file.read()
split_pattern = r',(?=[^ ])'
data = re.sub(split_pattern, '||', data)
tweets = data.split('||')
negative = pd.DataFrame(tweets, columns=['tweets'])
negative['type'] = -1
negative.drop_duplicates(inplace=True)
negative.head()

Unnamed: 0,tweets,type
0,How unhappy some dogs like it though,-1
1,talking to my over driver about where I'm goin...,-1
2,Does anybody know if the Rand's likely to fall...,-1
3,I miss going to gigs in Liverpool unhappy,-1
4,There isnt a new Riverdale tonight ? unhappy,-1


In [7]:
with open('data/processedNeutral.csv') as file:
    data = file.read()
split_pattern = r',(?=[^ ])'
data = re.sub(split_pattern, '||', data)
tweets = data.split('||')
neutral = pd.DataFrame(tweets, columns=['tweets'])
neutral['type'] = 0
neutral.drop_duplicates(inplace=True)
neutral.head()

Unnamed: 0,tweets,type
0,"Pak PM survives removal scare, but court order...",0
1,Supreme Court quashes criminal complaint again...,0
2,Art of Living's fights back over Yamuna floodp...,0
3,FCRA slap on NGO for lobbying...But was it doi...,0
4,"Why doctors, pharma companies are opposing nam...",0


In [8]:
with open('data/processedPositive.csv') as file:
    data = file.read()
split_pattern = r',(?=[^ ])'
data = re.sub(split_pattern, '||', data)
tweets = data.split('||')
positive = pd.DataFrame(tweets, columns=['tweets'])
positive['type'] = 1
positive.drop_duplicates(inplace=True)
positive.head()

Unnamed: 0,tweets,type
0,"An inspiration in all aspects: Fashion, fitnes...",1
1,Apka Apna Awam Ka Channel Frankline Tv Aam Adm...,1
2,Beautiful album from the greatest unsung guit...,1
3,Good luck to Rich riding for great project in ...,1
4,Omg he... kissed... him crying with joy,1


In [9]:
frames = [positive, negative, neutral]
df = pd.concat(frames)
df.head()

Unnamed: 0,tweets,type
0,"An inspiration in all aspects: Fashion, fitnes...",1
1,Apka Apna Awam Ka Channel Frankline Tv Aam Adm...,1
2,Beautiful album from the greatest unsung guit...,1
3,Good luck to Rich riding for great project in ...,1
4,Omg he... kissed... him crying with joy,1


### Let's create a dataframe with the results for different preprocessing methods and different vectorization methods. Snowball is used for _stemming_ and _stemming(snow) + misspellings_, Lancaster is used for _stemming(lanc) + misspellings_ (as Task's Other ideas of preprocessing') in text preprocessing for stemming.The initial values of the dataframe cells are NaN. 

In [10]:
preprocessing = ['just tokenization', 'stemming', 'lemmatization', 'stemming(snow) + misspellings',
                                                        'lemmatization + misspellings', 'stemming(lanc) + misspellings']
vectorizers = ['0 or 1, if the word exists', 'word counts', 'TFIDF']
df_df = pd.DataFrame(columns=vectorizers, index=preprocessing)
df_df

Unnamed: 0,"0 or 1, if the word exists",word counts,TFIDF
just tokenization,,,
stemming,,,
lemmatization,,,
stemming(snow) + misspellings,,,
lemmatization + misspellings,,,
stemming(lanc) + misspellings,,,


## Data Preparation

### Functions

##### Text cleaning function - basic + optional

- contractions to full form
- replace_emoticons with text
- remove ticks and next symbol
- remove url (http*)
- remove hashtags (#)
- remove mentions (@)
- remove numbers
- ignore case
- ignore punctuation
- remove extra spaces
- remove stop words (optional)
- remove misspelling (optional)


### Just Tokenization

Tokenization is the process of breaking text into smaller units called tokens. In natural language processing (NLP) tasks, tokenization plays a crucial role for several reasons:

1. **Simplification of Analysis**: Text, consisting of a continuous stream of characters, is difficult to analyze. Tokenization allows breaking down the text into words, phrases, or sentences, making further processing easier.

2. **Standardization**: Tokenization helps standardize text, which is especially important for machine learning models. For example, different forms of a word (like "run," "running," "ran") can be reduced to a single form, improving the quality of analysis.

3. **Noise Removal**: During tokenization, unnecessary characters (like punctuation) can be removed, allowing a focus on the meaningful parts of the text.

4. **Dictionary Creation**: Many NLP algorithms require the creation of a token dictionary to represent the text in vector form. This is particularly important for tasks like text classification or sentiment analysis.

5. **Data Preparation**: Tokenization is the first stage in preparing data for model training. Proper tokenization can significantly impact the model's performance.


In [11]:
df_token = df.copy(deep=True)

#### Applying the Text Cleaning Function (Basic Set)

In [12]:
df_token['tweets'] = df_token.apply(lambda item: text_cleaninig.clean(item.tweets), axis=1)

In [13]:
df_token.head()

Unnamed: 0,tweets,type
0,an inspiration in all aspects fashion fitness ...,1
1,apka apna awam ka channel frankline tv aam adm...,1
2,beautiful album from the greatest unsung guita...,1
3,good luck to rich riding for great project in ...,1
4,omg he kissed him crying with joy,1


> ### "0 or 1, if the word exists" for tweets and words (document-term matrix)

In [14]:
df_token_exist = text_processing.word_exists(df_token, 'tweets')
df_df['0 or 1, if the word exists'][0] = df_token_exist


In [15]:
df_token_exist.head()

Unnamed: 0_level_0,aa,aah,aam,aamby,aando,aap,aaree,abbeydale,abbreviation,abc,...,yr,yummy,yura,yuri,zabardast,zac,zcc,zero,zoo,zoos
tweets,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
an inspiration in all aspects fashion fitness beauty and personality happy face or smiley kisses thefashionicon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
apka apna awam ka channel frankline tv aam admi production please visit or likes share happy face or smiley fb page,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
beautiful album from the greatest unsung guitar genius of our time and i have met the great backstage,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
good luck to rich riding for great project in this sunday can you donate,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
omg he kissed him crying with joy,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


> ### Word count

In [16]:
df_token_count = text_processing.word_count(df_token, 'tweets')
df_token_count.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2715 entries, an inspiration in all aspects fashion fitness beauty and personality happy face or smiley kisses thefashionicon to amulya patnaik has been appointed new delhi police commissioner patnaik is a agmut cadre ips officer
Columns: 6157 entries, aa to zoos
dtypes: int64(6157)
memory usage: 127.6+ MB


In [17]:
df_token_count.head()

Unnamed: 0_level_0,aa,aah,aam,aamby,aando,aap,aaree,abbeydale,abbreviation,abc,...,yr,yummy,yura,yuri,zabardast,zac,zcc,zero,zoo,zoos
tweets,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
an inspiration in all aspects fashion fitness beauty and personality happy face or smiley kisses thefashionicon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
apka apna awam ka channel frankline tv aam admi production please visit or likes share happy face or smiley fb page,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
beautiful album from the greatest unsung guitar genius of our time and i have met the great backstage,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
good luck to rich riding for great project in this sunday can you donate,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
omg he kissed him crying with joy,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
df_df['word counts'][0] = df_token_count

> ### tfidf

**TF-IDF** (Term Frequency-Inverse Document Frequency) is a statistical method used to evaluate the importance of a word in a document relative to an entire corpus of texts. It is widely applied in information retrieval, text analytics, and machine learning. The method consists of two main components: **TF** and **IDF**.

## 1. Term Frequency (TF)
**Term Frequency** (TF) measures how often a word appears in a document. The formula for calculating TF is as follows:
TF(t, d) = N_{t,d} / N_{d}
where:
- N_{t,d} is the number of times the term t appears in document d,
- N_{d} is the total number of words in document d.

The higher the term frequency, the more significant it is considered in the context of that document.

## 2. Inverse Document Frequency (IDF)
**Inverse Document Frequency** (IDF) helps to reduce the weight of common words that may not carry significant information (e.g., prepositions or general terms). The formula for calculating IDF is as follows:
IDF(t, D) = log (N_{D} / N_{d,t} + 1)
where:
- N_{D}  is the total number of documents in the corpus,
- N_{d,t} is the number of documents containing the term t.

Thus, if a term occurs in many documents, its IDF will be low, indicating its lesser significance.

## 3. Combining TF and IDF
The final TF-IDF value for term t in document d is calculated as the product of TF and IDF:
TF-IDF(t, d, D) = TF(t, d) x IDF(t, D)

This value indicates how important the term is in the document compared to other documents in the corpus. A high TF-IDF value suggests that the term appears frequently in this document but rarely in others.

## Applications of TF-IDF
- **Information Retrieval**: Used for ranking documents by relevance to a search query.
- **Text Classification**: Helps in creating vector representations of documents for machine learning algorithms.
- **Keyword Extraction**: Allows for identifying the most significant words from a text.

In [19]:
df_token_tfidf = text_processing.tfidf(df_token, 'tweets')
df_token_tfidf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2715 entries, an inspiration in all aspects fashion fitness beauty and personality happy face or smiley kisses thefashionicon to amulya patnaik has been appointed new delhi police commissioner patnaik is a agmut cadre ips officer
Columns: 6157 entries, aa to zoos
dtypes: float64(6157)
memory usage: 127.6+ MB


In [20]:
df_token_tfidf.head()

Unnamed: 0_level_0,aa,aah,aam,aamby,aando,aap,aaree,abbeydale,abbreviation,abc,...,yr,yummy,yura,yuri,zabardast,zac,zcc,zero,zoo,zoos
tweets,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
an inspiration in all aspects fashion fitness beauty and personality happy face or smiley kisses thefashionicon,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
apka apna awam ka channel frankline tv aam admi production please visit or likes share happy face or smiley fb page,0.0,0.0,0.25964,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
beautiful album from the greatest unsung guitar genius of our time and i have met the great backstage,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
good luck to rich riding for great project in this sunday can you donate,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
omg he kissed him crying with joy,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
df_df['TFIDF'][0] = df_token_tfidf

### Stemming

A stemming algorithm is a computational procedure which reduces all words with the same root… to a common form, usually by stripping each word of its derivational and inflectional suffixes.The process of stemming is aimed at mapping for retrieval purposes, the stem need not be a linguistically correct lemma or root.
**Errors in Stemming**:
_Over-Stemming_: It occurs when two or more unrelated words result in the same stem.
_Under-Stemming_: It occurs when two or more related words result in different stems.

#### Porter

**Porter Stemmer**: one of the most commonly used stemmers, developed by M.F. Porter in 1980. Porter’s stemmer consists of five different phases. These phases are applied sequentially. Within each phase, there are certain conventions for selecting rules. The entire porter algorithm is small and thus fast and simple. The drawback of this stemmer is that it supports only the English language, and the stem obtained may or may not be linguistically correct.

In [22]:
porter = PorterStemmer()
df_stemmed_porter = df.copy(deep=True)
df_stemmed_porter['tweets'] = df_stemmed_porter.apply(lambda item: text_processing.stem_text(item.tweets, porter), axis=1)
df_stemmed_porter.head()

Unnamed: 0,tweets,type
0,an inspir in all aspect fashion fit beauti and...,1
1,apka apna awam ka channel franklin tv aam admi...,1
2,beauti album from the greatest unsung guitar g...,1
3,good luck to rich ride for great project in th...,1
4,omg he kiss him cri with joy,1


#### Snowball

**Snowball Stemmer**: It is a small language-independent stemming algorithm developed by Martin Porter. It works by removing the affixes from the word and iteratively trying to find the most basic form of the word. Snowball Stemmer supports multiple languages and is faster than the Porter Stemmer.

In [23]:
snowball = SnowballStemmer('english')
df_stemmed_snow = df.copy(deep=True)
df_stemmed_snow['tweets'] = df_stemmed_snow.apply(lambda item: text_processing.stem_text(item.tweets, snowball), axis=1)
df_stemmed_snow.head()

Unnamed: 0,tweets,type
0,an inspir in all aspect fashion fit beauti and...,1
1,apka apna awam ka channel franklin tv aam admi...,1
2,beauti album from the greatest unsung guitar g...,1
3,good luck to rich ride for great project in th...,1
4,omg he kiss him cri with joy,1


#### Lancaster

**Lancaster Stemmer**: It is a more aggressive stemmer that tries to remove as many affixes as possible. It is also known as the "Paice/Husk Stemmer". It is more aggressive than the Porter Stemmer and can remove more affixes than the Snowball Stemmer.

In [24]:
lancaster = LancasterStemmer()
df_stemmed_lanc = df.copy(deep=True)
df_stemmed_lanc['tweets'] = df_stemmed_lanc.apply(lambda item: text_processing.stem_text(item.tweets, lancaster), axis=1)
df_stemmed_lanc.head()

Unnamed: 0,tweets,type
0,an inspir in al aspect fash fit beauty and per...,1
1,apk apn awam ka channel franklin tv aam adm pr...,1
2,beauty alb from the greatest unsung guit geni ...,1
3,good luck to rich rid for gre project in thi s...,1
4,omg he kiss him cry with joy,1


> ### Applying '0 or 1, if the word exists' for Snowball stemmer

In [25]:
df_stem_exist = text_processing.word_exists(df_stemmed_snow, 'tweets')
df_stem_exist.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2715 entries, an inspir in all aspect fashion fit beauti and person happi face or smiley kiss thefashionicon to amulya patnaik has been appoint new delhi polic commission patnaik is a agmut cadr ip offic
Columns: 4926 entries, aa to zoo
dtypes: int64(4926)
memory usage: 102.1+ MB


In [26]:
df_df['0 or 1, if the word exists'][1] = df_stem_exist

> ### Applying 'word count' for Snowball stemmer

In [27]:
df_stem_count = text_processing.word_count(df_stemmed_snow, 'tweets')
df_stem_count.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2715 entries, an inspir in all aspect fashion fit beauti and person happi face or smiley kiss thefashionicon to amulya patnaik has been appoint new delhi polic commission patnaik is a agmut cadr ip offic
Columns: 4926 entries, aa to zoo
dtypes: int64(4926)
memory usage: 102.1+ MB


In [28]:
df_df['word counts'][1] = df_stem_count

> ### Applying 'tfidf' for Snowball stemmer

In [29]:
df_stem_tfidf = text_processing.tfidf(df_stemmed_snow, 'tweets')
df_stem_tfidf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2715 entries, an inspir in all aspect fashion fit beauti and person happi face or smiley kiss thefashionicon to amulya patnaik has been appoint new delhi polic commission patnaik is a agmut cadr ip offic
Columns: 4926 entries, aa to zoo
dtypes: float64(4926)
memory usage: 102.1+ MB


In [30]:
df_df['TFIDF'][1] = df_stem_tfidf

### Lemmatization

**Lemmatization** is the process of converting a word into its base or dictionary form, known as a **lemma**. This method is widely used in natural language processing (NLP) to enhance text analysis and information retrieval. Lemmatization helps reduce the number of unique words in texts, making them easier to analyze.

## 1. Difference from Stemming
Lemmatization is often confused with stemming, but these are two different processes:
- **Stemming**: Removes word endings to obtain the root form of a word but does not always return a grammatically correct form. For example, "running" may be reduced to "run," but "better" could become "better" or "good."
- **Lemmatization**: Always returns the word in its correct grammatical form. For example, "better" will be transformed into "good."

## 2. Lemmatization Process
The lemmatization process involves several steps:
1. **Part of Speech Determination**: To correctly convert a word to its lemma, it is essential to know its part of speech (noun, verb, adjective, etc.).
2. **Lexical Analysis**: Based on the part of speech and the context of the word, the algorithm determines its lemma.
3. **Use of Dictionaries**: Lemmatizers often use dictionaries and morphological rules for word transformation.

## 3. Examples of Lemmatization
- **Verbs**:
  - "running" → "run"
  - "was" → "be"
- **Nouns**:
  - "geese" → "goose"
  - "children" → "child"
- **Adjectives**:
  - "better" → "good"
  - "best" → "good"

## 4. Applications of Lemmatization
Lemmatization is applied in various fields:
- **Information Retrieval**: Simplifies searches by allowing the retrieval of documents containing different forms of a word.
- **Sentiment Analysis**: Helps determine the emotional tone of the text using the base forms of words.
- **Text Classification**: Reduces data dimensionality, improving classification quality.

In [31]:
df_lemmatized = df.copy(deep=True)
df_lemmatized['tweets'] = df_lemmatized.apply(lambda item: text_processing.lem_text(item.tweets), axis=1)
df_lemmatized.head()

Unnamed: 0,tweets,type
0,an inspiration in all aspect fashion fitness b...,1
1,apka apna awam ka channel frankline tv aam adm...,1
2,beautiful album from the greatest unsung guita...,1
3,good luck to rich riding for great project in ...,1
4,omg he kissed him cry with joy,1


> ### Applying '0 or 1, if the word exists' for Lemmatization

In [32]:
df_lem_exist = text_processing.word_exists(df_lemmatized, 'tweets')
df_lem_exist.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2715 entries, an inspiration in all aspect fashion fitness beauty and personality happy face or smiley kiss thefashionicon to amulya patnaik ha been appointed new delhi police commissioner patnaik is a agmut cadre ip officer
Columns: 5631 entries, aa to zoo
dtypes: int64(5631)
memory usage: 116.7+ MB


In [33]:
df_df['0 or 1, if the word exists'][2] = df_lem_exist

> ### Applying 'word count' for Lemmatization

In [34]:
df_lem_count = text_processing.word_count(df_lemmatized, 'tweets')
df_lem_count.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2715 entries, an inspiration in all aspect fashion fitness beauty and personality happy face or smiley kiss thefashionicon to amulya patnaik ha been appointed new delhi police commissioner patnaik is a agmut cadre ip officer
Columns: 5631 entries, aa to zoo
dtypes: int64(5631)
memory usage: 116.7+ MB


In [35]:
df_df['word counts'][2] = df_lem_count

> ### Applying 'tfidf' for Lemmatization

In [36]:
df_lem_tfidf = text_processing.tfidf(df_lemmatized, 'tweets')
df_lem_tfidf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2715 entries, an inspiration in all aspect fashion fitness beauty and personality happy face or smiley kiss thefashionicon to amulya patnaik ha been appointed new delhi police commissioner patnaik is a agmut cadre ip officer
Columns: 5631 entries, aa to zoo
dtypes: float64(5631)
memory usage: 116.7+ MB


In [37]:
df_df['TFIDF'][2] = df_lem_tfidf

### Stemming(snowball) + misspellings

In [38]:
df_stem_spell_snow = df.copy(deep=True)
df_stem_spell_snow['tweets'] = df_stem_spell_snow.apply(lambda item: text_processing.stem_text(item.tweets, snowball, misspelling=True), axis=1)
df_stem_spell_snow.head()

Unnamed: 0,tweets,type
0,an inspir in all aspect fashion fit beauti and...,1
1,aka anna away ka channel franklin to am admit ...,1
2,beauti album from the greatest unsung guitar g...,1
3,good luck to rich ride for great project in th...,1
4,om he kiss him cri with joy,1


> ### Applying '0 or 1, if the word exists' for 'Stemming(snowball) + misspellings'

In [39]:
df_stem_spell_exist = text_processing.word_exists(df_stem_spell_snow, 'tweets')
df_stem_spell_exist.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2715 entries, an inspir in all aspect fashion fit beauti and person happi face or smiley kiss to amulet has been appoint new deli polic commission is a gamut cadr is offic
Columns: 3857 entries, aah to zoo
dtypes: int64(3857)
memory usage: 79.9+ MB


In [40]:
df_df['0 or 1, if the word exists'][3] = df_stem_spell_exist

> ### Applying 'word count' for 'Stemming(snowball) + misspellings'

In [41]:
df_stem_spell_count = text_processing.word_count(df_stem_spell_snow, 'tweets')
df_stem_spell_count.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2715 entries, an inspir in all aspect fashion fit beauti and person happi face or smiley kiss to amulet has been appoint new deli polic commission is a gamut cadr is offic
Columns: 3857 entries, aah to zoo
dtypes: int64(3857)
memory usage: 79.9+ MB


In [42]:
df_df['word counts'][3] = df_stem_spell_count

> ### Applying 'tfidf' for 'Stemming(snowball) + misspellings'

In [43]:
df_stem_spell_tfidf = text_processing.tfidf(df_stem_spell_snow, 'tweets')
df_stem_spell_tfidf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2715 entries, an inspir in all aspect fashion fit beauti and person happi face or smiley kiss to amulet has been appoint new deli polic commission is a gamut cadr is offic
Columns: 3857 entries, aah to zoo
dtypes: float64(3857)
memory usage: 79.9+ MB


In [44]:
df_df['TFIDF'][3] = df_stem_spell_tfidf

### Lemmatization + misspellings

In [45]:
df_lemmatized = df.copy(deep=True)
df_lemmatized['tweets'] = df_lemmatized.apply(lambda item: text_processing.lem_text(item.tweets, misspelling=True), axis=1)
df_lemmatized.head()

Unnamed: 0,tweets,type
0,an inspiration in all aspect fashion fitness b...,1
1,aka anna away ka channel franklin to am admit ...,1
2,beautiful album from the greatest unsung guita...,1
3,good luck to rich riding for great project in ...,1
4,om he kissed him cry with joy,1


> ### Applying '0 or 1, if the word exists' for 'Lemmatization + misspellings'

In [46]:
df_lem_spell_exist = text_processing.word_exists(df_lemmatized, 'tweets')
df_lem_spell_exist.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2715 entries, an inspiration in all aspect fashion fitness beauty and personality happy face or smiley kiss to amulet ha been appointed new deli police commissioner is a gamut cadre is officer
Columns: 4588 entries, aah to zoo
dtypes: int64(4588)
memory usage: 95.1+ MB


In [47]:
df_df['0 or 1, if the word exists'][4] = df_lem_spell_exist

> ### Applying 'word count' for 'Lemmatization + misspellings'

In [48]:
df_lem_spell_count = text_processing.word_count(df_lemmatized, 'tweets')
df_lem_spell_count.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2715 entries, an inspiration in all aspect fashion fitness beauty and personality happy face or smiley kiss to amulet ha been appointed new deli police commissioner is a gamut cadre is officer
Columns: 4588 entries, aah to zoo
dtypes: int64(4588)
memory usage: 95.1+ MB


In [49]:
df_df['word counts'][4] = df_lem_spell_count

> ### Applying 'tfidf' for 'Lemmatization + misspellings'

In [50]:
df_lem_spell_tfidf = text_processing.tfidf(df_lemmatized, 'tweets')
df_lem_spell_tfidf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2715 entries, an inspiration in all aspect fashion fitness beauty and personality happy face or smiley kiss to amulet ha been appointed new deli police commissioner is a gamut cadre is officer
Columns: 4588 entries, aah to zoo
dtypes: float64(4588)
memory usage: 95.1+ MB


In [51]:
df_df['TFIDF'][4] = df_lem_spell_tfidf

### Other ideas of preprocessing - Stemming(lancaster) + misspellings

In [52]:
df_stem_spell_lanc = df.copy(deep=True)
df_stem_spell_lanc['tweets'] = df_stem_spell_lanc.apply(lambda item: text_processing.stem_text(item.tweets, lancaster, misspelling=True), axis=1)
df_stem_spell_lanc.head()

Unnamed: 0,tweets,type
0,an inspir in al aspect fash fit beauty and per...,1
1,ak ann away ka channel franklin to am admit pr...,1
2,beauty alb from the greatest unsung guit geni ...,1
3,good luck to rich rid for gre project in thi s...,1
4,om he kiss him cry with joy,1


> ### Applying '0 or 1, if the word exists' for 'Stemming(lancaster) + misspellings'

In [53]:
df_other_exist = text_processing.word_exists(df_stem_spell_lanc, 'tweets')
df_other_exist.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2715 entries, an inspir in al aspect fash fit beauty and person happy fac or smiley kiss to amulet has been appoint new del pol commit is a gamut cadr is off
Columns: 3382 entries, aah to zoo
dtypes: int64(3382)
memory usage: 70.1+ MB


In [54]:
df_df['0 or 1, if the word exists'][5] = df_other_exist

> ### Applying 'word count' for 'Stemming(lancaster) + misspellings'

In [55]:
df_other_count = text_processing.word_count(df_stem_spell_lanc, 'tweets')
df_other_count.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2715 entries, an inspir in al aspect fash fit beauty and person happy fac or smiley kiss to amulet has been appoint new del pol commit is a gamut cadr is off
Columns: 3382 entries, aah to zoo
dtypes: int64(3382)
memory usage: 70.1+ MB


In [56]:
df_df['word counts'][5] = df_other_count

> ### Applying 'tfidf' for 'Stemming(lancaster) + misspellings'

In [57]:
df_other_tfidf = text_processing.tfidf(df_stem_spell_lanc, 'tweets')
df_other_tfidf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2715 entries, an inspir in al aspect fash fit beauty and person happy fac or smiley kiss to amulet has been appoint new del pol commit is a gamut cadr is off
Columns: 3382 entries, aah to zoo
dtypes: float64(3382)
memory usage: 70.1+ MB


In [58]:
df_df['TFIDF'][5] = df_other_tfidf

## Similarity

**Cosine Similarity** is a measure used to evaluate the similarity between two vectors in space, based on the cosine of the angle between them. It is often applied in natural language processing and machine learning to compare texts, documents, or other objects represented as vectors.

### Formula

Cosine similarity is defined by the following formula:
cosine\_similarity}(A, B) = {A x B} / {|A| x |B|}
where:
- A and B are vectors,
- A x B is the dot product of the vectors,
- |A| and |B| are the norms (lengths) of the vectors A and B respectively.

### Meaning

Cosine similarity takes values from -1 to 1:
- 1 means the vectors are identical (angle 0°),
- 0 means the vectors are orthogonal (angle 90°),
- -1 means the vectors are opposite (angle 180°).

### Applications

1. **Text Processing**: Used to determine the similarity between documents or queries.
2. **Recommendation Systems**: Helps find similar products or movies.
3. **Classification**: Used in machine learning algorithms for grouping similar objects.

In [59]:
df_res = cosine_similarity.cosine_similarity(df, df_df)
df_res

Unnamed: 0,"0 or 1, if the word exists + just tokenization","0 or 1, if the word exists + stemming","0 or 1, if the word exists + lemmatization","0 or 1, if the word exists + stemming(snow) + misspellings","0 or 1, if the word exists + lemmatization + misspellings","0 or 1, if the word exists + stemming(lanc) + misspellings",word counts + just tokenization,word counts + stemming,word counts + lemmatization,word counts + stemming(snow) + misspellings,word counts + lemmatization + misspellings,word counts + stemming(lanc) + misspellings,TFIDF + just tokenization,TFIDF + stemming,TFIDF + lemmatization,TFIDF + stemming(snow) + misspellings,TFIDF + lemmatization + misspellings,TFIDF + stemming(lanc) + misspellings
1,6. thanks happy - 316. thanks b happy\n6. than...,6. thank happi - 458. thank happi\n6. thanks h...,60. thanks for the recent follow happy to conn...,16. share the love thank for be top new follow...,6. thanks happy - 316. thanks b happy\n6. than...,16. shar the lov thank for being top new follo...,6. thanks happy - 316. thanks b happy\n6. than...,6. thank happi - 458. thank happi\n6. thanks h...,60. thanks for the recent follow happy to conn...,16. share the love thank for be top new follow...,6. thanks happy - 316. thanks b happy\n6. than...,16. shar the lov thank for being top new follo...,60. thanks for the recent follow happy to conn...,6. thank happi - 260. thank happi\n6. thanks h...,60. thanks for the recent follow happy to conn...,6. thank happi - 406. thank happi\n6. thanks h...,60. thanks for the recent follow happy to conn...,6. thank happy - 406. thank happy\n6. thanks h...
2,6. thanks happy - 458. thanks happy\n6. thanks...,60. thank for the recent follow happi to conne...,6. thanks happy - 406. thanks happy\n6. thanks...,260. thank happi - 406. thank happi\n260. Than...,46. thanks for the recent follow happy to conn...,260. thank happy - 406. thank happy\n260. Than...,6. thanks happy - 458. thanks happy\n6. thanks...,60. thank for the recent follow happi to conne...,6. thanks happy - 406. thanks happy\n6. thanks...,260. thank happi - 406. thank happi\n260. Than...,46. thanks for the recent follow happy to conn...,260. thank happy - 406. thank happy\n260. Than...,60. thanks for the recent follow happy to conn...,60. thank for the recent follow happi to conne...,221. thanks for the recent follow happy to con...,6. thank happi - 260. thank happi\n6. thanks h...,221. thanks for the recent follow happy to con...,60. thank for the rec follow happy to connect ...
3,6. thanks happy - 260. thanks happy\n6. thanks...,26. hey thank for be top new follow this week ...,16. share the love thanks for being top new fo...,46. thank for the recent follow happi to conne...,260. thanks happy - 458. thanks happy\n260. Th...,260. thank happy - 458. thank happy\n260. Than...,6. thanks happy - 260. thanks happy\n6. thanks...,26. hey thank for be top new follow this week ...,16. share the love thanks for being top new fo...,46. thank for the recent follow happi to conne...,260. thanks happy - 458. thanks happy\n260. Th...,260. thank happy - 458. thank happy\n260. Than...,221. thanks for the recent follow happy to con...,6. thank happi - 316. thank b happi\n6. thanks...,60. thanks for the recent follow happy to conn...,260. thank happi - 458. thank happi\n260. Than...,60. thanks for the recent follow happy to conn...,6. thank happy - 260. thank happy\n6. thanks h...
4,16. share the love thanks for being top new fo...,60. thank for the recent follow happi to conne...,60. thanks for the recent follow happy to conn...,6. thank happi - 458. thank happi\n6. thanks h...,6. thanks happy - 260. thanks happy\n6. thanks...,6. thank happy - 458. thank happy\n6. thanks h...,16. share the love thanks for being top new fo...,60. thank for the recent follow happi to conne...,60. thanks for the recent follow happy to conn...,6. thank happi - 458. thank happi\n6. thanks h...,6. thanks happy - 260. thanks happy\n6. thanks...,6. thank happy - 458. thank happy\n6. thanks h...,60. thanks for the recent follow happy to conn...,60. thank for the recent follow happi to conne...,60. thanks for the recent follow happy to conn...,26. hey thank for be top new follow this week ...,221. thanks for the recent follow happy to con...,46. thank for the rec follow happy to connect ...
5,60. thanks for the recent follow happy to conn...,6. thank happi - 260. thank happi\n6. thanks h...,60. thanks for the recent follow happy to conn...,6. thank happi - 260. thank happi\n6. thanks h...,16. share the love thanks for being top new fo...,6. thank happy - 260. thank happy\n6. thanks h...,60. thanks for the recent follow happy to conn...,6. thank happi - 260. thank happi\n6. thanks h...,60. thanks for the recent follow happy to conn...,6. thank happi - 260. thank happi\n6. thanks h...,16. share the love thanks for being top new fo...,6. thank happy - 260. thank happy\n6. thanks h...,221. thanks for the recent follow happy to con...,60. thank for the recent follow happi to conne...,16. share the love thanks for being top new fo...,6. thank happi - 316. thank b happi\n6. thanks...,60. thanks for the recent follow happy to conn...,60. thank for the rec follow happy to connect ...
6,6. thanks happy - 406. thanks happy\n6. thanks...,60. thank for the recent follow happi to conne...,60. thanks for the recent follow happy to conn...,260. thank happi - 316. thank b happi\n260. Th...,6. thanks happy - 406. thanks happy\n6. thanks...,6. thank happy - 406. thank happy\n6. thanks h...,6. thanks happy - 406. thanks happy\n6. thanks...,60. thank for the recent follow happi to conne...,60. thanks for the recent follow happy to conn...,260. thank happi - 316. thank b happi\n260. Th...,6. thanks happy - 406. thanks happy\n6. thanks...,6. thank happy - 406. thank happy\n6. thanks h...,60. thanks for the recent follow happy to conn...,60. thank for the recent follow happi to conne...,221. thanks for the recent follow happy to con...,6. thank happi - 458. thank happi\n6. thanks h...,60. thanks for the recent follow happy to conn...,6. thank happy - 316. thank b happy\n6. thanks...
7,60. thanks for the recent follow happy to conn...,60. thank for the recent follow happi to conne...,6. thanks happy - 458. thanks happy\n6. thanks...,6. thank happi - 406. thank happi\n6. thanks h...,26. hey thanks for being top new follower this...,46. thank for the rec follow happy to connect ...,60. thanks for the recent follow happy to conn...,60. thank for the recent follow happi to conne...,6. thanks happy - 458. thanks happy\n6. thanks...,6. thank happi - 406. thank happi\n6. thanks h...,26. hey thanks for being top new follower this...,46. thank for the rec follow happy to connect ...,221. thanks for the recent follow happy to con...,60. thank for the recent follow happi to conne...,60. thanks for the recent follow happy to conn...,93. happi - 150. happi\n93. Goodevening happy ...,221. thanks for the recent follow happy to con...,60. thank for the rec follow happy to connect ...
8,60. thanks for the recent follow happy to conn...,16. share the love thank for be top new follow...,26. hey thanks for being top new follower this...,260. thank happi - 458. thank happi\n260. Than...,260. thanks happy - 316. thanks b happy\n260. ...,260. thank happy - 316. thank b happy\n260. Th...,60. thanks for the recent follow happy to conn...,16. share the love thank for be top new follow...,26. hey thanks for being top new follower this...,260. thank happi - 458. thank happi\n260. Than...,260. thanks happy - 316. thanks b happy\n260. ...,260. thank happy - 316. thank b happy\n260. Th...,16. share the love thanks for being top new fo...,6. thank happi - 458. thank happi\n6. thanks h...,60. thanks for the recent follow happy to conn...,260. thank happi - 406. thank happi\n260. Than...,93. happy - 150. happy\n93. Goodevening happy ...,26. hey thank for being top new follow thi wee...
9,60. thanks for the recent follow happy to conn...,6. thank happi - 316. thank b happi\n6. thanks...,6. thanks happy - 316. thanks b happy\n6. than...,26. hey thank for be top new follow this week ...,260. thanks happy - 406. thanks happy\n260. Th...,26. hey thank for being top new follow thi wee...,60. thanks for the recent follow happy to conn...,6. thank happi - 316. thank b happi\n6. thanks...,6. thanks happy - 316. thanks b happy\n6. than...,26. hey thank for be top new follow this week ...,260. thanks happy - 406. thanks happy\n260. Th...,26. hey thank for being top new follow thi wee...,60. thanks for the recent follow happy to conn...,6. thank happi - 406. thank happi\n6. thanks h...,221. thanks for the recent follow happy to con...,316. thank b happi - 406. thank happi\n316. th...,60. thanks for the recent follow happy to conn...,16. shar the lov thank for being top new follo...
10,26. hey thanks for being top new followers thi...,6. thank happi - 406. thank happi\n6. thanks h...,6. thanks happy - 260. thanks happy\n6. thanks...,6. thank happi - 316. thank b happi\n6. thanks...,6. thanks happy - 458. thanks happy\n6. thanks...,6. thank happy - 316. thank b happy\n6. thanks...,26. hey thanks for being top new followers thi...,6. thank happi - 406. thank happi\n6. thanks h...,6. thanks happy - 260. thanks happy\n6. thanks...,6. thank happi - 316. thank b happi\n6. thanks...,6. thanks happy - 458. thanks happy\n6. thanks...,6. thank happy - 316. thank b happy\n6. thanks...,221. thanks for the recent follow happy to con...,16. share the love thank for be top new follow...,26. hey thanks for being top new follower this...,260. thank happi - 316. thank b happi\n260. Th...,60. thanks for the recent follow happy to conn...,6. thank happy - 458. thank happy\n6. thanks h...


In [60]:
df_res.to_csv('res/cos_sim.csv')

###  The top10 similar pairs of tweets for different vectorizer and preprocessors

> ### vectorizer + preprocessor: tweets after preprocessing / tweets in original

In [61]:
cosine_similarity.print_cossim(df_res)

0 or 1, if the word exists + just tokenization
	1.
6. thanks happy - 316. thanks b happy
6. thanks happy - 316. thanks b happy
	2.
6. thanks happy - 458. thanks happy
6. thanks happy - 458. thanks! happy
	3.
6. thanks happy - 260. thanks happy
6. thanks happy - 260. Thanks happy
	4.
16. share the love thanks for being top new followers this week happy want this - 621. share the love thanks for being top new followers this week happy want this
16. Share the love: thanks for being top new followers this week happy  Want this? - 621. Share the love:thanks for being top new followers this week happy   Want this
	5.
60. thanks for the recent follow happy to connect happy have a great thursday want this - 221. thanks for the recent follow happy to connect happy have a great thursday want this
60. Thanks for the recent follow Happy to connect happy  have a great Thursday. (Want this? - 221. Thanks for the recent follow Happy to connect happy  have a great Thursday. Want this
	6.
6. thanks hap

## Machine learning

### Gaussian Naive Bayes - base classification

**Gaussian Naive Bayes** is a classification algorithm based on Bayes' theorem, which assumes that features (or attributes) are independent of each other. This method is particularly effective for classification tasks with continuous features, as it uses a normal (Gaussian) distribution to model the feature values.

#### Key Principles
1. **Bayes' Theorem**:
 P(C|X) = {P(X|C) x P(C)} / P(X)
   where:
   - P(C|X) is the posterior probability of class C given features X,
   - P(X|C) is the likelihood of features X given class C,
   - P(C) is the prior probability of class C,
   - P(X) is the overall probability of features X.

2. **Independence Assumption**:
   The algorithm assumes that all features are independent of each other, simplifying calculations. Thus, for a multidimensional feature space, we can write:
   P(X|C) = P(X_1|C) x P(X_2|C) x ... x P(X_n|C)

3. **Gaussian Distribution**:
   For each feature, the algorithm assumes that its values are distributed according to a normal (Gaussian) distribution. For each class C, the mean \mu and standard deviation \sigma are computed:
   P(X_i|C) = 1 / {sqrt{2\pi\sigma^2}} x exp(-{(X_i - \mu)^2} / {2 x \sigma^2})

#### Training Process
1. **Data Collection**: Training data with known class labels is collected.
2. **Parameter Calculation**:
   - For each class C, the mean and standard deviation for each feature are calculated.
3. **Prior Probabilities**: Prior probabilities for each class are computed based on the frequency of classes in the training set.

#### Classification Process
1. For a new instance X, the probability of belonging to each class C is calculated:
   P(C|X):  P(C) x P(X|C)
2. The class with the highest probability is selected as the predicted class.

#### Advantages and Disadvantages
**Advantages**:
- Simplicity of implementation and interpretation.
- Fast operation, especially on large datasets.
- Works well with high-dimensional data.

**Disadvantages**:
- The independence assumption may not always hold, which can reduce accuracy.
- Sensitive to outliers and noise in the data.

In [62]:
clf = GaussianNB()

In [63]:
df_res = machine_learning.model_preprocessing(clf, df, df_df)

In [64]:
df_res

Unnamed: 0,"0 or 1, if the word exists",word counts,TFIDF
just tokenization,0.734807,0.734807,0.707182
stemming,0.73849,0.73849,0.720074
lemmatization,0.73849,0.73849,0.709024
stemming(snow) + misspellings,0.734807,0.734807,0.710866
lemmatization + misspellings,0.729282,0.729282,0.697974
stemming(lanc) + misspellings,0.767956,0.767956,0.718232


### Logistic Regression

**Logistic Regression** is a statistical method used for binary classification. It models the probability that an object belongs to a particular class based on one or more predictors (features). This method is often applied in tasks where the outcome is a binary value (e.g., "yes" or "no," "success" or "failure").

#### Key Principles
1. **Model**:
   Logistic regression uses the logistic function (or sigmoid function) to transform a linear combination of input features into a probability. The formula for the logistic function is as follows:
   P(Y=1|X) = 1 / {1 + e^{-(b_0 + b_1 x X_1 + b_2 x X_2 + ... + b_n x X_n)}}
   where:
   - P(Y=1|X) is the probability that the target variable Y equals 1 (belongs to the positive class),
   - b_0 is the intercept,
   - b_1, b_2, ..., b_n are the coefficients (model parameters),
   - X_1, X_2, ..., X_n are the predictors.

2. **Model Training**:
   To train the logistic regression model, the maximum likelihood estimation method is used. The goal is to find values for the coefficients b that maximize the probability of the observed data.

3. **Prediction**:
   After training the model, to predict the class of a new object, the probability is calculated, and if it exceeds a given threshold (usually 0.5), the object is classified as the positive class.

#### Advantages and Disadvantages
**Advantages**:
- Simplicity of interpretation of results.
- Effectiveness when there is a linear relationship between features and the target variable.
- Speed of computation and training.

**Disadvantages**:
- The assumption of a linear relationship between predictors and the log-odds may not always hold true.
- Sensitivity to outliers and multicollinearity (high correlation between independent variables).

In [65]:
clf = LogisticRegression()

#### Searching for best Logistic Regression params with Gridsearch

In [66]:
parameters = {
    'penalty' : ['l1','l2'], 
    'C'       : [1., 10., 100.],
    'solver'  : ['newton-cg', 'lbfgs', 'liblinear'],
}

In [67]:
machine_learning.grid_search_all(clf, df, df_df, parameters)

just tokenization, 0 or 1, if the word exists:
accuracy: 0.934438,
params = {'C': 1.0, 'penalty': 'l2', 'solver': 'liblinear'}
just tokenization, word counts:
accuracy: 0.934438,
params = {'C': 1.0, 'penalty': 'l2', 'solver': 'liblinear'}
just tokenization, TFIDF:
accuracy: 0.935175,
params = {'C': 10.0, 'penalty': 'l2', 'solver': 'liblinear'}
stemming, 0 or 1, if the word exists:
accuracy: 0.934070,
params = {'C': 1.0, 'penalty': 'l2', 'solver': 'newton-cg'}
stemming, word counts:
accuracy: 0.934070,
params = {'C': 1.0, 'penalty': 'l2', 'solver': 'newton-cg'}
stemming, TFIDF:
accuracy: 0.935175,
params = {'C': 100.0, 'penalty': 'l2', 'solver': 'liblinear'}
lemmatization, 0 or 1, if the word exists:
accuracy: 0.932965,
params = {'C': 1.0, 'penalty': 'l2', 'solver': 'liblinear'}
lemmatization, word counts:
accuracy: 0.932965,
params = {'C': 1.0, 'penalty': 'l2', 'solver': 'liblinear'}
lemmatization, TFIDF:
accuracy: 0.934807,
params = {'C': 10.0, 'penalty': 'l2', 'solver': 'liblinear'}


### Decision tree

**Decision Tree** is a machine learning method used for both classification and regression. It represents a model that makes decisions based on a sequence of questions asked about the features of the input data. Each "branch" of the tree corresponds to a choice based on the value of a feature, and each "leaf" (terminal node) represents a prediction.

#### Key Principles
1. **Tree Structure**:
   - **Root**: The starting node of the tree that contains all the data.
   - **Internal Nodes**: Nodes representing questions about features that split the data into subsets.
   - **Leaves**: Terminal nodes representing predictions (classes or values).

2. **Building Algorithm**:
   - **Feature Selection**: For each node in the tree, a feature must be chosen that best splits the data. Various criteria can be used for this, such as:
     - **Gini Index**: A measure of impurity in the node, which minimizes the probability of misclassification.
     - **Entropy Criterion**: Measures the uncertainty in the node. The lower the entropy, the more homogeneous the data.
     - **Mean Squared Error**: Used in regression tasks to minimize the deviations of predicted values from actual values.

3. **Pruning the Tree**:
   - To prevent overfitting, the tree may be pruned. This is done by removing some nodes that do not significantly improve predictions, making the model more general.


#### Advantages and Disadvantages
**Advantages**:
- Simplicity of interpretation and visualization.
- Does not require preprocessing of data, such as normalization or standardization.
- Ability to work with both numerical and categorical data.

**Disadvantages**:
- Prone to overfitting, especially with deep trees.
- Sensitive to small changes in the data, which can lead to significant changes in the structure of the tree.
- Limited ability to model complex dependencies.

In [68]:
clf = tree.DecisionTreeClassifier()

#### Searching for best Decision Tree params with Gridsearch

In [69]:
max_depths = [2, 4, 8, 16, 32]

parameters = {
    'criterion': ['gini', 'entropy'],
    'max_depth': max_depths,
}

In [70]:
machine_learning.grid_search_all(clf, df, df_df, parameters)

just tokenization, 0 or 1, if the word exists:
accuracy: 0.925599,
params = {'criterion': 'entropy', 'max_depth': 16}
just tokenization, word counts:
accuracy: 0.925967,
params = {'criterion': 'gini', 'max_depth': 16}
just tokenization, TFIDF:
accuracy: 0.929282,
params = {'criterion': 'gini', 'max_depth': 16}
stemming, 0 or 1, if the word exists:
accuracy: 0.924862,
params = {'criterion': 'gini', 'max_depth': 32}
stemming, word counts:
accuracy: 0.925230,
params = {'criterion': 'gini', 'max_depth': 16}
stemming, TFIDF:
accuracy: 0.926335,
params = {'criterion': 'gini', 'max_depth': 16}
lemmatization, 0 or 1, if the word exists:
accuracy: 0.925599,
params = {'criterion': 'gini', 'max_depth': 16}
lemmatization, word counts:
accuracy: 0.925230,
params = {'criterion': 'gini', 'max_depth': 16}
lemmatization, TFIDF:
accuracy: 0.926335,
params = {'criterion': 'entropy', 'max_depth': 16}
stemming(snow) + misspellings, 0 or 1, if the word exists:
accuracy: 0.923757,
params = {'criterion': 'gin

### Random Forest

**Random Forest** is an ensemble machine learning method used for classification and regression. It builds multiple decision trees and combines their results to improve accuracy and prevent overfitting. This method is based on the concept of "smart" aggregation of multiple models to achieve a more reliable and stable outcome.

#### Key Principles
1. **Structure**:
   - A Random Forest consists of many decision trees, each trained on a random subset of data. Each tree makes a decision, and the final result is obtained by voting (for classification) or averaging (for regression) the predictions of all trees.

2. **Random Sampling**:
   - For creating each tree, the bootstrap method is used, which includes randomly extracting a subset from the original dataset with replacement. This allows each tree to be trained on different data, increasing model diversity.

3. **Random Feature Selection**:
   - At each node split, a random subset of features is selected. This helps prevent overfitting and makes the model more robust to noise in the data.

4. **Combining Results**:
   - For classification, the final prediction is determined by voting, where the class that receives the most votes from the trees is considered the final prediction. For regression, the results are averaged.

#### Advantages and Disadvantages
**Advantages**:
- High accuracy and robustness against overfitting.
- Ability to handle large datasets with many features.
- Resilience to outliers and noise in the data.
- Ease of interpreting feature importance.

**Disadvantages**:
- High computational complexity during model training due to the large number of trees.
- Lower interpretability compared to single decision trees.
- Possibility of overfitting if the number of trees is too large or if they are too deep.

In [71]:
clf = RandomForestClassifier(max_depth=2, random_state=0)

#### Searching for best Random Forest params with Gridsearch

In [72]:
max_depths = [2, 4, 8, 16, 32]
n_estimators = [1000]

parameters = {
    'criterion': ['gini', 'entropy'],
    'max_depth': max_depths,
    'n_estimators': n_estimators
}

In [73]:
machine_learning.grid_search_all(clf, df, df_df, parameters)

just tokenization, 0 or 1, if the word exists:
accuracy: 0.934438,
params = {'criterion': 'gini', 'max_depth': 32, 'n_estimators': 1000}
just tokenization, word counts:
accuracy: 0.934438,
params = {'criterion': 'gini', 'max_depth': 32, 'n_estimators': 1000}
just tokenization, TFIDF:
accuracy: 0.933702,
params = {'criterion': 'gini', 'max_depth': 32, 'n_estimators': 1000}
stemming, 0 or 1, if the word exists:
accuracy: 0.937753,
params = {'criterion': 'entropy', 'max_depth': 32, 'n_estimators': 1000}
stemming, word counts:
accuracy: 0.937753,
params = {'criterion': 'entropy', 'max_depth': 32, 'n_estimators': 1000}
stemming, TFIDF:
accuracy: 0.936648,
params = {'criterion': 'entropy', 'max_depth': 32, 'n_estimators': 1000}
lemmatization, 0 or 1, if the word exists:
accuracy: 0.937753,
params = {'criterion': 'entropy', 'max_depth': 32, 'n_estimators': 1000}
lemmatization, word counts:
accuracy: 0.937753,
params = {'criterion': 'entropy', 'max_depth': 32, 'n_estimators': 1000}
lemmatizati

### Choosing the best model and calculating final accuracies for all preprocessing and vectoring methods

#### Logistic regression with params for best model according to gridsearch

In [74]:
clf = LogisticRegression()
params_best = {'C': 10.0, 'penalty': 'l2', 'solver': 'liblinear'}
clf.set_params(**params_best)
df_res_upd = machine_learning.model_preprocessing(clf, df, df_df)
df_res_upd

Unnamed: 0,"0 or 1, if the word exists",word counts,TFIDF
just tokenization,0.93186,0.93186,0.922652
stemming,0.93186,0.93186,0.930018
lemmatization,0.93186,0.93186,0.922652
stemming(snow) + misspellings,0.933702,0.933702,0.930018
lemmatization + misspellings,0.93186,0.93186,0.926335
stemming(lanc) + misspellings,0.935543,0.935543,0.928177


#### Decision Tree with params for best model according to gridsearch

In [75]:
clf = tree.DecisionTreeClassifier()
params_best = {'criterion': 'gini', 'max_depth': 16}
clf.set_params(**params_best)
df_res_upd = machine_learning.model_preprocessing(clf, df, df_df)
df_res_upd

Unnamed: 0,"0 or 1, if the word exists",word counts,TFIDF
just tokenization,0.92081,0.922652,0.92081
stemming,0.913444,0.917127,0.913444
lemmatization,0.92081,0.917127,0.92081
stemming(snow) + misspellings,0.911602,0.918969,0.911602
lemmatization + misspellings,0.917127,0.917127,0.917127
stemming(lanc) + misspellings,0.913444,0.909761,0.917127


#### Random Forest with params for best model according to gridsearch

In [76]:
clf = RandomForestClassifier(random_state=0)
params_best = {'criterion': 'entropy', 'max_depth': 32, 'n_estimators': 1000}
clf.set_params(**params_best)
df_res_upd = machine_learning.model_preprocessing(clf, df, df_df)
df_res_upd

Unnamed: 0,"0 or 1, if the word exists",word counts,TFIDF
just tokenization,0.928177,0.928177,0.924494
stemming,0.935543,0.935543,0.928177
lemmatization,0.935543,0.935543,0.930018
stemming(snow) + misspellings,0.937385,0.937385,0.928177
lemmatization + misspellings,0.930018,0.930018,0.922652
stemming(lanc) + misspellings,0.941068,0.941068,0.93186


----
### Word2Vec

**Word2Vec** is a method for representing words as vectors, allowing machine learning models to effectively process textual data. Word2Vec uses neural networks to learn vector representations of words that capture semantic and syntactic relationships between them.

#### Key Principles
1. **Vector Representations**:
   - Each word is represented as a multi-dimensional vector. These vectors allow models to compare words based on their meanings and context.
      - 
2. **Algorithms**:
   Word2Vec uses two main algorithms for training:
   - **Continuous Bag of Words (CBOW)**: This method predicts the current word based on its context (the surrounding words). For example, given the sentence "the cat sits on the mat," the model would try to predict the word "sits" using "the," "cat," "on," and "the mat" as context.
   - **Skip-gram**: This method, on the other hand, uses the current word to predict the context. So, if we have the word "sits," the model will try to predict the words that surround it.

3. **Training**:
   - The model is trained on large volumes of textual data. During the training process, it optimizes the vectors so that words with similar meanings are located closer together in the vector space.

4. **Semantic Relationships**:
   - Word2Vec can capture semantic relationships between words. For instance, the vectors for "king" and "queen" will be close, as will the vectors for "man" and "woman." This allows for operations like "king - man + woman = queen."

#### Advantages and Disadvantages
**Advantages**:
- Efficiency: Allows for quick and effective processing of large volumes of textual data.
- Semantic Representation: Captures the meanings of words and their relationships.
- Ease of Use: Can be integrated into various NLP tasks.

**Disadvantages**:
- Limited Contextuality: Word2Vec does not account for the order of words, which can be critical in some tasks.
- Need for Large Datasets: Achieving good results requires a substantial amount of training text.
- Inability to Handle Polysemy: A single word may have multiple meanings, and Word2Vec may not always account for this.

In [77]:
preprocessing = ['original', 'just tokenization', 'stemming', 'lemmatization', 'stemming(snow) + misspellings',
                                                        'lemmatization + misspellings', 'stemming(lanc) + misspellings']
df_df_prep = pd.DataFrame(columns=preprocessing)

In [78]:
for el in preprocessing:
    if el == 'original':
        df_df_prep[el] = df['tweets'].iloc[:]
    else:
        df_df_prep[el] = df_df.iloc[:, 0][el].iloc[:, 0].index
df_df_prep

Unnamed: 0,original,just tokenization,stemming,lemmatization,stemming(snow) + misspellings,lemmatization + misspellings,stemming(lanc) + misspellings
0,"An inspiration in all aspects: Fashion, fitnes...",an inspiration in all aspects fashion fitness ...,an inspir in all aspect fashion fit beauti and...,an inspiration in all aspect fashion fitness b...,an inspir in all aspect fashion fit beauti and...,an inspiration in all aspect fashion fitness b...,an inspir in al aspect fash fit beauty and per...
1,Apka Apna Awam Ka Channel Frankline Tv Aam Adm...,apka apna awam ka channel frankline tv aam adm...,apka apna awam ka channel franklin tv aam admi...,apka apna awam ka channel frankline tv aam adm...,aka anna away ka channel franklin to am admit ...,aka anna away ka channel franklin to am admit ...,ak ann away ka channel franklin to am admit pr...
2,Beautiful album from the greatest unsung guit...,beautiful album from the greatest unsung guita...,beauti album from the greatest unsung guitar g...,beautiful album from the greatest unsung guita...,beauti album from the greatest unsung guitar g...,beautiful album from the greatest unsung guita...,beauty alb from the greatest unsung guit geni ...
3,Good luck to Rich riding for great project in ...,good luck to rich riding for great project in ...,good luck to rich ride for great project in th...,good luck to rich riding for great project in ...,good luck to rich ride for great project in th...,good luck to rich riding for great project in ...,good luck to rich rid for gre project in thi s...
4,Omg he... kissed... him crying with joy,omg he kissed him crying with joy,omg he kiss him cri with joy,omg he kissed him cry with joy,om he kiss him cri with joy,om he kissed him cry with joy,om he kiss him cry with joy
...,...,...,...,...,...,...,...
1019,Supreme Court shoots down govt bid to put spor...,supreme court shoots down govt bid to put spor...,suprem court shoot down govt bid to put sport ...,supreme court shoot down govt bid to put sport...,suprem court shoot down got bid to put sport m...,supreme court shoot down got bid to put sport ...,suprem court shoot down got bid to put sport m...
1020,"Historian Ram Guha, IDFC official Vikram Limay...",historian ram guha idfc official vikram limaye...,historian ram guha idfc offici vikram limay fo...,historian ram guha idfc official vikram limaye...,historian ram gula if offici viral imag former...,historian ram gula if official viral image for...,hist ram gul if off vir im form captain lian a...
1021,Supreme Court names former CAG as head of 4-me...,supreme court names former cag as head of memb...,suprem court name former cag as head of member...,supreme court name former cag a head of member...,suprem court name former can as head of member...,supreme court name former can a head of member...,suprem court nam form can as head of memb pane...
1022,Court summons CM suspended BJP MP as accused i...,court summons cm suspended bjp mp as accused i...,court summon cm suspend bjp mp as accus in cri...,court summons cm suspended bjp mp a accused in...,court summon am suspend bop my as accus in cri...,court summons am suspended bop my a accused in...,court summon am suspend bop my as accus in cri...


In [79]:
df_df_prep.to_csv('res/df_df_prep.csv')

#### Word2Vec examples

In [80]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from gensim.models.word2vec import Word2Vec

df_df_ = df_df_prep.iloc[:, -2].astype(str).tolist()
tweets_lists = [text_to_word_sequence(tw) for tw in df_df_]
w2v_model = Word2Vec(sentences=tweets_lists, vector_size=100,
						 					 		window=5, min_count=1)
w2v_model.wv.word_vec('happy')

array([-0.09505568,  0.3937941 ,  0.04392887, -0.33067578,  0.10746714,
       -0.91409445,  0.730753  ,  1.1085308 , -0.33851627, -0.50586295,
       -0.15899393, -0.93298036, -0.14942716, -0.01470097,  0.13183391,
       -0.29866737, -0.02443575, -0.3835428 , -0.06543013, -0.93508136,
        0.3376409 ,  0.15310737,  0.49957466, -0.30281562, -0.1575916 ,
        0.04224298, -0.1382828 , -0.34044975, -0.24866405, -0.03870746,
        0.5451863 , -0.14133126,  0.30855674, -0.56088084, -0.1796257 ,
        0.8269225 ,  0.2414299 , -0.05726096, -0.3344391 , -0.55864376,
        0.26846048, -0.45957068, -0.23192333,  0.00735891,  0.42723098,
       -0.34936285, -0.40866336, -0.2465668 ,  0.3900131 ,  0.22368641,
        0.02124124, -0.4222775 ,  0.2863718 ,  0.03347119, -0.32226196,
        0.29391056,  0.03812635, -0.12255003, -0.2826103 ,  0.10834589,
       -0.04517619, -0.12301224,  0.33927235, -0.09603848, -0.41030037,
        0.683605  ,  0.18159644,  0.5405732 , -0.61853415,  0.34

In [81]:
w2v_model.wv.most_similar('fitness')

[('pick', 0.9152530431747437),
 ('fly', 0.9139841794967651),
 ('kind', 0.9126733541488647),
 ('bad', 0.9119454026222229),
 ('start', 0.9114742875099182),
 ('gone', 0.9113501906394958),
 ('shot', 0.9105678200721741),
 ('free', 0.9102700352668762),
 ('reading', 0.9102481007575989),
 ('already', 0.9101448655128479)]

In [82]:
w2v_model.wv.doesnt_match("girl guitar man woman".split())

'guitar'

### Using  pretrained word vector representation models

> #### glove-twitter-25

**GloVe-Twitter-25** is a pre-trained word vector representation model designed for processing texts from Twitter. It is based on the GloVe (Global Vectors for Word Representation) method, which creates vector representations of words by considering word statistics in a large text corpus. The GloVe-Twitter-25 model was trained on a large dataset of tweets and has a vector size of 25.

#### Key Principles
1. **GloVe Method**:
   - GloVe uses a word co-occurrence matrix to study how often words appear together in text. This allows capturing contextual relationships between words based on their frequency of occurrence in various contexts.
   - The model creates vectors such that their scalar products reflect the logarithm of the probability of co-occurrence of words.

2. **Training on Twitter**:
   - GloVe-Twitter-25 was trained on millions of tweets, allowing it to capture the specific nuances of Twitter language, such as slang, abbreviations, and emojis.
   - Training on such a corpus makes the model particularly useful for tasks related to social media analysis and natural language processing in the context of Twitter.

3. **Vector Size**:
   - The vectors have a dimension of 25, making them compact and convenient for use in various applications where processing speed and memory efficiency are critical.

#### Applications
- **Sentiment Analysis**: Used to determine the emotional tone of tweets, which is useful for marketing research and public opinion monitoring.
- **Text Classification**: Applied for automatic categorization of tweets based on their content.
- **Recommendation Systems**: Helps in creating recommendations based on textual content in social networks.

#### Advantages and Disadvantages
**Advantages**:
- Specificity: Training on Twitter data allows the model to better handle the language and style characteristic of this platform.
- Compactness: The small vector size (25) makes them easy to integrate into applications with limited resources.

**Disadvantages**:
- Limited Dimensionality: The smaller vector size may limit the model's ability to capture more complex semantic relationships compared to models with larger dimensions.
- Language Specificity: The model may not be suitable for tasks requiring processing of texts outside the Twitter context or in more formal language settings.

In [83]:
import gensim.downloader as api

w2v_corpus = api.load('glove-twitter-25')
w2v_corpus['happy']



array([-1.2304 ,  0.48312,  0.14102, -0.0295 , -0.65253, -0.18554,
        2.1033 ,  1.7516 , -1.3001 , -0.32113, -0.84774,  0.41995,
       -3.8823 ,  0.19638, -0.72865, -0.85273,  0.23174, -1.0763 ,
       -0.83023,  0.10815, -0.51015,  0.27691, -1.1895 ,  0.98094,
       -0.13955], dtype=float32)

In [84]:
w2v_corpus.most_similar('fitness')

[('lifestyle', 0.877960205078125),
 ('wellness', 0.8659428954124451),
 ('production', 0.8604754209518433),
 ('skills', 0.8599393963813782),
 ('training', 0.8574686646461487),
 ('professional', 0.853054940700531),
 ('crossfit', 0.8482766151428223),
 ('nutrition', 0.8449391722679138),
 ('yoga', 0.8430234789848328),
 ('coaching', 0.8291710019111633)]

In [85]:
w2v_corpus.doesnt_match("girl apple man woman".split())

'apple'

> ### glove-twitter-100

**GloVe-Twitter-100** is a pre-trained word vector representation model designed for processing texts from Twitter. It is based on the GloVe (Global Vectors for Word Representation) method and was trained on a large dataset of tweets. The vector size in this model is 100, allowing it to capture more complex semantic relationships between words compared to lower-dimensional models.

In [86]:
w2v_corpus = api.load('glove-twitter-100')
w2v_corpus['happy']



array([ 0.023098 , -0.11098  ,  0.079839 ,  0.26566  ,  0.23083  ,
       -0.14683  ,  0.29009  ,  0.24811  , -0.38742  ,  0.11899  ,
       -0.81393  , -0.69197  , -4.0274   , -0.096299 , -0.49273  ,
        0.71179  ,  0.043593 ,  0.048169 , -0.90247  ,  0.23704  ,
        0.20754  , -0.10822  , -0.69071  , -0.33782  ,  0.83584  ,
       -0.75044  ,  0.21905  ,  0.28662  ,  0.63882  , -1.0862   ,
       -0.76783  , -0.4843   ,  0.34029  ,  0.65897  ,  0.50015  ,
        0.52957  ,  0.39435  , -0.38319  ,  0.11514  , -0.1388   ,
       -1.3666   ,  0.1397   ,  0.18929  ,  0.93266  , -0.47246  ,
       -0.19455  , -0.03649  , -0.98943  , -0.27461  ,  0.24763  ,
       -0.45024  , -0.71812  ,  0.61547  , -0.90039  ,  0.92341  ,
        0.2597   ,  0.058149 ,  0.30903  ,  0.26106  , -0.087882 ,
       -0.18843  , -0.85732  ,  0.065188 ,  0.035417 , -0.1342   ,
        0.06486  , -1.07     , -0.37303  , -0.79469  ,  0.23944  ,
        0.37891  , -0.36431  ,  0.19694  ,  0.39264  ,  0.6587

In [87]:
w2v_corpus.most_similar('fitness')

[('workout', 0.7857903838157654),
 ('crossfit', 0.7615430355072021),
 ('training', 0.7413853406906128),
 ('bodybuilding', 0.7236645817756653),
 ('wellness', 0.7222222089767456),
 ('cardio', 0.7207351922988892),
 ('workouts', 0.7059327363967896),
 ('nutrition', 0.7046872973442078),
 ('gym', 0.7045153975486755),
 ('exercise', 0.7042489051818848)]

In [88]:
w2v_corpus.doesnt_match("girl apple man woman".split())

'apple'

> ### glove-twitter-200

**GloVe-Twitter-200** is a pre-trained word vector representation model designed for processing texts from Twitter. It is based on the GloVe (Global Vectors for Word Representation) method and was trained on a large dataset of tweets. The vector size in this model is 200, allowing it to capture more complex semantic relationships between words compared to lower-dimensional models.

In [89]:
w2v_corpus = api.load('glove-twitter-200')
w2v_corpus['happy']



array([ 3.4055e-01, -3.5341e-02,  1.7932e-01,  2.2748e-01,  5.1800e-01,
        3.5620e-01,  4.7427e-01, -3.3973e-01, -1.1459e-01, -8.5816e-02,
       -4.8371e-01, -1.5185e-01,  4.9683e-02,  2.3031e-01, -1.2894e-02,
        4.2952e-01, -2.2993e-01,  4.0219e-01, -4.6905e-01, -3.4270e-01,
        2.3068e-01,  5.4987e-02, -6.3739e-01, -1.7282e-01,  1.2480e-01,
        4.7597e-01, -3.1538e-01, -2.3897e-01,  6.3453e-01,  6.9128e-02,
        4.4254e-02, -1.7599e-01,  2.4331e-01,  8.8688e-01, -2.1671e-02,
        3.5471e-01,  5.1198e-01,  3.7152e-01, -3.1553e-01,  1.9369e-01,
       -2.3263e-01, -8.4731e-02,  3.2064e-01,  4.1194e-01, -7.6711e-01,
       -2.3819e-01, -2.2367e-02, -4.8029e-01, -2.1130e-01,  3.7667e-01,
       -3.8449e-01, -4.6203e-01,  2.0125e-01, -8.7098e-01,  6.3067e-01,
        3.8002e-01,  1.0009e-01,  2.2057e-01,  1.2709e-01, -2.7291e-01,
       -3.8695e-01, -1.7037e-01, -5.2444e-01,  1.6979e-01, -6.3892e-02,
       -4.1493e-01, -3.0092e-01, -8.8667e-02,  2.7442e-02, -3.32

In [90]:
w2v_corpus.most_similar('fitness')

[('workout', 0.7339531183242798),
 ('crossfit', 0.6933854818344116),
 ('gym', 0.6854565143585205),
 ('training', 0.6848605871200562),
 ('nutrition', 0.673112690448761),
 ('wellness', 0.6691936254501343),
 ('exercise', 0.6544373631477356),
 ('cardio', 0.6528776288032532),
 ('bodybuilding', 0.6506840586662292),
 ('pilates', 0.6403564810752869)]

In [91]:
w2v_corpus.doesnt_match("girl apple man woman".split())

'apple'

#### Word2Vec and pretrained word vector representation models with Logistic Regression Classifier

In [92]:
clf = LogisticRegression()
params_best = {'C': 10.0, 'penalty': 'l2', 'solver': 'liblinear'}
clf.set_params(**params_best)
res = w2v_ml.model_preprocessing(clf, df, df_df_prep)
res.iloc[1:, :]

Unnamed: 0,w2v,glove-25,glove-100,glove-200
just tokenization,0.701657,0.85267,0.898711,0.909761
stemming,0.690608,0.85267,0.898711,0.909761
lemmatization,0.709024,0.85267,0.898711,0.909761
stemming(snow) + misspellings,0.699816,0.85267,0.898711,0.909761
lemmatization + misspellings,0.690608,0.85267,0.898711,0.909761
stemming(lanc) + misspellings,0.697974,0.85267,0.898711,0.909761


#### Word2Vec and pretrained word vector representation models  with Decision Tree

In [93]:
clf = tree.DecisionTreeClassifier()
params_best = {'criterion': 'gini', 'max_depth': 16}
clf.set_params(**params_best)
res = w2v_ml.model_preprocessing(clf, df, df_df_prep)
res.iloc[1:, :]

Unnamed: 0,w2v,glove-25,glove-100,glove-200
just tokenization,0.552486,0.732965,0.769797,0.755064
stemming,0.574586,0.742173,0.755064,0.767956
lemmatization,0.550645,0.740331,0.777164,0.73849
stemming(snow) + misspellings,0.561694,0.71639,0.769797,0.736648
lemmatization + misspellings,0.565378,0.720074,0.755064,0.744015
stemming(lanc) + misspellings,0.561694,0.731123,0.769797,0.742173


#### Word2Vec and pretrained word vector representation models with Random Forest Classifier

In [94]:
clf = RandomForestClassifier(random_state=0)
params_best = {'criterion': 'gini', 'max_depth': 32, 'n_estimators': 1000}
clf.set_params(**params_best)
res = w2v_ml.model_preprocessing(clf, df, df_df_prep)
res.iloc[1:, :]

Unnamed: 0,w2v,glove-25,glove-100,glove-200
just tokenization,0.74954,0.850829,0.904236,0.896869
stemming,0.760589,0.850829,0.904236,0.896869
lemmatization,0.755064,0.850829,0.904236,0.896869
stemming(snow) + misspellings,0.758748,0.850829,0.904236,0.896869
lemmatization + misspellings,0.755064,0.850829,0.904236,0.896869
stemming(lanc) + misspellings,0.762431,0.850829,0.904236,0.896869


-----

### Some useful links

- https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
- sentiment analysis https://www.analyticsvidhya.com/blog/2021/09/sentiment-classification-using-nlp-with-text-analytics/
- https://becominghuman.ai/nlp-classifying-positive-and-negative-restaurant-reviews-bag-of-words-model-31e9abfd7286
- Comments classification https://github.com/msahamed/yelp_comments_classification_nlp/blob/master/word_embeddings.ipynb
- tokenization https://towardsdatascience.com/5-simple-ways-to-tokenize-text-in-python-92c6804edfc4
- Lemmatization https://pythobyte.com/stemming-and-lemmatization-82464/
- Fundamentals of Bag Of Words and TF-IDF https://medium.com/analytics-vidhya/fundamentals-of-bag-of-words-and-tf-idf-9846d301ff22
- How to Vectorize Text in DataFrames for NLP Tasks https://towardsdatascience.com/how-to-vectorize-text-in-dataframes-for-nlp-tasks-3-simple-techniques-82925a5600db
- Stemming: Porter Vs. Snowball Vs. Lancaster https://towardsai.net/p/l/stemming-porter-vs-snowball-vs-lancaster
- Stemming и лемматизация в Python https://pythobyte.com/stemming-and-lemmatization-82464/
- https://www.bigdataschool.ru/blog/pyspark-vectorization.html
- https://towardsdatascience.com/benchmarking-python-nlp-tokenizers-3ac4735100c5
- https://stackoverflow.com/questions/45312377/how-to-one-hot-encode-from-a-pandas-column-containing-a-list
- different stemmers https://machinelearningknowledge.ai/beginners-guide-to-stemming-in-python-nltk/
- preprocessing https://dataaspirant.com/nlp-text-preprocessing-techniques-implementation-python/#t-1600081660724