<a href="https://colab.research.google.com/github/allakoala/data_science/blob/main/colab_notebooks/HW_Classification_(part_2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#HW - https://docs.google.com/document/d/1L7p-GnVDEUal3JkgR9E2Kqp62yT06TXKWYwXTnVOhuk/edit

Data -This is a dataset for binary sentiment classification. We provide a set of 50,000 highly polar movie reviews for training and testing
https://drive.google.com/file/d/14TaFIFoslOAAljV2uj5rU9biAuocKScX/view?usp=sharing

To do - Take the provided dataset and solve the binary classification task.

Target – sentiment pos/neg

Evaluation - Metric AUC-ROC with visualisation




In [None]:
!pip install word2number

In [None]:
#basics
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import string
import sklearn
import nltk #pip install nltk
nltk.download('omw-1.4')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
import re
import string
from collections import Counter

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

from bs4 import BeautifulSoup
import spacy
import unidecode #!pip install unidecode
from word2number import w2n #!pip install word2number
import contractions #!pip install contractions

import en_core_web_sm

!python -m spacy download en_core_web_md
nlp = spacy.load('en_core_web_md')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

#Text preprocessing with explanations of all steps:
The given code defines a function clean_text() which takes a list of documents as input and performs a series of text preprocessing steps on each document. The processed documents are returned as a list.

Here is a breakdown of the steps performed in the function:

- Remove HTML tags using BeautifulSoup library and replace them with a space separator.
- Remove accented characters using unidecode library.
- Expand contractions using contractions library.
- Convert number words to numeric form using word2number library.
- Remove numbers using regular expressions.
- Remove words with all duplicated letters (currently commented out).
- Remove leading and ending spaces using string methods.
- Remove punctuation using string methods.
- Convert all text to lowercase using string methods.
- Tokenize text using nltk library and count the number of words in the corpus.
- Remove stop words using a set of stop words defined earlier in the code.

#Conclusion:
1.Stemming is a process of reducing words to their root form, while lemmatization involves reducing words to their base form, which is usually a dictionary word. In this case, we can see that both techniques have reduced the words to their base form, although they have different results. For example, the word "comfortable" has been stemmed to "comfort" but has been lemmatized to "comfortable". Similarly, the word "mentioned" has been stemmed to "mention" but has been lemmatized to "mentioned". In conclusion, both stemming and lemmatization are useful techniques for text normalization, but they have different approaches and results. Stemming is a simpler and more aggressive approach that involves removing the suffixes of words to reduce them to their root form, while lemmatization is a more complex and accurate approach that involves reducing words to their base form. The choice between these techniques will depend on the specific needs and goals of the text processing task. Starting from this pair Index: 0 Normalization1: happened | Normalization2: goe words comparison is asynchronised, that means that aparently stemming removes some words at some points.

In [None]:
#path of the file to read
url = "/content/drive/MyDrive/Colab Notebooks/data_class_2/LargeMovieReviewDataset.csv"

#read the file into a variable
data = pd.read_csv(url, sep=',')

#examine the data
data.head()

In [None]:
stop_words = {word for word in stopwords.words('english') if word not in {'no', 'not'}}

def clean_text(documents):
    # Remove HTML tags
    documents = [BeautifulSoup(doc, "html.parser").get_text(separator=" ") for doc in documents]

    # Remove accented characters from text, e.g. café
    documents = [unidecode.unidecode(doc) for doc in documents]

    # Expand contractions
    documents = [contractions.fix(doc) for doc in documents]

    # Convert number words to numeric form
    new_documents = []
    for doc in documents:
        words = []
        for word in doc.split():
            if word.isalpha():
                try:
                    num = w2n.word_to_num(word)
                    words.append(str(num))
                except ValueError:
                    words.append(word)
            else:
                words.append(word)
        new_documents.append(' '.join(words))
    documents = new_documents

    # Remove numbers
    documents = [re.sub(r'\b\d+\b', '', doc) for doc in documents]

    # Remove words with all duplicated letters
    # documents = [re.sub(r'\b\w*(\w)\1{1,}\w*\b', '', doc) for doc in documents]

    # Remove leading and ending spaces
    documents = [" ".join(doc.strip().split()) for doc in documents]

    # Remove punctuation
    documents = [doc.translate(str.maketrans('', '', string.punctuation)) for doc in documents]

    # Convert to lowercase
    documents = [doc.lower() for doc in documents]

    # Tokenize text and count the number of words in the corpus
    tokens = [nltk.word_tokenize(doc) for doc in documents]
    corpus_size_words = np.sum([len(d) for d in tokens])

    # Remove stop words
    filtered_tokens = [[token for token in doc_tokens if token not in stop_words] for doc_tokens in tokens]

    # Print descriptive statistics
    corpus_size_docs = len(documents)
    sentiment_distr = Counter(data['sentiment'])
    print('Corpus Size (Number of Documents): {}'.format(corpus_size_docs))
    print('Corpus Size (Number of Words): {}'.format(corpus_size_words))
    print('Sentiment Distribution: {}'.format(sentiment_distr))

    return [' '.join(filtered_doc_tokens) for filtered_doc_tokens in filtered_tokens]

data['review_cleaned'] = clean_text(data['review'])
data['review_cleaned']

In [None]:
# Normalization 1 - lemmatization
lemmatizer = WordNetLemmatizer()

def normalize_text1(text):
    # Tokenize words
    text = nltk.word_tokenize(text)
    # Lemmatize words
    text = [lemmatizer.lemmatize(word) for word in text]
    return text

data['review_normalized1'] = data['review_cleaned'].apply(normalize_text1)

# concatenate the lists of normalized words for all reviews
normalized1 = []
for review in data['review_normalized1']:
    normalized1.extend(review)

print(normalized1[:7]) # print first 7 words

In [None]:
# Normalization 2 - stemming
stemmer = PorterStemmer()

def normalize_text2(text):
    # Tokenize words
    text = nltk.word_tokenize(text)
    # Stem words
    text = [stemmer.stem(word) for word in text]
    return text

data['review_normalized2'] = data['review_cleaned'].apply(normalize_text2)

# concatenate the lists of normalized words for all reviews
normalized2 = []
for review in data['review_normalized2']:
    normalized2.extend(review)

print(normalized2[:7]) # print first 7 words

In [None]:
#the list of the word pairs that are different after Normalization 1 and Normalization 2
for i in range(min(len(data), 10)):
    normalized1 = set(data['review_normalized1'][i])
    normalized2 = set(data['review_normalized2'][i])
    diff = normalized1.symmetric_difference(normalized2)
    if diff:
        for word1, word2 in zip(sorted(normalized1 - normalized2), sorted(normalized2 - normalized1)):
            print("Index:", i, "Normalization1:", word1, "| Normalization2:", word2)

In [None]:
#convert sentiment to binary (0 or 1)
data['sentiment'] = data['sentiment'].apply(lambda x: 1 if x=='positive' else 0)
#data[['review_normalized1','review_normalized2','sentiment']].head()
data[['review_normalized2','sentiment']]

In [None]:
data[['review_normalized1','sentiment']]

#Words importance
See the WordClouds for both sentiments.


In [None]:
from collections import defaultdict
from operator import itemgetter
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a dictionary to count the frequency of each word in each sentiment class
freq_dict = defaultdict(lambda: defaultdict(int))
for i, row in data[['review_normalized1','sentiment']].iterrows():
    sentiment = row['sentiment']
    words = " ".join(row['review_normalized1'])  # join the list of words with a space separator
    for word in words.split():
        freq_dict[sentiment][word] += 1

# Sort the dictionary by frequency and print out the top 50 words for each sentiment class
for sentiment in freq_dict:
    print(f"Sentiment {sentiment}:")
    top_words = sorted(freq_dict[sentiment].items(), key=itemgetter(1), reverse=True)[:50]
    for word, freq in top_words:
        print(f"{word}: {freq}")
    print()

In [None]:
from collections import defaultdict
from operator import itemgetter
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a dictionary to count the frequency of each word in each sentiment class
freq_dict = defaultdict(lambda: defaultdict(int))
for i, row in data[['review_normalized1','sentiment']].iterrows():
    sentiment = row['sentiment']
    words = " ".join(row['review_normalized1'])  # join the list of words with a space separator
    for word in words.split():
        freq_dict[sentiment][word] += 1

# Check for same words in both lists and delete the word with the lowest frequency count
for word in freq_dict[1].keys() & freq_dict[0].keys():
    if freq_dict[1][word] < freq_dict[0][word]:
        del freq_dict[1][word]
    else:
        del freq_dict[0][word]

# Sort the dictionary by frequency and print out the top 50 words for each sentiment class
for sentiment in freq_dict:
    print(f"Sentiment {sentiment}:")
    top_words = sorted(freq_dict[sentiment].items(), key=itemgetter(1), reverse=True)[:50]
    for word, freq in top_words:
        print(f"{word}: {freq}")
    print()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Define colors for each sentiment class
colors = {1: 'Greens', 0: 'Reds'}

for sentiment in freq_dict:
    # Generate word cloud for each sentiment
    wordcloud = WordCloud(width=600,
                          height=400,
                          random_state=2,
                          max_font_size=100,
                          background_color='white',
                          colormap=colors[sentiment]).generate_from_frequencies(freq_dict[sentiment])

    # Plot the word cloud
    plt.figure(figsize=(10, 7))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(f"Sentiment {sentiment}")
    plt.show()

#Hyperparameters tuning
SGD best params: {'alpha': 0.001}

SVM best params: {'C': 1, 'kernel': 'linear'}

NB best params: {'alpha': 0.1}

SGD accuracy: 0.808

SVM accuracy: 0.808

NB accuracy: 0.778

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data[['review_normalized1', 'sentiment']], test_size=0.33, random_state=42)

#Sentiment Distribution for Train and Test
print("Sentiment Distribution for Train:", Counter(train['sentiment']))
print("Sentiment Distribution for Test:", Counter(test['sentiment']))

X_train = [' '.join(words) for words in train['review_normalized1']]
X_test = [' '.join(words) for words in test['review_normalized1']]
y_train = train['sentiment']
y_test = test['sentiment']

In [None]:
#Text Vectorization
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

tfidf_vec = TfidfVectorizer(min_df = 10, token_pattern = r'[a-zA-Z]+')
X_train_bow = tfidf_vec.fit_transform(X_train) # fit train
X_test_bow = tfidf_vec.transform(X_test) # transform test
print(X_train_bow.shape)
print(X_test_bow.shape)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from scipy.stats import uniform
import numpy as np

# Limit the size of the dataset
X_train_bow_reduced = X_train_bow[:1000]
y_train_reduced = y_train[:1000]
X_test_bow_reduced = X_test_bow[:500]
y_test_reduced = y_test[:500]

# Define parameter distributions for each model
sgd_params = {
    'alpha': [0.0001, 0.001, 0.01, 0.1, 1],
}

svm_params = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
}

nb_params = {
    'alpha': [0.0001, 0.001, 0.01, 0.1, 1],
}

# Initialize the GridSearchCV objects for each model
sgd_gs = GridSearchCV(
    SGDClassifier(),
    sgd_params,
    cv=5,
    n_jobs=-1,
    scoring='accuracy',
    verbose=1
)

svm_gs = GridSearchCV(
    SVC(),
    svm_params,
    cv=5,
    n_jobs=-1,
    scoring='accuracy',
    verbose=1
)

nb_gs = GridSearchCV(
    MultinomialNB(),
    nb_params,
    cv=5,
    n_jobs=-1,
    scoring='accuracy',
    verbose=1
)

# Fit the models to the training data
sgd_gs.fit(X_train_bow_reduced, y_train_reduced)
svm_gs.fit(X_train_bow_reduced, y_train_reduced)
nb_gs.fit(X_train_bow_reduced, y_train_reduced)

# Evaluate the models on the test data
sgd_score = sgd_gs.score(X_test_bow_reduced, y_test_reduced)
svm_score = svm_gs.score(X_test_bow_reduced, y_test_reduced)
nb_score = nb_gs.score(X_test_bow_reduced, y_test_reduced)

print("SGD best params:", sgd_gs.best_params_)
print("SVM best params:", svm_gs.best_params_)
print("NB best params:", nb_gs.best_params_)
print("SGD accuracy:", sgd_score)
print("SVM accuracy:", svm_score)
print("NB accuracy:", nb_score)

#Compare performance of models
Based on the evaluation results, the best model SGDClassifier AUC-ROC score : 0.8215260851719264

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Comparing the performance of the models
sgd_pred = sgd_gs.predict(X_test_bow)
svm_pred = svm_gs.predict(X_test_bow)
nb_pred = nb_gs.predict(X_test_bow)

sgd_auc = roc_auc_score(y_test, sgd_pred)
svm_auc = roc_auc_score(y_test, svm_pred)
nb_auc = roc_auc_score(y_test, nb_pred)

print("AUC-ROC score for SGDClassifier:", sgd_auc)
print("AUC-ROC score for SVM:", svm_auc)
print("AUC-ROC score for Naive Bayes:", nb_auc)

# Visualizing AUC-ROC scores
fpr_sgd, tpr_sgd, _ = roc_curve(y_test, sgd_pred)
fpr_svm, tpr_svm, _ = roc_curve(y_test, svm_pred)
fpr_nb, tpr_nb, _ = roc_curve(y_test, nb_pred)

plt.figure(figsize=(8, 6))
plt.plot(fpr_sgd, tpr_sgd, label="SGDClassifier (AUC = {:.2f})".format(sgd_auc))
plt.plot(fpr_svm, tpr_svm, label="SVM (AUC = {:.2f})".format(svm_auc))
plt.plot(fpr_nb, tpr_nb, label="Naive Bayes (AUC = {:.2f})".format(nb_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

#Feature Importance with the best model and its best parameters
To evaluate the importance of the features is to look at their corresponding coefficients.

Positive weights imply positive contribution of the feature to the prediction; negative weights imply negative contribution of the feature to the prediction.

The absolute values of the weights indicate the effect sizes of the features.
See the table below.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

# Create pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', SGDClassifier(alpha=0.001))
])

# Fit pipeline on training data
pipeline.fit(X_train, y_train)

# Extract the coefficients of the model from the pipeline
importances = pipeline.named_steps['clf'].coef_.flatten()

# Select top 10 positive/negative weights
top_indices_pos = np.argsort(importances)[::-1][:10]
top_indices_neg = np.argsort(importances)[:10]

# Get featnames from tfidfvectorizer
feature_names = np.array(pipeline.named_steps['tfidf'].get_feature_names_out())
feature_importance_df = pd.DataFrame({
    'FEATURE': feature_names[np.concatenate((top_indices_pos, top_indices_neg))],
    'IMPORTANCE': importances[np.concatenate((top_indices_pos, top_indices_neg))],
    'SENTIMENT': ['pos' for _ in range(len(top_indices_pos))] + ['neg' for _ in range(len(top_indices_neg))]
})
print(feature_importance_df)