# NAME: Emre Koç

# Installing Necessary Libraries

In [1]:
#I have faced some bugs in certain python environments because of gensim and scipy, but this versions work well on google colab.

! pip install requests
! pip install beautifulsoup4
! pip install pandas
! pip install gensim
! pip install spacy
! python -m spacy download en_core_web_sm
! pip install scikit-learn
! pip install numpy
! pip install scipy==1.12


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting scipy==1.12
  Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.4/38.4 MB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: s

# Scraping Data

* I have used well-known library for scraping beatiful soup. Since the website is for scraping, data was well-organized and easy to gather.

* On the other side, labeling of quotes was a trouble. There were many tags for some quotes and the amount of data was around 100. I could have tried multi-label classification but it would not be the best choice because the data is very small even for a basic classification task.

* So I have tried an approach to make sure each quote has single tag. I have listed all tags with their frequencies. Then, I assign the tag with the highest frequency for each quote. This ensures that we will have minimum number of labels. Even though, the results are not good, I will have 41 unique data labels, in ideal scenario at least we should have 100 quotes for each one of them.

* Nevertheless, I have well organized data at the end.

In [7]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


baseUrl = 'https://quotes.toscrape.com/page/'

quotes_data = []
tag_frequency = {}

for page in range(1,11):
    url = baseUrl + str(page) + '/'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    quote_blocks = soup.find_all('div', class_='quote')

    for quote_block in quote_blocks:
        text = quote_block.find('span', class_='text').get_text()
        tags = [tag.get_text() for tag in quote_block.find_all('a', class_='tag')]
        quotes_data.append({"quote": text, "tags": tags})
        for tag in tags:
            if tag in tag_frequency:
                tag_frequency[tag] += 1
            else:
                tag_frequency[tag] = 1


updated_quotes_data = []
for quote in quotes_data:
    if quote["tags"]:
        max_tag = max(quote["tags"], key=lambda tag: tag_frequency.get(tag, 0))
        updated_quotes_data.append({"quote": quote["quote"], "tag": max_tag})


data = pd.DataFrame(updated_quotes_data)

print(f"Number of quotes: {len(data)}")
print(f"Number of unique labels: {len(data['tag'].unique())}")



Number of quotes: 97
Number of unique labels: 41


# Converting Data into Numerical Form

* After I have my data, I need to convert it into numerical form. For that, I have followed a systematic approach. First, I have tokenized words, lemmatized, removed stop words and punctuations. Then I will use the method I have choosen for converting it into numerical form. I have also encoded my labels.

* The possible options are BoW, Tf-Idf and Word Embeddings.
    BoW: This is the most simple one, it basically tokenizer words and creates a vector out of frequency of these words for each sentence.
    Tf-Idf: This is a method that I have mainly used for SEO keyword extraction therefore I am experienced with. It is also frequency based model, it simply assigns a statistical value to each word by using mathematical formulas. The value signifies the importance of word.
    Word Embedding: This is the most advanced method, it uses NLP and pretrained models to create vectors. Also, this method is commonly used by me for QA retrieval tasks.

* General and common approach would be choosing word embeddings among these three I think. And for that, I have two option: already trained one, and training a model with my data. Since we have very limited and not very well labeled data, using an already trained model on a very large dataset would be beneficial.

* Also we would have been loss any semantic relationship between words if word embeddings are not utilized. The frequency based models would be enough for some part of data especially when label word exists inside the quote, however that is not the case for all. The quotes are short and data

In [3]:
import gensim.downloader
from gensim.models import KeyedVectors
import numpy as np

embedding_model = gensim.downloader.load('word2vec-google-news-300')




In [4]:
#For reference this function is taken from a medium article: https://medium.com/@reddyyashu20/build-text-classification-model-using-word2vec-nlp-part2-52aa2839e8f4
def vectorize_sentence(sentence):
    vector_size = embedding_model.vector_size
    sentence_vector = np.zeros(vector_size)
    counter = 1
    for word in sentence:
        if word in embedding_model:
            counter += 1
            sentence_vector += embedding_model[word]
    sentence_vector = sentence_vector/counter
    return sentence_vector

In [5]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import string

nlp = spacy.load("en_core_web_sm")

stop_words = STOP_WORDS
punctuations = string.punctuation

def tokenizer(sentence):
    doc = nlp(sentence)
    tokens = [word.lemma_.lower().strip() for word in doc]
    tokens = [word for word in tokens if word not in stop_words and word not in punctuations]
    return tokens


In [8]:
from sklearn.preprocessing import LabelEncoder

data['quote_token'] = data['quote'].apply(tokenizer)
data['vector_embedding'] = data['quote_token'].apply(vectorize_sentence)

label_encoder = LabelEncoder()
data["label"] = label_encoder.fit_transform(data["tag"])




# Training Model

* For model choice, I have many options as GaussianNB, LinearRegression and MultiNomialNB.. LinearRegression is mainly for binary classification and MultinomialNB does not support negative values in vectors. So GaussianNB would be reasonable choice.



In [21]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB


X = data['vector_embedding'].to_list()
y = data['label'].to_list()
X_train, X_test, y_train, y_test = train_test_split (X, y , test_size=0.2)

model = GaussianNB()
model.fit(X_train, y_train)



# Evaluating Model

In [24]:
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))



Accuracy: 35.00%


# Final Reports

* The main problem was lack of enough data, we had many labels and the data was not enough for that. If we had 5 to 6 labels with same data, we could have better results. Because at least each label would have 20 quotes with balanced distribution. In our case, distribution is not balanced at all, some labels are dominating.

* Other than data, the choice of model and training method (using hyperparemetres etc.) could improve the overall performance and accuracy.

* I have involved a similar problem previously, where we were trying to classify user-comments. There was two objectives: semantic classification as positive or negative and classifying by category. Also, I have seen examples of binary classification for classifying emails as spam or not-spam.