<a href="https://colab.research.google.com/github/abbddos/AKDN_SEHAR_QR_Code-Generator/blob/main/Text_classifyer_for_small_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text classification on smaller datasets

In humanitarian context, textual dataset are not normally large, they barely reach 200 records and they take very long time to build up. As such, it is necessary to find out whether text classifyers can make accurate predictions out of smaller datasets. 

To achieve this, and since no MHPSS datasets are available for the moment, the previous dataset from Articles Classifications are used and shaved down to 100 records only.

The same methodology as before was used to determine which algorithm makes better and more accurate predictions.

In [1]:
#Basic data analysis libraries...
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk

#Required Natural Langauge libraries...
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
from wordcloud import WordCloud

#Required ML libraries and packages...
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import make_scorer, roc_curve, roc_auc_score
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [4]:
# Importing dataset...

df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/BBC News Train.csv')
df.head()

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,business
1,154,german business confidence slides german busin...,business
2,1101,bbc poll indicates economic gloom citizens in ...,business
3,1976,lifestyle governs mobile choice faster bett...,tech
4,917,enron bosses in $168m payout eighteen former e...,business


In [5]:
df.shape

(1490, 3)

In [67]:
# Shrinking dataset to 100 records.

dataset = df.sample(n=100, random_state=1)
dataset.head()

Unnamed: 0,ArticleId,Text,Category
91,1756,2d metal slug offers retro fun like some drill...,tech
1103,1108,blair stresses prosperity goals tony blair say...,politics
909,1955,weak dollar trims cadbury profits the world s ...,business
683,63,court rejects $280bn tobacco case a us governm...,business
561,293,christmas song formula unveiled a formula for...,entertainment


In [7]:
dataset.shape

(100, 3)

In [68]:
dataset['Category'].value_counts()

sport            23
politics         21
business         20
tech             18
entertainment    18
Name: Category, dtype: int64

In [71]:
# Step 1. Assignment of numeric values to categories.

dataset['CategoryID'] = dataset['Category'].factorize()[0]
dataset.head()

Unnamed: 0,ArticleId,Text,Category,CategoryID
91,1756,2d metal slug offers retro fun like some drill...,tech,0
1103,1108,blair stresses prosperity goals tony blair say...,politics,1
909,1955,weak dollar trims cadbury profits the world s ...,business,2
683,63,court rejects $280bn tobacco case a us governm...,business,2
561,293,christmas song formula unveiled a formula for...,entertainment,3


In [72]:
category = dataset[['Category', 'CategoryID']].drop_duplicates().sort_values('CategoryID')
category

Unnamed: 0,Category,CategoryID
91,tech,0
1103,politics,1
909,business,2
561,entertainment,3
980,sport,4


In [13]:
# Step 2. Text cleanup.
# Step2.1. Removing all tags.

def remove_tags(text):
  remove = re.compile(r'')
  return re.sub(remove, '', text)

# Step 2.2. Removing special Charachters.
def special_char(text):
  reviews = ''
  for x in text:
    if x.isalnum():
      reviews = reviews + x
    else:
      reviews = reviews + ' '
  return reviews

# Step 2.3. Converting text to lower case.
def convert_lower(text):
   return text.lower()

# Step 2.4. Removing all stop words.
def remove_stopwords(text):
  stop_words = set(stopwords.words('english'))
  words = word_tokenize(text)
  return [x for x in words if x not in stop_words]

# Step 2.5. Lemmantize text..
def lemmatize_word(text):
  wordnet = WordNetLemmatizer()
  return " ".join([wordnet.lemmatize(word) for word in text])

# Step 2.6. Apply all of the above steps to records.
dataset['Text'] = dataset['Text'].apply(remove_tags).apply(special_char).apply(convert_lower)
dataset['Text'] = dataset['Text'].apply(remove_stopwords).apply(lemmatize_word)

#dataset['Text']

91      2d metal slug offer retro fun like drill serge...
1103    blair stress prosperity goal tony blair say pa...
909     weak dollar trim cadbury profit world biggest ...
683     court reject 280bn tobacco case u government c...
561     christmas song formula unveiled formula ultima...
                              ...                        
1087    blue slam blackburn savage birmingham confirme...
1377    anti terror plan face first test plan allow ho...
694     time get tough friendly international manager ...
559     india unveils anti poverty budget india boost ...
528     u bank loses customer detail bank america reve...
Name: Text, Length: 100, dtype: object

In [14]:
x = dataset['Text']
y = dataset['CategoryID']


# Step 3. Convert cleaned up texts into vectors using CountVectorizer...
from sklearn.feature_extraction.text import CountVectorizer
x = np.array(dataset.iloc[:,0].values)
y = np.array(dataset.CategoryID.values)
cv = CountVectorizer(max_features = 5000)
x = cv.fit_transform(dataset.Text).toarray()
print("X.shape = ",x.shape)
print("y.shape = ",y.shape)

X.shape =  (100, 5000)
y.shape =  (100,)


In [15]:
# Step 4. Split data into train and test subsets...
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0, shuffle = True)
print(len(x_train))
print(len(x_test))

70
30


In [43]:
# Step 5. Applying several regressors and testing them...
#   ... in order to select the best model to apply prediction.

perform_list = []

def run_model(model_name, est_c, est_pnlty):
  mdl=''
  if model_name == 'Logistic Regression':
    mdl = LogisticRegression()
  elif model_name == 'Random Forest':
    mdl = RandomForestClassifier(n_estimators=100 ,criterion='entropy' , random_state=0)
  elif model_name == 'Multinomial Naive Bayes':
    mdl = MultinomialNB(alpha=0.5,fit_prior=True)
  elif model_name == 'Support Vector Classifer':
    mdl = SVC()
  elif model_name == 'Decision Tree Classifier':
    mdl = DecisionTreeClassifier()
  elif model_name == 'K Nearest Neighbour':
    mdl = KNeighborsClassifier(n_neighbors=10 , metric= 'minkowski' , p = 4)
  elif model_name == 'Gaussian Naive Bayes':
    mdl = GaussianNB()

  oneVsRest = OneVsRestClassifier(mdl)
  oneVsRest.fit(x_train, y_train)
  y_pred = oneVsRest.predict(x_test)

  accuracy = round(accuracy_score(y_test, y_pred) * 100, 2)

  precision, recall, f1score, support = score(y_test, y_pred, average='micro')

  print(f'Test Accuracy Score of Basic {model_name}: % {accuracy}')
  print(f'Precision : {precision}')
  print(f'Recall : {recall}')
  print(f'F1-score : {f1score}')

  perform_list.append(dict([
    ('Model', model_name),
    ('Test Accuracy', round(accuracy, 2)),
    ('Precision', round(precision, 2)),
    ('Recall', round(recall, 2)),
    ('F1', round(f1score, 2))
  ]))

In [44]:
mods = [
        'Random Forest',
        'Multinomial Naive Bayes',
        'Support Vector Classifer',
        'Decision Tree Classifier',
        'Gaussian Naive Bayes'
]

for mod in mods:
  run_model(mod, est_c=None, est_pnlty=None)
  

Test Accuracy Score of Basic Random Forest: % 70.0
Precision : 0.7
Recall : 0.7
F1-score : 0.7
Test Accuracy Score of Basic Multinomial Naive Bayes: % 96.67
Precision : 0.9666666666666667
Recall : 0.9666666666666667
F1-score : 0.9666666666666667
Test Accuracy Score of Basic Support Vector Classifer: % 60.0
Precision : 0.6
Recall : 0.6
F1-score : 0.6
Test Accuracy Score of Basic Decision Tree Classifier: % 23.33
Precision : 0.23333333333333334
Recall : 0.23333333333333334
F1-score : 0.23333333333333334
Test Accuracy Score of Basic Gaussian Naive Bayes: % 33.33
Precision : 0.3333333333333333
Recall : 0.3333333333333333
F1-score : 0.3333333333333333


In [45]:
model_performance = pd.DataFrame(data=perform_list)
model_performance = model_performance[['Model', 'Test Accuracy', 'Precision', 'Recall', 'F1']]
model_performance

Unnamed: 0,Model,Test Accuracy,Precision,Recall,F1
0,Random Forest,70.0,0.7,0.7,0.7
1,Multinomial Naive Bayes,96.67,0.97,0.97,0.97
2,Support Vector Classifer,60.0,0.6,0.6,0.6
3,Decision Tree Classifier,23.33,0.23,0.23,0.23
4,Gaussian Naive Bayes,33.33,0.33,0.33,0.33


It was found above that **Multinomial Naive Bayez** classifyer has higher accuracy than otehr classifyers at *alpha = 0.5*. It was noticed that when *alpha > 0.5* accuracy was significantly lower and prediction errors were noticed.

In [73]:
MNB = MultinomialNB(alpha=0.5,fit_prior=True)
MNB.fit(x_train, y_train)
def Get_prediction(article):
  pred = cv.transform([article])
  score = MNB.predict(pred)
  result = ""
  if score == [0]:
    result = "Tech"
  elif score == [1]:
    result = "Politics"
  elif score == [2]:
    result = "Business"
  elif score == [3]:
    result = "Entertainment"
  elif score == [4]:
    result = "Sports"
  return (result, score)

In [60]:
article = """
In a phone call late on Thursday, the Russian president said such sanctions would be a "colossal mistake".

Mr Biden, meanwhile, told Mr Putin that the US and its allies would respond decisively to any invasion of Ukraine.

The call, requested by Russia, was the pair's second such conversation this month and lasted for almost an hour.

It marked the latest effort to defuse tensions over Ukraine's eastern border with Russia, where Ukrainian officials say more than 100,000 Russian troops have been sent.

The build-up has prompted concern in the West, with the US threatening Mr Putin with sanctions "like none he's ever seen" if Ukraine comes under attack.

Russia, however, denies it is planning to invade the country and says the troops are there for exercises. It says it is entitled to move its troops freely on its own soil.

Although the two sides exchanged warnings during the call, Russian foreign policy adviser Yuri Ushakov told reporters shortly after that Mr Putin was "pleased" with the conversation. He added that it had created a "good backdrop" for future talks.

A senior US official, who spoke on condition of anonymity, said the tone had been "serious and substantive."

"President Biden reiterated that substantive progress in these dialogues can occur only in an environment of de-escalation," White House Press Secretary Jen Psaki said.

"He made clear that the United States and its allies and partners will respond decisively if Russia further invades Ukraine," she added.

US and Russian officials are set to meet for in-person talks in Geneva next month, and the White House said Mr Biden urged his Russian counterpart to pursue a diplomatic solution.

In a holiday message before Thursday's call, Mr Putin told Mr Biden he was "convinced" the pair could work together based on "mutual respect and consideration of each other's national interests".

His spokesman, Dmitry Peskov, said Moscow was "in the mood for a conversation"."""


In [61]:
res = Get_prediction(article)
print(res)

('Politics', array([1]))


In [62]:
buss = """
The electric vehicle firm announced it was recalling 356,309 vehicles because of potential rear-view camera issues affecting 2017-2020 Model 3 Teslas.

A further 119,009 Model S vehicles will also be recalled because of potential problems with the front trunk, or boot.

The total recall figure is almost equivalent to the 500,000 cars Tesla delivered last year, Reuters reports.

The BBC has approached Tesla for comment.

A safety report, submitted this month, estimates that around 1% of recalled Model 3s may have a defective rear-view camera.

Over time "repeated opening and closing of the trunk lid" may cause excessive wear to a cable that provides the rear-view camera feed, says a Safety Recall report submitted by Tesla to the National Highway Traffic Safety Administration (NHTSA) in the US on the 21 December.

If the wear causes the core of the cable to separate "the rear-view camera feed is not visible on the centre display", the report notes.

The loss of the review camera display may "increase the risk of collision", it adds.

The Model S recall involves vehicles manufactured between 2014-2021, some of which may have a problem with a "secondary latch" on the front trunk, or boot.

In another Safety Recall report, also filed on 21 December, Tesla notes the fault could mean, if the primary latch is inadvertently released, the front trunk "may open without warning and obstruct the driver's visibility, increasing the risk of a crash".

Around 14% of recalled Model S's may have the defect, the report notes.

In both cases, the reports state that "Tesla is not aware of any crashes, injuries, or deaths" relating to the potential faults."""

res = Get_prediction(buss)
print(res)

('Tech', array([0]))


In [63]:
entertainment = """
Rupert Grint absolutely loves being a dad. The 33-year-old actor touched on how fatherhood has changed him while promoting season 3 of his Apple TV+ show, Servant. From M. Night Shyamalan, the thriller follows a mourning Philadelphia couple whose marriage is up in the air after an unspeakable tragedy. The rift then opens the door for a mysterious force to enter their home.

"[Fatherhood], it's definitely changed my perspective," Grint told ET's Lauren Zima, before touching on the show's theme of how far a person would go for their child. "Since becoming a dad, kind of midway through, just to really have a better sense of what that can do to a family, that kind of level of loss is unimaginable. And yeah, I mean, it's quite hard for me to kind of really completely face that directly. I just find it just incredible."

"It's a weird place to be, especially when Wednesday first came. I remember I brought her to the set this season," he continued. "She thought she was at Sesame Street, which was very far away from Sesame Street. But yeah, it's really interesting."""

res = Get_prediction(entertainment)
print(res)

('Entertainment', array([3]))


In [64]:
ent = """
Clearly a burgeoning talent, she took over the reins of the TV show in the early 1950s, and was subsequently nominated for her first Emmy Award for best TV actress in 1951. It was the new awards' first ever category to recognise the achievements of women on the box.

Next came the TV sitcom Life with Elizabeth - a show she launched herself, alongside George Tibbles. "He wrote and I produced," she explained. "I was one of the first women producers in Hollywood."

White kept her profile high with many chat show appearances, and met her third husband (and "love of my life") Allen Ludden on the game show Password in 1961. The couple were married from 1963 until Ludden's death in 1981."""

res = Get_prediction(ent)
print(res)

('Entertainment', array([3]))


In [65]:
sprt = """
Brown, whose previous best was 46 against the New York Knicks in October, scored 21 points in the fourth quarter.

It helped the Celtics come from 14 points down with four minutes and 20 seconds left in regulation time.

He started overtime with his fifth three-pointer of the game, and also had 11 rebounds and four assists.

"I was just trying to be aggressive the entire time," he said.

"My team-mates encouraged me to take the shots. I feel like I took some good looks and they went down tonight."

Brown's layup with 38 seconds left in regulation levelled the score at 98-98 and he scored again to put Boston up 100-98 with 30 seconds left, but Tim Frazier scored for Orlando to force overtime.

Both teams were missing their leading scorers with Orlando's Cole Anthony sidelined by a sprained ankle while Boston's Jayson Tatum missed a fourth straight game because of Covid-19 concerns.

Elsewhere, Dallas Mavericks star Luka Doncic marked his return from a 10-game absence with 14 points, 10 assists and nine rebounds in their 95-86 away victory over the Oklahoma City Thunder.

The 22-year-old had missed five games with an ankle injury and then five more because of Covid.

"My chest was burning," said the Slovenian, who played 31 minutes on his comeback. "It was a weird feeling, but happy. Very happy."

Thunder rookie Josh Giddey became the youngest NBA player to post a triple-double with 17 points, 14 assists and 13 rebounds.

Aged 19 years and 84 days, Giddey surpassed LaMelo Ball's record of 19 years and 140 days, and also came up with four steals in his first game back after missing the last three for Covid reasons.
"""

res = Get_prediction(sprt)
print(res)

('Sports', array([4]))


In [66]:
debt="""
The statement to the stock exchange did not give a reason for the trading halt.

Evergrande has more than $300bn (£222bn) of debt and is scrambling to raise cash by selling assets and shares to repay suppliers and creditors.

Last week, the company dialled back plans to repay investors in its wealth management products.

Evergrande said on Friday that each investor in its wealth management product could expect to receive $1,257 each month as principal payment for three months irrespective of when the investment matures.

The company had earlier not mentioned any amount and had agreed to repay 10% of the investment by the end of the month when the product matures.

Evergrande said in a statement posted on the wealth unit's website that the situation was not "ideal" and that it would "actively raise funds", and update the repayment plan in late March, without giving further details.

The announcement was seen as highlighting the deepening cash squeeze at the struggling property developer.

Last week, Evergrande did not make some interest payments on its offshore bonds.

Over the weekend, local media reported that a city government on the Chinese resort island of Hainan had ordered the company on 30 December to demolish its 39 residential buildings there within 10 days, as they were built illegally.

Evergrande has yet to comment on the reports.

The company's $19bn in international bonds were deemed to be in default by rating agencies after it missed a payment deadline last month.
"""
res = Get_prediction(debt)
print(res)

('Business', array([2]))


In [75]:
ttt = """
The Biden administration wants to extend the life of the International Space Station to 2030, keeping it aloft despite mounting tensions with Russia, its main partner on the orbiting laboratory.

The announcement by NASA on Friday comes a day after Russian President Vladimir Putin warned that any new sanctions stemming from the growing crisis in Ukraine could lead to “a complete rupture of relations.” And last month, Russia fired a missile that destroyed an inactive weather satellite and created a large field of more than 1,500 pieces of debris that threatened the space station as well as a host of other satellites.

While the act was condemned by the Biden administration, and NASA Administrator Bill Nelson called it “reckless and dangerous,” Nelson also said the attack was an act of the Russian military that surprised the Russian space agency.
"""

res = Get_prediction(ttt)
print(res)

# The above mentioned text was copied from a Tech Article from CNN news website. Despite the article was classified as "Tech" by CNN, the classifyer...
# Predicted it to be "Business". While reading the text, it appeared that it does not have any keywords associated with Tech what so ever, 
# and expressions and terminologies in it were a mash-up of ideas that were more political and businessy than technological...
# As such, the Classifyer works.

('Business', array([2]))
