### CIS 9: Lab 4
Natural Language Processing: Multinomial NB

In [815]:
### Name: Arnav Kumar

In this lab you will train an ML model to categorize news articles.

In [816]:
# import modules
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import nltk

The [BBC News](https://www.bbc.com/news) is a British news organization that reports on current events around the world. In this exercise you will train an NLP model to categorize the topics of news articles. The model will determine whether a news articles is on sports, politics, etc.

The training data are from BBC News and have been preprocessed for ML. The training input file is `news.csv` ([source](https://www.kaggle.com/datasets/dheemanthbhat/bbc-full-text-preprocessed?select=docs_stage_3_preprocessed.csv))

1. __Read data from _news.csv_ into a DataFrame__.<br>
Then __print the number of rows and columns of the DataFrame__<br>
and __print the first 5 rows__ to see what the data looks like.

In [817]:
df = pd.read_csv("news.csv")
display(df[:5])

Unnamed: 0,DocId,DocTextlen,DocText,ADJ,ADP,ADV,AUX,CCONJ,DET,NOUN,...,PUNCT,SCONJ,SYM,VERB,X,INTJ,DocType,FileSize,FilePath,DocCat
0,B_001,2553,ad sale boost time_warner profit quarterly pro...,31,61,15.0,15.0,13.0,28,114,...,55,3.0,9.0,53,0.0,0.0,Business,2560,../input/bbc-full-text-document-classification...,0
1,B_002,2248,dollar gain greenspan speech dollar hit high l...,33,54,15.0,21.0,9.0,44,99,...,43,5.0,2.0,43,0.0,0.0,Business,2252,../input/bbc-full-text-document-classification...,0
2,B_003,1547,yukos unit buyer face loan claim owner embattl...,11,32,3.0,15.0,4.0,25,71,...,26,3.0,4.0,42,0.0,0.0,Business,1552,../input/bbc-full-text-document-classification...,0
3,B_004,2395,high fuel price hit ba profit british_airways ...,36,53,16.0,17.0,8.0,26,114,...,62,8.0,10.0,45,0.0,0.0,Business,2412,../input/bbc-full-text-document-classification...,0
4,B_005,1565,pernod takeover talk lift domecq share uk drin...,15,32,5.0,13.0,8.0,14,68,...,35,5.0,3.0,26,0.0,0.0,Business,1570,../input/bbc-full-text-document-classification...,0


---

2. Data cleaning

2a. __Print all the column labels__.

In [818]:
print(df.columns.values)

['DocId' 'DocTextlen' 'DocText' 'ADJ' 'ADP' 'ADV' 'AUX' 'CCONJ' 'DET'
 'NOUN' 'NUM' 'PART' 'PRON' 'PROPN' 'PUNCT' 'SCONJ' 'SYM' 'VERB' 'X'
 'INTJ' 'DocType' 'FileSize' 'FilePath' 'DocCat']


2b. Since the data have been preprocessed, each row or news article has multiple features, some of which we don't need for our ML training purpose.

The column labels that are all uppercase such as ADJ, ADV, NOUN... denote the count of adjectives, adverbs, nouns... that are in the article. We can remove these columns because Parts of Speech are not used by the MultinomialMB model.

The columns we want to keep are:
- DocText: contains the news articles
- DocType: categories of the news articles, as strings
- DocCat: categories of the news articles, as numbers

Given that the columns containing Parts of Speech can be removed due to the reason above, create a Raw NBConvert cell to __explain why the other columns can also be removed__, so that we only keep the 3 columns DocText, DocType, and DocCat.

2c. __Create a DataFrame with the 3 columns__ that you want to keep.<br>
Then __print the first 5 rows__ of the DataFrame.

In [819]:
df.drop(columns=['DocId', 'DocTextlen', 'ADJ','ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X', 'INTJ', 'FileSize', 'FilePath'], inplace=True),'\n'
display(df[:5])

Unnamed: 0,DocText,DocType,DocCat
0,ad sale boost time_warner profit quarterly pro...,Business,0
1,dollar gain greenspan speech dollar hit high l...,Business,0
2,yukos unit buyer face loan claim owner embattl...,Business,0
3,high fuel price hit ba profit british_airways ...,Business,0
4,pernod takeover talk lift domecq share uk drin...,Business,0


2d. __Shorten the column labels__ by removing the 'Doc' from each label and lowercase all letters.<br>
Then __print the first 5 rows__ of the DataFrame.

In [820]:
df = df.rename(columns={"DocText":"text", "DocType":"type", "DocCat":"cat"})
display(df[:5])

Unnamed: 0,text,type,cat
0,ad sale boost time_warner profit quarterly pro...,Business,0
1,dollar gain greenspan speech dollar hit high l...,Business,0
2,yukos unit buyer face loan claim owner embattl...,Business,0
3,high fuel price hit ba profit british_airways ...,Business,0
4,pernod takeover talk lift domecq share uk drin...,Business,0


2e. __Check and remove any NaN__.

In [821]:
df=df.dropna()

In [None]:
### show result of checking NaN             -1/2pt

---

3. Analyze data

3b. __Show the count of each DocType categories__<br>
and then __show the count of each DocCat categories__ 

In [822]:
print(df.groupby(['type']).size())
print(df.groupby(['cat']).size())

type
Business         510
Entertainment    381
Politics         413
Sport            506
Tech             395
dtype: int64
cat
0    510
1    381
2    413
3    506
4    395
dtype: int64


3b. The output seems to show that the there is a one-to-one correspondence between the strings in DocType and numbers in DocCat.

Write code to __print the proof that they correspond with each other__. This means to show that all "Business" DoctType are 0 in DocCat, all "Sport" DocType are 3 in DocCat, etc.

_Challenge: write a loop to check and print the 5 results, instead of copy-paste code 5 times_

In [823]:
print(((df['type'] == 'Business') & (df['cat'] == 0)).any())
print(((df['type'] == 'Entertainment') & (df['cat'] == 1)).any())
print(((df['type'] == 'Politics') & (df['cat'] == 2)).any())
print(((df['type'] == 'Sport') & (df['cat'] == 3)).any())
print(((df['type'] == 'Tech') & (df['cat'] == 4)).any())

True
True
True
True
True


3c. __Create a lookup table__ which is a dictionary where each unique DocCat value is the key, and the corresponding DocType string is the value.<br>
Then __print the lookup table__.

In [824]:
d = dict(zip(df['cat'],df["type"]))
print(d)

{0: 'Business', 1: 'Entertainment', 2: 'Politics', 3: 'Sport', 4: 'Tech'}


---

4. Preparing data for ML

4a. Now that you've proven that DocType and DocCat have the same data, choose the column that makes it less work for you to use the ML model, then __remove one of the columns__. <br>
Then __show the first 5 rows__ of the DataFrame.

In [825]:
df.drop(columns=['type'], inplace=True),'\n'
display(df[:5])

Unnamed: 0,text,cat
0,ad sale boost time_warner profit quarterly pro...,0
1,dollar gain greenspan speech dollar hit high l...,0
2,yukos unit buyer face loan claim owner embattl...,0
3,high fuel price hit ba profit british_airways ...,0
4,pernod takeover talk lift domecq share uk drin...,0


4b. __Create the X and y datasets__ and __print the shape__ of each.

In [826]:
X=df.text
y=df.cat
print(X.shape)
print(y.shape)

(2205,)
(2205,)


4c. Since the training data are already preprocessed. We want to take a look at one sample news article to see if there needs to be further preprocessing.

__Print the news article at row 0__ to inspect it.

In [827]:
print(df.text[0])

ad sale boost time_warner profit quarterly profit media giant timewarner jump 76 $ 1.13bn £ 600 m month december $ 639 m year early firm big investor google benefit sale high speed internet connection high advert sale timewarner say fourth quarter sale rise 2 $ 11.1bn $ 10.9bn profit buoy gain offset profit dip warner_bros user aol time_warner say friday own 8 search engine google internet business aol mixed fortune lose 464,000 subscriber fourth quarter profit low precede quarter company say aol underlie profit exceptional item rise 8 strong internet advertising revenue hope increase subscriber offer online service free timewarner internet customer try sign aol exist customer high speed broadband timewarner restate 2000 2003 result follow probe the_us_securities_exchange_commission sec close conclude time_warner's fourth quarter profit slightly well analyst expectation film division see profit slump 27 $ 284 m help box office flop alexander catwoman sharp contrast year early final fil

4d. The preprocessing that we've discussed in class are related to stop words and stemming.<br>

__Create a Raw NBConvert cell to explain__:
- Does it look like stop words have been removed? Give examples from the text.
- Does it look like stemming was applied? Give examples from the text.

4e. __Convert the preprocessed data to numbers__ so it's ready for the ML model.<br>
Then __print the shape of the X dataset__ that will be used with the model.

In [828]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(X)
X = vect.transform(X)
print(X[0])
X.shape

  (0, 1)	1
  (0, 63)	1
  (0, 65)	1
  (0, 91)	1
  (0, 148)	1
  (0, 365)	1
  (0, 372)	1
  (0, 377)	2
  (0, 379)	1
  (0, 500)	2
  (0, 519)	1
  (0, 551)	1
  (0, 629)	1
  (0, 691)	1
  (0, 729)	1
  (0, 771)	1
  (0, 860)	1
  (0, 883)	1
  (0, 971)	1
  (0, 1148)	1
  (0, 1593)	2
  (0, 1664)	1
  (0, 1726)	1
  (0, 1788)	1
  (0, 1792)	2
  :	:
  (0, 22802)	1
  (0, 22806)	2
  (0, 22815)	1
  (0, 22926)	1
  (0, 23132)	1
  (0, 23367)	1
  (0, 23401)	1
  (0, 23728)	2
  (0, 23922)	3
  (0, 24318)	2
  (0, 24396)	2
  (0, 26246)	1
  (0, 26513)	3
  (0, 26522)	7
  (0, 26942)	1
  (0, 26999)	1
  (0, 27177)	1
  (0, 27268)	1
  (0, 27666)	1
  (0, 27727)	1
  (0, 28185)	1
  (0, 28245)	1
  (0, 28325)	1
  (0, 28452)	1
  (0, 28841)	4


(2205, 28975)

In [None]:
### as discussed in class, no need to show the vector. It doesn't mean much
### to the reader            -1/4pt

---

5. Train and test the model

5a. __Create X and y training and testing datasets__.<br>
Then __print the shape of each dataset__.

In [829]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(1653, 28975) (552, 28975) (1653,) (552,)


5b. __Train and test the ML model__<br>
and then __print the accuracy measurements__.

_There are more than one accuracy measurement._

In [830]:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
# accuracy
print(metrics.accuracy_score(y_test, y_pred))
metrics.confusion_matrix(y_test, y_pred, labels=[0,1])

0.9710144927536232


array([[130,   0],
       [  0,  85]], dtype=int64)

In [None]:
### confusion matrix should be 5x5           -1/2pt

5c. Create a Raw NBConvert cell to __discuss whether the accuracy measurements agree with each other__.

In [None]:
### the confusion matrix looks like 100% accurate probably because 
### the rest of it wasn't shown and the error could be in that part

---

6. Real life testing of the model you've trained.

6a. __Print the lookup table__ you created in step 3c, which shows the corresponding values of the DocType and DocCat columns.

In [831]:
print(d)

{0: 'Business', 1: 'Entertainment', 2: 'Politics', 3: 'Sport', 4: 'Tech'}


6b. One advantage of working in NLP is that it's easier to come up with testing data. 

1. Go to the [BBC News](https://www.bbc.com/news) website to find the different types of news categories.<br>
2. __Choose 3 of the news categories__ in the BBC web page header that match the categories that the ML model have learned.<br>
3. For each category, click on the category link to __find today's news articles in that category__.<br>
> - Then click to open an article and __copy the first 4-5 paragraphs of the article__.<br>
> - Create a Code cell and paste the paragraphs into a Python string.

_You should end up with 3 Code cells, each has a Python string which is the 4-5 paragraphs of a news article._

In [832]:
entertainment = "The film's production company Paramount Pictures said the injuries were non life-threatening and happened while shooting a planned stunt sequence. The crew members were all in stable condition and continue to receive treatment, the statement said. Earlier this week, the Sun reported there had been an explosion and six people went to hospital. It was terrifying - a huge ball of fire flew up and caught several crew members in its path. In years of filming I've never seen an accident so scary, a source told the newspaper. Everyone involved, from the lowliest runners to the star names, has been shaken up by this, they added. In a statement, a Paramount Pictures spokesperson said: The safety and full medical services teams on-site were able to act quickly so that those who were impacted immediately received necessary care. They said it has strict health and safety procedures in place on all our productions and would take all necessary precautions as we resume production. According to Variety, no cast members were injured but six people received treatment for burn injuries and four remain in hospital. Sir Ridley Scott, who directed the original 2000 historical drama film, is returning to direct the second instalment, which is scheduled to be released in November 2024. No title has yet been announced for the sequel, which stars Normal People actor Paul Mescal, Denzel Washington and Connie Nielson. The original film won five Oscars, including best actor for Russell Crowe, who played Roman general Maximus Decimus Meridius alongside Joaquin Phoenix as Emperor Commodus. The movie, set during the height of the Roman Empire, sees Maximus start out as a war hero before before being forced to become a gladiator. Gladiator made $457m (£355m) at the box office and revived the historical epic drama genre, which had been out of fashion for decades."

In [833]:
business = "Meta has shown staff plans for a text-based social network designed to compete with Twitter, sources have told the BBC. It could allow users to follow accounts they already follow on Instagram, Meta's image-sharing app. And it could potentially allow them to bring over followers from decentralised platforms such as Mastodon. A Meta spokesperson confirmed to the BBC that the platform was in development. We're exploring a standalone decentralised social network for sharing text updates, they said. We believe there's an opportunity for a separate space where creators and public figures can share timely updates about their interests. Meta's chief product officer Chris Cox said coding was under way on the platform. The tech giant aims to release it soon, although no date was given. There is some speculation that it could be as early as the end of June. Screenshots have appeared online which were shown internally to employees, potentially giving an idea of what the app will look like. Sources within the company have told the BBC that these leaked screenshots are genuine. If they are, the layout of this new platform will be familiar to anyone who has spent time on Twitter."

In [834]:
tech = "A trial under way at Aberdeen Royal Infirmary is exploring whether artificial intelligence (AI) can assist radiologists in reviewing thousands of mammograms a year. The pilot helped spot early-stage breast cancer for June - a healthcare assistant and participant in the trial - and she is now set to undergo surgery as a result. Mammograms are low level X-rays used in breast cancer screenings to monitor and detect changes too small to see or feel. According to the NHS, they help save about 1,300 lives each year in the UK. And while the number of women who attended a routine breast screening, after an invitation, increased in Scotland in the three-year period to 2022, the number of radiologists to review results is shrinking. What is AI? AI - technology which sees computers perform specific tasks that would typically require human intelligence - is already widely used across a range of industries. While high-profile experts' fears that AI could lead to the extinction of humanity have recently been making headlines, the tech's more practical realities are already being shown in healthcare. Its potential to speed up the process of drug and disease discovery means many scientists and doctors see AI as a powerful tool to work with, rather than replace, practitioners."

6c. __Create a DataFrame from the 3 Python strings__.<br>
Then __print the DataFrame__.

_An example DataFrame is shown below, from news articles on 6/3. Your text will be different._

In [835]:
d2 = pd.DataFrame(columns = ["text"],data = [[entertainment], [business], [tech]])
display(d2)

Unnamed: 0,text
0,The film's production company Paramount Pictur...
1,Meta has shown staff plans for a text-based so...
2,A trial under way at Aberdeen Royal Infirmary ...


6d. __Test the ML model__ that you've trained with your new data in the DataFrame.

This means:
- preprocess the new data (your answer in step 4d will determine how you preprocess the new data).
- convert the new data to numbers
- test the model with the data
- print the categories of news  that the model predicted. Use the lookup table to convert the numeric result from the model into the category string.
Example:<br>`
Article 1 : Sport
Article 2 : Business
Article 3 : Tech
`

_You'll need 4 Code cells, one for each step above_.

In [836]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
#nltk.download('stopwords') 
stop_words=set(stopwords.words("english"))
tokenizer = RegexpTokenizer('[a-z]+')
i = 0
while i <= 2:
    d2.text[i] = tokenizer.tokenize(d2.text[i].lower())
    d2.text[i] = [d2.text[i] for d2.text[i] in d2.text[i] if d2.text[i] not in stop_words]
    d2.text[i]=[stemmer.stem(w) for w in d2.text[i]]
    d2.text[i] = ' '.join(d2.text[i])
    i+=1

#def preprocess(s) :
#    w = tokenizer.tokenize(s.lower())         # separate into words and lowercase each word
#    w = [word for word in w if word not in stop_words]    # remove stop words
#    w = [stemmer.stem(word) for word in w]    # find stem of each word
#    return ' '.join(w)         # join back into a string


In [837]:
v2 = CountVectorizer()
v2.fit(d2.text)
d2.text = vect.transform(d2.text)
d2.text.shape

  d2.text = vect.transform(d2.text)


(3, 28975)

In [None]:
### don't need a new vectorizer, use the same one as the training data
### -1/2pt

In [846]:
A = d2.drop(columns=['text'])
b=d2.text
A_train, A_test, b_train, b_test = train_test_split(A, b, test_size=0.25)
print(A_train.shape, A_test.shape, b_train.shape, b_test.shape)
classifier = MultinomialNB()
print(A_train)
classifier.fit(X_train, b_train)
y_pred = classifier.predict(X_test)

(2, 0) (1, 0) (2, 28975) (1, 28975)
Empty DataFrame
Columns: []
Index: [2, 1]


ValueError: at least one array or dtype is required

In [None]:
### why create training and testing set?  the 3 articles are all testing data    -1/2pt
### don't create a new classifier. Use the one that you trained earlier   -1/2pt
### there's no X_train and b_train to fit()    -1/2pt
### there's no X_test   -1/2pt

In [None]:
print(y_pred)
# accuracy
print(metrics.accuracy_score(y_test, y_pred))
metrics.confusion_matrix(y_test, y_pred, labels=[0,1])

6e. Create a Raw NBConvert cell to __discuss the result of your test__.

In [None]:
### can't draw conclusion since the code above doesn't work so there was no prediction
### and therefore no result to compare           -1pt