In this assignment, you will vectorize the data that you collected in HW1. Because the goal is to
identify the public sentiment toward AI on social media, you need to think about what
vectorization options, regarding both what to count and how to count, would be the best for this
goal. Make sure to explain the decisions you made during the vectorization process, e.g., if you
removed stopwords and why.

*1) Collected data by downloading a sample from a News Sentiment Analysis Corpus http://mlg.ucd.ie/datasets/bbc.html

About the dataset
The dataset is a collection of news articles from BBC across 5 major categories, namely:

Business
Entertainment
Politics
Sport
Tech
There are a total of 2225 articles in the dataset, which is a mix of all of the above categories. 

In [13]:
import pandas as pd
import numpy as np

# Load the dataset
bbc_news = pd.read_csv('C:\\Users\\User\\Desktop\\bbc_news_mixed.csv')
bbc_news.head()

Unnamed: 0,text,label
0,Cairn shares slump on oil setback\n\nShares in...,business
1,Egypt to sell off state-owned bank\n\nThe Egyp...,business
2,Cairn shares up on new oil find\n\nShares in C...,business
3,Low-cost airlines hit Eurotunnel\n\nChannel Tu...,business
4,"Parmalat to return to stockmarket\n\nParmalat,...",business


In [15]:
# print first 2 articles
for art in bbc_news.text[:2]:
    print(art[:200])

Cairn shares slump on oil setback


The company said tests ha
Egypt to sell off state-owned bank

The Egyptian government is reportedly planning to privatise one of the country's big public banks.

An Investment Ministry official has told the Reuters news agency


In [16]:
# category-wise count
bbc_news.label.value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: label, dtype: int64

In [29]:
docs = bbc_news['text'].values
print(docs[0:2])


 'Egypt to sell off state-owned bank\n\nThe Egyptian government is reportedly planning to privatise one of the country\'s big public banks.\n\nAn Investment Ministry official has told the Reuters news agency that the Bank of Alexandria will be sold sometime in 2005. The move is seen as evidence of a new commitment by the government to reduce the size of public sector. The official said the government has not yet decided whether the sale will take the form of a public flotation. "The most important thing to decide now is the method - whether by selling shares to the public or to a strategic investor from abroad," he said.\n\nAnalysts say the public-sector banks have suited the government\'s monetary, credit and exchange policies. Nevertheless, the Egyptian government has spoken for years about privatising one of the big four state banks - Banque Misr, National Bank of Egypt, Banque du Caire and Bank of Alexandria. It had been expected one of the smallest of the four big public banks - B

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# several commonly used vectorizer setting

#  unigram boolean vectorizer, set minimum document frequency to 5
unigram_bool_vectorizer = CountVectorizer(encoding='latin-1', binary=True, min_df=5, stop_words='english')

#  unigram term frequency vectorizer, set minimum document frequency to 5
unigram_count_vectorizer = CountVectorizer(encoding='latin-1', binary=False, min_df=5, stop_words='english')

#  unigram and bigram term frequency vectorizer, set minimum document frequency to 5
gram12_count_vectorizer = CountVectorizer(encoding='latin-1', ngram_range=(1,2), min_df=5, stop_words='english')

#  unigram tfidf vectorizer, set minimum document frequency to 5
unigram_tfidf_vectorizer = TfidfVectorizer(encoding='latin-1', use_idf=True, min_df=5, stop_words='english')

In [31]:

# fit vocabulary in documents and transform the documents into vectors
vecs = unigram_count_vectorizer.fit_transform(docs)

# check the content of a document vector
print(vecs.shape)
print(vecs[0].toarray())

# check the size of the constructed vocabulary
print(len(unigram_count_vectorizer.vocabulary_))

# print out the first 10 items in the vocabulary
print(list(unigram_count_vectorizer.vocabulary_.items())[:10])

# check word index in vocabulary
print(unigram_count_vectorizer.vocabulary_.get('year'))

(2225, 9138)
[[0 3 0 ... 0 0 0]]
9138
[('shares', 7417), ('slump', 7590), ('oil', 5715), ('setback', 7382), ('energy', 2931), ('uk', 8563), ('firm', 3358), ('closed', 1701), ('18', 60), ('disappointing', 2560)]
9108


In [85]:
from sklearn.feature_extraction.text import CountVectorizer

# initialize count vector
cvec = CountVectorizer(stop_words='english')
# create Bag of Words
bow = cvec.fit_transform(docs)
# shape of Bag of Words
print('shape of BOW:', bow.shape)
# number of words in the vocabulary
print('No. of words in vocabulary:', len(cvec.vocabulary_))

shape of BOW: (2225, 33913)
No. of words in vocabulary: 33913


** Stemming **

In [35]:
# importing simple_preprocess
from gensim.utils import simple_preprocess

# check the output of simple_preprocess on the first article
print(simple_preprocess(bbc_news.text[0])[:200])



In [36]:
# preprocess all the articles of the data set
words = bbc_news.text.apply(lambda x: simple_preprocess(x))
words[:10]

0    [cairn, shares, slump, on, oil, setback, share...
1    [egypt, to, sell, off, state, owned, bank, the...
2    [cairn, shares, up, on, new, oil, find, shares...
3    [low, cost, airlines, hit, eurotunnel, channel...
4    [parmalat, to, return, to, stockmarket, parmal...
5    [making, your, office, work, for, you, our, mi...
6    [mexican, in, us, send, bn, home, mexican, lab...
7    [asia, shares, defy, post, quake, gloom, indon...
8    [german, bidder, in, talks, with, lse, deutsch...
9    [bank, payout, to, pinochet, victims, us, bank...
Name: text, dtype: object

In [50]:
docs = bbc_news['text'].values
print(docs[0:1])



In [55]:

docs = bbc_news['text'].values

# We create a pandas dataframe as follows:
data = pd.DataFrame(data=bbc_news['text'].values, columns=['text'])

# We display the first 10 elements of the dataframe:
display(data.head(10))

Unnamed: 0,text
0,Cairn shares slump on oil setback\n\nShares in...
1,Egypt to sell off state-owned bank\n\nThe Egyp...
2,Cairn shares up on new oil find\n\nShares in C...
3,Low-cost airlines hit Eurotunnel\n\nChannel Tu...
4,"Parmalat to return to stockmarket\n\nParmalat,..."
5,Making your office work for you\n\nOur mission...
6,Mexican in US send $16bn home\n\nMexican labou...
7,Asia shares defy post-quake gloom\n\nIndonesia...
8,German bidder in talks with LSE\n\nDeutsche Bo...
9,Bank payout to Pinochet victims\n\nA US bank h...


In [60]:
# Number ot stop words
from nltk.corpus import stopwords
stop = stopwords.words('english')

data['stopwords'] = data['text'].apply(lambda x: len([x for x in x.split() if x in stop]))
data[['text','stopwords']].head()

Unnamed: 0,text,stopwords
0,Cairn shares slump on oil setback\n\nShares in...,160
1,Egypt to sell off state-owned bank\n\nThe Egyp...,108
2,Cairn shares up on new oil find\n\nShares in C...,108
3,Low-cost airlines hit Eurotunnel\n\nChannel Tu...,68
4,"Parmalat to return to stockmarket\n\nParmalat,...",100


In [61]:
#Remove Puncutation
data['text'] = data['text'].str.replace('[^\w\s]','')
data['text'].head()

0    Cairn shares slump on oil setback\n\nShares in...
1    Egypt to sell off stateowned bank\n\nThe Egypt...
2    Cairn shares up on new oil find\n\nShares in C...
3    Lowcost airlines hit Eurotunnel\n\nChannel Tun...
4    Parmalat to return to stockmarket\n\nParmalat ...
Name: text, dtype: object

In [62]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Program Files
[nltk_data]     (x86)\Microsoft Visual
[nltk_data]     Studio\Shared\Anaconda3_64\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [63]:
#Stemming - removal of -ing, -ly, 's

from nltk.stem import PorterStemmer
st = PorterStemmer()
data['text'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

0    cairn share slump on oil setback share in cair...
1    egypt to sell off stateown bank the egyptian g...
2    cairn share up on new oil find share in cairn ...
3    lowcost airlin hit eurotunnel channel tunnel o...
4    parmalat to return to stockmarket parmalat the...
Name: text, dtype: object

In [77]:
bbc_news = data['text']
bbc_news.head()

0    Cairn shares slump on oil setback\n\nShares in...
1    Egypt to sell off stateowned bank\n\nThe Egypt...
2    Cairn shares up on new oil find\n\nShares in C...
3    Lowcost airlines hit Eurotunnel\n\nChannel Tun...
4    Parmalat to return to stockmarket\n\nParmalat ...
Name: text, dtype: object

In [80]:
# Load the dataset
bbc_news = pd.read_csv('C:\\Users\\User\\Desktop\\bbc_news_mixed.csv')
bbc_news.head()

Unnamed: 0,text,label
0,Cairn shares slump on oil setback\n\nShares in...,business
1,Egypt to sell off state-owned bank\n\nThe Egyp...,business
2,Cairn shares up on new oil find\n\nShares in C...,business
3,Low-cost airlines hit Eurotunnel\n\nChannel Tu...,business
4,"Parmalat to return to stockmarket\n\nParmalat,...",business


In [81]:
# category-wise count
bbc_news.label.value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: label, dtype: int64

In [82]:
from sklearn.preprocessing import LabelEncoder

# initialize LabelEncoder
lencod = LabelEncoder()
# fit_transform() converts the text to numbers
bbc_news.label = lencod.fit_transform(bbc_news.label)
# label-wise count
bbc_news.label.value_counts()

3    511
0    510
2    417
4    401
1    386
Name: label, dtype: int64

In [84]:
from sklearn.feature_extraction.text import CountVectorizer

# initialize count vector
cvec = CountVectorizer(stop_words='english')
# create Bag of Words
bow = cvec.fit_transform(bbc_news.text)
# shape of Bag of Words
print('shape of BOW:', bow.shape)
# number of words in the vocabulary
print('No. of words in vocabulary:', len(cvec.vocabulary_))

shape of BOW: (2225, 29126)
No. of words in vocabulary: 29126


In [86]:
# create a dataframe from the BOW
bow_df = pd.SparseDataFrame(bow, columns=cvec.get_feature_names(), index=bbc_news.index, default_fill_value=0)

# sample some data points
bow_df.iloc[:20, 5000:5050]

Unnamed: 0,bcc,bdb,bdo,bdos,beach,beaches,beachfront,beachgoer,beacon,beaded,...,beats,beattie,beatty,beattys,beaudoin,beaufort,beaumont,beautiful,beautifully,beauty
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


** TFIDF Count Vector **

In [87]:
from sklearn.feature_extraction.text import TfidfVectorizer

# initialize TFIDF
vec = TfidfVectorizer(max_features=4000, stop_words='english')
# create TFIDF
tfidf = vec.fit_transform(bbc_news.text)
# shape of TFIDF
tfidf.shape

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


(2225, 4000)

In [88]:
# create a dataframe from the TFIDF
tfidf_df = pd.SparseDataFrame(tfidf, columns=vec.get_feature_names(), index=bbc_news.index, default_fill_value=0)

# sample some data points
tfidf_df.iloc[:20, 1000:1050]

Unnamed: 0,death,debate,debt,debts,debut,dec,decade,decades,december,decent,...,dem,demand,demanded,demands,democracy,democrat,democratic,democrats,dems,denied
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.043469,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.077338,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.105987,0.129173,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.099194,0.181343,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.085918,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032933,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
