# <font color='light gray'>Kaggle Competition- Predict Stock Price Movement Based On News Headline using NLP
<font color='gray'>**Credit:**
* https://www.youtube.com/watch?v=h-LGjJ_oANs&list=PLZoTAELRMXVMdJ5sqbCK2LiM0HhQVWNzm&index=13
* https://www.kaggle.com/code/rohit0906/stock-sentiment-analysis-using-news-headlines
</font>

## <font color='light gray'>Stock Sentiment Analysis using News Headlines
<font color='gray'>**About the problem and the dataset used.**
* The data set in consideration is a combination of the world news and stock price shifts.
* Data ranges from 2008 to 2016 and the data from 2000 to 2008 was scrapped from Yahoo finance.
* There are 25 columns of top news headlines for each day in the data frame.
* Class 1- the stock price increased.
* Class 0- the stock price stayed the same or decreased.

<font color='gray'><br>**About the approach.**
* Used **TF-IDF and Bag of Words** for extracting featues from the headlines.
* Used **Random Forest Classifier, Multinational Naive Bayes and Passive Aggressive Classifier** for analysis.</font>

# Load the raw dataset csv file from github

In [1]:
# url of raw data set from github
url="https://raw.githubusercontent.com/akdubey2k/NLP/main/Stock_Sentiment_Analysis/stock_news_headlines.csv"

# Read the dataset of csv file using "pandas dataframe"

**encoding : str, optional, default ‘utf-8’**<br>
Encoding to use for UTF when reading/writing (ex. 'utf-8'). List of Python standard encodings .

In [2]:
import pandas as pd

df = pd.read_csv(filepath_or_buffer=url, encoding='ISO-8859-1', )
# this encoding increases the probability that a utf-8-sig encoding can be correctly
# guessed from the byte sequence. So here the BOM is not used to be able to determine
# the byte order used for generating the byte sequence, but as a signature that helps
# in guessing the encoding. On encoding the utf-8-sig codec will write 0xef, 0xbb,
# 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those
# three bytes if they appear as the first three bytes in the file. In UTF-8, the use
# of the BOM is discouraged and should generally be avoided.')

In [3]:
# display only 2 rows starting from index 0 with data reproducibility
print('The shape of dataset: '.ljust(25, '.'), df.shape)
df.head(2)

The shape of dataset: ... (4101, 27)


Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2000-01-03,0,A 'hindrance to operations': extracts from the...,Scorecard,Hughes' instant hit buoys Blues,Jack gets his skates on at ice-cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl's successor drawn into scandal,The difference between men and women,"Sara Denver, nurse turned solicitor",Diana's landmine crusade put Tories in a panic,Yeltsin's resignation caught opposition flat-f...,Russian roulette,Sold out,Recovering a title
1,2000-01-04,0,Scorecard,The best lake scene,Leader: German sleaze inquiry,"Cheerio, boyo",The main recommendations,Has Cubie killed fees?,Has Cubie killed fees?,Has Cubie killed fees?,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man's extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn't know without the ...,Millennium bug fails to bite


# Extract the "pandas dataframe" using 'sample' module/class
## <font color='light gray'>pandas.DataFrame.sample
<font color='gray'>**Return a random sample of items from an axis of object.**</font>
* **n : int, optional**
Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.
* **random_state : int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional**
If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given.
* **axis : {0 or ‘index’, 1 or ‘columns’, None}, default None**
Axis to sample. Accepts axis number or name. Default is stat axis for given data type. For Series this parameter is unused and defaults to None.
* **ignore_index : bool, default False**
If True, the resulting index will be labeled 0, 1, …, n - 1.

In [4]:
# display only 2 rows with data reproducibility
# df.sample(axis='index', random_state=1, n=2, ignore_index=True)
    # axis='index', no use here
    # if ignore_index=True, then index will be labeled 0, 1, …, n - 1. forcefully.
# df.sample(random_state=0, n=2, ignore_index=True)

# display only 2 rows without data reproducibility
df.sample(n=2, ignore_index=True)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2012-02-16,1,"Bill C-309, if passed, would make it a crime t...",Acta loses more support in Europe as Netherlan...,UK Parliament members concluded that the Inter...,An EU court rules that social networking sites...,"After revolution in Egypt, women's taste of eq...",Israel has threatened to tear down solar panel...,Greece Is on Pace for the Worst Recession in M...,Acta is sinking: EU Court of Justice rules out...,...,Dubai : 19 year old Indonesian girl kills her ...,French film ban raises autism issue,Saudi Arabia's religious police have arrested ...,"Facing a backlash, Ottawa moves to retool cybe...",Rare Look at China's Energy Machine - \r\nA ph...,Iran wants early resumption of nuclear talks: ...,"World's oldest hooker makes $80,000 a year fro...",Greece's Model Mayor - Reform Hero Takes on Co...,"Facing backlash, Canada Conservatives eye rewr...",Prosecutors Try to Lift German President's Imm...
1,2016-04-21,0,Declassified memo shows multiple Saudi connect...,Canada to introduce pot legalization legislati...,IS executes 250 women for refusing sexual jiha...,Referendum on abolishing monarchy must be held...,Islamic face veil to be banned in Latvia despi...,Watchdog says press freedom in decline in 'new...,Super rich who hide money in tax havens to be ...,Scientists resort to advertising to get Great ...,...,U.K. Issues Travel Warning About Antigay U.S. ...,"""China's President Xi Jinping has assumed a ne...","Failed N.K. missile launch damages launcher, c...",The Fallout From the Panama Papers in Hong Kon...,"Saudi government has vast network of PR, lobby...","VW to pay US customers $5,000 each to settle s...",Confirmed as Oldest in World: Message in bottl...,Colombian president: prohibitionist drug polic...,Philippine's presidential frontrunner tells US...,Hundreds of Palestinians march in support of J...


In [5]:
# display only 2 rows without data reproducibility
df.sample(n=2, ignore_index=True)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2003-07-22,0,Sun still sinking slowly (update),Stepfather guilty of murdering Jenna Baldwin,US: Saddam's sons dead,Hip (hop) replacement,Profiles: Qusay and Uday Saddam Hussein,US forces may have killed Saddam's sons,Kirk Lightsey/Bobby Wellins,"Jesse Sykes, Borderline, London",...,BA staff take a swipe at new security system,Tories: roads reduce congestion,"Famous Belgian, and Native American",Blair denies role in naming Kelly,US unveils plan to end North Korean nuclear am...,Costing religion,TV channel threatens to delay Italian football...,13 injured in Spain bombings,Games sector under fire,Finnish outperform English children
1,2012-04-02,1,ACTA Could be Passed in 10 Weeks; Take Action ...,Tunisia rejects shariah in new constitution,"UK teachers no longer teaching, just 'training...","North Korea rocket test will cost $850m USD, e...","Tibetan immolations, largely unnoticed, among ...",How Canada's Green Credentials Fell Apart - \r...,Ikea to design 11-hectare neighbourhood. We ar...,Executive pay soars as bosses set each others'...,...,Euro unemployment spikes to record 10.8%,Hungary President Schmitt quits in plagiarism ...,Stephen Fry lends support to Greek calls to re...,"""Democracy champion Aung San Suu Kyi declared ...",Swiss Arrest Warrants Fuel Tax Row With German...,Swedish workers party hard during lunch breaks,Chinas Bloody Factories -- A Problem Bigger th...,"In Rich Europe, Growing Ranks of Working Poor ...",UK 'considers' internet surveillance network,Japanese experts warn of earthquakes that coul...


In [6]:
# print the total number of input feature (columns) in dataset
# df.columns.to_list()
df.columns

Index(['Date', 'Label', 'Top1', 'Top2', 'Top3', 'Top4', 'Top5', 'Top6', 'Top7',
       'Top8', 'Top9', 'Top10', 'Top11', 'Top12', 'Top13', 'Top14', 'Top15',
       'Top16', 'Top17', 'Top18', 'Top19', 'Top20', 'Top21', 'Top22', 'Top23',
       'Top24', 'Top25'],
      dtype='object')

In [7]:
# print the data type of each input feature (column) in a dataset
# df.dtypes.to_list()
df.dtypes

Date     object
Label     int64
Top1     object
Top2     object
Top3     object
Top4     object
Top5     object
Top6     object
Top7     object
Top8     object
Top9     object
Top10    object
Top11    object
Top12    object
Top13    object
Top14    object
Top15    object
Top16    object
Top17    object
Top18    object
Top19    object
Top20    object
Top21    object
Top22    object
Top23    object
Top24    object
Top25    object
dtype: object

## <font color='light gray'>Divide data into training and test set as per date-wise

In [8]:
# this won't work, as this will show only boolean values,
# and we need content of dataset as per datewise, please see below...
train = df['Date'] < '2008-01-01'
train

0        True
1        True
2        True
3        True
4        True
        ...  
4096    False
4097    False
4098    False
4099    False
4100    False
Name: Date, Length: 4101, dtype: bool

In [9]:
# load the content of dataset as per datewise, which are below the date...
train = df[df['Date'] < '2015-01-01']
print('Shape of train'.ljust(20, '.'), train.shape)
train.head(2)

Shape of train...... (3723, 27)


Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2000-01-03,0,A 'hindrance to operations': extracts from the...,Scorecard,Hughes' instant hit buoys Blues,Jack gets his skates on at ice-cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl's successor drawn into scandal,The difference between men and women,"Sara Denver, nurse turned solicitor",Diana's landmine crusade put Tories in a panic,Yeltsin's resignation caught opposition flat-f...,Russian roulette,Sold out,Recovering a title
1,2000-01-04,0,Scorecard,The best lake scene,Leader: German sleaze inquiry,"Cheerio, boyo",The main recommendations,Has Cubie killed fees?,Has Cubie killed fees?,Has Cubie killed fees?,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man's extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn't know without the ...,Millennium bug fails to bite


In [10]:
# load the content of dataset as per datewise, which are above the date...
# test = df[df.Date > '20141231']
test = df[df['Date'] > '2014-12-31']
print('Shape of test'.ljust(20, '.'), test.shape)
test.head(2)

Shape of test....... (378, 27)


Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
3723,2015-01-02,1,Most cases of cancer are the result of sheer b...,Iran dismissed United States efforts to fight ...,Poll: One in 8 Germans would join anti-Muslim ...,UK royal family's Prince Andrew named in US la...,Some 40 asylum-seekers refused to leave the bu...,Pakistani boat blows self up after India navy ...,Sweden hit by third mosque arson attack in a week,940 cars set alight during French New Year,...,Ukrainian minister threatens TV channel with c...,Palestinian President Mahmoud Abbas has entere...,Israeli security center publishes names of 50 ...,The year 2014 was the deadliest year yet in Sy...,A Secret underground complex built by the Nazi...,Restrictions on Web Freedom a Major Global Iss...,Austrian journalist Erich Mchel delivered a pr...,Thousands of Ukraine nationalists march in Kiev,Chinas New Years Resolution: No More Harvestin...,Authorities Pull Plug on Russia's Last Politic...
3724,2015-01-05,0,Moscow-&gt;Beijing high speed train will reduc...,Two ancient tombs were discovered in Egypt on ...,China complains to Pyongyang after N Korean so...,Scotland Headed Towards Being Fossil Fuel-Free...,Prime Minister Shinzo Abe said Monday he will ...,Sex slave at centre of Prince Andrew scandal f...,Gay relative of Hamas founder faces deportatio...,The number of female drug addicts in Iran has ...,...,The Islamic State has approved a 2015 budget o...,"Iceland To Withdraw EU Application, Lift Capit...",Blackfield Capital Founder Goes Missing: The v...,Rocket stage crashes back to Earth in rural Ch...,2 Dead as Aircraft Bombs Greek Tanker in Libya...,Belgian murderer Frank Van Den Bleeken to die ...,Czech President criticizes Ukrainian PM; says ...,3 Vietnamese jets join search for 16 missing F...,France seeks end to Russia sanctions over Ukraine,China scraps rare earths caps


<font color='dark gray'>***this is not a classification problem, so we can not use "train_test_split" method***

## <font color='light gray'>Data cleaning from training set, by removing 'Date' & 'Label' columns.

In [11]:
# all 4101 rows, and 2 to 26 columns included. Ideally removing 'Date' & 'Label' columns.
train_data = train.iloc[:, 2:]
train_data.head(2)

Unnamed: 0,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,Top10,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,A 'hindrance to operations': extracts from the...,Scorecard,Hughes' instant hit buoys Blues,Jack gets his skates on at ice-cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,Derby raise a glass to Strupar's debut double,"Southgate strikes, Leeds pay the penalty",...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl's successor drawn into scandal,The difference between men and women,"Sara Denver, nurse turned solicitor",Diana's landmine crusade put Tories in a panic,Yeltsin's resignation caught opposition flat-f...,Russian roulette,Sold out,Recovering a title
1,Scorecard,The best lake scene,Leader: German sleaze inquiry,"Cheerio, boyo",The main recommendations,Has Cubie killed fees?,Has Cubie killed fees?,Has Cubie killed fees?,Hopkins 'furious' at Foster's lack of Hannibal...,Has Cubie killed fees?,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man's extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn't know without the ...,Millennium bug fails to bite


### <font color='light gray'>Removing punctuations from training dataset with the help of "regex" except alphabets.
### https://regex101.com/

In [12]:
# except alphabets everythinng has to be removed
'''
[^a-zA-Z] : A character not in the range of a-z or A-Z
'''
train_data.replace('[^a-zA-Z]', ' ', regex=True, inplace=True)
train_data.head(n=2)

Unnamed: 0,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,Top10,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,A hindrance to operations extracts from the...,Scorecard,Hughes instant hit buoys Blues,Jack gets his skates on at ice cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,Derby raise a glass to Strupar s debut double,Southgate strikes Leeds pay the penalty,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl s successor drawn into scandal,The difference between men and women,Sara Denver nurse turned solicitor,Diana s landmine crusade put Tories in a panic,Yeltsin s resignation caught opposition flat f...,Russian roulette,Sold out,Recovering a title
1,Scorecard,The best lake scene,Leader German sleaze inquiry,Cheerio boyo,The main recommendations,Has Cubie killed fees,Has Cubie killed fees,Has Cubie killed fees,Hopkins furious at Foster s lack of Hannibal...,Has Cubie killed fees,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man s extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn t know without the ...,Millennium bug fails to bite


### <font color='light gray'>Renaming column names for an easiness of access

In [13]:
# print('Shape of train_data :'.ljust(40, '.'), train_data.shape[1])
# print('Columns of train_data', train_data.columns)

train_data.columns = [str(i) for i in range(train_data.shape[1]) ]
# print('Columns of train_data', train_data.columns)
train_data.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
0,A hindrance to operations extracts from the...,Scorecard,Hughes instant hit buoys Blues,Jack gets his skates on at ice cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,Derby raise a glass to Strupar s debut double,Southgate strikes Leeds pay the penalty,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl s successor drawn into scandal,The difference between men and women,Sara Denver nurse turned solicitor,Diana s landmine crusade put Tories in a panic,Yeltsin s resignation caught opposition flat f...,Russian roulette,Sold out,Recovering a title
1,Scorecard,The best lake scene,Leader German sleaze inquiry,Cheerio boyo,The main recommendations,Has Cubie killed fees,Has Cubie killed fees,Has Cubie killed fees,Hopkins furious at Foster s lack of Hannibal...,Has Cubie killed fees,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man s extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn t know without the ...,Millennium bug fails to bite


### <font color='light gray'>Convertng headlines to lower case

In [14]:
print('train_data.columns : ', train_data.columns)

for col in train_data.columns:
  train_data[col] = train_data[col].str.lower()

train_data.head(2)

train_data.columns :  Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
       '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24'],
      dtype='object')


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
0,a hindrance to operations extracts from the...,scorecard,hughes instant hit buoys blues,jack gets his skates on at ice cold alex,chaos as maracana builds up for united,depleted leicester prevail as elliott spoils e...,hungry spurs sense rich pickings,gunners so wide of an easy target,derby raise a glass to strupar s debut double,southgate strikes leeds pay the penalty,...,flintoff injury piles on woe for england,hunters threaten jospin with new battle of the...,kohl s successor drawn into scandal,the difference between men and women,sara denver nurse turned solicitor,diana s landmine crusade put tories in a panic,yeltsin s resignation caught opposition flat f...,russian roulette,sold out,recovering a title
1,scorecard,the best lake scene,leader german sleaze inquiry,cheerio boyo,the main recommendations,has cubie killed fees,has cubie killed fees,has cubie killed fees,hopkins furious at foster s lack of hannibal...,has cubie killed fees,...,on the critical list,the timing of their lives,dear doctor,irish court halts ira man s extradition to nor...,burundi peace initiative fades after rebels re...,pe points the way forward to the ecb,campaigners keep up pressure on nazi war crime...,jane ratcliffe,yet more things you wouldn t know without the ...,millennium bug fails to bite


In [15]:
# Can not use list comprehension for above code, since the list 'train_data' do not have entity 'head'
# train_data = [train_data[col].str.lower() for col in train_data.columns]

### <font color='light gray'>Combining all column's data of first row into a "single line row string"

In [16]:
# zeroth row 0 and columns 0 to 24 (total 25 in count)
' '.join(str(x) for x in train_data.iloc[0, 0:25])

'a  hindrance to operations   extracts from the leaked reports scorecard hughes  instant hit buoys blues jack gets his skates on at ice cold alex chaos as maracana builds up for united depleted leicester prevail as elliott spoils everton s party hungry spurs sense rich pickings gunners so wide of an easy target derby raise a glass to strupar s debut double southgate strikes  leeds pay the penalty hammers hand robson a youthful lesson saints party like it s      wear wolves have turned into lambs stump mike catches testy gough s taunt langer escapes to hit     flintoff injury piles on woe for england hunters threaten jospin with new battle of the somme kohl s successor drawn into scandal the difference between men and women sara denver  nurse turned solicitor diana s landmine crusade put tories in a panic yeltsin s resignation caught opposition flat footed russian roulette sold out recovering a title'

### <font color='light gray'>Create headlines string by combining all column's data into row-wise string, as above code for a "single line row string"

In [17]:
headlines = []
# print('length', (train_data.shape[0]))
for row in range(0, train_data.shape[0]):
  headlines.append(' '.join(str(x) for x in train_data.iloc[row, 0:25]))

headlines[1256]



## <font color='light gray'>Implementation of Bag of Words (BOW)<br><br>sklearn.feature_extraction.text.CountVectorizer
**CountVectorizer** is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text

# [What is CountVectorizer in NLP?](https://spotintelligence.com/2023/05/17/countvectorizer/)

**CountVectorizer** is a text preprocessing technique commonly used in "natural language processing" **(NLP)** tasks for converting a collection of text documents into a numerical representation. It is part of the scikit-learn library, a popular machine learning library in Python.

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

In [19]:
cv = CountVectorizer(ngram_range=(2,2))
train_data = cv.fit_transform(headlines)
train_data

<3723x539932 sparse matrix of type '<class 'numpy.int64'>'
	with 968648 stored elements in Compressed Sparse Row format>

## <font color='light gray'>A random forest classifier<br><br>sklearn.ensemble.RandomForestClassifier
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

* **n_estimators : int, default=100**
The number of trees in the forest.

Changed in version 0.22: The default value of n_estimators changed from 10 to 100 in 0.22.

* **criterion : {“gini”, “entropy”, “log_loss”}, default=”gini”**
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain, see Mathematical formulation. Note: This parameter is tree-specific.

### Entropy:
entropy quantifies the amount of surprise or uncertainty associated with the outcome of a random variable. A higher entropy indicates higher uncertainty or randomness, while lower entropy indicates lower uncertainty.

*For example, a fair coin toss has maximum entropy because there is equal probability for each outcome (heads or tails), while a biased coin with a higher probability of landing on heads would have lower entropy because there's less uncertainty about the outcome. Similarly, in data science and machine learning, entropy is often used as a measure of impurity in decision tree algorithms for classification tasks.*

In a decision tree, this is used to measure the impurity of a node. The goal is to minimize entropy, which means maximizing the homogeneity of the nodes. In other words, the algorithm tries to find splits that result in subsets of the data where the classes are as pure as possible.

In [20]:
rfc = RandomForestClassifier(n_estimators=200, criterion='entropy')
rfc.fit(train_data, train['Label']) # independent var., dependent var.

## <font color='light gray'>Prediction for the Test Dataset

In [21]:
test.head(2)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
3723,2015-01-02,1,Most cases of cancer are the result of sheer b...,Iran dismissed United States efforts to fight ...,Poll: One in 8 Germans would join anti-Muslim ...,UK royal family's Prince Andrew named in US la...,Some 40 asylum-seekers refused to leave the bu...,Pakistani boat blows self up after India navy ...,Sweden hit by third mosque arson attack in a week,940 cars set alight during French New Year,...,Ukrainian minister threatens TV channel with c...,Palestinian President Mahmoud Abbas has entere...,Israeli security center publishes names of 50 ...,The year 2014 was the deadliest year yet in Sy...,A Secret underground complex built by the Nazi...,Restrictions on Web Freedom a Major Global Iss...,Austrian journalist Erich Mchel delivered a pr...,Thousands of Ukraine nationalists march in Kiev,Chinas New Years Resolution: No More Harvestin...,Authorities Pull Plug on Russia's Last Politic...
3724,2015-01-05,0,Moscow-&gt;Beijing high speed train will reduc...,Two ancient tombs were discovered in Egypt on ...,China complains to Pyongyang after N Korean so...,Scotland Headed Towards Being Fossil Fuel-Free...,Prime Minister Shinzo Abe said Monday he will ...,Sex slave at centre of Prince Andrew scandal f...,Gay relative of Hamas founder faces deportatio...,The number of female drug addicts in Iran has ...,...,The Islamic State has approved a 2015 budget o...,"Iceland To Withdraw EU Application, Lift Capit...",Blackfield Capital Founder Goes Missing: The v...,Rocket stage crashes back to Earth in rural Ch...,2 Dead as Aircraft Bombs Greek Tanker in Libya...,Belgian murderer Frank Van Den Bleeken to die ...,Czech President criticizes Ukrainian PM; says ...,3 Vietnamese jets join search for 16 missing F...,France seeks end to Russia sanctions over Ukraine,China scraps rare earths caps


In [22]:
test.shape
test.shape[0]
test.shape[1]

27

In [23]:
test_headlines = []

# for row in range(0, test.shape[0]):
#   test_headlines.append(' '.join(str(x) for x in test.iloc[row, 2:27]))

test_headlines = [test_headlines.append(' '.join(str(x) for x in test.iloc[row, 2:27])) for row in range(0, test.shape[0])]

In [24]:
# Since, the stopword filtering is already applied before n-gram extraction, hence "apple day".
v = CountVectorizer(ngram_range=(1, 2))
print(v.fit(["an apple a day keeps the doctor away"]).vocabulary_)

{'an': 0, 'apple': 2, 'day': 5, 'keeps': 9, 'the': 11, 'doctor': 7, 'away': 4, 'an apple': 1, 'apple day': 3, 'day keeps': 6, 'keeps the': 10, 'the doctor': 12, 'doctor away': 8}


In [25]:
v = CountVectorizer(ngram_range=(2, 2))
print(v.fit(["an apple a day keeps the doctor away"]).vocabulary_)

{'an apple': 0, 'apple day': 1, 'day keeps': 2, 'keeps the': 4, 'the doctor': 5, 'doctor away': 3}
