# <font color='light gray'>Kaggle Competition- Predict Stock Price Movement Based On News Headline using NLP
<font color='gray'>**Credit:**
* https://www.youtube.com/watch?v=h-LGjJ_oANs&list=PLZoTAELRMXVMdJ5sqbCK2LiM0HhQVWNzm&index=13
* https://www.kaggle.com/code/rohit0906/stock-sentiment-analysis-using-news-headlines
* https://saturncloud.io/blog/understanding-the-ngramrange-argument-in-a-countvectorizer-in-sklearn/#:~:text=As%20a%20data%20scientist%20or,a%20given%20corpus%20of%20text.</font>

## <font color='light gray'>Stock Sentiment Analysis using News Headlines
<font color='gray'>**About the problem and the dataset used.**
* The data set in consideration is a combination of the world news and stock price shifts.
* Data ranges from 2008 to 2016 and the data from 2000 to 2008 was scrapped from Yahoo finance.
* There are 25 columns of top news headlines for each day in the data frame.
* Class 1- the stock price increased.
* Class 0- the stock price stayed the same or decreased.

<font color='gray'><br>**About the approach.**
* Used **TF-IDF and Bag of Words** for extracting featues from the headlines.
* Used **Random Forest Classifier, Multinational Naive Bayes and Passive Aggressive Classifier** for analysis.</font>

In [1]:
# url of raw data set from github
url="https://raw.githubusercontent.com/akdubey2k/NLP/main/Stock_Sentiment_Analysis/stock_news_headlines.csv"

**encoding : str, optional, default ‘utf-8’**<br>
Encoding to use for UTF when reading/writing (ex. 'utf-8'). List of Python standard encodings .

In [2]:
import pandas as pd

df = pd.read_csv(filepath_or_buffer=url, encoding='ISO-8859-1')
# this encoding increases the probability that a utf-8-sig encoding can be correctly
# guessed from the byte sequence. So here the BOM is not used to be able to determine
# the byte order used for generating the byte sequence, but as a signature that helps
# in guessing the encoding. On encoding the utf-8-sig codec will write 0xef, 0xbb,
# 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those
# three bytes if they appear as the first three bytes in the file. In UTF-8, the use
# of the BOM is discouraged and should generally be avoided.')

In [3]:
# display only 2 rows starting from index 0 with data reproducibility
print('The shape of dataset: '.ljust(25, '.'), df.shape)
df.head(2)

The shape of dataset: ... (4101, 27)


Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2000-01-03,0,A 'hindrance to operations': extracts from the...,Scorecard,Hughes' instant hit buoys Blues,Jack gets his skates on at ice-cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl's successor drawn into scandal,The difference between men and women,"Sara Denver, nurse turned solicitor",Diana's landmine crusade put Tories in a panic,Yeltsin's resignation caught opposition flat-f...,Russian roulette,Sold out,Recovering a title
1,2000-01-04,0,Scorecard,The best lake scene,Leader: German sleaze inquiry,"Cheerio, boyo",The main recommendations,Has Cubie killed fees?,Has Cubie killed fees?,Has Cubie killed fees?,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man's extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn't know without the ...,Millennium bug fails to bite


## <font color='light gray'>pandas.DataFrame.sample
<font color='gray'>**Return a random sample of items from an axis of object.**</font>
* **n : int, optional**
Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.
* **random_state : int, array-like, BitGenerator, np.random.RandomState, np.random.Generator, optional**
If int, array-like, or BitGenerator, seed for random number generator. If np.random.RandomState or np.random.Generator, use as given.
* **axis : {0 or ‘index’, 1 or ‘columns’, None}, default None**
Axis to sample. Accepts axis number or name. Default is stat axis for given data type. For Series this parameter is unused and defaults to None.
* **ignore_index : bool, default False**
If True, the resulting index will be labeled 0, 1, …, n - 1.

In [4]:
# display only 2 rows with data reproducibility

# df.sample(axis='index', random_state=1, n=2, ignore_index=True)
    # axis='index', no use here
    # if ignore_index=True, then index will be labeled 0, 1, …, n - 1. forcefully.
df.sample(random_state=1, n=2, )

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
3009,2012-03-01,1,Megaupload Founder Defeats US Govt Attempts To...,"End 'destructive' war on pot, panel urges Cana...",Australian philosophers have published an arti...,SOPA Ireland Signed Into Law: Ireland has pass...,A couple who tortured and killed 15-year-old K...,Britain was no longer safe territory for Murdo...,Anger Erupts Again Over China's Bear Bile Farm...,Only 19% of Israelis Support a Unilateral Stri...,...,Qatar crosses the Syrian Rubicon: 63m to buy w...,Iran recently offered to supply Pakistan with ...,In A Baghdad ER --- \r\n\r\n\r\nIraqi junior d...,Indonesia's Endangered Sumatran Tiger's Habita...,China drafts legal proposal to completely shut...,Pakistan rejects US pressure on Iran-Pakistan ...,The Warlord and the Basketball Star Dikembe M...,"A mind-boggling 40,000 trillion becquerels of ...",Hong Kong Airlines Ltd has threatened to cance...,"North Korea: What does 240,000 metric tons of ..."
2990,2012-02-02,0,Bulgarian ISPs Rise against ACTA (xpost r/evol...,Spain orders $500M in gold found under the Atl...,Kim Jong Un Looking at Things [pics],At least 40 dead in Egypt Football violence,Famous Egyptian actor sentenced to three month...,"Why Britain and Argentina are tussling, again,...",Seized cash triggers political furor in Mexico...,Paulo Coelho calls on readers to pirate books,...,Ecuador Creating Alternative to Neo-Liberal Mo...,"""In the chaotic evacuation of the Costa Concor...",Afghanistan: Leon Panetta signals end to US co...,Hundreds of Egyptians take to streets of Cairo...,EU antitrust authorities rejected plans of a t...,Egyptian parties refuse to commit to womens ri...,"Macedonia Muslims Riot, Burn Historic Church A...",The revelation that lawmakers for the Left Par...,Europe freeze: More deaths in Ukraine and Poland,British MP calls for pardon for 'hero' Turing


In [5]:
# display only 2 rows without data reproducibility
df.sample(n=2,)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
416,2001-09-05,1,Richard Williams: Comment,Norway 3 - 2 Wales,Northern Ireland 3 - 0 Iceland,Empty seats wipe gloss off victory,Player by player: how they rated,Analysis: Alan Travis on ID cards,England 2 - 0 Albania,Belgium 2 - 0 Scotland,...,Private dental patients pay more,Prom 60 review: Czech Philharmonic,What Bank of England decision means for savers...,Consumers put paid to a rate reduction,Cabaret: Clive Rowe,Bank of England leaves rates at 5%,Richard Williams on Diego Maradona Jr,Brown remains upbeat about Scotland's chances,Bright hope McCartney debuts against Iceland,England Under-21 5 - 0 Albania Under-21
4010,2016-02-24,1,Beijing now has more billionaires than New York,Brazil is building $250m-worth undersea cable ...,MH17 report identifies Russian soldiers suspec...,Pope suggests contraceptives could be used to ...,"PSA Peugeot-Citroen, ninth largest car manufac...",A gay Iranian poet is seeking asylum in Israel...,Canadian Federal Court allows medical marijuan...,BBC TV host Jimmy Savile sexually abused 72 vi...,...,Serious failings at the BBC allowed Jimmy Savi...,"Russia, With Turkey In Mind, Announces Deal To...",ISIS Is Losing Its Capital,France seeks 1.6 billion euros in back taxes f...,US Passes Bill Banning Goods Produced by Child...,Russia seeks joint manned flight to Mars with ...,China Warns U.S. After Trump Wins Nevada Caucus,"Greece 'won't be Lebanon of Europe', says migr...",Reports of hostage taking in London restaurant,Germany deports Afghan refugees in effort to d...


In [6]:
# df.columns.to_list()
df.columns

Index(['Date', 'Label', 'Top1', 'Top2', 'Top3', 'Top4', 'Top5', 'Top6', 'Top7',
       'Top8', 'Top9', 'Top10', 'Top11', 'Top12', 'Top13', 'Top14', 'Top15',
       'Top16', 'Top17', 'Top18', 'Top19', 'Top20', 'Top21', 'Top22', 'Top23',
       'Top24', 'Top25'],
      dtype='object')

In [7]:
# df.dtypes.to_list()
df.dtypes

Date     object
Label     int64
Top1     object
Top2     object
Top3     object
Top4     object
Top5     object
Top6     object
Top7     object
Top8     object
Top9     object
Top10    object
Top11    object
Top12    object
Top13    object
Top14    object
Top15    object
Top16    object
Top17    object
Top18    object
Top19    object
Top20    object
Top21    object
Top22    object
Top23    object
Top24    object
Top25    object
dtype: object

## <font color='light gray'>Divide data into training and test set as per date-wise

In [8]:
# this won't work, as this will show only boolean values,
# and we need content of dataset as per datewise, please see below...
train = df['Date'] < '2008-01-01'
train

0        True
1        True
2        True
3        True
4        True
        ...  
4096    False
4097    False
4098    False
4099    False
4100    False
Name: Date, Length: 4101, dtype: bool

In [9]:
# load the content of dataset as per datewise, which are below the date...
train = df[df['Date'] < '2015-01-01']
print('Shape of train'.ljust(20, '.'), train.shape)
train.head(2)

Shape of train...... (3723, 27)


Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2000-01-03,0,A 'hindrance to operations': extracts from the...,Scorecard,Hughes' instant hit buoys Blues,Jack gets his skates on at ice-cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl's successor drawn into scandal,The difference between men and women,"Sara Denver, nurse turned solicitor",Diana's landmine crusade put Tories in a panic,Yeltsin's resignation caught opposition flat-f...,Russian roulette,Sold out,Recovering a title
1,2000-01-04,0,Scorecard,The best lake scene,Leader: German sleaze inquiry,"Cheerio, boyo",The main recommendations,Has Cubie killed fees?,Has Cubie killed fees?,Has Cubie killed fees?,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man's extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn't know without the ...,Millennium bug fails to bite


In [10]:
# load the content of dataset as per datewise, which are above the date...
test = df[df.Date > '2015-02-01']
print('Shape of test'.ljust(20, '.'), test.shape)
test.head(2)

Shape of test....... (358, 27)


Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
3743,2015-02-02,1,Westminster child abuse scandal: KGB and CIA k...,A new guard for Asgard: Iceland building first...,Pagan priest wants the theft of a statue of th...,'Suppressed' EU report could have banned pesti...,The 4 surviving copies of the 1215 Magna Carta...,Tech pioneer Phil Zimmermann calls David Camer...,Thousands march for democracy in Hong Kong,"ISIS getting 'desperate,' struggling to replen...",...,The pro-Russian separatist leader in eastern U...,"Obama: Greece needs growth, not more austerity",One of two Russian bombers carried a nuke whil...,SpaceX and Google form joint partnership to br...,Tens of thousands of longtime Palestinian refu...,"Fire Guts Major Russian Library, Destroying Mi...","8,700 people file lawsuit against Japanese new...",Michelangelo bronzes discovered.,US considers providing arms to Ukraine as rebe...,Syriza-led Greek parliament will never ratify ...
3744,2015-02-03,1,"NASA is planning a mission to Europa, one of t...",Two-year-old Indian boy's heart beats in that ...,Over 100 drugged and raped in Japan fake clini...,CCTV footage exposes slaughterhouse cruelty: S...,Worlds most expensive drug which costs up to ...,"President Obama: ""US deploying all available a...",Kim Jong-un Says N.Korean Poverty Keeps Him Up...,French troops kill around a dozen Islamist mil...,...,Canada sends robots to Kurdistan to help clear...,Police alerted to planned march against Jewifi...,Head of U.N. inquiry into Gaza conflict to qui...,Casualties in shopping centre explosion in Perth.,US increasingly concerned that Russia is inten...,An entire city is going to be wiped off the ma...,"Barack Obama proposes over $1 billion civil, m...","Head of UN Gaza inquiry quits, cites Israel's ...",Suicides in Greece surged by a third after the...,Indian government launches its own low cost me...


<font color='dark gray'>***this is not a classification problem, so we can not use "train_test_split" method***

## <font color='light gray'>Data cleaning from training set.

In [11]:
# all 4101 rows, and 2 to 26 columns included. Ideally removing 'Date' & 'Label' columns.
train_data = train.iloc[:, 2:]
train_data.head(2)

Unnamed: 0,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,Top10,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,A 'hindrance to operations': extracts from the...,Scorecard,Hughes' instant hit buoys Blues,Jack gets his skates on at ice-cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,Derby raise a glass to Strupar's debut double,"Southgate strikes, Leeds pay the penalty",...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl's successor drawn into scandal,The difference between men and women,"Sara Denver, nurse turned solicitor",Diana's landmine crusade put Tories in a panic,Yeltsin's resignation caught opposition flat-f...,Russian roulette,Sold out,Recovering a title
1,Scorecard,The best lake scene,Leader: German sleaze inquiry,"Cheerio, boyo",The main recommendations,Has Cubie killed fees?,Has Cubie killed fees?,Has Cubie killed fees?,Hopkins 'furious' at Foster's lack of Hannibal...,Has Cubie killed fees?,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man's extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn't know without the ...,Millennium bug fails to bite


### <font color='light gray'>Removing punctuations
### https://regex101.com/

In [12]:
# except alphabets everythinng has to be removed
'''
[^a-zA-Z] : A character not in the range of a-z or A-Z
'''
# train_data.replace(['^a-zA-Z'], '', regex=True, inplace=True) wrong assignment
train_data.replace('[^a-zA-Z]', ' ', regex=True, inplace=True)
train_data.head(2)

Unnamed: 0,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,Top10,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,A hindrance to operations extracts from the...,Scorecard,Hughes instant hit buoys Blues,Jack gets his skates on at ice cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,Derby raise a glass to Strupar s debut double,Southgate strikes Leeds pay the penalty,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl s successor drawn into scandal,The difference between men and women,Sara Denver nurse turned solicitor,Diana s landmine crusade put Tories in a panic,Yeltsin s resignation caught opposition flat f...,Russian roulette,Sold out,Recovering a title
1,Scorecard,The best lake scene,Leader German sleaze inquiry,Cheerio boyo,The main recommendations,Has Cubie killed fees,Has Cubie killed fees,Has Cubie killed fees,Hopkins furious at Foster s lack of Hannibal...,Has Cubie killed fees,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man s extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn t know without the ...,Millennium bug fails to bite


### <font color='light gray'>Renaming column names

In [13]:
train.shape # number of rows & columns

(3723, 27)

In [14]:
train.shape[1] # number of columns

27

In [15]:
# Renaming column names for ease of accessness
# generating number from 0 to 24 (total 25 in count) and storing in list

# list1 = []
# for i in range(train.shape[1] - 2):
#     list1.append(str(i))
# print(list1)

# list comprehensive
new_index = [str(i) for i in range(train.shape[1] - 2)] # this is nothing but append operation
train_data.columns = new_index
train_data.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
0,A hindrance to operations extracts from the...,Scorecard,Hughes instant hit buoys Blues,Jack gets his skates on at ice cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,Derby raise a glass to Strupar s debut double,Southgate strikes Leeds pay the penalty,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl s successor drawn into scandal,The difference between men and women,Sara Denver nurse turned solicitor,Diana s landmine crusade put Tories in a panic,Yeltsin s resignation caught opposition flat f...,Russian roulette,Sold out,Recovering a title
1,Scorecard,The best lake scene,Leader German sleaze inquiry,Cheerio boyo,The main recommendations,Has Cubie killed fees,Has Cubie killed fees,Has Cubie killed fees,Hopkins furious at Foster s lack of Hannibal...,Has Cubie killed fees,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man s extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn t know without the ...,Millennium bug fails to bite


### <font color='light gray'>Convertng headlines to lower case

In [16]:
# Convertng headlines to lower case
for i in new_index:
    train_data[i] = train_data[i].str.lower()

train_data.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
0,a hindrance to operations extracts from the...,scorecard,hughes instant hit buoys blues,jack gets his skates on at ice cold alex,chaos as maracana builds up for united,depleted leicester prevail as elliott spoils e...,hungry spurs sense rich pickings,gunners so wide of an easy target,derby raise a glass to strupar s debut double,southgate strikes leeds pay the penalty,...,flintoff injury piles on woe for england,hunters threaten jospin with new battle of the...,kohl s successor drawn into scandal,the difference between men and women,sara denver nurse turned solicitor,diana s landmine crusade put tories in a panic,yeltsin s resignation caught opposition flat f...,russian roulette,sold out,recovering a title
1,scorecard,the best lake scene,leader german sleaze inquiry,cheerio boyo,the main recommendations,has cubie killed fees,has cubie killed fees,has cubie killed fees,hopkins furious at foster s lack of hannibal...,has cubie killed fees,...,on the critical list,the timing of their lives,dear doctor,irish court halts ira man s extradition to nor...,burundi peace initiative fades after rebels re...,pe points the way forward to the ecb,campaigners keep up pressure on nazi war crime...,jane ratcliffe,yet more things you wouldn t know without the ...,millennium bug fails to bite


In [17]:
len(new_index)

25

In [18]:
# list comprehension of above code, somehow not able to resolve it
# train_data = [train_data[i].str.lower() for i in new_index]
# train_data = [train_data[i].str.lower() if i in new_index else train_data[i] for i in range(len(train_data))]
# train_data = [train_data[i].str.lower() if i in new_index else train_data[i] for i in range(len(new_index))]
# train_data

### <font color='light gray'>Combining all columns data into a "single line row string"

In [19]:
# zeroth row 0 and columns 0 to 24 (total 25 in count)
' '.join(str(x) for x in train_data.iloc[0, 0:25])

'a  hindrance to operations   extracts from the leaked reports scorecard hughes  instant hit buoys blues jack gets his skates on at ice cold alex chaos as maracana builds up for united depleted leicester prevail as elliott spoils everton s party hungry spurs sense rich pickings gunners so wide of an easy target derby raise a glass to strupar s debut double southgate strikes  leeds pay the penalty hammers hand robson a youthful lesson saints party like it s      wear wolves have turned into lambs stump mike catches testy gough s taunt langer escapes to hit     flintoff injury piles on woe for england hunters threaten jospin with new battle of the somme kohl s successor drawn into scandal the difference between men and women sara denver  nurse turned solicitor diana s landmine crusade put tories in a panic yeltsin s resignation caught opposition flat footed russian roulette sold out recovering a title'

### <font color='light gray'>Create headlines by combining all columns data into row-wise string, as above code for a "single line row string"

In [20]:
headlines = []
for row in range(0, train_data.shape[0]):
    headlines.append(' '.join(str(x) for x in train_data.iloc[row, 0:25]))
headlines[1297]
# print(len(train_data.index))
# train_data.shape[0]

'aldershot town       carlisle united stevenage borough       hereford united charlton athletic       everton rooney dropped over swearing west bromwich albion       arsenal mourinho happy with offer of up to       m a year two thirds of first time voters care about key issues  but still will not vote lib dem targeting of tory leaders is falling short  poll suggests ted wragg  electioneering damages debate read my lips  whatever happened to that promise  martin kettle  a relative defeat will in fact serve to strengthen blairism john brennan  manifestos hint greater policy priority for fe mg rover workers receive      m howard  the man with a plan  puts his faith in victory in face of defeat press review  the european papers weigh up the election campaign talking up  talking down in tight race switching sides  the myths and the reality tony woodley  an end to passivity   days to go going well over the top  the link between monty and mikey letters  choice at the ballot box home wind turb

## <font color='light gray'>Implementation of Bag of Words (BOW)<br><br>sklearn.feature_extraction.text.CountVectorizer
**CountVectorizer** is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

In [22]:
cv = CountVectorizer(ngram_range=(2,2))
train_data = cv.fit_transform(headlines)
train_data

<3723x539932 sparse matrix of type '<class 'numpy.int64'>'
	with 968648 stored elements in Compressed Sparse Row format>

## <font color='light gray'>A random forest classifier<br><br>sklearn.ensemble.RandomForestClassifier
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

* **n_estimators : int, default=100**
The number of trees in the forest.

Changed in version 0.22: The default value of n_estimators changed from 10 to 100 in 0.22.

* **criterion : {“gini”, “entropy”, “log_loss”}, default=”gini”**
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain, see Mathematical formulation. Note: This parameter is tree-specific.

In [23]:
rfc = RandomForestClassifier(n_estimators=200, criterion='entropy')
rfc.fit(train_data, train['Label']) # independent var., dependent var.

## <font color='light gray'>Prediction for the Test Dataset

In [24]:
test_headlines = []
for row in range(0, test.shape[0]):
    test_headlines.append(' '.join(str(x) for x in test.iloc[row, 0:25]))

test_data = cv.transform(test_headlines)
test_pred = rfc.predict(test_data)
test_headlines[2]

'2015-02-04 1 Draft of Arrest Warrant for Argentine President Found at Dead Prosecutors Home New allegations of Saudi involvement in 9/11 ISIS Burns Jordanian Pilot Alive Jordan executes two Iraqi militants in response to pilot\'s death. The US has lost control of 400 million dollars worth of weapons in Yemen Jets bomb Boko Haram in Nigeria\'s first major offensive Taiwan TransAsia plane crash-lands in Taipei river ISIS captors \'didn\'t even have the Quran,\' says former hostage Jordan to execute "within hours" jailed woman militant it had sought to swap for pilot killed by Islamic State Isis set up giant screens in Raqqa showing Jordanian pilot burning to death cheered on by crowds Putin asks Ukraine to repay a $3 billion loan because Russia needs the money to fight its financial crisis ISIS throws gay Syrian man seven stories; the man survives, but is then stoned to death The head of Sunni Islams most respected seat of learning has expressed his outrage over the purported burning to

## <font color='light gray'>Classification report

In [25]:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
print(confusion_matrix(test['Label'], test_pred, ))
print(classification_report(test['Label'], test_pred))
print(accuracy_score(test['Label'], test_pred))

[[ 16 159]
 [ 11 172]]
              precision    recall  f1-score   support

           0       0.59      0.09      0.16       175
           1       0.52      0.94      0.67       183

    accuracy                           0.53       358
   macro avg       0.56      0.52      0.41       358
weighted avg       0.56      0.53      0.42       358

0.5251396648044693
