# Mandate-2
Pratyush Ranjan IMT2019065\
Archit Sangal IMT2019012

## Downloading and importing necessary packages

### Downloading nltk utilities

In [None]:
import nltk
nltk.download('all')

### Downloading contractions library for expanding contractions

In [None]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.68-py2.py3-none-any.whl (8.1 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[?25l[K     |█▏                              | 10 kB 26.3 MB/s eta 0:00:01[K     |██▎                             | 20 kB 30.1 MB/s eta 0:00:01[K     |███▍                            | 30 kB 30.9 MB/s eta 0:00:01[K     |████▋                           | 40 kB 18.3 MB/s eta 0:00:01[K     |█████▊                          | 51 kB 12.9 MB/s eta 0:00:01[K     |██████▉                         | 61 kB 14.8 MB/s eta 0:00:01[K     |████████                        | 71 kB 14.2 MB/s eta 0:00:01[K     |█████████▏                      | 81 kB 15.5 MB/s eta 0:00:01[K     |██████████▎                     | 92 kB 17.1 MB/s eta 0:00:01[K     |███████████▍                    | 102 kB 16.2 MB/s eta 0:00:01[K     |████████████▌     

### Importing required libraries

In [None]:
import pandas as pd
import numpy as np
from textblob import TextBlob
import contractions
import re
from bs4 import BeautifulSoup
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
from gensim.models import Word2Vec
import multiprocessing
from time import time
from sklearn.cluster import KMeans

## Function to remove punctuations

This function uses TextBlob to remove punctuations from the given sentence.

Input: @nifty crashing heavily, by 20%, due to situation in UKR! #sell #bearish

Output: nifty crashing heavily by 20 due to situation in UKR sell bearish

In [None]:
def remove_punctuations(text):
    text_blob = TextBlob(text)
    return ' '.join(text_blob.words)

remove_punctuations("@nifty crashing heavily, by 20%, due to situation in UKR! #sell #bearish")

'nifty crashing heavily by 20 due to situation in UKR sell bearish'

## Function to remove stopwords

Most common words like a, an, the, by, etc. will be removed through this function.

Input: SBI shares fell by nearly 20% after government announced a new bill proposing increased land taxes.

Output: SBI shares fell nearly 20% government announced new bill proposing increased land taxes.

In [None]:
stopwords = nltk.corpus.stopwords.words('english')

def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word.lower() not in stopwords])

remove_stopwords("SBI shares fell by nearly 20% after government announced a new bill proposing increased land taxes.")

'SBI shares fell nearly 20% government announced new bill proposing increased land taxes.'

## Function to expand contractions

This function expands the contractions in a sentence, such as converting I'll to I will, We'd to We would and so on.

Input: If the govn **hadn't** released this surprise bill, **I'd** be sitting with my profits. But unfortunately **I'll** have to sell all of it now. **It's** totally disappointing.

Output: If the govn **had not** released this surprise bill, **I would** be sitting with my milllions. But unfortunately **I will** have to sell all of it now. **It is** totally disappointing.



In [None]:
def expand_contractions(text):
    return ' '.join([contractions.fix(word) for word in text.split()])

expand_contractions("If the govn hadn't released this surprise bill, I'd be sitting with my profits. But unfortunately I'll\
 have to sell all of it now. It's totally disappointing.")

'If the govn had not released this surprise bill, I would be sitting with my milllions. But unfortunately I will have to sell all of it now. It is totally disappointing.'

## Function to lemmatise text
This function uses wordnet lemmatiser to lemmatise a given text and it uses verb as part of the speech tag.

Input: The stock market is going to rocket sky high today because the penny stocks are shooting up.

Output: The stock market be go to rocket sky high today because the penny stock be shoot up.

In [None]:
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()

def lemmatize(text):
    return ' '.join([lemmatizer.lemmatize(word, 'v') for word in text.split()])

lemmatize("The stock market is going to rocket sky high today because the penny stocks are shooting up.")

'The stock market be go to rocket sky high today because the penny stock be shoot up.'

## Function to clean text

This function cleans a given text in four steps - 


1.   Removes any @ tags.
2.   Removes any web links.
3.   Removes characters other than alphabets, '?' and '!'.
4.   Removes extra spaces.

Input: What?    I am surprised!   Penny stocks are making people millionares in less than 24hrs.  See yourself! https://youtube.com
/pennystocks

Output: What? I am surprised! Penny stocks are making people millionares in less than hrs. See yourself!



In [None]:
# 1. Remove tags eg: @pratyush
# 2. Remove web links eg: https://www.yahoo.com/news/stocks
# 3. Remove characters other than alphabets and expression symbols like ?, !
# 4. Remove extra spaces
def clean(text):
    text = BeautifulSoup(text, 'lxml').get_text()
    text = re.sub(r"@[A-Za-z0-9]+", ' ', text)
    text = re.sub(r"https?://[A-Za-z0-9./]+", ' ', text)
    text = re.sub(r"[^a-zA-Z.!?']", ' ', text)
    text = re.sub(r" +", ' ', text)

    return text

clean("What?   I am surprised!   Penny stocks are making people millionares in less than 24hrs.  See yourself! https://youtube.com\
/pennystocks")

'What? I am surprised! Penny stocks are making people millionares in less than hrs. See yourself! '

## Function for spelling correction using Jaccardian Distance

When provided a word, this function searches the word in the list of correct words from nltk's word corpus. If no match is found, then the function queries through all the words from the corpus having the same starting letter as the provided word and compares the two using jaccardian distance(ngrams=2). The word with the least jaccardian distance is reutrned as the correct word.

Input: I am **delightede** to **anoucnce** **thate** our **comapany** is now listed on Stock **mareket**.

Output: I am **delighted** to **announce** **that** our **company** is now listed on Stock **market**.

In [None]:
# Spelling correction using Jaccardian Distance

correct_words = nltk.corpus.words.words()

def correct_spellings_jaccard(text):
    word_list = text.split()

    for i, word in enumerate(word_list):
        word = word.lower()
        if word not in correct_words:
            scores = {}
            for w in correct_words:
                if w[0] == word[0]:
                    try:
                        scores[w] = jaccard_distance( set(ngrams(word, 2)), set(ngrams(w, 2)) )
                    except:
                        continue
            word_list[i] = sorted(scores.items(), key = lambda item: item[1])[0][0] if len(scores) > 0 else word
    return ' '.join(word_list)

correct_spellings_jaccard('I am delightede to anoucnce thate our comapany is now listed on Stock mareket')

'I am delighted to announce that our company is now listed on Stock market'

## Text Processing Pipeline

Pipeline function to process a given text. It has a total of 9 steps - 


1.   Converting text to lowercase
2.   Cleaning text using clean() function
3.   Removing punctuations using remove_punctuations() function
4.   Expanding contractions using expand_contractions() function
5.   Correcting spellings using jaccardian distance function
6.   Removing stopwords using remove_stopwords() function
7.   Lemmatising text using lemmatize() function
8.   Cleaning text again using clean() to remove some extra spaces created
9.   Removing instances of 's in the text


In [None]:
def pipeline(text):
    text = text.lower()
    text = clean(text)
    text = remove_punctuations(text)
    text = expand_contractions(text)
    #text = correct_spellings_jaccard(text)
    text = remove_stopwords(text)
    text = lemmatize(text)
    text = clean(text)
    text = text.replace('\'s', '')

    return text

## Reading financial news headlines data

In [None]:
data = pd.read_csv('/content/drive/MyDrive/NLP/financial_data/all-data.csv', names = ['labels','messages'], encoding='ISO-8859-1').sample(frac=1)
data

Unnamed: 0,labels,messages
4288,neutral,"When cruising , the revs fall as less engine o..."
2075,positive,"According to M-real 's CEO , Mikko Helander , ..."
4815,negative,Operating profit excluding non-recurring items...
3421,neutral,This corrensponds to 4.628 percent of Okmetic ...
3915,neutral,The total capacity of the factory will be appr...
...,...,...
3043,neutral,Neste Oil will publish its third quarter 2008 ...
2363,neutral,A meeting of Glisten shareholders to vote on t...
568,positive,With this appointment Kaupthing Bank aims to f...
4350,neutral,The floor area of the Yliopistonrinne project ...


## Data Analysis

### Checking for null entries

In [None]:
data.isnull().sum()

labels      0
messages    0
dtype: int64

### Analysing financial news headlines

In [None]:
data.messages

4288    When cruising , the revs fall as less engine o...
2075    According to M-real 's CEO , Mikko Helander , ...
4815    Operating profit excluding non-recurring items...
3421    This corrensponds to 4.628 percent of Okmetic ...
3915    The total capacity of the factory will be appr...
                              ...                        
3043    Neste Oil will publish its third quarter 2008 ...
2363    A meeting of Glisten shareholders to vote on t...
568     With this appointment Kaupthing Bank aims to f...
4350    The floor area of the Yliopistonrinne project ...
2822    At the end of March 2009 , the company 's loan...
Name: messages, Length: 4846, dtype: object

## Applying text processing pipeline

Applying text processing pipeline on every financial news headline in the dataset and converting the sentences to a list of words inorder to train word embeddings using them.

In [None]:
sentences = data.messages.map(lambda text: pipeline(text))
train_sentences = [sentence.split() for sentence in sentences]
train_sentences[0:3]

In [None]:
sentences

4288           cruise rev fall less engine output require
2075    accord real  ceo mikko helander transaction en...
4815    operate profit exclude non recur items decreas...
3421    corrensponds percent okmetic  share capital vo...
3915    total capacity factory approximately engines year
                              ...                        
3043    neste oil publish third quarter result friday ...
2363        meet glisten shareholders vote bid hold march
568     appointment kaupthing bank aim co ordinate cap...
4350    floor area yliopistonrinne project sq build  g...
2822                end march company  loan amount eur mn
Name: messages, Length: 4846, dtype: object

## Embedding Configuration and Training

Setting the configurations and training gensim's Word2Vec model on our custom financial news dataset.

In [None]:
# iter 1000
w2v_model = Word2Vec(train_sentences,
                     min_count=3,
                     window=4,
                     size=300,
                     sample=1e-5,
                     alpha=0.03,
                     min_alpha=0.0007,
                     negative=20,
                     iter=10000,
                     workers=multiprocessing.cpu_count()-1
                     )

## Assessing Word2Vec model

Most similar words for capital were returned as share, disposal, exchange, stock.
Most similar words for oil were returned as neste, gas, biodiesel. Neste is an oil refining company.

In [None]:
w2v_model.wv.most_similar('capital')

[('share', 0.3540598750114441),
 ('disposal', 0.31121963262557983),
 ('exchange', 0.2964023947715759),
 ('hold', 0.2846112549304962),
 ('warrant', 0.2731533646583557),
 ('huhtamaki', 0.2716323733329773),
 ('investors', 0.2683177590370178),
 ('vote', 0.264710009098053),
 ('aggregate', 0.26155152916908264),
 ('stock', 0.26109758019447327)]

In [None]:
w2v_model.wv.most_similar('oil')

[('neste', 0.6567887663841248),
 ('gas', 0.44650447368621826),
 ('biodiesel', 0.3915223777294159),
 ('palm', 0.34086838364601135),
 ('nexbtl', 0.3350670635700226),
 ('tons', 0.31725549697875977),
 ('refine', 0.31282925605773926),
 ('fat', 0.31119444966316223),
 ('vessel', 0.3026620149612427),
 ('shale', 0.29222363233566284)]