# Mandate-4

Archit Sangal IMT2019012\
Pratyush Ranjan IMT2019065

## Downloading and importing necessary packages

### Downloading nltk utilities

In [1]:
import nltk
nltk.download('all')
import random
random.seed(0)

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /home/architsangal/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /home/architsangal/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /home/architsangal/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /home/architsangal/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /home/architsangal/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]

### Downloading contractions library for expanding contractions

In [2]:
!pip install contractions



### Importing required libraries

In [3]:
import pandas as pd
import numpy as np
from textblob import TextBlob
import contractions
import re
from bs4 import BeautifulSoup
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
from gensim.models import Word2Vec
from gensim.models import Doc2Vec
import multiprocessing
from time import time
from sklearn.cluster import KMeans

## Function to remove punctuations

This function uses TextBlob to remove punctuations from the given sentence.

Input: @nifty crashing heavily, by 20%, due to situation in UKR! #sell #bearish

Output: nifty crashing heavily by 20 due to situation in UKR sell bearish

In [4]:
def remove_punctuations(text):
    text_blob = TextBlob(text)
    return ' '.join(text_blob.words)

remove_punctuations("@nifty crashing heavily, by 20%, due to situation in UKR! #sell #bearish")

'nifty crashing heavily by 20 due to situation in UKR sell bearish'

## Function to remove stopwords

Most common words like a, an, the, by, etc. will be removed through this function.

Input: SBI shares fell by nearly 20% after government announced a new bill proposing increased land taxes.

Output: SBI shares fell nearly 20% government announced new bill proposing increased land taxes.

In [5]:
stopwords = nltk.corpus.stopwords.words('english')

def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word.lower() not in stopwords])

remove_stopwords("SBI shares fell by nearly 20% after government announced a new bill proposing increased land taxes.")

'SBI shares fell nearly 20% government announced new bill proposing increased land taxes.'

## Function to expand contractions

This function expands the contractions in a sentence, such as converting I'll to I will, We'd to We would and so on.

Input: If the govn **hadn't** released this surprise bill, **I'd** be sitting with my profits. But unfortunately **I'll** have to sell all of it now. **It's** totally disappointing.

Output: If the govn **had not** released this surprise bill, **I would** be sitting with my milllions. But unfortunately **I will** have to sell all of it now. **It is** totally disappointing.



In [6]:
def expand_contractions(text):
    return ' '.join([contractions.fix(word) for word in text.split()])

expand_contractions("If the govn hadn't released this surprise bill, I'd be sitting with my profits. But unfortunately I'll\
 have to sell all of it now. It's totally disappointing.")

'If the govn had not released this surprise bill, I would be sitting with my profits. But unfortunately I will have to sell all of it now. It is totally disappointing.'

## Function to lemmatise text
This function uses wordnet lemmatiser to lemmatise a given text and it uses verb as part of the speech tag.

Input: The stock market is going to rocket sky high today because the penny stocks are shooting up.

Output: The stock market be go to rocket sky high today because the penny stock be shoot up.

In [7]:
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()

def lemmatize(text):
    return ' '.join([lemmatizer.lemmatize(word, 'v') for word in text.split()])

lemmatize("The stock market is going to rocket sky high today because the penny stocks are shooting up.")

'The stock market be go to rocket sky high today because the penny stock be shoot up.'

## Function to clean text

This function cleans a given text in four steps - 


1.   Removes any @ tags.
2.   Removes any web links.
3.   Removes characters other than alphabets, '?' and '!'.
4.   Removes extra spaces.

Input: What?    I am surprised!   Penny stocks are making people millionares in less than 24hrs.  See yourself! https://youtube.com
/pennystocks

Output: What? I am surprised! Penny stocks are making people millionares in less than hrs. See yourself!



In [8]:
# 1. Remove tags eg: @pratyush
# 2. Remove web links eg: https://www.yahoo.com/news/stocks
# 3. Remove characters other than alphabets and expression symbols like ?, !
# 4. Remove extra spaces
def clean(text):
    text = BeautifulSoup(text).get_text()
    text = re.sub(r"@[A-Za-z0-9]+", ' ', text)
    text = re.sub(r"https?://[A-Za-z0-9./]+", ' ', text)
    text = re.sub(r"[^a-zA-Z.!?']", ' ', text)
    text = re.sub(r" +", ' ', text)

    return text

clean("What?   I am surprised!   Penny stocks are making people millionares in less than 24hrs.  See yourself! https://youtube.com\
/pennystocks")



'What? I am surprised! Penny stocks are making people millionares in less than hrs. See yourself! '

## Function for spelling correction using Jaccardian Distance

When provided a word, this function searches the word in the list of correct words from nltk's word corpus. If no match is found, then the function queries through all the words from the corpus having the same starting letter as the provided word and compares the two using jaccardian distance(ngrams=2). The word with the least jaccardian distance is reutrned as the correct word.

Input: I am **delightede** to **anoucnce** **thate** our **comapany** is now listed on Stock **mareket**.

Output: I am **delighted** to **announce** **that** our **company** is now listed on Stock **market**.

In [9]:
# Spelling correction using Jaccardian Distance

correct_words = nltk.corpus.words.words()

def correct_spellings_jaccard(text):
    word_list = text.split()

    for i, word in enumerate(word_list):
        word = word.lower()
        if word not in correct_words:
            scores = {}
            for w in correct_words:
                if w[0] == word[0]:
                    try:
                        scores[w] = jaccard_distance( set(ngrams(word, 2)), set(ngrams(w, 2)) )
                    except:
                        continue
            word_list[i] = sorted(scores.items(), key = lambda item: item[1])[0][0] if len(scores) > 0 else word
    return ' '.join(word_list)

correct_spellings_jaccard('I am delightede to anoucnce thate our comapany is now listed on Stock mareket')

'I am delighted to announce that our company is now listed on Stock market'

## Text Processing Pipeline

Pipeline function to process a given text. It has a total of 9 steps - 


1.   Converting text to lowercase
2.   Cleaning text using clean() function
3.   Removing punctuations using remove_punctuations() function
4.   Expanding contractions using expand_contractions() function
5.   Correcting spellings using jaccardian distance function
6.   Removing stopwords using remove_stopwords() function
7.   Lemmatising text using lemmatize() function
8.   Cleaning text again using clean() to remove some extra spaces created
9.   Removing instances of 's in the text


In [10]:
def pipeline(text):
    text = text.lower()
    text = clean(text)
    text = remove_punctuations(text)
    text = expand_contractions(text)
    #text = correct_spellings_jaccard(text)
    text = remove_stopwords(text)
    text = lemmatize(text)
    text = clean(text)
    text = text.replace('\'s', '')

    return text

## Reading financial news headlines data

In [11]:
data = pd.read_csv('company.csv')

## Data Analysis

### Checking for null entries

In [12]:
data.isnull().sum()

text    0
dtype: int64

### Analysing financial news headlines

In [13]:
print(type(data))

<class 'pandas.core.frame.DataFrame'>


## Applying text processing pipeline

Applying text processing pipeline on every financial news headline in the dataset and converting the sentences to a list of words inorder to train word embeddings using them.

In [14]:
sentences = data.text.map(lambda text: pipeline(text))
train_sentences = [sentence.split() for sentence in sentences]

## Embedding Configuration and Training

Setting the configurations and training gensim's Word2Vec model on our custom financial news dataset.

In [15]:
# iter 1000
w2v_model = Word2Vec(train_sentences,
                     min_count=3,
                     window=4,
                     vector_size=100,
                     sample=1e-5,
                     alpha=0.03,
                     min_alpha=0.0007,
                     negative=20,
                     epochs=1000,
                     workers=multiprocessing.cpu_count()-1
                     )

## Sec2Vec

We have word2vec but we would need to code from scratch sen2vec from the vector we get from word2vec. We take out all the vectors from the word2vec model and take the average of the vectors. and store the vector for all the sentences in list 'vectors'.

In [16]:
vectors = []
for sentence in train_sentences:
    vectors_sentence = [0]*100
    size_sen = 0
    for word in sentence:
        if word in w2v_model.wv.index_to_key:
            vector_word = w2v_model.wv.get_vector(word)
            size_sen += 1
        else:
            continue
        vectors_sentence = [sum(x) for x in zip(vectors_sentence, vector_word)]
    if size_sen != 0:
        vectors_sentence = [x/len(sentence) for x in vectors_sentence]
    vectors.append(vectors_sentence)

## KMEANS Clustering

we use kmeans clustering, and we assume that there will be 3 clusters. One for positive tweets, one for negative and one for neutral.

We map the clusters to the category by whichever gives us the best result. For example if we have 3 clusters -

c1 - 3 pos, 1 neg, 1 neutral

c2 - 3 pos, 2 neg, 1 neutral

c3 - 3 pos, 1 neg, 2 neutral

Best possible pair matching can be c1 - positive cluster, c2 - negative cluster, c3 - neutral cluster. This was done in the folder 'mapping of the class'.

Here is the mapping-

0 - neutral

1 - positive

2 - negative

In [17]:
from sklearn.cluster import KMeans
import numpy as np

In [18]:
kmeans = KMeans(n_clusters=3, random_state=0).fit(vectors)

In [19]:
len(kmeans.cluster_centers_)

3

## Sentiment

sentiment = (count_positive-count_negative)/len(vectors)

The above sentiment will vary from [-1 to 1] for the collection of tweets.

1  -> being all tweets were positive

-1 -> being all tweets were negative

0  -> means equal number of positive and negative tweets.

0.4 -> Moderately positive sentiment

-0.4 -> Moderately negative sentiment

This is like a scale of -1 to 1. If the sentiment is 0, it means neutral, if positive, it means sentiment is positive and degree of positiveness is determined linearly by the magnitude of the value recieved and vice versa.

In [20]:
y_out = kmeans.predict(vectors)
count_positive = 0
count_negative = 0
for i in range(len(vectors)):
    if(y_out[i] == 1):
        count_positive+=1
    elif(y_out[i] == 2):
        count_negative+=1
print((count_positive-count_negative)/len(vectors))

0.41276376737004633


## Saving the Model
Saving the model so that it can be used in future if required

In [21]:
import pickle
with open("kmeans.pkl", "wb") as f:
    pickle.dump(kmeans, f)