# Data Cleaning

This notebook goes through a necessary step of any data science project - data cleaning. Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out". Feeding dirty data into a model will give us results that are meaningless.

Specifically, we'll be walking through:

1. **Getting the data - **in this case, we'll be scraping data from a website
2. **Cleaning the data - **we will walk through popular text pre-processing techniques
3. **Organizing the data - **we will organize the cleaned data into a way that is easy to input into other algorithms

The output of this notebook will be clean, organized data in two standard text formats:

1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

In [7]:
# Web scraping, pickle imports
import requests
#from bs4 import BeautifulSoup
import pickle
import pandas as pd

# Web scraping stuff from Samyak
nifty = pd.read_csv("nifty50_news_stock_data_combined.csv")
# # Stock names
# stocks = []
print(nifty)
nifty.head()
nifty.iloc[2]["Content"]

                                                   Title  \
0      Sacked and stripped of bonuses: Srikrishna rep...   
1      Ex-CEO Chanda Kochhar may have to return over ...   
2      Dr Reddy's Laboratories Ltd appoints Axis Bank...   
3      Here's a timeline of ICICI Bank-Videocon loan ...   
4      How much money former ICICI Bank CEO Chanda Ko...   
...                                                  ...   
12947  L&T Finance, L&T Infra Credit and 5 other NBFC...   
12948  Larsen & Toubro Stocks Live Updates: Larsen & ...   
12949  Larsen & Toubro falls Monday, underperforms ma...   
12950  How L&T is Driving India's Green Hydrogen Boom...   
12951  Sunpure Signs Contract with L&T to Supply PV R...   

                                                 Content  \
0      an independent inquiry initiated by icici bank...   
1                                                    NaN   
2          you can now subscribe to our  you can now ...   
3      answer videocon – that is the re

'    you can now subscribe to our  you can now subscribe to our economic times whatsapp channel  new delhi pharma major dr reddys laboratories thursday said it has appointed axis bank s former md and ceo shikha sharma as companys independent additional director for five yearsshikha sanjaya sharma has been appointed as an additional director categorised as independent on the board of dr reddys laboratories ltd for a period of five years effective january   the company said in a filing to the bsesharma was the managing director and ceo of axis bank from june  up to december  it addedshe has more than three decades of experience in the financial sector having begun her career with icici bank ltd in  during her tenure with the icici group she was instrumental in setting up icici securities dr reddys saidsharma holds an mba from the indian institute of management ahmedabad and pgd in software technology from national centre for software technology mumbai it addedshares of dr reddys laborato

In [12]:
# # Pickle files for later use - not needed ... yet

# # Make a new directory to hold the text files
# !mkdir news

# for i, c in enumerate(stocks):
#     with open("news/" + c + ".txt", "wb") as file:
#         pickle.dump(transcripts[i], file)

In [13]:
# # Load pickled files
# data = {}
# for i, c in enumerate(stocks):
#     with open("news/" + c + ".txt", "rb") as file:
#         data[c] = pickle.load(file)

In [14]:
# # Double check to make sure data has been loaded properly
# data.keys()

## Cleaning The Data

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate. Here are a bunch of things you can do to clean your data. We're going to execute just the common cleaning steps here and the rest can be done at a later point to improve our results.

**Common data cleaning steps on all text:**
* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

**More data cleaning steps after tokenization:**
* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

In [15]:
# # Let's take a look at our data again
# next(iter(data.keys()))

In [16]:
# # Notice that our dictionary is currently in key: news, value: list of text format
# next(iter(data.values()))

In [17]:
# # We are going to change this to key: comedian, value: string format
# def combine_text(list_of_text):
#     '''Takes a list of text and combines them into one large chunk of text.'''
#     combined_text = ' '.join(list_of_text)
#     return combined_text

In [18]:
# # Combine it!
# data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [19]:
# # We can either keep it in dictionary format or put it into a pandas dataframe
# import pandas as pd
# pd.set_option('max_colwidth',150)

# data_df = pd.DataFrame.from_dict(data_combined).transpose()
# data_df.columns = ['news']
# data_df = data_df.sort_index()
# data_df

In [8]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = str(text)
    if(text is None or text.isnumeric()): 
        return ""
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [9]:
# Let's take a look at the updated text
content_clean = pd.DataFrame(nifty.Content.apply(round1))
title_clean = pd.DataFrame(nifty.Title.apply(round1))

In [10]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', ' ', text)
    text = re.sub('\n', ' ', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [11]:
# Let's take a look at the updated text
content_clean = pd.DataFrame(content_clean.Content.apply(round2))
title_clean = pd.DataFrame(title_clean.Title.apply(round2))
content_clean
title_clean
combined_data = pd.DataFrame({
    'Combined': title_clean['Title'] + " " + content_clean['Content']
})
combined_data

Unnamed: 0,Combined
0,sacked and stripped of bonuses srikrishna repo...
1,exceo chanda kochhar may have to return over r...
2,dr reddys laboratories ltd appoints axis bank ...
3,heres a timeline of icici bankvideocon loan ca...
4,how much money former icici bank ceo chanda ko...
...,...
12947,lt finance lt infra credit and other nbfcs su...
12948,larsen toubro stocks live updates larsen tou...
12949,larsen toubro falls monday underperforms mark...
12950,how lt is driving indias green hydrogen boom ...


In [12]:
# vadar

In [8]:
from nltk.sentiment import SentimentIntensityAnalyzer

# Initialize the VADER SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

# Define a function to calculate VADER sentiment scores for a given text
def get_vader_scores(text):
    return sia.polarity_scores(text)

# Apply the function to calculate sentiment on the 'Combined' column
combined_data['Vader_Scores'] = combined_data['Combined'].apply(get_vader_scores)

# If you want separate columns for each score (negative, neutral, positive, compound), you can do:
combined_data['Vader_Negative_Score'] = combined_data['Vader_Scores'].apply(lambda x: x['neg'])
combined_data['Vader_Neutral_Score'] = combined_data['Vader_Scores'].apply(lambda x: x['neu'])
combined_data['Vader_Positive_Score'] = combined_data['Vader_Scores'].apply(lambda x: x['pos'])
combined_data['Vader_Compound_Score'] = combined_data['Vader_Scores'].apply(lambda x: x['compound'])





In [9]:
updated_data = nifty.merge(combined_data[['Vader_Negative_Score', 'Vader_Neutral_Score', 'Vader_Positive_Score', 'Vader_Compound_Score']],
                                   left_index=True, right_index=True, how='left')

updated_data.to_csv('nifty50_news_stock_data_combined.csv', index=False)


In [10]:
# textblob

In [11]:
from textblob import TextBlob
import pandas as pd

# Function to calculate TextBlob sentiment
def get_textblob_sentiment(text):
    # Create a TextBlob object
    blob = TextBlob(text)
    # This will return a namedtuple of the form Sentiment(polarity, subjectivity)
    return blob.sentiment

# Read the existing CSV file into a DataFrame
df = pd.read_csv('nifty50_news_stock_data_combined.csv')

# Calculate sentiment using TextBlob
df['TextBlob_Sentiment'] = combined_data['Combined'].apply(lambda x: get_textblob_sentiment(x).polarity)
df['TextBlob_Subjectivity'] = combined_data['Combined'].apply(lambda x: get_textblob_sentiment(x).subjectivity)



In [12]:
df['TextBlob_Sentiment'] 

0        0.046822
1        0.000000
2        0.102683
3        0.024962
4        0.043339
           ...   
12947    0.082368
12948    0.136364
12949    0.000000
12950    0.055598
12951    0.142527
Name: TextBlob_Sentiment, Length: 12952, dtype: float64

In [13]:
df.to_csv('nifty50_news_stock_data_combined.csv', index=False)

In [14]:
df[1:10]

Unnamed: 0,Title,Content,Link,Date,Ticker,T-1,T+1,T+2,Vader_Negative_Score_x,Vader_Neutral_Score_x,...,Vader_Positive_Score_y,Vader_Compound_Score_y,TextBlob_Sentiment,TextBlob_Subjectivity,DistilBERT_Sentiment,DistilBERT_Score,Vader_Negative_Score,Vader_Neutral_Score,Vader_Positive_Score,Vader_Compound_Score
1,Ex-CEO Chanda Kochhar may have to return over ...,,https://www.business-standard.com/article/pti-...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.0,0.788,...,0.212,0.5423,0.0,0.0,NEGATIVE,0.998119,0.0,0.788,0.212,0.5423
2,Dr Reddy's Laboratories Ltd appoints Axis Bank...,you can now subscribe to our you can now ...,https://m.economictimes.com/industry/healthcar...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.0,0.972,...,0.028,0.6486,0.102683,0.258018,NEGATIVE,0.930891,0.0,0.972,0.028,0.6486
3,Here's a timeline of ICICI Bank-Videocon loan ...,answer videocon – that is the reason why video...,https://www.moneycontrol.com/news/business/her...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.061,0.851,...,0.087,0.9705,0.024962,0.301661,NEGATIVE,0.993598,0.061,0.851,0.087,0.9705
4,How much money former ICICI Bank CEO Chanda Ko...,money back guarantee deliberate act what now...,https://timesofindia.indiatimes.com/business/i...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.084,0.82,...,0.096,0.6635,0.043339,0.293273,NEGATIVE,0.99772,0.084,0.82,0.096,0.6635
5,"ICICI Bank fires boss: hurt and shocked, credi...",chanda kochhar who served with icici bank for ...,https://indianexpress.com/article/business/ban...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.09,0.794,...,0.116,0.9846,-0.029755,0.392264,NEGATIVE,0.998456,0.09,0.794,0.116,0.9846
6,ICICI Bank board under fire for giving clean c...,,https://www.business-standard.com/article/econ...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.137,0.571,...,0.291,0.4019,0.366667,0.7,NEGATIVE,0.987804,0.137,0.571,0.291,0.4019
7,Chanda Kochhar may have to repay Rs 353 crore ...,schemes and the code of conduct with all att...,https://www.deccanherald.com/business/chanda-k...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.0,0.852,...,0.148,0.8885,0.075,0.353125,NEGATIVE,0.996584,0.0,0.852,0.148,0.8885
8,Senior management at ICICI Bank to have no titles,you can now subscribe to our you can now ...,https://m.economictimes.com/banking/senior-man...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.005,0.952,...,0.043,0.7681,0.074747,0.385534,NEGATIVE,0.99593,0.005,0.952,0.043,0.7681
9,India’s top 10 distributors hold 18% of indust...,india s top mutual fund distributors now mana...,https://cafemutual.com/news/industry/15721-ind...,2019-02-21,ICICIBANK.NS,351.299988,355.600006,348.200012,0.0,0.895,...,0.105,0.9246,0.180556,0.382407,NEGATIVE,0.988775,0.0,0.895,0.105,0.9246


In [None]:
# DistilBert

In [14]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import pipeline


# Load the pre-trained DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

# Create a sentiment analysis pipeline
nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

In [13]:
import pandas as pd
import torch

df = pd.read_csv('nifty50_news_stock_data_combined.csv')

def get_distilbert_sentiment(text):
    # Tokenize and truncate the sequence to be within the 512 token limit
    inputs = tokenizer.encode_plus(
        text, 
        add_special_tokens=True, 
        max_length=512, 
        truncation=True,
        padding='max_length', 
        return_tensors='pt'
    )
    
    # Get the predictions from the model
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Take the first result (batch size is 1)
    result = outputs.logits[0]
    
    # Convert to probabilities and take the argmax to get the most likely label
    probabilities = torch.nn.functional.softmax(result, dim=0)
    sentiment = model.config.id2label[probabilities.argmax().item()]
    score = probabilities.max().item()
    
    return sentiment, score

# Apply sentiment analysis using DistilBERT
df['DistilBERT_Sentiment'], df['DistilBERT_Score'] = zip(*combined_data['Combined'].map(get_distilbert_sentiment))

# Save the updated DataFrame back to the CSV file
df.to_csv('nifty50_news_stock_data_combined.csv', index=False)




In [14]:
df[1:10]

Unnamed: 0,Title,Content,Link,Date,Ticker,T-1,T+1,T+2,Vader_Negative_Score_x,Vader_Neutral_Score_x,Vader_Positive_Score_x,Vader_Compound_Score_x,Vader_Negative_Score_y,Vader_Neutral_Score_y,Vader_Positive_Score_y,Vader_Compound_Score_y,TextBlob_Sentiment,TextBlob_Subjectivity,DistilBERT_Sentiment,DistilBERT_Score
1,Ex-CEO Chanda Kochhar may have to return over ...,,https://www.business-standard.com/article/pti-...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.0,0.788,0.212,0.5423,0.0,0.788,0.212,0.5423,0.0,0.0,NEGATIVE,0.998119
2,Dr Reddy's Laboratories Ltd appoints Axis Bank...,you can now subscribe to our you can now ...,https://m.economictimes.com/industry/healthcar...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.0,0.972,0.028,0.6486,0.0,0.972,0.028,0.6486,0.102683,0.258018,NEGATIVE,0.930891
3,Here's a timeline of ICICI Bank-Videocon loan ...,answer videocon – that is the reason why video...,https://www.moneycontrol.com/news/business/her...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.061,0.851,0.087,0.9705,0.061,0.851,0.087,0.9705,0.024962,0.301661,NEGATIVE,0.993598
4,How much money former ICICI Bank CEO Chanda Ko...,money back guarantee deliberate act what now...,https://timesofindia.indiatimes.com/business/i...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.084,0.82,0.096,0.6635,0.084,0.82,0.096,0.6635,0.043339,0.293273,NEGATIVE,0.99772
5,"ICICI Bank fires boss: hurt and shocked, credi...",chanda kochhar who served with icici bank for ...,https://indianexpress.com/article/business/ban...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.09,0.794,0.116,0.9846,0.09,0.794,0.116,0.9846,-0.029755,0.392264,NEGATIVE,0.998456
6,ICICI Bank board under fire for giving clean c...,,https://www.business-standard.com/article/econ...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.137,0.571,0.291,0.4019,0.137,0.571,0.291,0.4019,0.366667,0.7,NEGATIVE,0.987804
7,Chanda Kochhar may have to repay Rs 353 crore ...,schemes and the code of conduct with all att...,https://www.deccanherald.com/business/chanda-k...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.0,0.852,0.148,0.8885,0.0,0.852,0.148,0.8885,0.075,0.353125,NEGATIVE,0.996584
8,Senior management at ICICI Bank to have no titles,you can now subscribe to our you can now ...,https://m.economictimes.com/banking/senior-man...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.005,0.952,0.043,0.7681,0.005,0.952,0.043,0.7681,0.074747,0.385534,NEGATIVE,0.99593
9,India’s top 10 distributors hold 18% of indust...,india s top mutual fund distributors now mana...,https://cafemutual.com/news/industry/15721-ind...,2019-02-21,ICICIBANK.NS,351.299988,355.600006,348.200012,0.0,0.895,0.105,0.9246,0.0,0.895,0.105,0.9246,0.180556,0.382407,NEGATIVE,0.988775


In [None]:
df = pd.read_csv('nifty50_news_stock_data_combined.csv')

In [83]:
df['DistilBERT_Sentiment']

0        NEGATIVE
1        NEGATIVE
2        NEGATIVE
3        NEGATIVE
4        NEGATIVE
           ...   
12947    NEGATIVE
12948    NEGATIVE
12949    NEGATIVE
12950    NEGATIVE
12951    POSITIVE
Name: DistilBERT_Sentiment, Length: 12952, dtype: object

In [1]:
#finbert
import torch
from transformers import BertTokenizer, BertForSequenceClassification
import pandas as pd
from tqdm.auto import tqdm

tqdm.pandas()

tokenizer = BertTokenizer.from_pretrained('ProsusAI/finbert')
model = BertForSequenceClassification.from_pretrained('ProsusAI/finbert')
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [2]:
def chunk_text(text, max_chunk_length=510):
    # Tokenize the text and add special tokens
    tokens = tokenizer.encode(text, add_special_tokens=False)
    
    # Initialize list to store tokenized chunks
    chunks = []

    # Split tokens into chunks of `max_chunk_length`
    for i in range(0, len(tokens), max_chunk_length):
        # Create a chunk with the `[CLS]` token at the start and `[SEP]` token at the end
        chunk = tokens[i:i + max_chunk_length]
        chunks.append(chunk)
    
    # Add special tokens to each chunk
    chunks = [([tokenizer.cls_token_id] + chunk + [tokenizer.sep_token_id]) for chunk in chunks]
    
    return chunks

In [17]:

def get_sentiment(input_ids, attention_mask):
    # Move to GPU if available
    input_ids = input_ids.to('cuda' if torch.cuda.is_available() else 'cpu')
    attention_mask = attention_mask.to('cuda' if torch.cuda.is_available() else 'cpu')
    
    # Get model outputs and logits
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
    logits = outputs.logits
    sentiment = torch.argmax(logits, dim=1).item()  # Assuming binary classification (positive/negative)
    return sentiment


In [3]:
from torch.nn.functional import softmax

In [4]:
def process_text(text):
    # Split the text into chunks and prepare tensors
    chunks = chunk_text(text)
    
    # Process each chunk and aggregate the results
    probability_list = []
    for chunk in chunks:
        # Convert chunk to tensors
        input_ids = torch.tensor(chunk).unsqueeze(0)
        attention_mask = torch.tensor([1] * len(chunk)).unsqueeze(0)
        
        # Check and move to GPU if available
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        model.to(device)

        # Get model outputs and logits
        with torch.no_grad():
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            probabilities = softmax(logits, dim=1)  # Convert logits to probabilities
            probability_list.append(probabilities)

    # Combine the probabilities from each chunk and calculate the average
    combined_probabilities = torch.cat(probability_list, dim=0)
    average_probabilities = torch.mean(combined_probabilities, dim=0)

    # Extract the negative and positive scores, assuming indexes 0 and 2 respectively
    negative_sentiment_score = average_probabilities[0].item()
    neutral_sentiment_score = average_probabilities[1].item()
    positive_sentiment_score = average_probabilities[2].item()

    return negative_sentiment_score, neutral_sentiment_score, positive_sentiment_score


In [5]:
def aggregate_sentiments(sentiments):
    # Convert logits to probabilities
    probs = torch.nn.functional.softmax(sentiments, dim=1)
    # Take the mean of probabilities across chunks
    mean_probs = probs.mean(dim=0)
    # Get the sentiment with the highest probability
    sentiment = torch.argmax(mean_probs).item()
    return sentiment

In [13]:
import torch
from transformers import BertForSequenceClassification, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('ProsusAI/finbert')
model = BertForSequenceClassification.from_pretrained('ProsusAI/finbert')
from tqdm.auto import tqdm
tqdm.pandas()



# Now apply the function to the DataFrame
df[['Negative_Score', 'Neutral_Score', 'Positive_Score']] = combined_data["Combined"][1:5].progress_apply(lambda x: pd.Series(process_text(x)))

  0%|          | 0/4 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (887 > 512). Running this sequence through the model will result in indexing errors


NameError: name 'df' is not defined

In [23]:
df[['Negative_Score', 'Neutral_Score', 'Positive_Score']]

Unnamed: 0,Negative_Score,Neutral_Score,Positive_Score
0,,,
1,0.019017,0.364974,0.616009
2,0.091997,0.015717,0.892285
3,0.027189,0.319605,0.653207
4,0.021873,0.638321,0.339807
...,...,...,...
12947,,,
12948,,,
12949,,,
12950,,,


In [15]:


import torch
from transformers import BertForSequenceClassification, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('ProsusAI/finbert')
model = BertForSequenceClassification.from_pretrained('ProsusAI/finbert')
from tqdm.auto import tqdm
tqdm.pandas()


import pandas as pd

# Read the existing CSV file into a DataFrame
df = pd.read_csv('nifty50_news_stock_data_combined.csv')



df = pd.read_csv('nifty50_news_stock_data_combined.csv')
# Now apply the function to the DataFrame
df[['Negative_Score', 'Neutral_Score', 'Positive_Score']] = combined_data["Combined"].progress_apply(lambda x: pd.Series(process_text(x)))



# df['FinBert_Sentiments'] = combined_data['Combined'].progress_apply(process_text)

# df['FinBert_Sentiments'] = combined_data['Combined'].progress_apply(lambda x: [get_sentiment(chunk) for chunk in chunk_text(x)])

  0%|          | 0/12952 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1069 > 512). Running this sequence through the model will result in indexing errors


In [16]:
df[['Negative_Score', 'Neutral_Score', 'Positive_Score']]

Unnamed: 0,Negative_Score,Neutral_Score,Positive_Score
0,0.024382,0.267857,0.707761
1,0.019017,0.364974,0.616009
2,0.091997,0.015717,0.892285
3,0.027189,0.319605,0.653207
4,0.021873,0.638321,0.339807
...,...,...,...
12947,0.034260,0.141118,0.824622
12948,0.026134,0.053237,0.920629
12949,0.135554,0.818574,0.045872
12950,0.275276,0.017267,0.707457


In [17]:
df.to_csv('nifty50_news_stock_data_combined.csv', index=False)

In [18]:
df[1:10]

Unnamed: 0,Title,Content,Link,Date,Ticker,T-1,T+1,T+2,Vader_Negative_Score_x,Vader_Neutral_Score_x,...,TextBlob_Subjectivity,DistilBERT_Sentiment,DistilBERT_Score,Vader_Negative_Score,Vader_Neutral_Score,Vader_Positive_Score,Vader_Compound_Score,Negative_Score,Neutral_Score,Positive_Score
1,Ex-CEO Chanda Kochhar may have to return over ...,,https://www.business-standard.com/article/pti-...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.0,0.788,...,0.0,NEGATIVE,0.998119,0.0,0.788,0.212,0.5423,0.019017,0.364974,0.616009
2,Dr Reddy's Laboratories Ltd appoints Axis Bank...,you can now subscribe to our you can now ...,https://m.economictimes.com/industry/healthcar...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.0,0.972,...,0.258018,NEGATIVE,0.930891,0.0,0.972,0.028,0.6486,0.091997,0.015717,0.892285
3,Here's a timeline of ICICI Bank-Videocon loan ...,answer videocon – that is the reason why video...,https://www.moneycontrol.com/news/business/her...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.061,0.851,...,0.301661,NEGATIVE,0.993598,0.061,0.851,0.087,0.9705,0.027189,0.319605,0.653207
4,How much money former ICICI Bank CEO Chanda Ko...,money back guarantee deliberate act what now...,https://timesofindia.indiatimes.com/business/i...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.084,0.82,...,0.293273,NEGATIVE,0.99772,0.084,0.82,0.096,0.6635,0.021873,0.638321,0.339807
5,"ICICI Bank fires boss: hurt and shocked, credi...",chanda kochhar who served with icici bank for ...,https://indianexpress.com/article/business/ban...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.09,0.794,...,0.392264,NEGATIVE,0.998456,0.09,0.794,0.116,0.9846,0.030248,0.653142,0.31661
6,ICICI Bank board under fire for giving clean c...,,https://www.business-standard.com/article/econ...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.137,0.571,...,0.7,NEGATIVE,0.987804,0.137,0.571,0.291,0.4019,0.045535,0.836474,0.117991
7,Chanda Kochhar may have to repay Rs 353 crore ...,schemes and the code of conduct with all att...,https://www.deccanherald.com/business/chanda-k...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.0,0.852,...,0.353125,NEGATIVE,0.996584,0.0,0.852,0.148,0.8885,0.015705,0.713097,0.271197
8,Senior management at ICICI Bank to have no titles,you can now subscribe to our you can now ...,https://m.economictimes.com/banking/senior-man...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.005,0.952,...,0.385534,NEGATIVE,0.99593,0.005,0.952,0.043,0.7681,0.040998,0.044907,0.914094
9,India’s top 10 distributors hold 18% of indust...,india s top mutual fund distributors now mana...,https://cafemutual.com/news/industry/15721-ind...,2019-02-21,ICICIBANK.NS,351.299988,355.600006,348.200012,0.0,0.895,...,0.382407,NEGATIVE,0.988775,0.0,0.895,0.105,0.9246,0.254697,0.015965,0.729338


In [19]:
import pandas as pd
import numpy as np

def highest_score_label(row):
    scores = {'negative': row['Negative_Score'], 'neutral': row['Neutral_Score'], 'positive': row['Positive_Score']}
    highest_label = max(scores, key=scores.get)
    highest_score = max(scores.values())
    return pd.Series([highest_label, highest_score])

df[['Highest_FinBERT_Sentiment', 'Highest_FINBERT_Score']] = df.apply(highest_score_label, axis=1)


In [20]:
df[['Highest_FinBERT_Sentiment', 'Highest_FINBERT_Score']][1:10]

Unnamed: 0,Highest_FinBERT_Sentiment,Highest_FINBERT_Score
1,positive,0.616009
2,positive,0.892285
3,positive,0.653207
4,neutral,0.638321
5,neutral,0.653142
6,neutral,0.836474
7,neutral,0.713097
8,positive,0.914094
9,positive,0.729338


In [21]:
df.to_csv('nifty50_news_stock_data_combined.csv', index=False)

In [22]:
df[1:4]

Unnamed: 0,Title,Content,Link,Date,Ticker,T-1,T+1,T+2,Vader_Negative_Score_x,Vader_Neutral_Score_x,...,DistilBERT_Score,Vader_Negative_Score,Vader_Neutral_Score,Vader_Positive_Score,Vader_Compound_Score,Negative_Score,Neutral_Score,Positive_Score,Highest_FinBERT_Sentiment,Highest_FINBERT_Score
1,Ex-CEO Chanda Kochhar may have to return over ...,,https://www.business-standard.com/article/pti-...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.0,0.788,...,0.998119,0.0,0.788,0.212,0.5423,0.019017,0.364974,0.616009,positive,0.616009
2,Dr Reddy's Laboratories Ltd appoints Axis Bank...,you can now subscribe to our you can now ...,https://m.economictimes.com/industry/healthcar...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.0,0.972,...,0.930891,0.0,0.972,0.028,0.6486,0.091997,0.015717,0.892285,positive,0.892285
3,Here's a timeline of ICICI Bank-Videocon loan ...,answer videocon – that is the reason why video...,https://www.moneycontrol.com/news/business/her...,2019-01-31,ICICIBANK.NS,364.450012,354.549988,352.649994,0.061,0.851,...,0.993598,0.061,0.851,0.087,0.9705,0.027189,0.319605,0.653207,positive,0.653207


In [66]:
import nltk
# nltk.download('punkt')

In [None]:
#tokenize
from nltk.tokenize import word_tokenize
def tokenize(text):
    if(text is None): 
        return ""
    return word_tokenize(text)
tokens = lambda x: tokenize(x)
tok = pd.DataFrame(data_clean.Content.apply(tokens))
tok

In [None]:
len(tok.iloc[0].Content)

848

## Organizing The Data

I mentioned earlier that the output of this notebook will be clean, organized data in two standard text formats:
1. **Corpus - **a collection of text
2. **Document-Term Matrix - **word counts in matrix format

### Corpus

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [None]:
# # Let's take a look at our dataframe
# data_df

In [None]:
# # Let's add the tickers as well
# tickers = []

# data_df['tickers'] = ticker
# data_df

In [None]:
# # Let's pickle it for later use
# data_df.to_pickle("corpus.pkl")

### Document-Term Matrix

For many of the techniques we'll be using in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [None]:
# # We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
# from sklearn.feature_extraction.text import CountVectorizer

# cv = CountVectorizer(stop_words='english')
# data_cv = cv.fit_transform(data_clean.transcript)
# data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
# data_dtm.index = data_clean.index
# data_dtm

In [None]:
# # Let's pickle it for later use
# data_dtm.to_pickle("dtm.pkl")

In [None]:
# # Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
# data_clean.to_pickle('data_clean.pkl')
# pickle.dump(cv, open("cv.pkl", "wb"))

## Sentiment Analysis using FinBert

In [5]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
import pandas as pd
from tqdm.auto import tqdm


In [6]:
tqdm.pandas()

In [7]:
tokenizer = BertTokenizer.from_pretrained('ProsusAI/finbert')
model = BertForSequenceClassification.from_pretrained('ProsusAI/finbert')
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [8]:
tokenizer = BertTokenizer.from_pretrained('ProsusAI/finbert')
model = BertForSequenceClassification.from_pretrained('ProsusAI/finbert')
model.eval() 

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [27]:
def chunk_text(text, max_chunk_length=510):
    # Tokenize the text and add special tokens
    tokens = tokenizer.encode(text, add_special_tokens=False)
    
    # Initialize list to store tokenized chunks
    chunks = []

    # Split tokens into chunks of `max_chunk_length`
    for i in range(0, len(tokens), max_chunk_length):
        # Create a chunk with the `[CLS]` token at the start and `[SEP]` token at the end
        chunk = tokens[i:i + max_chunk_length]
        chunk = [tokenizer.cls_token_id] + chunk + [tokenizer.sep_token_id]
        chunks.append(chunk)
    
    return chunks


In [92]:
def get_sentiment(chunk):
    # Assuming 'chunk' is already tokenized and ready for input
    inputs = tokenizer(chunk, return_tensors="pt", padding=True, truncation=True)
    outputs = model(**inputs)
    sentiment = torch.argmax(outputs.logits).item()  # Get the most likely sentiment label (index)
    score = torch.softmax(outputs.logits, dim=1).max().item()  # Get the maximum softmax score
    return sentiment, score  # Return a tuple of sentiment and score



In [93]:
def process_text(text):
    # Split the text into chunks and prepare tensors
    chunks = chunk_text(text)
    sentiments = []
    for chunk in chunks:
        # Add [CLS] and [SEP] tokens and create tensors
        input_ids = torch.tensor([tokenizer.cls_token_id] + chunk + [tokenizer.sep_token_id])
        attention_mask = torch.tensor([1] * len(input_ids))
        # Pad tensors to the required length
        padding_length = max_chunk_length + 2 - len(input_ids)
        input_ids = torch.cat([input_ids, torch.zeros(padding_length, dtype=torch.long)])
        attention_mask = torch.cat([attention_mask, torch.zeros(padding_length, dtype=torch.long)])
        # Get sentiment logits for the chunk
        logits = get_sentiment(input_ids.unsqueeze(0), attention_mask.unsqueeze(0))
        sentiments.append(logits)
    sentiments = torch.stack(sentiments)
    average_sentiment = sentiments.mean(dim=0)
    sentiment_label = torch.argmax(average_sentiment).item()
    return sentiment_label

In [94]:
def aggregate_sentiments(sentiments):
    # Convert logits to probabilities
    probs = torch.nn.functional.softmax(sentiments, dim=1)
    # Take the mean of probabilities across chunks
    mean_probs = probs.mean(dim=0)
    # Get the sentiment with the highest probability
    sentiment = torch.argmax(mean_probs).item()
    return sentiment

In [95]:
import torch
from transformers import BertForSequenceClassification, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('ProsusAI/finbert')
model = BertForSequenceClassification.from_pretrained('ProsusAI/finbert')

In [97]:
from tqdm.auto import tqdm
tqdm.pandas()
nifty_finbert = pd.read_csv("nifty_sentiment_combined.csv")


df = pd.read_csv('nifty50_news_stock_data_combined.csv')


df['FinBert_Sentiments'] = combined_data['Combined'].progress_apply(lambda x: [get_sentiment(chunk) for chunk in chunk_text(x)])

  0%|          | 0/12952 [00:00<?, ?it/s]

ValueError: text input must be of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

  0%|          | 0/1 [00:00<?, ?it/s]

ValueError: Input 0        an independent inquiry initiated by icici bank...
1                                                      nan
2            you can now subscribe to our  you can now ...
3        answer videocon – that is the reason why video...
4        money back guarantee  deliberate act  what now...
                               ...                        
12947    lt finance lt infra credit and five other nonb...
12948                                                  nan
12949                                                  nan
12950    how lt is driving indias green hydrogen boom  ...
12951    sunpure a leading global supplier of pv roboti...
Name: Content, Length: 12952, dtype: object is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

In [None]:
def tokenize(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = str(text)
    return tokenizer.encode_plus(text, add_special_tokens=True, max_length=512, truncations=True, padding="max_length")
tokenized = lambda x: tokenize(x)
tokens = pd.DataFrame(data_clean.Content.apply(tokenized))
tokens

Keyword arguments {'truncations': True} not recognized.
Keyword arguments {'truncations': True} not recognized.
Keyword arguments {'truncations': True} not recognized.
Keyword arguments {'truncations': True} not recognized.
Keyword arguments {'truncations': True} not recognized.
Keyword arguments {'truncations': True} not recognized.
Keyword arguments {'truncations': True} not recognized.
Keyword arguments {'truncations': True} not recognized.
Keyword arguments {'truncations': True} not recognized.
Keyword arguments {'truncations': True} not recognized.
Keyword arguments {'truncations': True} not recognized.
Keyword arguments {'truncations': True} not recognized.
Keyword arguments {'truncations': True} not recognized.
Keyword arguments {'truncations': True} not recognized.
Keyword arguments {'truncations': True} not recognized.
Keyword arguments {'truncations': True} not recognized.
Keyword arguments {'truncations': True} not recognized.
Keyword arguments {'truncations': True} not reco

Unnamed: 0,Content
0,"[input_ids, token_type_ids, attention_mask]"
1,"[input_ids, token_type_ids, attention_mask]"
2,"[input_ids, token_type_ids, attention_mask]"
3,"[input_ids, token_type_ids, attention_mask]"
4,"[input_ids, token_type_ids, attention_mask]"
...,...
12947,"[input_ids, token_type_ids, attention_mask]"
12948,"[input_ids, token_type_ids, attention_mask]"
12949,"[input_ids, token_type_ids, attention_mask]"
12950,"[input_ids, token_type_ids, attention_mask]"


In [None]:
def tokenize(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = str(text)
    return tokenizer.encode_plus(text, add_special_tokens=False, return_tensors='pt')
tokenized = lambda x: tokenize(x)
tokens = pd.DataFrame(data_clean.Content.apply(tokenized))
tokens

Unnamed: 0,Content
0,"[input_ids, token_type_ids, attention_mask]"
1,"[input_ids, token_type_ids, attention_mask]"
2,"[input_ids, token_type_ids, attention_mask]"
3,"[input_ids, token_type_ids, attention_mask]"
4,"[input_ids, token_type_ids, attention_mask]"
...,...
12947,"[input_ids, token_type_ids, attention_mask]"
12948,"[input_ids, token_type_ids, attention_mask]"
12949,"[input_ids, token_type_ids, attention_mask]"
12950,"[input_ids, token_type_ids, attention_mask]"


In [None]:
tokens.iloc[0][0]

  tokens.iloc[0][0]


{'input_ids': tensor([[ 2019,  2981,  9934,  ...,  2000,  1053, 13245]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}

In [None]:
def chunk(tok):
    input_id_chunks = tok['input_ids'][0].split(510)
    mask_chunks = tok['attention_mask'][0].split(510)

    for tensor in input_id_chunks:
        print(len(tensor))
        
    return (input_id_chunks, mask_chunks)
chunked = lambda x: chunk(x)
chunks = pd.DataFrame(tokens.Content.apply(chunked))
chunks
    

510
510
31
1
212
510
364
475
510
510
211
1
82
332
195
497
510
10
510
510
351
510
269
510
19
498
491
200
1
58
335
61
187
510
226
341
387
510
92
510
260
1
510
510
141
510
510
374
1
510
220
92
510
196
510
491
46
417
510
211
149
510
510
26
223
510
222
1
1
510
510
145
493
510
225
510
510
152
262
510
439
1
510
22
222
510
510
510
510
150
510
42
510
148
510
55
1
510
71
510
95
510
510
101
510
9
245
308
201
439
364
212
332
429
510
71
52
510
164
53
510
91
510
108
479
510
58
455
432
414
510
216
1
221
81
487
449
1
230
391
498
510
510
451
510
510
338
510
60
510
178
510
20
510
53
510
102
510
118
510
127
510
510
146
510
378
465
362
510
204
510
510
204
510
47
297
363
392
510
42
509
510
205
510
193
510
26
263
510
45
1
1
398
469
422
510
510
53
404
449
415
49
510
142
485
486
510
510
180
334
510
205
510
138
510
510
54
510
510
21
278
127
122
475
256
114
510
510
201
510
100
1
490
289
437
509
392
1
1
510
366
510
239
510
376
187
486
244
377
510
400
274
510
191
393
1
510
242
510
510
25
103
510
14
510
510
187
36

ValueError: not enough values to unpack (expected 2, got 1)

In [None]:
len(chunks)

12952

In [None]:
a = torch.arange(10)
a

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
torch.cat([torch.Tensor([101]), a, torch.Tensor([102])])

tensor([101.,   0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9., 102.])

In [None]:
chunksize = 512

for i in range(len(input_id_chunks)):
input_id_chunks[i] = torch.cat([
torch. Tensor ([101]), input_id_chunks[i], torch.Tensor ([102])
mask_chunks[i] = torch.cat([
torch. Tensor([1]), mask_chunks[i], torch.Tensor([1])
pad_len = chunksize - input_id_chunk[i].shape[0]|
if pad_len > 0:
input_id_chunks[i] = torch.cat ([
input_id_chunks[i],, torch. Tensor ([0] * pad_len)
mask_chunks[i] = torch.cat([
mask_chunks[i], torch.Tensor([0] * pad_len)
for chunk in input_id_chunks:
print(len(chunk))
chunk

## Sentiment Analysis using Textblob

So far, all of the analysis we've done has been pretty generic - looking at counts, creating scatter plots, etc. These techniques could be applied to numeric data as well.

When it comes to text data, there are a few popular techniques that we'll be going through in the next few notebooks, starting with sentiment analysis. A few key points to remember with sentiment analysis.

1. **TextBlob Module:** Linguistic researchers have labeled the sentiment of words based on their domain expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob module allows us to take advantage of these labels.
2. **Sentiment Labels:** Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we're going to ignore them for now). A corpus' sentiment is the average of these.
   * **Polarity**: How positive or negative a word is. -1 is very negative. +1 is very positive.
   * **Subjectivity**: How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.

For more info on how TextBlob coded up its [sentiment function](https://planspace.org/20150607-textblob_sentiment/).

Let's take a look at the sentiment of the various transcripts, both overall and throughout the comedy routine.

In [None]:
%pip install textblob
from textblob import TextBlob
import pandas as pd

Collecting textblob
  Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
[K     |████████████████████████████████| 626 kB 3.8 MB/s eta 0:00:01
[?25hCollecting nltk>=3.8
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 40.5 MB/s eta 0:00:01
Installing collected packages: nltk, textblob
  Attempting uninstall: nltk
    Found existing installation: nltk 3.7
    Uninstalling nltk-3.7:
      Successfully uninstalled nltk-3.7
Successfully installed nltk-3.8.1 textblob-0.18.0.post0
Note: you may need to restart the kernel to use updated packages.


In [None]:
nifty['Polarity'] = nifty['Combined'].apply(lambda text: TextBlob(text).sentiment.polarity)
nifty['Subjectivity'] = nifty['Combined'].apply(lambda text: TextBlob(text).sentiment.subjectivity)

In [None]:
nifty[['Combined', 'Polarity', 'Subjectivity']].head()
nifty.to_csv('nifty_sentiment_textblob.csv', index=False)

In [None]:
# We'll start by reading in the corpus, which preserves word order
import pandas as pd

data = pd.read_pickle('corpus.pkl')
data

In [None]:
# Create quick lambda functions to find the polarity and subjectivity of each routine
# Terminal / Anaconda Navigator: conda install -c conda-forge textblob
from textblob import TextBlob

pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity

data['polarity'] = data['news'].apply(pol)
data['subjectivity'] = data['news'].apply(sub)
data

In [None]:
# Let's plot the results
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 8]

for index, news in enumerate(data.index):
    x = data.polarity.loc[news]
    y = data.subjectivity.loc[news]
    plt.scatter(x, y, color='blue')
    plt.text(x+.001, y+.001, data['full_name'][index], fontsize=10)
    plt.xlim(-.01, .12) 
    
plt.title('Sentiment Analysis', fontsize=20)
plt.xlabel('<-- Negative -------- Positive -->', fontsize=15)
plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15)

plt.show()

## Sentiment Analysis using Vader

In [51]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk

In [52]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/yuanxiaohan/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [53]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [54]:
sent = SentimentIntensityAnalyzer() # create an object / instance

In [55]:
sent.polarity_scores("   this scotch was awesome. I had a heavenly feeling by taking a sip")

{'neg': 0.0, 'neu': 0.422, 'pos': 0.578, 'compound': 0.8625}

In [56]:
# Web scraping stuff from Samyak
nifty = pd.read_csv("nifty50_news_combined.csv")
# # Stock names
# stocks = []
print(nifty)
nifty.head()
nifty.iloc[2]["Content"]

                                                   Title  \
0      Sacked and stripped of bonuses: Srikrishna rep...   
1      Ex-CEO Chanda Kochhar may have to return over ...   
2      Dr Reddy's Laboratories Ltd appoints Axis Bank...   
3      Here's a timeline of ICICI Bank-Videocon loan ...   
4      How much money former ICICI Bank CEO Chanda Ko...   
...                                                  ...   
12947  L&T Finance, L&T Infra Credit and 5 other NBFC...   
12948  Larsen & Toubro Stocks Live Updates: Larsen & ...   
12949  Larsen & Toubro falls Monday, underperforms ma...   
12950  How L&T is Driving India's Green Hydrogen Boom...   
12951  Sunpure Signs Contract with L&T to Supply PV R...   

                                                 Content  \
0      An independent inquiry initiated by ICICI Bank...   
1                                                    NaN   
2      \n\n\n\n(You can now subscribe to our\n\n(You ...   
3      Answer: Videocon – That is the r

'\n\n\n\n(You can now subscribe to our\n\n(You can now subscribe to our Economic Times WhatsApp channel\n\nNEW DELHI: Pharma major Dr Reddy\'s Laboratories Thursday said it has appointed Axis Bank \'s former MD and CEO Shikha Sharma as company\'s independent additional director for five years."Shikha Sanjaya Sharma has been appointed as an additional director, categorised as independent, on the Board of Dr Reddy\'s Laboratories Ltd for a period of five years, effective January 31, 2019," the company said in a filing to the BSE.Sharma was the Managing Director and CEO of Axis Bank from June 2009 up to December 2018, it added.She has more than three decades of experience in the financial sector, having begun her career with ICICI Bank Ltd in 1980. During her tenure with the ICICI group, she was instrumental in setting up ICICI Securities, Dr Reddy\'s said.Sharma holds an MBA from the Indian Institute of Management, Ahmedabad and PGD in Software Technology from National Centre for Softwar

In [57]:
nifty['Title'] = nifty['Title'].astype(str)
nifty['Content'] = nifty['Content'].astype(str)

In [58]:
def get_sentiment_score(text):
    return sent.polarity_scores(text)

In [59]:
nifty['Combined'] = nifty['Title'] + ". " + nifty['Content']
nifty['vader_scores'] = combined_data.apply(get_sentiment_score)
nifty['compound'] = nifty['vader_scores'].apply(lambda score_dict: score_dict['compound'])
nifty['positive'] = nifty['vader_scores'].apply(lambda score_dict: score_dict['pos'])
nifty['neutral'] = nifty['vader_scores'].apply(lambda score_dict: score_dict['neu'])
nifty['negative'] = nifty['vader_scores'].apply(lambda score_dict: score_dict['neg'])


AttributeError: 'Series' object has no attribute 'encode'

In [None]:
print(nifty[['Combined', 'compound', 'positive', 'neutral', 'negative']].head())

In [None]:
nifty.to_csv("nifty50_news_with_vader.csv", index=False)

In [56]:
%pip install transformers

Note: you may need to restart the kernel to use updated packages.


In [57]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import pipeline

# Load the pre-trained DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

# Create a sentiment analysis pipeline
nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [59]:
def distilbert_sentiment(text):
    # Truncate the text to the maximum sequence length that DistilBERT can handle
    tokens = tokenizer.tokenize(text)[:tokenizer.model_max_length - 2]
    # Convert list of tokens back to string
    truncated_text = tokenizer.convert_tokens_to_string(tokens)
    # Use the truncated text for analysis
    sentence = Sentence(truncated_text)
    classifier.predict(sentence)
    sentiment_value = sentence.labels[0].to_dict()['value']  # 'POSITIVE' or 'NEGATIVE'
    confidence_score = sentence.labels[0].to_dict()['confidence']  # e.g. 0.9999
    
    return sentiment_value, confidence_score

# Assuming 'nifty' is your DataFrame and 'Combined' is the column with texts to analyze
nifty['DistilBERT_Sentiment'], nifty['DistilBERT_Confidence'] = zip(*nifty['Combined'].apply(distilbert_sentiment))

# Display the DataFrame to verify
print(nifty[['Combined', 'DistilBERT_Sentiment', 'DistilBERT_Confidence']].head())

                                            Combined DistilBERT_Sentiment  \
0  Sacked and stripped of bonuses: Srikrishna rep...             NEGATIVE   
1  Ex-CEO Chanda Kochhar may have to return over ...             NEGATIVE   
2  Dr Reddy's Laboratories Ltd appoints Axis Bank...             NEGATIVE   
3  Here's a timeline of ICICI Bank-Videocon loan ...             NEGATIVE   
4  How much money former ICICI Bank CEO Chanda Ko...             NEGATIVE   

   DistilBERT_Confidence  
0               0.999879  
1               0.997500  
2               0.950661  
3               0.999739  
4               0.999497  


In [None]:
nifty.to_csv('nifty_with_distilbert.csv', index=False)

In [None]:
# import SentimentIntensityAnalyzer class
# from vaderSentiment.vaderSentiment module.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
 
# sentiments of the sentence.
def sentiment_scores(sentence):
 
    # Create a SentimentIntensityAnalyzer object.
    sid_obj = SentimentIntensityAnalyzer()
 
    # polarity_scores method of SentimentIntensityAnalyzer
    # object gives a sentiment dictionary.
    # which contains pos, neg, neu, and compound scores.
    sentiment_dict = sid_obj.polarity_scores(sentence)
     
#     print("Overall sentiment dictionary is : ", sentiment_dict)
#     print("sentence was rated as ", sentiment_dict['neg']*100, "% Negative")
#     print("sentence was rated as ", sentiment_dict['neu']*100, "% Neutral")
#     print("sentence was rated as ", sentiment_dict['pos']*100, "% Positive")
 
#     print("Sentence Overall Rated As", end = " ")
 
#     # decide sentiment as positive, negative and neutral
#     if sentiment_dict['compound'] >= 0.05 :
#         print("Positive")
 
#     elif sentiment_dict['compound'] <= - 0.05 :
#         print("Negative")
 
#     else :
#         print("Neutral")
 

In [None]:
#for each sentence, add the sentiment into column of data frame
#might need to change starting dataframe to accomodate for sentences?

In [1]:
import pandas as pd



In [2]:
# Load the datasets
distilbert_df = pd.read_csv('nifty_with_distilbert.csv')
vader_df = pd.read_csv('nifty50_news_with_vader.csv')
textblob_df = pd.read_csv('nifty_sentiment_textblob.csv')

In [3]:
common_columns = ['Title', 'Content', 'Link', 'Date', 'Ticker', 'Combined']

In [5]:
# Merge DistilBERT and Vader on the common columns
merged_df = pd.merge(distilbert_df, vader_df, on=common_columns, how='inner')

# Now merge the above with TextBlob on the common columns
final_merged_df = pd.merge(merged_df, textblob_df[['Title', 'Polarity', 'Subjectivity']], on='Title', how='inner')

# Save the merged dataframe to a new CSV file
final_merged_df.to_csv('nifty_sentiment_combined.csv', index=False)

