<a href="https://colab.research.google.com/github/anjalinagel12/Google-colab-notebook/blob/master/news_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/10/aeefced99c8a59d828a92cc11d213e2743212d3641c87c82d61b035a7d5c/transformers-2.3.0-py3-none-any.whl (447kB)
[K     |▊                               | 10kB 32.3MB/s eta 0:00:01[K     |█▌                              | 20kB 1.7MB/s eta 0:00:01[K     |██▏                             | 30kB 2.3MB/s eta 0:00:01[K     |███                             | 40kB 1.7MB/s eta 0:00:01[K     |███▋                            | 51kB 1.9MB/s eta 0:00:01[K     |████▍                           | 61kB 2.3MB/s eta 0:00:01[K     |█████▏                          | 71kB 2.5MB/s eta 0:00:01[K     |█████▉                          | 81kB 2.7MB/s eta 0:00:01[K     |██████▋                         | 92kB 3.0MB/s eta 0:00:01[K     |███████▎                        | 102kB 2.8MB/s eta 0:00:01[K     |████████                        | 112kB 2.8MB/s eta 0:00:01[K     |████████▉                       | 122kB 2.8M

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

## Importing the dataset
We'll use pandas to read the dataset and load it into a dataframe.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [4]:
cd /content/drive/My Drive/ZS/input/dataset/

/content/drive/My Drive/ZS/input/dataset


In [0]:
df = pd.read_csv('/content/drive/My Drive/ZS/input/dataset/train_file.csv', sep=",")
#df.tail(2)

#Sentiment Headline prediction

In [0]:
batch_total = df[['Headline', 'SentimentHeadline']]

In [0]:
batch_1 = batch_total[:2000]

## Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model. 

In [0]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')
'''
# Transformers has a unified API
# for 10 transformer architectures and 30 pretrained weights.
#          Model          | Tokenizer          | Pretrained weights shortcut
MODELS = [(BertModel,       BertTokenizer,       'bert-base-uncased'),
          (OpenAIGPTModel,  OpenAIGPTTokenizer,  'openai-gpt'),
          (GPT2Model,       GPT2Tokenizer,       'gpt2'),
          (CTRLModel,       CTRLTokenizer,       'ctrl'),
          (TransfoXLModel,  TransfoXLTokenizer,  'transfo-xl-wt103'),
          (XLNetModel,      XLNetTokenizer,      'xlnet-base-cased'),
          (XLMModel,        XLMTokenizer,        'xlm-mlm-enfr-1024'),
          (DistilBertModel, DistilBertTokenizer, 'distilbert-base-uncased'),
          (RobertaModel,    RobertaTokenizer,    'roberta-base'),
          (XLMRobertaModel, XLMRobertaTokenizer, 'xlm-roberta-base'),
         ]

# To use TensorFlow 2.0 versions of the models, simply prefix the class names with 'TF', e.g. `TFRobertaModel` is the TF 2.0 counterpart of the PyTorch model `RobertaModel`

# Each architecture is provided with several class for fine-tuning on down-stream tasks, e.g.
BERT_MODEL_CLASSES = [BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
                      BertForSequenceClassification, BertForTokenClassification, BertForQuestionAnswering]

# All the classes for an architecture can be initiated from pretrained weights for this architecture
# Note that additional weights added for fine-tuning are only initialized
'''

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.

## Model #1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to so some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

##tokenizer

In [0]:
tokenized = batch_1['Headline'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [0]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

#max_len = max(max_len,66) # to make it more understandable as our iamges have tokens of length 66
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])


Our dataset is now in the `padded` variable, we can view its dimensions below:

In [11]:
np.array(padded).shape

(2000, 121)

In [0]:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')
'''
# Transformers has a unified API
# for 10 transformer architectures and 30 pretrained weights.
#          Model          | Tokenizer          | Pretrained weights shortcut
MODELS = [(BertModel,       BertTokenizer,       'bert-base-uncased'),
          (OpenAIGPTModel,  OpenAIGPTTokenizer,  'openai-gpt'),
          (GPT2Model,       GPT2Tokenizer,       'gpt2'),
          (CTRLModel,       CTRLTokenizer,       'ctrl'),
          (TransfoXLModel,  TransfoXLTokenizer,  'transfo-xl-wt103'),
          (XLNetModel,      XLNetTokenizer,      'xlnet-base-cased'),
          (XLMModel,        XLMTokenizer,        'xlm-mlm-enfr-1024'),
          (DistilBertModel, DistilBertTokenizer, 'distilbert-base-uncased'),
          (RobertaModel,    RobertaTokenizer,    'roberta-base'),
          (XLMRobertaModel, XLMRobertaTokenizer, 'xlm-roberta-base'),
         ]

# To use TensorFlow 2.0 versions of the models, simply prefix the class names with 'TF', e.g. `TFRobertaModel` is the TF 2.0 counterpart of the PyTorch model `RobertaModel`

# Each architecture is provided with several class for fine-tuning on down-stream tasks, e.g.
BERT_MODEL_CLASSES = [BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
                      BertForSequenceClassification, BertForTokenClassification, BertForQuestionAnswering]

# All the classes for an architecture can be initiated from pretrained weights for this architecture
# Note that additional weights added for fine-tuning are only initialized
'''

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

In [0]:
df_test = pd.read_csv('/content/drive/My Drive/ZS/input/dataset/test_file.csv', sep=",")
#df_test.tail(2)

In [0]:
batch_total_test = df_test[['Headline','IDLink']]

In [15]:
len(batch_total_test)

37288

In [0]:
batch_1_test = batch_total_test[:1000]

In [0]:
tokenized_test = batch_1_test['Headline'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [0]:
max_len = 0
for i in tokenized_test.values:
    if len(i) > max_len:
        max_len = len(i)

#max_len = max(max_len,66) # to make it more understandable as our iamges have tokens of length 66
padded_test = np.array([i + [0]*(max_len-len(i)) for i in tokenized_test.values])


In [19]:
np.array(padded_test).shape

(1000, 156)

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [20]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(2000, 121)

In [21]:
attention_mask_test = np.where(padded_test != 0, 1, 0)
attention_mask_test.shape

(1000, 156)

In [0]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [0]:
input_ids_test = torch.tensor(padded_test)  
attention_mask_test = torch.tensor(attention_mask_test)

with torch.no_grad():
    last_hidden_states_test = model(input_ids_test, attention_mask=attention_mask_test)

In [0]:
features = last_hidden_states[0][:,0,:].numpy()

In [0]:
features_test = last_hidden_states_test[0][:,0,:].numpy()

The labels indicating which sentence is positive and negative now go into the `labels` variable

In [0]:
labels = batch_1['SentimentHeadline']

## Model #2: Train/Test Split
Let's now split our datset into a training set and testing set (even though we're using 2,000 sentences from the SST2 training set).

In [0]:
# parameters = {'C': np.linspace(0.0001, 100, 20)}
# grid_search = GridSearchCV(LogisticRegression(), parameters)
# grid_search.fit(train_features, train_labels)

# print('best parameters: ', grid_search.best_params_)
# print('best scrores: ', grid_search.best_score_)

We now train the LogisticRegression model. If you've chosen to do the gridsearch, you can plug the value of C into the model declaration (e.g. `LogisticRegression(C=5.2)`).

##RandomForest

##XGboost

In [0]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error

In [0]:
X=features
y=labels

In [0]:
X_TEST=features_test


In [0]:
data_dmatrix = xgb.DMatrix(features,labels)



In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)


In [0]:
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)

In [34]:
xg_reg.fit(X_train,y_train)

preds = xg_reg.predict(X_test)



In [35]:
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

RMSE: 0.239476


In [36]:
xg_reg1 = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 10, alpha = 10, n_estimators = 70)
xg_reg1.fit(X_train,y_train)

preds1 = xg_reg1.predict(X_test)
rmse1 = np.sqrt(mean_squared_error(y_test, preds1))
print("RMSE: %f" % (rmse1))

RMSE: 0.135742


##submission file

In [0]:
batch_1_test['SentimentHeadline'] = xg_reg1.predict(X_TEST)
batch_1_test.to_csv('submission_headline.csv')

#SentimentTitle prediction

In [38]:
df = pd.read_csv('/content/drive/My Drive/ZS/input/dataset/train_file.csv', sep=",")
df.tail(2)

Unnamed: 0,IDLink,Title,Headline,Source,Topic,PublishDate,Facebook,GooglePlus,LinkedIn,SentimentTitle,SentimentHeadline
55930,P0EBiaSEjq,Microsoft finally releases giant Surface,Microsoft’s business customers are finally beg...,TechEye,microsoft,2016-03-29 01:38:00,0,0,0,0.0,-0.028296
55931,99MLvyAQTJ,Rays of sunshine in the US economy,AS WE all know from listening to the campaign ...,Washington Post,economy,2016-03-29 01:41:08,75,7,19,0.0,0.184444


In [0]:
batch_total = df[['Title', 'SentimentTitle']]

In [0]:
batch_1 = batch_total[:2000]

## Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model. 

In [0]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')
'''
# Transformers has a unified API
# for 10 transformer architectures and 30 pretrained weights.
#          Model          | Tokenizer          | Pretrained weights shortcut
MODELS = [(BertModel,       BertTokenizer,       'bert-base-uncased'),
          (OpenAIGPTModel,  OpenAIGPTTokenizer,  'openai-gpt'),
          (GPT2Model,       GPT2Tokenizer,       'gpt2'),
          (CTRLModel,       CTRLTokenizer,       'ctrl'),
          (TransfoXLModel,  TransfoXLTokenizer,  'transfo-xl-wt103'),
          (XLNetModel,      XLNetTokenizer,      'xlnet-base-cased'),
          (XLMModel,        XLMTokenizer,        'xlm-mlm-enfr-1024'),
          (DistilBertModel, DistilBertTokenizer, 'distilbert-base-uncased'),
          (RobertaModel,    RobertaTokenizer,    'roberta-base'),
          (XLMRobertaModel, XLMRobertaTokenizer, 'xlm-roberta-base'),
         ]

# To use TensorFlow 2.0 versions of the models, simply prefix the class names with 'TF', e.g. `TFRobertaModel` is the TF 2.0 counterpart of the PyTorch model `RobertaModel`

# Each architecture is provided with several class for fine-tuning on down-stream tasks, e.g.
BERT_MODEL_CLASSES = [BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
                      BertForSequenceClassification, BertForTokenClassification, BertForQuestionAnswering]

# All the classes for an architecture can be initiated from pretrained weights for this architecture
# Note that additional weights added for fine-tuning are only initialized
'''

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.

## Model #1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to so some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

##tokenizer

In [0]:
tokenized = batch_1['Title'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [0]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

#max_len = max(max_len,66) # to make it more understandable as our iamges have tokens of length 66
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])


Our dataset is now in the `padded` variable, we can view its dimensions below:

In [44]:
np.array(padded).shape

(2000, 32)

In [0]:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')
'''
# Transformers has a unified API
# for 10 transformer architectures and 30 pretrained weights.
#          Model          | Tokenizer          | Pretrained weights shortcut
MODELS = [(BertModel,       BertTokenizer,       'bert-base-uncased'),
          (OpenAIGPTModel,  OpenAIGPTTokenizer,  'openai-gpt'),
          (GPT2Model,       GPT2Tokenizer,       'gpt2'),
          (CTRLModel,       CTRLTokenizer,       'ctrl'),
          (TransfoXLModel,  TransfoXLTokenizer,  'transfo-xl-wt103'),
          (XLNetModel,      XLNetTokenizer,      'xlnet-base-cased'),
          (XLMModel,        XLMTokenizer,        'xlm-mlm-enfr-1024'),
          (DistilBertModel, DistilBertTokenizer, 'distilbert-base-uncased'),
          (RobertaModel,    RobertaTokenizer,    'roberta-base'),
          (XLMRobertaModel, XLMRobertaTokenizer, 'xlm-roberta-base'),
         ]

# To use TensorFlow 2.0 versions of the models, simply prefix the class names with 'TF', e.g. `TFRobertaModel` is the TF 2.0 counterpart of the PyTorch model `RobertaModel`

# Each architecture is provided with several class for fine-tuning on down-stream tasks, e.g.
BERT_MODEL_CLASSES = [BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
                      BertForSequenceClassification, BertForTokenClassification, BertForQuestionAnswering]

# All the classes for an architecture can be initiated from pretrained weights for this architecture
# Note that additional weights added for fine-tuning are only initialized
'''

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

In [0]:
batch_total_test2 = df_test[['Title','IDLink']]

In [0]:
batch_1_test2 = batch_total_test2[:1000]

In [0]:
tokenized_test = batch_1_test2['Title'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [0]:
max_len = 0
for i in tokenized_test.values:
    if len(i) > max_len:
        max_len = len(i)

#max_len = max(max_len,66) # to make it more understandable as our iamges have tokens of length 66
padded_test = np.array([i + [0]*(max_len-len(i)) for i in tokenized_test.values])


In [50]:
np.array(padded_test).shape

(1000, 30)

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [51]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(2000, 32)

In [52]:
attention_mask_test = np.where(padded_test != 0, 1, 0)
attention_mask_test.shape

(1000, 30)

In [0]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [0]:
input_ids_test = torch.tensor(padded_test)  
attention_mask_test = torch.tensor(attention_mask_test)

with torch.no_grad():
    last_hidden_states_test = model(input_ids_test, attention_mask=attention_mask_test)

In [0]:
features = last_hidden_states[0][:,0,:].numpy()

In [0]:
features_test = last_hidden_states_test[0][:,0,:].numpy()

The labels indicating which sentence is positive and negative now go into the `labels` variable

In [0]:
labels = batch_1['SentimentTitle']

## Model #2: Train/Test Split
Let's now split our datset into a training set and testing set (even though we're using 2,000 sentences from the SST2 training set).

In [0]:
# parameters = {'C': np.linspace(0.0001, 100, 20)}
# grid_search = GridSearchCV(LogisticRegression(), parameters)
# grid_search.fit(train_features, train_labels)

# print('best parameters: ', grid_search.best_params_)
# print('best scrores: ', grid_search.best_score_)

We now train the LogisticRegression model. If you've chosen to do the gridsearch, you can plug the value of C into the model declaration (e.g. `LogisticRegression(C=5.2)`).

##XGboost

In [0]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error

In [0]:
X=features
y=labels

In [0]:
X_TEST=features_test


In [62]:
len(X_TEST)

1000

In [0]:
data_dmatrix = xgb.DMatrix(features,labels)



In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)


In [0]:
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)

In [66]:
xg_reg.fit(X_train,y_train)

preds = xg_reg.predict(X_test)



In [67]:
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

RMSE: 0.218313


In [68]:
xg_reg1 = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 10, alpha = 10, n_estimators = 70)
xg_reg1.fit(X_train,y_train)

preds1 = xg_reg1.predict(X_test)
rmse1 = np.sqrt(mean_squared_error(y_test, preds1))
print("RMSE: %f" % (rmse1))

RMSE: 0.118482


##submission file

In [69]:
df_test = pd.read_csv('/content/drive/My Drive/ZS/input/dataset/sample_submission.csv', sep=",")
df_test.tail(2)

Unnamed: 0,IDLink,SentimentTitle,SentimentHeadline
3,lflGp3q2Fj,-0.073611,-0.417361
4,zDYG0SoovZ,0.047111,-0.213201


In [0]:
batch_1_test2['SentimentTitle'] = xg_reg1.predict(X_TEST)
batch_1_test2.to_csv('submissionTitle.csv')

##merge two dataframes

In [0]:
df1 = pd.read_csv("submissionTitle.csv")
df2 = pd.read_csv("submission_headline.csv")



In [72]:
df1.tail()

Unnamed: 0.1,Unnamed: 0,Title,IDLink,SentimentTitle
995,995,Watch Microsoft's Seeing AI help a blind perso...,jInx470K9I,-0.001126
996,996,Microsoft aims at computing platforms of the f...,ERJGQkk9tI,0.00125
997,997,VIDEO: Hands on with the Microsoft Hololens,uBRBuNnyKF,0.018405
998,998,Hands-on (again) with the Microsoft HoloLens,AVnr6kSHwY,-0.002099
999,999,Consumer confidence in UK economy hit by 'Brex...,MVCl0HBBuE,-0.092407


In [73]:
df2.tail()

Unnamed: 0.1,Unnamed: 0,Headline,IDLink,SentimentHeadline
995,995,"In a span of two and a half hours, Microsoft p...",jInx470K9I,-0.032728
996,996,Microsoft made a pitch Wednesday to developers...,ERJGQkk9tI,-0.04316
997,997,Dave Lee examines the potential of Microsoft's...,uBRBuNnyKF,0.003347
998,998,"Earlier today, Microsoft announced that a free...",AVnr6kSHwY,0.03626
999,999,Consumers have seen their confidence in the UK...,MVCl0HBBuE,-0.059417


In [0]:
df = df1.merge(df2, on="IDLink")

In [0]:
df=df[['IDLink','SentimentTitle','SentimentHeadline']]

In [76]:
df.tail()


Unnamed: 0,IDLink,SentimentTitle,SentimentHeadline
995,jInx470K9I,-0.001126,-0.032728
996,ERJGQkk9tI,0.00125,-0.04316
997,uBRBuNnyKF,0.018405,0.003347
998,AVnr6kSHwY,-0.002099,0.03626
999,MVCl0HBBuE,-0.092407,-0.059417


In [0]:
df.to_csv('submission.csv')