In this notebook, I worked on a news dataset to create a model to predict the sentiment in those news headlines.
The dataset has 2 features and 4.8k rows. The features are news headlines and the sentiment label. 

In text mining analysis, the most important part is data preprocessing. Analysis with raw data or nearly raw data might mislead you. After completing this step carefully, you may be lucky to build a successful model and make correct estimations.

There are a couple of steps in text preprocessing phase: standardization, data cleaning.
Lowering letters, removing numbers and punctuations, lemmatization and removing stopwords are the substeps to completing standardization and data cleaning part.

Lowering letters is the first step of standardization and cleaning. Same words with upper and lower case letters are processed as different words in text analysis.Converting all charachers to lowercase prevents us from this situation.

Removing numbers and punctuations is the another step however you should be careful about it. This step could be modified or skipped according to the  answers you try to get.

Lemmatization is a important step in text analysis. With this step, words are reduced to their base forms. 
Removing stop words is the last step. Like other types of the datasets, text data contains noisy words. After applying this step, the words that might have valuable information will remain.

After the text preprocessing steps, I try to build a classifier for sentiment prediction and review the model outputs at the end of this notebook.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Loading dependencies

In [None]:
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, GRU, Embedding,CuDNNGRU
from tensorflow.keras.optimizers import Adam
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

import string
from textblob import TextBlob
import spacy
nlp=spacy.load('en_core_web_sm')

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score,recall_score
import seaborn as sns

Defining functions for loading and preprocessing

In [None]:
def read_data():
    dataset=pd.read_csv('../input/sentiment-analysis-for-financial-news/all-data.csv',
                header=None,
                 names=['Sentiment','News'])
    return dataset

def convert_to_lower_case():
    def lower(input_text):
        return input_text.lower()
    dataset['News']=dataset['News'].apply(lower)
    
def remove_punctuation():
    def remove_punctuation_from_text(input_text):
        output_list=[word for word in input_text.split() if word.isalpha()]
        return ' '.join(output_list)    
    dataset['News']=dataset['News'].apply(remove_punctuation_from_text)
    
def correct_words():
    def correct_text(input_text):
        list_1=[str(TextBlob(word).correct()) for word in input_text.split()]
        output_text= ' '.join(list_1)
        return output_text
    dataset['News']=dataset['News'].apply(correct_text)
    
def lemmatize():
    def lematize_text(input_text):
        doc=nlp(input_text)
        lemmas=[token.lemma_ for token in doc]
        output_text=' '.join(lemmas)
        return output_text
    dataset['News']=dataset['News'].apply(lematize_text)
    
def remove_stopwords():
    def remove_stopwords_from_text(input_text):
        stopwords=spacy.lang.en.stop_words.STOP_WORDS
        output_list=[word for word in input_text.split() if word not in stopwords and not(word=='-PRON-') ]
        return ' '.join(output_list)
    dataset['News']=dataset['News'].apply(remove_stopwords_from_text)

def filter_the_neutral_news():
    return dataset[dataset['Sentiment']!='neutral']

def create_target_and_input():
    target=dataset['Sentiment'].values.tolist()
    target=[1 if sentiment=='positive' else 0 for sentiment in target]
    data=dataset['News'].values.tolist()
    return data,target

def split_train_test():
    x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42,stratify=target)
    y_train=np.array(y_train)
    y_test=np.array(y_test)
    return x_train, x_test, y_train, y_test 

Reading data, preprocessing text, splitting dataset as train and test

In [None]:
#reading the data
dataset=read_data()
#Preprocessing the text
convert_to_lower_case()
remove_punctuation()
lemmatize()
remove_stopwords()
#Preparing data for model
dataset=filter_the_neutral_news()
data, target=create_target_and_input()
x_train, x_test, y_train, y_test =split_train_test()

Preparing dataset for modelling.

In [None]:
#setting a threshold for the number of words that we are going to use

num_words=1000 # number of words that we are going to use. It takes top 1k words with the highest frequency
tokenizer=Tokenizer(num_words=num_words)
tokenizer.fit_on_texts(data)

#tokenization
x_train_tokens=tokenizer.texts_to_sequences(x_train)
x_test_tokens=tokenizer.texts_to_sequences(x_test)

#setting a threshold for the number of words in each text
num_tokens=[len(tokens) for tokens in x_train_tokens+x_test_tokens]
num_tokens=np.array(num_tokens)
max_tokens=np.mean(num_tokens)+2*np.std(num_tokens)
max_tokens=int(max_tokens)

#padding
x_train_pad=pad_sequences(x_train_tokens,
                              maxlen=max_tokens)
x_test_pad=pad_sequences(x_test_tokens,
                         maxlen=max_tokens)


After setting the max tokens threshold value, let's find the ratio of the text data which doesn't require to be modified. 

In [None]:
print('{:.2f}'.format(np.sum(num_tokens<max_tokens)/len(num_tokens)))

Building the model

In [None]:
#creating model
model=Sequential()
embedding_size=50  # we will create a 50 size vector for each word.
#At the beginning we will use random word vectors and each optimization step these vectors will be  
model.add(Embedding(input_dim=num_words,
                    output_dim=embedding_size,
                    input_length=max_tokens,
                    name='embedding_layer')
) # this Embedding layer will take a text as an input, convert it to a vector as an output

model.add(GRU(units=16, # number of neurons 
              return_sequences=True) # if true this layer odel creates multiple outputs. If the following layer has one neuron, which means the following layer creates the output. 
)
model.add(GRU(units=8, return_sequences=True))
model.add(GRU(units=4))
model.add(Dense(1,activation='sigmoid'))#with the sigmoid activation function, we receive an output between 0 and 1.
optimizer=Adam(lr=1e-3)

In [None]:
model.compile(loss='binary_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])

Model summary

In [None]:
model.summary()

Fitting the model

In [None]:
model.fit(x_train_pad,
          y_train,
          epochs=5)

Testing the model performance

In [None]:
result=model.evaluate(x_test_pad,
                      y_test)

In [None]:
#model success on the test dataset
y_test_pred=model.predict(x=x_test_pad)
y_test_pred=y_test_pred.T[0]
y_test_pred=np.array([1.0 if p>0.5 else 0.0 for p in y_test_pred])

precision_scr=precision_score(y_test, y_test_pred)
recall_scr=recall_score(y_test, y_test_pred)

print('Precision Score: {:.2f}'.format(precision_scr))
print('Recall Score: {:.2f}'.format(recall_scr))

# Model Results

With this model, I am able to predict %82 of the sentiment successfuly.

Our shooting ratio is %89, which means when the model makes 100 precition as positive, 89 of them are true.

Our catch ratio is %84, which means the model is able to predict sucessfully 84 of the all positive cases.

If you want to increase those ratios and have a better predictor, you should more focus on the preprocessing step and the model parameters tuning step.