 In this project, we aim to develop a sentiment analysis system that analyzes the sentiment associated with different stocks. Sentiment analysis plays a crucial role in understanding public perception and can provide valuable insights. By leveraging natural language processing and machine learning techniques, we can analyze textual data to determine whether the sentiment surrounding a particular stock is positive, negative, or neutral.

## List of Contents 
- Import libraries
- Data cleaning
- Preprocessing
- Model Evaluation

## Import Libraries

In [70]:
import pandas as pd
import re
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
import joblib

## Data Cleaning

In [2]:
main_df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='ISO-8859-1', names=['target','id','date','query','username','text'])
main_df.head()

Unnamed: 0,target,id,date,query,username,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [3]:
main_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column    Non-Null Count    Dtype 
---  ------    --------------    ----- 
 0   target    1600000 non-null  int64 
 1   id        1600000 non-null  int64 
 2   date      1600000 non-null  object
 3   query     1600000 non-null  object
 4   username  1600000 non-null  object
 5   text      1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


Lets take a look at the target variable.

In [4]:
main_df['target'].value_counts()

0    800000
4    800000
Name: target, dtype: int64

There does not seem to be any imbalance but the Target variable has 4 as positive value. We should change that.

In [5]:
main_df['target'] = main_df['target'].replace(to_replace=4, value=1)

Before we go any further, I would like to specify that my machine would take a lot of time to work on preprocessing and modelling data of this size. Therefore, we take only a part of the data.

In [7]:
neg_sample = main_df[main_df['target']==0][:50_000]
pos_sample = main_df[main_df['target']==1][:50_000]
df = pd.concat([neg_sample, pos_sample])

Now that we have taken a sample of the dataset, we can continue data cleaning.

In [8]:
df = df.reset_index(drop=True)
df.head(10)

Unnamed: 0,target,id,date,query,username,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
5,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
6,0,1467811592,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,mybirch,Need a hug
7,0,1467811594,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,coZZ,@LOLTrish hey long time no see! Yes.. Rains a...
8,0,1467811795,Mon Apr 06 22:20:05 PDT 2009,NO_QUERY,2Hood4Hollywood,@Tatiana_K nope they didn't have it
9,0,1467812025,Mon Apr 06 22:20:09 PDT 2009,NO_QUERY,mimismo,@twittera que me muera ?


## Preprocessing

It is essential to take out any usernames and hashtags from the text.

In [43]:
def use_regex(text):
    # remove username
    text = re.sub(r'@[\w_]+', '', text) 
    # remove hashtag
    text = re.sub(r'#[\w]+', '', text) 
    # remove extra spaces
    text = re.sub('  +', ' ', text)
    return text

In [44]:
df['text'] = df['text'].apply(use_regex)

In [45]:
df.head()

Unnamed: 0,target,id,date,query,username,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"http://twitpic.com/2y1zl - Awww, that's a bum..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,I dived many times for the ball. Managed to s...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"no, it's not behaving at all. i'm mad. why am..."


Now, we remove stop words, links, emails, numbers and punctuations from the text.

In [46]:
nlp = spacy.load('en_core_web_sm')

In [56]:
def remove_unwanted(text):
    doc = nlp(text)
    base_words = []
    final_base_words = ''
    for token in doc:
        if token.is_stop or token.is_punct or token.is_digit or token.like_email or token.like_url:
            continue
        base_words.append(token.lemma_)
        final_base_words = ' '.join(base_words)
    return final_base_words.lower().strip()

In [57]:
remove_unwanted(df['text'][0])

'awww bummer shoulda get david carr day ;d'

In [58]:
# Will take around 7 min.
df['text'] = df['text'].apply(remove_unwanted)

In [60]:
df.head()

Unnamed: 0,target,id,date,query,username,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,awww bummer shoulda get david carr day ;d
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,upset update facebook texte cry result school ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,dive time ball manage save rest bound
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,body feel itchy like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,behave mad


Some tweets were too short that the text was eliminated as a whole, therefore, we will drop those data points.

In [61]:
len(df[df['text'] == ''])

706

In [62]:
df = df[df['text'] != '']

In [63]:
print(f'Shape of the dataframe: {df.shape}')

Shape of the dataframe: (99294, 6)


## Model Evaluation

Now that text has been preprocessed, we can proceed further. There are different ways to move forward now. We can go with any of the following options:
- Count Vectorizer
- TfidfVectorizer
- N-grams Vectorization
- Word2vec

Let's try TfidfVectorizer and make pipeline.

In [64]:
X = df['text']
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101, stratify=df['target'])

In [65]:
X_train.shape, X_test.shape

((79435,), (19859,))

### Model 1: Naive Bayes

In [67]:
clf_nb = Pipeline([
    ('tfid', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])

clf_nb.fit(X_train, y_train)
y_pred_nb = clf_nb.predict(X_test)

print('Classification Report - Model 1: Naive Bayes')
print(classification_report(y_test, y_pred_nb))

Classification Report - Model 1: Naive Bayes
              precision    recall  f1-score   support

           0       0.73      0.75      0.74      9938
           1       0.75      0.73      0.74      9921

    accuracy                           0.74     19859
   macro avg       0.74      0.74      0.74     19859
weighted avg       0.74      0.74      0.74     19859



### Model 2: KNN

In [68]:
clf_knn = Pipeline([
    ('tfid', TfidfVectorizer()),
    ('classifier', KNeighborsClassifier())
])

clf_knn.fit(X_train, y_train)
y_pred_knn = clf_knn.predict(X_test)
print('Classification Report - Model 2: KNN')
print(classification_report(y_test, y_pred_knn))

Classification Report - Model 2: KNN
              precision    recall  f1-score   support

           0       0.71      0.39      0.50      9938
           1       0.58      0.84      0.68      9921

    accuracy                           0.61     19859
   macro avg       0.64      0.61      0.59     19859
weighted avg       0.64      0.61      0.59     19859



### Model 3: Random Forest

In [69]:
clf_rfc = Pipeline([
    ('tfid', TfidfVectorizer()),
    ('classifier', RandomForestClassifier())
])

clf_rfc.fit(X_train, y_train)
y_pred_rfc = clf_rfc.predict(X_test)
print('Classification Report - Model 3: Random Forest')
print(classification_report(y_test, y_pred_rfc))

Classification Report - Model 3: Random Forest
              precision    recall  f1-score   support

           0       0.75      0.73      0.74      9938
           1       0.74      0.76      0.75      9921

    accuracy                           0.74     19859
   macro avg       0.74      0.74      0.74     19859
weighted avg       0.74      0.74      0.74     19859



Time to evaluate our models.

In [71]:
accuracy_scores = {
    'Model': ['Naive Bayes', 'KNN', 'Random Forest'],
    'Accuracy': [accuracy_score(y_test, y_pred_nb), accuracy_score(y_test, y_pred_knn), accuracy_score(y_test, y_pred_rfc)]
}

evaluation_df = pd.DataFrame(accuracy_scores)
evaluation_df

Unnamed: 0,Model,Accuracy
0,Naive Bayes,0.740168
1,KNN,0.613676
2,Random Forest,0.74475


In [76]:
evaluation_df['Accuracy'].max()

0.744750490961277

The Random Forest model achieved the highest accuracy of 0.74. It is worth mentioning that the models could achieve even better performance with a larger dataset. The reported accuracy is based on a subset of the available data due to computational limitations.

In [77]:
joblib.dump(clf_rfc, 'clf_rfc_model.joblib')

['clf_rfc_model.joblib']

#### Testing model

In [72]:
email = [
    "Hey, can we get together to watch football game tomorrow?",
    "Upto 20% discount on parking, exclusive offer just for you. Dont miss this!"
]

In [75]:
clf_rfc.predict(email)

array([1, 0], dtype=int64)

In [79]:
email = [
    "Hey, can we get together to watch football game tomorrow?",
    "Upto 20% discount on parking, exclusive offer just for you. Dont miss this!"
]

# load the model
loaded_model = joblib.load('clf_rfc_model.joblib')

# make a prediction
print(loaded_model.predict(email))

[1 0]
