# Sentiment analysis with naive-bayes

The aim of this notebook is to try to predict the sentiment and emotion of tweets using a naive Bayes classifier. The tweets will be in text form and the possible categories for this classification task will be *positive* or *negative*.

## Dataset

The dataset that we'll be using is the proven and tested dataset *sentiment140* consisting of 1.6 million tweets extracted using the twitter api. Half of the tweets are annotated with 'positive' and half of them are annotated with 'negative'. The methodology of this annotation is to detect tweets that use certain emoticons, use the corresponding emotion to categorize the tweet, and then remove the emoji from the text. 

The detailed approach can be found in the official paper: http://http//cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf

Citation: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

## Importing Packages

In [22]:
# utility
import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt
import seaborn as sns

# nlp
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

# machine learning
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Burak.Oezkan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Burak.Oezkan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Burak.Oezkan\AppData\Roaming\nltk_data...


## Importing and Preprocessing Data

Because the naive Bayes classifier doesn't take into account the surrounding context of words, it makes sense to remove as much noise from the text as possible in this step.

In [59]:
# load dataset
df = pd.read_csv('datasets/sentiment140.csv', names=['sentiment', 'id', 'date', 'query', 'user', 'text'])
df = df.drop(['id', 'date', 'query', 'user'], axis=1)

# remove usernames
def remove_username(text):
    return ' '.join(word for word in text.split() if not word.startswith('@'))

# remove urls
def remove_url(text):
    return ' '.join(word for word in text.split() if not word.startswith('http') and not word.startswith('https')  and not word.startswith('www')) 

# remove non-alphabetic characters
def remove_nonalphabet(text):
    for char in text:
        if char not in 'abcdefghijklmnopqrstuvwxyz'+' ':
            text = text.replace(char, '')
    return text

# remove stopwords
def remove_stopwords(text):
    stop_words = stopwords.words('english')
    return ' '.join(word for word in text.split() if word not in stop_words)

# convert lowercase
def convert_lowercase(text):
    return text.lower()

# lemmatize
def lemmatize(text):
    return ' '.join(WordNetLemmatizer().lemmatize(word) for word in text.split())


def preprocess(text):
    text = convert_lowercase(text)
    text = remove_username(text)
    text = remove_url(text)
    text = lemmatize(text)
    text = remove_stopwords(text)
    text = remove_nonalphabet(text)
    return text

In [None]:
# apply preprocessing and save new dataset
df_preprocessed = df.copy()
df_preprocessed['text'] = df_preprocessed['text'].apply(preprocess)
df_preprocessed.to_csv('datasets/sentiment140_preprocessed.csv')