# Sentiment Analysis with Logistic Regression

In this notebook we tried to conduct sentiment analysis with lositic regression model.

## Prepare data 

All of the data are well-labelled and stored in a csv file, but tweet text are not preprocessed. We import numpy and pandas for array operations and data processing.

In [3]:
import numpy as np
import pandas as pd
tweets = pd.read_csv("Tweets.csv")
print(tweets.columns.values)

['tweet_id' 'airline_sentiment' 'airline_sentiment_confidence'
 'negativereason' 'negativereason_confidence' 'airline'
 'airline_sentiment_gold' 'name' 'negativereason_gold' 'retweet_count'
 'text' 'tweet_coord' 'tweet_created' 'tweet_location' 'user_timezone']


Sentiments are labelled with strings, we have to convert them into integers

In [7]:
def sentiment2target(sentiment):
    return {
        'negative': 0,
        'neutral': 1,
        'positive' : 2
    }[sentiment]
y = tweets.airline_sentiment.apply(sentiment2target)

Vectorized texts

In [8]:
from scipy.sparse import hstack
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(ngram_range=(1,2))
vectorized_data = count_vectorizer.fit_transform(tweets.text)
X = hstack((np.array(range(0,vectorized_data.shape[0]))[:,None], vectorized_data))

Train the LR model

In [34]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
regular = { 'C':[0.001, 1, 1000]}
lr = LogisticRegression()
gs = GridSearchCV(lr, regular, scoring = 'accuracy')
gs.fit(x_train, y_train)
print([gs.best_params_, gs.best_score_])

[{'C': 1000}, 0.78577527322404372]


In [39]:
new_lr = gs.best_estimator_
new_lr.fit(x_train, y_train)
prediction = new_lr.predict(x_test)
score = new_lr.score(x_test, y_test)
print(score)

0.803961748634


In [43]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

# Accuracy
print("Accuracy: {:.2f}%".format(accuracy_score(y_test, prediction) * 100))

# F1 score
print("F1 Score: {:.2f}".format(f1_score(y_test, prediction, average='micro') * 100))

# Confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, prediction))

Accuracy: 80.40%
F1 Score: 80.40
Confusion Matrix:
 [[1734  104   32]
 [ 230  347   37]
 [  97   74  273]]
