# Carbon Tax Sentiment Analysis - Inital Results and Code

To begin the process of creating a sentiment analyzer there are four steps to be followed. 

1. Import libraries and Dataset
2. Initial Data Cleaning
3. Split Data into Training and Test
4. Prediction and Model Evaluation



## Step 1: Import libraries and Dataset

In [1]:
# import required libraries

import numpy as np
import pandas as pd
import re
import nltk
import matplotlib.pyplot as plt
'exec(%matplotlib inline)'

# import dataset

data_source_url = r"carbon_tax_tweets.csv"
carbon_tweets = pd.read_csv(data_source_url)

# ensure nltk stopword database is present
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aidan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Step 2: Initial Data Cleaning

In [2]:
# DATA CLEANING 

# remove all special characters
carbon_tweets['Tweet'] =  [re.sub(r'\W', ' ', str(x)) for x in carbon_tweets['Tweet']]

# remove all single characters
carbon_tweets['Tweet'] =  [re.sub(r'\+[a-zA-Z]\s+', ' ', str(x)) for x in carbon_tweets['Tweet']]

# remove single characters from the start
carbon_tweets['Tweet'] =  [re.sub(r'\^[a-zA-Z]\s+', ' ', str(x)) for x in carbon_tweets['Tweet']]

# substituting multiple spaces with single space
carbon_tweets['Tweet'] =  [re.sub(r'\s+', ' ', str(x)) for x in carbon_tweets['Tweet']]

# removing prefixed 'b'
carbon_tweets['Tweet'] =  [re.sub(r'^b\s+', ' ', str(x)) for x in carbon_tweets['Tweet']]

# converting to lowercase
carbon_tweets['Tweet'] =  [x.lower() for x in carbon_tweets['Tweet']]

## Step 3: Training/Test Split

In [3]:
# divide data into training and tests sets

dependent_vars = carbon_tweets['Tweet']
independent_vars = carbon_tweets["Polarity"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dependent_vars, independent_vars, test_size=0.2, random_state=0)

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
training_data_transformed = vectorizer.fit_transform(X_train)
testing_data_transformed = vectorizer.transform(X_test)

from sklearn.ensemble import RandomForestClassifier

text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(training_data_transformed, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

## Step 4: Predictions/Model Evaluation

In [4]:
predictions = text_classifier.predict(testing_data_transformed)

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print("CONFUSION MATRIX:")
print(confusion_matrix(y_test, predictions))
print("CLASSIFICATION REPORT:")
print(classification_report(y_test, predictions))
print("ACCURACY SCORE:")
print(accuracy_score(y_test, predictions))


CONFUSION MATRIX:
[[10  1  0]
 [ 9  1  0]
 [ 1  0  0]]
CLASSIFICATION REPORT:
              precision    recall  f1-score   support

          -1       0.50      0.91      0.65        11
           0       0.50      0.10      0.17        10
           1       0.00      0.00      0.00         1

    accuracy                           0.50        22
   macro avg       0.33      0.34      0.27        22
weighted avg       0.48      0.50      0.40        22

ACCURACY SCORE:
0.5


  'precision', 'predicted', average, warn_for)


## Conclusions and Next Steps