<a href="https://colab.research.google.com/github/adithisirpa/sentiment-analysis/blob/main/sentiment_analysis_random_forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis in Python using Scikit Learn, Random Forest Algorithm, and TF-IDF Vectorizer 

# Imports

In [None]:
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Extracting the dataset

In [None]:
dataset_url = "Tweets.csv"

airline_tweets = pd.read_csv(dataset_url)
airline_tweets.head()

# Changing the plot size from default

In [None]:
plot_size = plt.rcParams["figure.figsize"]
print(plot_size[0])
print(plot_size[1])

plot_size[0] = 8
plot_size[1] = 6
plt.rcParams["figure.figsize"] = plot_size

6.0
4.0


# Data Analysis

### Pie chart to visualize the no of airlines

In [None]:
airline_tweets.airline.value_counts().plot(kind = 'pie', autopct = '%1.0f%%')

### Distribution of sentiments across all tweets

In [None]:
airline_tweets.airline_sentiment.value_counts().plot(kind = 'pie', autopct = '%1.0f%%', colors = ['red','yellow','green'])

### Distribution of sentiment for all airlines

In [None]:
airline_sentiment = airline_tweets.groupby(['airline','airline_sentiment']).airline_sentiment.count().unstack()
airline_sentiment.plot(kind = 'bar')

### Seaborn library to view average confidence level for **tweets**

In [None]:
import seaborn as sns
sns.barplot(x='airline_sentiment',y='airline_sentiment_confidence', data = airline_tweets)

# Dividing the dataset into features and label sets
Features contains tweets, label set contains the sentiment of the tweet that we have to predict. For this we can use iloc method of the pandas data

In [None]:
features = airline_tweets.iloc[:,10].values
labels = airline_tweets.iloc[:,1].values

# Preprocessing on the Data - Data Cleaning 

In [None]:
# for data cleaning we use regular expressions
import re

In [None]:
processed_features = []
for sentence in range(0,len(features)):
  processed_feature = re.sub(r'\W', ' ',str(features[sentence])) # to remove special characters

  processed_feature = re.sub(r'\s+[a-zA-Z]\s+', ' ',processed_feature) # to remove all single charcters

  processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ',processed_feature) # to remove single characters from the start

  processed_feature = re.sub(r'\s+', ' ',processed_feature,flags = re.I) # to substitute multiple spaces with single space

  processed_feature = re.sub(r'^b\s+', ' ',processed_feature) # to remove prefixed 'b'

  processed_feature = processed_feature.lower() # to convert to lower case

  processed_features.append(processed_feature)

processed_features

# Converting Text in Numerical form
Statistical algos use maths to train models since we need numbers to solve we need to convert text to numbers.

Approach used are : Bag of Words, TF-IDF

Scikit Library has this TF-IDF Vectorizer class that can be used to convert text features into TF-IDF feature vectors.

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Creating the Vectorizer
Here the max_features = 2500 which means it only uses the 2500 most frequently occuring words to create a bag of words feature vector

Words that occur less frequently are not useful for classification

max_df specifies that only use those words that occur in a minimum of 80% of documents

min_df shows that include words that occur in atleast 7 documents

In [None]:
vectorizer = TfidfVectorizer(max_features=2500, min_df = 7, max_df = 0.8, stop_words=stopwords.words('english'))

processed_features = vectorizer.fit_transform(processed_features).toarray()

# Dividing the data into training and test sets

here test_size = 0.2 means that 20% data will be used for testing the dataset and eremaining 80% for training the data 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train,X_test, y_train, y_test = train_test_split(processed_features, labels, test_size = 0.2,random_state=0)

# Training the data using Random Forest

Random forest owing its ability to its ability to act upon non normalized data

sklearn.ensemble module has this RandomForestClassifier class that can be used to train. To do this we need to call the fit method to the RandomForestClassifier class and pass the training features and labels as parameters

In [None]:
from sklearn.ensemble import RandomForestClassifier

text_classifier = RandomForestClassifier(n_estimators = 200, random_state = 0)
text_classifier.fit(X_train,y_train)

RandomForestClassifier(n_estimators=200, random_state=0)

### Make Predictions on model and evaluating the model

For this we use predict method from the RandomForestClassifier class

In [None]:
predictions = text_classifier.predict(X_test)

### To evaluate the performance of the machine learning model

We can use classification metrics- confusion matrix, F1 measure, accuracy, etc

To find values for this metrics we can use classification_report, confusion_matrix, accuracy_score

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [None]:
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test,predictions))