<a href="https://colab.research.google.com/github/aaditn18/sentiment_analysis/blob/main/sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install kaggle library

In [2]:
! pip install kaggle




Kaggle.json path configuration

In [3]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

cp: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory


Importing twitter data using API to fetch kaggle dataset

In [4]:
!kaggle datasets download -d kazanova/sentiment140

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
Downloading sentiment140.zip to /content
100% 80.9M/80.9M [00:04<00:00, 26.7MB/s]
100% 80.9M/80.9M [00:04<00:00, 17.2MB/s]


Extracting compressed dataset into current google colab working directory

In [5]:
from zipfile import ZipFile
dataset = '/content/sentiment140.zip'
with ZipFile(dataset, 'r') as raw_file:
  raw_file.extractall()
  print('done')

done


Import Dependencies for project

In [6]:
import numpy as np       #performing mathematical operations on array
import pandas as pd      #dataframe to process csv file and manipulate the data
import re                #regexs
from nltk.corpus import stopwords                            #NLTK is a python library for working with natural textual language
from nltk.stem.porter import PorterStemmer                   #Here, used to find stopwords (insignificant towards meaning), and to reduce words to their root form
from sklearn.feature_extraction.text import TfidfVectorizer  #SciKit-learn (sklearn) is being used to ML tasks: to find importance of word(s)
from sklearn.model_selection import train_test_split         #for dividing dataset into training set and testing set
from sklearn.linear_model import LogisticRegression          #logistic regression for classification into different sentiments
from sklearn.metrics import accuracy_score                   #utilizing accuracy score instead of f1 score as dataset is balanced
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Data Processing

In [7]:
data = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', encoding='ISO-8859-1')   #loading csv data into pandas dataframe
data.head() #the first data row is the column names so we must fix that

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [8]:
data = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', names = ['target', 'id', 'date', 'flag', 'user', 'text'], encoding='ISO-8859-1')
data.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [9]:
data.isnull().sum() #checking empty row count

target    0
id        0
date      0
flag      0
user      0
text      0
dtype: int64

In [10]:
data['target'].value_counts()  #finding occurrences of each type of data to understand balance of dataset and which testing methodology to use
#as its not imbalanced we don't need to do upsampling/downsampling to make data even

target
0    800000
4    800000
Name: count, dtype: int64

In [11]:
data.replace({'target':{4:1}}, inplace=True) #Changing positive target to 1, so that we can get simpler values; inplace used to make the change in 'data' itself
data['target'].value_counts() #0 is negative, 1 is positive

target
0    800000
1    800000
Name: count, dtype: int64

In [12]:
port_stem = PorterStemmer()      #to carry out single stemming operation
def stem_single_review(textual_content):  #to stem all words needed to be stemmed in a single review
  stemmed_content = re.sub('[^a-zA-Z]', ' ', textual_content)   #replacing all special characters with ' '
  stemmed_content = stemmed_content.lower()
  stemmed_content = stemmed_content.split() #converting content into a list
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]  #to carry out stemming on all non-stopwords in review
  return ' '.join(stemmed_content)  #return a string with the updated review



In [13]:
data['stemmed_content'] = data['text'].apply(stem_single_review) #stemming all reviews into new column 'stemmed_content'

In [14]:
data.head()


Unnamed: 0,target,id,date,flag,user,text,stemmed_content
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot http twitpic com zl awww bummer sho...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset updat facebook text might cri result sch...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,kenichan dive mani time ball manag save rest g...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,whole bodi feel itchi like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",nationwideclass behav mad see


In [15]:
x = data['stemmed_content'].values     #obtaining all the feature data in one variable
y = data['target'].values              #obtaining all the target data in one variable
print(x)
print(y)

['switchfoot http twitpic com zl awww bummer shoulda got david carr third day'
 'upset updat facebook text might cri result school today also blah'
 'kenichan dive mani time ball manag save rest go bound' ...
 'readi mojo makeov ask detail'
 'happi th birthday boo alll time tupac amaru shakur'
 'happi charitytuesday thenspcc sparkschar speakinguph h']
[0 0 0 ... 1 1 1]


In [16]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify = y, random_state=1) #to split the data into training and testing data
#test_size=0.2 makes 80% of data go for training and 20% for testing
#startify = y is to ensure there are proportionate y values in both testing and training data

Conversion of Textual data into numerical data

In [17]:
vectorizer= TfidfVectorizer() #To convert all textual data into numerical values representing likelihood of being positive/negative
#tfidf = term frequency (tf)-inverse document frequency (idf) = tf*idf, this is calculated for each term
#tf: In a single piece of content, in this case 1 review, occurences of a term/total terms
#idf: How many pieces of content have that term, log(Number of reviews/1+total number of reviews with the specific term)
x_train = vectorizer.fit_transform(x_train) #learns vocab from training data (fit), then transforms training data into feature vectors (numeric representation)
x_test = vectorizer.transform(x_test)       #directly transforms testing data into feature vectors (numeric representation)

In [18]:
print(x_train)

  (0, 408994)	0.20327132937613104
  (0, 106816)	0.37134449965091226
  (0, 347931)	0.4123017384299885
  (0, 317916)	0.30350077343669557
  (0, 227941)	0.2912488589003264
  (0, 266342)	0.3358584555454906
  (0, 239610)	0.20930055750051507
  (0, 364818)	0.37627918401650134
  (0, 341505)	0.41911697481409166
  (1, 324566)	0.5346829742944466
  (1, 240357)	0.32757289154579433
  (1, 411415)	0.28912454600209647
  (1, 181329)	0.5247713457687215
  (1, 286004)	0.49782742921934126
  (2, 306993)	0.41244791725873703
  (2, 129982)	0.27792888797524024
  (2, 118467)	0.8675495656028976
  (3, 15081)	0.44595664917320677
  (3, 244939)	0.24365166869261953
  (3, 136560)	0.2030585102219242
  (3, 165909)	0.30014312059830567
  (3, 15061)	0.15048519407525865
  (3, 125919)	0.2975473828320111
  (3, 439081)	0.1893315187082751
  (3, 375838)	0.24345453849281473
  :	:
  (1279995, 162976)	0.5039996612991468
  (1279995, 44681)	0.3863978701646447
  (1279995, 411671)	0.372838323110258
  (1279995, 15061)	0.2448880314664291
  

Training ML Model using logistic regression (Classification Problem)

In [19]:
model = LogisticRegression(max_iter=10000)
model.fit(x_train, y_train) #training the model with the training data and it's corresponding targets

In [20]:
x_train_prediction = model.predict(x_train)
training_data_accuracy = accuracy_score(y_train, x_train_prediction) #first param holds true expected values, #second param holds predicted values
print('Accuracy score on training data : ', training_data_accuracy)  #only to check whether model has overfitted

Accuracy score on training data :  0.81001796875


In [21]:
x_test_prediction = model.predict(x_test)
test_data_accuracy = accuracy_score(y_test, x_test_prediction)
print('Accuracy score on test data : ', test_data_accuracy) #if accuracy on training data is much better than accuracy on testing data: overfitting
#hyper-parameter tuning to improve score further or decision trees, random forests, gradient boosting?

Accuracy score on test data :  0.778884375


Saving model for future uses

In [23]:
import pickle
filename = 'sentiment_analysis_trialmodel.sav'
pickle.dump(model, open(filename, 'wb')) #to save it
loaded_model = pickle.load(open('/content/sentiment_analysis_trialmodel.sav', 'rb')) #to load it, replace filename w its path
#then we can simply do loaded_model.predict(<feature/review>)

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  # Regularization strength
    'solver': ['liblinear', 'saga'],  # Solver algorithms
    'max_iter': [100, 500, 1000]  # Maximum number of iterations
}

# Initialize the model
lr_model = LogisticRegression()

# Initialize Grid Search
grid_search = GridSearchCV(estimator=lr_model, param_grid=param_grid, cv=5, scoring='accuracy', verbose=2, n_jobs=-1)

# Fit Grid Search
grid_search.fit(x_train, y_train)

# Best parameters and score
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Use the best model found
best_model = grid_search.best_estimator_

Fitting 5 folds for each of 30 candidates, totalling 150 fits
