# Sentiment analysis using logistic Regression

kaggle data-link -> https://www.kaggle.com/datasets/kazanova/sentiment140/data

# Sentiment Analysis: Overview and Techniques

Sentiment analysis, also known as opinion mining, is the process of computationally identifying and categorizing opinions expressed in text data to determine the sentiment conveyed by the text. It's a valuable technique used in various applications such as social media monitoring, customer feedback analysis, and market research.

## Overview:
- **Objective**: Analyze and classify the sentiment expressed in text data as positive, negative, or neutral.
- **Techniques**: Sentiment analysis can be performed using various techniques, including rule-based methods, machine learning algorithms, and deep learning models.
- **Applications**: Used in social media monitoring, customer feedback analysis, brand reputation management, and market research to understand public opinion and sentiment trends.

## Techniques:
1. **Rule-Based Methods**: These methods rely on predefined rules or lexicons to determine the sentiment of text data. Lexicon-based approaches assign sentiment scores to words or phrases and aggregate them to determine the overall sentiment of a piece of text.

2. **Machine Learning**: Machine learning algorithms such as Naive Bayes, Support Vector Machines (SVM), and Random Forests can be trained on labeled datasets to classify text data into positive, negative, or neutral sentiments. Features can include word frequencies, n-grams, or word embeddings.

3. **Deep Learning**: Deep learning models, particularly Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Convolutional Neural Networks (CNNs), can learn complex patterns in text data and automatically extract features for sentiment classification.

## Evaluation Metrics:
- **Accuracy**: The percentage of correctly classified instances.
- **Precision and Recall**: Metrics to evaluate the performance of the classifier, especially in the context of imbalanced datasets.
- **F1 Score**: The harmonic mean of precision and recall, providing a balanced measure of the classifier's performance.
- **Confusion Matrix**: A table used to evaluate the performance of a classification model, displaying the counts of true positive, false positive, true negative, and false negative predictions.

## Implementation:
- Sentiment analysis can be implemented using various libraries and frameworks, including NLTK (Natural Language Toolkit), scikit-learn, TensorFlow, and PyTorch.
- Pretrained models and libraries such as VADER (Valence Aware Dictionary and sEntiment Reasoner) provide off-the-shelf solutions for sentiment analysis tasks.

## Challenges:
- **Context Understanding**: Understanding the context, sarcasm, and nuances of language can be challenging for sentiment analysis models.
- **Domain Specificity**: Sentiment analysis models may perform differently across different domains and industries due to domain-specific language and vocabulary.

# Logistic Regression: Overview and Formulas

Logistic regression is a popular statistical method used for binary classification tasks, where the goal is to predict the probability of an observation belonging to one of two classes. It's widely used in various fields such as healthcare, finance, marketing, and more.

## Overview:
- **Objective**: Predict the probability of an observation belonging to a certain class.
- **Model Type**: Supervised learning algorithm for binary classification.
- **Output**: Probability value between 0 and 1, which can be interpreted as the likelihood of belonging to the positive class.

## Key Components:
1. **Linear Combination**: Logistic regression models the relationship between the independent variables (features) and the log-odds of the dependent variable (class label) using a linear combination.

2. **Sigmoid Function (Logistic Function)**: The linear combination is passed through the sigmoid function to constrain the output to the range [0, 1].

3. **Decision Boundary**: The decision boundary is a threshold value (usually 0.5) above which the observation is classified as belonging to the positive class, and below which it's classified as belonging to the negative class.

## Training Logistic Regression:
- **Objective Function**: The objective function (often the log-likelihood or cross-entropy loss) is optimized to find the optimal coefficients that minimize the difference between the predicted probabilities and the actual labels.
- **Optimization Algorithm**: Common optimization algorithms include gradient descent and its variants, which iteratively update the coefficients to minimize the objective function.

In [1]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/chaos/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

## data processing

In [4]:
twitter_data = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='ISO-8859-1')

In [5]:
twitter_data.shape

(1599999, 6)

In [6]:
twitter_data.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


In [7]:
column_names = ['target', 'id', 'date', 'flag', 'user', 'text']
twitter_data = pd.read_csv('training.1600000.processed.noemoticon.csv', names=column_names, encoding='ISO-8859-1')

In [8]:
twitter_data.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [9]:
twitter_data.isnull().sum()

target    0
id        0
date      0
flag      0
user      0
text      0
dtype: int64

In [10]:
#checking the distribution: 
twitter_data['target'].value_counts()

target
0    800000
4    800000
Name: count, dtype: int64

- the distribution is even, otherwise we have to do upsampling or downsampling

### convert the target, only(0, 1)

In [11]:
twitter_data.replace({'target':{4:1}}, inplace=True)

In [12]:
twitter_data['target'].value_counts()

target
0    800000
1    800000
Name: count, dtype: int64

In [13]:
# Initialize Porter Stemmer
port_stem = PorterStemmer()

In [14]:
def stemming(content):
    # Remove non-alphabetic characters and convert to lowercase
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
    stemmed_content = stemmed_content.lower()
    
    # Tokenize the text
    stemmed_content = stemmed_content.split()
    
    # Apply stemming and remove stopwords
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if word not in stopwords.words('english')]
    
    # Join the stemmed words back into a string
    stemmed_content = ' '.join(stemmed_content)

    return stemmed_content

In [15]:
twitter_data['stemmed_content'] = twitter_data['text'].apply(stemming)  #took 17min 41.2 seconds

In [16]:
twitter_data.head()

Unnamed: 0,target,id,date,flag,user,text,stemmed_content
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot http twitpic com zl awww bummer sho...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset updat facebook text might cri result sch...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,kenichan dive mani time ball manag save rest g...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,whole bodi feel itchi like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",nationwideclass behav mad see


## save this dataframe for future

In [40]:
twitter_data.to_csv('stemmed_twitter_data.csv', index=False)

In [17]:
print(twitter_data['stemmed_content'])

0          switchfoot http twitpic com zl awww bummer sho...
1          upset updat facebook text might cri result sch...
2          kenichan dive mani time ball manag save rest g...
3                            whole bodi feel itchi like fire
4                              nationwideclass behav mad see
                                 ...                        
1599995                           woke school best feel ever
1599996    thewdb com cool hear old walt interview http b...
1599997                         readi mojo makeov ask detail
1599998    happi th birthday boo alll time tupac amaru sh...
1599999    happi charitytuesday thenspcc sparkschar speak...
Name: stemmed_content, Length: 1600000, dtype: object


In [18]:
print(twitter_data['target'])

0          0
1          0
2          0
3          0
4          0
          ..
1599995    1
1599996    1
1599997    1
1599998    1
1599999    1
Name: target, Length: 1600000, dtype: int64


## separating data and the target

In [19]:
X = twitter_data['stemmed_content'].values
Y = twitter_data['target'].values


In [20]:
print(X)

['switchfoot http twitpic com zl awww bummer shoulda got david carr third day'
 'upset updat facebook text might cri result school today also blah'
 'kenichan dive mani time ball manag save rest go bound' ...
 'readi mojo makeov ask detail'
 'happi th birthday boo alll time tupac amaru shakur'
 'happi charitytuesday thenspcc sparkschar speakinguph h']


In [21]:
print(Y)

[0 0 0 ... 1 1 1]


## split the train data and the test data

In [22]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

In [23]:
print(X.shape, X_train.shape, X_test.shape)

(1600000,) (1280000,) (320000,)


In [24]:
print(X_train)

['watch saw iv drink lil wine' 'hatermagazin'
 'even though favourit drink think vodka coke wipe mind time think im gonna find new drink'
 ... 'eager monday afternoon'
 'hope everyon mother great day wait hear guy store tomorrow'
 'love wake folger bad voic deeper']


In [25]:
print(X_test)

['mmangen fine much time chat twitter hubbi back summer amp tend domin free time'
 'ah may show w ruth kim amp geoffrey sanhueza'
 'ishatara mayb bay area thang dammit' ...
 'destini nevertheless hooray member wonder safe trip' 'feel well'
 'supersandro thank']


## converting the textual data to neumerical data

In [26]:
vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [27]:
print(X_train)

  (0, 443066)	0.4484755317023172
  (0, 235045)	0.41996827700291095
  (0, 109306)	0.3753708587402299
  (0, 185193)	0.5277679060576009
  (0, 354543)	0.3588091611460021
  (0, 436713)	0.27259876264838384
  (1, 160636)	1.0
  (2, 288470)	0.16786949597862733
  (2, 132311)	0.2028971570399794
  (2, 150715)	0.18803850583207948
  (2, 178061)	0.1619010109445149
  (2, 409143)	0.15169282335109835
  (2, 266729)	0.24123230668976975
  (2, 443430)	0.3348599670252845
  (2, 77929)	0.31284080750346344
  (2, 433560)	0.3296595898028565
  (2, 406399)	0.32105459490875526
  (2, 129411)	0.29074192727957143
  (2, 407301)	0.18709338684973031
  (2, 124484)	0.1892155960801415
  (2, 109306)	0.4591176413728317
  (3, 172421)	0.37464146922154384
  (3, 411528)	0.27089772444087873
  (3, 388626)	0.3940776331458846
  (3, 56476)	0.5200465453608686
  :	:
  (1279996, 390130)	0.22064742191076112
  (1279996, 434014)	0.2718945052332447
  (1279996, 318303)	0.21254698865277746
  (1279996, 237899)	0.2236567560099234
  (1279996, 2910

In [28]:
print(X_test)

  (0, 420984)	0.17915624523539803
  (0, 409143)	0.31430470598079707
  (0, 398906)	0.3491043873264267
  (0, 388348)	0.21985076072061738
  (0, 279082)	0.1782518010910344
  (0, 271016)	0.4535662391658828
  (0, 171378)	0.2805816206356073
  (0, 138164)	0.23688292264071403
  (0, 132364)	0.25525488955578596
  (0, 106069)	0.3655545001090455
  (0, 67828)	0.26800375270827315
  (0, 31168)	0.16247724180521766
  (0, 15110)	0.1719352837797837
  (1, 366203)	0.24595562404108307
  (1, 348135)	0.4739279595416274
  (1, 256777)	0.28751585696559306
  (1, 217562)	0.40288153995289894
  (1, 145393)	0.575262969264869
  (1, 15110)	0.211037449588008
  (1, 6463)	0.30733520460524466
  (2, 400621)	0.4317732461913093
  (2, 256834)	0.2564939661498776
  (2, 183312)	0.5892069252021465
  (2, 89448)	0.36340369428387626
  (2, 34401)	0.37916255084357414
  :	:
  (319994, 123278)	0.4530341382559843
  (319995, 444934)	0.3211092817599261
  (319995, 420984)	0.22631428606830145
  (319995, 416257)	0.23816465111736276
  (319995, 3

## training the machine learning model - logistic regression

In [29]:
model = LogisticRegression(max_iter=1500)

In [30]:
model.fit(X_train, Y_train)

## model evaluation

### accuracy score : training data

In [31]:
X_train_pred = model.predict(X_train)
trainig_data_accu = accuracy_score(Y_train, X_train_pred)

In [32]:
print('Accuracy score of the training data : ', trainig_data_accu)

Accuracy score of the training data :  0.79871953125


In [33]:
X_test_pred = model.predict(X_test)
test_data_accu = accuracy_score(Y_test, X_test_pred)

In [34]:
print('Accuracy score of the training data : ', test_data_accu)

Accuracy score of the training data :  0.77668125


In [35]:
print("Shape of Y_test:", Y_test.shape)
print("Shape of X_test_pred:", X_test_pred.shape)
print("Data type of Y_test:", type(Y_test))
print("Data type of X_test_pred:", type(X_test_pred))


Shape of Y_test: (320000,)
Shape of X_test_pred: (320000,)
Data type of Y_test: <class 'numpy.ndarray'>
Data type of X_test_pred: <class 'numpy.ndarray'>


# saving the trained model :

In [36]:
import pickle

In [37]:
filename = 'LogisticRegression_trained_model.sav'
pickle.dump(model, open(filename, 'wb'))

## using the saved model for future predictions : 

In [38]:
loaded_model = pickle.load(open('LogisticRegression_trained_model.sav', 'rb')) #rb = reading the file

In [39]:
X_new = X_test[234]
print(Y_test[234])

prediction = model.predict(X_new)
if print(prediction[0] == 0):
    print("negetive")
else: print("positive")

0
True
positive
