# Sentiment Analysis project using Natural Language Processing (NLP)

1. Objective: Analyze text data to determine whether the sentiment is positive, negative, or neutral.

2. Dataset: Use a dataset containing text data, such as reviews, tweets, or comments, with labeled sentiment (positive, negative, neutral).

3. Preprocessing: Clean the text by removing noise like punctuation, stop words, and unnecessary characters.

4. Text Tokenization: Break the text into smaller units (tokens), like words or phrases, to analyze them.

5. Vectorization: Convert the text into numerical data using methods like Bag of Words or TF-IDF to make it understandable for machines.

6. Algorithm: Apply machine learning algorithms like Logistic Regression, Naive Bayes, or Deep Learning models to predict sentiment.

7. Training: Train the model on a labeled dataset, teaching it to recognize patterns associated with different sentiments.

8. Testing: Evaluate the model on unseen data to check how accurately it can predict sentiment.

9. Evaluation Metrics: Use metrics like accuracy, precision, recall, and F1 score to assess the model's performance.

10. Deployment: Once trained, deploy the model to analyze real-time text data, such as customer reviews or social media posts, to understand public sentiment.


In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [8]:
data=pd.read_csv('IMDB Dataset_Sentiment Analysis.csv')

In [9]:
data= data.sample(1000)

In [10]:
data

Unnamed: 0,review,sentiment
721,I was watching the sci-fi channel when this st...,negative
25279,I went straight to the big screen to view this...,positive
6877,It is a pity that you cannot vote zero stars o...,negative
24422,In the Realm of the Senses is a beautifully fi...,positive
41470,Masters of Horror: The Screwfly Solution start...,negative
...,...,...
39721,this is best showing of what i think jesus rea...,positive
28266,I cannot for the life of me explain what the p...,negative
10027,I have just seen this movie and have not read ...,negative
282,I can't say much about this film. I think it s...,negative


In [11]:
data['review'].iloc[5]

"Although this small film kind of got lost in the wake of On The Waterfront, Edge Of The City can certainly hold its own with that star studded classic. It's another story about the docks and the code of silence that rules it.<br /><br />Next to the corrupt union that Lee J. Cobb ran in On The Waterfront, Jack Warden is really small time corruption. But he's real enough as the gang boss on one of the docks who intimidates the other workers by being handy with his fists and the bailing hook and he gets the rest to kickback part of their hard earned money. And it's all hard earned money in that job.<br /><br />One guy Warden can't intimidate is Sidney Poitier another gang boss and when he tries to intimidate newcomer John Cassavetes, Poitier takes him under his wing. The two develop quite the friendship and Poitier and his wife Ruby Dee even fix Cassavetes up with Kathleen Maguire.<br /><br />Warden is truly one loathsome creature and it's sad how by sheer force of personality and physic

In [12]:
data['sentiment'].unique()

array(['negative', 'positive'], dtype=object)

In [13]:
data['sentiment'].replace({'positive':1, 'negative':0},inplace=True)

In [14]:
data

Unnamed: 0,review,sentiment
721,I was watching the sci-fi channel when this st...,0
25279,I went straight to the big screen to view this...,1
6877,It is a pity that you cannot vote zero stars o...,0
24422,In the Realm of the Senses is a beautifully fi...,1
41470,Masters of Horror: The Screwfly Solution start...,0
...,...,...
39721,this is best showing of what i think jesus rea...,1
28266,I cannot for the life of me explain what the p...,0
10027,I have just seen this movie and have not read ...,0
282,I can't say much about this film. I think it s...,0


In [15]:
import re

In [16]:
def clean_html(text):
    clean = re.compile('<.*?>')
    return re.sub(clean,'',text)

In [17]:
data['review']= data['review'].apply(clean_html)

In [18]:
data

Unnamed: 0,review,sentiment
721,I was watching the sci-fi channel when this st...,0
25279,I went straight to the big screen to view this...,1
6877,It is a pity that you cannot vote zero stars o...,0
24422,In the Realm of the Senses is a beautifully fi...,1
41470,Masters of Horror: The Screwfly Solution start...,0
...,...,...
39721,this is best showing of what i think jesus rea...,1
28266,I cannot for the life of me explain what the p...,0
10027,I have just seen this movie and have not read ...,0
282,I can't say much about this film. I think it s...,0


In [19]:
data['review'].iloc[5]

"Although this small film kind of got lost in the wake of On The Waterfront, Edge Of The City can certainly hold its own with that star studded classic. It's another story about the docks and the code of silence that rules it.Next to the corrupt union that Lee J. Cobb ran in On The Waterfront, Jack Warden is really small time corruption. But he's real enough as the gang boss on one of the docks who intimidates the other workers by being handy with his fists and the bailing hook and he gets the rest to kickback part of their hard earned money. And it's all hard earned money in that job.One guy Warden can't intimidate is Sidney Poitier another gang boss and when he tries to intimidate newcomer John Cassavetes, Poitier takes him under his wing. The two develop quite the friendship and Poitier and his wife Ruby Dee even fix Cassavetes up with Kathleen Maguire.Warden is truly one loathsome creature and it's sad how by sheer force of personality and physical prowess he cows almost everyone e

In [20]:
# Example DataFrame
data = pd.DataFrame({'review': ['I love this product!!!', 'Horrible experience :(', 'Good value for money.']})

# Define the function
def remove_specical(text):
    x =''
    for i in text:
        if i.isalnum():
            x = x+i
        else:
            x = x+' '
    return x

In [21]:
# Apply the function to the 'review' column
data['review'] = data['review'].apply(remove_specical)
print(data)

                   review
0  I love this product   
1  Horrible experience   
2   Good value for money 


In [22]:
data

Unnamed: 0,review
0,I love this product
1,Horrible experience
2,Good value for money


In [23]:
data['review'].iloc[2]

'Good value for money '

In [24]:
def convert_low(text):
    return text.lower()

In [25]:
data['review']= data['review'].apply(convert_low)

In [26]:
data['review'].iloc[2]

'good value for money '

In [27]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\R S
[nltk_data]     Nithesh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [28]:
from nltk.corpus import stopwords

In [29]:
a = stopwords.words('english')

In [30]:
def remove_stop(text):
    x = []
    for i in text.split():
        if i not in a:
            x.append(i)
        else:
            pass
    return x

In [31]:
data['review']= data['review'].apply(remove_stop)

In [32]:
data['review'].iloc[2]

['good', 'value', 'money']

In [33]:
from nltk.stem.porter import PorterStemmer

In [34]:
ps = PorterStemmer()

In [35]:
def stem_words(text):
    x = []
    for i in text:
        x.append(ps.stem(i))
    return x

In [36]:
data['review']= data['review'].apply(stem_words)

In [37]:
data['review'].iloc[2]

['good', 'valu', 'money']

In [38]:
def join(list_input):
    return ' '.join(list_input)

In [39]:
data['review']= data['review'].apply(join)

In [40]:
data['review'].iloc[2]

'good valu money'

In [41]:
data

Unnamed: 0,review
0,love product
1,horribl experi
2,good valu money


In [42]:
X = data['review']

In [43]:
from sklearn.feature_extraction.text import CountVectorizer

In [44]:
cv = CountVectorizer()

In [45]:
cv.fit_transform(data['review'])

<3x7 sparse matrix of type '<class 'numpy.int64'>'
	with 7 stored elements in Compressed Sparse Row format>

In [46]:
X = cv.fit_transform(data['review']).toarray()

In [47]:
X

array([[0, 0, 0, 1, 0, 1, 0],
       [1, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 0, 1]], dtype=int64)

In [48]:
from sklearn.model_selection import train_test_split
import pandas as pd

# Example data (replace with your actual dataset)
data = pd.DataFrame({'feature1': [1, 2, 3, 4, 5],
                     'feature2': [5, 4, 3, 2, 1],
                     'target': [0, 1, 0, 1, 0]})

# Define X (features) and y (target)
X = data.drop('target', axis=1)  # Features
y = data['target']  # Target labels

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training Features:", X_train)
print("Testing Features:", X_test)
print("Training Labels:", y_train)
print("Testing Labels:", y_test)


Training Features:    feature1  feature2
4         5         1
2         3         3
0         1         5
3         4         2
Testing Features:    feature1  feature2
1         2         4
Training Labels: 4    0
2    0
0    0
3    1
Name: target, dtype: int64
Testing Labels: 1    1
Name: target, dtype: int64


In [49]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score,confusion_matrix

In [50]:
nb = GaussianNB()

In [51]:
nb.fit(X_train,y_train)

In [52]:
y_pred = nb.predict(X_test)

In [53]:
y_pred

array([0], dtype=int64)

In [54]:
y_test

1    1
Name: target, dtype: int64

In [55]:
cm = confusion_matrix(y_test,y_pred)

In [56]:
cm

array([[0, 0],
       [1, 0]], dtype=int64)

In [57]:
pip install --upgrade scikit-learn


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [58]:
import joblib

In [61]:
# Load the saved model
loaded_model = joblib.load('sentiment_analysis_model.pkl')
print("Model loaded successfully")

Model loaded successfully


In [63]:
# Save the trained model
joblib.dump(loaded_model, 'sentiment_analysis_model.pkl')
print("Model saved as sentiment_analysis_model.pkl")

Model saved as sentiment_analysis_model.pkl
