<a href="https://colab.research.google.com/github/abhishekmishra-bareilly/Deep-Learning/blob/main/Sentiment_analysis_using_word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
# Import the dependancy
import pandas as pd
import numpy as np

In [6]:
# Load the dataset
data1 = pd.read_csv('/content/drive/MyDrive/Copy of User Reviews.csv')

In [7]:
data1.head(5)

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


In [8]:
#Shape of the dataset
data1.shape

(64295, 5)

In [9]:
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64295 entries, 0 to 64294
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     64295 non-null  object 
 1   Translated_Review       37427 non-null  object 
 2   Sentiment               37432 non-null  object 
 3   Sentiment_Polarity      37432 non-null  float64
 4   Sentiment_Subjectivity  37432 non-null  float64
dtypes: float64(2), object(3)
memory usage: 2.5+ MB


In [7]:
# dropping column with all null values
newdata = data1.dropna()

In [8]:
newdata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37427 entries, 0 to 64230
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     37427 non-null  object 
 1   Translated_Review       37427 non-null  object 
 2   Sentiment               37427 non-null  object 
 3   Sentiment_Polarity      37427 non-null  float64
 4   Sentiment_Subjectivity  37427 non-null  float64
dtypes: float64(2), object(3)
memory usage: 1.7+ MB


In [9]:
# Put only two columns for analysis
data = newdata[['Translated_Review','Sentiment']]

In [10]:
data.head()

Unnamed: 0,Translated_Review,Sentiment
0,I like eat delicious food. That's I'm cooking ...,Positive
1,This help eating healthy exercise regular basis,Positive
3,Works great especially going grocery store,Positive
4,Best idea us,Positive
5,Best way,Positive


In [11]:
# Check for duplicate values
data.duplicated().sum()

9433

In [12]:
 # Drop duplicate values
 data.drop_duplicates(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


In [13]:
# Again check the shape
data.shape

(27994, 2)

In [14]:
# Again Check for duplicate values
data.duplicated().sum()

0

### Data preprocessing

#### remove all html tags

In [15]:
# Create function to remove all html tags
import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

In [16]:
data['Translated_Review'] = data['Translated_Review'].apply(remove_html_tags)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Translated_Review'] = data['Translated_Review'].apply(remove_html_tags)


#### Convert into lowercase

In [17]:
# Convert into lowercase
data['Translated_Review'] = data['Translated_Review'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Translated_Review'] = data['Translated_Review'].str.lower()


In [18]:
data['Translated_Review'].head()

0    i like eat delicious food. that's i'm cooking ...
1      this help eating healthy exercise regular basis
3           works great especially going grocery store
4                                         best idea us
5                                             best way
Name: Translated_Review, dtype: object

#### Remove stopwords

In [19]:
# Remove stopwords
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [20]:
def remove_stopwords(text):
    new_text = []
    
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [21]:
# Applying function
data['Translated_Review'] = data['Translated_Review'].apply(remove_stopwords)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Translated_Review'] = data['Translated_Review'].apply(remove_stopwords)


In [22]:
data['Translated_Review'].head()

0     like eat delicious food. that's i'm cooking f...
1           help eating healthy exercise regular basis
3           works great especially going grocery store
4                                         best idea us
5                                             best way
Name: Translated_Review, dtype: object

### Using word to vec

In [23]:
import gensim


In [24]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [25]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess


In [26]:
story = []
for doc in data['Translated_Review']:
    raw_sent = sent_tokenize(doc)
    for sent in raw_sent:
        story.append(simple_preprocess(sent))

In [27]:
# Model for w2V
model = gensim.models.Word2Vec(
    window=10,
    min_count=2
)

In [28]:
# building vocabulary
model.build_vocab(story)

In [29]:
model.train(story, total_examples=model.corpus_count, epochs=model.epochs)

(1953018, 2155900)

In [32]:
len(model.wv.index_to_key)

10715

In [33]:
def document_vector(doc):
    # remove out-of-vocabulary words
    doc = [word for word in doc.split() if word in model.wv.index_to_key]
    return np.mean(model.wv[doc], axis=0)

In [36]:
document_vector(data['Translated_Review'].values[0])

array([-0.53989977,  0.40047565,  0.21287328,  0.17935337, -0.03295907,
       -0.59545875, -0.05362034,  0.91093034, -0.4554392 , -0.3491016 ,
       -0.29794982, -0.38965315,  0.04761775,  0.22763464,  0.54316986,
       -0.37456024,  0.35739288, -0.6135732 ,  0.03272069, -0.3866469 ,
        0.31291044, -0.10707211, -0.2144071 , -0.21612094, -0.25553662,
       -0.13041721, -0.41441157, -0.15097417, -0.2793827 ,  0.10187916,
        0.44631448,  0.09959579, -0.16975455,  0.01018923, -0.16727777,
        0.12887464,  0.1933792 , -0.21270308, -0.40074223, -0.6431204 ,
        0.05371315, -0.11555828, -0.23548219, -0.06866817,  0.16320895,
       -0.23205659, -0.37288666,  0.18848205, -0.02729459,  0.4170385 ,
        0.2321824 , -0.24652757, -0.4090524 , -0.03660602, -0.15829468,
        0.162782  ,  0.09625836,  0.03602264, -0.12498179, -0.406832  ,
       -0.08234693,  0.30576622, -0.4706959 ,  0.05584227, -0.5417421 ,
        0.3992642 ,  0.2612811 ,  0.2580472 , -0.50623727,  0.51

In [44]:
from tqdm import tqdm


In [42]:
tqdm(data['Translated_Review'].values)

  0%|          | 0/27994 [00:00<?, ?it/s]

<tqdm.std.tqdm at 0x7f8414c3f040>

In [53]:
from tqdm import tqdm
X = []
for doc in tqdm(data['Translated_Review'].values):
    X.append(document_vector(doc))


  0%|          | 26/27994 [00:00<01:04, 432.41it/s]


ValueError: ignored

In [54]:
X = np.array(X)

In [56]:
X.shape

(26, 100)

In [63]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

y = encoder.fit_transform(data['Sentiment'].head(26))

In [64]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

In [65]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [66]:
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
accuracy_score(y_test,y_pred)

0.8333333333333334

In [67]:
data['Sentiment'].head(26)

0     Positive
1     Positive
3     Positive
4     Positive
5     Positive
6     Positive
8      Neutral
9      Neutral
10    Positive
11    Positive
12    Positive
13    Positive
14    Positive
16    Positive
17    Positive
18    Positive
19    Positive
20    Positive
21    Positive
22     Neutral
23    Positive
24    Positive
25     Neutral
26    Positive
27    Positive
28    Positive
Name: Sentiment, dtype: object