### Dataset Link: https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format

Sentiment Analysis can help us finding out the mood and emotions of general a customer or reviewer and it helps in gathering the insightful information regarding the context. Sentiment Analysis is a process of analyzing data and classifying it based on the need of the research.

In [6]:
!python -m spacy download en_core_web_sm

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
                                              0.0/12.8 MB ? eta -:--:--
                                              0.1/12.8 MB 3.3 MB/s eta 0:00:04
     -                                        0.4/12.8 MB 3.3 MB/s eta 0:00:04
     --                                       0.7/12.8 MB 3.8 MB/s eta 0:00:04
     ---                                      1.0/12.8 MB 4.1 MB/s eta 0:00:03
     ----                                     1.4/12.8 MB 4.5 MB/s eta 0:00:03
     -----                                    1.7/12.8 MB 4.7 MB/s eta 0:00:03
     ------                                   2.0/12.8 MB 4.6 MB/s eta 0:00:03
     ------                                   2.2/12.8 MB 4.6 MB/s eta 0:00:03
     ------                               

In [7]:
import pandas as pd
from textblob import TextBlob
from nltk.tokenize.toktok import ToktokTokenizer
import re
tokenizer = ToktokTokenizer()
import spacy
nlp = spacy.load('en_core_web_sm', disable=['ner'])

In [8]:
TextBlob("he is very good boy").sentiment

Sentiment(polarity=0.9099999999999999, subjectivity=0.7800000000000001)

In [9]:
TextBlob("he is not a good boy").sentiment

Sentiment(polarity=-0.35, subjectivity=0.6000000000000001)

In [10]:
TextBlob("Eerybody says this man is poor").sentiment

Sentiment(polarity=-0.4, subjectivity=0.6)

### Polarity and Subjectivity
Polarity is a float value which helps in identifying whether a sentence is positive or negative. Its values ranges in [-1,1] where 1 means positive statement and -1 means a negative statement. 

On the other side, Subjective sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1]. Closer the value to 1, more likly it is public opinion.

In [11]:
### Data Loading
train=pd.read_csv("Train.csv")
train

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1
...,...,...
39995,"""Western Union"" is something of a forgotten cl...",1
39996,This movie is an incredible piece of work. It ...,1
39997,My wife and I watched this movie because we pl...,0
39998,"When I first watched Flatliners, I was amazed....",1


In [12]:
label_0=train[train['label']==0].sample(n=5000)
label_1=train[train['label']==1].sample(n=5000)

In [13]:
train=pd.concat([label_1,label_0])
from sklearn.utils import shuffle
train = shuffle(train)

In [14]:
train

Unnamed: 0,text,label
874,"I've seen this movie when I was young, and I r...",1
31575,SPOILERS<br /><br />Tom and Jerry is a classic...,1
21366,I am a big fan of British films in general but...,0
10973,"This is one of the all-time great ""Our Gang"" s...",1
29051,"First off, the movie was not true to facts at ...",0
...,...,...
12867,This movie was not that good at all. Here is t...,0
15342,Walker Texas Ranger is one of the worst shows ...,0
36020,Like one of the other reviewers (might have be...,1
18196,For someone who remembers Jane in the Daily Mi...,1


Here, the data has two labels ie 0 and 1. 0 stands for "Negative" and "1" stands for "Positive".

### Data Preprocessing

In [15]:
train.isnull().sum()

text     0
label    0
dtype: int64

In [17]:
import numpy as np
train.replace(r'^\s*$', np.nan, regex=True,inplace=True)
train.dropna(axis = 0, how = 'any', inplace = True)

In [18]:
train.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=True)
print('escape seq removed')

escape seq removed


In [19]:
import numpy as np
train.replace(r'^\s*$', np.nan, regex=True,inplace=True)
train.dropna(axis = 0, how = 'any', inplace = True)

In [20]:
train

Unnamed: 0,text,label
874,"I've seen this movie when I was young, and I r...",1
31575,SPOILERS<br /><br />Tom and Jerry is a classic...,1
21366,I am a big fan of British films in general but...,0
10973,"This is one of the all-time great ""Our Gang"" s...",1
29051,"First off, the movie was not true to facts at ...",0
...,...,...
12867,This movie was not that good at all. Here is t...,0
15342,Walker Texas Ranger is one of the worst shows ...,0
36020,Like one of the other reviewers (might have be...,1
18196,For someone who remembers Jane in the Daily Mi...,1


In [21]:
train['text']=train['text'].str.encode('ascii', 'ignore').str.decode('ascii')
print('non-ascii data removed')

non-ascii data removed


In [22]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [23]:
def remove_punctuations(text):
    import string
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text
train['text']=train['text'].apply(remove_punctuations)

In [24]:
train

Unnamed: 0,text,label
874,Ive seen this movie when I was young and I rem...,1
31575,SPOILERSbr br Tom and Jerry is a classic carto...,1
21366,I am a big fan of British films in general but...,0
10973,This is one of the alltime great Our Gang shor...,1
29051,First off the movie was not true to facts at a...,0
...,...,...
12867,This movie was not that good at all Here is th...,0
15342,Walker Texas Ranger is one of the worst shows ...,0
36020,Like one of the other reviewers might have bee...,1
18196,For someone who remembers Jane in the Daily Mi...,1


In [25]:
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [26]:
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

In [27]:
def custom_remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

In [28]:
train['text']=train['text'].apply(custom_remove_stopwords)

In [29]:
train

Unnamed: 0,text,label
874,Ive seen movie young remembered one first film...,1
31575,SPOILERSbr br Tom Jerry classic cartoon flawle...,1
21366,big fan British films general especially gangs...,0
10973,one alltime great Gang shorts Spanky cutest fu...,1
29051,First movie not true facts saw documentary day...,0
...,...,...
12867,movie not good first clue not gonna strong mov...,0
15342,Walker Texas Ranger one worst shows produced p...,0
36020,Like one reviewers might Amazon first introduc...,1
18196,someone remembers Jane Daily Mirror strip cart...,1


In [30]:
def remove_special_characters(text):
    text = re.sub('[^a-zA-z0-9\s]', '', text)
    return text

In [31]:
train['text']=train['text'].apply(remove_special_characters)

In [32]:
def remove_html(text):
    import re
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r' ', text)

In [33]:
train['text']=train['text'].apply(remove_html)

In [34]:
def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r' ',text)

In [35]:
train['text']=train['text'].apply(remove_URL)

In [36]:
def remove_numbers(text):
    """ Removes integers """
    text = ''.join([i for i in text if not i.isdigit()])         
    return text

In [37]:
train['text']=train['text'].apply(remove_numbers)

In [38]:
def cleanse(word):
    rx = re.compile(r'\D*\d')
    if rx.match(word):
        return ''
    return word
def remove_alphanumeric(strings):
    nstrings = [" ".join(filter(None, (
    cleanse(word) for word in string.split()))) 
    for string in strings.split()]
    str1 = ' '.join(nstrings)
    return str1

In [39]:
train['text']=train['text'].apply(remove_alphanumeric)

In [40]:
train

Unnamed: 0,text,label
874,Ive seen movie young remembered one first film...,1
31575,SPOILERSbr br Tom Jerry classic cartoon flawle...,1
21366,big fan British films general especially gangs...,0
10973,one alltime great Gang shorts Spanky cutest fu...,1
29051,First movie not true facts saw documentary day...,0
...,...,...
12867,movie not good first clue not gonna strong mov...,0
15342,Walker Texas Ranger one worst shows produced p...,0
36020,Like one reviewers might Amazon first introduc...,1
18196,someone remembers Jane Daily Mirror strip cart...,1


In [41]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

In [42]:
train['text']=train['text'].apply(lemmatize_text)

In [43]:
train['sentiment'] = train['text'].apply(lambda tweet: TextBlob(tweet).sentiment)

In [44]:
train

Unnamed: 0,text,label,sentiment
874,I ve see movie young remember one first film t...,1,"(0.0898809523809524, 0.5090136054421769)"
31575,spoilersbr br Tom Jerry classic cartoon flawle...,1,"(0.12678856231487814, 0.491311612364244)"
21366,big fan british film general especially gangst...,0,"(-0.08836206896551724, 0.5750000000000002)"
10973,one alltime great Gang short spanky cut funnie...,1,"(0.38416666666666666, 0.6666666666666666)"
29051,first movie not true fact see documentary day ...,0,"(0.1209733893557423, 0.44943977591036416)"
...,...,...,...
12867,movie not good first clue not go to strong mov...,0,"(0.035833333333333335, 0.5305555555555557)"
15342,Walker Texas Ranger one bad show produce past ...,0,"(-0.11249999999999996, 0.5374999999999999)"
36020,like one reviewer might Amazon first introduce...,1,"(0.08727497096399534, 0.5137381250186127)"
18196,someone remember Jane Daily Mirror strip carto...,1,"(0.07750000000000001, 0.365)"


In [45]:
sentiment_series = train['sentiment'].tolist()

In [46]:
columns = ['polarity', 'subjectivity']
df1 = pd.DataFrame(sentiment_series, columns=columns, index=train.index)

In [47]:
df1

Unnamed: 0,polarity,subjectivity
874,0.089881,0.509014
31575,0.126789,0.491312
21366,-0.088362,0.575000
10973,0.384167,0.666667
29051,0.120973,0.449440
...,...,...
12867,0.035833,0.530556
15342,-0.112500,0.537500
36020,0.087275,0.513738
18196,0.077500,0.365000


In [48]:
result = pd.concat([train,df1],axis=1)

In [49]:
result.drop(['sentiment'],axis=1,inplace=True)

In [50]:
result.loc[result['polarity']>=0.3, 'Sentiment'] = "Positive"
result.loc[result['polarity']<0.3, 'Sentiment'] = "Negative"

In [51]:
result

Unnamed: 0,text,label,polarity,subjectivity,Sentiment
874,I ve see movie young remember one first film t...,1,0.089881,0.509014,Negative
31575,spoilersbr br Tom Jerry classic cartoon flawle...,1,0.126789,0.491312,Negative
21366,big fan british film general especially gangst...,0,-0.088362,0.575000,Negative
10973,one alltime great Gang short spanky cut funnie...,1,0.384167,0.666667,Positive
29051,first movie not true fact see documentary day ...,0,0.120973,0.449440,Negative
...,...,...,...,...,...
12867,movie not good first clue not go to strong mov...,0,0.035833,0.530556,Negative
15342,Walker Texas Ranger one bad show produce past ...,0,-0.112500,0.537500,Negative
36020,like one reviewer might Amazon first introduce...,1,0.087275,0.513738,Negative
18196,someone remember Jane Daily Mirror strip carto...,1,0.077500,0.365000,Negative


In [52]:
result.loc[result['label']==1, 'Sentiment_label'] = 1
result.loc[result['label']==0, 'Sentiment_label'] = 0

In [53]:
result

Unnamed: 0,text,label,polarity,subjectivity,Sentiment,Sentiment_label
874,I ve see movie young remember one first film t...,1,0.089881,0.509014,Negative,1.0
31575,spoilersbr br Tom Jerry classic cartoon flawle...,1,0.126789,0.491312,Negative,1.0
21366,big fan british film general especially gangst...,0,-0.088362,0.575000,Negative,0.0
10973,one alltime great Gang short spanky cut funnie...,1,0.384167,0.666667,Positive,1.0
29051,first movie not true fact see documentary day ...,0,0.120973,0.449440,Negative,0.0
...,...,...,...,...,...,...
12867,movie not good first clue not go to strong mov...,0,0.035833,0.530556,Negative,0.0
15342,Walker Texas Ranger one bad show produce past ...,0,-0.112500,0.537500,Negative,0.0
36020,like one reviewer might Amazon first introduce...,1,0.087275,0.513738,Negative,1.0
18196,someone remember Jane Daily Mirror strip carto...,1,0.077500,0.365000,Negative,1.0
