### Sentiment Analysis

https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/

In [2]:
import pandas as pd

In [3]:
#Read the dataset

train = pd.read_csv(r'C:\Users\divyakamat\data\DataSets\nlp\train_tweets.csv')

In [4]:
train.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


### Basic feature extraction using text data
    - Number of words
    - Number of characters
    - Average word length
    - Number of stopwords
    - Number of special characters
    - Number of numerics
    - Number of uppercase words

### Count the number of words

Extract the number of words in each tweet. The basic intuition behind this is that generally, the negative sentiments contain a lesser amount of words than the positive ones.

In [5]:
train['word_count'] = train['tweet'].apply(lambda x : len(str(x).split(" "))).head()
train[['tweet','word_count']].head()

Unnamed: 0,tweet,word_count
0,@user when a father is dysfunctional and is s...,21.0
1,@user @user thanks for #lyft credit i can't us...,22.0
2,bihday your majesty,5.0
3,#model i love u take with u all the time in ...,17.0
4,factsguide: society now #motivation,8.0


In [6]:
train.describe(percentiles = [0.25,0.50,0.75,0.90,0.95]).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,90%,95%,max
id,31962.0,15981.5,9226.778988,1.0,7991.25,15981.5,23971.75,28765.9,30363.95,31962.0
label,31962.0,0.070146,0.255397,0.0,0.0,0.0,0.0,0.0,1.0,1.0
word_count,5.0,14.6,7.700649,5.0,8.0,17.0,21.0,21.6,21.8,22.0


### Count the number of characters

Extract the number of characters in each tweet. The basic intuition behind this is that generally, the negative sentiments contain a lesser amount of words than the positive ones.

Calculation also includes number of spaces, can be exculded if required

In [7]:
train['char_count']= train['tweet'].str.len()
train[['tweet','char_count']].head()

Unnamed: 0,tweet,char_count
0,@user when a father is dysfunctional and is s...,102
1,@user @user thanks for #lyft credit i can't us...,122
2,bihday your majesty,21
3,#model i love u take with u all the time in ...,86
4,factsguide: society now #motivation,39


### Extract Average word length

Calculate average word length of each tweet. Take the sum of the length of all the words and divide it by the total length of the tweet:

In [8]:
def avg_word(sentence):
    words = sentence.split()
    return (sum(len(word) for word in words)/len(words))

train['avg_word'] = train['tweet'].apply(lambda x : avg_word(x))
train[['tweet','avg_word']].head()

Unnamed: 0,tweet,avg_word
0,@user when a father is dysfunctional and is s...,4.555556
1,@user @user thanks for #lyft credit i can't us...,5.315789
2,bihday your majesty,5.666667
3,#model i love u take with u all the time in ...,4.928571
4,factsguide: society now #motivation,8.0


### Number of Stop Words

In [9]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

train['stopwords'] = train['tweet'].apply(lambda x : len([x for x in x.split() if x in stop]))
train[['tweet','stopwords']].head()

Unnamed: 0,tweet,stopwords
0,@user when a father is dysfunctional and is s...,10
1,@user @user thanks for #lyft credit i can't us...,5
2,bihday your majesty,1
3,#model i love u take with u all the time in ...,5
4,factsguide: society now #motivation,1


### Number of Uppercase words
Anger or rage is quite often expressed by writing in UPPERCASE words which makes this a necessary operation to identify those words.

In [10]:
train['upper']=train['tweet'].apply(lambda x:len([x for x in x.split() if x.upper()]))
train[['tweet','upper']].head()

Unnamed: 0,tweet,upper
0,@user when a father is dysfunctional and is s...,18
1,@user @user thanks for #lyft credit i can't us...,19
2,bihday your majesty,3
3,#model i love u take with u all the time in ...,14
4,factsguide: society now #motivation,4


### Basic Pre-Processing

- Lower casing
- Punctuation removal
- Stopwords removal
- Frequent words removal
- Rare words removal
- Spelling correction
- Tokenization
- Stemming
- Lemmatization

### Lower Case

The first pre-processing step which we will do is transform our tweets into lower case. This avoids having multiple copies of the same words

In [11]:
train['tweet'] = train['tweet'].apply(lambda x : " ".join(x.lower() for x in x.split()))
train[['tweet']].head()

Unnamed: 0,tweet
0,@user when a father is dysfunctional and is so...
1,@user @user thanks for #lyft credit i can't us...
2,bihday your majesty
3,#model i love u take with u all the time in ur...
4,factsguide: society now #motivation


### Remove Punctuation

The next step is to remove punctuation, as it doesn’t add any extra information while treating text data. Therefore removing all instances of it will help us reduce the size of the training data

In [12]:
train['tweet'] = train['tweet'].str.replace('[^\w\s]','')
train['tweet'].head()

0    user when a father is dysfunctional and is so ...
1    user user thanks for lyft credit i cant use ca...
2                                  bihday your majesty
3    model i love u take with u all the time in urð...
4                    factsguide society now motivation
Name: tweet, dtype: object

### Removal of Stop Words

Remove commonly occuring words from the text data

In [13]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

train['tweet'].head()

0    user father dysfunctional selfish drags kids d...
1    user user thanks lyft credit cant use cause do...
2                                       bihday majesty
3                model love u take u time urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

### Common word removal

Remove commonly occuring words from the text data

In [14]:
#Get top10 frequently occuring words and assign to freq variable
freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[:10]

In [15]:
freq = list(freq.index)

In [16]:
#Remove those top 10 frequently occuring words
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['tweet'].head()

0    father dysfunctional selfish drags kids dysfun...
1    thanks lyft credit cant use cause dont offer w...
2                                       bihday majesty
3                              model take urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

### Rare words removal

Remove rarely occuring words from the text data

In [17]:
#Get bottom 10  rarely occuring words and assign to freq variable
freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[-10:]

In [18]:
freq = list(freq.index)
#Remove those bottom 10 rarely occuring words
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['tweet'].head()

0    father dysfunctional selfish drags kids dysfun...
1    thanks lyft credit cant use cause dont offer w...
2                                       bihday majesty
3                              model take urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

In [19]:
from textblob import TextBlob

#create a object, by passing a string
blob = Textblob("Analytics Vidhya is a great platform to learn data science. \n It helps community through blogs, hackathons, discussions,etc.")

blob.sentences

NameError: name 'Textblob' is not defined