# Sentiment Analysis of twitter

## What is sentiment analysis?

Sentiment analysis is an analysis of a sentence structure, in which it understand the meaning of sentence and than it gives response accordingly.

### How to perform sentiment analysis?
- First, choose the data on which you wants to perfom an analysis.
- Split it into training and testing set of data.
- Specify the classifier.
- Train some data to predict the future analysis.
- Try to predict the data.

That's it...


Now let's try to perform this analysis on one of the dataset of twitter.


## 1. Import libraries 

In [1]:
# importing libraries

import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

## 2. Load the datasets

Here, I took dataset from <a href='https://www.kaggle.com/youben/tweets-sentiment-analysis'>kaggel</a>. This data set is about to twitter sentiments of users. It ranked to 0 and 1 accordingly the user sentiments. 

In [2]:
# First let's get the data set 

df = pd.read_csv('D:/Education/train.csv', encoding='ISO-8859-1')
df_test = pd.read_csv('D:/Education/test.csv', encoding='ISO-8859-1')
train = df.head(20000)
test = df_test.head(20000)

Displaying the top 5 data from train dataset.

In [3]:
train.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,i think mi bf is cheating on me!!! ...


## 3. Split data for training and testing

Now, here you will have one doubt that we have testing dataset than why should we have to split it, right? I'll tell you why, Because with the help of split data we can calculate cross_validation score and accuracy by matching tested data and predicted data. So that's why we have to split the data. 

So, let's get started...

Here, we have used train data's SentimentText as X value and Sentiment value as y.
So, here how we splits the data.

In [5]:
learn_data, learn_test, sentimate_learn, sentimate_test = train_test_split(train['SentimentText'],train['Sentiment'],test_size=0.25)

## 4. vectorize the data 

#### What is vectorizer?
Vectorization is used to speed up the Python code without using loop. Using such a function can help in minimizing the running time of code efficiently. 

It is a sklearn's inbuilt library. (from sklearn.feature_extraction.text import TfidVectorizer).

There are 2 vectorizing factors are available in sci-kit learn module. 
- **TfidVectorizer**
- **CountVectorizer**

These both vectorizer method convert the string into metrix but the only diference is TfidfVectorizer will convert into Tf-idf metrix.

In [6]:
# Create feature vectors
vectorizer = TfidfVectorizer(min_df = 5,
                             max_df = 0.8,
                             sublinear_tf = True,
                             use_idf = True)
#vectorizer = CountVectorizer()

train_vectors = vectorizer.fit_transform(learn_data)
tese_vectors = vectorizer.transform(test['SentimentText'])
target = train['Sentiment'].values

## 5. create classifier model

As we are using SVM classifier, first we have to make one variable which holds the svm classifier. Here, we are using linear kernel classifier. The next step is to fit the data into classifier but for that we need 2 values (Sentiment Text,Sentiment) from training set. We will use those splitting training dataset to fit into the model like (learn_data, Sentiment_learn).

But as we perform vactorizer on learning data for smooth fitting and replace value into variable called train_vectors.

In [7]:
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(train_vectors,sentimate_learn)

prediction_linear = pd.DataFrame(svm_classifier.predict(train_vectors))
prediction_linear


Unnamed: 0,0
0,0
1,1
2,0
3,0
4,0
...,...
14995,1
14996,0
14997,0
14998,0


## 6. Calculate the score of classifier

Here, we will use the predicted value to calculate the score. Have a look at the classification_report that it shows the accuracy of 84%. Note that we have used only first 20000 data to save the run time.

In [8]:
print("Accuracy_Score => ",accuracy_score(prediction_linear,sentimate_learn))
print("Report => ", classification_report(sentimate_learn,prediction_linear))

Accuracy_Score =>  0.8333333333333334
Report =>                precision    recall  f1-score   support

           0       0.83      0.86      0.85      8023
           1       0.83      0.80      0.82      6977

    accuracy                           0.83     15000
   macro avg       0.83      0.83      0.83     15000
weighted avg       0.83      0.83      0.83     15000



## 7. Test out some string



In [9]:
review= """omg its already 7:30"""
review_vector = vectorizer.transform([review])
svm_classifier.predict(review_vector)

array([0], dtype=int64)

In [10]:
review= """is so sad for my APL frie..."""
review_vector = vectorizer.transform([review])
svm_classifier.predict(review_vector)

array([0], dtype=int64)