<center><i> Big Data Project </i> </center>


# <center><b> Twitter sentimental analysis and classification of tweets </b> </center>

## Problem Statement and Introduction 

In this project, we are building a model to that analyses a stream of tweets on twitter and displays the sentiment of the tweet. The model is trained and tested using the given csv data. We renamed training csv file to trainingdata.csv to make it readable.The data set includes 1,600,000 tweets. We have used hadoop spark on jupyter notebook environment to build the model. The model is built using logistic regression library and the output obtained is stored in MsSQL database. 

The entire project is divided into 3 main parts : 

1) Building a classifier model using the given training and test data 

2) Using the model that is built in part1 to classify the Tweets 

3) Storing the data into database - MsSQL using Python module- pyodbc that makes accessing of ODBC database simple. 

### Part 1
### Building classifier model 

<b>Loading the required packages </b>


In [2]:
import findspark
findspark.init('C:\Users\samir\Desktop\spark\spark-2.4.7-bin-hadoop2.7')

In [3]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import HashingTF, Tokenizer, StopWordsRemover

<b>Creating a spark session with the user defined appName tweetsentimentalanalysis. 


In [4]:
spark = SparkSession.builder.appName('tweetsentimentanalysis').getOrCreate()
sc =spark.sparkContext


### Read data file into Spark dataFrame¶

The given training data is used to train the model. Training data consists of 6 fields. Polarity which indicates the sentiment of the tweet( 4- positive, 0- negative,2-neutral),ID represents unique identification number of the tweet, Date of the tweet posted, Query (if there is no query, field has NO_QUERY), User- Name of the user who posted the tweet and Text -the actual tweet. Training data given is in CSV format. This data is loaded into the  DataFrame and schema of the dataframe is printed out in the below code block to get proper visualization of our data information.

In [5]:
tweettraindata = spark.read.csv('trainingdata.csv',inferSchema='true').toDF("Polarity", "ID", "Date", "Query","User","Text")

tweettraindata.printSchema()

root
 |-- Polarity: integer (nullable = true)
 |-- ID: long (nullable = true)
 |-- Date: string (nullable = true)
 |-- Query: string (nullable = true)
 |-- User: string (nullable = true)
 |-- Text: string (nullable = true)



<b> A sample of 10 rows of data in the dataframe is shown below  </b>

In [6]:
tweettraindata.show(10)

+--------+----------+--------------------+--------+---------------+--------------------+
|Polarity|        ID|                Date|   Query|           User|                Text|
+--------+----------+--------------------+--------+---------------+--------------------+
|       0|1467810369|Mon Apr 06 22:19:...|NO_QUERY|_TheSpecialOne_|@switchfoot http:...|
|       0|1467810672|Mon Apr 06 22:19:...|NO_QUERY|  scotthamilton|is upset that he ...|
|       0|1467810917|Mon Apr 06 22:19:...|NO_QUERY|       mattycus|@Kenichan I dived...|
|       0|1467811184|Mon Apr 06 22:19:...|NO_QUERY|        ElleCTF|my whole body fee...|
|       0|1467811193|Mon Apr 06 22:19:...|NO_QUERY|         Karoli|@nationwideclass ...|
|       0|1467811372|Mon Apr 06 22:20:...|NO_QUERY|       joy_wolf|@Kwesidei not the...|
|       0|1467811592|Mon Apr 06 22:20:...|NO_QUERY|        mybirch|         Need a hug |
|       0|1467811594|Mon Apr 06 22:20:...|NO_QUERY|           coZZ|@LOLTrish hey  lo...|
|       0|1467811795|

### Data Extraction 

Out of the six fields present in the dataframe we need only 2 fields for our model - tweet text and Polarity. These fields are selected and sample of 5 rows are printed below

In [7]:
selected_data= tweettraindata.select("Text", col("Polarity").cast("Int").alias("Polarity"))
selected_data.show(truncate=False,n=5)

+-------------------------------------------------------------------------------------------------------------------+--------+
|Text                                                                                                               |Polarity|
+-------------------------------------------------------------------------------------------------------------------+--------+
|@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D|0       |
|is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!    |0       |
|@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds                          |0       |
|my whole body feels itchy and like its on fire                                                                     |0       |
|@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.

## Train and Test data

The data set is divided into train and test data using ramdomSplit function. 70% of the data is divided into training data and rest 30% is used as test data. Number of row in training data and test data is shown below. 

In [9]:
#divide dataset as 70% training and 30%test
dividedData = selected_data.randomSplit([0.70,0.30])
trainingData = dividedData[0] #index 0 = data training
testingData = dividedData[1] #index 1 = data testing
train_rows = trainingData.count()
test_rows = testingData.count()
print ("Training data rows:", train_rows, "; Testing data rows:", test_rows)

('Training data rows:', 1120149, '; Testing data rows:', 479851)


## Preprocessing data
### Tokenization:
We did Tokenization as first preprocessing technique. It is the method of separating and classifying parts of string in the sentence.The text column in the training data frame consists of sentences. Each of the sentences are broken down to words and a list of all the words of the tweet is stored in a new column under the name : SeWords. This is done using the Tokenizer module as shown in the code block below.  


In [14]:
tokenizer = Tokenizer(inputCol="Text", outputCol="SeWords")
tokenizedTrain = tokenizer.transform(trainingData)
tokenizedTrain.show(truncate=False, n=5)

+---------------------------------------------------------------------------------------------+--------+-------------------------------------------------------------------------------------------------------------------+
|Text                                                                                         |Polarity|SeWords                                                                                                            |
+---------------------------------------------------------------------------------------------+--------+-------------------------------------------------------------------------------------------------------------------+
|       i really2 don't like this condition. sucksssssss                                      |0       |[, , , , , , , i, really2, don't, like, this, condition., sucksssssss]                                             |
|      My current headset is on its deathbed now!  My dad gave it to me just 3 weeks back!    |0       |[, , , , , ,

### Removing unwanted information

One of the main ways of preprocessing is to filter out unnecessary data.SeWords column consists of list of all the words in the tweet. For effective analysis of the sentiment of the tweet, we need to remove some words that are not necessary. We call it stopwords. So, we get rid of those stopwordsusing the StopWordsRemover. StopWordsRemover is a Transformer takes a String array of words and returns a String array after removing all the defined stop words.

In [15]:
swr = StopWordsRemover(inputCol=tokenizer.getOutputCol(), 
                       outputCol="MeaningfulWords")
SwRemovedTrain = swr.transform(tokenizedTrain)
SwRemovedTrain.show(truncate=False, n=5)

+---------------------------------------------------------------------------------------------+--------+-------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+
|Text                                                                                         |Polarity|SeWords                                                                                                            |MeaningfulWords                                                             |
+---------------------------------------------------------------------------------------------+--------+-------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+
|       i really2 don't like this condition. sucksssssss                                      |0       |[,

###### Numerical features are created from Meaningful words using code below. HashingTF funtion using Austin Appleby's MurmurHash 3 algorithm is implemented. Sample output of top 3 rows are displayed after implementing the code.

In [16]:
hashTF = HashingTF(inputCol=swr.getOutputCol(), outputCol="features")
numericTrainData = hashTF.transform(SwRemovedTrain).select(
    'Polarity', 'MeaningfulWords', 'features')
numericTrainData.show(truncate=False, n=3)

+--------+----------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|Polarity|MeaningfulWords                                                             |features                                                                                                                |
+--------+----------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|0       |[, , , , , , , really2, like, condition., sucksssssss]                      |(262144,[9346,20263,157492,208258,249180],[1.0,1.0,1.0,1.0,7.0])                                                        |
|0       |[, , , , , , current, headset, deathbed, now!, , dad, gave, 3, weeks, back!]|(262144,[89074,92854,107144,114629,132612,133824,153489,233502,233677,249180]

### Training the model

To train out classifier model we use Logistic Regression. The LogisticRegression library is imported form pysark and training data frame columns features and polarity are passed as inputs with the maximum iteration equal to 10. 

In [17]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(labelCol="Polarity", featuresCol="features", 
                        maxIter=10, regParam=0.01)
model = lr.fit(numericTrainData)
print ("Training is done!")

Training is done!


### Prepare testing data

Test data is prepared similar to training data. Each tweet text is divided into a list of words and the unwanted words are removed. Features of the meaningful words are obtained using hashTF function. A sample top 2 rows of test dataframe is shown below

In [18]:
tokenizedTest = tokenizer.transform(testingData)
SwRemovedTest = swr.transform(tokenizedTest)
numericTest = hashTF.transform(SwRemovedTest).select(
    'Polarity', 'MeaningfulWords', 'features')
numericTest.show(truncate=False, n=2)

+--------+---------------------------------------------------------------------+----------------------------------------------------------------------------------------+
|Polarity|MeaningfulWords                                                      |features                                                                                |
+--------+---------------------------------------------------------------------+----------------------------------------------------------------------------------------+
|0       |[, , , , , , , , , , fuck, you!]                                     |(262144,[164046,237111,249180],[1.0,1.0,10.0])                                          |
|0       |[, , , , , , , , , , want, ben&amp;jerrys, cake, batter, please, ugh]|(262144,[13007,56397,137422,190256,230921,249180,252290],[1.0,1.0,1.0,1.0,1.0,10.0,1.0])|
+--------+---------------------------------------------------------------------+----------------------------------------------------------------------

### Predict testing data and calculate the accuracy model

Test data is passed into the built model and and the prediction made by the model is stored under the column name 'Prediction'. 
The accuracy of the model is caculated based on the test data results using the prediction and input data. Accuracy is obtained by dividing (the number of correct predictions made by model) / (total number of predictions made by the model). We have obtained an accuracy of 72.5 % as shown below in the output. 

In [19]:
prediction = model.transform(numericTest)

predictionFinal = prediction.select(
    'MeaningfulWords', 'prediction', 'Polarity')
predictionFinal.show(n=10, truncate = False)

correctPrediction = predictionFinal.filter(
    predictionFinal['prediction'] == predictionFinal['Polarity']).count()
totalData = predictionFinal.count()
accuracy = float(correctPrediction)/float(totalData)
accuracy
print('correct prediction:', correctPrediction, 'total data:', totalData, 
      'accuracy:', accuracy)

+----------------------------------------------------------------------------------------------------+----------+--------+
|MeaningfulWords                                                                                     |prediction|Polarity|
+----------------------------------------------------------------------------------------------------+----------+--------+
|[, , , , , , , , , , fuck, you!]                                                                    |0.0       |0       |
|[, , , , , , , , , , want, ben&amp;jerrys, cake, batter, please, ugh]                               |0.0       |0       |
|[, , , , , , , , head, feels, like, bowling, ball]                                                  |0.0       |0       |
|[, , , , , #canucks]                                                                                |0.0       |0       |
|[, , , , , jb, isnt, showing, australia, more!]                                                     |0.0       |0       |
|[, , , , , fucc

The model built is saved as 'comodel' and is used in step2 of the project

In [15]:
model.save('comodel')