# Step 1: Get tweets.

To understand and estimate user sentiments over twitter to real news, we first need a lot of tweets. Our first task is to get as many as 10K tweets. 

Instead of going for 10K tweets, I first got 500 tweets for sample. These tweets were searched specifically for "#modi", "#commonwealth", "#facebook", "#music."

In [0]:
import tweepy
from tweepy import OAuthHandler

customer_key = "secret"
customer_secret = "secret"
access_token = "secret"
access_secret = "secret"

auth = OAuthHandler(customer_key, customer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

In the above code block, I have supplied tweepy with my twitter creds and the variable 'api' is now ready to be used to grab new tweets. We will head to creating a mongo db and corresponding table, collections to store these tweets. 

In [0]:
from pymongo import MongoClient
client = MongoClient()
db = client.twitterDb
posts = db.modiPosts

The next block grabs tweets with a hashtag modi, and stores them on db. 

In [0]:
import json
tweets = api.search(q = "#modi", count = 100)
for tweet in tweets:
    posts.insert_one(tweet._json)

the next block of code grabs tweets with hashtag facebook.

In [0]:
posts = db.music
tweets = api.search(q = "#facebook", count = 200)
for tweet in tweets:
    posts.insert_one(tweet._json)

The queries can be combined and made into single query, to get 10K tweets lets use this technique.

# Step 2: Clean up the data
Once we are done with collecting the tweets, we ne+ed to clean the data. As the tweets contain all kinds of languages, expressions and slangs but the NLP library we are using to analyse the sentiments of the users is limited in capabilities, this is a necessary step.

Let us retrieve the data from raw collection of 10 K tweets. 




In [0]:
from pymongo import MongoClient
from bson.json_util import dumps
client = MongoClient()
db = client.twitterDb
modiPosts = db.zurkerbergposts
texts = []
postsJson = []
for post in modiPosts.find():
    texts.append(post['text'])
    print (post['text'])

Let us convert the text to lower case, as most of the libraries and collections of words that we will be using are in lower case. It will be an efficient decision for further comparisions. 


In [0]:
texts = [text.lower() for text in texts]


Considering the types of special characters and how they mean nothing to the target libraries we can remove these too. We also make sure only characters that are understandable are present.

In [0]:
texts = [re.sub('\W+', ' ', text) for text in texts]
texts = [re.sub('[^a-z0-9\s]', '', text) for text in texts]

To remove any words that don't make sense we use a collection of all English words and remove the letters not occuring in this collection. 

In [0]:

englishWords = open('eng_words.txt', 'rt').read()
englishWords.replace('rt', '')

cleanTexts = []
for text in texts:
    words = re.split(r'\W+', text)
    words = [word for word in words if word in englishWords]
    words = [word for word in words if word not in ['rt', 'co', 't', 'c']]
    text = ' '.join(words)
    #print(len(text))
    cleanTexts.append(text)

## Sentiment analysis of tweets
Once the same is done, the data is clean and it can be analyzed for sentiment. This is done by importing sentiment from TextBlob.

In [0]:
from textblob import TextBlob
for text in cleanTexts:
    sentiment = TextBlob(text).sentiment
    #print (len(text), text, sentiment)
len(cleanTexts)

This data is then appended to the corresponding clean tweets. We will create a new database with clean values.

In [0]:
from bson.json_util import loads
from bson.json_util import dumps

results = modiPosts.find()
results = list(results)
i = 0;

for i in range(0, len(cleanTexts)):
    if len(cleanTexts[i]) > 0 :
        sentiment = TextBlob(cleanTexts[i]).sentiment
        if sentiment.subjectivity > 0.2:
            result = results[i]
            jsonObj = loads(dumps(result))
            jsonObj['polarity'] = sentiment.polarity
            jsonObj['subjectivity'] = sentiment.subjectivity
            jsonObj['cleantest'] = cleanTexts[i]
            #print(jsonObj, '\n\n', cleanTexts[i], '\n\n\n')
            collection.insert_one(jsonObj)

# Step 3: Analysis and representation of results

As we have already calculated sentiments of each of the tweets, we can now head to representing the analysis and visualizing the results.

## Getting top news 
Let's get the top phrases that occur in cleaned data. 


In [0]:
from pymongo import MongoClient
import pandas as pd

client = MongoClient()
db = client.twitterCleanDb
tweetCollection = db.tenKtweetsCleaned

hashTagCollection = tweetCollection.find({'entities.hashtags.text': {'$ne': None}})
text = []
hashtags = []
for tweet in hashTagCollection:
    if len(tweet['entities']['hashtags']) > 0:
        hashtags.append(tweet['entities']['hashtags'][0]['text'])
        text.append(tweet['cleantest'])

df = pd.DataFrame({'tags': hashtags, 'text': text})
tags = df['tags'].value_counts().index.tolist()
text = df['text'].value_counts().index.tolist()

print(tags[:5])

The outputs are found to be: ['KXIPvCSK', 'TreCru', 'Syria', 'PremiosMTVMiaw', 'Hearties']. To get news realted to these topics I've used the source newsapi.org and the below script does the job for us. 

In [0]:
top_tags = tags[:5]

import urllib.request
for tag in top_tags:
    contents = urllib.request.urlopen("https://newsapi.org/v2/everything?q="+tag+"&from=2018-04-14&to=2018-04-16&apiKey=*****************mykey******").read()
 

These outputs are stored in json. However, for hashtag "TreCru" the news source could not get any related results. Also, the date range is maintained from April, 14 to 16 because the data set I had was collected in the same time frame.

For each of the news items, we calculate sentiments. But first let's see the news. 

In [0]:
import json

for tag in top_tags:
    contents = urllib.request.urlopen("https://newsapi.org/v2/everything?q="+tag+"&from=2018-04-14&to=2018-04-16&apiKey=981eb42055b4472baaa0285d95b24082").read().decode('utf-8')
    jsn = json.loads(contents)
    for i in range(0, jsn['totalResults']):
        print("Source: " + jsn['articles'][i]['source']['name'])
        print("Title: " + jsn['articles'][i]['title'])
        if jsn['articles'][i]['description'] is not None:
            print("Body: " + jsn['articles'][i]['description'] + "\n\n\n")


Source: Indianexpress.com
Title: MS Dhoni’s unbeaten 79 in vain: Who said what on Twitter
Body: MS Dhoni scored an unbeaten 79 off 44 balls that included six boundaries and five maximums.



Source: The Times of India
Title: IPL 2018: Aaron Finch Falls For Consecutive Ducks After Marriage, Internet Advises Him To Go On Honeymoon
Body: Australian cricketer Aaron Finch is famed for his ability with the willow and is quite the maverick when he gets going. The power hitter has been associated with the Indian Premier League (IPL) for quite some time now and has also the record of playing for se…



Source: Indianexpress.com
Title: IPL 2018, KXIP vs CSK: Twitterati LOVE this picture of Yuvraj Singh helping out MS Dhoni
Body: IPL 2018, KXIP vs CSK: Chennai Super Kings might have had lost to Kings XI Punjab, but MS Dhoni's knock will be remembered. People also remember his grit when he did not let pain get the better of him and Yuvraj Singh and his camaraderie.



...........



## News sentiment analysis
To analyze the sentiments of news TextBlob can be used, and the code is as below. 

In [0]:
import json
from textblob import TextBlob

consol_news = json.load(open('consolidatednews.json'))

for temp in consol_news['articles']:
    sentiment = TextBlob(temp['description']).sentiment
    print(sentiment.polarity, sentiment.subjectivity)

The sentiments for IPL news are as follows.

0.0 0.0
0.0 0.0
0.41666666666666663 0.5833333333333333
0.2532467532467532 0.3792207792207792
0.3333333333333333 0.6666666666666666
0.0 0.0
0.3277777777777778 0.6555555555555554
0.0125 0.025
0.8 1.0
-0.23333333333333328 0.2222222222222222
0.0 0.0
0.1875 0.6666666666666666
0.1875 0.6666666666666666

Similarily these values are calculated for all the entries. An avaerage is taken and these values can be considered as news sentiment values for the topics. 

The polarity values are thus found to be . 
1. 0.17578393828393826 0.3742562992562993
2. 0 0
3. 0.062395334928229663 0.1741377591706539
4. 0.05 0.06666666666666667
5. -0.0375 0.325	

From this we can see that #KXIPvCSK is on a positive note compared to othes while #Hearties has mostly negative impact in news networks. 

# Twitter Cloud

To represent the tweet despersion globally over a map, we will be using the library mplleaflet. The library takes inputs in the form of coordinates over the map, renders the map from google maps. The points are then plotted as matplotlib figures. 

Inordre to achieve this we can get coordinates, but sadly not every tweet comes with coordinates. However, a few users supply their location information along with tweets. This information can be used to plot tweets. Finding such tweets, making sense of the coordinates we can use geopy. 

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from pymongo import MongoClient
from geopy.geocoders import Nominatim

locate = Nominatim()
client = MongoClient()
db = client.twitterCleanDb
tweetCollection = db.tenKtweetsCleaned

times = []
polarities = []
subjectivities = []
locations = []
locationPolarities = []
lats = []
longs = []
modiTweets = tweetCollection.find({'cleantest' :{'$regex': '.*a*.'}})
for tweet in modiTweets:
    times.append(tweet['created_at'])
    polarities.append(tweet['polarity'])
    subjectivities.append(tweet['subjectivity'])
    
    latlong = locate.geocode(tweet['user']['location'], timeout = 10)
    if latlong is None:
        latlong = locate.geocode(tweet['user']['time_zone'], timeout = 10)
        if latlong is None:
            print('appended')
        else:
            locations.append([latlong.latitude, latlong.longitude])
            print('appended 1', latlong, latlong.latitude, latlong.longitude)
            locationPolarities.append(tweet['polarity'])
    else:
        locations.append([latlong.latitude, latlong.longitude])
        print('appended 2', latlong, latlong.latitude, latlong.longitude)
        locationPolarities.append(tweet['polarity'])

In the code above, we also append the sentiment values to each of the queried tweets. This makes a pretty good dataframes to analyse further.

Now let us examine the sentiment value variations with respect to time. 


In [0]:
import dateutil
df = pd.DataFrame({'time':times, 'polarity': polarities, 'subjectivity': subjectivities})
df = df.sort_values(by=['time'])
df['time'] = df['time'].apply(dateutil.parser.parse)
df.dtypes

By sorting the values by time, parsing and indexing the column date as a date object the data frame can be plotted. 

In [0]:
plt.figure(figsize = (12, 5))

plt.title('Sentiment Comparision')
plt.plot(df['time'], df['polarity'])
plt.plot(df['time'], df['subjectivity'])

plt.xlabel('Time')
plt.ylabel('Sentiment Value')

# Analysis

We can see from the graph that there are multiple gaps in the data, this might be due to the trend in tweets and no one actually tweeting in the stipulated time. 

Aggregating the latitude and longitude values, and plotting them using mplleaflet we can see the temporal distribution of tweets.

In [0]:
lats = [lat[0] for lat in locations]
longi = [long[1] for long in locations]

print(lats, " +  ", longi)
lats = list(filter(lambda a: a != 'p', lats))

import mplleaflet

mapdf = pd.DataFrame({'lat': lats, 'longi': longi, 'polarity':polarities[:len(lats)]})
plt.figure(figsize=(8, 8))
plt.scatter(longi, lats, c = polarities[:len(lats)])
mplleaflet.display()

In the final representation of Heroku, much emphasis was given to user experience and more varied pots are made. 