# Subject: Advanced Data Analysis

# Module: Geospatial Analysis

## Session 5 - Geographic Analysis of Social Network Data 

### Demo 1 -  Sentiment analysis and Geospatial analysis on Carles Puigdemont tweets using Python 

The requirements that we'll need to install are:

- NumPy: This is the fundamental package for scientific computing with Python. Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data.
- Pandas: This is an open source library providing high-performance, easy-to-use data structures and data analysis tools.
- Tweepy: This is an easy-to-use Python library for accessing the Twitter API.
- Matplotlib: This is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
- Seaborn: This is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
- Textblob: This is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks.

https://textblob.readthedocs.io/en/dev/

$ pip install -U textblob

$ python -m textblob.download_corpora

https://github.com/tweepy/tweepy

$ pip install tweepy

## 1. Extracting twitter data (tweepy + pandas)

### 1.1. Importing our libraries

In [None]:
# General:
import tweepy           # To consume Twitter's API
import pandas as pd     # To handle data
import numpy as np      # For number computing

# For plotting and visualization:
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### 1.2. Creating a Twitter App

In order to extract tweets for a posterior analysis, we need to access to our Twitter account and create an app. The website to do this is https://apps.twitter.com/

From this app that we're creating we will save the following information:

- Consumer Key (API Key)

- Consumer Secret (API Secret)

- Access Token

- Access Token Secret


In [None]:
# API's setup:
def twitter_setup():
    """
    Utility function to setup the Twitter's API
    with our access keys provided.
    """
    # Authentication and access using keys:
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
    
    # Return API with authentication:
    api = tweepy.API(auth)
    return api

### 1.3. Tweets extraction

Now that we've created a function to setup the Twitter API, we can use this function to create an "extractor" object. After this, we will use Tweepy's function extractor.user_timeline(screen_name, count) to extract from screen_name's user the quantity of count tweets.

As it is mentioned in the title, I've chosen @KRLS as the user to extract data for a posterior analysis. The way to extract Twitter's data is as follows:

In [None]:
# We create an extractor object:
extractor = twitter_setup()

# We create a tweet list as follows:
tweets = extractor.user_timeline(screen_name="KRLS", count=200)
print("Number of tweets extracted: {}.\n".format(len(tweets)))

# We print the most recent 5 tweets:
print("5 recent tweets:\n")
for tweet in tweets[:5]:
    print(tweet.text)
    print()

We now have an extractor and extracted data, which is listed in the tweets variable. I must mention at this point that each element in that list is a tweet object from Tweepy.

### 1.4. Creating a (pandas) DataFrame

We now have initial information to construct a pandas DataFrame, in order to manipulate the info in a very easy way.

In [None]:
# We create a pandas dataframe as follows:
data = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['Tweets'])

# We display the first 10 elements of the dataframe:
display(data.head(10))

An interesting thing is the number if internal methods that the tweetstructure has in Tweepy:

In [None]:
# Internal methods of a single tweet object:
print(dir(tweets[0]))

The interesting part from here is the quantity of metadata contained in a single tweet. If we want to obtain data such as the creation date, or the source of creation, we can access the info with this attributes. 

In [None]:
from textblob import TextBlob #TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
import re #This module provides regular expression matching operations.Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. 

def clean_tweet(tweet):
    '''
    Utility function to clean the text in a tweet by removing 
    links and special characters using regex.
    '''
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

def analize_sentiment(tweet):
    '''
    Utility function to classify the polarity of a tweet
    using textblob.
    '''
    analysis = TextBlob(clean_tweet(tweet))
    if analysis.sentiment.polarity > 0:
        return 1
    elif analysis.sentiment.polarity == 0:
        return 0
    else:
        return -1

https://link.springer.com/chapter/10.1007/978-3-319-47602-5_40

In [None]:
# We create a column with the result of the analysis:
data['SA'] = np.array([ analize_sentiment(tweet) for tweet in data['Tweets'] ])

# We display the updated dataframe with the new column:
display(data.head(10))

In [None]:
type(data)

In [None]:
data

### 2.2. Analyzing the results

#### 2.2.1. Calculation of the percentage of the classified tweets

In [None]:
# We construct lists with classified tweets:

pos_tweets = [ tweet for index, tweet in enumerate(data['Tweets']) if data['SA'][index] > 0]
neu_tweets = [ tweet for index, tweet in enumerate(data['Tweets']) if data['SA'][index] == 0]
neg_tweets = [ tweet for index, tweet in enumerate(data['Tweets']) if data['SA'][index] < 0]

In [None]:
pos_tweets

In [None]:
neu_tweets

In [None]:
neg_tweets

Now that we have the lists, we just print the percentages:

In [None]:
# Twitter App access keys for @user

# Consume:
CONSUMER_KEY    = 'Uu3D2hHGljVnU2vchDYmHZtGw'
CONSUMER_SECRET = 'aUfG03L1ZUQjojGK1dQ6dKFC8nktUZQZ4eZDU3p23hEA8ZQus3'

# Access:
ACCESS_TOKEN  = '955416286477082625-85nByhWARuuQQJt2QyfFwublVbSE28L'
ACCESS_SECRET = 'DWARKyhsQKwm1aVWiTsS8RjKQGy778iaERGHeKnauB9Sb'

In [None]:
data['timestamp'] = data['timestamp'].dt.strftime('%Y-%m-%d')

### 2. Sentiment analysis

### 2.1. Importing textblob

Anyway, getting back to the code we will just add an extra column to our data. This column will contain the sentiment analysis and we can plot the dataframe to see the update:

In [None]:
# Graph of the Polarity by dates (2017 and 2018)
df1.plot( kind='line', x='timestamp', y='Polarity',title='Polarity by date')
axes = plt.gca()
plt.xticks(rotation='vertical', fontsize=11)
plt.show()

In [None]:
# Graph of the Polarity by hour for a specific day
df1['Hour'] = pd.to_datetime(df1['Date'], format='%H:%M').dt.hour # to create a new column with the hour information
df1.head()

In [None]:
df2 = df1[df1['timestamp'] == '2018-01-23']
df2.plot( kind='line', x='Hour', y='Polarity',title='Polarity on 23/01/2018')
axes = plt.gca()
plt.xticks(rotation='vertical', fontsize=11)
plt.show()

In [None]:
# Graph of the Likes by dates (2017 and 2018)
df1.plot( kind='line', x='timestamp', y='Likes',title='Likes by date')
axes = plt.gca()
plt.xticks(rotation='vertical', fontsize=11)
plt.show()

## 3. Export the dataframe

In [None]:
data.to_csv('data_twitter.csv')

## 4. QuantumGIS processing

### Task: 

Because Carles Puigdemont has not activated the Twiter location-sharing mode, we need to simulate a creation of 200 random location points within the administrative boundaries of Brussels.

 - Inside QuantumGIS lets calculate a shapefile with 200 random points inside Brussels. Use the function "Random Points in layer bounds" (Menu Vector, Research tools).

- Add the longitude and latitude columns with the option "Export/Add geometry columns" (Menu Vector, Geometry tools).

- Use the exported file csv file ('data_twitter.csv') and perform a join with the random points layer.

## 5. Import the joined shapefile to Jupyter and create a geodataframe

In [None]:
# We print info from the first tweet:
print(tweets[0].id)
print(tweets[0].created_at)
print(tweets[0].source)
print(tweets[0].favorite_count)
print(tweets[0].retweet_count)
print(tweets[0].geo)
print(tweets[0].coordinates)
print(tweets[0].entities)

In [None]:
# We add relevant data:
data['len']  = np.array([len(tweet.text) for tweet in tweets])
data['ID']   = np.array([tweet.id for tweet in tweets])
data['Date'] = np.array([tweet.created_at for tweet in tweets])
data['Source'] = np.array([tweet.source for tweet in tweets])
data['Likes']  = np.array([tweet.favorite_count for tweet in tweets])
data['coordinates']    = np.array([tweet.coordinates for tweet in tweets])

In [None]:
# We convert the Date type:
data['timestamp'] = pd.to_datetime(data['Date'], unit='s')

In [None]:
# We print percentages:

print("Percentage of positive tweets: {}%".format(len(pos_tweets)*100/len(data['Tweets'])))
print("Percentage of neutral tweets: {}%".format(len(neu_tweets)*100/len(data['Tweets'])))
print("Percentage de negative tweets: {}%".format(len(neg_tweets)*100/len(data['Tweets'])))

#### 2.2.2. Infographics of the tweets

In [None]:
import geopandas as gpd
gdf = gpd.read_file('Random_points_twitter.shp')
gdf = gdf.to_crs({'init': 'epsg:4326'})

In [None]:
gdf

In [None]:
gdf.dtypes

In [None]:
#We must convert the Polarity to string to be used as a map attribute
gdf.data_twi_8 = gdf.data_twi_8.astype(str)

## 6.Webmapping of Tweets with Folium

### 6.1. Installation:

https://github.com/python-visualization/folium

$ pip install folium

### 6.2. Creation of a webmap with twitter location by date

In [None]:
import folium

In [None]:
#Create a Twitter basemap specifying map center, zoom level, and using the CartoDB Positron tiles
Twitter_map = folium.Map([45.955263, 8.935129], tiles='cartodbpositron', zoom_start = 5)
Twitter_map

https://deparkes.co.uk/2016/06/10/folium-map-tiles/

In [None]:
# Pandas DataFrame with the Sentiment Analysis results
data.head()

In [None]:
# New Pandas DataFrame with a new name of the field including the Sentiment Analysis results (SA)
df1=data.rename(columns={'SA':'Polarity'})
df1.head()