## twitter-hashtag-analysis #kabali
I was excited to do exploratory data analysis with the #Kabali hashtag dataset downloaded from twitter for 7 days(21/07 to 28/07) using PySpark. I have used tweepy python library for accessing the Twitter API to dowload the #kabali tweets. you can find the script from [here](https://github.com/dmanojbabu/twitter-hashtag-analysis/blob/master/twitter-hashtag-scrapper.py)

In [2]:
# load the 7days of tweets scrapped
df = sqlContext.read.load('dbfs:///FileStore/tables/302cowo81470664461416/kabali_tweets_final.tsv', 
                          format='com.databricks.spark.csv', 
                          header='false', 
                          delimiter='\t',
                          inferSchema='true')

#### Defining a Function for plotting the results:

In [4]:
from spark_notebook_helpers import prepareSubplot, np, plt, cm

def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
    ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    #turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    #now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()
        
def plot_tweet_category(x,y,title):
  remove_border(left=False, bottom=False)
  plt.style.use('ggplot')
  fig = plt.figure(figsize=(7, 5))
  pos = np.arange(len(y))
  plt.barh(pos, y)
  plt.title(title)
  
  #add the numbers to the side of each bar
  for p, c, ch in zip(pos, x, y):
    plt.annotate(str(ch), xy=(ch + 1, p + .5), va='center')
  
  #cutomize ticks
  ticks = plt.yticks(pos + .5, x)
  xt = plt.xticks()[0]
  plt.xticks(xt, [' '] * len(xt))
  remove_border(left=False, bottom=False)
  #set plot limits & layout
  plt.ylim(pos.max(), pos.min() - 1)
  plt.tight_layout()
  return fig

#### Data Cleansing:
Check for Null value exist in any of the columns:

In [6]:
df.printSchema()

In [7]:
from pyspark.sql.functions import col, sum

def count_null(col_name):
  return sum(col(col_name).isNull().cast('integer')).alias(col_name)

exprs = []
for col_name in df.columns:
  exprs.append(count_null(col_name))
  
df.agg(*exprs).show()

In [8]:
#fix null values with default value
df = df.na.fill({'retweet_count': 0,'favorite_count': 0})
df.agg(*exprs).show()

There were no null values exist and we are good to proceed

#### summary statistics:
Providing a quick summary of the tweets dataframe numeric columns. Let?s try it here:

In [11]:
display(df.describe())

#### 1.	How many tweets containing the hashtag #kabali (for 7 days from the day before the film release)?

In [13]:
#filterRetweets = (df.map(lambda line: line['tweet_text']).filter(lambda x:x.startswith('RT')))
filterRetweets = df[df['tweet_text'].startswith('RT')]
filtertweets = df[~df['tweet_text'].startswith('RT')]
print 'Total Tweets #kabali from 21/07/16 to 28/07/2016 : {0}'.format(df.count())
print 'Number of original tweets of #kabali             : {0}'.format(filtertweets.count())
print 'Number of retweets of #kabali                    : {0}'.format(filterRetweets.count())

#### 2.	What is the trend of these tweets over time?

In [15]:
from pyspark.sql import functions as F

from_pattern = 'MMM d, yyyy h:mm:ss aa'
to_pattern = 'yyyy-MM-dd'

df = df.withColumn('tweet_date', F.from_unixtime(F.unix_timestamp(df['created_at'], from_pattern), to_pattern))
user_name_to_count_df =(df.groupBy('tweet_date').count().sort('tweet_date',ascending=True))
data = user_name_to_count_df.take(20)
display(data)

#### 3.	What are the languages used for the tweets?

In [17]:
df_with_language = df.join(langcode_map,df['lang']==langcode_map['code'])
lang_to_count_df =(df_with_language.groupBy('language').count().sort('count',ascending=False))

data = lang_to_count_df.collect()
x, y = zip(*data)
fig = plot_tweet_category(x,y,'#kabali tweet languages used')
display(fig)

#### 4. Who is more active using the hashtag #kabali?

In [19]:
user_name_to_count_df =(df.groupBy('user_name').count().sort('count',ascending=False))
data = user_name_to_count_df.take(20)
x, y = zip(*data)
fig = plot_tweet_category(x,y,'#kabali Top 20 active users')
display(fig)

#### 5.	What is the top 10 most retweeted tweets?

In [21]:
user_name_to_count_df =(df.filter(~df['tweet_text'].startswith('RT')).sort('retweet_count',ascending=False))
data = user_name_to_count_df.select('user_name','tweet_text','retweet_count').take(3000)
display(data)

#### 6.	What is the top 10 most favorite tweets?

In [23]:
user_name_to_count_df =(df.filter(~df['tweet_text'].startswith('RT')).sort('favorite_count',ascending=False))
data = user_name_to_count_df.select('user_name','tweet_text','favorite_count').take(10)
display(data)

@ImRaina Cricketer Suresh Raina tops the chart!

#### 7.	What are the demographics of tweets involved in #kabali?

In [26]:
user_name_to_count_df =(df.groupBy('country').count().sort('count',ascending=False))
data = user_name_to_count_df.take(20)
x, y = zip(*data)
fig = plot_tweet_category(x,y,'#kabali demographics of tweets')
display(fig)

Seems most of the tweeter users were not yet updated their country details in their profile.
###### We have some tweets from pakistan lets checkout that!!!

In [28]:
pak_tweets_df = (df.filter(df['country']=='Pakistan').select('user_name','tweet_date','tweet_text','country'))
display(pak_tweets_df)

####  8. What are the most tweeted terms related to #kabali?
![Kabali word cloud](https://s10.postimg.org/oji5nlf0p/kabali_word_cloud.png)
![Kabali word mask](https://s9.postimg.org/bcfweyjjj/kabali_word_mask.png)

In [30]:
## Sample script used to generate word cloud
from wordcloud import WordCloud, STOPWORDS
from PIL import Image
from os import path
import random

def grey_color_func(word, font_size, position, orientation, random_state=None, **kwargs):
    return "hsl(0, 0%%, %d%%)" % random.randint(60, 100)

d = path.dirname(__file__)

# read the mask image
mask = np.array(Image.open(path.join(d, "storm-trooper.gif")))

# adding movie script specific stopwords
stopwords = set(STOPWORDS)
stopwords.add("int")
stopwords.add("ext")

wd_ca = df.sort_values(by='retweet_count', ascending=0).head(4000)
no_urls_no_tags = " ".join([word for word in " ".join(wd_ca['tweet_text']).split()
                                if 'http' not in word
                                    and not word.startswith('@')
                                    and word != 'RT'])

wc = WordCloud(max_words=1000, mask=mask, stopwords=stopwords, margin=10,
               random_state=1).generate(no_urls_no_tags)
# store default colored image
default_colors = wc.to_array()
plt.axis("off")
plt.figure()
plt.title("#kabali")
plt.imshow(default_colors)
plt.axis("off")
plt.show()