# Analysis of Twitter Data

## Mining tweets
Our main goal here is to compare the popularity of programming languages that have been used in **big data** and **data analytic**, and to retrieve the tutorial links of those programming languages. 

We will do this in 3 steps:

* We will add tags to our tweets DataFrame in order to be able to manipualte the data easily.
* Target tweets that have "programming" or "tutorial" keywords.
* Extract links from the relevants tweets.

### Adding Python, Java, R, MatLab, SAS, Scala, Unix tags
First, we will create a function that checks if a specific keyword is present in a text. We will do this by using regular expressions. 

Using a Python library called "**re**", we will create a function called **word_in_text(word, text)**. This function will return *True* if a word is found in text, otherwise it returns *False*.

In [None]:
import re
def word_in_text(word, text):
    try:
        word = word.lower()
        text = text.lower()
        match = re.search(word, text)
        if match:
            return 1
        return 0
    except:
        return 0

Next, we will add these 7 columns to our tweets DataFrame.

In [None]:
import sys
import json
import pandas as pd
import matplotlib.pyplot as plt

tweets_data_path = 'C:\\Program Files\\Anaconda2\\tweets_bigData_dataAnalytic.json'

tweets_data = []
tweets_file = open(tweets_data_path, "r")
count = 0
for line in tweets_file:
    try:
        count = count + 1
        tweet = json.loads(line)
        tweets_data.append(tweet)
        if count%100 == 0:
            sys.stdout.write('.')
        if count%7000 == 0:
            sys.stdout.write('\n')
    except Exception as e:
        print e
        continue
print "\n%s tweets read." % (count)
tweets = pd.DataFrame()
tweets['text'] = map(lambda tweet: tweet.get('text', None), tweets_data)
#print tweets.head(3)
tweets['python'] = tweets['text'].apply(lambda tweet: word_in_text('python', tweet))
tweets['java'] = tweets['text'].apply(lambda tweet: word_in_text('java', tweet))
tweets['r'] = tweets['text'].apply(lambda tweet: word_in_text(' r pack', tweet))
tweets['matlab'] = tweets['text'].apply(lambda tweet: word_in_text('matlab', tweet))
tweets['sas'] = tweets['text'].apply(lambda tweet: word_in_text('sas', tweet))
tweets['scala'] = tweets['text'].apply(lambda tweet: word_in_text('scala', tweet))
tweets['unix'] = tweets['text'].apply(lambda tweet: word_in_text('unix', tweet))
print tweets.head()

In [None]:
print tweets['python'].value_counts()[1]
print tweets['java'].value_counts()[1]
print tweets['r'].value_counts()[1]
print tweets['matlab'].value_counts()[1]
print tweets['sas'].value_counts()[1]
print tweets['scala'].value_counts()[1]
print tweets['unix'].value_counts()[1]

We then can make a simple comparaison chart by executing the following:

In [None]:
%matplotlib inline
keywords = ['python', 'java', 'r', 'matlab', 'sas', 'scala', 'unix']
tweets_by_keywords = [tweets['python'].value_counts()[1], \
    tweets['java'].value_counts()[1], \
    tweets['r'].value_counts()[1], \
    tweets['matlab'].value_counts()[1], \
    tweets['sas'].value_counts()[1], \
    tweets['scala'].value_counts()[1], \
    tweets['unix'].value_counts()[1]]

x_pos = list(range(len(keywords)))
width = 0.6
fig, ax = plt.subplots()
plt.bar(x_pos, tweets_by_keywords, width, alpha=1, color='g')

# Setting axis labels and ticks
ax.set_ylabel('Number of tweets', fontsize=15)
ax.set_title('Programming Languages used in Big Data Analytics',\
             fontsize=10, fontweight='bold')
ax.set_xticks([p + 0.4 * width for p in x_pos])
ax.set_xticklabels(keywords)
plt.grid()

### Targeting relevant tweets
We are intersted in targetting tweets that are related to any *tutorial* or *programming* stuff that concerns the big data or data scinece. We will then create an additional column to our tweets DataFrame where we will add this information.

In [None]:
tweets['tutorial'] = tweets['text'].apply(lambda tweet: word_in_text('tutorial', tweet))
tweets[(tweets['python']==1) & (tweets['tutorial']==1)]
s = tweets[(tweets['python']==1) & (tweets['tutorial']==1)]
print len(s)
#print s

To easy filter the records in out dataframe **tweets**, we can also add an additional column called "*relevant*" that take value 1 if the tweet has either "Python" and "Tutorial" keyword, otherwise it takes value 0.

In [None]:
tweets['relevant'] = tweets['text'].apply(lambda tweet: word_in_text('python', tweet)\
    and word_in_text('tutorial', tweet))
tweets[tweets['relevant']==1]

In the same way, we can also add an additional column called "*relevant*" that take value 1 if the tweet has either "Python" and "Programming", or "Java" and "Programming", too. 

In [None]:
tweets['programming'] = tweets['text'].apply(lambda tweet:\
    (word_in_text('python', tweet) or word_in_text('java', tweet))\
    and word_in_text('programming', tweet))
tweets[tweets['programming']==1]

Now, we can count the number of tutorials, Python and Java programming courses, that have been found from the Twitter data.

In [None]:
print tweets['tutorial'].value_counts()[1]
print tweets[tweets['tutorial']==1]['python'].value_counts()[1]
print
print tweets['programming'].value_counts()[1]
print tweets[tweets['programming']==1]['python'].value_counts()[1]
print tweets[tweets['programming']==1]['java'].value_counts()[1]

We can make a comparison graph by executing the commands below:

In [None]:
x_labels = ['Python Tutorial', 'Python Prog', 'Java Prog']
tweets_by = [tweets[tweets['tutorial'] == 1]['python'].value_counts()[1],\
             tweets[tweets['programming'] == 1]['python'].value_counts()[1],\
             tweets[tweets['programming'] == 1]['java'].value_counts()[1]]
x_pos = list(range(len(x_labels)))
width = 0.8
fig, ax = plt.subplots()
plt.bar(x_pos, tweets_by, width,alpha=1,color='g')
ax.set_ylabel('Number of tweets', fontsize=12)
ax.set_title('Number of found tutorials and programming courses.', fontsize=10, fontweight='bold')
ax.set_xticks([p + 0.4 * width for p in x_pos])
ax.set_xticklabels(x_labels)
plt.grid()

From the graph, we can see that our tweets (i.e., users in twitters) talk about or offer many Java stuffs for the moment.

## Extracting links from relevant tweets
Now, we will extract the relevant tweets, we want to retrieve links to any tutorials concerning big data and data science. We will start by creating a function that uses regular expressions for retrieving link that start with "http://" or "https://" from a text. This function will return the url if found, otherwise it returns an empty string.

In [None]:
def extract_link(text):
    regex = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
    match = re.search(regex, text)
    if match:
        return match.group()
    return ''

Next, we will add a column called link to our tweets DataFrame. This column will contain the urls information.

In [None]:
tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))

Next we will create a new DataFrame called **tweets_relevant_with_link**. This DataFrame is a subset of tweets DataFrame and contains all relevant tweets that have a link.

In [None]:
tweets_relevant = tweets[tweets['tutorial'] == 1]
tweets_relevant_with_link = tweets_relevant[tweets_relevant['link'] != '']

We can now print out all links for Python tutorials by executing the commands below:

In [None]:
print tweets_relevant_with_link[tweets_relevant_with_link['python'] == 1]['link']