
                                              Python 3
                                            23-09-2016
Happiness of twitter users in USA by states
============

> Author: **Gabriel Piles González**


***


The porpouse of this notebook is to compare the happiness of the twitter users in the USA by states. The sample is obtain by the API of twitter. It is going to be follow the following points.

1. Load files: states list and words ratings
2. Helper functions
3. Execute analysis
4. Results: Numbers of twitts
5. Results: Happiness of twitter users in USA by states
6. Conclusions and observations


***


### 1. Load files: states list and words ratings ###

In [1]:
# To load the states from a file

states_text = open('USA_states.txt', 'r')
states_dictionary = {}
for each_state_text in states_text:
    tokens=each_state_text.split("\t")
    states_dictionary[tokens[1]] = tokens[0]

states_text = open('USA_states.txt', 'r')
states_by_code = {}
for each_state_text in states_text:
    tokens=each_state_text.split("\t")
    states_by_code[tokens[0]] = tokens[1]


In [2]:
# To load the word rating happiness

words_ratings = open('AFINN-111.txt', 'r')
words_ratings_dictionary = {}
for each_rating in words_ratings:
    tokens=each_rating.split("\t")
    words_ratings_dictionary[tokens[0]] = int(tokens[1])

for key in sorted(words_ratings_dictionary)[:5]:
    print(key, str(words_ratings_dictionary[key]))
print(words_ratings_dictionary["can't stand"])

abandon -2
abandoned -2
abandons -2
abducted -2
abduction -2
-3




***


### 2. Helper functions ###

It is going to be defined 6 functions in order to help to do the analysis.

1. **initialize_dictionary:** for initiate a states dictionary
2. **USA_twitt:** return true if it is a twitt from USA and false otherwise
3. **get_state_of_USA:** return the state of the twitt passed as a parameter
4. **twitt_evaluation:** return a number that reperesents the happiness of a twitt
5. **plot_happiness:** create a bar chart
6. **check_happiness:** this function loops through the lines of a file that contains twitts and store the number of twitts per state and the total happiness per state. It returns the number of twitts evaluated.


In [3]:
def initialize_dictionary(dictionary):
    dictionary_out = {}
    for each_key, each_value in dictionary.items():
        dictionary_out[each_value] = 0
    return dictionary_out

In [4]:
import json

def USA_twitt(twitt):
    place = twitt.get('place')
    if not place:
        return False
    countrie_code = twitt.get('place').get('country_code')
    if countrie_code == "US":
        return True
    return False


In [5]:
def get_state_of_USA(twitt):
    place_full_text = twitt.get('place').get('full_name')
    tokens = place_full_text.split(', ')
    if len(tokens) < 2:
        return None
    if tokens[1] == "USA":
        if tokens[0] in states_dictionary:
            return states_dictionary[tokens[0]]
        else:
            return None
    return tokens[1]

In [6]:
import re
def twitt_evaluation(twitt_text):
    rating = 0
    words = twitt_text.split(' ')
    for each_word in words:
        clean_word = re.sub('\W+','', each_word )
        if clean_word in words_ratings_dictionary:
            rating += words_ratings_dictionary[clean_word]
    return rating
twitt_evaluation('well! surprise bad man!!!')

-3

In [7]:
from bokeh.charts import Bar, output_notebook, show

def plot_bar_chart(name, data, values, labels):
        # It is going to be plot in the notebook
        output_notebook()

        fig = Bar(data, values=values, label=labels, title=name, plot_width=900, legend=False)
        
        show(fig)

In [8]:
import os.path

def check_happiness(file, happiness_by_state, number_twitts_by_state):
    twitts_data = open(file, 'r')
    number_twitts=0
    for each_twitt in twitts_data:
        number_twitts+=1
        twitt_json = json.loads(each_twitt)
        if not USA_twitt(twitt_json):
            continue
        rating = twitt_evaluation(twitt_json.get('text'))
        if rating == 0:
            continue
        state = get_state_of_USA(twitt_json)
        if not state:
            continue
        if state in happiness_by_state:
            happiness_by_state[state] += rating
            number_twitts_by_state[state] +=1
    return(number_twitts)


***

### 3. Execute analysis ###

The next section loops through a set of files containing lines of twitts and evaluate each file

In [9]:
happiness_by_state = initialize_dictionary(states_dictionary)
number_twitts_by_state = initialize_dictionary(states_dictionary)

number_twitts_evaluated = 0
number_files = 0
for i in range(209):
    file_name = "twitts_sample_" + str(i)
    if os.path.exists(file_name):
        number_files += 1
        number_twitts_evaluated += check_happiness(file_name, happiness_by_state, number_twitts_by_state)


***

### 4. Results: Numbers of twitts ###

The results obtained are the following.

In [10]:
valid_twitts = 0
for each_state in number_twitts_by_state:
    valid_twitts += number_twitts_by_state[each_state]

print('------------------------------------------------------------')
print()
print('Number of twitts evaluated: ' + str(number_twitts_evaluated))
print('Number of twitts with valid information: ' + str(valid_twitts))
print()
print('------------------------------------------------------------')
print()
print('Percentage of valid twitts: ' + "{0:.2f}".format(100 * valid_twitts/number_twitts_evaluated) + '%')
print()

states = []
number_twitts_per_state_array = []

for each_state in number_twitts_by_state:
    states.append(states_by_code[each_state])
    number_twitts_per_state_array.append(number_twitts_by_state[each_state])

plot_bar_chart('Number of valid twitts per state', {
    'values': number_twitts_per_state_array,
    'labels': states
}, 'values', 'labels')   

------------------------------------------------------------

Number of twitts evaluated: 241941
Number of twitts with valid information: 1107

------------------------------------------------------------

Percentage of valid twitts: 0.46%





***

### 5. Results: Happiness of twitter users in USA by states ###

The results obtained are the following.

In [13]:
happiness = initialize_dictionary(states_dictionary)
for each_state in happiness:
    if number_twitts_by_state[each_state] == 0:
        happiness[each_state] = 0
    else:     
        happiness[each_state] = happiness_by_state[each_state] / number_twitts_by_state[each_state]

states_sorted = []
happiness_sorted = []
i = 1
for k in sorted(happiness, key=happiness.get, reverse=True):
    if i < 10:
        states_sorted.append('0' + str(i) + " " + states_by_code[k])
    else:
        states_sorted.append(str(i) + " " + states_by_code[k])
    happiness_sorted.append(happiness[k])
    i+=1
    
plot_bar_chart('Happiness of twitter users in USA by states', {
    'state': states_sorted,
    'happiness': happiness_sorted
}, 'happiness', 'state')



***


### 6. Conclusions and observations ###

The porpouse of this work is to practice with python and it is not intented to actually analyze the happiness of the USA users of twitter. This analysis is not valid because the sample is too small and it is not uniformly distributed through the different states because some states are sampled in different hours.