**Repo** https://github.com/aparna-k/datasci_course_materials/tree/master/assignment1

**Step 1** - Create a twitter app and create an access token. (OAuth1)
http://docs.inboundnow.com/guide/create-twitter-application/

You will now copy four values into the file twitterstream.py. These values are your "Consumer Key (API Key)", your "Consumer Secret (API Secret)", your "Access token" and your "Access token secret". All four should now be visible on the "Keys and Access Tokens" page. (You may see "Consumer Key (API Key)" referred to as either "Consumer key" or "API Key" in some places in the code or on the web; all three are synonyms.) Open twitterstream.py and set the variables corresponding to the api key, api secret, access token, and access secret. You will see code like the below:

```python
api_key = "<Enter api key>" 
api_secret = "<Enter api secret>" 
access_token_key = "<Enter your access token key here>" 
access_token_secret = "<Enter your access token secret here>"
```

Since I don't want my access token on a public Github repo, I'm creating the following four env variables and using that in the code.

```bash
export TWITTER_API_KEY="xxxx"
export TWITTER_API_SECRET="xxxx"
export TWITTER_ACCESS_TOKEN_KEY="xxx"
export TWITTER_ACCESS_TOKEN_SECRET="xxx"
```

Then in the file `twitterstream.py`

```python
api_key = os.environ.get('TWITTER_API_KEY')
api_secret = os.environ.get('TWITTER_API_SECRET')
access_token_key = os.environ.get('TWITTER_ACCESS_TOKEN_KEY')
access_token_secret = os.environ.get('TWITTER_ACCESS_TOKEN_SECRET')
```

```bash
python twitterstream.py > output.txt
```

This command pipes the output to a file. Stop the program with Ctrl-C, but wait at least 3 minutes for data to accumulate.

<h3 style='color:blue'>Derive the sentiment of each tweet</h3>

Compute the sentiment of each tweet based on the sentiment scores of the terms in the tweet. 

The sentiment of a tweet is equivalent to the sum of the sentiment scores for each term in the tweet.

To score a word, we use an AFINN list of words
http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010

AFINN is a list of English words rated for valence with an integer
between minus five (negative) and plus five (positive). The words have
been manually labeled by Finn Årup Nielsen in 2009-2011. The file
is tab-separated. There are two versions:

**AFINN-111**: Newest version with 2477 words and phrases.

**AFINN-96**: 1468 unique words and phrases on 1480 lines. Note that there
are 1480 lines, as some words are listed twice. The word list in not
entirely in alphabetic ordering. 

We will be using AFINN-111 to compute sentiment.

The file AFINN-111.txt contains a list of pre-computed sentiment scores. Each line in the file contains a word or phrase followed by a sentiment score. Each word or phrase that is found in a tweet but not found in AFINN-111.txt should be given a sentiment score of 0. 

**Create a dict of words, scores from the AFINN-111**

In [1]:
afinnfile = open("AFINN-111.txt")
scores = {} # initialize an empty dictionary
for line in afinnfile:
  term, score  = line.split("\t")  # The file is tab-delimited. "\t" means "tab character"
  scores[term] = int(score)  # Convert the score to an integer.

In [3]:
for key in scores.keys()[0:20]:
    print key, scores[key]

limited -1
suicidal -2
pardon 2
desirable 2
protest -2
lurking -1
controversial -2
hating -3
ridiculous -3
hate -3
aggression -2
increase 1
regretted -2
violate -2
granting 1
attracted 1
poorest -2
scold -2
bailout -2
sorry -1


**Parse the output.txt file using `json`**

In [6]:
import json
tweets_score = []
with open('output.txt') as op:
    for line in op:
        tweet = json.loads(line)
        if 'text' in tweet:
            words = tweet['text'].split(' ')
            score = 0
            for word in words:
                score += scores.get(word.lower(), 0)
            tweets_score.append(score)
        else:
            tweets_score.append(0)

The `tweets_score` list has a score corresponding to each tweet in the same order as the tweets appear in the file.

check `tweet_sentiment.py` for the program that takes in an AFINN file and an output file and computes sentiment for each line in a twitter stream file

```bash
python tweet_sentiment.py AFINN-111.txt output.txt
```

<h3 style='color:blue'>Derive the sentiment of new terms</h3>

In this part you will be creating a script that computes the sentiment for the terms that **do not appear** in the file AFINN-111.txt

Here's how you might think about the problem: We know we can use the sentiment-carrying words in AFINN-111.txt to deduce the overall sentiment of a tweet. Once you deduce the sentiment of a tweet, you can work backwards to deduce the sentiment of the non-sentiment carrying words that do not appear in AFINN-111.txt. For example, if the word soccer always appears in proximity with positive words like great and fun, then we can deduce that the term soccer itself carries a positive sentiment.

You are provided with a skeleton file term_sentiment.py which accepts the same two arguments as tweet_sentiment.py and can be executed using the following command:

```bash
$ python term_sentiment.py AFINN-111.txt output.txt
```

Your script should print output to stdout. Each line of output should contain a term, followed by a space, followed by the sentiment. That is, each line should be in the format <term:string> <sentiment:float>

For example, if you have the pair ("foo", 103.256) in Python, it should appear in the output as:

```bash
foo 103.256
```

In [13]:
new_terms_scores = {}
with open('output.txt') as op:
    for line in op:
        tweet = json.loads(line)
        if 'text' in tweet:
            words = tweet['text'].split(' ')
            pos_words = 1
            neg_words = 1
            new_terms = []
            new_terms_score = 0
            for word in words:
                if scores.has_key(word):
                    if scores[word] > 0:
                        pos_words += 1
                    else:
                        neg_words += 1
                else:
                    new_terms.append(word)
                ratio = float(pos_words) / float(neg_words)
                if(ratio >= 1):
                    new_terms_score = pos_words
                else:
                    new_terms_score = neg_words * -1
                for new_term in new_terms:
                    new_terms_scores[new_term] = new_terms_score
                

In [16]:
# Sample output
for key in new_terms_scores.keys()[0:20]:
    print key, new_terms_scores[key]

 1
ورشة 1
https://t.co/uacdHg78ZR 1
これ、やってみたら自分はかなり少数派なものを選んでしまっていた。HAHAHA 1
better! 2
@Mhodc17 1
casa, 1
る 1
pide 1
ptm 1
PARTE 1
everybody 1
يقول 1
Buenos 1
tug-of-war -3
mansion 3
3m 1
otro 1
@wylogp 1
Guide 1


<h4 style="color:green">Explaination of the technique I've used:</h4>

For each valid tweet:
1. I count the number of positive words and the number of negative words. 
    - I initialize the `pos_words` and `neg_words` to 1 because I'll be using their ratio to determine general sentiment of the tweet and I didn't want a division by zero error

2. For each tweet I get a ratio of number of positive terms to number of negative terms (num_positive/num_negative)

3. If the ratio is greater than or equal to 1, I decide that the tweet is in general positive, else, I decide that the tweet is negative

4. For a new term that was not found in the AFINN file, I score it as either number of positive words, if the tweet is positive, or else the word is scored `num_of_negative_words * (-1)`

**This is obviously a very simplistic solution that does not consider non english words of filter out non textual tweets**