# Twitter POS Tagging 
The goal of this tutorial is to introduce a a Part-of-Speech (POS) tagger developed for tweets which was released as part of the [TweetNLP](https://www.ark.cs.cmu.edu/TweetNLP/) toolkit. The code is written in Java and the python wrapper for the tokenization is from [this](https://github.com/myleott/ark-twokenize-py) github repository. This tutorial has code from the [TweetNLP](https://github.com/brendano/ark-tweet-nlp/) github repository as well as the python wrapper from [this](https://github.com/ianozsvald/ark-tweet-nlp-python) repository.

## POS tagging
- POS tagging involves identifying part-of-speech of tokens in a given text. This can be viewed as a task of labeling the sentence w_1, w_2, ....., w_n with pos tags, one for each word: t_1, t_2, ...., t_n.
- The 8 common parts of speech for english language are:
  1. Noun
  2. Verb
  3. Pronoun
  4. Preposition
  5. Adverb
  6. Conjuction
  7. Participle
  8. Article  
- Twitter data is different from standard language data in that there are tokens such as #, @, emoticons, URLs, etc. So the tagset for twitter needs to incorporate the tags for these new tokens. The tags that are used to annotate tweets are as follows:

<img src="pos_tags.png">

## Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments
- This tutorial covers how to accomplish the task of POS tagging for twitter data based on this paper: https://aclanthology.org/P11-2008.pdf
- The nature of twitter data poses challenges in using standard POS taggers. The paper develops the above tagset for twitter to include tags for words that are not commonly encountered in language outside of twitter. 
- Around 1,800 tweets were manually annotated with corresponding pos tags.
- Conditional Random Fields (CRFs) were used with features specific to twitter POS tagging. The features for the CRF are below (see paper for more details):
  - Twitter orthography - these features are rules that detect @, #, and URls.
  - Names - these features check for names from a dictionary of compiled tokens which are frequently capitalized.
  - Traditional Tag Dictionary - these are features for all tags that occur in PTB.
  - Distributional Similarity - these features are constructed from the successor and predecessor probabilities for the 10,000 most common terms.
  - Phonetic normalization - words are normalized to ignore alternate spellings of words using the Metaphone algorithm; e.x.{thangs, thanks, thanksss, thanx, thinks, thnx} are mapped to 0NKS.
- 1827 tweets that are annotated are divided into training set of 1000 tweets, dev set of 327 tweets, and test set of 500 tweets. The results of the tagger incorporating the above features are compared with the standard Stanford Tagger and using the above feature set for twitter data reduces error by about 25%.

## Instructions 
- You will need to download the POS tagger from https://code.google.com/archive/p/ark-tweet-nlp/downloads
- This requires Java 6. https://www.oracle.com/java/technologies/java-platform.html
- Place this ipython notebook that has python wrappers inside the ark-tweet-nlp-0.3.2 folder.

In [1]:
from __future__ import unicode_literals

import operator
import re
import sys
import os
import numpy as np

import subprocess
import shlex

try:
    from html.parser import HTMLParser
except ImportError:
    from HTMLParser import HTMLParser

try:
    import html
except ImportError:
    pass  

In [2]:
Contractions = re.compile(u"(?i)(\w+)(n['’′]t|['’′]ve|['’′]ll|['’′]d|['’′]re|['’′]s|['’′]m)$", re.UNICODE)
Whitespace = re.compile(u"[\s\u0020\u00a0\u1680\u180e\u202f\u205f\u3000\u2000-\u200a]+", re.UNICODE)
punctChars = r"['\"“”‘’.?!…,:;]"
punctSeq   = r"['\"“”‘’]+|[.?!,…]+|[:;]+"
entity     = r"&(?:amp|lt|gt|quot);" # see more here https://www.w3schools.com/html/html_entities.asp

In [3]:
def regex_or(*items):
    return '(?:' + '|'.join(items) + ')'

urlStart1  = r"(?:https?://|\bwww\.)"
commonTLDs = r"(?:com|org|edu|gov|net|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|pro|tel|travel|xxx)"
ccTLDs = r"(?:ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|" + \
r"bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|" + \
r"er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|" + \
r"hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|" + \
r"lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|" + \
r"nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|" + \
r"sl|sm|sn|so|sr|ss|st|su|sv|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|" + \
r"va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|za|zm|zw)"	#TODO: remove obscure country domains?
urlStart2  = r"\b(?:[A-Za-z\d-])+(?:\.[A-Za-z0-9]+){0,3}\." + regex_or(commonTLDs, ccTLDs) + r"(?:\."+ccTLDs+r")?(?=\W|$)"
urlBody    = r"(?:[^\.\s<>][^\s<>]*?)?"
urlExtraCrapBeforeEnd = regex_or(punctChars, entity) + "+?"
urlEnd     = r"(?:\.\.+|[<>]|\s|$)"
url        = regex_or(urlStart1, urlStart2) + urlBody + "(?=(?:"+urlExtraCrapBeforeEnd+")?"+urlEnd+")"

In [4]:
monetary = r"\$([0-9]+)?\.?([0-9]+)?"
timeLike   = r"\d+(?::\d+){1,2}"
numberWithCommas = r"(?:(?<!\d)\d{1,3},)+?\d{3}" + r"(?=(?:[^,\d]|$))"
numComb = u"[\u0024\u058f\u060b\u09f2\u09f3\u09fb\u0af1\u0bf9\u0e3f\u17db\ua838\ufdfc\ufe69\uff04\uffe0\uffe1\uffe5\uffe6\u00a2-\u00a5\u20a0-\u20b9]?"
boundaryNotDot = regex_or("$", r"\s", r"[“\"?!,:;]", entity)
aa1  = r"(?:[A-Za-z]\.){2,}(?=" + boundaryNotDot + ")"
aa2  = r"[^A-Za-z](?:[A-Za-z]\.){1,}[A-Za-z](?=" + boundaryNotDot + ")"
standardAbbreviations = r"\b(?:[Mm]r|[Mm]rs|[Mm]s|[Dd]r|[Ss]r|[Jj]r|[Rr]ep|[Ss]en|[Ss]t)\."
arbitraryAbbrev = regex_or(aa1, aa2, standardAbbreviations)
separators  = "(?:--+|―|—|~|–|=)"
decorations = u"(?:[♫♪]+|[★☆]+|[♥❤♡]+|[\u2639-\u263b]+|[\ue001-\uebbb]+)"
thingsThatSplitWords = r"[^\s\.,?\"]"
embeddedApostrophe = thingsThatSplitWords+r"+['’′]" + thingsThatSplitWords + "*"
normalEyes = "[:=]" # 8 and x are eyes but cause problems
wink = "[;]"
noseArea = "(?:|-|[^a-zA-Z0-9 ])" # doesn't get :'-(
happyMouths = r"[D\)\]\}]+"
sadMouths = r"[\(\[\{]+"
tongue = "[pPd3]+"
otherMouths = r"(?:[oO]+|[/\\]+|[vV]+|[Ss]+|[|]+)" # remove forward slash if http://'s aren't cleaned

# mouth repetition examples:
# @aliciakeys Put it in a love song :-))
# @hellocalyclops =))=))=)) Oh well

# myleott: try to be as case insensitive as possible, but still not perfect, e.g., o.O fails
#bfLeft = u"(♥|0|o|°|v|\\$|t|x|;|\u0ca0|@|ʘ|•|・|◕|\\^|¬|\\*)".encode('utf-8')
bfLeft = u"(♥|0|[oO]|°|[vV]|\\$|[tT]|[xX]|;|\u0ca0|@|ʘ|•|・|◕|\\^|¬|\\*)"
bfCenter = r"(?:[\.]|[_-]+)"
bfRight = r"\2"
s3 = r"(?:--['\"])"
s4 = r"(?:<|&lt;|>|&gt;)[\._-]+(?:<|&lt;|>|&gt;)"
s5 = "(?:[.][_]+[.])"
# myleott: in Python the (?i) flag affects the whole expression
#basicface = "(?:(?i)" +bfLeft+bfCenter+bfRight+ ")|" +s3+ "|" +s4+ "|" + s5
basicface = "(?:" +bfLeft+bfCenter+bfRight+ ")|" +s3+ "|" +s4+ "|" + s5

eeLeft = r"[＼\\ƪԄ\(（<>;ヽ\-=~\*]+"
eeRight= u"[\\-=\\);'\u0022<>ʃ）/／ノﾉ丿╯σっµ~\\*]+"
eeSymbol = r"[^A-Za-z0-9\s\(\)\*:=-]"
eastEmote = eeLeft + "(?:"+basicface+"|" +eeSymbol+")+" + eeRight

oOEmote = r"(?:[oO]" + bfCenter + r"[oO])"

emoticon = regex_or(
        # Standard version  :) :( :] :D :P
        "(?:>|&gt;)?" + regex_or(normalEyes, wink) + regex_or(noseArea,"[Oo]") + regex_or(tongue+r"(?=\W|$|RT|rt|Rt)", otherMouths+r"(?=\W|$|RT|rt|Rt)", sadMouths, happyMouths),

        # reversed version (: D:  use positive lookbehind to remove "(word):"
        # because eyes on the right side is more ambiguous with the standard usage of : ;
        regex_or("(?<=(?: ))", "(?<=(?:^))") + regex_or(sadMouths,happyMouths,otherMouths) + noseArea + regex_or(normalEyes, wink) + "(?:<|&lt;)?",

        #inspired by http://en.wikipedia.org/wiki/User:Scapler/emoticons#East_Asian_style
        eastEmote.replace("2", "1", 1), basicface,
        # iOS 'emoji' characters (some smileys, some symbols) [\ue001-\uebbb]
        # TODO should try a big precompiled lexicon from Wikipedia, Dan Ramage told me (BTO) he does this

        # myleott: o.O and O.o are two of the biggest sources of differences
        #          between this and the Java version. One little hack won't hurt...
        oOEmote
)

Hearts = "(?:<+/?3+)+" #the other hearts are in decorations

Arrows = regex_or(r"(?:<*[-―—=]*>+|<+[-―—=]*>*)", u"[\u2190-\u21ff]+")

Hashtag = "#[a-zA-Z0-9_]+"
AtMention = "[@＠][a-zA-Z0-9_]+"

Bound = r"(?:\W|^|$)"
Email = regex_or("(?<=(?:\W))", "(?<=(?:^))") + r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}(?=" +Bound+")"

In [5]:
# We will be tokenizing using these regexps as delimiters
# Additionally, these things are "protected", meaning they shouldn't be further split themselves.
Protected  = re.compile(
    regex_or(
        Hearts,
        url,
        Email,
        timeLike,
        monetary,
        numberWithCommas,
        numComb,
        emoticon,
        Arrows,
        entity,
        punctSeq,
        arbitraryAbbrev,
        separators,
        decorations,
        embeddedApostrophe,
        Hashtag,
        AtMention), re.UNICODE)

# Edge punctuation
# Want: 'foo' => ' foo '
# While also:   don't => don't
# the first is considered "edge punctuation".
# the second is word-internal punctuation -- don't want to mess with it.
# BTO (2011-06): the edgepunct system seems to be the #1 source of problems these days.
# I remember it causing lots of trouble in the past as well.  Would be good to revisit or eliminate.

# Note the 'smart quotes' (http://en.wikipedia.org/wiki/Smart_quotes)
#edgePunctChars    = r"'\"“”‘’«»{}\(\)\[\]\*&" #add \\p{So}? (symbols)
edgePunctChars    = u"'\"“”‘’«»{}\\(\\)\\[\\]\\*&" #add \\p{So}? (symbols)
edgePunct    = "[" + edgePunctChars + "]"
notEdgePunct = "[a-zA-Z0-9]" # content characters
offEdge = r"(^|$|:|;|\s|\.|,)"  # colon here gets "(hello):" ==> "( hello ):"
EdgePunctLeft  = re.compile(offEdge + "("+edgePunct+"+)("+notEdgePunct+")", re.UNICODE)
EdgePunctRight = re.compile("("+notEdgePunct+")("+edgePunct+"+)" + offEdge, re.UNICODE)


In [6]:
def splitEdgePunct(input):
    input = EdgePunctLeft.sub(r"\1\2 \3", input)
    input = EdgePunctRight.sub(r"\1 \2\3", input)
    return input

# The main work of tokenizing a tweet.
def simpleTokenize(text):

    # Do the no-brainers first
    splitPunctText = splitEdgePunct(text)

    textLength = len(splitPunctText)

    # BTO: the logic here got quite convoluted via the Scala porting detour
    # It would be good to switch back to a nice simple procedural style like in the Python version
    # ... Scala is such a pain.  Never again.

    # Find the matches for subsequences that should be protected,
    # e.g. URLs, 1.0, U.N.K.L.E., 12:53
    bads = []
    badSpans = []
    for match in Protected.finditer(splitPunctText):
        # The spans of the "bads" should not be split.
        if (match.start() != match.end()): #unnecessary?
            bads.append( [splitPunctText[match.start():match.end()]] )
            badSpans.append( (match.start(), match.end()) )

    # Create a list of indices to create the "goods", which can be
    # split. We are taking "bad" spans like
    #     List((2,5), (8,10))
    # to create
    #     List(0, 2, 5, 8, 10, 12)
    # where, e.g., "12" here would be the textLength
    # has an even length and no indices are the same
    indices = [0]
    for (first, second) in badSpans:
        indices.append(first)
        indices.append(second)
    indices.append(textLength)

    # Group the indices and map them to their respective portion of the string
    splitGoods = []
    for i in range(0, len(indices), 2):
        goodstr = splitPunctText[indices[i]:indices[i+1]]
        splitstr = goodstr.strip().split(" ")
        splitGoods.append(splitstr)

    #  Reinterpolate the 'good' and 'bad' Lists, ensuring that
    #  additonal tokens from last good item get included
    zippedStr = []
    for i in range(len(bads)):
        zippedStr = addAllnonempty(zippedStr, splitGoods[i])
        zippedStr = addAllnonempty(zippedStr, bads[i])
    zippedStr = addAllnonempty(zippedStr, splitGoods[len(bads)])

    # BTO: our POS tagger wants "ur" and "you're" to both be one token.
    # Uncomment to get "you 're"
    #splitStr = []
    #for tok in zippedStr:
    #    splitStr.extend(splitToken(tok))
    #zippedStr = splitStr

    return zippedStr


def addAllnonempty(master, smaller):
    for s in smaller:
        strim = s.strip()
        if (len(strim) > 0):
            master.append(strim)
    return master

# "foo   bar " => "foo bar"
def squeezeWhitespace(input):
    return Whitespace.sub(" ", input).strip()

# Final pass tokenization based on special patterns
def splitToken(token):
    m = Contractions.search(token)
    if m:
        return [m.group(1), m.group(2)]
    return [token]

# Assume 'text' has no HTML escaping.
def tokenize(text):
    return simpleTokenize(squeezeWhitespace(text))

# Twitter text comes HTML-escaped, so unescape it.
# We also first unescape &amp;'s, in case the text has been buggily double-escaped.
def normalizeTextForTagger(text):
    assert sys.version_info[0] >= 3 and sys.version_info[1] > 3, 'Python version >3.3 required'
    text = text.replace("&amp;", "&")
    text = html.unescape(text)
    return text

# This is intended for raw tweet text -- we do some HTML entity unescaping before running the tagger.
#
# This function normalizes the input text BEFORE calling the tokenizer.
# So the tokens you get back may not exactly correspond to
# substrings of the original text.
def tokenizeRawTweetText(text):
    tokens = tokenize(normalizeTextForTagger(text))
    return tokens

## Python Wrapper for POS Tagger
- The functions below call the runTagger.sh to get the POS tag predictions for the tokenized tweets. 
- runTagger.sh script should be invoked.

In [7]:
RUN_TAGGER_CMD = "java -XX:ParallelGCThreads=2 -Xmx500m -jar ark-tweet-nlp-0.3.2.jar"

def _split_results(rows):
    """Parse the tab-delimited returned lines, modified from: https://github.com/brendano/ark-tweet-nlp/blob/master/scripts/show.py"""
    for line in rows:
        line = line.strip()  # remove '\n'
        if len(line) > 0:
            if line.count('\t') == 2:
                parts = line.split('\t')
                tokens = parts[0]
                tags = parts[1]
                confidence = float(parts[2])
                yield tokens, tags, confidence
                
                
def _call_runtagger(tweets, run_tagger_cmd=RUN_TAGGER_CMD):
    """Call runTagger.sh using a named input file"""

    # remove carriage returns as they are tweet separators for the stdin
    # interface
    tweets_cleaned = [tw.replace('\n', ' ') for tw in tweets]
    message = "\n".join(tweets_cleaned)

    # force UTF-8 encoding (from internal unicode type) to avoid .communicate encoding error as per:
    # http://stackoverflow.com/questions/3040101/python-encoding-for-pipe-communicate
    message = message.encode('utf-8')

    # build a list of args
    args = shlex.split(run_tagger_cmd)
    args.append('--output-format')
    args.append('conll')
    po = subprocess.Popen(args, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    # old call - made a direct call to runTagger.sh (not Windows friendly)
    #po = subprocess.Popen([run_tagger_cmd, '--output-format', 'conll'], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    result = po.communicate(message)
    # expect a tuple of 2 items like:
    # ('hello\t!\t0.9858\nthere\tR\t0.4168\n\n',
    # 'Listening on stdin for input.  (-h for help)\nDetected text input format\nTokenized and tagged 1 tweets (2 tokens) in 7.5 seconds: 0.1 tweets/sec, 0.3 tokens/sec\n')

    pos_result = result[0].decode('utf-8').strip('\n\n')  # get first line, remove final double carriage return
    pos_result = pos_result.split('\n\n')  # split messages by double carriage returns
    pos_results = [pr.split('\n') for pr in pos_result]  # split parts of message by each carriage return
    return pos_results


def runtagger_parse(tweets, run_tagger_cmd=RUN_TAGGER_CMD):
    """Call runTagger.sh on a list of tweets, parse the result, return lists of tuples of (term, type, confidence)"""
    pos_raw_results = _call_runtagger(tweets, run_tagger_cmd)
    pos_result = []
    for pos_raw_result in pos_raw_results:
        pos_result.append([x for x in _split_results(pos_raw_result)])
    return pos_result


def check_script_is_present(run_tagger_cmd=RUN_TAGGER_CMD):
    """Simple test to make sure we can see the script"""
    success = False
    try:
        args = shlex.split(run_tagger_cmd)
        args.append("--help")
        po = subprocess.Popen(args, stdout=subprocess.PIPE)
        # old call - made a direct call to runTagger.sh (not Windows friendly)
        #po = subprocess.Popen([run_tagger_cmd, '--help'], stdout=subprocess.PIPE)
        while not po.poll():
            lines = [l for l in po.stdout]
        # we expected the first line of --help to look like the following:
        assert "RunTagger [options]" in lines[0].decode('utf-8')
        success = True
    except OSError as err:
        print("Caught an OSError, have you specified the correct path to runTagger.sh? We are using \"%s\". Exception: %r" % (run_tagger_cmd, repr(err)))
    return success


## Read tokenized tweets
We will now load tweets that have the tokenized for POS tagging.

In [11]:
file = open("tweets_tokenized.txt", "r")
tweets_tokenized = file.readlines()
print(tweets_tokenized)

["I won't win a single game I bet on!! Got Mr. Cliff Lee, if he loses its on me U.S.A!\n", 'RT @eye_e: this poster-print costs $12.40 , which is 40% of the normal price! http://tl.gd/6meogh\n', 'I ❤ Biebs & want to hang out with him!!\n', '@thecamion I like monkeys, but I still hate COSTCO parking lots.. oO o.O #COSTCO 2:15 PM\n', 'Texas Rangers are in the World Series! Go Rangers!!!!!!!!! :> <3 ♥❤♡ http://fb.me/D2LsXBJx\n']


## Apply POS tagger
The output of the POS tagger is a tuple containing token, predicted output tag, and confidence

In [12]:
top_5k = open('top_5k_twitter_2015.txt')
top_5k_list = []
for line in top_5k:
    tokenized_top_5k = ' '.join(tokenizeRawTweetText(line))
    tokenized_top_5k_list = tokenized_top_5k.split(' ')
    top_5k_list.append(tokenized_top_5k_list[0])
print(top_5k_list)



In [17]:
inp_file = open('alcohol_tweets_4k.txt')
#inp_file = open('tweets.txt')

oup_file = open("t_t_p.txt", "w") 


open_class=['N','A','V','R']

lexical_density_ratio_list = []
n_slex_ratio_list = []
ttr_list = []
tokenized_tweet_list=[]
#oup_file = open("tweets_tokenized.txt", "w") 
for line in inp_file:
    #print("---")
    tokenized_tweet_list=[]
    tokenized_tweet = ' '.join(tokenizeRawTweetText(line))
    print(tokenized_tweet)
    #tokenized_tweet_list = tokenized_tweet.split(' ')
    #unique_tokenized_tweet_list = set(tokenized_tweet_list)
    
    
    out = runtagger_parse([tokenized_tweet])
    print(out)
    number_of_words =len(out[0])
    print(out[0])
    n_open_class = 0
    n_slex = 0
    for i in range(number_of_words):
        tokenized_tweet_list.append(out[0][i][0])
        if out[0][i][1] in open_class : 
            n_open_class += 1
    oup_file.write(line.strip()+'\t'+str(tokenized_tweet_list)+'\t'+str(out[0]) + '\n')
    for i in range(number_of_words):
        if out[0][i][0].lower() not in top_5k_list :
            #print(out[0][i][0])
            n_slex +=1
            
    unique_tokenized_tweet_list = set(tokenized_tweet_list)
    diversity = len(unique_tokenized_tweet_list)
    
    ratio = n_open_class/number_of_words
    n_slex_ratio = n_slex/number_of_words
    ttr = diversity/number_of_words
    lexical_density_ratio_list.append(ratio)
    n_slex_ratio_list.append(n_slex_ratio)
    ttr_list.append(ttr)
    #print(unique_tokenized_tweet_list, diversity, number_of_words)

    
    #density_file.write(line.strip() +'\t'+str(n_open_class)+ '/'+str(number_of_words)+ '\n')
    
    
inp_file.close()

oup_file.close()




Drinking a Smoglifter by @brashbrewingco @ Spoiled Rotten Grayton Beach, FL — http://t.co/h9LJZwCrm9
[[]]
[]


ZeroDivisionError: division by zero

In [11]:
def sort_index(list_in):

    sort_ = []
    for i in range (len(list_in)): 
        max_index = np.argmax(list_in)
        list_in[max_index] = list_in[max_index]-2 
        sort_.append(max_index)
    for i in range (len(list_in)): 
        list_in[i] =list_in[i] +2
    return sort_

def file_sorter(content,out_file, in_ratio, sort_list):
    for i in sort_list : 
        out_file.write(content[i].strip() + '\t' + str(in_ratio[i]) + '\n')
    return 


In [12]:
inp_file = open('alcohol_tweets_4k.txt')
#inp_file = open('tweets.txt')

density_file = open("density.txt", "w")
sophistication_file = open("sophistication.txt", "w")
diversity_file = open("diversity.txt", "w")

lexical_index_sort = sort_index(lexical_density_ratio_list)
n_slex_index_sort = sort_index(n_slex_ratio_list)
ttr_index_sort = sort_index(ttr_list)
content = inp_file.readlines()
file_sorter(content,density_file,lexical_density_ratio_list,lexical_index_sort)
file_sorter(content,sophistication_file,n_slex_ratio_list,n_slex_index_sort)
file_sorter(content,diversity_file,ttr_list,ttr_index_sort)

inp_file.close()

density_file.close()
sophistication_file.close()
diversity_file.close()



In [13]:
from scipy import stats
rll , pll = stats.pearsonr(lexical_density_ratio_list,lexical_density_ratio_list)
rln , pln = stats.pearsonr(lexical_density_ratio_list,n_slex_ratio_list)
rlt , plt = stats.pearsonr(lexical_density_ratio_list,ttr_list)

rnl , pnl = stats.pearsonr(n_slex_ratio_list,lexical_density_ratio_list)
rnn , pnn = stats.pearsonr(n_slex_ratio_list,n_slex_ratio_list)
rnt , pnt = stats.pearsonr(n_slex_ratio_list,ttr_list)

rtl , ptl = stats.pearsonr(ttr_list,lexical_density_ratio_list)
rtn , ptn = stats.pearsonr(ttr_list,n_slex_ratio_list)
rtt , ptt = stats.pearsonr(ttr_list,ttr_list)

In [14]:
correlations_file = open("correlations.txt", "w") 

correlations_file.write('\t' +'\t' +'\t' +'\t'+'\t'+'density' + '\t' + '\t'+ '\t'+'sophistication' + '\t' + '\t'+ '\t'+'diversity'+  '\n')
correlations_file.write('density' +'\t' + '\t'+'\t' +'\t'+str(rll) + '\t' +str(rln) + '\t' +str(rlt)+  '\n')
correlations_file.write('sophistication' +'\t' +'\t' +str(rnl) + '\t' +str(rnn) + '\t' +str(rnt)+  '\n')
correlations_file.write('diversity' +'\t' +'\t' + '\t'+str(rtl) + '\t' +str(rtn) + '\t' +str(rtt)+  '\n'+'\n'+'\n')
correlations_file.write("The values on the diagnal are all ~1 which is due to self corelation. The result of the table above shows that the linear corelation"+'\n')
correlations_file.write("between sophistication and dencity is higher than the other two. Diversity and density have the lowest corelation, because having Noun,Verb,adj "+'\n')
correlations_file.write("or adv does not change the diversity of the tweet. On the other hand, the results shoes that if we have more of the open-class words, the chance "+'\n')
correlations_file.write("of having more sophisticated words goes up.")
correlations_file.close()