# Inverted Indexs

For the twitter data, there will be two inverted indexes. The first inverted index will have the username as the key and the tweets that correspond to that username as the values. The second inverted index will be the word index for our tweets; the key will be a word and the values will be the id of the documents that contain that word.

# Word Dictionary Reversed Index

The reversed index for the words in the corpus will be created by going through each word in each document. The reveresed index will allow for a faster retrieval of relevant documents. 

In [17]:
def create_inverted_index(corpus):
    idx = {}
    
    for i, doc in enumerate(corpus):
        # Iterate through each word in the document
        for word in doc.split():
            if word in idx:
                # Update the document's term frequency
                if i in idx[word]:
                    idx[word][i] += 1
                # Add the document to the word index
                else:
                    idx[word][i] = 1;
            # Add the word to the reversed index
            else:
                idx[word] = {i:1}
    
    return idx


In [18]:
test_users = ["vcu451", "chadfu", "SIX15"]
test_corpus = ["reading my kindle2  love it lee childs is good read", 
               "ok, first assesment of the kindle2 ...it fucking rocks", 
               "fuck this economy I hate aig and their non loan given asses"]

idx = create_inverted_index(test_corpus)

idx

{'...it': {1: 1},
 'I': {2: 1},
 'aig': {2: 1},
 'and': {2: 1},
 'asses': {2: 1},
 'assesment': {1: 1},
 'childs': {0: 1},
 'economy': {2: 1},
 'first': {1: 1},
 'fuck': {2: 1},
 'fucking': {1: 1},
 'given': {2: 1},
 'good': {0: 1},
 'hate': {2: 1},
 'is': {0: 1},
 'it': {0: 1},
 'kindle2': {0: 1, 1: 1},
 'lee': {0: 1},
 'loan': {2: 1},
 'love': {0: 1},
 'my': {0: 1},
 'non': {2: 1},
 'of': {1: 1},
 'ok,': {1: 1},
 'read': {0: 1},
 'reading': {0: 1},
 'rocks': {1: 1},
 'the': {1: 1},
 'their': {2: 1},
 'this': {2: 1}}

# User Tweet Index

The user tweet index will allow for the all tweets(documents) that belong to a username to be retrieved quickly.

In [21]:
def create_user_index(users):
    
    user_tweets = {}
    
    # Go through each of the tweets in the corpus
    for i in range( len(users) ):
        # When user already exists, add the document id to the existing user tweet list
        if users[i] in user_tweets:
            user_tweets.append(i)
        # Otherwise, creat a new list with the document id
        else:
            user_tweets[users[i] ] = [i]
            
    return user_tweets

In [22]:
user_tweet_index = create_user_index(test_users)

user_tweet_index

{'SIX15': [2], 'chadfu': [1], 'vcu451': [0]}