In [None]:
from __future__ import absolute_import, division, print_function

![2017fiedler](http://www.engg.ksu.edu/images/2017fiedler.jpg)

## K-State Honor Code
>### "On my honor, as a student, I have neither given nor received unauthorized aid on this academic work."
>### "The assignment I am submitting contains my own words without borrowing the words of other people from the Internet or other sources (e.g., articles, lecture notes).”
>### Derek W. Christensen

## The Vector Space Model: Implementation
Created on Wed Dec 13 09:50:32 2017
@author: Derek Christensen
>### Vector Space Model, TF-IDF, Query, Ranked List
>### HW 2
>### CS 833 Information Retrieval and Text Mining
>### Cornelia Caragea
>### Department of Computer Science
>### Kansas State University
>### Fall 2017

## Your task is to implement a basic vector space retrieval system. You will use the Cranfield collection to develop and test your system.

#### The Cranfield collection is a standard IR text collection, consisting of 1400 documents from the aerodynamics field, in SGML format. The dataset, a list of queries and relevance judgments associated with these queries are available from Online K-State.

#### Tasks: To complete this assignment, you need to use the pre-processing tools implemented during assignment 1.
#### Note that you also need to eliminate the SGML tags (e.g., '<'TITLE>', '<'DOC>, '<'TEXT>, etc.) - you should only keep the actual title and text.

1. Implement an indexing scheme based on the vector space model, as discussed in class. The
steps pointed out in class can be used as guidelines for the implementation. For the weighting
scheme, use and experiment with:

    • TF-IDF (do not divide TF by the maximum term frequency in a document).<br>
    <br>

2. For each of the ten queries in the queries.txt file, determine a ranked list of documents, in
descending order of their similarity with the query. The output of your retrieval should be
a list of (query_id, document_id) pairs.

Determine the average precision and recall for the ten queries, when you use:

    • top 10 documents in the ranking  
    • top 50 documents in the ranking  
    • top 100 documents in the ranking  
    • top 500 documents in the ranking  

Note: A list of relevant documents for each query is provided to you, so that you can determine
precision and recall.

Submission instructions:

1. write a README file including:<br>
    • a detailed note about the functionality of each of the above programs<br>
    • complete instructions on how to run them<br>
    • answers to the questions above<br>

2. make sure you include your name in each program and in the README file.
3. make sure all your programs run correctly on the CS machines. You will lose 40 points
if your code is not running on these machines. The path to the data should be an input
parameter, and not hardcoded.
4. submit your assignment through Online K-State.


<br>

# Vector Space Model: Implementation Steps<br>
  
>### Step 1: Preprocessing
>### Step 2: Indexing
>### Step 3: Retrival  
>### Step 4: Ranking  

# Import Libraries

### Pretty Display of Variables

## Most performant - Python 2.7 and 3, dict comprehension:

### Imagine that you have:
##### >>> keys = ('name', 'age', 'food')
##### >>> values = ('Monty', 42, 'spam')
### What is the simplest way to produce the following dictionary ?
##### >>> dict = {'name' : 'Monty', 'age' : 42, 'food' : 'spam'}

A possible improvement on using the dict constructor is to use the native syntax of a dict comprehension (not a list comprehension, as others have mistakenly put it):

##### >>> new_dict = {k: v for k, v in zip(keys, values)}

### In _Python 2, zip returns a list,_ *to avoid creating an unnecessary list, use izip instead*
(aliased to zip can reduce code changes when you move to Python 3).

## >>> from itertools import izip as zip
 
So that is still:

## >>> new_dict = {k: v for k, v in zip(keys, values)}

### Python 2, ideal for <= 2.6
 
izip from itertools becomes zip in Python 3. izip is better than zip for Python 2 (because it avoids the unnecessary list creation), and ideal for 2.6 or below:
##### >>> from itertools import izip
##### >>> new_dict = dict(izip(keys, values))

### Python 3
In Python 3, zip becomes the same function that was in the itertools module, so that is simply:
##### >>> new_dict = dict(zip(keys, values))
A dict comprehension would be more performant though (see performance review at the end of this answer).

### Result for all cases:
In all cases:
##### >>> new_dict
{'age': 42, 'name': 'Monty', 'food': 'spam'}<br>

### Explanation:
If we look at the help on dict we see that it takes a variety of forms of arguments:
##### >>> help(dict)
class dict(object)<br>
 |  dict() -> new empty dictionary<br>
 |  dict(mapping) -> new dictionary initialized from a mapping object's<br>
 |      (key, value) pairs<br>
 |  dict(iterable) -> new dictionary initialized as if via:<br>
 |      d = {}<br>
 |      for k, v in iterable:<br>
 |          d[k] = v<br>
 |  dict(**kwargs) -> new dictionary initialized with the name=value pairs<br>
 |      in the keyword argument list.  For example:  dict(one=1, two=2)<br>
 
The optimal approach is to use an iterable while avoiding creating unnecessary data structures. In Python 2, zip creates an unnecessary list:

##### >>> zip(keys, values)
[('name', 'Monty'), ('age', 42), ('food', 'spam')]<br>
 
In Python 3, the equivalent would be:
##### >>> list(zip(keys, values))
[('name', 'Monty'), ('age', 42), ('food', 'spam')]<br>
 
and Python 3's zip merely creates an iterable object:
##### >>> zip(keys, values)
<zip object at 0x7f0e2ad029c8><br>
 
Since we want to avoid creating unnecessary data structures, we usually want to avoid Python 2's zip (since it creates an unnecessary list).

#### Less performant alternatives:
This is a generator expression being passed to the dict constructor:
##### >>> generator_expression = ((k, v) for k, v in zip(keys, values))
##### >>> dict(generator_expression)
or equivalently:
##### >>> dict((k, v) for k, v in zip(keys, values))
And this is a list comprehension being passed to the dict constructor:
##### >>> dict([(k, v) for k, v in zip(keys, values)])
In the first two cases, an extra layer of non-operative (thus unnecessary) computation is placed over the zip iterable, and in the case of the list comprehension, an extra list is unnecessarily created. I would expect all of them to be less performant, and certainly not more-so.

#### Performance review:
In 64 bit Python 3.4.3, on Ubuntu 14.04, ordered from fastest to slowest:
##### >>> min(timeit.repeat(lambda: {k: v for k, v in zip(keys, values)}))
0.7836067057214677<br>
##### >>> min(timeit.repeat(lambda: dict(zip(keys, values))))
1.0321204089559615<br>
##### >>> min(timeit.repeat(lambda: {keys[i]: values[i] for i in range(len(keys))}))
1.0714934510178864<br>
##### >>> min(timeit.repeat(lambda: dict([(k, v) for k, v in zip(keys, values)])))
1.6110592018812895<br>
##### >>> min(timeit.repeat(lambda: dict((k, v) for k, v in zip(keys, values))))
1.7361853648908436<br>

In [None]:
# from __future__ import absolute_import, division, print_function
# coding: utf-8

__author__ = 'Derek W. Christensen'
__email__ = 'cderekw@gmail.com'
__version__ = '0.0.0'

import sys
import functools
import math
import string
import random
from random import randrange
import pprint
import subprocess
import itertools
import hashlib

from math import pi
from bisect import bisect_left  

# Regular Expression
import re

import gc

from array import array

import os
from os import path

from operator import itemgetter, attrgetter

# In Python 2, zip returns a list, to avoid creating an unnecessary list,\
# use izip instead
# (aliased to zip can reduce code changes when you move to Python 3)
from itertools import izip as zip

# import timeit
# timeit.timeit('x=(1,2,3,4,5,6,7,8,9)', number=100000)

# to read and/or save text files or csv files, import csv
import csv

import collections
# Count words in list
from collections import Counter
from collections import defaultdict

# import nltk, which is a python package for natural language processing
import nltk
# to remove stropwords
from nltk.corpus import stopwords
# FreqDist, word_tokenize
from nltk import FreqDist, sent_tokenize, word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

# from nltk.stem.snowball import EnglishStemmer
# Assuming we're working with English

# Data Visualization
import matplotlib.pyplot as plt
# import ipython
# % matplotlib inline
get_ipython().magic(u'matplotlib inline')

# python package for text classification
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

#initialize countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
#initialize TfidfTransformer
from sklearn.feature_extraction.text import TfidfTransformer

# support vector machine (another algorithm for classification)
from sklearn.svm import SVC

# Evaluating model performance
from sklearn import metrics

# Excel-like format
import pandas as pd
# to diplay max rows
pd.set_option('display.max_rows', 1000)
# to diplay max cols
pd.set_option('display.max_columns', 1000)
# to define width of cells
pd.set_option('display.max_colwidth', 1000)

# package for numbers
import numpy as np

from numpy import dot
from numpy.linalg import norm
#v1=doc_vec(doc1)
#v2=doc_vec(doc2)
#print "Similarity: %s" % float(dot(v1,v2) / (norm(v1) * norm(v2)))

from numpy import zeros
#def doc_vec(doc):
# v=zeros(len(key_idx)) # returns array([0,0,0....len(key_idx)])


# WordCloud
from wordcloud import WordCloud, STOPWORDS

# Count words in list
# from collections import Counter --> (see above)

# Pattern
from pattern.en import sentiment

# Seaborn
import seaborn

# IPyton Display
from IPython.display import display, HTML
# You can include Youtube video in Ipython notebook
from IPython.display import YouTubeVideo
# Include images in ipython notebook
from IPython.display import Image
# Include webpages in ipython notebook
from IPython.core.display import HTML 

from bs4 import BeautifulSoup

# #Adding the following lines to a new script will clear all variables each time you rerun the script:
# from IPython import get_ipython
# get_ipython().magic('reset -sf')

# Functions

In [None]:
# def getStopwords(dir_path_stopwords)
# return(stopwords_from_file)


def getStopwords(dir_path_stopwords):
    
    files_stopwords = os.listdir(dir_path_stopwords)
    print('files_stopwords = ', files_stopwords)
        
    for fsw in files_stopwords:
        with open(dir_path_stopwords+'/'+os.path.basename(fsw), 'r') as swfile:
            stopwords_from_file = set(swfile.read().splitlines())
            
#             stopwords_from_file = set(swfile.readlines())
            
#             stopwords_from_file = [lambda x for x: set(swfile.readlines()).strip()]

    print('stopwords_from_file = ', stopwords_from_file)
    
    return(stopwords_from_file)

In [None]:
# def getFiles(dir_path)
# return(files, file_names, file_idx, file_zip, file_dict, file_dict_enum)


def getFiles(dir_path):
    
#     print()
#     print('-----BEGIN getFiles METHOD-----')
#     print()

    files = os.listdir(dir_path)
    file_names = os.listdir(dir_path)
    print('files = ', files)
    print('file_names = ', file_names)
    print('len(file_names) = ', len(file_names))
    print()
    
    for i in range(len(file_names)):
        file_idx.append(i+1)
        print('file_idx[i] ', i, '= ', file_idx[i])
    print()
    
    print('file_idx = ', file_idx)
    print()
    
    file_zip = zip(file_idx,file_names)
    print('file_zip = ', file_zip)
    print()
    
    file_dict = dict(file_zip)
    print('file_dict = ', file_dict)
    print()
    
    files_dict_enum = {key:value for key, value in enumerate(file_names)}
    print('files_dict_enum = ', files_dict_enum)
    print()
    
#     print()
#     print('-----END getFiles METHOD-----')
#     print()
    
    return(files, file_names, file_idx, file_zip, file_dict, files_dict_enum)

In [None]:
# def getLines(files, dir_path)
# return(review, docnum, titles, texts)


def getLines(files, dir_path):
    
    print()
    print('-----BEGIN getLines METHOD-----')
    print()
    
    # tokenize the words based on white space, removes the punctuation
    strtemp = ""

    for f in files:
        with open(dir_path+'/'+os.path.basename(f), 'r') as ipfile:
            i = 0
            for line in ipfile:
                line = line.strip()
                if i == 2:
                    docnum.append(line)
                    review.append(line)
                    i += 1
                elif i == 5:
                    strtemp += line
                    strtemp += " "
                    review.append(line)
                    i += 1
                    while line != '</TITLE>':
                        for line in ipfile:
                            line = line.strip()
                            if line == '</TITLE>':
                                review.append(line)
                                i += 1
                            else:
                                strtemp += line
                                strtemp += " "
                                review.append(line)
                                i += 1
                            break
                    titles.append(strtemp)
                    strtemp = ""
                elif line == '<TEXT>':
                    review.append(line)
                    i += 1
                    while line != '</TEXT>':
                        for line in ipfile:
                            line = line.strip()
                            if line == '</TEXT>':
                                review.append(line)
                                i += 1
                            else:
                                strtemp += line
                                strtemp += " "
                                review.append(line)
                                i += 1
                            break
                    texts.append(strtemp)
                    strtemp = ""
                else:
                    review.append(line)
                    i += 1
#            print('\nDone with file = ', ipfile, '\n')
#            print()

    print()
    print('-----END getLines METHOD-----')
    print()

    return(review, docnum, titles, texts)

In [None]:
# def getPerDocCorp(titles, texts)
# return(perDocCorp, corpus)


def getPerDocCorp(titles, texts):
    
    print()
    print('-----BEGIN getPerDocCorp METHOD-----')
    print()
    
    strtemp = ""
    corpustemp = ""
    
    for i in range(len(titles)):
        strtemp += titles[i]
        strtemp += texts[i]
        print('\ni = ', i)
        print('strtemp = ', strtemp)
        print()
        corpustemp += strtemp
        print('corpustemp = ', corpustemp, '\n')
        perDocCorp.append(strtemp)
        strtemp = ""
    
    corpus.append(corpustemp)
    print('corpus = ', corpus)
    
    print()
    print('-----END getPerDocCorp METHOD-----')
    print()
    
    return(perDocCorp, corpus)

In [None]:
# def getPerDocCorpClean(perDocCorp)
# return(perDocCorpClean, perDocLen, fdistPerDoc, fdistPerDocLen,
#       freq_word_PerDoc)

# for ea perDocCorp: tokenize, clean, stem, lem, stopwords, \
# shortwords, etc.


def getPerDocCorpClean(perDocCorp):

    print()
    print('-----BEGIN getPerDocCorpClean METHOD-----')
    print()
    
    i=0
    for doc in perDocCorp:
        
        # lenDocTokens = 0
        # fdist = {}
        # lenDocFdist = 0

        tokens = str(doc)
        print('tokens = str(doc)')
        print(len(tokens))
        print('\ntokens [', i, '] = ', tokens, '\n')

        # lowecases for content analytics ... we assume, for example, \
        # LOVE is sames love
        tokens = tokens.lower()
        print('tokens = tokens.lower()')
        print(len(tokens))
        print('\ntokens [', i, '] = ', tokens, '\n')

        # the dataset contains useless characters and numbers
        # Remove useless numbers and alphanumerical words
        # use regular expression ... a-zA-Z0-9 refers to all English \
        # characters (lowercase & uppercase) and numbers
        # ^a-zA-Z0-9 is opposite of a-zA-Z0-9
        tokens = re.sub("[^a-zA-Z0-9]", " ", tokens)
        print('tokens = re.sub("[^a-zA-Z0-9]", " ", tokens)')
        print(len(tokens))
        print('\ntokens [', i, '] = ', tokens, '\n')

        # tokenization or word split
        tokens = word_tokenize(tokens)
        print('tokens = word_tokenize(tokens)')
        print(len(tokens))
        print('\ntokens [', i, '] = ', tokens, '\n')

        # Filter non-alphanumeric characters from tokens
        tokens = [word for word in tokens if word.isalpha()]

        # remove short words
        tokens = [word for word in tokens if len(word) > 2]

        # remove common words
        stoplist = stopwords.words('english')
        # if you want to remove additional words EXAMPLE
        # more = set(['much', 'even', 'time', 'story'])
        # more = set(['the'])
        # stoplist = set(stoplist) | more
        stoplist = set(stoplist) | stopwords_from_file
        stoplist = set(stoplist)
        tokens = [word for word in tokens if word not in stoplist]

        # stemming
        stemmer = PorterStemmer()
        tokens = [stemmer.stem(word) for word in tokens]

# -----CLEANING COMPLETE-----

        perDocCorpClean.append(tokens)
        print('\nperDocCorpClean[', i, '] = ', tokens, '\n')

        lenDocTokens = len(tokens)
        print(lenDocTokens)
        perDocLen.append(lenDocTokens)
        print(perDocLen[i])
        print('\nperDocLen[', i, '] = ', perDocLen[i], '\n')

        fdist = nltk.FreqDist(tokens)
        fdist
        fdistPerDoc.append(fdist)
        print(fdistPerDoc[i])
        print('\nfdistPerDoc[', i, '] = ', fdistPerDoc[i], '\n')
        print('\nfdistPerDoc[i].most_common(10) = \n',
              fdistPerDoc[i].most_common(10), '\n')

        lenDocFdist = len(fdist)
        print(lenDocFdist)
        fdistPerDocLen.append(lenDocFdist)
        print(fdistPerDocLen[i])
        print('\nfdistPerDocLen[', i, '] = ', fdistPerDocLen[i], '\n')

        # freq_word_PerDoc = []
        # prepare the results of word frequency on corpus data as a list

        freq_word = []

        # two values or columns in fdist_a
        print()
        j = 0
        for k, v in fdist.items():
            freq_word.append([k, v])
            print('freq_word[', j, '] = ', freq_word[j])
            j += 1

        # make it like an Excel worksheet
        wordlist = pd.DataFrame(freq_word)

        # pd.set_option('display.max_rows', 1000)
        pd.set_option('display.max_rows', 10)
        
        wordlistSorted = wordlist.sort_values(by=[1, 0],
                                              ascending=[False, True])
        print(wordlistSorted)
        
        freq_word_PerDoc.append(wordlistSorted)
        print('\nfreq_word_PerDoc[', i, '] = ', freq_word_PerDoc[i], '\n')

        i += 1
    
    print()
    print('-----END getPerDocCorpClean METHOD-----')
    print()

    return(perDocCorpClean, perDocLen, fdistPerDoc, fdistPerDocLen,
           freq_word_PerDoc)

In [None]:
# def getCorpusClean(corpus)
# return(corpusClean, corpusLen, fdistCorpus, fdistCorpusLen,
#        freq_word_Corpus)

# for corpus: tokenize, clean, stem, lem, stopwords, \
# shortwords, etc.


def getCorpusClean(corpus):

    print()
    print('-----BEGIN getCorpusClean METHOD-----')
    print()

#    i = 0
#    for doc in perDocCorp:

    # lenDocTokens = 0
    # fdist = {}
    # lenDocFdist = 0

    tokens = str(corpus)
    print('tokens = str(corpus)')
    print(len(tokens))
    print('\ntokens = ', tokens, '\n')

    # lowecases for content analytics ... we assume, for example, \
    # LOVE is sames love
    tokens = tokens.lower()
    print('tokens = tokens.lower()')
    print(len(tokens))
    print('\ntokens = ', tokens, '\n')

    # the dataset contains useless characters and numbers
    # Remove useless numbers and alphanumerical words
    # use regular expression ... a-zA-Z0-9 refers to all English \
    # characters (lowercase & uppercase) and numbers
    # ^a-zA-Z0-9 is opposite of a-zA-Z0-9
    tokens = re.sub("[^a-zA-Z0-9]", " ", tokens)
    print('tokens = re.sub("[^a-zA-Z0-9]", " ", tokens)')
    print(len(tokens))
    print('\ntokens = ', tokens, '\n')

    # tokenization or word split
    tokens = word_tokenize(tokens)
    print('tokens = word_tokenize(tokens)')
    print(len(tokens))
    print('\ntokens = ', tokens, '\n')

    # Filter non-alphanumeric characters from tokens
    tokens = [word for word in tokens if word.isalpha()]

    # remove short words
    tokens = [word for word in tokens if len(word) > 2]

    # remove common words
    stoplist = stopwords.words('english')
    # if you want to remove additional words EXAMPLE
    # more = set(['much', 'even', 'time', 'story'])
    # more = set(['the'])
    # stoplist = set(stoplist) | more
    stoplist = set(stoplist) | stopwords_from_file
    stoplist = set(stoplist)
    tokens = [word for word in tokens if word not in stoplist]

    # stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]
    # -----CLEANING COMPLETE-----

    corpusClean.append(tokens)
    print('\ncorpusClean = ', tokens, '\n')

    lenCorpusTokens = len(tokens)
    print('lenCorpusTokens = ', lenCorpusTokens)
    corpusLen.append(lenCorpusTokens)
    print(corpusLen)
    print('\ncorpusLen = ', corpusLen, '\n')

    fdist = nltk.FreqDist(tokens)
    fdist
    fdistCorpus.append(fdist)
    print(fdistCorpus)
    print('\nfdistCorpus = ', fdistCorpus, '\n')
    print('\nfdistCorpus[0].most_common(10) = \n',
          fdistCorpus[0].most_common(10), '\n')

    lenCorpusFdist = len(fdist)
    print('lenCorpusFdist = ', lenCorpusFdist)

    fdistCorpusLen.append(lenCorpusFdist)
    print(fdistCorpusLen)
    print('\nfdistCorpusLen = ', fdistCorpusLen, '\n')

    # freq_word_PerDoc = []
    # prepare the results of word frequency on corpus data as a list

    freq_word = []

    # two values or columns in fdist_a
    print()
    j = 0
    for k, v in fdist.items():
        freq_word.append([k, v])
        print('freq_word[', j, '] = ', freq_word[j])
        j += 1

    # make it like an Excel worksheet
    wordlist = pd.DataFrame(freq_word)

    # pd.set_option('display.max_rows', 1000)
    pd.set_option('display.max_rows', 10)
    wordlistSorted = wordlist.sort_values(by=[1, 0],
                                          ascending=[False, True])
#        print(wordlistSorted)
    freq_word_Corpus.append(wordlistSorted)
    print('\nfreq_word_Corpus = \n', freq_word_Corpus, '\n')

#    i += 1

    print()
    print('-----END getCorpusClean METHOD-----')
    print()

    return(corpusClean, corpusLen, fdistCorpus, fdistCorpusLen,
           freq_word_Corpus)

# Step 2: Indexing Functions

### Create Postings Dictionary Function

In [None]:
# create Postings dictionary function
# def getPostings(file_names, freq_word_PerDoc, perDocCorpClean)
# return(postings)


def getPostings(file_names, freq_word_PerDoc, perDocCorpClean):
    for docid in (range(len(file_names))):
        for word in freq_word_PerDoc[docid][0]:
            postings[word][docid] = perDocCorpClean[docid].count(word)
    return(postings)

### Create DF Dictionary Function

In [None]:
# create DF dictionary function
# getDF(file_names, freq_word_Corpus, postings)
# return(df)


def getDF(file_names, freq_word_Corpus, postings):
    for docid in (range(len(file_names))):
        for word in freq_word_Corpus[0][0]:
            df[word] = len(postings[word])
    return(df)

### Create Inverted Index & docVecLen Functions

In [None]:
# calculate IDF
# getIDF(word)
# return(idf)


def getIDF(word):
    print('      -----getIDF-----')
    print('      word = ', word)

    if word in fdistCorpus[0]:
        N = (len(file_names))
        print('      N = ', N)
        dfi = df[word]
        print('      dfi = df[word] = ', dfi)
        N_div_dfi = N / dfi
        print('      N_div_dfi = N / dfi = ', N_div_dfi)
        idf = math.log(N_div_dfi, 2)
        print('      idf = math.log(N / df[word], 2) = math.log(N / dfi, 2) = math.log(N_div_dfi, 2)')
        print('      idf = ', idf)
    else:
        idf = 0.0
        print('      idf = ', idf)
    print('      -----back to getWeight-----')
    return(idf)

In [None]:
# calculate weight
# getWeight(word, docid)
# return(weight)


def getWeight(word, docid):
    print('    -----getWeight-----')
    print('    word = ', word)
    print('    docid = ', docid)
    tf = 0
    idf = 0
    print
    if docid in postings[word]:
        tf = postings[word][docid]
        print('    tf = postings[word][docid]')
        print('    tf = ', tf)
        
        idf = getIDF(word)
        print('    idf = ', idf)

        weight = tf * idf
        print('    weight = tf * idf = ', tf, '*', idf, '=', weight)
        print('    weight = ', weight)
    else:
        tf = 0
        print('    tf = ', tf)

        idf = getIDF(word)
        print('    idf = ', idf)

#         weight = 0.0
        weight = tf * idf
        print('    weight = tf * idf = ', tf, '*', idf, '=', weight)
        print('    weight = ', weight)
    print('    -----back to sumSquares-----')
    return(weight)

In [None]:
# create Inverted Index & docVecLen
# getDocVecLen(file_names, freq_word_Corpus)
# return(docVecLen)


def getDocVecLen(file_names, freq_word_Corpus):
    for docid in (range(len(file_names))):
        print('\n<<<<<<<<<<Calculate New docVecLen>>>>>>>>>>\n')
        sumSquares = 0
        for word in freq_word_Corpus[0][0]:
            print('  -----Calculate Update to sumSquares for next Word-----')
            print('  word = ', word)
            print('  docid = ', docid)

            # calculate weight
            # getWeight(word, docid)
            # return(weight)

            weight = getWeight(word, docid)
            print('  weight = ', weight)

            weight_sq = weight**2
            print('  weight_sq = weight**2 = ', weight_sq)
            print('  sumSquares = sumSquares + weight_sq = ', sumSquares, '+', weight_sq, '=')

            sumSquares += weight_sq
            print('  sumSquares = ', sumSquares)
            print()
        print('  docVecLen[docid=', docid,'] = math.sqrt(sumSquares) = math.sqrt(', sumSquares, ')')

        docVecLen[docid] = math.sqrt(sumSquares)
        print('  docVecLen[docid=', docid,'] = ', docVecLen[docid])
    return(docVecLen)

# Step 3: Retrival Functions

In [None]:
# def getQueries(dir_path_queries):
# return(queries_from_file)


def getQueries(dir_path_queries):
    files_queries = os.listdir(dir_path_queries)
    for fq in files_queries:
        with open(dir_path_queries+'/'+os.path.basename(fq), 'r') as qfile:
            queries_from_file = (qfile.read().splitlines())
    return(queries_from_file)

In [None]:
# Get lines of Input
# def getQLines(q)
# return(qReview, qDocnum, qTexts)


def getQLines(q):
    
    print('\n-----BEGIN getQLines METHOD-----')
    
    # tokenize the words based on white space, removes the punctuation
    strtemp = ""
    
    queryNum = 0
    qDocnum.append(queryNum)

    i = 0
    for line in q:
        line = line.strip()
        strtemp += line
        strtemp += " "
        
        qReview.append(line)
        i += 1
        
    qTexts.append(strtemp)
    strtemp = ""
    
    print('\nDone with query = ', q,)

    print('\n-----END getQLines METHOD-----')

    return(qReview, qDocnum, qTexts)

In [None]:
# Generate Query corpus

# def getQCorp(qTexts)
# return(qCorp)


def getQCorp(qTexts):

    print('\n-----BEGIN getQCorp METHOD-----')
    
    strtemp = ""
    
    for i in range(len(qTexts)):
        strtemp += qTexts[i]
        print('\ni = ', i)
        print('strtemp = ', strtemp)
        qCorp.append(strtemp)
        strtemp = ""
        
    print('\nqCorp = ', qCorp)
    
    print('\n-----END getQCorp METHOD-----')
    
    return(qCorp)

In [None]:
# clean Query corpus

# def getQClean(qCorp):
# return(qClean, qLen, fdistQ, fdistQLen, 
#            freq_word_Q, freq_word_Qorpus)

# for ea q: tokenize, clean, stem, lem, stopwords, \
# shortwords, etc.


def getQClean(qCorp):

    print('\n-----BEGIN getQClean METHOD-----\n')
    
#     i = 0
#     for doc in qCorp:
        
#         # lenDocTokens = 0
#         # fdist = {}
#         # lenDocFdist = 0

#         tokens = str(doc)
    tokens = str(qCorp)
    print('tokens = str(doc)')
    print(len(tokens))
#     print('\ntokens [', i, '] = ', tokens, '\n')
    print('\ntokens = ', tokens, '\n')

    # lowecases for content analytics ... we assume, for example, \
    # LOVE is sames love
    tokens = tokens.lower()
    print('tokens = tokens.lower()')
    print(len(tokens))
#     print('\ntokens [', i, '] = ', tokens, '\n')
    print('\ntokens = ', tokens, '\n')
    
    # the dataset contains useless characters and numbers
    # Remove useless numbers and alphanumerical words
    # use regular expression ... a-zA-Z0-9 refers to all English \
    # characters (lowercase & uppercase) and numbers
    # ^a-zA-Z0-9 is opposite of a-zA-Z0-9
    tokens = re.sub("[^a-zA-Z0-9]", " ", tokens)
    print('tokens = re.sub("[^a-zA-Z0-9]", " ", tokens)')
    print(len(tokens))
#     print('\ntokens [', i, '] = ', tokens, '\n')
    print('\ntokens = ', tokens, '\n')

    # tokenization or word split
    tokens = word_tokenize(tokens)
    print('tokens = word_tokenize(tokens)')
    print(len(tokens))
#     print('\ntokens [', i, '] = ', tokens, '\n')
    print('\ntokens = ', tokens, '\n')

    # Filter non-alphanumeric characters from tokens
    tokens = [word for word in tokens if word.isalpha()]
#    print('tokens = [word for word in tokens if word.isalpha()]')
#    print(len(tokens))
##        print('\ntokens [', i, '] = ', tokens, '\n')
#    print('\ntokens = ', tokens, '\n')

    # remove short words
    tokens = [word for word in tokens if len(word) > 2]
#    print('tokens = [word for word in tokens if len(word) > 2]')
#    print(len(tokens))
##        print('\ntokens [', i, '] = ', tokens, '\n')
#    print('\ntokens = ', tokens, '\n')

    # remove common words
    stoplist = stopwords.words('english')
    # if you want to remove additional words EXAMPLE
#        more = set(['much', 'even', 'time', 'story'])
    # more = set(['the'])
    # stoplist = set(stoplist) | more

    stoplist = set(stoplist) | stopwords_from_file
    stoplist = set(stoplist)

    tokens = [word for word in tokens if word not in stoplist]
#    print('stoplist = set(stoplist)')
#    print('tokens = [word for word in tokens if word not in stoplist]')
#    print(len(tokens))
#    print('\ntokens [', 0, '] = ', tokens[0], '\n')
#    print('\ntokens = ', tokens, '\n')

    # stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]
#    print('stemmer = PorterStemmer()')
#    print('tokens = [stemmer.stem(word) for word in tokens]')
#    print(len(tokens))
##        print('\ntokens [', i, '] = ', tokens, '\n')
#    print('\ntokens = ', tokens, '\n')

# -----CLEANING COMPLETE-----

    qClean.append(tokens)
#     print('\nqClean[', i, '] = ', tokens, '\n')
    print('\nqClean = ', tokens, '\n')

    lenQTokens = len(tokens)
    print('lenQTokens = ', lenQTokens)

    qLen.append(lenQTokens)
#    print('qLen[i] = ', qLen[i])
    print(qLen)
#    print('\nqLen[', i, '] = ', qLen[i], '\n')
    print('\nqLen = ', qLen, '\n')

    qfdist = nltk.FreqDist(tokens)
    print('qfdist = ', qfdist)
    print('type(qfdist) = ', type(qfdist))
    print('qfdist.items() = \n', qfdist.items())
    print()

    fdistQ.append(qfdist)
#     print('\nfdistQ[i] = ', fdistQ[i])
    print('fdistQ = ', fdistQ)
#     print('\nfdistQ[', i, '] = ', fdistQ[i], '\n')
    print('\nfdistQ = ', fdistQ, '\n')
    print('\nfdistQ[', 0, '] = ', fdistQ[0], '\n')

    print('\nfdistQ[0].most_common(10) = \n',
          fdistQ[0].most_common(10), '\n')

    lenQFdist = len(qfdist)
    print('lenQFdist = ', lenQFdist)
    
    fdistQLen.append(lenQFdist)
    print('fdistQLen = ', fdistQLen)
    print('fdistQLen = [0]', fdistQLen[0])
#     print('\fdistQLen[', i, '] = ', fdistQLen[i], '\n')
    print('\fdistQLen = ', fdistQLen, '\n')

    # prepare the results of word frequency on corpus data as a list
#     freq_word_Q = []

    # two values or columns in fdist_a
    print()
    j = 0
    for k, v in qfdist.items():
        freq_word_Q.append([k, v])
        print('freq_word_Q[', j, '] = ', freq_word_Q[j])
        j += 1

    print('\n\n-----DONE WITH FOR K,V-----\n\n')
            
    # make it like an Excel worksheet
    # wordlist = pd.DataFrame(freq_word)
    qwordlist = pd.DataFrame(freq_word_Q)
    print('\nqwordlist = ', qwordlist)
    print()

    # pd.set_option('display.max_rows', 1000)
    pd.set_option('display.max_rows', 10)

#         wordlistSorted = wordlist.sort_values(by=[1, 0],
#                                               ascending=[False, True])
    qwordlistSorted = qwordlist.sort_values(by=[1, 0],
                                          ascending=[False, True])
    print('\nqwordlistSorted = \n', qwordlistSorted)

    freq_word_Qorpus.append(qwordlistSorted)
#     print('\nfreq_word_Q[', i, '] = ', freq_word_Qorpus[i], '\n')
    print('\nfreq_word_Qorpus[', 0, '] = \n', freq_word_Qorpus[0], '\n')
    print('\nfreq_word_Qorpus = \n', freq_word_Qorpus, '\n')

#         i += 1
    
    print('\n-----END getQClean METHOD-----\n')
    
    print('freq_word_Q[] = ', freq_word_Q)
    print('freq_word_Q[0] = ', freq_word_Q[0])
    print('freq_word_Q[1] = ', freq_word_Q[1])
    print('freq_word_Q[2] = ', freq_word_Q[2])
    
    print('\nfreq_word_Q[0][0] = ', freq_word_Q[0][0])
    print()

    return(qClean, qLen, fdistQ, fdistQLen,
           freq_word_Q, freq_word_Qorpus)

In [None]:
# def getQTuples(freq_word_Q):
# return(q_tuple_words, q_tuple_freq_i)


def getQTuples(freq_word_Q):

    q_tuple_words = tuple([val for (key, val) in enumerate([val for elem in
                           freq_word_Q for val in elem]) if key % 2 == 0])

    print('q_tuple_words = ', q_tuple_words)
    print('type(q_tuple_words) = ', type(q_tuple_words))

    q_tuple_freq_i = tuple([val for (key, val) in enumerate([val for elem in
                            freq_word_Q for val in elem]) if key % 2 != 0])

    print('q_tuple_freq_i = ', q_tuple_freq_i)
    print('type(q_tuple_freq_i) = ', type(q_tuple_freq_i))

    return(q_tuple_words, q_tuple_freq_i)

In [None]:
# def intersection(post_word_keys):
#     return(docid_set)


def intersection(post_word_keys):
    sets = []
    sets = post_word_keys
    print('\nsets = ', sets)

#     docid_set = (reduce(set.intersection, [s for s in sets]))
    docid_set = (reduce(set.union, [s for s in sets]))
    print('\ndocid_set = ', docid_set)

    return(docid_set) 

In [None]:
# def intersect(a, b):
#     return(c)

def intersect(a, b):
    if len(a) > len(b):
        a, b = b, a

    c = set()
    for x in a:
        if x in b:
            c.add(x)
    return(c)

In [None]:
# generate list of relevant documents

# def getRetDoc(postings, q_tuple_words)
# return(retDoc)


def getRetDoc(postings, q_tuple_words):

    print('\npostings = ', postings)
    print('\nq_tuple_words = ', q_tuple_words)

    print('\ntype(postings) = ', type(postings))
    print('type(q_tuple_words) = ', type(q_tuple_words))
    print()

    print('postings.keys() = ', postings.keys())

    post_word_keys = ([set(postings[word].keys()) for word in q_tuple_words])
    print('\npost_word_keys = ', post_word_keys)

#     post_word_keys = ([(postings[word].keys()) for word in q_tuple_words])
#     print('\npost_word_keys = ', post_word_keys)

    # def intersection(post_word_keys):
    #     return(docid_set)

    docid_set = intersection(post_word_keys)
    print('\ndocid_set = ', docid_set)
    print()

    retDoc = docid_set
    print('\nretDoc = ', retDoc)
    print()

    return(retDoc)

In [None]:
# def getCosSim(docid, q_tuple_words,
#               q_tuple_freq_i, fdistCorpus):
# return(cosSim)

def getCosSim(docid, q_tuple_words,
              q_tuple_freq_i, fdistCorpus):
    similarity = 0.0
    cosSim = 0.0
    qTF = 0
    qIDF = 0.0
    qWeight = 0.0
    qWeightSquared = 0.0
    qSumWeightSquared = 0.0
    global qVecLen
    docWordWeight = 0.0
    x = 0
    for word in q_tuple_words:
#         qTF = q_tuple_freq_i[x]
        if word in fdistCorpus[0]:
            print('word = ', word)
            qTF = q_tuple_freq_i[x]
            print('qTF = ', qTF)
            qIDF = getIDF(word)
            print('qIDF = ', qIDF)
            qWeight = qTF * qIDF
            print('qWeight = ', qWeight)
            qWeightSquared = qWeight**2
            qSumWeightSquared += qWeightSquared

            docWordWeight = getWeight(word, docid)

            similarity += qWeight * docWordWeight
        x += 1

#     print('\ndocid = ', docid)
    
    qVecLen = math.sqrt(qSumWeightSquared)
#     print('\nqVecLen = ', qVecLen)

    cosSim = similarity / (qVecLen * docVecLen[docid])
#     print('\ncosSim = ', cosSim)

    return(cosSim)

In [None]:
# def getCosSimScoresList(retDoc, q_tuple_words,
#                         q_tuple_freq_i, fdistCorpus):
# return(cosSimScoresList)

def getCosSimScoresList(retDoc, q_tuple_words,
                        q_tuple_freq_i, fdistCorpus):

    cosSimScoresList = [
        (docid+1, getCosSim(docid, q_tuple_words, q_tuple_freq_i, fdistCorpus)) 
        for docid in retDoc]

    return(cosSimScoresList)

# Step 4: Ranking Functions

In [None]:
# def getRankCosSimList(cosSimScoresList)
# return(rankCosSimList)


def getRankCosSimList(cosSimScoresList):
    rankCosSimList = sorted(cosSimScoresList, key=lambda l: l[1], reverse=True)
    print('\nrankCosSimList = \n', rankCosSimList)
    return(rankCosSimList)

In [None]:
#    def getRankListPerQ(qNum, queries_from_file,
#                        postings, fdistCorpus)
#    return(rankListPerQ)


def getRankListPerQ(qNum, queries_from_file,
                    postings, fdistCorpus):

    # def getQLines(q)
    # return(qReview, qDocnum, qTexts)
    global q
    q = []
    print('q = ', q)
    global qReview
    qReview = []
    print('qReview = ', qReview)
    global qDocnum
    qDocnum = []
    print('qDocnum = ', qDocnum)
    # qTitles = []
    global qTexts
    qTexts = []
    print('qTexts = ', qTexts)

    # def getQCorp(qTexts)
    # return(qCorp)
    global qCorp
    qCorp = []
    print('qCorp = ', qCorp)

    # def getQClean(qCorp):
    # return(qClean, qLen, fdistQ, fdistQLen,
    #            freq_word_Q, freq_word_Qorpus)
    global qClean
    qClean = []
    print('qClean = ', qClean)
    global qLen
    qLen = []
    print('qLen = ', qLen)
    global fdistQ
    fdistQ = []
    print('fdistQ = ', fdistQ)
    global fdistQLen
    fdistQLen = []
    print('fdistQLen = ', fdistQLen)
    global freq_word_Q
    freq_word_Q = []
    print('freq_word_Q = ', freq_word_Q)
    global freq_word_Qorpus
    freq_word_Qorpus = []
    print('freq_word_Qorpus = ', freq_word_Qorpus)

    # def getRetDoc(postings, q_tuple_words)
    # return(retDoc)
    global retDoc
    retDoc = []
    print('retDoc = ', retDoc)

    # def getCosSimScoresList(retDoc, q_tuple_words,
    #                         q_tuple_freq_i, fdistCorpus):
    # return(cosSimScoresList)
    cosSimScoresList = defaultdict(float)

    # def getRankCosSimList(cosSimScoresList)
    # return(rankCosSimList)
    global rankCosSimList
    rankCosSimList = []
    print('rankCosSimList = ', rankCosSimList)

    query = []
    print('query = ', query)
    input = ""

    input = queries_from_file[qNum]
    print('input = queries_from_file[', qNum, '] = ', queries_from_file[qNum])
    print('input = ', input)

    query.append(input)
    q = query

    # def getQLines(q)
    # return(qReview, qDocnum, qTexts)

    qReview, qDocnum, qTexts = getQLines(q)

    print('\nqTexts = ', qTexts)
    print()
    print('DONE ASSIGNING DOCNUM TITLES AND TEXTS')
    print('\n-----END OF getQLines-----')

    # def getQCorp(qTexts)
    # return(qCorp)

    qCorp = getQCorp(qTexts)

    # for ea qCorp: tokenize, clean, stem, lem, stopwords,\
    # shortwords, etc.

    # def getQClean(qCorp):
    # return(qClean, qLen, fdistQ, fdistQLen,
    #            freq_word_Q, freq_word_Qorpus)

    qClean, qLen, fdistQ, fdistQLen, freq_word_Q, freq_word_Qorpus\
        = getQClean(qCorp)

    x = 0
    for x in range(len(qClean)):
        print()
#        print('qClean[', x, '] = ', qClean[x])
#        print('qLen[', x, '] = ', qLen[x])
        print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
        print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
        print('\nfdistQ[', x, '] = ', fdistQ[x])
        print('\nfdistQLen[', x, '] = ', fdistQLen[x])
        print('\nfreq_word_Q[', x, '] = \n', freq_word_Q[x])
        print('\nfreq_word_Qorpus[', x, '] = \n', freq_word_Qorpus[x])
        print()
        print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
        print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
        print()
        x += 1

    print('type(fdistQ) = ', type(fdistQ))
    print('type(freq_word_Q) = ', type(freq_word_Q))
    print('type(freq_word_Qorpus) = ', type(freq_word_Qorpus))
    print('\nfdistQ = ', fdistQ)
    print('\nfreq_word_Q = ', freq_word_Q)
    print('\nfreq_word_Qorpus = \n', freq_word_Qorpus)
    print()

    print('\nfreq_word_Q[:] = \n', freq_word_Q[:], '\n')
    print('\ntype(freq_word_Q) = ', type(freq_word_Q))

    qClean0 = str(qClean[0])

    print('\nqClean0 =\n', qClean0)
    print('type(qClean0) = ', type(qClean0))

    # def getQTuples(freq_word_Q):
    # return(q_tuple_words, q_tuple_freq_i)

    q_tuple_words, q_tuple_freq_i = getQTuples(freq_word_Q)

    # def getRetDoc(postings, q_tuple)
    # return(retDoc)

#    print('\npostings = ', postings)

    retDoc = getRetDoc(postings, q_tuple_words)

    print('\nq_tuple_words = ', q_tuple_words)
    print('\nq_tuple_freq_i = ', q_tuple_freq_i)
    print('\nfdistCorpus = ', fdistCorpus)

    cosSimScoresList = getCosSimScoresList(retDoc, q_tuple_words,
                                           q_tuple_freq_i, fdistCorpus)

    '''
    cosSimScoresList[ 0 ] =  (1, 0.016380446956997974)
    cosSimScoresList[ 1 ] =  (2, 0.18149296974189377)
    cosSimScoresList[ 2 ] =  (4, 0.130410097650556)
    cosSimScoresList[ 3 ] =  (5, 0.1191294611973798)
    cosSimScoresList[ 4 ] =  (6, 0.05303561515807138)
    '''

    print('\nqVecLen = ', qVecLen)
    print()

    print('\nOUTPUT of cosSimScoresList values')
    print()
    for z in xrange(5):
        print('cosSimScoresList[', z, '] = ', cosSimScoresList[z])

    # def getRankCosSimList(cosSimScoresList)
    # return(rankCosSimList)

    rankCosSimList = getRankCosSimList(cosSimScoresList)

    print('\ntype(rankCosSimList) = ', type(rankCosSimList))

    '''
    rankCosSimList[ 0 ] =  (323, 0.36102506311675114)
    rankCosSimList[ 1 ] =  (322, 0.3515479432650447)
    rankCosSimList[ 2 ] =  (1394, 0.3512276970366803)
    rankCosSimList[ 3 ] =  (628, 0.3464697753459441)
    rankCosSimList[ 4 ] =  (179, 0.310596542107325)

    VERSUS [(docid + 1) in cosSimScoresList]

    rankCosSimList[ 0 ] =  (324, 0.36102506311675114)
    rankCosSimList[ 1 ] =  (323, 0.3515479432650447)
    rankCosSimList[ 2 ] =  (1395, 0.3512276970366803)
    rankCosSimList[ 3 ] =  (629, 0.3464697753459441)
    rankCosSimList[ 4 ] =  (180, 0.310596542107325)
    '''

    print('len(rankCosSimList) = ', len(rankCosSimList))
#    print()

    print()
    for y in xrange(5):
        print('rankCosSimList[', y, '] = ', rankCosSimList[y])

    rankListPerQ = rankCosSimList

    print()
    for z in xrange(5):
        print('rankListPerQ[', z, '] = ', rankListPerQ[z])

    return(rankListPerQ)

In [None]:
# sendToOutputFolder(dir_path_output, output_qid_docid)

def sendToOutputFolder(dir_path_output, output_qid_docid):
    files_ouput = os.listdir(dir_path_output)
    for fo in files_ouput:
        with open(dir_path_output+'/'+os.path.basename(fo), 'w') as ofile:
            ofile.write('\n'.join('{} {}'.format(qiddocid[0], qiddocid[1]) for
                                  qiddocid in output_qid_docid))
    return()

In [None]:
# def getRelevance(dir_path_relevance):
# return(relevance_from_file)

# relevance_from_file = getRelevance(dir_path_relevance)


def getRelevance(dir_path_relevance):
    files_relevance = os.listdir(dir_path_relevance)
    for fr in files_relevance:
        with open(dir_path_relevance+'/'+os.path.basename(fr), 'r') as rfile:
#            relevance_from_file = (rfile.read().splitlines())
            relevance_from_file = [tuple(int(n) for n in line.split())
                                   for line in rfile]
    return(relevance_from_file)

In [None]:
# def getQtyRelDocPerQ(relevance_from_file):
# return(qtyRelDocPerQ)

# qtyRelDocPerQ = getQtyRelDocPerQ(relevance_from_file)


def getQtyRelDocPerQ(relevance_from_file):
    qNum = 1
    qtyRelDoc = 0
    totRelDoc = len(relevance_from_file)
    print('totRelDoc = ', totRelDoc)
    for n in xrange(totRelDoc):
        print('\nn = ', n)
        if (n == totRelDoc - 1) & (relevance_from_file[n][0] == qNum):
            qtyRelDoc += 1
            print('qtyRelDoc = ', qtyRelDoc)
            print('A Q DOC = ', qNum, ' ', relevance_from_file[n][1])
            qtyRelDocPerQ.append(qtyRelDoc)
            print('\nqtyRelDocPerQ) = ', qtyRelDocPerQ)
        elif relevance_from_file[n][0] == qNum:
            qtyRelDoc += 1
            print('qtyRelDoc = ', qtyRelDoc)
            print('B Q DOC = ', qNum, ' ', relevance_from_file[n][1])
        elif relevance_from_file[n][0] == qNum + 1:
            qtyRelDocPerQ.append(qtyRelDoc)
            print('\nqtyRelDocPerQ = ', qtyRelDocPerQ, '\n')
            qtyRelDoc = 1
            print('qtyRelDoc = ', qtyRelDoc)
            qNum += 1
            print('C Q DOC = ', qNum, ' ', relevance_from_file[n][1])
    return(qtyRelDocPerQ)

# Main

In [None]:
print()
#-----------------
# constants
############################################################
# ______path to cranfieldDocs directory_____

# dir_path = 'C:/Users/Derek Christensen/Dropbox/_cis833irtm/hw2//\
    # cranfieldDocs'
# dir_path = 'C:/Users/Derek Christensen/Dropbox/_cis833irtm/hw2/data'
# dir_path = 'C:/Users/Derek Christensen/Dropbox/_cis833irtm/hw2/data-temp'
# dir_path = 'C:/Users/Derek Christensen/Dropbox/_cis833irtm/hw2/data-misc'

dir_path = 'C:/Users/derekc/Dropbox/__cis833irtm/hw2/cranfieldDocs'

# dir_path = r'C:/Users/derekc/Dropbox/__cis833irtm/hw2/data-1'
# dir_path = r'C:/Users/derekc/Dropbox/__cis833irtm/hw2/data-fox'
# dir_path = r'C:/Users/derekc/Dropbox/__cis833irtm/hw2/data-fox2'
# dir_path = r'C:/Users/derekc/Dropbox/__cis833irtm/hw2/data'
# dir_path = r'C:/Users/derekc/Dropbox/__cis833irtm/hw2/data-15'
# dir_path = r'C:/Users/derekc/Dropbox/__cis833irtm/hw2/data-Q2'
# dir_path = r'C:/Users/derekc/Dropbox/__cis833irtm/hw2/data-Q2-2'
# dir_path = r'C:/Users/derekc/Dropbox/__cis833irtm/hw2/data-Q2-3'

dir_path_stopwords = r'C:/Users/derekc/Dropbox/__cis833irtm/hw2/stopwords'

dir_path_queries = r'C:/Users/derekc/Dropbox/__cis833irtm/hw2/queries'
# dir_path_queries = r'C:/Users/derekc/Dropbox/__cis833irtm/hw2/queries2'
# dir_path_queries = r'C:/Users/derekc/Dropbox/__cis833irtm/hw2/Q2'

dir_path_output = r'C:/Users/derekc/Dropbox/__cis833irtm/hw2/output'

dir_path_relevance = r'C:/Users/derekc/Dropbox/__cis833irtm/hw2/relevance'
# dir_path_relevance = r'C:/Users/derekc/Dropbox/__cis833irtm/hw2/rel-Q2'
# dir_path_relevance = r'C:/Users/derekc/Dropbox/__cis833irtm/hw2/rel-Q2-2'
# dir_path_relevance = r'C:/Users/derekc/Dropbox/__cis833irtm/hw2/rel-Q123'

In [None]:
print('\nInput for dir_path_queries folder: = ', dir_path_queries)

In [None]:
# declare arrays, variables

# def getFiles(dir_path)
# return(files, file_names, file_idx, file_zip, file_dict, file_dict_enum)

global stopwords_from_file
global queries_from_file

files = []
file_names = []

file_idx = []
file_zip = []
file_dict = []
file_dict = {}
file_dict_enum = {}

# def getLines(files, dir_path)
# return(review, docnum, titles, texts) 
review = []
docnum = []
titles = []
texts = []
tf = []
j = 0

# def getPerDocCorp(titles, texts)
# return(perDocCorp, corpus)
perDocCorp = []
corpus = []

# def getPerDocCorpClean(perDocCorp)
# return(perDocCorpClean, perDocLen, fdistPerDoc, fdistPerDocLen,
#       freq_word_PerDoc)
perDocCorpClean = []
perDocLen = []
fdistPerDoc = []
fdistPerDocLen = []
freq_word_PerDoc = []

# def getCorpusClean(corpus)
# return(corpusClean, corpusLen, fdistCorpus, fdistCorpusLen,
#        freq_word_Corpus)
corpusClean = []
corpusLen = []
fdistCorpus = []
fdistCorpusLen = []
freq_word_Corpus = []

# def getPostings(file_names, freq_word_PerDoc, perDocCorpClean)
# return(postings)
postings = defaultdict(dict)

# getDF(file_names, freq_word_Corpus, postings)
# return(df)
df = defaultdict(int)

# getDocVecLen(file_names, freq_word_Corpus)
# return(docVecLen)

docVecLen = defaultdict(float)

# def getRankListPerQ(qNum, queries_from_file,
#                     postings, fdistCorpus)
# return(rankListPerQ)
rankListPerQ = []
global output_qid_docid
output_qid_docid = []

# def getQtyRelDocPerQ(relevance_from_file):
# return(qtyRelDocPerQ)

global qtyRelDocPerQ
qtyRelDocPerQ = []

# Step 1. Preprocessing  
Write a program that preprocesses the collection.<br>
This preprocessing stage should specifically include a function that tokenizes the text.<br>
In doing so, tokenize on whitespace and remove punctuation.<br>
#### Note that you also need to eliminate the SGML tags (e.g., '<'TITLE>', '<'DOC>, '<'TEXT>, etc.) - you should only keep the actual title and text.

## • Input: Documents that are read one by one from the collection
### • Implement the preprocessing functions:  
>#### • For tokenization  
>#### • For stop word removal  
>#### • For stemming  
## • Output: Tokens to be added to the index  
>#### • No punctuation, no stop-words, stemmed

## • Input: Documents that are read one by one from the collection

In [None]:
# def getStopwords(dir_path_stopwords)
# return(stopwords_from_file)

stopwords_from_file = getStopwords(dir_path_stopwords)

In [None]:
# get all files inside the directory & process to arrays & dicts
# getFiles(dir_path, files, file_names, file_idx, file_zip, file_dict)
# print(files)
#
# def getFiles(dir_path)
# return(files, file_names, file_idx, file_zip, file_dict, files_dict_enum)

files, file_names, file_idx, file_zip, file_dict, file_dict_enum = \
    getFiles(dir_path)

print('files = ', files)
print('file_names = ', file_names)
print('len(file_names) = ', len(file_names))
print()
print('file_idx = ', file_idx)
print()
print('file_zip = ', file_zip)
print()
print('file_dict = ', file_dict)
print()
print('files_dict_enum = ', file_dict_enum)
print()

## • Implement the preprocessing functions:  
>### • For tokenization  
>### • For stop word removal  
>### • For stemming  

# Eliminate SGML tags & only keep TITLE & TEXT

# Reading data as list & eliminate SGML tags

In [None]:
# start processing the ipfile & break all files into lines
#
# def getLines(files, dir_path):
# return(review, docnum, titles, texts)

review, docnum, titles, texts = getLines(files, dir_path)

print('REVIEW = \n', review)
print()

In [None]:
print('review = ', review)
print()
print('docnum = ', docnum)
print()
print('titles = ', titles)
print()
print('texts = ', texts)
print()
print('DONE ASSIGNING DOCNUM TITLES AND TEXTS')
print()
print('\n-----END OF getLines-----\n\n')

In [None]:
# print 1st 2 tokens in list review
print('review[:2] = ', review[:2])

In [None]:
print('review[0] = ', review[0])

In [None]:
print('review[0:3] = ', review[0:3])

In [None]:
print('Number of tokens = len(review) = ', len(review))

# Merge Titles and Texts

In [None]:
# merge TITLES & TEXTS into 1 STRING per DOC

# def getPerDocCorp(titles, texts)
# return(perDocCorp, corpus)

perDocCorp, corpus = getPerDocCorp(titles, texts)

print()
print('perDocCorp = \n', perDocCorp)
print()
print('corpus = \n', corpus)
print()

In [None]:
# //////////////////////////////////////////////////
#    PRINT EACH DOC'S CORP
# \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

for i in range(len(perDocCorp)):
    print('\nperDocCorp[', i, '] = ', perDocCorp[i])

In [None]:
print()
print('perDocCorp[0:1] = \n', perDocCorp[0:1])
print()
print('perDocCorp[1:2] = \n', perDocCorp[1:2])
print()
print('perDocCorp[2:3] = \n', perDocCorp[2:3])
print()
print('perDocCorp[3:4] = \n', perDocCorp[3:4])
print()
print('perDocCorp[4:5] = \n', perDocCorp[4:5])

In [None]:
print()
print('perDocCorp[:1] = \n', perDocCorp[:1])
print()
print('perDocCorp[0:2] = \n', perDocCorp[0:2])
print()
print('perDocCorp[0:3] = \n', perDocCorp[0:3])
print()
print()
print('perDocCorp[2:4] = \n', perDocCorp[2:4])
print()

# Clean Corpus

## • Output: Tokens to be added to the index  
>### • No punctuation, no stop-words, stemmed

In [None]:
# for ea perDocCorp: tokenize, clean, stem, lem, stopwords, \
# shortwords, etc.

# def getPerDocCorpClean(perDocCorp)
# return(perDocCorpClean, perDocLen, fdistPerDoc, fdistPerDocLen,
#   freq_word_PerDoc)

perDocCorpClean, perDocLen, fdistPerDoc, fdistPerDocLen, freq_word_PerDoc \
    = getPerDocCorpClean(perDocCorp)

In [None]:
print(perDocCorpClean[0:])

In [None]:
print(perDocCorpClean)

In [None]:
print('\nperDocCorpClean =\n', perDocCorpClean, '\n')
x = 0
for x in range(len(perDocCorpClean)):
    print('perDocCorpClean[', x, '] = ', perDocCorpClean[x])
    print('perDocLen[', x, '] = ', perDocLen[x])
    print('fdistPerDoc[', x, '] = ', fdistPerDoc[x])
    print('fdistPerDocLen[', x, '] = ', fdistPerDocLen[x])
    print('freq_word_PerDoc[', x, '] = \n', freq_word_PerDoc[x])
    print()
    x += 1

print('len(perDocCorpClean) = ', len(perDocCorpClean))
print()

In [None]:
fdistPerDoc

In [None]:
# //////////////////////////////////////////////////
#    fdistPerDoc[2]
# \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

print('fdistPerDoc[ 2 ] = ', fdistPerDoc[ 2 ])

In [None]:
fdistPerDoc[ 2 ]

In [None]:
# //////////////////////////////////////////////////////////////////////
# doc156words = {k: fdistPerDoc[2][k] for k in fdistPerDoc[2].keys()[:]}
# \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

doc156words = {k: fdistPerDoc[2][k] for k in fdistPerDoc[2].keys()[:]}
print('len(doc156words) = ', len(doc156words))
doc156words

In [None]:
print('\ndoc156words =\n', doc156words)

In [None]:
corpus

In [None]:
# for corpus: tokenize, clean, stem, lem, stopwords, \
# shortwords, etc.

# def getCorpusClean(corpus)
# return(corpusClean, corpusLen, fdistCorpus, fdistCorpusLen,
#        freq_word_Corpus)

corpusClean, corpusLen, fdistCorpus, fdistCorpusLen, freq_word_Corpus\
    = getCorpusClean(corpus)

In [None]:
print('\ncorpusClean =\n', corpusClean, '\n')
x = 0
for x in range(len(corpusClean)):
    print('corpusClean[', x, '] = ', corpusClean[x])
    print('corpusLen[', x, '] = ', corpusLen[x])
    print('fdistCorpus[', x, '] = ', fdistCorpus[x])
    print('fdistCorpusLen[', x, '] = ', fdistCorpusLen[x])
    print('freq_word_Corpus[', x, '] = \n', freq_word_Corpus[x])
    print()
    x += 1

print('len(corpusClean) = ', len(corpusClean))
print()

In [None]:
print('\nfdistCorpus = ', fdistCorpus)
print()

In [None]:
fdistCorpus

In [None]:
print('files = ', files)
print('file_names = ', file_names)
print('len(file_names) = ', len(file_names))
print()
print('file_idx = ', file_idx)
print()
print('file_zip = ', file_zip)
print()
# print('file_dict = ', file_dict)
# print()
# print('files_dict_enum = ', file_dict_enum)
# print()

# Step 2: Indexing  
1. Implement an indexing scheme based on the vector space model, as discussed in class. The
steps pointed out in class can be used as guidelines for the implementation. For the weighting
scheme, use and experiment with:  
• TF-IDF (do not divide TF by the maximum term frequency in a document).

### • Build an inverted index, with an entry for each word in the vocabulary
## • Input: Tokens obtained from the preprocessing module
## • Output: An inverted index for fast access
### • Many data structures are appropriate for fast access
>#### • We will use hashtables
        * Store tokens in hashtable, with token string as key and weight as value.
        * Table must fit in main memory.

### • We need:
>#### • One entry for each word in the vocabulary
>#### • For each such entry:
    * Keep a list of all the documents where it appears together with the corresponding frequency --> TF
    * Keep the total number of documents in which the corresponding word appears --> IDF
### • Constant time to find or update weight of a specific token.

<img src = "images\Index_Terms.JPG">

## Inverted Index: TF-IDF

<img src = "images\onefish-twofish.JPG">

## Indexing - How many passes through the data?

### • TF and IDF for each token can be computed in one pass
### • Cosine similarity also requires document lengths
### • Need a second pass to compute document vector lengths
>#### • Remember that the length of a document vector is the square-root of sum of the squares of the weights of its tokens.
>#### • Remember the weight of a token is: TF * IDF
>#### • Therefore, must wait until IDF’s are known (and therefore until all documents are indexed) before document lengths can be determined.
### • Do a second pass over all documents: keep a list or hashtable with all document id’s, and for each document determine its length.

## Time Complexity of Indexing

### • Complexity of creating vector and indexing a document of n tokens is O(n).
>#### • TF-IDF (do not divide TF by the maximum term frequency in a document).
### • So indexing m such documents is O(m n).
### • Computing token IDFs can be done during the same first pass
### • Computing vector lengths is also O(m n).
### • Complete process is O(m n), which is also the complexity of just reading in the corpus.

### Create Postings Dictionary

In [None]:
# create Postings dictionary

# def getPostings(file_names, freq_word_PerDoc, perDocCorpClean)
# return(postings)

# postings = defaultdict(dict)

postings = getPostings(file_names, freq_word_PerDoc, perDocCorpClean)
postings

In [None]:
# //////////////////////////////////////////////////////////////////////
# postings
# \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

print('postings = \n', postings)

In [None]:
print('\nlen(positings) = ', len(postings))

In [None]:
print('\nfreq_word_Corpus[0] = \n', freq_word_Corpus[0])

### Create DF Dictionary

In [None]:
# create DF dictionary

# getDF(file_names, freq_word_Corpus, postings)
# return(df)

print('\n-----CALCULATE DF-----\n')

# df = defaultdict(int)

df = getDF(file_names, freq_word_Corpus, postings)
df

In [None]:
print('\ndf = ', df)
print()

In [None]:
print('\nlen(df) = ', len(df))

In [None]:
DFfirst20 = {k: df[k] for k in df.keys()[:20]}
print('\nDFfirst20 = ', DFfirst20)

print('\ntype(df) = ', type(df))
print('\nlen(df) = ', len(df))

## Create Inverted Index & docVecLen

In [None]:
# file_names, freq_word_Corpus
print('\nfile_names = ', file_names)
print('\freq_word_Corpus = ', freq_word_Corpus)

In [None]:
# create Inverted Index & docVecLen

# getDocVecLen(file_names, freq_word_Corpus)
# return(docVecLen)

# docVecLen = defaultdict(float)

docVecLen = getDocVecLen(file_names, freq_word_Corpus)
docVecLen

In [None]:
# docVecLen values
print('\nOUTPUT of docVecLen values\n')
for docid in (range(len(file_names))):
    print('docVecLen[docid] [', docid, '] = ', docVecLen[docid])
print()

# Step 3: Retrival
The output of your retrieval should be
a list of (query id, document id) pairs.

## • Input: Query and Inverted Index (from Step 2)
## • Output: Similarity values between query and documents  
  
### • Tokens that are not in both the query and the document have no effect on the cosine similarity.
>#### • Product of token weights is zero and does not contribute to the dot product.
### • Usually the query is fairly short, and therefore its vector is extremely sparse.
### • Use the inverted index (from Step 2) to find the limited set of documents that contain at least one of the query words.

## Processing the Query

### • Incrementally compute cosine similarity of each indexed document as query words are processed one by one.
### • To accumulate a total score for each retrieved document:
>#### • store retrieved documents in a hashtable, 
>#### • where the document id is the key and the partial accumulated score is the value.

## Inverted Query Retrieval Efficiency

## • Assume that, on average, a query word appears in B documents:  
  
  <img src = "images\qWords-bDocs.JPG">  
  
## • Then retrieval time is O(|Q|B), which is typically much better than:
>#### • naïve retrieval that examines all |D| documents, O(|V||D|), 
>#### • because |Q| << |V| and B << |D|.

# Get Query from User

In [None]:
print('QUERY TEST TEXT TO ENTER:')

# print('Input for data-test folder:')
# print('Java ?& dog jUMp JUMPING 42 to also')

# print('Input for data folder:')
# print('Flow ?& shear hEAt heating 42 to well layer')

# print('\n***---TEST DOC\'s 0, 1, 2 & little 3 => [1]---***\n')
# print('the simple 42 ! situational PAST OF theoretical must least \
#      exactly accordingly specified')

In [None]:
print('\n"queries.txt" Input\n')

# print('Q1 = what investigations have been made of the wave system created \
#       by a static pressure distribution over a liquid surface .')
print('Q2 = has anyone investigated the effect of shock generated \
     vorticity on heat transfer to a blunt body .')
# print('Q3 = what is the heat transfer to a blunt body in the absence of \
#      vorticity .')
# print('Q4 = what are the general effects on flow fields when the reynolds \
#      number is small .')
# print('Q5 = find a calculation procedure applicable to all incompressible \
#      laminar boundary layer flow problems having good accuracy and \
#      reasonable computation time .')
# print('Q6 = papers applicable to this problem (calculation procedures \
#       for laminar incompressible flow with arbitrary pressure \
#       gradient) .')
# print('Q7 = has anyone investigated the shear buckling of stiffened \
#      plates .')
# print('Q8 = papers on shear buckling of unstiffened rectangular plates \
#      under shear .')
# print('Q9 = in practice, how close to reality are the assumptions that \
#      the flow in a hypersonic shock tube using nitrogen is non-viscous \
#      and in thermodynamic equilibrium .')
# print('Q10 = what design factors can be used to control lift-drag ratios \
#      at mach numbers above 5 .')

In [None]:
# query = []

# queryTempFirst = "Hi"
# queryTempLast = "By"
# input = ""

# input1 = input2 = input3 = input4 = input5 = ""
# input6 = input7 = input8 = input9 = input10 = ""

# query.append(queryTempFirst)

In [None]:
# print('Pleae enter your queary:\n')
# input = (raw_input("Search query >> "))

# print('Input for data-test folder:')
# input = "Java ?& dog jUMp JUMPING 42 to also"

# print('Input for data folder:')
# input = "Flow ?& shear hEAt heating 42 to well layer"

# print('\nTEST DOC [1] Input for data folder:')
# input = "the simple 42 ! situational PAST OF theoretical must least \
# exactly accordingly specified"

In [None]:
print('\nInput for dir_path_queries folder: = ', dir_path_queries)

In [None]:
# print('Input for <data-15> folder:')
# input1 = "what investigations have been made of the wave system created \
# by a static pressure distribution over a liquid surface ."

# print('Input for <data-15> folder:')
# input2 = "has anyone investigated the effect of shock generated \
# vorticity on heat transfer to a blunt body ."

# print('Input for <data-15> folder:')
# input3 = "what is the heat transfer to a blunt body in the absence of \
# vorticity ."

# print('Input for <data-15> folder:')
# input4 = "what are the general effects on flow fields when the reynolds \
# number is small ."

# print('Input for <data-15> folder:')
# input5 = "find a calculation procedure applicable to all incompressible \
# laminar boundary layer flow problems having good accuracy and reasonable \
# computation time ."

# print('Input for <data-15> folder:')
# input6 = "papers applicable to this problem (calculation procedures for \
# laminar incompressible flow with arbitrary pressure gradient) ."

# print('Input for <data-15> folder:')
# input7 = "has anyone investigated the shear buckling of stiffened plates ."

# print('Input for <data-15> folder:')
# input8 = "papers on shear buckling of unstiffened rectangular plates \
# under shear ."

# print('Input for <data-15> folder:')
# input9 = "in practice, how close to reality are the assumptions that the \
# flow in a hypersonic shock tube using nitrogen is non-viscous and in \
# thermodynamic equilibrium ."

# print('Input for <data-15> folder:')
# input10 = "what design factors can be used to control lift-drag ratios at \
# mach numbers above 5 ."

In [None]:
# input = input2
# print('input = input2')
# print('input = ', input)

In [None]:
# query.append(input)
# query.append(queryTempLast)
# print('query = ', query)

In [None]:
# print('query = ', query)

In [None]:
# q = query
# print('q = ', q)

In [None]:
# print('\nq = ', q)

In [None]:
# print(' q[0] = ', q[0])

In [None]:
# def getQueries(dir_path_queries)
# return(queries_from_file)

queries_from_file = getQueries(dir_path_queries)

In [None]:
print('queries_from_file = ', queries_from_file)
print('type(queries_from_file) = ', type(queries_from_file))

In [None]:
# print('\nqueries_from_file[:] = ', queries_from_file[:])

In [None]:
print('\nqueries_from_file[0] = ', queries_from_file[0])

In [None]:
print('\nqueries_from_file[1:4] = ', queries_from_file[1:4])

In [None]:
print()
for x in range(len(queries_from_file)):
    print('queries_from_file[', x, '] = ', queries_from_file[x])
    print()

## • Implement the preprocessing functions:  
>### • For tokenization  
>### • For stop word removal  
>### • For stemming  

# Reading Query as List

# Start Processing Q and Break Into Lines

In [None]:
# start processing q and break into lines

# def getQLines(q)
# return(qReview, qDocnum, qTexts)

# qReview, qDocnum, qTexts = getQLines(q)

# print('\nQREVIEW = \n', qReview)
# print()

In [None]:
# print('type(qReview) = ', type(qReview))

In [None]:
# print('qReview = ', qReview)
# print()
# print('qDocnum = ', qDocnum)
# print()
# # print('titles = ', titles)
# # print()
# print('qTexts = ', qTexts)
# print()
# print('DONE ASSIGNING DOCNUM TITLES AND TEXTS')
# print()
# print('\n-----END OF getQLines-----\n\n')

In [None]:
# # print 1st 2 tokens in list review
# print('qReview[:2] = ', qReview[:2])

In [None]:
# print('qReview[0] = ', qReview[:])

In [None]:
# print('qReview[0:3] = ', qReview[0:3])

In [None]:
# print('Number of tokens = len(qReview) = ', len(qReview))

In [None]:
# print('qTexts[0] = ', qTexts[0])

In [None]:
# test_Qtext = qTexts[0]
# test_Qtext

In [None]:
# print('test_Qtext = ', test_Qtext)

# Get Q Corpus

# Merge TITLES & TEXTS into 1 String per Query

In [None]:
# # merge TITLES & TEXTS into 1 STRING per Query

# # def getQCorp(qTexts)
# # return(qCorp)

# qCorp = getQCorp(qTexts)

# print('\nqCorp = \n', qCorp)

In [None]:
# print('type(qCorp) = ', type(qCorp))

In [None]:
# for i in range(len(qCorp)):
#     print('\nqCorp[', i, '] = ', qCorp[i])

In [None]:
# print('\nqCorp[0:1] = \n', qCorp[0:1])

# Clean Q Corpus

## • Output: Tokens to be added to the index  
>### • No punctuation, no stop-words, stemmed

# For ea qCorp: tokenize, clean, stem, lem, stopwords, shortwords, etc.

In [None]:
# # for ea qCorp: tokenize, clean, stem, lem, stopwords,\
# # shortwords, etc.

# # def getQClean(qCorp):
# # return(qClean, qLen, fdistQ, fdistQLen, 
# #            freq_word_Q, freq_word_Qorpus)

# qClean, qLen, fdistQ, fdistQLen, freq_word_Q , freq_word_Qorpus\
#     = getQClean(qCorp)

In [None]:
# print('qClean[0:] = ', qClean[0:])
# print('qClean = ', qClean)
# print('fdistQ = ', fdistQ)
# print('freq_word_Q = ', freq_word_Q)
# print('qClean = ', qClean)

# qClean = str(qClean[0])

In [None]:
# print('\nqClean =\n', qClean, '\n')
# x = 0
# for x in range(len(qClean)):
#     print('qClean[', x, '] = ', qClean[x])
#     print('qLen[', x, '] = ', qLen[x])
#     print('fdistQ[', x, '] = ', fdistQ[x])
#     print('fdistQLen[', x, '] = ', fdistQLen[x])
#     print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
#     print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
#     print('\nfreq_word_Q[', x, '] = \n', freq_word_Q[x])
#     print('\nfreq_word_Qorpus[', x, '] = \n', freq_word_Qorpus[x])
#     print()
#     print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
#     print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
#     print()
#     x += 1

In [None]:
# print('type(fdistQ) = ', type(fdistQ))
# print('type(freq_word_Q) = ', type(freq_word_Q))
# print('type(freq_word_Qorpus) = ', type(freq_word_Qorpus))
# print('\nfdistQ = ', fdistQ)
# print('\nfreq_word_Q = ', freq_word_Q)
# print('\nfreq_word_Qorpus = \n', freq_word_Qorpus)
# print()

In [None]:
# print('\nfreq_word_Q[:] = \n', freq_word_Q[:], '\n')
# # print('\nfreq_word_Q[0:-1] = \n', freq_word_Q[0:-1], '\n')

# print('type(freq_word_Q) = ', type(freq_word_Q))

In [None]:
# ####    freqQtest = []
# ####    for i in (len(freq_word_Q)):
# ####        freqQtest.append(freq_word_Q[i][0])

# ####    print('\nfreqQtest = ', freqQtest)
# ####    print('\ntype(freqQtest) = ', freqQtest)

In [None]:
# print()
# # print('type(qfdist) = ', type(qfdist))
# print('type(qClean) = ', type(qClean))
# print('type(fdistQ) = ', type(fdistQ))
# print('type(freq_word_Q) = ', type(freq_word_Q))
# print('type(freq_word_Qorpus) = ', type(freq_word_Qorpus))

In [None]:
#    print('\nlen(qClean) = ', len(qClean))
#    print()

In [None]:
# qClean0 = str(qClean[0])
# qClean0

In [None]:
# print('\nqClean0 =\n', qClean0, '\n')
# print(type(qClean0))

# # qSetClean = set((qClean0))

# # qSetClean = set(qClean[:])
# # qSetClean = set(freq_word_Q.items())

# # for item in freq_word_Q:
# #     qSetClean.add(item)

# # print('\nset of qSetClean = ', qClean0)

# Generate Tuples for Both the Query's Words & Freqs

In [None]:
# # generate tuples for both the Query's Words & Freqs
# #
# # def getQTuples(freq_word_Q):
# # return(q_tuple_words, q_tuple_freq_i)

# q_tuple_words, q_tuple_freq_i = getQTuples(freq_word_Q)

# Generate List of Relevant Documents

In [None]:
# # generate list of relevant documents

# # def getRetDoc(postings, q_tuple_words)
# # return(retDoc)


# retDoc = getRetDoc(postings, q_tuple_words)
# print('retDoc = ', retDoc)

In [None]:
# df

In [None]:
# for docid in retDoc:
#     print('retDoc[', docid, '] = ', docid)

In [None]:
# retDoc

In [None]:
# type(retDoc)

In [None]:
# for docid in retDoc:
#     print('retDoc[', docid, '] = ', docid)

# print('\nretDoc = ', retDoc)
# print('type(retDoc) = ', type(retDoc))

# Calculate CosSim Scores b/t Q & Ea. Doc

In [None]:
# # calculate CosSim Scores b/t q & ea. doc

# # def getCosSimScoresList(retDoc, q_tuple_words,
# #                         q_tuple_freq_i, fdistCorpus):
# # return(cosSimScoresList)

# #     def getCosSim(docid, q_tuple_words, q_tuple_freq_i):
# #     return(cosSim)

# # cosSimScoresList = defaultdict(float)
# # print('docid = ', docid)

# print('retDoc = ', retDoc)
# print('\nq_tuple_words = ', q_tuple_words)
# print('q_tuple_freq_i = ', q_tuple_freq_i)
# print('\nfdistCorpus = ', fdistCorpus)

# cosSimScoresList = getCosSimScoresList(retDoc, q_tuple_words,
#                                        q_tuple_freq_i, fdistCorpus)

In [None]:
# # cosSimScoresList values

# print('\nOUTPUT of cosSimScoresList values\n')
# for docid in (range(len(cosSimScoresList))):
#     print('cosSimScoresList[docid] [', docid, '] = ',
#           cosSimScoresList[docid])
# print()

In [None]:
# cosSimScoresList

# Step 4: Ranking  
2. For each of the ten queries in the queries.txt file, determine a ranked list of documents, in descending order of their similarity with the query.  

Determine the average precision and recall for the ten queries, when you use:  
• top 10 documents in the ranking  
• top 50 documents in the ranking  
• top 100 documents in the ranking  
• top 500 documents in the ranking  
  
Note: A list of relevant documents for each query is provided to you, so that you can determine
precision and recall.

### • Sort the hashtable including the retrieved documents based onthe value of cosine similarity
### • Return the documents in descending order of their relevance  
  
## • Input: Similarity values between query and documents
## • Output: Ranked list of documents in reversed order of their relevance

## Term Weights

### • Weights applied to both document terms and query terms
### • Direct impact on the final ranking
>#### • Direct impact on the results
>#### • Direct impact on the quality of IR system

# Rank List of Relevant Documents

In [None]:
# # rank list of relevant documents

# # def getRankCosSimList(cosSimScoresList)
# # return(rankCosSimList)

# # rankCosSimList = []


# rankCosSimList = getRankCosSimList(cosSimScoresList)

In [None]:
# rankCosSimList

In [None]:
# print('\nrankCosSimList = ', rankCosSimList)

In [None]:
# print('\ntype(rankCosSimList) = ', type(rankCosSimList))

In [None]:
#    '''
#    rankCosSimList[ 0 ] =  (323, 0.36102506311675114)
#    rankCosSimList[ 1 ] =  (322, 0.3515479432650447)
#    rankCosSimList[ 2 ] =  (1394, 0.3512276970366803)
#    rankCosSimList[ 3 ] =  (628, 0.3464697753459441)
#    rankCosSimList[ 4 ] =  (179, 0.310596542107325)
#
#    VERSUS [(docid + 1) in cosSimScoresList]
#
#    rankCosSimList[ 0 ] =  (324, 0.36102506311675114)
#    rankCosSimList[ 1 ] =  (323, 0.3515479432650447)
#    rankCosSimList[ 2 ] =  (1395, 0.3512276970366803)
#    rankCosSimList[ 3 ] =  (629, 0.3464697753459441)
#    rankCosSimList[ 4 ] =  (180, 0.310596542107325)
#    '''

In [None]:
##    print('\nrankCosSimList = \n', rankCosSimList)
#    print('\nlen(rankCosSimList) = ', len(rankCosSimList))
#    print()

In [None]:
##    print()
##    for y in range(len(rankCosSimList)):
##        print('rankCosSimList[', y, '] = ', rankCosSimList[y])

In [None]:
#    print()
#    for y in xrange(5):
#        print('rankCosSimList[', y, '] = ', rankCosSimList[y])

In [None]:
##    print()
##    test = []
##    print('test = ', test)
##    test.append(rankCosSimList[0])
##    print('test.append(rankCosSimList[0]) = ', test)
##    test.append(rankCosSimList[1])
##    print('test.append(rankCosSimList[1]) = ', test)

In [None]:
# rank list of relevant documents per query

# def getRankListPerQ(qNum, queries_from_file,
#                    postings, fdistCorpus)
# return(rankListPerQ)

# rankListPerQ = []
# output_qid_docid = []

for qNum in range(len(queries_from_file)):
    rankListPerQ = getRankListPerQ(qNum, queries_from_file,
                                   postings, fdistCorpus)

    for relvDocIdx in range(len(rankListPerQ)):
        output_qid_docid.append((qNum + 1, rankListPerQ[relvDocIdx][0]))
#         print('\n\noutput_qid_docid = ', output_qid_docid)

    print('\n\n//////////////////////////////////////////////////////////')
    print('output_qid_docid = ', output_qid_docid)
    print('//////////////////////////////////////////////////////////\n\n')

In [None]:
print('\n\n//////////////////////////////////////////////////////////////')
print('output_qid_docid = ', output_qid_docid)
print('//////////////////////////////////////////////////////////////\n\n')

In [None]:
# for n in range(len(output_qid_docid)):
#     if output_qid_docid[n][0] == 1:
#         print('output_qid_docid[', n, '] = ', output_qid_docid[n])
#         print('docid = ', output_qid_docid[n][1])

In [None]:
for n in xrange(5):
    if output_qid_docid[n][0] == 1:
        print('output_qid_docid[', n, '] = ', output_qid_docid[n])
        print('docid = ', output_qid_docid[n][1])

In [None]:
# def sendToOutputFolder(dir_path_output, output_qid_docid)
# return()

sendToOutputFolder(dir_path_output, output_qid_docid)

In [None]:
# def getRelevance(dir_path_relevance):
# return(relevance_from_file)

relevance_from_file = getRelevance(dir_path_relevance)

In [None]:
# print('\nqueries_from_file = ', queries_from_file)
print('\ntype(relevance_from_file) = ', type(relevance_from_file))

In [None]:
# print('\nqueries_from_file[:] = ', queries_from_file[:])
print('\nrelevance_from_file[0] = ', relevance_from_file[0])
print('\nrelevance_from_file[1:4] = ', relevance_from_file[1:4])

In [None]:
print()
for x in xrange(5):
    print('relevance_from_file[', x, '] = ', relevance_from_file[x])

In [None]:
print()
print('relevance_from_file[]')

In [None]:
print('\nrelevance_from_file[', 0, '] = ', relevance_from_file[0])
print()

In [None]:
# for n in range(len(output_qid_docid)):
#    if output_qid_docid[n][0] == 1:
#        print('output_qid_docid[', n, '] = ', output_qid_docid[n])
#        print('docid = ', output_qid_docid[n][1])

for n in xrange(5):
    if relevance_from_file[n][0] == 2:
        print('relevance_from_file[', n, '] = ', relevance_from_file[n])
        print('docid = ', relevance_from_file[n][1])

In [None]:
# for n in range(len(relevance_from_file)):
for r in xrange(5):
#    if output_qid_docid[n][0] == 1:
#    for tup in output_qid_docid[:]:
    print('relevance_from_file[', r, '] = ', relevance_from_file[r])
    print('qid = relevance_from_file[', r, '][0] = ',
          relevance_from_file[r][0])
    print('docid = relevance_from_file[', r, '][1] = ',
          relevance_from_file[r][1], '\n')

In [None]:
# def getQtyRelDocPerQ(relevance_from_file):
# return(qtyRelDocPerQ)

qtyRelDocPerQ = getQtyRelDocPerQ(relevance_from_file)

print('\nqtyRelDocPerQ) = ', qtyRelDocPerQ)

# Action

### • Build an inverted index, with an entry for each word in the vocabulary
### • Input: Tokens obtained from the preprocessing module
### • Output: An inverted index for fast access
### • Many data structures are appropriate for fast access
>#### • We will use hashtables
        * Store tokens in hashtable, with token string as key and weightas value.
        * Table must fit in main memory.

### • Build an inverted index, with an entry for each word in the vocabulary
### • Input: Tokens obtained from the preprocessing module
### • Output: An inverted index for fast access
### • Many data structures are appropriate for fast access
<ul>
    <li>We will use hashtables
        <ul>
            <li>Store tokens in hashtable, with token string as key and weight
                as value.</li>
            <li>Table must fit in main memory.</li>
        </ul>
    </li>
</ul>

### • Input: Documents that are read one by one from the collection
### • Implement the preprocessing functions:  
>#### • For tokenization  
>#### • For stop word removal  
>#### • For stemming  
### • Output: Tokens to be added to the index  
>#### • No punctuation, no stop-words, stemmed
        - TWO
        - THREE

### • Input: Documents that are read one by one from the collection
### • Implement the preprocessing functions:  
>#### • For tokenization  
>#### • For stop word removal  
>#### • For stemming  
### • Output: Tokens to be added to the index  
>#### • No punctuation, no stop-words, stemmed
        - TWO
        - THREE

> ## Blockquoted header
>
> This is blockquoted text.
>
> This is a second paragraph within the blockquoted text.

> ## Blockquoted header
>
    > ### This is blockquoted text.
>
        > #### This is a second paragraph within the blockquoted text.

+ One
+ Two
+ Three
    - Nested One
    - Nested Two
        * 3rd level one
        * 3rd level two

## + One
## + Two
## + Three
    ### - Nested One
    ### - Nested Two
       #### * 3rd level one
       #### * 3rd level two

# --------------------DELETE----------------------

In [None]:

idx = [1,2,3,4,5]
vls = ['a','b','c','d','e']

myzip = zip(idx,vls)
print(idx)
print(vls)
print(myzip)

In [None]:
dicttest = dict(myzip)
print(dicttest)

In [None]:
if 4 in dicttest: print(dicttest[4])

In [None]:
dicttest[6] = 'f'
print(dicttest)

In [None]:
for key in dicttest:
    print(key)

In [None]:
for key in dicttest:
    print(dicttest[key])

In [None]:
for key in dicttest.iterkeys():
    print(key)

In [None]:
for key in dicttest.iterkeys():
    print(dicttest[key])

In [None]:
for val in dicttest.itervalues():
    print(val)

In [None]:
for key in dicttest:
    print(dicttest[key])

In [None]:
print(dicttest)

In [None]:
dicttest[7] = 'g'
print(dicttest)

In [None]:
dicttest.update({1:'z'})
print(dicttest)

In [None]:
dicttest.update({2:'y', 8:'h'})
print(dicttest)

In [None]:
print(dicttest)

In [None]:
# Array Practice

In [None]:
x = np.arange(10)
x

In [None]:
x[2]

In [None]:
x[-2]

In [None]:
x.shape = (2,5)
x

In [None]:
x[1,3]

In [None]:
x[1,-1]

In [None]:
x[0]

In [None]:
x[0][2]

In [None]:
x[0,2]

In [None]:
x = np.arange(10)
x

In [None]:
x[2:5]

In [None]:
x[:-7]

In [None]:
x[1:7:2]

In [None]:
y = np.arange(35).reshape(5,7)
y

In [None]:
y[1:5:2,::3]

In [None]:
y[1:5:2,1::2]

In [None]:
x = np.arange(10,1,-1)
x

In [None]:
x[np.array([3,3,1,8])]

In [None]:
y =np.array(['a','be','cat','door'])
y

In [None]:
# Hash Practice

In [None]:
D = {}

In [None]:
D['a'] = 1
D['b'] = 2
D['c'] = 3
D

In [None]:
D['b']

In [None]:
for k in D.keys():
    print(D[k])

In [None]:
for k,v in D.items():
    print(k,':',v)

In [None]:
keys = ['d','e','f']
values = [4,5,6]
hash = {k:v for k, v in zip(keys, values)}
hash

In [None]:
map(hash, [4, 5, 6])

In [None]:
map(hash, [7,8,9,10])

In [None]:
keys = ['at', 'be', 'cat', 'dog']
docsat = ['d1', 'd2']
docsbe = ['d2', 'd3', 'd5']
docscat = ['d1', 'd5']

freqval = [4, 2, 2, 4]
freqat = [2, 1]
freqbe = [3, 7, 4]
freqcat = [5, 8]

docfreqval ={d:f for d, f in zip(keys, freqval)}
print(docfreqval)

docfreqat ={d:f for d, f in zip(docsat, freqat)}
print(docfreqat)

docfreqbe ={d:f for d, f in zip(docsbe, freqbe)}
print(docfreqbe)

docfreqcat ={d:f for d, f in zip(docscat, freqcat)}
print(docfreqcat)

In [None]:
# numdocs = [2,3,3]

# hash = {k:n:d for k, n, d in zip(keys, numdocs, )}

numdocs = [2, 3]

hash = {k:n for k, n in zip(keys, numdocs, )}
print('hash = ', hash)

In [None]:
import os

In [None]:
def get_files(dir, suffix):
    """
    Returns all the files in a folder ending with suffix
    :param dir:
    :param suffix:
    :return: the list of file names
    """
    print('dir = ',dir)
    print('suffix = ',suffix)
    files = []
    for file in os.listdir(dir):
        if file.endswith(suffix):
            files.append(file)
        return(files)

dir = 'C:/Users/Derek Christensen/Dropbox/_cis833irtm/hw2/data/'
suffix = '.txt'
files = []

get_files(dir,suffix)
print(files[:])

In [None]:
# http://prooffreaderplus.blogspot.ca/2014/11/top-10-python-idioms-i-wished-id.html?m=1

In [None]:
# 1. Python 3-style printing in Python 2

In [None]:
# Enumerate a List

In [None]:
mylist = ["It's",'only','a','model.']

for index, item in enumerate(mylist):
    print(index, item)

In [None]:
mylist

In [None]:
mylist[2]

In [None]:
mynumber = 3

if 4 > mynumber > 2:
    print("Chained comparison operators work! \n" * 3)

In [None]:
mycounter = Counter()
for i in range(100):
    random_number = randrange(10)
    print(random_number)
    mycounter[random_number] += 1
for i in range(10):
    print(i, mycounter[i])

In [None]:
mycounter[3]

In [None]:
sum=0
for i in range(10):
    sum += mycounter[i]
print(sum)

In [None]:
mycounter = Counter()

# print(random_number)
# print(mycounter[random_number])
print('hi')
for i in range(10):
    print('i = ', i)
    random_number = randrange(10)
    print(random_number)
    mycounter[random_number] += 1
    print(mycounter[random_number])

In [None]:
# Dict Comprehensions

In [None]:
my_phrase = ['No','one','expects','the','Spanish','Inquisition']
my_dict = {key:value for value, key in enumerate(my_phrase)}
print(my_dict)
rev_dict = {value:key for key, value in my_dict.items()}
print(rev_dict)

In [None]:
my_phrase2 = ['Search','for','the','Holy','Grail']
my_dict2 = {value:key for key, value in enumerate(my_phrase2)}
print(my_dict2)
rev_dict2 = {key:value for value, key in my_dict2.items()}
print(rev_dict2)

In [None]:
my_phrase3 = ['grad','in','msor']
my_dict3 = {key:value for key, value in enumerate(my_phrase3)}
print(my_dict3)

In [None]:
# Executing Shell Commands with *subprocess*

In [None]:
# import subprocess
output = subprocess.check_output('dir', shell=True)
print(output)

In [None]:
# 7. *dict* *.get()* and *.iteritems()* Methods

In [None]:
my_dict = {'name': 'Lancelot', 'quest': 'Holy Grail', 'favourite_color': 'blue'}

print(my_dict.get('airspeed velocity of an unladen swallow', 'African or European?\n'))

for key, value in my_dict.iteritems():
    print(key, value, sep=": ")

In [None]:
# 8. *Tuple* unpacking for switching variables

In [None]:
a = 'Spam'
b = 'Eggs'

print(a, b)

a, b = b, a

print(a, b)

In [None]:
# 9. Introspection tools

In [None]:
my_dict = {'That': 'an ex-parrot!'}
    
help(my_dict)

In [None]:
# 10. PEP-8 compliant string chaining

In [None]:
my_long_text = ("We are no longer the knights who say Ni! "
                "We are now the knights who say ekki-ekki-"
                "ekki-p'tang-zoom-boing-z'nourrwringmm!")
print(my_long_text)

In [None]:
#!/usr/bin/env python3


# declaration and adding columns
cinema = []
for j in range(5):
    column = []
    for i in range(10):
            column.append(i)
    cinema.append(column)
cinema

In [None]:
# filling with data
cinema[2][2] = 1 # center
for i in range(1, 4): # fourth row
    cinema[i][3] = 1
for i in range(5): # the last row
    cinema[i][4] = 1
cinema

In [None]:
cinema[0]

In [None]:
cols = len(cinema)
rows = 0
if cols:
    rows = len(cinema[0])
for j in range(rows):
    for i in range(cols):
#         print(cinema[i][j])
        print(cinema[i][j], end = "")
    print()

In [None]:
#!/usr/bin/env python3


# declaration and adding columns
cinema = []
for j in range(5):
    row = []
    for i in range(10):
            row.append(0)
    cinema.append(row)
# filling with data
cinema[2][2] = 1 # center
for i in range(1, 4): # fourth row
    cinema[i][3] = 1
for i in range(5): # the last row
    cinema[i][4] = 1

rows = len(cinema)
cols = 0
if rows:
    cols = len(cinema[0])
for j in range(cols):
    for i in range(rows):
#         print(cinema[i][j])
        print(cinema[i][j], end = "")
    print()

In [None]:
cinema = []

for j in range(5):
        column = []
        for i in range(10):
                column.append(0)
        cinema.append(column)
cinema

In [None]:
print('cinema = ', cinema)

In [None]:
column[1]

In [None]:
cinema

In [None]:
cinema[2][2] = 9 # center
cinema

In [None]:
for i in range(1, 4): # fourth row
    cinema[i][3] = 7
cinema

In [None]:
for i in range(5): # the last row
    cinema[i][4] = 1
cinema

### SETS

In [None]:
xlist = 

### define your data as a list of flag values (True, False) mapped to flag names (single-character strings). You then transform this data definition into an inverted dictionary which maps flag names to flag values. This can be done quite succinctly with a nested list comprehension,

In [None]:
def invert_dict(inverted_dict):
    elements = inverted_dict.iteritems()
    print('type(elements) = ', type(elements))
    print('elements = ', elements)
#     print('elements[:] = ', elements[:])
#     print('elements[0] = ', elements[0])
#     print('elements.items() = ', elements.items())
#     print('elements.iteritems() = ', elements.iteritems())
    print('elements.viewitems() = ', elements.viewitems())

    for flag_value, flag_names in elements:
        print('flag_value = ', flag_value)
        print('flag_names = ', flag_names)
        for flag_name in flag_names:
            print('flag_name = ', flag_name)
            yield flag_name, flag_value


In [None]:
flags = {True: ["a", "b", "c"], False: ["d", "e"]}
flags

In [None]:
print('flags = ', flags)

In [None]:
flags = dict(invert_dict(flags))
# >>> print flags
# {'a': True, 'c': True, 'b': True, 'e': False, 'd': False}

In [None]:
flags

In [None]:
print('flags = ', flags)

# List Comprehension
https://docs.python.org/2/tutorial/datastructures.html#list-comprehensions

In [None]:
>>> # flatten a list using a listcomp with two 'for'
>>> vec = [[1,2,3], [4,5,6], [7,8,9]]
>>> [num for elem in vec for num in elem]
[1, 2, 3, 4, 5, 6, 7, 8, 9]

In [None]:
vec = [[1,2,3], [4,5,6], [7,8,9]]
vec

In [None]:
[num for elem in vec for num in elem]

In [None]:
[num for numlist in vec for num in numlist]

In [None]:
vec = [[1,2,3], [4,5,6], [7,8,9]]
vec

In [None]:
[x for y in vec for x in y]

In [None]:
[str(round(pi, i)) for i in range(1, 6)]

# Practice Append list to List

In [None]:
a = []
a

In [None]:
a.append([2, 156])
a

In [None]:
b = []
b

In [None]:
b.append([1,11])
b

In [None]:
c = []
c

In [None]:
c.append([3,55])
c

In [None]:
b.append(c)
b

In [None]:
a.append(c)
a

In [None]:
a.append(b)
a

# Output Append

In [None]:
tempOutput = []

In [None]:
qNum = 1
qNum

In [None]:
rankList = [[324, 0.3610], [323, 0.3515], [1395, 0.3512]]
rankList

In [None]:
rankListTup = [(325, 0.3610), (322, 0.3515), (1394, 0.3512)]
rankListTup

In [None]:
rankList[0]

In [None]:
print('rankList[0] = ', rankList[0])

In [None]:
rankListTup[0]

In [None]:
print('rankListTup[0] = ', rankListTup[0])

In [None]:
for i in range(len(rankListTup)):
    tempOutput.append((qNum+1, rankListTup[i][0]))
    print('tempOutput = ', tempOutput)

In [None]:
tempOutput[0][0] = qNum
tempOutput

# Max in a slice of an array

In [None]:
PA = []
PA

In [None]:
PA = [0, 0, .35, .25, .20, .17, .22, .33, .08]

In [None]:
PA

In [None]:
if PA[0] < max(PA[3:]):
    PA[0] = max(PA[3:])
PA[0]

In [None]:
PA

In [None]:
if PA[1] < max(PA[3:]):
    PA[1] = max(PA[3:])
PA[1]

In [None]:
PA

In [None]:
if PA[2] < max(PA[3:]):
    PA[2] = max(PA[3:])
PA[2]

In [None]:
if PA[3] < max(PA[4:]):
    PA[3] = max(PA[4:])
PA[3]

In [None]:
PA

In [None]:
pal = len(PA)
pal

In [None]:
if PA[4] < max(PA[5:]):
    PA[4] = max(PA[5:])
PA[4]

In [None]:
x = 5
if x < pal:
    if PA[4] < max(PA[x:]):
        PA[4] = max(PA[x:])
PA[4]

In [None]:
PA

In [None]:
PA[8] = .83

In [None]:
x = len(PA)-3
print(x)
if x < pal:
    if PA[4] < max(PA[x:]):
        PA[4] = max(PA[x:])
PA[4]

In [None]:
PA

# xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

In [None]:
SA = []
SA

In [None]:
# SA[0] = [11, 12, 13, 14]
SA.append([11, 12, 13, 14])

In [None]:
SA[0]

In [None]:
SA.append([21, 22, 23, 24])
SA[0]

In [None]:
SA[1]

In [None]:
SA.append([31, 32, 33, 34])
SA[0]

In [None]:
SA[1]

In [None]:
SA[2]

In [None]:
SA

In [None]:
SA[0][3]

In [None]:
SA[1][3]

In [None]:
SA