## Object Oriented Programming - Quick and dirty Introduction

In this session, we go through a toy example of object oriented programming in python and then look at the skeleton class of a document. Clearly, this topic is vast and requires a much more through treatment than being given here. But I find it more useful to introduce you to concepts because we need them rather than because they exist. 

#### Why will we use object oriented programming in this class?
Amongst the many many reasons that OOP makes sense, for us the most important one is:

1. __Reusability__
We want to be able to write all our code for every document once and then call it again and again. Having stand alone functinons not only increases the amount of _house keeping_ but will also require you to change code at different places everytime your requirements change.

2. __Encapsulation__
Remember that you will have hundreds of documents and will have to define properties for each one of them. For example, how long is every document or what are its tokens. OOP allows you to define all of these properties at one place and then the details of these implementations can be hidden.

#### Important Terminology

1. __Class__: Is _dna_ or _blueprint_ for an object that has to be modeled. Contains attributes that contins its properties and methods that characterize its behaviors.

2. __Instance__: An individual object of a class.

3. __Instantiation__: The creation of an instance of a class.

4. __Object__: A unique instance of a class. An object comprises both data members (class variables and instance variables) and methods.

5. __Class variable__: A variable that is shared by all instances of a class. They are defined within a class but outside any of the class's methods

6. __Instance variable__: A variable that is defined inside a method and belongs only to the current instance of a class.

7. __Data member__: A class variable or instance variable that holds data associated with a class and its objects.

8. __Method__ : A function that is defined in a class definition. 

So all the chit chat aside lets get down to it 

In [1]:
import numpy as np
from nltk.tokenize import wordpunct_tokenize
from nltk import PorterStemmer
import re

"""
This is a class sherlock. 
Notice how it is defined with the keyword `class` and a name that begins with a capital letter
"""
class Document():
    
    """ The Doc class rpresents a class of individul documents
    
    """
    
    def __init__(self, speech_year, speech_pres, speech_text):
        """
        The __init__ method is called everytime an object is instantiated.
        This is where you will define all the properties of the object that it must have
        when it is `born`.
        """
        
        #These are data members
        self.year = speech_year
        self.pres = speech_pres
        self.text = speech_text.lower()
        self.tokens = np.array(wordpunct_tokenize(self.text))
        
        
        
    def token_clean(self,length):

        """ 
        description: strip out non-alpha tokens and tokens of length > 'length'
        input: length: cut off length 
        """

        self.tokens = np.array([t for t in self.tokens if (t.isalpha() and len(t) > length)])


    def stopword_remove(self, stopwords):

        """
        description: Remove stopwords from tokens.
        input: stopwords: a suitable list of stopwords
        """
 
        self.tokens = np.array([t for t in self.tokens if t not in stopwords])


    def stem(self):

        """
        description: Stem tokens with Porter Stemmer.
        """
        
        self.tokens = np.array([PorterStemmer().stem(t) for t in self.tokens])
        
    def demo_self():
        print 'this will error out'


### Self
Notice the `self` keyword that is present all over the place in the class. `self` tells python which object it needs to work with. 

1. Therefore every method should have a `self` parameter specified.
2. Notice that when the method is called the reference to the object is passed implicitly.
3. Every data member of the class needs to be referred to using `self`. 

In [2]:
#Instantiating an object.  
speech1 = Document('1986', 'Rooster', 'this is a chicken')
print speech1

#Accessing data members
print speech1.tokens

try:
    speech1.demo_self()
except Exception as ex:
    print ex

<__main__.Document instance at 0x109836ea8>
['this' 'is' 'a' 'chicken']
demo_self() takes no arguments (1 given)


### Demo of usefulness of classes

Use this [link](http://www.codeskulptor.org/#user41_X9owzYCGupNiKYr.py) to see how useful classes can be. The link leads to a code I wrote two years ago that implements a very naive version of the game Asteroids. 

## A skeleton class structure for documents

In [100]:
import numpy as np
import codecs
import nltk
import re
import math 
from nltk.tokenize import wordpunct_tokenize
from nltk import PorterStemmer
from itertools import repeat


class Corpus():
    
    """ 
    The Corpus class represents a document collection
     
    """
    def __init__(self, doc_data, stopword_file, clean_length):
        """
        Notice that the __init__ method is invoked everytime an object of the class
        is instantiated
        """
        

        #Initialise documents by invoking the appropriate class
        self.docs = [Document(doc[0], doc[1], doc[2]) for doc in doc_data] 
        
        self.N = len(self.docs)
        self.clean_length = clean_length
        
        #get a list of stopwords
        self.create_stopwords(stopword_file, clean_length)
        
        #stopword removal, token cleaning and stemming to docs
        self.clean_docs(2)
        
        #create vocabulary
        self.corpus_tokens()
        
    def clean_docs(self, length):
        """ 
        Applies stopword removal, token cleaning and stemming to docs
        """
        for doc in self.docs:
            doc.token_clean(length)
            doc.stopword_remove(self.stopwords)
            doc.stem()        
    
    def create_stopwords(self, stopword_file, length):
        """
        description: parses a file of stowords, removes words of length 'length' and 
        stems it
        input: length: cutoff length for words
               stopword_file: stopwords file to parse
        """
        
        with codecs.open(stopword_file,'r','utf-8') as f: raw = f.read()
        
        self.stopwords = (np.array([PorterStemmer().stem(word) 
                                    for word in list(raw.splitlines()) if len(word) > length]))
        
     
    def corpus_tokens(self):
        """
        description: create a set of all all tokens or in other words a vocabulary
        """
        
        #initialise an empty set
        self.token_set = set()
        for doc in self.docs:
            self.token_set = self.token_set.union(doc.tokens) 
            
    def document_term_matrix(self):
        """
        description:  returns a D by V array of frequency counts
        """  
        # subroutine: computes the counts of each vocabulary in the document
        def counts(doc):
            # initialize a matrix
            term_mat = [0]*len(self.token_set)
            for token in doc.tokens:
                term_mat[list(self.token_set).index(token)] = term_mat[list(self.token_set).index(token)] + 1
            return term_mat;
            
        self.doc_term_matrix = []
        
        for doc in self.docs:
            self.doc_term_matrix.append([doc.pres + " " + doc.year, counts(doc)])


      
    def tf_idf(self):
        """
        description:  returns a D by V array of tf-idf scores
        """
        # Compute inverse document frequency 
        idf = [0]*len(self.token_set)
        for token in self.token_set:
            ind = 0
            for doc in self.docs:
                if token in doc.tokens:
                    ind += 1 
            idf[list(self.token_set).index(token)] = math.log(self.N/ind)
        
        # Create a subroutine that computes tf_idf for one document
        def tfidf(doc):
            term_mat = [0]*len(self.token_set)
            for token in doc.tokens:
                term_mat[list(self.token_set).index(token)] = term_mat[list(self.token_set).index(token)] + 1 
        
            for i,term in enumerate(term_mat):
                if term != 0:
                    term_mat[i] = (1 + math.log(term)) * idf[i]
            return term_mat;
        
        #tf_idf
        self.tf_idf_matrix = []
        for doc in self.docs:
            self.tf_idf_matrix.append([doc.pres + " " + doc.year, tfidf(doc)])
            
            
        
    def dict_rank(self, n, dictionary, token_repr):
        """
        description:  returns the top n documents based on a given dictionary and represenation of tokens
        """
        if token_repr == "tf-idf":
            self.tf_idf()
            representation = self.tf_idf_matrix
            
        if token_repr == "doc-term":
            self.document_term_matrix()
            representation = self.doc_term_matrix
            
        # Return top n docs based on dictionary given
        score = []
        x=self.token_set
        x=list(x)
        for token in x: 
            try:
                score.append(dictionary[token])
            except: 
                score.append(0)

        # get a vector with all the scores in order
        score=[int(x) for x in score]
        rank = {}
        elements=range(len(representation))
   
        for i in elements:
            rank[representation[i][0]] = np.dot(score,representation[i][1])
            
        # Get sorted view of the keys.
        s = sorted(rank, key=rank.get, reverse=True)[0:(n-1)]
        
        ranking = {}
        for key in s:
            ranking[key] =  rank[key]
        
        return ranking 



In [75]:
class Document():
    
    """ The Doc class rpresents a class of individul documents
    
    """
    
    def __init__(self, speech_year, speech_pres, speech_text):
        self.year = speech_year
        self.pres = speech_pres
        self.text = speech_text.lower()
        self.tokens = np.array(wordpunct_tokenize(self.text))
        
        
        
    def token_clean(self,length):

        """ 
        description: strip out non-alpha tokens and tokens of length > 'length'
        input: length: cut off length 
        """

        self.tokens = np.array([t for t in self.tokens if (t.isalpha() and len(t) > length)])


    def stopword_remove(self, stopwords):

        """
        description: Remove stopwords from tokens.
        input: stopwords: a suitable list of stopwords
        """

        
        self.tokens = np.array([t for t in self.tokens if t not in stopwords])


    def stem(self):

        """
        description: Stem tokens with Porter Stemmer.
        """
        
        self.tokens = np.array([PorterStemmer().stem(t) for t in self.tokens])


We will use the presedential speech dataset to demonstrate how the class works 

In [5]:
def parse_text(textraw, regex):
    """takes raw string and performs two operations
    1. Breaks text into a list of speech, president and speech
    2. breaks speech into paragraphs
    """
    prs_yr_spch_reg = re.compile(regex, re.MULTILINE|re.DOTALL)
    
    #Each tuple contains the year, last ane of the president and the speech text
    prs_yr_spch = prs_yr_spch_reg.findall(textraw)
    
    #convert immutabe tuple to mutable list
    prs_yr_spch = [list(tup) for tup in prs_yr_spch]
    
    for i in range(len(prs_yr_spch)):
        prs_yr_spch[i][2] = prs_yr_spch[i][2].replace('\n', '')
    
    #sort
    prs_yr_spch.sort()
    
    return(prs_yr_spch)

In [113]:
# Exercise 2: 

text = open("/Users/ainalopez/Downloads/text_mining-master/data/pres_speech/sou_all copy.txt", 'r').read()
#text = open("/home/yaroslav/Projects/text_mining/data/pres_speech/sou_all.txt", 'r').read()
regex = "_(\d{4}).*?_[a-zA-Z]+.*?_[a-zA-Z]+.*?_([a-zA-Z]+)_\*+(\\n{2}.*?)\\n{3}"
pres_speech_list = parse_text(text, regex)

#Instantite the corpus class
# corpus = Corpus(pres_speech_list, '/home/yaroslav/Projects/text_mining/data/stopwords/stopwords.txt', 2)
corpus = Corpus(pres_speech_list, '/Users/ainalopez/Downloads/text_mining-master-2/data/stopwords/stopwords.txt', 2)

# Import dictionary from excel file
import pandas as pd
#df = pd.read_excel("/home/yaroslav/Dropbox/BGSE/3rd Term/Text Mining/LoughranMcDonald_MasterDictionary_2014.xlsx", skiprows=0)
df = pd.read_excel("/Users/ainalopez/Downloads/dictionary1.xlsx", skiprows=0)
w = df['Word']
words = [str(x).lower() for x in df['Word']]
words=[PorterStemmer().stem(t) for t in words]
score = [str(x).lower() for x in df['Positive']] # or any other method
dictionary=dict(zip(words,score))

# Applying dict_rank function
X1 = corpus.dict_rank(corpus.N, dictionary,"doc-term")
X2 = corpus.dict_rank(corpus.N, dictionary,"tf-idf")

# Print the Ranking 
print sorted(X1, key=X1.get, reverse=True)
print sorted(X2, key=X2.get, reverse=True)

In [131]:
for key in sorted(X2, key=X2.get, reverse = True):
    print str(X2[key]) + ","


130559.963176,
108333.40409,
99350.8241101,
86602.0861312,
86568.5988541,
76838.4816609,
72258.1006625,
68536.8041623,
67600.6735489,
67060.9908068,
65986.7441174,
63267.8868743,
62706.1738366,
61872.5503152,
61069.4394329,
59891.2560132,
57982.2560063,
57462.4556797,
57043.4672759,
56723.5703056,
55825.1000214,
54808.5135905,
54403.3280192,
52811.8009693,
52260.7513738,
51802.5737956,
51552.5546315,
50626.0456879,
50190.1493574,
50059.3983806,
50038.7316494,
49200.220772,
48433.1102126,
48345.1171769,
48107.4919067,
47752.6207105,
46375.7957706,
46330.8337656,
46150.668827,
45737.3984,
45396.8732552,
44152.9397802,
43968.2424233,
43130.1418999,
43005.8169787,
42924.0107171,
42583.6845346,
41819.6722728,
41618.3358291,
41371.0819463,
40845.9146668,
40570.3829871,
39963.0980793,
39703.4325762,
39683.4071173,
39640.4245397,
39056.8617806,
38862.1572823,
38641.9109001,
38552.1862343,
38356.9875164,
38317.2002131,
37719.3045623,
37484.4760753,
37436.8799059,
37146.14019,
37089.2942283,
368

In [132]:

for key in sorted(X1, key=X1.get, reverse = True):
    print str(X1[key]) + ","

1106959,
1054725,
787528,
683060,
548457,
540421,
498232,
484169,
470106,
441980,
435953,
411845,
407827,
391755,
389746,
347557,
331485,
319431,
313404,
307377,
303359,
301350,
293314,
289296,
285278,
275233,
271215,
267197,
267197,
263179,
255143,
253134,
249116,
249116,
247107,
245098,
245098,
239071,
237062,
235053,
231035,
222999,
220990,
220990,
218981,
218981,
216972,
216972,
214963,
214963,
212954,
210945,
208936,
206927,
206927,
204918,
204918,
204918,
202909,
202909,
202909,
202909,
198891,
196882,
196882,
192864,
190855,
188846,
184828,
182819,
180810,
178801,
178801,
178801,
176792,
174783,
174783,
172774,
170765,
168756,
168756,
168756,
168756,
166747,
166747,
166747,
164738,
164738,
162729,
162729,
162729,
160720,
158711,
158711,
158711,
158711,
154693,
152684,
152684,
152684,
150675,
146657,
146657,
146657,
144648,
144648,
142639,
140630,
140630,
140630,
140630,
138621,
138621,
138621,
136612,
134603,
134603,
134603,
134603,
130585,
130585,
130585,
128576,
126567,
124558

In [122]:
for sortedKey in sorted(X1):
    print X1[sortedKeY]


NameError: name 'sortedKeY' is not defined

['Carter 1980', 'Carter 1981', 'Carter 1979', 'Nixon 1974', 'Truman 1946', 'Nixon 1972', 'Taft 1912', 'Taft 1910', 'Roosevelt 1907', 'Roosevelt 1901', 'Roosevelt 1906', 'Carter 1978', 'Roosevelt 1905', 'Taft 1911', 'McKinley 1899', 'Cleveland 1885', 'Polk 1848', 'McKinley 1900', 'Jackson 1830', 'Roosevelt 1908', 'Roosevelt 1904', 'Coolidge 1925', 'Eisenhower 1961', 'Eisenhower 1955', 'McKinley 1898', 'Cleveland 1896', 'Cleveland 1886', 'Buren 1837', 'Truman 1950', 'Tyler 1844', 'Hayes 1880', 'Taft 1909', 'Harrison 1891', 'Jackson 1834', 'Roosevelt 1903', 'Clinton 2000', 'Polk 1847', 'Buren 1839', 'Truman 1953', 'Polk 1846', 'Polk 1845', 'Hayes 1879', 'Hayes 1877', 'Cleveland 1888', 'Arthur 1881', 'Buchanan 1858', 'Roosevelt 1902', 'Harrison 1892', 'Buren 1838', 'Harrison 1889', 'Jackson 1829', 'Coolidge 1926', 'Jackson 1835', 'Cleveland 1895', 'Fillmore 1852', 'Truman 1948', 'Eisenhower 1953', 'Adams 1825', 'Eisenhower 1958', 'Fillmore 1851', 'Hoover 1929', 'Buchanan 1859', 'Cleveland 

In [105]:
# Exercise 2:

from pandas import DataFrame

# Get list of presidents
presidents = [presi.pres for presi in corpus.docs]
years = [ye.year for ye in corpus.docs]

pr1 = list(X1.keys())
pr2 = list(X2.keys())

pr1=[x[4:] for x in pr1]
pr2=[x[4:] for x in pr2]

pr1=[int(x) for x in pr1]
pr2=[int(x) for x in pr2]

pres1 = [presidents[i] for i in pr1]
pres2 = [presidents[i] for i in pr2]

year1 = [years[i] for i in pr1]
year2 = [years[i] for i in pr2]

scores1 = list(X1.values())
scores2 = list(X2.values())


df1 = DataFrame({
        'president': pres1,
        'year': year1,
        'score1': scores1
        })

df2 = DataFrame({
        'president': pres2,
        'year' : year2,
        'score1': scores2
        })


g1=df1.groupby(['president']).mean()
g2=df2.groupby(['president']).mean()
#g = df1.groupby('president').mean().sort_values(ascending=False)
#most_rated = lens.groupby('title').size().sort_values(ascending=False)[:25]

print df1
print df2

print g1
print g2


ValueError: invalid literal for int() with base 10: 'an 1983'

In [37]:
pr1 = list(X1.keys())
pr2 = list(X2.keys())

pr1=[x[4:] for x in pr1]
pr2=[x[4:] for x in pr2]

pr1=[int(x) for x in pr1]
pr2=[int(x) for x in pr2]

presidents = [presi.pres for presi in corpus.docs]
pres1 = [presidents[i] for i in pr1]
pres2 = [presidents[i] for i in pr2]

print pres1

['Clinton', 'Bush', 'Obama', 'Clinton', 'Obama', 'Clinton', 'Clinton', 'Clinton', 'Bush', 'Clinton', 'Bush', 'Obama', 'Clinton', 'Bush', 'Bush', 'Bush', 'Bush', 'Bush', 'Obama', 'Clinton', 'Obama', 'Obama', 'Bush', 'Bush', 'Bush', 'Reagan', 'Reagan', 'Reagan', 'Bush', 'Reagan', 'Reagan', 'Reagan']
