Questions:

- For POS tagging, should I used filtered list or tokens? (Different results)
        #self.pos_tags = self.parts_of_speech_tagging(self.filtered_text)
- Do I keep ANP? The text file is pretty bad. 

### Things done so far:

- Converted pdf and doc files to txt files
- Create a manifesto class (object oriented programming)
- Read each manifesto text file
- Created word tokens of each text file
- Created sentence tokens of each text file
- Find word frequency of words in each text file
- Do parts of speech tagging for each text file
- Create a 'stemmed list' for the tokens in each text file
- Pre-processing to remove stop words and convert to lower case to avoid repetition
- Find any top X number of most frequent words in each text file
- Create a word cloud of any given text, for any optional number of words
- Summarize text from any text file
- Creates ngrams of any length, and presents any X most common ngrams
- Calculates document similarity between any two manifestos
- Calculates sentence Sentiment (polarity and subjectivity) and plots graphs for each


### Packages Used:
- NLTK
- Gensim
- Word Cloud
- Matplotlib
- Collections
- Spacy
- TextBlob

In [134]:
import os
import nltk
from nltk.corpus import stopwords
from nltk.stem import RSLPStemmer
from nltk import pos_tag
from nltk.util import ngrams   

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from gensim.summarization.summarizer import summarize
from collections import Counter
import spacy
from textblob import TextBlob



class Manifesto(object):
    '''
    Creates a Manifesto object.
    Can be used to assess different aspects of a text file
        including tokens, most common words, parts of speech.
    Can be used to create word clouds.
    '''
    
    
    
    def __init__(self, file_path, name):
        '''
        Initializes the object variables.
        '''
        self.name = name
        self.text = self.reading_file (file_path)
        self.tokens = self.tokenize (self.text)
        self.sentence_tokens = token_sentences(self.text)
        self.filtered_text = self.preprocessing (self.tokens)
        self.stemmed_list = self.stemmer(self.filtered_text)
        self.pos_tags = self.parts_of_speech_tagging(self.tokens)
        self.word_frequency = self.finding_word_frequency(self.filtered_text)
        
    
    def reading_file (self, file_path):
        '''
        Given a file path, checks if file exists there, reads it, closes it,
            and returns the text as a string.

        Input:
            file_path: string

        Return:
            text: string
        '''
        assert os.path.exists(file_path), "File not found at: "+str(file_path)
        f = open(file_path,'r')    
        text = f.read()
        f.close()
        return text
        
        
    def tokenize (self, text):
        '''
        Given some text, will return tokens of that text
        
        Input: 
            text: string
        Output:
            token: list of string
        '''
        tokens = nltk.word_tokenize(text)
        return tokens
    
    
    def token_sentences(self, text):
        '''
        Given some text, will return sentence tokens of that text
        
        Input:
            text: string
        
        Output:
            list of sentence tokens (string)
        '''
        sent_text = nltk.sent_tokenize(text)
        return sent_text

        
    def finding_word_frequency (self, filtered_text):
        '''
        Given a list, returns the word frequency
        
        Input:
            filtered_text: list of strings
        Output:
            word frequency: nltk.probability.FreqDist
        '''
        word_frequency = nltk.FreqDist(filtered_text)
        return word_frequency
        
        
    def parts_of_speech_tagging(self, tokens):
        '''
        Given tokens, assigns parts of speech to each token
        
        '''
        tagged = nltk.pos_tag(tokens)
        return tagged      
        
        
    def stemmer (self, filtered_text):
        '''
        Does stemming/lemmatization of a given text
        Input:
            filtered_text: list of string
            
        Output:
            a set of stemmed words
        '''
        st = RSLPStemmer()
        stemmed_list = set(st.stem(token) for token in filtered_text)
        return stemmed_list
        
        
        
    def preprocessing (self, text):
        '''
        Removes stop words and converts to lower case.
        
        Input:
            text: string
            
        Output:
            filtered_text: list of string
        '''
        stop_words = set(stopwords.words('english'))
        words=[word.lower() for word in text if word.isalpha()]
        filtered_text = [w for w in words if not w in stop_words]
        return filtered_text

    
        
    def find_most_frequent_words(self, number):
        '''
        For a given manifesto object, returns the most common X number of words used
            along with the count
            
        Input:
            number: integer
            
        Output:
            mostcommon: list
        '''
        wordfreqdist = nltk.FreqDist(self.filtered_text)
        mostcommon = wordfreqdist.most_common(number)
        return mostcommon
        
        
        
    def __repr__(self):
        
        return (self.name + ' Manifesto ')
    
    
    
    
def create_wordcloud(text, title = None, maximum_words = 100):
    '''
    Creates a word cloud based on the text of the file.
    Removes stop words (which consists of conventional stop words  
            and words from my own list)
            
    Input:
        text (string)
        title (optional, string)
        maximum words (integer, default = 100)
        

    Special thanks to the community at stackoverflow
    (https://stackoverflow.com/questions/16645799/how-to-create-a-word-cloud-from-a-corpus-in-python)
    for this one!
    '''
    stop_words = list(STOPWORDS)
    personal_list = ['pakistan', 'people', 'party', 'manifesto', 'government', 'per', 
                    'cent', 'will', 'Parliamentarians', 'ANP', 'MQM', 'iii', 'i', 'ii', 'iv', 'v', 'vi','vii', 'PML', 'PTI', 'ensure', 'right', 'provide' ]
    stop_words_2 = set(stop_words + personal_list)


    wordcloud = WordCloud(
        background_color='white',
        stopwords=stop_words_2,
        max_words=maximum_words,
        scale=3,
        max_font_size=40
    ).generate(str(text))

    fig = plt.figure(1, figsize=(20, 20), dpi = 400)
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=30)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()
    
    
def summarize_text (text):
    '''
    Given some text, will summarize it and return the summary
        
    Input:
        text (string)
    Output:
        summary (string)
    '''
    return summarize(text)


def getting_ngrams(text, gram_type, length_of_list):
    '''
    Given a text, given the type of gram (example, bigram, trigram, fourgram etc), and given length of list,
        returns a list of the most common X ngrams in that text. 
        
    Input:
        text (string)
        gram_type (integer): indicates the number of grams
        length_of_list (integer): indicates the number of most common ngrams to show
    
    Output:
        ngram (list of string)
        
    Thanks to the Stackoverflow community at 
    https://stackoverflow.com/questions/32441605/generating-ngrams-unigrams-bigrams-etc-from-a-large-corpus-of-txt-files-and-t
    for help with this!
    '''
    gram = ngrams(text, gram_type)
    ngram = Counter(gram).most_common(length_of_list)
    return ngram



def manifesto_similarities():
    '''
    Prints similarity score between the manifestos of different parties using manifesto text.
    Uses package 'spacy' to calculate similarity
    
    '''
    nlp = spacy.load('en')
    list_of_parties = [ppp, pmln, mqm, pti, anp, ji]
    tokens = [nlp(ppp.text), nlp(pmln.text), nlp(mqm.text), nlp(pti.text), nlp(anp.text), nlp(ji.text)]

    counter_1 = 0
    print ('{} {:>30} {:>40}'.format('PARTY 1', 'PARTY 2','SIMILARITY SCORES'))
    print ()
    for token1 in tokens:    
        counter_2 = 0
        for token2 in tokens:
            print ('{:30} {:30} {:.5}'.format(list_of_parties[counter_1].name, list_of_parties[counter_2].name,token1.similarity(token2) ))
            counter_2 = counter_2 + 1
        counter_1 = counter_1 + 1
        print ()
        
        
def sentiment_analysis (sentence_tokens, name):
    '''
    Given sentence tokens, calculates polarity and subjectivity of each sentence in the document,
        and plots graphs of polarity and subjectivity
        
    Input:
        sentence_tokens: list of strings
        name: the name of each party (string)
    
    '''
    polarity = []
    subjectivity = []
    for sentence in sentence_tokens:
        s = TextBlob(sentence)
        polarity.append(s.sentiment[0])
        subjectivity.append(s.sentiment[1])
        
    plt.plot(polarity)
    plt.xlabel('Sentences across manifesto')
    plt.ylabel('Polarity')
    plt.title(' Manifesto of '+ name)
    plt.show()


    plt.plot(subjectivity, 'C7')
    plt.xlabel('Sentences across manifesto')
    plt.ylabel('subjectivity')
    plt.title(' Manifesto of '+ name)
    plt.show()

In [132]:
# CREATING MANIFESTO OBJECTS OF THE POLITICAL PARTIES

ppp = Manifesto ('/Users/kazi/Desktop/Manifesto Text Files/PPP_2013.txt', 'Pakistan Peoples Party')
pmln = Manifesto ('/Users/kazi/Desktop/Manifesto Text Files/PMLN_2013.txt', 'Pakistan Muslim League N')
mqm = Manifesto ('/Users/kazi/Desktop/Manifesto Text Files/MQM_2013.txt', 'Mutahhida Qaumi Movement')
pti = Manifesto ('/Users/kazi/Desktop/Manifesto Text Files/PTI_2013.txt', 'Pakistan Tehreek-e-Insaaf')
anp = Manifesto ('/Users/kazi/Desktop/Manifesto Text Files/ANP_2013.txt', 'Awami National Party')
ji = Manifesto ('/Users/kazi/Desktop/Manifesto Text Files/JI_2013.txt', 'Jamat-ul-Islami')

In [85]:
ppp = Manifesto ('/Users/kazi/Desktop/Manifesto Text Files/PPP_2013.txt', 'Pakistan Peoples Party')