# Exercise Twelve: Texts, Three Ways
For this week, you will sample the three methods we've explored (topic modeling, sentiment analysis, and Markov chain generation) using the same set of root texts.

- Collect and import ten documents (novels work best, but anything goes!)
- Using the topic modeling code as a starter, build a topic model of the documents
- Using the sentiment analysis code as a starter, run a sentiment analysis on sample fragments from the documents and compare
- Using the Markov chain code as a starter, generate a sentence using one of the documents
- Using the Markov chain code as a starter, generate a longer text fragment using all of the documents

As a bonus, try to extend this analysis to note other features of these documents using any of our previous exercises as a starting point.

(Karsdorp, Kestemont, and Riddell).


## Stage One: Collect and import ten documents (novels work best, but anything goes!)

(Karsdorp, Kestemont, and Riddell).

In [41]:
import wikipedia
import os
path = "entries/"

directors = ["Steven Spielberg", "George Lucas", "Martin Scorsese", "Ridley Scott","M. Night Shyamalan"]
for director in directors:
    page = wikipedia.page(director)
    print(page.title)
    filename = path + director.replace(" ","_") + ".txt"
    with open (filename, "w", encoding="utf-8", errors="ignore") as f:
        f.write(page.content)
        f.close()

In [42]:
authors = ["Terry Pratchett", "Annalee Newitz", "Charlie Jane Anders", "Octavia Butler","N.K. Jemisin"]
for author in authors:
    page = wikipedia.page(author)
    print(page.title)
    filename = path + author.replace(" ","_") + ".txt"
    with open (filename, "w", encoding="utf-8", errors="ignore") as f:
        f.write(page.content)
        f.close()

AttributeError: module 'wikipedia' has no attribute 'page'

## Stage Two: Using the topic modeling code as a starter, build a topic model of the documents

(Karsdorp, Kestemont, and Riddell).

In [None]:
import pandas as pd
import os
import numpy as np

documents = []
path = 'film directors/'

filenames=sorted([os.path.join(path, fn) for fn in os.listdir(path)])
print(len(filenames)) # count files in corpus
print(filenames[:10]) # print names of 1st ten files in corpus

In [None]:
import sklearn.feature_extraction.text as text

vectorizer=text.CountVectorizer(input='filename', stop_words="english", min_df=1)
dtm=vectorizer.fit_transform(filenames).toarray() # defines document term matrix

vocab=np.array(vectorizer.get_feature_names())

In [None]:
print(f'Shape of document-term matrix: {dtm.shape}. '
      f'Number of tokens {dtm.sum()}')

In [None]:
import sklearn.decomposition as decomposition
model = decomposition.LatentDirichletAllocation(
    n_components=100, learning_method='online', random_state=1)
document_topic_distributions = model.fit_transform(dtm)
vocabulary = vectorizer.get_feature_names()
# (# topics, # vocabulary)
assert model.components_.shape == (100, len(vocabulary))
# (# documents, # topics)
assert document_topic_distributions.shape == (dtm.shape[0], 100)  

In [None]:
topic_names = [f'Topic {k}' for k in range(100)]
topic_word_distributions = pd.DataFrame(
    model.components_, columns=vocabulary, index=topic_names)
print(topic_word_distributions)

In [None]:
topic_word_distributions.loc['Topic 9'].sort_values(ascending=False).head(18)

In [None]:
words = topic_word_distributions.loc['Topic 2'].sort_values(ascending=False).head(18)
words

from matplotlib import pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import matplotlib.colors as mcolors

# Create and generate a word cloud image:
wordcloud = WordCloud().generate_from_frequencies(words)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

## Stage Three: Using the sentiment analysis code as a starter, run a sentiment analysis on sample fragments from the documents and compare

(Karsdorp, Kestemont, and Riddell).

In [None]:
import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')

from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

In [None]:
for filename in filenames:
    with open(filename, encoding="utf-8") as f:
        text = f.read()
        documents.append(text)
        scores = sid.polarity_scores(text[0:500])
    print(filename)
    for key in sorted(scores):
        print('{0}: {1}, '.format(key, scores[key]), end='')
    print(' ')

## Stage Four: Using the Markov chain code as a starter, generate a sentence using one of the documents

(Karsdorp, Kestemont, and Riddell).

In [None]:
import markovify
import random
generator_text = ""

for document in documents:
    generator_text += document

In [None]:
text_model = markovify.Text(text)
print( text_model.make_sentence() )

## Stage Five: Using the Markov chain code as a starter, generate a longer text fragment using all of the documents

(Karsdorp, Kestemont, and Riddell).

In [None]:
novel = ''
while (len( novel.split(" ")) < 500):
  for i in range(random.randrange(3,9)):
    novel += text_model.make_sentence() + " "
  novel += "\n\n"

print(novel)

## Bonus Stage: Try to extend this analysis to note other features of these documents using any of our previous exercises as a starting point.

(Karsdorp, Kestemont, and Riddell).

