# Exercise Fourteen: Project Design Starter
In this exercise, you'll be planning out a complex project. You'll draw in some code, but focus on commenting to describe your project structure. The sample document below will guide you through organizing and annotating your project design. The primary components you'll include are:

- Dependencies: What modules will your project need?
- Collection: Where is your data coming from?
- Processing: How will you format and process your data?
- Analysis: What techniques will you use to understand your data?
- Visualization: How will you visualize and explore your data?

Don't worry if you aren't exactly certain how you would implement everything - this should be a starting point for a larger research study, but it doesn't need to be a complete, functional workflow. Aim for a "good enough" starting point that you can reference and extend for future work.

Note where you have something working, and where it's broken or in progress.

Race After Technology: Chapter 5
Digital Humanities Coursebook: Coda

(Karsdorp, Kestemont, and Riddell).


"When people change how they speak or act in order to conform to dominant norms, we call it “code-switching”" (Benjamin 180). 

"Data, in short, do not speak for themselves and don’t always change hearts and minds or policy" (Benjamin 192). 

And to help push the technical philosophy further and keep the process moving was knowing that "An interface is a set of cognitive cues. It may look like a screen full of pictures of things inside the computer, but in fact, the interface mediates between an individual the computational activity" (Drucker 176).

## Project Overview: NaNoGenMo
This sample project builds on our previous exercises inspired by National Novel Generation Month. It offers a framework for exporing text generation based upon children's literature, inspired by NaNoGenMo's call to think about different forms of procedural making. As such, it is guided by that project's rule: "Spend the month of November writing code that generates a novel of 50k+ words."

(Replace this text with a short description of what your envisioned project design will accomplish. Include your research question and goals for this analysis.)

## Stage One: Dependencies

Add the import code for every dependency of your project: for instance, if you are collecting data, you might import Tweepy or BeautifulSoup. If you're working with a file of folders, import os. Most projects will require Pandas, along with appropriate processing and visualization libraries. In the comments, explain briefly why you are including each library (as shown in the example below.)



(Karsdorp, Kestemont, and Riddell).

In [13]:
# Importing Pandas to handle collected Twitter data (example comment)
import pandas as pd

As a result, comfort has been established as BeautifulSoup enables movement to work without an API. Enforcing beginning with "a problem or a question. If your problem or question is not well defined, develop or find one which is" (Karsdorp, Kestemont, Riddell 323) comes alive.

In [14]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Stage Two: Collection
Describe your data collection scope and process briefly, and include an example of how you might collect your data drawing on our other projects. For example, if this workflow will collect Twitter data from a stream, you might revisit that demo, copy the stream, and adjust the hashtag.

(Karsdorp, Kestemont, and Riddell).

In [15]:
# Collect data using a Tweepy stream (example annotation)
# (Copy and modify code from other exercises to prototype this goal)

In [16]:
# Python Package imports
import requests
from bs4 import BeautifulSoup
from dateutil.parser import parse
import concurrent.futures
import pandas as pd

With the use of exploring the URL of interest, https://www.metacritic.com/game/playstation-3/the-walking-dead-a-telltale-games-series/user-reviews?page=, it allowed for the process to “Consider many models. Different narratives are often compatible with the same set of observations” (Karsdorp, Kestemont, Riddell 324).

In [12]:
review_dict = {'name':[], 'date':[], 'rating':[], 'review':[]}
for i in range(0, 50):
    url = 'https://www.metacritic.com/game/playstation-3/the-walking-dead-a-telltale-games-series/user-reviews?page=' + str(i)
    user_agent = {'User-agent': 'Mozilla/5.0'}
    response = requests.get(url, headers = user_agent)
    soup = BeautifulSoup(response.text, 'html.parser')
    for review in soup.find_all('div', class_='review_content'): 
        if review.find('div', class_='name') == None:
            break 
        review_dict['name'].append(review.find('div', class_='name').find('a').text)
        review_dict['date'].append(review.find('div', class_='date').text)
        review_dict['rating'].append(review.find('div', class_='review_grade').find_all('div')[0].text)
        if review.find('span', class_='blurb blurb_expanded'): 
            review_dict['review'].append(review.find('span', class_='blurb blurb_expanded').text)
           # print(review.find('span', class_='blurb blurb_expanded').text)
        elif review.find('div',class_='review_body').find('span') == None:
            review_dict['review'].append('No review text.')
           # print("No review")
        else:
            review_dict['review'].append(review.find('div',class_='review_body').find('span').text)
          #  print(review.find('div',class_='review_body').find('span').text)

## Stage Three: Processing
After your data has been collected or imported, store it in a format that works for your purposes. This can vary: for Twitter analysis, it might be a Pandas dataframe, while for text, you might build a document term matrix.

(Karsdorp, Kestemont, and Riddell).

In [None]:
# Store Twitter data using Pandas with appropriate column names (example comment)
# (Copy and modify code from other exercises to prototype this goal)

Creating Pandas' data frame was a smooth process, which contributed to the fun lines allowing for an "Account for variability in human judgments" (Karsdorp, Kestemont, Riddell 324), enabling organizational aspects toward the coding journey to become more formulated.

In [None]:
ac_reviews = pd.DataFrame(review_dict)
print(ac_reviews)

## Stage Four: Analysis
Think across all of the methods we've tried this semester - what combination would be most helpful for your goals? Include code sections for each method you think is important. In most cases, a combination will be most revealing: for instance, you might employ several different textual analysis frameworks on a set of documents. Use at least two distinctly different methods of analysis.

(Karsdorp, Kestemont, and Riddell).

In [None]:
# Compare several account outputs using PCA (example comment)
# (Copy and modify code from other exercises to prototype this goal)

Used to clean up and help remove characters etc., as BeautifulSoup enabled significant aspects of cleaning setup, exploring "ideas from math and (Bayesian) statistics. Good ideas are found everywhere" (Karsdorp, Kestemont, Riddell 324).

In [None]:
import re
re_list = ['(https?://)?(www\.)?(\w+\.)?(\w+)(\.\w+)(/.+)?', '@[A-Za-z0-9_]+','#']
combined_re = re.compile( '|'.join( re_list) )
regex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags = re.UNICODE)

In [None]:
from nltk.tokenize import WordPunctTokenizer
token = WordPunctTokenizer()
def cleaning_reviews(t):
    del_amp = BeautifulSoup(t, 'lxml')
    del_amp_text = del_amp.get_text()
    del_link_mentions = re.sub(combined_re, '', del_amp_text)
    del_emoticons = re.sub(regex_pattern, '', del_link_mentions)
    lower_case = del_emoticons.lower()
    words = token.tokenize(lower_case)
    result_words = [x for x in words if len(x) > 2]
    return (" ".join(result_words)).strip()

In [None]:
cleaned_reviews = []
for i in range(0,len(ac_reviews['review'])):
    cleaned_reviews.append(cleaning_reviews((ac_reviews.review[i])))
print(cleaned_reviews[0:5])

In [None]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

stopwords = set(STOPWORDS)
stopwords.update(["walking","dead","telltale","kill"])

Used variance in code and color, and size enabled for visualized experimentation and expression, as "The creation of digital assets will then serve the project's overall design" (Drucker 193).

In [None]:
string = pd.Series(cleaned_reviews).str.cat(sep=' ')
wordcloud = WordCloud(width=1600, stopwords=stopwords,height=1000,max_font_size=250,max_words=100,collocations=False, background_color='blue').generate(string)
plt.figure(figsize=(40,30))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

## Stage Five: Visualization
Finally, think about the visualizations that would be most useful to sharing and exploring your data. Consider both static and dynamic approaches from the different libraries we've worked with this semester. Include at least two preliminary visualizations.

(Karsdorp, Kestemont, and Riddell).

In [None]:
# Build a wordcloud of term distributions in each document (example comment)
# (Copy and modify code from other exercises to prototype this goal)

In [None]:
import numpy as np
from PIL import Image
import random

mask = np.array(Image.open('Walk.PNG'))

Took out subplots and changed toward variance of color. A higher-resolution image can be achieved by funding fixed color numbers and adjusting the range.

In [None]:
wordcloud = WordCloud(width=2000, mask = mask,stopwords=stopwords,height=1000,max_font_size=250,max_words=100,collocations=False,background_color='violet').generate(string)
f = plt.figure(figsize=(40,30))
plt.imshow(mask, cmap=plt.cm.gray, interpolation='bilinear')
def green_color_func(word, font_size, position, orientation, random_state=None,
                    **kwargs):
    return "hsl(150, 50%%, %d%%)" % random.randint(60, 100)
plt.axis("off")

plt.imshow(wordcloud.recolor(color_func=green_color_func, random_state=3),
           interpolation="bilinear")
plt.title('The Walking Dead', size='100')
plt.axis("off")
plt.show()

Stage Five: Import Bokeh and chart some aspect of the text: this could be the wordcount, topics, or sentiment analysis as demoed
A sentiment drew interest in charting differences in a sentiment score, which became the range of judgments, signifying that “Knowledge of the existing tools and platforms for this aspect of research is important” (Drucker 199).

In [None]:
import nltk.data
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import sentiment
from nltk import word_tokenize

# Next, we initialize VADER so we can use it within our Python script
sid = SentimentIntensityAnalyzer()

In [None]:
def calculate_sentiment(text):
    # Run VADER on the text
    scores = sid.polarity_scores(text)
    # Extract the compound score
    compound_score = scores['compound']
    # Return compound score
    return compound_score

In [None]:
ac_reviews['Sentiment Score'] = ac_reviews['review'].apply(calculate_sentiment)
ac_reviews.sort_values(by='Sentiment Score', ascending=False)[:15]

NOTES: New with color pallets etc., tried "jitter" and palettes import. Review scores are numerical were as the "jitter" brings in due to being applied to the bar space.

In [None]:
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, show, output_file, save
from bokeh.io import output_notebook
from bokeh.palettes import Viridis256
from bokeh.models.tools import HoverTool
from bokeh.models.formatters import DatetimeTickFormatter
from bokeh.models import ColorBar
from bokeh.transform import linear_cmap
from bokeh.models.tools import WheelZoomTool
from bokeh.transform import jitter

#file for output
output_file(filename="ac.html", title="AC Reviews Visualization")

I experimented with the range and size of indicators, etc., as “A platform with a broad user community is more likely to last—and to provide help support in the form of list-servs and other venues” (Drucker 199).

In [None]:
ac_reviews['rating'] = ac_reviews['rating'].astype(int)
source = ColumnDataSource(ac_reviews)
mapper = linear_cmap(field_name='Sentiment Score', palette=Viridis256 ,low=-3 ,high=2)
p = figure(plot_height=1000, plot_width=1000, toolbar_location="below")
p.circle(x=jitter('rating',width=2,range=p.x_range), y='Sentiment Score', source=source, size=10, line_color=mapper,color=mapper, fill_alpha=1)
p.toolbar.active_scroll = WheelZoomTool()
p.title.text = 'The Walking Dead Reviews'
p.xaxis.axis_label = 'Review Score'
p.yaxis.axis_label = 'Sentiment Score'

Even though it is light in modes, HTML visual presentation, does show the separation of scores of sentiment, etc., separated by light and dark components, as "a repository like GitHub ...is an invaluable resource for anyone working on digital project development” (Drucker 208).

In [None]:
from bokeh.models.tools import PanTool, WheelZoomTool

color_bar = ColorBar(color_mapper=mapper['transform'], width=8)
p.background_fill_color = "gray"
p.add_layout(color_bar, 'right')

hover = HoverTool()
hover.tooltips= """
<div style="width:200px;"><b>Review: </b>
@review
</div>
"""

p.add_tools(hover)

output_notebook()

show(p)