# Lab Assignment 1: Exploring Text Data
## by Avi Sinha

### 1. Business Understanding

All html files are collected from the IMDb archive in the domain of movies. Each of the 30,000 documents is a review. The reviews are professionally written and are posted to different online newsgroups. Data collected by Bo Pang and Lillian Lee. http://www.cs.cornell.edu/people/pabo/movie-review-data/


#### Purpose of the Data and Analysis
Understanding human sentiment is an important part of businesses to understand the consumers relationship management (CRM). Since humans are verbal communicators, simple numbers are not an accurate indicator. Numeric rating systems can only describe sentiment to a certain extent and are not always available. A better approach is to understand general sentiment from the vocabulary collected in freely available reviews and posts. 

This knowledge of sentiment can be especially beneficial when applied to movie distributers who want a deeper understanding of what qualities make a movie successful before they spend millions to distribute them through channels (either streamed or physical). This way more financially viable movies can be chosen from studios and sold based on reviews. In the end, distributers make money from lucrative movies and consumers would get what they wanted to watch.

#### Prediction Task
The nuances can become extremely fine-grained with implict meanings such as intent, emotion, subjectivity. However, this prediction task would be a basic polarity analysis, a simple positive or negative, coupled with key words describing them, which is basically enough to take decent advantage of the wealth of data available. 

#### Level of Accuracy
The success of this task would result in basic classification of a review as positive or negative by analyzing vocabulary used. The level of success of this kind of classification depends on the length of the review and the complexity of the language used to describe it. Taking all this together, the required accuracy for this data classification to be of use would be around 90+ percent or above because any false classification could result in the movie not being distributed or wrongly distributed in place of a better performer, thus causing massive losses in revenue through wasted production.



### 2. Data Encoding

#### 2.1 Read in raw text documents

In [60]:
# %load parse.py
import glob
import re
import string

def preprocess(text):
    text= re.sub(b"<.*?>", b" ", text)#no_tags
    text= re.sub(b"\n", b" ", text)#no_new_lines
    text= re.sub(b"\r", b" ", text)#no_returns
    #lowered with no punctuation
    text= text.translate(None, b'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~').lower()
    return text


documents = []
for filename in glob.glob('polarity_html/movie/*.html'):
    with open(filename, 'rb') as f:
        raw = f.read()
        cleaned = preprocess(raw)
        documents.append(cleaned)


print(len(documents))


27886


#### 2.2 Verify Data Quality and Implement Stemming

In [61]:
documents[2]


b'     review for less than zero 1987              less than zero 1987   reviewed by  serdar yegulalp      less than zero 1987     a movie review by serdar yegulalp  copyright 1998 by serdar yegulalp    capsule bret easton elliss flashinthepan novel becomes a weak and  sentimental movie casting of robert downey jr is an asset though    i read less than zero when it was first published and several other times  since each time ive come back to it it seems to contain that much less  it wasnt that profound a book to begin with and what little insight it did  have los angeles is a terrible place for anyone to try to be a moral  person for one has not survived the transition to the big screen less  than zero the motion picture is even less absorbing than the book that  spawned it its just not a good movie despite the presence of three good  actors andrew mccarthy jami gertz and robert downey jr who are given  potentially interesting roles to play    mccarthyis a young la denizen clay who goe

This the raw data imported into memory

#### 2.3 Convert data from raw text into sparse encoded bag-of-words

In [62]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text 



domain_specific_stop_words = ["author", "movies", "movie", "film", "reviews", "review"]
stop_words = text.ENGLISH_STOP_WORDS.union(domain_specific_stop_words)


count_vect = CountVectorizer(stop_words= stop_words, 
                             decode_error='ignore'
                                ) # an object capable of counting words in a document!

bag_words = count_vect.fit_transform(documents)

documents[0]

b'     review for vie r\xeav\xe9e des anges la 1998              vie r234v233e des anges la 1998   reviewed by  harvey s karten      the dreamlife of angels     reviewed by harvey karten phd   sony pictures classics   director  erick zonca   writer  erick zonca roger bohbot   cast elodie bouchez natacha regnier gregoire colin jo  prestia patrick mercado       when i was younger and wiser and dating girls who had  roommates i was regularly amazed by how often these pairs  of best friends would split up prematurely  the rate of  breakdowns of twentysomethings during the 1960s exceeded  even the current frequency of divorce  does living together  ruin friendships  in many cases this seems to be true  to  get more insight into the enigma take in erick zoncas  remarkably wellacted and poignant piece the dreamlife of  angels  youd not be at all surprised to know that this is the  work that closed last years new york film festival featured  not in the usual space but in the majestic and large

In [63]:
print(bag_words.shape) # this is a sparse matrix
print('=========')
print(bag_words[0])

(27886, 218758)
  (0, 99216)	1
  (0, 114956)	1
  (0, 161397)	1
  (0, 45901)	1
  (0, 93572)	1
  (0, 18788)	1
  (0, 141476)	1
  (0, 55293)	1
  (0, 114556)	1
  (0, 75132)	1
  (0, 101272)	1
  (0, 205317)	1
  (0, 31927)	1
  (0, 134175)	1
  (0, 161555)	1
  (0, 43779)	1
  (0, 55864)	1
  (0, 25041)	1
  (0, 46354)	1
  (0, 184382)	1
  (0, 203919)	1
  (0, 45777)	1
  (0, 62180)	1
  (0, 45581)	1
  (0, 162790)	2
  :	:
  (0, 30120)	2
  (0, 63418)	2
  (0, 36590)	1
  (0, 28914)	1
  (0, 165227)	1
  (0, 215539)	1
  (0, 218483)	2
  (0, 65568)	4
  (0, 55913)	1
  (0, 41208)	1
  (0, 148520)	2
  (0, 181134)	1
  (0, 147883)	1
  (0, 15555)	4
  (0, 59676)	4
  (0, 106455)	3
  (0, 87844)	3
  (0, 163379)	2
  (0, 157749)	1
  (0, 3078)	2
  (0, 110271)	2
  (0, 15581)	2
  (0, 53906)	2
  (0, 167166)	1
  (0, 207682)	2


In [64]:
print(len(count_vect.vocabulary_))
#print(count_vect.vocabulary_)
count_vect.inverse_transform(bag_words[0])

218758


[array(['index', 'links', 'related', 'conversion', 'html', 'ascii',
        'original', 'differ', 'likely', 'formatting', 'inthe', 'urls',
        'broken', 'newsgroups', 'relevant', 'commentscriticisms', 'direct',
        'belongs', 'copyright', 'stated', 'unless', 'control', 'editorial',
        'contents', 'responsibility', 'accepts', 'database', 'internet',
        'german', 'derecfilmkritiken', 'newsgroup', 'recartsmoviesreviews',
        'posted', '1999', 'minutes', '113', 'time', 'running', 'rated',
        'sincerity', 'earthiness', 'accomplished', 'produces', 'closeness',
        'quandary', 'intimacy', 'feel', 'live', 'compelled', 'notes',
        'production', 'according', 'thesps', 'gifted', 'culls',
        'performances', 'natural', 'beauty', 'example', 'louisefor',
        'violentthelma', 'melodramatic', 'overly', 'tend', 'relationship',
        'exploring', 'films', 'hollywood', 'desperate', 'apart', 'fall',
        'begin', 'relationships', 'tenuous', 'moody', 'seldom

In [65]:
# now let's create a pandas API out of this
import pandas as pd

pd.options.display.max_columns = 999
df = pd.DataFrame(data=bag_words.toarray(),columns=count_vect.get_feature_names())
df

MemoryError: 

In [59]:
# print out 10 most common words in our data
df.sum().sort_values()[-10:]

good                    10
arizona                 12
raising                 12
scene                   13
steve                   13
like                    14
responsibility          16
recartsmoviesreviews    16
copyright               17
1987                    20
dtype: int64

#### 2.4 Convert the data into a sparse encoded tf-idf representation.

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(stop_words= stop_words, decode_error='ignore', 
                             max_df=0.01,
                             min_df=4)

tfidf_mat = tfidf_vect.fit_transform(documents)
tfidf_mat

<27886x218758 sparse matrix of type '<class 'numpy.float64'>'
	with 8410424 stored elements in Compressed Sparse Row format>

In [47]:
# convert to pandas to get better idea about the data
df = pd.DataFrame(data=tfidf_mat.toarray(),columns=tfidf_vect.get_feature_names())
df

MemoryError: 

In [74]:
# print out 10 words with max tfidf, normalized by document occurrence
df.max().sort_values()[-10:]

blind        0.320479
date         0.322304
palace       0.349054
greasers     0.349054
shermans     0.351081
hollywood    0.367310
cambodia     0.376265
march        0.421297
shuffle      0.438277
gibson       0.438717
dtype: float64

### 3. Data Visualization: Visualize statistical summaries of the text data

#### 3.1 word frequencies, most relevant words; Termite chart

#### 3.2  most relevant words Cloud chart

In [14]:
from yellowbrick.text import TSNEVisualizer
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()


tsne = TSNEVisualizer(labels=["documents"])

docs = [documents]

tsne.fit(docs)
tsne.poof()

TypeError: Cannot cast array data from dtype('float64') to dtype('S32') according to the rule 'safe'