CS194-16 Introduction to Data Science

**NOTE** click near here to select this cell, esc-Enter will get you into cell edit mode, shift-Enter gets you back

**Name**: *Please put your name*

**Student ID**: *Please put your student ID*

This homework explores the use of synthetic (XML) and natural language parsing for data preparation. It comprises 3 parts:
1. Parse some data (Amazon product reviews) in XML to extract the text of the reviews. 
2. Parse the text reviews into Stanford dependencies using the Stanford Parser, with XML output
3. Read the parsed sentences back into Python with the XML parser. 

We assume you have copied this HW notebook, the stanford parser archive, and the reviews archive into the same directory. Unpack the later two:
<pre>
tar xvzf reviews.tar.gz
tar xvzf stanfordparser.tar.gz
</pre>

and then copy the parser into /opt:

<pre>
sudo mv StanfordParser /opt
</pre>

finally, if you havent already done it, create a personal bin directory:
<pre>
mkdir ~/bin
</pre>

scripts or links in that directory will then be in your path. This will be useful for using the Stanford parser (and other tools) later. The path is set in your login script. To make it find the new bin directory you have to log out and log back in again. In the top right hand corner of the VM window you will find a gear-shaped icon. Clicking it yields a drop-down menu with a logout option. Logout, an then back in when you see the login screen. 

We will be using Python's ElementTree API which you can read about here:

https://docs.python.org/2/library/xml.etree.elementtree.html

Start by loading some XML data.

In [4]:
from lxml import etree
parser = etree.XMLParser(recover=True)
tree = etree.parse('reviews/video/reviews.xml',parser)

btw, the data is actually far-from-perfect XML. To see some of the defects, remove the argument "parser" from the last line, so that it tries instead to parse with a (default) strict parser. You will see it crash at an invalid char string somewhere in the file. You can fix this and find the next problem... But its better to use an auto-recovering parser like the one above. 


> TODO: What kinds of error did you see in the file? (hit esc-Enter to edit this, then ctl-Enter to save)

Now lets look at the contents of this tree

In [5]:
root=tree.getroot()
root

<Element reviews at 0x7f8f2d4576c8>

Nodes have a tag (a name), and possibly attributes (empty in this case)

In [6]:
root.tag

'reviews'

In [7]:
root.attrib

{}

The children of the root node are accessible using square bracket notation. They will be individual reviews. You can then examine each review node's children by adding additional square bracket fields. Do this now and explore the parse tree. Compare with the file contents (use a text editor to see it). 

In [8]:
root[2]

<Element review at 0x7f8f2d457f38>

You can also use the colon notation to retrieve all the children of a node:

In [9]:
root[22][:]

[<Element unique_id at 0x7f8f2c16e098>,
 <Element unique_id at 0x7f8f2c16e0e0>,
 <Element asin at 0x7f8f2c16e170>,
 <Element product_name at 0x7f8f2c16e1b8>,
 <Element product_type at 0x7f8f2c16e200>,
 <Element product_type at 0x7f8f2c16e248>,
 <Element helpful at 0x7f8f2c16e290>,
 <Element rating at 0x7f8f2c16e2d8>,
 <Element title at 0x7f8f2c16e320>,
 <Element date at 0x7f8f2c16e368>,
 <Element reviewer at 0x7f8f2c16e3b0>,
 <Element reviewer_location at 0x7f8f2c16e3f8>,
 <Element review_text at 0x7f8f2c16e440>]

Notice that same-named elements can occur multiple times, e.g. unique_id and product_type

The "contents" of a node are usually held in its text field, which you access like this:

In [10]:
root[2][0]

<Element unique_id at 0x7f8f2c16e5f0>

In [11]:
root[2][0].text

'\n1569494088:ambrose_bierce_is_a_better_authority:charlotte_tellson_"the_keeper_of_bierce"\n'

Now we can look at the contents of the other "unique_id" node:

In [12]:
root[2][1].text

'\n10560\n'

the find() and findall() methods allow you to find one or (respectively) all the children of a node with a particular tag. 

In [13]:
root[10].find('product_name').text

'\nThe Firm: Video: Tom Cruise,Jeanne Tripplehorn,Gene Hackman,Hal Holbrook,Terry Kinney,Wilford Brimley,Ed Harris,Holly Hunter,David Strathairn,Gary Busey,Steven Hill,Tobin Bell,Barbara Garrick,Jerry Hardin,Paul Calderon,Jerry Weintraub,Sullivan Walker,Karina Lombard,Margo Martindale,John Beal,Sydney Pollack\n'

In [14]:
root.findall('review')[:10]

[<Element review at 0x7f8f2d457200>,
 <Element review at 0x7f8f2d4571b8>,
 <Element review at 0x7f8f2d457f38>,
 <Element review at 0x7f8f2d4574d0>,
 <Element review at 0x7f8f2c16e488>,
 <Element review at 0x7f8f2c16e758>,
 <Element review at 0x7f8f2c16e710>,
 <Element review at 0x7f8f2c16e128>,
 <Element review at 0x7f8f2c16e7a0>,
 <Element review at 0x7f8f2c16e7e8>]

In [15]:
root[3].find('review_text').text

"\nthis is another story that I don't know if I like it or not because Amazon.com never sent it too me.  this has happened about 4/5 times.  While usually I have no problem getting movies, sometimes I do.  sending them e-mails doesn't seem to work.  If they didn't have the best prices for movies, I'd leave\n"

Use the ElementTree methods to construct a dataframe containing 11 columns corresponding to the 11 distinct children node types of each review node. Each row should represent a single review. For nodes that may be repeated like "unique_id", include a list of the node values in that field.

> TODO: What fraction of the XML review records have two "unique_id" nodes? What fraction have two "product_type" nodes?

Finally save the dataFrame as a csv file (you can use a Pandas builtin to do this).

For the review text, you should create one file with a unique name per review containing only the review text. The names should be review_text#####.txt where ##### is the number of the review.

In the preamble for this HW, you put the Stanford Parser in the /opt directory, and you also created a ~/bin directory. You can use these to put Stanford Parser commands in your path without having to add several new directories to your $PATH variable. There are three commands we will need initially. 

Open a terminal window and create symlinks like this:

<pre>ln -s /opt/StanfordParser/lexparser.sh ~/bin/lexparser.sh

ln -s /opt/StanfordParser/lexparser-gui.sh ~/bin/lexparser-gui.sh

ln -s /opt/StanfordParser/dependencyviewer/dependencyviewer.sh ~/bin/dependencyviewer.sh</pre>

and then type:

<pre>
lexparser-gui.sh
</pre>

This brings up a GUI interface to the Stanford parser. To use it, click on "Load Parser" which brings up a file selection dialog. Navigate to 

<pre>/opt/StanfordParser/stanford-parser-3.4.1-models.jar</pre>

and open it.

Then you will see a list of parsers to use. Select 
<pre>englishPCFG.ser.gz</pre>

You're now ready to parse some text!

Click on "Load File" and navigate back to your HW2 directory (you'll have to go all the way up to "/", and down through "/home"). Load your review text file

<pre>review_text00000.txt</pre>

which will display the text with the first sentence highlighted. Now click on "Parse" which will bring up a graphical display of the parsed sentence. 

> TODO: Did the sentence parse correctly?

Parse the other sentences from this file. Notice that the yellow highlight is for standard sentences (broken at periods) but that some of these sentences are broken into sentence subparts. 

This parse tree shows a standard (constituency) tree. Usually we will want to work with dependency trees. To view a dependency tree for the sentences in this file, do 

<pre>
dependencyviewer.sh -in review_text00000.txt
</pre>

(note the extra "-in" option for this parser). This brings up a window with tabs for each of the sentences. click through each sentence and contrast the dependency parse tree with the constituency tree in the other window.

Note: Both parsers consume quite a bit of memory so you may need to close the constituency tree viewer before starting the dependency viewer. 

> TODO: What are the root nodes for each sentence-like fragment in sentence 5 ? 

The parser also contains scripts for parsing text into structured output. Now run

<pre>
lexparser.sh review_text00000.txt
</pre>

You will see both constituency and dependency tree output for each sentence. These formats are ad-hoc though, and not easy for a machine to work with. You can customize the parser startup script. In the main parser directory you will find a script:

<pre>
/opt/StanfordParser/lexparser.sh
</pre>

Make your own copy of this script in the same directory, say call it:

<pre>
/opt/StanfordParser/dependencyparser.sh
</pre>

This file may not be executable, depending on how you copied it. To make sure it is, do:

<pre>
chmod 755 dependencyparser.sh
</pre>
 
in the Stanford Parser directory. Now open the script in an editor. It contains an invocation of the parser with the option 

<pre>-outputFormat "penn,typedDependencies"</pre>

we wont need the penn format output, so you can remove "penn" from the options. We need XML output instead of the standard output however. To do that add this option:

<pre>
-outputFormatOptions "xml"
</pre>

after the -outputFormat option (yes the names are confusing). Save the file. 

Now from a terminal prompt, create a new symlink from your ~/bin directory to the dependencyparser.sh script. You should now be able to change to the directory containing your sentences and type:

<pre>
dependencyparser.sh review_text00000.txt
</pre>

You will see some diagnostic messages, and the XML data. The parser actually sends the XML only to stdout and the diagnostics to stderr. To get just the XML in a file you can do:

<pre>
dependencyparser.sh review_text00000.txt > review_parsed00000.xml
</pre>

Now write a bash script (or do in python if you know how to invoke shell commands) to iterate over the input files and produce parsed copies, i.e. by replacing "00000" in the filenames above with a series of integer indices. HINT: the bash command for integer iteration is

<pre>
for i in `seq 0 xxx`
do
...
done
</pre> 

and to get a fixed-length integer string in a file name do:

<pre>
fname=`printf "review_text%05d.txt" $i`
</pre>

NOTE: Parsing is very time-consuming. You dont have to parse all the reviews, but do at least say the first 100. 

> TODO: Give the total of file sizes (e.g. using "du" on the directory containing them) for the unparsed text files and the total for the XML parsed files. 

Use the ElementTree API to read an XML dependency parse tree from the files that you just created.

Write a function to recognize targets and associated sentiment. e.g. a simple pattern is to start at the root node of a dependency tree, look for an nsubj child (a target) and then look for sentiment words - adjectives that attach directly to the subject. Or, look for direct and indirect object words and any adjectives attached to them. 

More complicated patterns occur when the sentiment words are in a different subtree, rooted by a suitable verb, e.g.

"Lawrence put in a commanding performance..."

Here we would like to extract the subject, along with the object "performance" and its modifier "commanding". You will need to be careful about the verb however, to make sure that the two subtrees are related. 

For each pattern that is matched, the function should output the target as a string, and also the sentiment (phrase or list of words) as a string. i.e. in general the output will be a list of (string, string) tuples. Run this function over all the trees from part 2 above. 

Write one more function that finds a pattern of (target, sentiment words or phrases). This time, define your own pattern by looking through the dependency trees output from part 2. 

Apply these two functions to each parsed sentence, and concatenate their outputs. Finally concatenate the lists from all sentences. From the final list, construct a dataFrame with "target" and "sentiment" columns. In the space below cut and paste the first 100 rows of this table (or less if you dont have 100 rows from all the sentences from part 2. 

Save this notebook and submit it using glookup. 

> TODO: Put your analysis code here.

> TODO: Put <=100 rows of your target/sentiment table below: