# Project 3

### Posted: November 2, 2017
### Due: November 17, 2017

## This project focuses on utilizing the NLP skills we learned to do two basic text analysis tasks. Although the text below references NLTK, you are free to use any other NLP package instead (e.g., spaCy).

## Part 1: Review Classification
### This part uses the Amazon Fine Foods reviews dataset, that contains reviews for a collection of products from Amazon. You can find the full dataset at: https://snap.stanford.edu/data/web-FineFoods.html. The project directory contains a small portion of it (not a random sample) with 2000 reviews. 

### Your task is to read in this file, and construct a simple classifier to predict the review/score from the review/userId, review/profileName, review/time, review/summary, and review/text (but not the productId, or helpfulness). You should figure out different types of features to use for this task, and should use a Naive Bayes Classifier for the classification (you can use other methods if you'd like). You should use off-the-shelf implementation.

## Here I use the code in the following stackoverflow link to read the data and write in csv format.  
### https://stackoverflow.com/questions/23331480/converting-amazon-data-into-csv-format-in-python

In [70]:
import pandas as pd
import string
import re 

INPUT = "finefoods_training.txt"
OUTPUT = "Output.csv"

header = [
    "product/productId",
    "review/userId",
    "review/profileName",
    "review/helpfulness",
    "review/score",
    "review/time",
    "review/summary",
    "review/text"]

f = open(INPUT)
outfile = open(OUTPUT,"w")

# Write header
outfile.write(",".join(header) + "\n")

currentLine = []
for line in f:
   line = line.strip()

   if line == "": 
      outfile.write(",".join(currentLine))
      outfile.write("\n")
      currentLine = []
      continue
   parts = line.split(":",1)
   parts[1] = parts[1].replace(',', ' ')
   currentLine.append(parts[1])

if currentLine != []:
    outfile.write(",".join(currentLine))


f.close()
outfile.close()

In [71]:
df = pd.read_csv('Output.csv', encoding='latin-1')
df.head()


Unnamed: 0,product/productId,review/userId,review/profileName,review/helpfulness,review/score,review/time,review/summary,review/text
0,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1/1,5.0,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned ...
1,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0/0,1.0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanu...
2,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1/1,4.0,1219017600,"""Delight"" says it all",This is a confection that has been around a f...
3,B000UA0QIQ,A395BORC6FGVXV,Karl,3/3,2.0,1307923200,Cough Medicine,If you are looking for the secret ingredient ...
4,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0/0,5.0,1350777600,Great taffy,Great taffy at a great price. There was a wi...


## Here I clean review summary and review text columns by lowering all alphabets and also removing non alphabetic parts. Also, the set is devided between 20% test set and 80% train set.

In [72]:
from sklearn.model_selection import train_test_split

import re
import string
import nltk

cleanup_re = re.compile('[^a-z]+')
def cleanup(sentence):
    sentence = sentence.lower()
    sentence = cleanup_re.sub(' ', sentence).strip()
    return sentence

df["Summary_Clean"] = df["review/summary"].apply(cleanup)
df["Text_Clean"] = df["review/text"].apply(cleanup)

train, test = train_test_split(df, test_size=0.2)
print("%d items in training data, %d in test data" % (len(train), len(test)))
df.head()

1600 items in training data, 400 in test data


Unnamed: 0,product/productId,review/userId,review/profileName,review/helpfulness,review/score,review/time,review/summary,review/text,Summary_Clean,Text_Clean
0,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1/1,5.0,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned ...,good quality dog food,i have bought several of the vitality canned d...
1,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0/0,1.0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanu...,not as advertised,product arrived labeled as jumbo salted peanut...
2,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1/1,4.0,1219017600,"""Delight"" says it all",This is a confection that has been around a f...,delight says it all,this is a confection that has been around a fe...
3,B000UA0QIQ,A395BORC6FGVXV,Karl,3/3,2.0,1307923200,Cough Medicine,If you are looking for the secret ingredient ...,cough medicine,if you are looking for the secret ingredient i...
4,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0/0,5.0,1350777600,Great taffy,Great taffy at a great price. There was a wi...,great taffy,great taffy at a great price there was a wide ...


## Here I use CountVectorizer, TfidfTransformer to transform the cleaned summary and cleaned text columns.

In [74]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

count_vect = CountVectorizer(min_df = 1, ngram_range = (1, 4))
X_train_summary_counts = count_vect.fit_transform(train["Summary_Clean"])
X_test_summary_counts = count_vect.transform(test["Summary_Clean"])

X_train_text_counts = count_vect.fit_transform(train["Text_Clean"])
X_test_text_counts = count_vect.transform(test["Text_Clean"])


tfidf_transformer = TfidfTransformer()
X_train_summary_tfidf = tfidf_transformer.fit_transform(X_train_summary_counts)
X_test_summary_tfidf = tfidf_transformer.transform(X_test_summary_counts)

X_train_text_tfidf = tfidf_transformer.fit_transform(X_train_text_counts)
X_test_text_tfidf = tfidf_transformer.transform(X_test_text_counts)



y_train = train["review/score"]
y_test = test["review/score"]



## Here the most important features that we can train the y on them are transformed cleaned summary and transformed cleaned text. I trained 3 models: Multinomia Naive Bayes, Bernoulli Naive Bayes and Logistic Regression to train y on transformed cleaned text first. Then I will use transformed cleaned summary and see how it performs.

In [75]:
prediction = dict()
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB().fit(X_train_text_tfidf , y_train)
prediction['Multinomial'] = model.predict(X_test_text_tfidf)

In [76]:
from sklearn.naive_bayes import BernoulliNB
model = BernoulliNB().fit(X_train_text_tfidf, y_train)
prediction['Bernoulli'] = model.predict(X_test_text_tfidf)

In [77]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e5)
logreg_result = logreg.fit(X_train_text_tfidf, y_train)
prediction['Logistic'] = logreg.predict(X_test_text_tfidf)

## Here I look at confusion matrix and accuracy score. They show that the model is not performing very well. The confusion matrix for the first and second model (Naive Bayes) show that all the scores are predicted to be 5 in the test set. The logistic regression is performing a little bit better but it also mispredict lots of them. 

In [79]:
from sklearn.metrics import confusion_matrix, accuracy_score
confusion_matrix(y_test,prediction['Multinomial'] )

array([[  0,   0,   0,   0,  29],
       [  0,   0,   0,   0,  23],
       [  0,   0,   0,   0,  39],
       [  0,   0,   0,   0,  54],
       [  0,   0,   0,   0, 255]])

In [80]:
confusion_matrix(y_test,prediction['Bernoulli'] )

array([[  0,   0,   0,   0,  29],
       [  0,   0,   0,   0,  23],
       [  0,   0,   0,   0,  39],
       [  0,   0,   0,   0,  54],
       [  0,   0,   0,   0, 255]])

In [81]:
confusion_matrix(y_test,prediction['Logistic'] )

array([[  8,   0,   1,   1,  19],
       [  2,   0,   2,   2,  17],
       [  4,   1,   1,   4,  29],
       [  2,   0,   1,   3,  48],
       [  1,   0,   0,   4, 250]])

In [82]:
accuracy_score(y_test,prediction['Multinomial'] )

0.63749999999999996

In [83]:
accuracy_score(y_test,prediction['Bernoulli'] )

0.63749999999999996

In [84]:
accuracy_score(y_test,prediction['Logistic'] )

0.65500000000000003

## Here I use transformed cleaned summary and try to predict y. The results are still very bad. Even worse than the other one.

In [85]:
prediction2 = dict()
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB().fit(X_train_summary_tfidf , y_train)
prediction2['Multinomial'] = model.predict(X_test_summary_tfidf)

from sklearn.naive_bayes import BernoulliNB
model = BernoulliNB().fit(X_train_summary_tfidf, y_train)
prediction2['Bernoulli'] = model.predict(X_test_summary_tfidf)

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e5)
logreg_result = logreg.fit(X_train_summary_tfidf, y_train)
prediction2['Logistic'] = logreg.predict(X_test_summary_tfidf)

In [86]:
confusion_matrix(y_test,prediction2['Multinomial'] )

array([[  1,   0,   0,   0,  28],
       [  0,   0,   0,   0,  23],
       [  0,   0,   0,   0,  39],
       [  0,   0,   0,   0,  54],
       [  0,   0,   0,   0, 255]])

In [87]:
confusion_matrix(y_test,prediction2['Bernoulli'] )

array([[  0,   0,   0,   0,  29],
       [  0,   0,   0,   0,  23],
       [  0,   0,   0,   0,  39],
       [  0,   0,   0,   0,  54],
       [  0,   0,   0,   0, 255]])

In [88]:
confusion_matrix(y_test,prediction2['Logistic'] )

array([[ 16,   2,   2,   1,   8],
       [  7,   2,   2,   0,  12],
       [  4,   3,   6,   1,  25],
       [  1,   1,   3,   5,  44],
       [  7,   3,   4,  15, 226]])

In [89]:
accuracy_score(y_test,prediction2['Logistic'] )

0.63749999999999996

# Part 2: Named Entity Recognition

## Here, we will use the Named Entity Recognition functionality of NLTK to extract entities and relationships from some recent news articles. Specifically, pick your favorite viewer sport (football, soccer, tennis, baseball, cricket, etc.), and download 10 recent news articles describing some games/matches.
## Use NLTK to write code to extract named entities from each of them. The final output should simply be a list of entities and their types, which would require understanding the structure of the output of the ne_chunk command, and traversing it to find just the named entities.
## Next, write a few regular expressions to extract information about which positions or roles different players serve in their team. 

## Here I defined a function that get a text, word_tokenize it first and do pos_tagging and then find the name entities. The output is a list of name entities and their types. I have shown the type of the output by a simple example.

In [93]:
def get_named_entity(document):
    output=[]
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(document))):
        if hasattr(chunk, 'label'):
            output+=[[' '.join(c[0] for c in chunk),chunk.label()]]
            #print(' '.join(c[0] for c in chunk),":", chunk.label())
    return output        

In [94]:
get_named_entity("WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement.")

[['WASHINGTON', 'GPE'],
 ['New York', 'GPE'],
 ['Loretta E. Lynch', 'PERSON'],
 ['Brooklyn', 'GPE']]

## Here I defined a list of urls that are 10 different news articles related to soccer. I used requests and BeautifulSoup to just read the title and body of the news from each one of these articles and at the end made a list of all the name entities and their type for each one of these documents.

In [95]:
import requests
from bs4 import BeautifulSoup

list_of_urls = ["https://www.npr.org/2017/11/14/564006451/italy-stuns-soccer-fans-fails-to-qualify-for-world-cup",
                "https://www.npr.org/2017/11/14/564163477/italy-misses-world-cup-qualifier-for-first-time-since-1958",
                "https://www.npr.org/2017/07/26/539450573/a-comeback-could-be-on-the-horizon-for-u-s-men-s-soccer",
                "https://www.npr.org/2017/10/11/557198136/after-devastating-loss-for-usmnt-what-comes-next",
                "https://www.npr.org/2017/01/25/511608910/u-s-mens-soccer-goes-back-to-the-future-with-new-coach-new-priorities",
                "https://www.npr.org/2016/09/14/493831738/why-abby-wambach-doesnt-want-to-be-known-just-as-a-soccer-player",
                "https://www.npr.org/2016/08/21/490819550/brazil-men-s-soccer-redeem-loss-to-germany-for-olympic-gold-in-penalty-kick-shoo",
                "https://www.npr.org/2016/06/21/482981926/u-s-soccer-team-faces-argentina-in-copa-semi-final",
                "https://www.npr.org/2016/06/23/483197692/iceland-eliminates-austria-in-european-soccer-tournament",
                "https://www.npr.org/2016/06/17/482420749/u-s-mens-soccer-tops-ecuador-to-make-copa-america-semifinals"]
list_of_named_entity=[]
for url_soccer in list_of_urls:
    r = requests.get(url_soccer)
    text = r.text

    soup = BeautifulSoup(text,"html.parser")
    #print(soup.prettify())

    bodytext=soup.findAll(["p","title"])

    list_of_data=[]
    for p in bodytext:
        if p.string !=None:
            list_of_data.append(p.string)
        
    document=""
    for i in list_of_data:
        document+=i
    
    
    list_of_named_entity.append(get_named_entity(document))

list_of_named_entity    

[[['Italy', 'GPE'],
  ['Stuns Soccer Fans', 'PERSON'],
  ['Fails', 'PERSON'],
  ['Italy', 'GPE'],
  ['Sweden', 'PERSON'],
  ['David Greene', 'PERSON'],
  ['Christopher Livesay', 'PERSON'],
  ['HOST', 'ORGANIZATION'],
  ['Italy', 'GPE'],
  ['Sweden', 'GPE'],
  ['Italian', 'GPE'],
  ['Gigi Buffon', 'PERSON'],
  ['SOUNDBITE OF', 'ORGANIZATION'],
  ['ARCHIVED', 'ORGANIZATION'],
  ['GIGI', 'ORGANIZATION'],
  ['Italian', 'GPE'],
  ['Italian', 'GPE'],
  ['Journalist Christopher Livesay', 'PERSON'],
  ['Rome', 'GPE'],
  ['Chris', 'GPE'],
  ['Italy', 'GPE'],
  ['Italy', 'GPE'],
  ['Italy', 'PERSON'],
  ['Yankees', 'ORGANIZATION'],
  ['GREENE', 'GPE'],
  ['Yankees', 'ORGANIZATION'],
  ['Yankees', 'ORGANIZATION'],
  ['Buffon', 'PERSON'],
  ['Italians', 'GPE'],
  ['Gian Piero Ventura', 'PERSON'],
  ['Italy', 'GPE'],
  ['Italian', 'GPE'],
  ['Italian', 'GPE'],
  ['Serie As', 'PERSON'],
  ['non-Italian', 'GPE'],
  ['Italy', 'GPE'],
  ['Italian', 'GPE'],
  ['Chris Livesay', 'PERSON'],
  ['Rome', 'GPE

In [112]:
my_sent2= "ali is a goalkeeper and davide is striker"

In [113]:
from nltk.sem import extract_rels,rtuple
IN = re.compile(r'.*\bgoalkeeper\b.|\bdefender\b.|\bmidfielder\b.|\bforward\b.|\bstriker\b.*')
rels = extract_rels('PER', 'PER', my_sent2, pattern=IN) 
