## Project 7

In this project the goal was to use a part of some data and predict some aspect of it using older data. In this case we used movies reviews and new articles. With the movies review my goal was to separate the data into two parts. one part of it will be the testing data, while the latter would be the training data. Looking at the training data our goal will be to predict the type of review given in the testing data.  A similar thing will be done for new articles but instead of giving us a review, it would tell us what type of news article it is. The predictions for each of this topic will be made using some statistics. Once the predictions are made, I will use it to figure out how well they are working with my own random reviews. Lastly, I will check how will the predictions were made on some of the testing data.

Here we clean up the data and create a data frame with the most words used in positive and negative reviews.
This data will be used to make predictions in the testing data.


In [1]:
import pandas as pd
import re
from collections import Counter
from math import prod
import numpy as np

#open the data containing movie reviews file
movies = pd.read_csv("movie_reviews.zip")
#creating training and testing data 
train_frac = 0.5
train_df = movies.sample(frac=train_frac, random_state=0).copy()
test_df = movies.drop(train_df.index).copy()

#cleaning up text
def wc(text):
    words = re.findall("[a-z']+", text.lower().replace("<br />", ""))
    return Counter(words)

#creating new column in training data with data showing word counts of a particular review
train_df["word_counts"] = train_df["review"].map(wc)

#
counters = {}
for label in train_df["sentiment"].unique():
    counters[label] = Counter()

for i in train_df.index:
    counters[train_df.loc[i, "sentiment"]] += train_df.loc[i, "word_counts"]
    
word_counts = pd.DataFrame(counters).fillna(0) + 1

with open("stopwords.txt") as foo:
    stops = foo.read().split(",")
    
word_counts = word_counts.drop(stops, errors="ignore")
class_probs = train_df["sentiment"].value_counts()/len(train_df)

def word_prob(w):
    return word_counts.loc[w]/word_counts.sum()
word_counts

Unnamed: 0,negative,positive
whole,898.0,662.0
film,9374.0,10163.0
lasted,33.0,25.0
minutes,1109.0,421.0
maximum,13.0,10.0
...,...,...
subsection,1.0,2.0
matkondar,1.0,2.0
bhajpai,1.0,2.0
brightens,1.0,2.0


Here we print the tested data along with a new column called predicted sentiment giving us the predicted review type after all the computations were completed

In [2]:
a = test_df.head(100).copy()
for i in a.index:
    words = re.findall("[a-z']+", a['review'][i].lower().replace("<br />", ""))
    probs = sum([np.log10(word_prob(w)) for w in words if w in word_counts.index])*class_probs
    a['predicted_sentiment'] = probs.idxmax()


As we can see, there are a few places where it was predicted incorrectly, in order for me to see how well the algorithm is working I had the divided the amount of data with the incorrect predictions

In [3]:
predicted_wrong = a[a['sentiment'] != a['predicted_sentiment']]
print(f"Percentage of incorrect predictions : {len(predicted_wrong)/len(a)}")
print(f"Percentage of correct predictions : {1 - len(predicted_wrong)/len(a)}")

Percentage of incorrect predictions : 0.47
Percentage of correct predictions : 0.53


It seems the 53% of the reviews were predicted right. To see a few examples were problems occurred I had written my own small reviews and had run it through the program

In [4]:
review = 'This movie was not great, would recommend to any of my friends'
words = re.findall("[a-z']+", review.lower().replace("<br />", ""))
probs = sum([np.log10(word_prob(w)) for w in words if w in word_counts.index])*class_probs
print(f"Predicticed sentiment:  {probs.idxmax()}")

Predicticed sentiment:  positive


In [5]:
review = 'This movie was terrbile to watch'
words = re.findall("[a-z']+", review.lower().replace("<br />", ""))
probs = sum([np.log10(word_prob(w)) for w in words if w in word_counts.index])*class_probs
print(f"Predicticed sentiment:  {probs.idxmax()}")

Predicticed sentiment:  negative


The first review was supposed to be a negative review but because it had a key word that is mainly used in good reviews the algorithm didn’t give a correct response. While the second review had all the correct key words needed from the type of review it was, it made the correct predictions.

In the next part, we will be working with different data coming from news articles. With this we will be doing the same thing we did with movies except the data was given in a text file. So first i would have to parse my data and create it into a data frame as such with the movie reviews. Then I did I used the same method I had used for predicted movie reviews with testing data.

Opening text file with data.

In [6]:
from zipfile import ZipFile
with ZipFile("newsgroups.zip", 'r') as zipped:
    txt = zipped.read("newsgroups.txt").decode(encoding='utf8', errors='ignore')
    

Parsing my data and removing any empty not needed elements in the list

In [7]:
my_list = re.findall(r"(?s).*?(?=Newsgroup:)", txt)
while("" in my_list) :
    my_list.remove("")

Creating the dataframe

In [8]:
dictionary = []
for i in range(len(my_list)):
    listing = {}
    w = re.findall(r"Newsgroup: *.+",my_list[i])
    o = re.findall(r"From: *.+",my_list[i])
    r = re.findall(r"Subject: *.+",my_list[i])
    d = re.findall(r"(?<=\n{2}).*$", my_list[i], flags=re.S)
    listing['newsgroup'] = w[0][11:] 
    listing['from'] = o[0][6:] 
    if r == []:
        listing['subject'] = []
    else:
        listing['subject'] = r[0][9:] 
    listing['body'] = d[0][:]
    dictionary.append(listing)

    df = pd.DataFrame(dictionary)
df
    
    
    

Unnamed: 0,newsgroup,from,subject,body
0,rec.autos,gwm@spl1.spl.loral.com (Gary W. Mahan),Re: Are BMW's worth the price?,\n>sure sounds like they got a ringer. the 32...
1,sci.med,davec@ecst.csuchico.edu (Dave Childs),Dental Fillings question,\nI have been hearing bad thing about amalgam ...
2,alt.atheism,"""Robert Knowles"" <p00261@psilink.com>",Re: Islamic marriage?,"\n>DATE: Tue, 6 Apr 1993 00:11:49 GMT\n>FROM..."
3,rec.sport.baseball,sepinwal@mail.sas.upenn.edu (Alan Sepinwall),Re: WFAN,\nIn article <1993Apr16.174843.28111@cabell.vc...
4,talk.religion.misc,rwd4f@poe.acc.Virginia.EDU (Rob Dobson),Re: A Message for you Mr. President: How do yo...,\nIn article <visser.735284180@convex.convex.c...
...,...,...,...,...
7372,sci.electronics,randy@ve6bc.ampr.ab.ca (Randy J. Pointkoski),Needed 24 volt 4 circuit Flasher,\n\nI am looking for a source for a 4 circuit ...
7373,sci.med,ron.roth@rose.com (ron roth),Selective Placebo,\nL(> levin@bbn.com (Joel B Levin) writes:\nL...
7374,rec.autos,RZAA80@email.sps.mot.com (Jim Chott),Re: Toyota Land Cruiser worth it?,"\nIn article <1r3sbbINN8e0@hp-col.col.hp.com>,..."
7375,rec.sport.hockey,"""James J. Murawski"" <jjm+@andrew.cmu.edu>",This Year's vs. Next Year's Playoffs,"\n\nWell, since someone probably wanted to kno..."


Using the same method I used for movies

In [9]:
train_frac = 0.5
train_df = df.sample(frac=train_frac, random_state=0).copy()
test_df = df.drop(train_df.index).copy()


def wc(text):
    words = re.findall("[a-z']+", text.lower().replace("<br />", ""))
    return Counter(words)


train_df["word_counts"] = train_df["body"].map(wc)


counters = {}
for label in train_df["newsgroup"].unique():
    counters[label] = Counter()

for i in train_df.index:
    counters[train_df.loc[i, "newsgroup"]] += train_df.loc[i, "word_counts"]
    
word_counts = pd.DataFrame(counters).fillna(0) + 1

with open("stopwords.txt") as foo:
    stops = foo.read().split(",")
    
word_counts = word_counts.drop(stops, errors="ignore")
class_probs = train_df["newsgroup"].value_counts()/len(train_df)

def word_prob(w):
    return word_counts.loc[w]/word_counts.sum()
word_counts

Unnamed: 0,rec.sport.baseball,sci.electronics,alt.atheism,sci.med,rec.motorcycles,talk.religion.misc,rec.sport.hockey,rec.autos
went,38.0,13.0,15.0,29.0,37.0,14.0,71.0,37.0
dodgers,46.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
game,328.0,19.0,20.0,2.0,5.0,9.0,611.0,3.0
tonight,15.0,1.0,1.0,1.0,10.0,2.0,21.0,2.0
cap,5.0,7.0,2.0,4.0,7.0,1.0,6.0,3.0
...,...,...,...,...,...,...,...,...
luckly,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
estamate,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
stander,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
runing,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0


Above shows the words used in every type of news articles there are

In [10]:
a = test_df
for i in a.index:
    words = re.findall("[a-z']+", a['newsgroup'][i].lower().replace("<br />", ""))
    probs = sum([np.log10(word_prob(w)) for w in words if w in word_counts.index])*class_probs
    a['predicted_newsgroup'] = probs.idxmax()

a

Unnamed: 0,newsgroup,from,subject,body,predicted_newsgroup
0,rec.autos,gwm@spl1.spl.loral.com (Gary W. Mahan),Re: Are BMW's worth the price?,\n>sure sounds like they got a ringer. the 32...,rec.autos
2,alt.atheism,"""Robert Knowles"" <p00261@psilink.com>",Re: Islamic marriage?,"\n>DATE: Tue, 6 Apr 1993 00:11:49 GMT\n>FROM...",rec.autos
3,rec.sport.baseball,sepinwal@mail.sas.upenn.edu (Alan Sepinwall),Re: WFAN,\nIn article <1993Apr16.174843.28111@cabell.vc...,rec.autos
5,rec.sport.hockey,umturne4@ccu.umanitoba.ca (Daryl Turner),"Re: Jets/Canucks - Jets hold on, win 5-4",\nIn article <C6067p.Lsp@news.cso.uiuc.edu> ep...,rec.autos
7,rec.sport.baseball,rickc@krill.corp.sgi.com (Richard Casares),Re: Jim Lefebvre is an idiot.,"\nIn article <1993Apr5.190141.17623@bsu-ucs>, ...",rec.autos
...,...,...,...,...,...
7366,rec.sport.baseball,<RVESTERM@vma.cc.nd.edu>,Re: Jack Morris,\nIn article <1993Apr20.004746.13007@ramsey.cs...,rec.autos
7368,rec.sport.baseball,tedward@cs.cornell.edu (Edward [Ted] Fischer),Re: Ind. Source Picks Baerga Over Alomar: Case...,\nIn article <C5L6Dn.4uB@andy.bgsu.edu> klopfe...,rec.autos
7370,rec.motorcycles,ryan_cousineau@compdyn.questor.org (Ryan Cousi...,Re: more DoD paraphernali,\n\n\n\nJS>From: Stafford@Vax2.Winona.MSUS.Edu...,rec.autos
7371,sci.med,geb@cs.pitt.edu (Gordon Banks),Re: tuberculosis,\nIn article <1993Mar29.181406.11915@iscsvax.u...,rec.autos


In [11]:
predicted_wrong = a[a['newsgroup'] != a['predicted_newsgroup']]
print(f"Percentage of correct predictions : {len(predicted_wrong)/len(a)}")
print(f"Percentage of incorrect predictions : {1 - len(predicted_wrong)/len(a)}")

Percentage of correct predictions : 0.8631065329357549
Percentage of incorrect predictions : 0.1368934670642451


To see how well the predictions, we are being made I had used the same technique used for the movies review and found that 86% of them came back predicted