## Sentiment Analysis with scikit-learn

In [1]:
import pandas as pd
import numpy as np
import sklearn
import os 
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [2]:
# maximize output width
pd.set_option('display.max_colwidth', -1)

In [3]:
# get relative path of current working directory and contents
cwd = os.getcwd()
mother = os.path.dirname(cwd)
print(cwd)
print(os.listdir(cwd))
print(mother)
# list contents of resources directory
resources = mother + '/resources'
print(os.listdir(resources))

C:\Users\alasseter\Documents\Projects\wings_of_armageddon\workbench_austin
['.ipynb_checkpoints', 'Exploration and Naive Bayes classifier.ipynb', 'Sample Script.ipynb']
C:\Users\alasseter\Documents\Projects\wings_of_armageddon
['beerpic.jpg', 'DLabs -Terms and Conditions.pdf', 'sample_submission.csv', 'test.csv', 'train.csv']


# Training dataset

In [4]:
df = pd.read_csv(resources + '\\train.csv')

### Exploration

In [5]:

df.head(3)

Unnamed: 0,id,labels,text
0,2592,0,"Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!"
1,18359,1,"This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is some merit in this view, but it's also true that no one forced Hindus and Muslims in the region to mistreat each other as they did around the time of partition. It seems more likely that the British simply saw the tensions between the religions and were clever enough to exploit them to their own ends.<br /><br />The result is that there is much cruelty and inhumanity in the situation and this is very unpleasant to remember and to see on the screen. But it is never painted as a black-and-white case. There is baseness and nobility on both sides, and also the hope for change in the younger generation.<br /><br />There is redemption of a sort, in the end, when Puro has to make a hard choice between a man who has ruined her life, but also truly loved her, and her family which has disowned her, then later come looking for her. But by that point, she has no option that is without great pain for her.<br /><br />This film carries the message that both Muslims and Hindus have their grave faults, and also that both can be dignified and caring people. The reality of partition makes that realisation all the more wrenching, since there can never be real reconciliation across the India/Pakistan border. In that sense, it is similar to ""Mr & Mrs Iyer"".<br /><br />In the end, we were glad to have seen the film, even though the resolution was heartbreaking. If the UK and US could deal with their own histories of racism with this kind of frankness, they would certainly be better off."
2,1040,0,"Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt like I was watching a junior high video presentation. Have the directors, producers, etc. ever even seen a movie before? Halestorm is getting worse and worse with every new entry. The concept for this movie sounded so funny. How could you go wrong with Gary Coleman and a handful of somewhat legitimate actors. But trust me when I say this, things went wrong, VERY WRONG."


In [6]:
# What's the distribution of sentiment?
df['labels'].value_counts()

1    12500
0    12500
Name: labels, dtype: int64

In [7]:
# Are positive reviews longer?
df['length']=df['text'].apply(lambda row: len(row))
posmean = df.loc[df['labels']==1]['length'].mean()
negmean = df.loc[df['labels']==0]['length'].mean()
print('Average length of positive reviews', posmean)
print('Average length of negative reviews:', negmean)
# Yes, it looks like positive reviews are a little bit longer.

Average length of positive reviews 1347.42648
Average length of negative reviews: 1303.19936


### Vectorize the text data

Note: TfidfVectorizer is Equivalent to CountVectorizer followed by TfidfTransformer.    
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [8]:
# It turns out that the most frequent word is "br br": we need to add this to stopwords.
df.loc[df['text'].str.contains('<br /><br />')].tail(1)

Unnamed: 0,id,labels,text,length
24999,15795,1,"Abhay Deol meets the attractive Soha Ali Khan and greets her ""Hello Sister""!!!. This sets the tone for a remarkable debut film by Shivam Nair. Soha, a middle class girl has run away from her home in Nainital and come to Delhi to marry her lover, Shayan Munshi. But Shyan doesn't turn up leaving Soha heartbroken & alone in the big bad world. . Abhay, the lower class next door guy turns protective towards the vulnerable Soha and helps her get a job & shelter in an old age home. Slowly romance blooms and Soha agrees to marry Abhay. Then Shyan re-enters into Soha's life.<br /><br />A sensitively made film with a very unusual story, lovingly shot in Delhi, revolves around the delicate Soha. This well crafted film has moments which will forever remain etched in one's memory Â the awkward first kiss & Abhay's swift apology; Abhay describing Soha as ""class wali ladki"" & hastily adding ""that he doesn't love her""; his gifting a churidar to Soha & asking her out for a date.<br /><br />The music is good & the background music excellent. In a scene where Soha rushes & embraces Abhay the sound track disappears. The stillness conveys both the awkwardness & tenderness of the relationship.<br /><br />The poignant ending makes for a bitter sweet film, the memories of which will linger for a long long time.<br /><br />A must see I will rate it 8.5/10",1353


In [9]:
stopset = nltk.corpus.stopwords.words('english')
stopset.extend(['br'])

In [10]:
# Define the vectorizer (defaults commented out)
tvec = TfidfVectorizer(
#                        input='content', 
#                        encoding='utf-8', 
#                        decode_error='strict', 
                       strip_accents='unicode', 
                       lowercase=True, 
#                        preprocessor=None, 
#                        tokenizer=None, 
#                        analyzer='word', 
                       stop_words=stopset, 
#                        token_pattern='(?u)\\b\\w\\w+\\b', 
                       ngram_range=(1, 3), 
#                        max_df=1.0, 
#                        min_df=1, 
#                        max_features=None, 
#                        vocabulary=True, 
#                        binary=False, 
#                        dtype=<class 'numpy.int64'>, 
#                        norm='l2', 
#                        use_idf=True, 
#                        smooth_idf=True, 
#                        sublinear_tf=False
)

In [11]:
cleaned = tvec.fit_transform(df['text'])
cleaned

<25000x4663341 sparse matrix of type '<class 'numpy.float64'>'
	with 8341293 stored elements in Compressed Sparse Row format>

In [12]:
# Most common words
word_counts = pd.DataFrame(cleaned.sum(axis=0),
                       columns=tvec.get_feature_names())
word_counts.head(3)

Unnamed: 0,00,00 01,00 01 percent,00 90,00 90 minutes,00 air,00 air horror,00 alison,00 alison agrees,00 back,...,zzzzzzzzzzzz pop,zzzzzzzzzzzz pop vcr,zzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz ooops,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz ooops sorry,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz oh,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz oh um,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
0,3.604259,0.034543,0.034543,0.077252,0.077252,0.086255,0.086255,0.033209,0.033209,0.166734,...,0.147763,0.147763,0.073757,0.06942,0.06942,0.06942,0.06942,0.06942,0.06942,0.06942


In [13]:
print('Most common: \n \n', word_counts.T.sort_values(by=0,ascending=False).head(10))

Most common: 
 
                  0
movie   431.943057
film    367.661851
one     244.160509
like    208.857432
good    184.779504
story   156.233606
really  155.883391
time    154.175497
would   153.237895
even    151.384530


### Train-test split

In [14]:
X = df['text']
y = df['labels']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size = .3, random_state=42)

In [15]:
# Note the difference in train and test. Don't fit on the test data! (d'oh)
tvec_train = tvec.fit_transform(X_train)
tvec_test  = tvec.transform(X_test)

### Naive Bayes Classifier

In [16]:
# Grid search parameters
param_grid = {
#         'alpha': [0, 1, 2], 
#         'class_prior': [1, 2, None],
#         'fit_prior': [False, True],  
        }

In [17]:
# conduct gridsearch
grid = GridSearchCV(MultinomialNB(), param_grid=param_grid, n_jobs = 1, cv=3)
grid.fit(tvec_train, y_train)
print(grid.best_params_)

{}


In [18]:
# Instantiate model using those parameters
model = grid.best_estimator_
print(model)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)


In [19]:
# Predict on the test data
predictions=model.predict(tvec_test)
# Probabilities
probabilities = model.predict_proba(tvec_test)[:,1]
print(len(probabilities))
print(len(y_test))
print(len(predictions)) # confirm that y_test matches predictions & probs.

7500
7500
7500


### Score the model

In [20]:

print(metrics.classification_report(y_test, predictions))

             precision    recall  f1-score   support

          0       0.89      0.87      0.88      3750
          1       0.88      0.89      0.88      3750

avg / total       0.88      0.88      0.88      7500



In [21]:
def model_metrics(y_test, predictions):
    '''
    Calculate 5 standard model metrics
    Return a dictionary with the metrics
    '''
    f1 = metrics.f1_score(y_test, predictions)
    accuracy = metrics.accuracy_score(y_test, predictions)
    error = 1 - accuracy
    precision = metrics.precision_score(y_test, predictions)
    recall = metrics.recall_score(y_test, predictions)
    rocauc =  metrics.roc_auc_score(y_test, predictions)
    return {'f1 score':f1, 'accuracy score': accuracy, 'error rate': error, 'precision score': precision, 'recall score': recall, 'ROC-AUC score': rocauc}


model_metrics(y_test, predictions)

{'ROC-AUC score': 0.8834666666666667,
 'accuracy score': 0.8834666666666666,
 'error rate': 0.11653333333333338,
 'f1 score': 0.8846661388229085,
 'precision score': 0.8756530825496343,
 'recall score': 0.8938666666666667}

In [22]:
def model_metrics_table(y_test, predictions):
    '''
    Make a data frame of the model metrics
    '''
    metrics_df = pd.DataFrame.from_dict(model_metrics(y_test, predictions), orient='index')
    metrics_df.reset_index(level=0, inplace=True)
    metrics_df.columns=['evaluation metric name', 'score']
    metrics_df['score']=round(metrics_df['score'], 2)
    return metrics_df
model_metrics_table(y_test, predictions)

Unnamed: 0,evaluation metric name,score
0,f1 score,0.88
1,accuracy score,0.88
2,error rate,0.12
3,precision score,0.88
4,recall score,0.89
5,ROC-AUC score,0.88


### Where did the model predict incorrectly?

In [23]:
# Convert each component to a pandas dateframe
df_probs=pd.DataFrame(probabilities, columns=['probabilities']).reset_index(drop=True)
df_preds=pd.DataFrame(predictions, columns=['predictions']).reset_index(drop=True)
df_Xtest=pd.DataFrame(X_test).reset_index(drop=True)
df_ytest=pd.DataFrame(y_test).reset_index(drop=True)
# Reset a new index because we removed all the training data but never reset the index, so it has gaps
# and drop=True gets rid of the old index

final=pd.concat([df_Xtest, df_ytest, df_preds, df_probs], axis=1)
final.loc[final['labels']!=final['predictions']].head(2)

Unnamed: 0,text,labels,predictions,probabilities
0,"(There isn't much in the way of spoilers, since there isn't a plot to reveal, but still, I guess I describe some of what happens so...) This is it. This is THE most nonsensical film I've ever seen. There are simply no words to describe this movie, although ""bizarre"" ""ridiculous"" and ""ego trip"" are pretty close. the opening half hour or so are really, really weird music videos, with absolutely no plot or continuity, apart from that MJ falls into some from the previous. One of the highlights of this part of the ""film"" in the section where MJ is flying a merry-go-round aeroplane through what seem to be half-arsed bond intro rejects and sections cut from Yellow Submarine (dear lord you could not make this up).<br /><br />Then, with a little over an hour remaining, the ""film"" begins, with a lot of claymation (some of it really creepy) spotting our ""hero"" and chasing him looking for an autograph. Obviously, this leaves our as of yet mute (discounting songs) lead somewhat worried, and he manages to temporarily lose them. Fortunate for him, because it means he can witness a falling star and, and again, I'm not making this up, turn into a claymation rabbit. He uses this cunning disguise to try and sneak past them, but, for reasons I can't recall right now, they see through it (oh no!) and the creepy chase begins again. Cue another song (big shock there).<br /><br />Shortly after the end of the chase, MJ somehow brings the rabbit to life, until he is busted by a policeman (in the middle of the desert) because it is, apparently, illegal to dance there.<br /><br />The rest of the film is equally as strange, highlights including MJ cleaning up a bar to the tune of Smooth Criminal, including shooting a man with his finger, not only killing the guy, but burning his shadow into the wall, a la nuclear fission weapons. Another good moment is when MJ, seeing, Mr Big (Joe ""what the hell happened to his career at this point?"" Peschi) kidnap one of the children he was friends with, magically creates a tommy gun, and in another moment of violence that pepper this film seemingly at random, opens fire at everything that moves. A final moment I shall mention is when MJ, surrounded by Mr Big and his private army. Seriously, this guy has dozens of people working for him, and they're decked out more like commando units rather than mobsters, which I guess they are. How does he get out? Why, he turns into a robot, complete with weapons and shield. This is the third of four transformations he makes, almost always when backed into a corner and/or on the run.<br /><br />This film is quite, quite surreal, with little in the way of plot, and virtually no continuity.",1,0,0.426253
23,"There was a great film to be made about Steve Biko. Sadly this wasn't it. Denzel Washington - never the most flexible of actors - is totally unable to convey the great charisma that Biko had. Attenborough's big crowd scenes are laughable. The Soweto massacre wasn't like this, three neat lines of children ( some doing cartwheels!) marching happily into the guns of the soldiers. With Biko dead the film rapidly descends into farce. If the struggle against Apartheid was anything it was a black people's struggle yet somehow we are all supposed to be gripped by the escape of a white man and his family. I'm sure Donald Woods was a decent man and he would be the first to say that Biko was important while he wasn't. Penelope Wilton's accent is pure Hampshire and she seems completely unaware that she is in South Africa at all. at all. The Wood's family dog gets more lines than the black maid. As the family make their escape one the women I saw the film with - incidentally one of only about a dozen black people in a large, full cinema - whispered ""This is like the sound of music."" She had a point.<br /><br />Overall this is a film by a well-intentioned if somewhat inept white liberal about a radical black people's struggle. And really South Africa needs well-intentioned white liberals like it needs a hole in the head.",0,1,0.644676


In [24]:
# Let's look at some that had intermediate probability
final[(final['probabilities']>0.48) & (final['probabilities']<0.52)]

Unnamed: 0,text,labels,predictions,probabilities
18,"I consider myself a huge movie buff. I was sick on the couch and popped in this film. Right from the opening to the end I watched in awe at these great actors, i'd never seen, say great word. The filming was beautiful. It was just what I needed. I hope that this message is heard over any bad comments written by others. The Director has a heart and it beats with his actors throughout. Thanku for making a film like this one. Just wonderfully awkward, beautiful kind characters who are flawed and graceful all at once. Just great. I can't submit this without 10 lines in total so I will simply go on to say that I wish for more from this director, more from all the actors in this film and more from the writer. I didn't want it to end. The end",1,1,0.514824
59,"Well what I can say about this movie is that it's great to see so many Asian faces. What I didn't like about the film was that it was full of stereotypes of what typical racial characters would do in their role. The Asian girl without confidence who has to play someone else to get ahead, the white guy infatuated with Asian culture and chooses to leave his white world behind for the land of yellow and the ""keeping it real"" black cab driver. Plus all the coke, shanghai tang and dunkin donuts product placement was a bit too obvious. The story plot itself was fun but pretty much how I thought the story would unravel. Then again when watching romantic comedies you can't expect much but then again I would have been wanted to just be surprised at least once. The parents are the best part of the flick.",0,1,0.511099
82,"Is there a movement more intolerant and more judgmental than the environmentalist movement? To a budding young socialist joining the circus must seem as intimidating as joining a real circus. Even though such people normally outsource their brain to Hollywood for these important issues, the teachings of Hollywood can often seem fragmented and confusing. Fortunately Ed is here to teach neo-hippies in the art of envirojudgementalism.<br /><br />Here you'll learn the art of wagging your finger in the face of anyone without losing your trademark smirk. You'll learn how to shrug off logic and science with powerful arguments of fear. You'll learn how to stop any human activity that does not interest you by labeling it as the gateway to planetary Armageddon.<br /><br />In addition to learning how to lie with a straight face you'll also learn how to shrug off accusations that are deflected your way no matter how much of a hypocrite you are. You'll be able to use as much energy as Al Gore yet while having people treat you as if you were Amish.<br /><br />In the second season was even more useful as we were able to visit other Hollywood Gods, holy be thy names, and audit - i.e. judge - their lifestyles. NOTE: This is the only time it's appropriate for an envirofascist to judge another because it allows the victim the chance to buy up all sorts of expensive and trendy eco-toys so that they can wag their finger in other people's faces.<br /><br />What does Ed have in store for us in season three? Maybe he'll teach us how to be judgmental while sleeping!",0,1,0.513359
131,"Season after season, the players or characters in this show appear to be people who you'd absolutely love to hate. Is this show rigged to be that or were they chosen for the same? Each episode vilifies one single person specifically and he ends up getting killed off. You enjoy seeing them get screwed although its totally wrong and sick. You enjoy seeing them screwing others, getting screwed themselves, playing dirty, getting it back, escaping and finally getting kicked out by Trump. The amount of tears also seems to be increasing by the season.<br /><br />The rewards which attempt to compensate for past humiliation and suffering are also heavily reduced. In the newer seasons, its like ""You get to meet xyx who'll lecture you about uvw""..like who freaking cares? The characters are so hateable, collectively and individually, that you wonder if they're paid actors? The only sane one gets to win.<br /><br />Watch with caution and maintain a conscience. Those are your fellow human beings in the firing line.",1,0,0.493193
137,"We know that firefighters and rescue workers are heroes: an idÃ©e reÃ§ue few would challenge. Friends and family of these and others who perished in the attacks on the World Trade Center might well be moved by this vapid play turned film. A sweet, earnest, though tongue-tied fireman recalls what he can of lost colleagues to a benumbed journalist who converts his fragments into a eulogy. They ponder the results. He mumbles some more, she composes another eulogy, etc., etc.<br /><br />The dreadful events that provoked the need for several thousand eulogies is overwhelmingly sad, but this plodding insipid dramatization is distressingly boring.",0,1,0.519684
142,"Seriously, I've read some of the reviews on this film, and I have to ask, were you people watching the same movie.<br /><br />Yes, I give the set directors a lot of credit for being able to recreate 1930 vintage Los Angeles, but so what? <br /><br />None of the characters are likable, the story seems aimless, Karen Black is simply not a very good actress. Donald Sutherland is just icky. (and his character ""Homer Simpson"" makes me wish for the animated version. D'oh!) Then you had the creepy child actor, the creepier Billy Barty, and so on.<br /><br />This is one of those films cinema buffs love and the rest of us look at each other and go, ""What the heck!""",0,1,0.500740
180,"What more could I say? The Americans totally hated it because the U.S. cut was so bad, although you could detect the underlying goodwill in it.<br /><br />Talking about the U.S. theatrical release(along with the newly released Blu-ray Disc version), it's faster and tighter than HK cut, the background musics were all changed from the dark, grim HK musics to Hip-hop musics; and there were a lot of gruesome scenes cut out. Though, the dubbing was a notable job given that they tried to capture the original actor's voice and tone. But, the problem is Hak Hap(Black Mask) the movie was designed and meant to be dark, grim, super-disturbing and totally gruesome. Very unfortunately the U.S. release just skimmed the cream they wanted, which in return completely changed the movie's undertone(HK release was rated 18+) to be even more comical and amateurish.<br /><br />Now let's talk about the original HK release. This movie is like a hidden gem, a prototype for the whole ""matrix"" tide and era. The fighting scenes are totally awesome even the camera works were a bit ""old-school"" among HK movies. However the style the movie created was a unique blend of Kungfu and pop culture. With all the leather, black costumes and decorations, this movie features a batman-like superhero in a black mask against a run-of-the-mill gang of multinational super-soldiers lead by a punk heavy metal rock star boss. Yes it sounds like imaginations of a retarded child, but it works. It's so impressive that the whole movie's gonna give you nightmares featuring foreigners fighting a bloodbath battle in leather coats. In year 2002 they made a sequel which had a PG-13 rating, but without Jet Li and Liu Qing Yun. And you know how bad that was because Li and Liu were the core characters in the movie and had strong personalities and an interesting friendship. And, did I happen to mention Francois Yip? Her roundhouse kick was totally cool, even cooler than the villain boss because she didn't use a stuntman for all the fighting. Did I mention she was also smoking hot? Anyway, there are a lot of things to like about the movie.<br /><br />However, the movie also suffered from a lot of problems. First off, it's a mediocre script made at its best potential, which means this production team deserved a better screen-writer. There are a lot climaxes in the entire 100 minutes but they often felt like far-fetched and don't totally make senses to the audiences(US version was even worse because all the character developments were cut). Anyway, you can't ask too much out of a comic-inspired action movie. Also, this movie is entirely improper for children. I won't recommend it to you if you are less than 20 years old. It's saturated with disturbing contents including blood, gore, sado-maso costumes, extreme brutal violence and so on. Along with the style of the movie, it can be called a wet dream for heavy-metal rock music fans and action fans. (the U.S. cut was milder, but if you want to see it, see the HK release for what it is.) 7/10. Status: inspiring, hidden, undervalued, adult.",1,0,0.483406
222,Allen and Julie move into a cabin in the mountains after their daughter is murdered one night. No one knows who killed the little girl but it's why they moved to the mountains. So the couple moves into this cabin and it's haunted by people who killed themselves there and no one in the nearby town wants to talk about it.<br /><br />This movie has a lot of creepiness to it and it has a lot of parts that made me jump. Some of the parts are predictable but once in a while there is a part I didn't expect. It was a pretty good movie that wasn't the scariest movie in the world but it was still scary enough to make it pretty good.<br /><br />I also liked the ending because it left the viewer to decide how it ends. It is also kind of a sad movie as well but a well done horror movie.,1,1,0.518806
235,"I had been long awaiting this movie ever since I saw the trailer, which made it look like a political drama, starring three of my favorite actors; Al Pacino, John Cusack, and Bridget Fonda. And even though it was directed by Harold Becker, who has done uneven work, he and Pacino did combine on SEA OF LOVE, which ranks among each of their best work. But interference on some level(for starters, several of the scenes in the original trailer don't appear in the movie) and changing of tone(subsequent trailers make it look like a thriller) make this, while watchable, nowhere near as it could have been.<br /><br />Which is too bad, because I really wanted to like this movie. There was great potential here to be a film about how government can still be worthwhile despite all the corruption, and to make a complex statement about that corruption, not the usual good guys vs. bad guys. And there is good acting here. Pacino and Cusack are both very good, and Danny Aiello gives one of the best performances of his career. But Fonda is wasted in her role, having nothing to do, and while there is merit in the central storyline, when it turns to a thriller, the movie loses its way, briefly recovers in the final scene between Cusack and Pacino, and then falls down completely in the end. I wish I could like this more, but no.",1,0,0.489267
241,"Hollow Point, though clumsy in places, manages to be an extremely endearing and amusing action movie.<br /><br />The primary entertainment value here is humor - everyone turns in clever performances that provide the film with a great deal of energy.<br /><br />Oh, by the way, advocates of gun safety will be horrified by the conduct of the characters in this movie...",1,1,0.510865


# Testing dataset

In [41]:
# read in the testing data
dftest = pd.read_csv(resources + '\\test.csv')
len(dftest)

25000

In [43]:
# vectorize the text (remember -- we already fit to the training vocabulary, so we only want 'transform'!)
tvec_test = tvec.transform(dftest['text'])

In [44]:
# predict using the same model we defined earlier
predictions=model.predict(tvec_test)

In [45]:
# create a new column with the predictions
dftest['preds'] = predictions

In [46]:
print(len(dftest))
dftest['preds'].value_counts()

25000


0    12952
1    12048
Name: labels, dtype: int64

In [50]:
# Probabilities
dftest['probs'] = model.predict_proba(tvec_test)[:,1]
dftest['labels']=dftest['probs']

### Submission

In [51]:
dftest[['id','labels']].to_csv('predictions.csv', index=False)