Dataset: http://ai.stanford.edu/~amaas/data/sentiment/  
Tutorial: https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184  

## Creating a DataFrame from individual text files

### Training set

In [1]:
# use glob to create a list of train_pos_filenames
import glob
train_pos_filenames = glob.glob('../data/aclImdb/train/pos/*.txt')
train_neg_filenames = glob.glob('../data/aclImdb/train/neg/*.txt')
print("train_pos_text:")
print(train_pos_filenames[:10])
print("\ntrain_neg_text:")
print(train_neg_filenames[:10])

train_pos_text:
['../data/aclImdb/train/pos\\0_9.txt', '../data/aclImdb/train/pos\\10000_8.txt', '../data/aclImdb/train/pos\\10001_10.txt', '../data/aclImdb/train/pos\\10002_7.txt', '../data/aclImdb/train/pos\\10003_8.txt', '../data/aclImdb/train/pos\\10004_8.txt', '../data/aclImdb/train/pos\\10005_7.txt', '../data/aclImdb/train/pos\\10006_7.txt', '../data/aclImdb/train/pos\\10007_7.txt', '../data/aclImdb/train/pos\\10008_7.txt']

train_neg_text:
['../data/aclImdb/train/neg\\0_3.txt', '../data/aclImdb/train/neg\\10000_4.txt', '../data/aclImdb/train/neg\\10001_4.txt', '../data/aclImdb/train/neg\\10002_1.txt', '../data/aclImdb/train/neg\\10003_1.txt', '../data/aclImdb/train/neg\\10004_3.txt', '../data/aclImdb/train/neg\\10005_3.txt', '../data/aclImdb/train/neg\\10006_4.txt', '../data/aclImdb/train/neg\\10007_1.txt', '../data/aclImdb/train/neg\\10008_2.txt']


In [2]:
print(len(train_pos_filenames))
print(len(train_neg_filenames))

12500
12500


In [3]:
%time
# read the contents of the train_pos files into a list (each list element is one review)
train_pos_text = []
for filename in train_pos_filenames:
    with open(filename, encoding='utf-8') as f:
        train_pos_text.append(f.read())
print("train_pos_text:")
print(train_pos_text[0])


# read the contents of the train_pos files into a list (each list element is one review)
train_neg_text = []
for filename in train_neg_filenames:
    with open(filename, encoding='utf-8') as f:
        train_neg_text.append(f.read())
print("\ntrain_neg_text:")
print(train_neg_text[0])

Wall time: 0 ns
train_pos_text:
Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!

train_neg_text:
Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orche

In [4]:
# combine the pos and neg for train lists
train_text = train_pos_text + train_neg_text
print(len(train_text))

# create a list of labels (pos=1, neg=0)
train_labels = [1]*len(train_pos_filenames) + [0]*len(train_neg_filenames)
print(len(train_labels))

25000
25000


In [5]:
import pandas as pd
# convert the lists into a DataFrame
train_df = pd.DataFrame({'label':train_labels, 'reviews':train_text})

In [6]:
pd.set_option('display.max_colwidth', None)

In [7]:
train_df.head()

Unnamed: 0,label,reviews
0,1,"Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as ""Teachers"". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is ""Teachers"". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!"
1,1,"Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they'll be next to end up on the streets.<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it's like to be homeless? That is Goddard Bolt's lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days without the luxuries; if Bolt succeeds, he can do what he wants with a future project of making more buildings. The bet's on where Bolt is thrown on the street with a bracelet on his leg to monitor his every move where he can't step off the sidewalk. He's given the nickname Pepto by a vagrant after it's written on his forehead where Bolt meets other characters including a woman by the name of Molly (Lesley Ann Warren) an ex-dancer who got divorce before losing her home, and her pals Sailor (Howard Morris) and Fumes (Teddy Wilson) who are already used to the streets. They're survivors. Bolt isn't. He's not used to reaching mutual agreements like he once did when being rich where it's fight or flight, kill or be killed.<br /><br />While the love connection between Molly and Bolt wasn't necessary to plot, I found ""Life Stinks"" to be one of Mel Brooks' observant films where prior to being a comedy, it shows a tender side compared to his slapstick work such as Blazing Saddles, Young Frankenstein, or Spaceballs for the matter, to show what it's like having something valuable before losing it the next day or on the other hand making a stupid bet like all rich people do when they don't know what to do with their money. Maybe they should give it to the homeless instead of using it like Monopoly money.<br /><br />Or maybe this film will inspire you to help others."
2,1,"Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love scenes in clothes warehouse are second to none. The corn on face is a classic, as good as anything in Blazing Saddles. The take on lawyers is also superb. After being accused of being a turncoat, selling out his boss, and being dishonest the lawyer of Pepto Bolt shrugs indifferently ""I'm a lawyer"" he says. Three funny words. Jeffrey Tambor, a favorite from the later Larry Sanders show, is fantastic here too as a mad millionaire who wants to crush the ghetto. His character is more malevolent than usual. The hospital scene, and the scene where the homeless invade a demolition site, are all-time classics. Look for the legs scene and the two big diggers fighting (one bleeds). This movie gets better each time I see it (which is quite often)."
3,1,"This is easily the most underrated film inn the Brooks cannon. Sure, its flawed. It does not give a realistic view of homelessness (unlike, say, how Citizen Kane gave a realistic view of lounge singers, or Titanic gave a realistic view of Italians YOU IDIOTS). Many of the jokes fall flat. But still, this film is very lovable in a way many comedies are not, and to pull that off in a story about some of the most traditionally reviled members of society is truly impressive. Its not The Fisher King, but its not crap, either. My only complaint is that Brooks should have cast someone else in the lead (I love Mel as a Director and Writer, not so much as a lead)."
4,1,"This is not the typical Mel Brooks film. It was much less slapstick than most of his movies and actually had a plot that was followable. Leslie Ann Warren made the movie, she is such a fantastic, under-rated actress. There were some moments that could have been fleshed out a bit more, and some scenes that could probably have been cut to make the room to do so, but all in all, this is worth the price to rent and see it. The acting was good overall, Brooks himself did a good job without his characteristic speaking to directly to the audience. Again, Warren was the best actor in the movie, but ""Fume"" and ""Sailor"" both played their parts well."


In [8]:
train_df.shape

(25000, 2)

### Test set

In [9]:
# use glob to create a list of test_pos_filenames
import glob
test_pos_filenames = glob.glob('../data/aclImdb/test/pos/*.txt')
test_neg_filenames = glob.glob('../data/aclImdb/test/neg/*.txt')
print("test_pos_text:")
print(test_pos_filenames[:10])
print("\ntest_neg_text:")
print(test_neg_filenames[:10])

test_pos_text:
['../data/aclImdb/test/pos\\0_10.txt', '../data/aclImdb/test/pos\\10000_7.txt', '../data/aclImdb/test/pos\\10001_9.txt', '../data/aclImdb/test/pos\\10002_8.txt', '../data/aclImdb/test/pos\\10003_8.txt', '../data/aclImdb/test/pos\\10004_9.txt', '../data/aclImdb/test/pos\\10005_8.txt', '../data/aclImdb/test/pos\\10006_7.txt', '../data/aclImdb/test/pos\\10007_10.txt', '../data/aclImdb/test/pos\\10008_8.txt']

test_neg_text:
['../data/aclImdb/test/neg\\0_2.txt', '../data/aclImdb/test/neg\\10000_4.txt', '../data/aclImdb/test/neg\\10001_1.txt', '../data/aclImdb/test/neg\\10002_3.txt', '../data/aclImdb/test/neg\\10003_3.txt', '../data/aclImdb/test/neg\\10004_2.txt', '../data/aclImdb/test/neg\\10005_2.txt', '../data/aclImdb/test/neg\\10006_2.txt', '../data/aclImdb/test/neg\\10007_4.txt', '../data/aclImdb/test/neg\\10008_4.txt']


In [10]:
print(len(test_pos_filenames))
print(len(test_neg_filenames))

12500
12500


In [11]:
%time
# read the contents of the test_pos files into a list (each list element is one review)
test_pos_text = []
for filename in test_pos_filenames:
    with open(filename, encoding='utf-8') as f:
        test_pos_text.append(f.read())
print("test_pos_text:")
print(test_pos_text[0])


# read the contents of the test_neg files into a list (each list element is one review)
test_neg_text = []
for filename in test_neg_filenames:
    with open(filename, encoding='utf-8') as f:
        test_neg_text.append(f.read())
print("\ntest_neg_text:")
print(test_neg_text[0])

Wall time: 0 ns
test_pos_text:
I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge.

test_neg_text:
Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just

In [12]:
# combine the pos and neg for test lists
test_text = test_pos_text + test_neg_text
print(len(test_text))

# create a list of labels (pos=1, neg=0)
test_labels = [1]*len(test_pos_filenames) + [0]*len(test_neg_filenames)
print(len(test_labels))

25000
25000


In [13]:
import pandas as pd
# convert the lists into a DataFrame
test_df = pd.DataFrame({'label':test_labels, 'reviews':test_text})

In [14]:
test_df.head()

Unnamed: 0,label,reviews
0,1,"I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge."
1,1,"Actor turned director Bill Paxton follows up his promising debut, the Gothic-horror ""Frailty"", with this family friendly sports drama about the 1913 U.S. Open where a young American caddy rises from his humble background to play against his Bristish idol in what was dubbed as ""The Greatest Game Ever Played."" I'm no fan of golf, and these scrappy underdog sports flicks are a dime a dozen (most recently done to grand effect with ""Miracle"" and ""Cinderella Man""), but some how this film was enthralling all the same.<br /><br />The film starts with some creative opening credits (imagine a Disneyfied version of the animated opening credits of HBO's ""Carnivale"" and ""Rome""), but lumbers along slowly for its first by-the-numbers hour. Once the action moves to the U.S. Open things pick up very well. Paxton does a nice job and shows a knack for effective directorial flourishes (I loved the rain-soaked montage of the action on day two of the open) that propel the plot further or add some unexpected psychological depth to the proceedings. There's some compelling character development when the British Harry Vardon is haunted by images of the aristocrats in black suits and top hats who destroyed his family cottage as a child to make way for a golf course. He also does a good job of visually depicting what goes on in the players' heads under pressure. Golf, a painfully boring sport, is brought vividly alive here. Credit should also be given the set designers and costume department for creating an engaging period-piece atmosphere of London and Boston at the beginning of the twentieth century.<br /><br />You know how this is going to end not only because it's based on a true story but also because films in this genre follow the same template over and over, but Paxton puts on a better than average show and perhaps indicates more talent behind the camera than he ever had in front of it. Despite the formulaic nature, this is a nice and easy film to root for that deserves to find an audience."
2,1,"As a recreational golfer with some knowledge of the sport's history, I was pleased with Disney's sensitivity to the issues of class in golf in the early twentieth century. The movie depicted well the psychological battles that Harry Vardon fought within himself, from his childhood trauma of being evicted to his own inability to break that glass ceiling that prevents him from being accepted as an equal in English golf society. Likewise, the young Ouimet goes through his own class struggles, being a mere caddie in the eyes of the upper crust Americans who scoff at his attempts to rise above his standing. <br /><br />What I loved best, however, is how this theme of class is manifested in the characters of Ouimet's parents. His father is a working-class drone who sees the value of hard work but is intimidated by the upper class; his mother, however, recognizes her son's talent and desire and encourages him to pursue his dream of competing against those who think he is inferior.<br /><br />Finally, the golf scenes are well photographed. Although the course used in the movie was not the actual site of the historical tournament, the little liberties taken by Disney do not detract from the beauty of the film. There's one little Disney moment at the pool table; otherwise, the viewer does not really think Disney. The ending, as in ""Miracle,"" is not some Disney creation, but one that only human history could have written."
3,1,"I saw this film in a sneak preview, and it is delightful. The cinematography is unusually creative, the acting is good, and the story is fabulous. If this movie does not do well, it won't be because it doesn't deserve to. Before this film, I didn't realize how charming Shia Lebouf could be. He does a marvelous, self-contained, job as the lead. There's something incredibly sweet about him, and it makes the movie even better. The other actors do a good job as well, and the film contains moments of really high suspense, more than one might expect from a movie about golf. Sports movies are a dime a dozen, but this one stands out. <br /><br />This is one I'd recommend to anyone."
4,1,"Bill Paxton has taken the true story of the 1913 US golf open and made a film that is about much more than an extra-ordinary game of golf. The film also deals directly with the class tensions of the early twentieth century and touches upon the profound anti-Catholic prejudices of both the British and American establishments. But at heart the film is about that perennial favourite of triumph against the odds.<br /><br />The acting is exemplary throughout. Stephen Dillane is excellent as usual, but the revelation of the movie is Shia LaBoeuf who delivers a disciplined, dignified and highly sympathetic performance as a working class Franco-Irish kid fighting his way through the prejudices of the New England WASP establishment. For those who are only familiar with his slap-stick performances in ""Even Stevens"" this demonstration of his maturity is a delightful surprise. And Josh Flitter as the ten year old caddy threatens to steal every scene in which he appears.<br /><br />A old fashioned movie in the best sense of the word: fine acting, clear directing and a great story that grips to the end - the final scene an affectionate nod to Casablanca is just one of the many pleasures that fill a great movie."


In [15]:
test_df.shape

(25000, 2)

### Data Cleaning and Exporting

In [16]:
import re

In [17]:
train_df['reviews'] = train_df.reviews.apply(lambda s: re.sub(r'<br */>',' ', s))

In [18]:
train_df.to_csv('train.csv')

In [19]:
test_df['reviews'] = test_df.reviews.apply(lambda s: re.sub(r'<br */>',' ', s))

In [20]:
test_df.to_csv('test.csv')