Dataset: http://ai.stanford.edu/~amaas/data/sentiment/  
Tutorial: https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184  

## Creating a DataFrame from individual text files

### Training set

In [1]:
# use glob to create a list of train_pos_filenames
import glob
train_pos_filenames = glob.glob('../data/aclImdb/train/pos/*.txt')
train_neg_filenames = glob.glob('../data/aclImdb/train/neg/*.txt')
print("train_pos_text:")
print(train_pos_filenames[:10])
print("\ntrain_neg_text:")
print(train_neg_filenames[:10])

train_pos_text:
['../data/aclImdb/train/pos/4715_9.txt', '../data/aclImdb/train/pos/12390_8.txt', '../data/aclImdb/train/pos/8329_7.txt', '../data/aclImdb/train/pos/9063_8.txt', '../data/aclImdb/train/pos/3092_10.txt', '../data/aclImdb/train/pos/9865_8.txt', '../data/aclImdb/train/pos/6639_10.txt', '../data/aclImdb/train/pos/10460_10.txt', '../data/aclImdb/train/pos/10331_10.txt', '../data/aclImdb/train/pos/11606_10.txt']

train_neg_text:
['../data/aclImdb/train/neg/1821_4.txt', '../data/aclImdb/train/neg/10402_1.txt', '../data/aclImdb/train/neg/1062_4.txt', '../data/aclImdb/train/neg/9056_1.txt', '../data/aclImdb/train/neg/5392_3.txt', '../data/aclImdb/train/neg/2682_3.txt', '../data/aclImdb/train/neg/3351_4.txt', '../data/aclImdb/train/neg/399_2.txt', '../data/aclImdb/train/neg/10447_1.txt', '../data/aclImdb/train/neg/10096_1.txt']


In [2]:
print(len(train_pos_filenames))
print(len(train_neg_filenames))

12500
12500


In [3]:
%time
# read the contents of the train_pos files into a list (each list element is one review)
train_pos_text = []
for filename in train_pos_filenames:
    with open(filename) as f:
        train_pos_text.append(f.read())
print("train_pos_text:")
print(train_pos_text[0])


# read the contents of the train_pos files into a list (each list element is one review)
train_neg_text = []
for filename in train_neg_filenames:
    with open(filename) as f:
        train_neg_text.append(f.read())
print("\ntrain_neg_text:")
print(train_neg_text[0])

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 4.05 µs
train_pos_text:
For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan "The Skipper" Hale jr. as a police Sgt.

train_neg_text:
Working with one of the best Shakespeare sources, this film manages to be creditable to it's source, whilst still appealing to a wider audience.<br /><br />Branagh steals the film from under Fishburne's nose, and there's a talented cast on good form.


In [4]:
# combine the pos and neg for train lists
train_text = train_pos_text + train_neg_text
print(len(train_text))

# create a list of labels (pos=1, neg=0)
train_labels = [1]*len(train_pos_filenames) + [0]*len(train_neg_filenames)
print(len(train_labels))

25000
25000


In [5]:
import pandas as pd
# convert the lists into a DataFrame
train_df = pd.DataFrame({'label':train_labels, 'reviews':train_text})

In [6]:
pd.set_option('display.max_colwidth', -1)

In [7]:
train_df.head()

Unnamed: 0,label,reviews
0,1,"For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan ""The Skipper"" Hale jr. as a police Sgt."
1,1,"Bizarre horror movie filled with famous faces but stolen by Cristina Raines (later of TV's ""Flamingo Road"") as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her attempted suicides by guarding the Gateway to Hell! The scenes with Raines modeling are very well captured, the mood music is perfect, Deborah Raffin is charming as Cristina's pal, but when Raines moves into a creepy Brooklyn Heights brownstone (inhabited by a blind priest on the top floor), things really start cooking. The neighbors, including a fantastically wicked Burgess Meredith and kinky couple Sylvia Miles & Beverly D'Angelo, are a diabolical lot, and Eli Wallach is great fun as a wily police detective. The movie is nearly a cross-pollination of ""Rosemary's Baby"" and ""The Exorcist""--but what a combination! Based on the best-seller by Jeffrey Konvitz, ""The Sentinel"" is entertainingly spooky, full of shocks brought off well by director Michael Winner, who mounts a thoughtfully downbeat ending with skill. ***1/2 from ****"
2,1,"A solid, if unremarkable film. Matthau, as Einstein, was wonderful. My favorite part, and the only thing that would make me go out of my way to see this again, was the wonderful scene with the physicists playing badmitton, I loved the sweaters and the conversation while they waited for Robbins to retrieve the birdie."
3,1,"It's a strange feeling to sit alone in a theater occupied by parents and their rollicking kids. I felt like instead of a movie ticket, I should have been given a NAMBLA membership.<br /><br />Based upon Thomas Rockwell's respected Book, How To Eat Fried Worms starts like any children's story: moving to a new town. The new kid, fifth grader Billy Forrester was once popular, but has to start anew. Making friends is never easy, especially when the only prospect is Poindexter Adam. Or Erica, who at 4 1/2 feet, is a giant.<br /><br />Further complicating things is Joe the bully. His freckled face and sleeveless shirts are daunting. He antagonizes kids with the Death Ring: a Crackerjack ring that is rumored to kill you if you're punched with it. But not immediately. No, the death ring unleashes a poison that kills you in the eight grade.<br /><br />Joe and his axis of evil welcome Billy by smuggling a handful of slimy worms into his thermos. Once discovered, Billy plays it cool, swearing that he eats worms all the time. Then he throws them at Joe's face. Ewww! To win them over, Billy reluctantly bets that he can eat 10 worms. Fried, boiled, marinated in hot sauce, squashed and spread on a peanut butter sandwich. Each meal is dubbed an exotic name like the ""Radioactive Slime Delight,"" in which the kids finally live out their dream of microwaving a living organism.<br /><br />If you've ever met me, you'll know that I have an uncontrollably hearty laugh. I felt like a creep erupting at a toddler whining that his ""dilly dick"" hurts. But Fried Worms is wonderfully disgusting. Like a G-rated Farrelly brothers film, it is both vomitous and delightful.<br /><br />Writer/director Bob Dolman is also a savvy storyteller. To raise the stakes the worms must be consumed by 7 pm. In addition Billy holds a dark secret: he has an ultra-sensitive stomach.<br /><br />Dolman also has a keen sense of perspective. With such accuracy, he draws on children's insecurities and tendency to exaggerate mundane dilemmas.<br /><br />If you were to hyperbolize this movie the way kids do their quandaries, you will see that it is essentially about war. Freedom-fighter and freedom-hater use pubescent boys as pawns in proxy wars, only to learn a valuable lesson in unity. International leaders can learn a thing or two about global peacekeeping from Fried Worms.<br /><br />At the end of the film, I was comforted when two chaperoning mothers behind me, looked at each other with befuddlement and agreed, ""That was a great movie."" Great, now I won't have to register myself in any lawful databases."
4,1,"You probably all already know this by now, but 5 additional episodes never aired can be viewed on ABC.com I've watched a lot of television over the years and this is possibly my favorite show, ever. It's a crime that this beautifully written and acted show was canceled. The actors that played Laura, Whit, Carlos, Mae, Damian, Anya and omg, Steven Caseman - are all incredible and so natural in those roles. Even the kids are great. Wonderful show. So sad that it's gone. Of course I wonder about the reasons it was canceled. There is no way I'll let myself believe that Ms. Moynahan's pregnancy had anything to do with it. It was in the perfect time slot in this market. I've watched all the episodes again on ABC.com - I hope they all come out on DVD some day. Thanks for reading."


In [8]:
train_df.shape

(25000, 2)

### Test set

In [9]:
# use glob to create a list of test_pos_filenames
import glob
test_pos_filenames = glob.glob('../data/aclImdb/test/pos/*.txt')
test_neg_filenames = glob.glob('../data/aclImdb/test/neg/*.txt')
print("test_pos_text:")
print(test_pos_filenames[:10])
print("\ntest_neg_text:")
print(test_neg_filenames[:10])

test_pos_text:
['../data/aclImdb/test/pos/4715_9.txt', '../data/aclImdb/test/pos/1930_9.txt', '../data/aclImdb/test/pos/3205_9.txt', '../data/aclImdb/test/pos/10186_10.txt', '../data/aclImdb/test/pos/147_10.txt', '../data/aclImdb/test/pos/7511_7.txt', '../data/aclImdb/test/pos/616_10.txt', '../data/aclImdb/test/pos/10460_10.txt', '../data/aclImdb/test/pos/3240_9.txt', '../data/aclImdb/test/pos/1975_9.txt']

test_neg_text:
['../data/aclImdb/test/neg/1821_4.txt', '../data/aclImdb/test/neg/9487_1.txt', '../data/aclImdb/test/neg/4604_4.txt', '../data/aclImdb/test/neg/2828_2.txt', '../data/aclImdb/test/neg/10890_1.txt', '../data/aclImdb/test/neg/3351_4.txt', '../data/aclImdb/test/neg/8070_2.txt', '../data/aclImdb/test/neg/1027_4.txt', '../data/aclImdb/test/neg/8248_3.txt', '../data/aclImdb/test/neg/4290_4.txt']


In [10]:
print(len(test_pos_filenames))
print(len(test_neg_filenames))

12500
12500


In [11]:
%time
# read the contents of the test_pos files into a list (each list element is one review)
test_pos_text = []
for filename in test_pos_filenames:
    with open(filename) as f:
        test_pos_text.append(f.read())
print("test_pos_text:")
print(test_pos_text[0])


# read the contents of the test_neg files into a list (each list element is one review)
test_neg_text = []
for filename in test_neg_filenames:
    with open(filename) as f:
        test_neg_text.append(f.read())
print("\ntest_neg_text:")
print(test_neg_text[0])

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 7.87 µs
test_pos_text:
Based on an actual story, John Boorman shows the struggle of an American doctor, whose husband and son were murdered and she was continually plagued with her loss. A holiday to Burma with her sister seemed like a good idea to get away from it all, but when her passport was stolen in Rangoon, she could not leave the country with her sister, and was forced to stay back until she could get I.D. papers from the American embassy. To fill in a day before she could fly out, she took a trip into the countryside with a tour guide. "I tried finding something in those stone statues, but nothing stirred in me. I was stone myself." <br /><br />Suddenly all hell broke loose and she was caught in a political revolt. Just when it looked like she had escaped and safely boarded a train, she saw her tour guide get beaten and shot. In a split second she decided to jump from the moving train and try to rescue him, with no though

In [12]:
# combine the pos and neg for test lists
test_text = test_pos_text + test_neg_text
print(len(test_text))

# create a list of labels (pos=1, neg=0)
test_labels = [1]*len(test_pos_filenames) + [0]*len(test_neg_filenames)
print(len(test_labels))

25000
25000


In [13]:
import pandas as pd
# convert the lists into a DataFrame
test_df = pd.DataFrame({'label':test_labels, 'reviews':test_text})

In [14]:
test_df.head()

Unnamed: 0,label,reviews
0,1,"Based on an actual story, John Boorman shows the struggle of an American doctor, whose husband and son were murdered and she was continually plagued with her loss. A holiday to Burma with her sister seemed like a good idea to get away from it all, but when her passport was stolen in Rangoon, she could not leave the country with her sister, and was forced to stay back until she could get I.D. papers from the American embassy. To fill in a day before she could fly out, she took a trip into the countryside with a tour guide. ""I tried finding something in those stone statues, but nothing stirred in me. I was stone myself."" <br /><br />Suddenly all hell broke loose and she was caught in a political revolt. Just when it looked like she had escaped and safely boarded a train, she saw her tour guide get beaten and shot. In a split second she decided to jump from the moving train and try to rescue him, with no thought of herself. Continually her life was in danger. <br /><br />Here is a woman who demonstrated spontaneous, selfless charity, risking her life to save another. Patricia Arquette is beautiful, and not just to look at; she has a beautiful heart. This is an unforgettable story. <br /><br />""We are taught that suffering is the one promise that life always keeps."""
1,1,"This is a gem. As a Film Four production - the anticipated quality was indeed delivered. Shot with great style that reminded me some Errol Morris films, well arranged and simply gripping. It's long yet horrifying to the point it's excruciating. We know something bad happened (one can guess by the lack of participation of a person in the interviews) but we are compelled to see it, a bit like a car accident in slow motion. The story spans most conceivable aspects and unlike some documentaries did not try and refrain from showing the grimmer sides of the stories, as also dealing with the guilt of the people Don left behind him, wondering why they didn't stop him in time. It took me a few hours to get out of the melancholy that gripped me after seeing this very-well made documentary."
2,1,"I really like this show. It has drama, romance, and comedy all rolled into one. I am 28 and I am a married mother, so I can identify both with Lorelei's and Rory's experiences in the show. I have been watching mostly the repeats on the Family Channel lately, so I am not up-to-date on what is going on now. I think females would like this show more than males, but I know some men out there would enjoy it! I really like that is an hour long and not a half hour, as th hour seems to fly by when I am watching it! Give it a chance if you have never seen the show! I think Lorelei and Luke are my favorite characters on the show though, mainly because of the way they are with one another. How could you not see something was there (or take that long to see it I guess I should say)? <br /><br />Happy viewing!"
3,1,"This is the best 3-D experience Disney has at their themeparks. This is certainly better than their original 1960's acid-trip film that was in it's place, is leagues better than ""Honey I Shrunk The Audience"" (and far more fun), barely squeaks by the MuppetVision 3-D movie at Disney-MGM and can even beat the original 3-D ""Movie Experience"" Captain EO. This film relives some of Disney's greatest musical hits from Aladdin, The Little Mermaid, and others, and brought a smile to my face throughout the entire show. This is a totally kid-friendly movie too, unlike ""Honey..."" and has more effects than the spectacular ""MuppetVision"""
4,1,"Of the Korean movies I've seen, only three had really stuck with me. The first is the excellent horror A Tale of Two Sisters. The second and third - and now fourth too - have all been Park Chan Wook's movies, namely Oldboy, Sympathy for Lady Vengeance), and now Thirst. <br /><br />Park kinda reminds me of Quentin Tarantino with his irreverence towards convention. All his movies are shocking, but not in a gratuitous sense. It's more like he shows us what we don't expect to see - typically situations that go radically against society's morals, like incest or a libidinous, blood-sucking, yet devout priest. He's also quite artistically-inclined with regards to cinematography, and his movies are among the more gorgeous that I've seen.<br /><br />Thirst is all that - being about said priest and the repressed, conscience-less woman he falls for - and more. It's horror, drama, and even comedy, as Park disarms his audience with many inappropriate yet humorous situations. As such, this might be his best work for me yet, since his other two movies that I've seen were lacking the humor element that would've made them more palatable for repeat viewings."


In [15]:
test_df.shape

(25000, 2)

### Data Cleaning

In [16]:
import re

In [17]:
train_df['reviews'] = train_df.reviews.apply(lambda s: re.sub(r'<br */>',' ', s))

In [18]:
train_df.to_csv('train.csv')

In [19]:
test_df['reviews'] = test_df.reviews.apply(lambda s: re.sub(r'<br */>',' ', s))

In [20]:
test_df.to_csv('test.csv')