# Ch. 1 - Sentiment Analysis Nuts and Bolts

## How many positive and negative reviews are there?

As a first step in a sentiment analysis task, similar to other data science problems, we might want to explore the dataset in more detail.

You will work with a sample of the IMDB movies reviews. A dataset called `movies` has been created for you. It is a sample of the data we saw in the slides. Feel free to explore it in the IPython Shell, calling the `.head()` method, for example.

Be aware that this exercise uses real data, and as such there is always a risk that it may contain profanity or other offensive content (in this exercise, and any following exercises that also use real data).

### Instructions
* Find the number of positive and negative reviews in the `movies` dataset.
* Find the proportion (percentage) of positive and negative reviews in the dataset.

In [None]:
# Find the number of positive and negative reviews
print('Number of positive and negative reviews: ', movies.label.value_counts())

# Find the proportion of positive and negative reviews
print('Proportion of positive and negative reviews: ', movies.label.value_counts() / len(movies))

## Longest and shortest reviews

In this exercise, you will continue to work with the `movies` dataset. You explored how many positive and negative reviews there are. Now your task is to explore the review column in more detail.

### Instructions

#### Section 1
* Use the `review` column of the `movies` dataset to find the length of the longest review.

#### Section 2
* Similarly, find the length of the shortest review.

In [None]:
# Section 1
length_reviews = movies.review.str.len()

# How long is the longest review
print(max(length_reviews))

# Section 2
length_reviews = movies.review.str.len()

# How long is the shortest review
print(min(length_reviews))

## Detecting the sentiment of Tale of Two Cities

In the video we saw that one type of algorithms for detecting the sentiment are based on a lexicon of predefined words and their corresponding polarity score. Your task in this exercise is to detect the sentiment, including polarity and subjectivity of a given string using such a rule-based approach and the textblob library in Python.

You will work with the `two_cities` string. It contains the first sentence of Dickens's Tale of Two Cities novel. Feel free to explore it in the Shell.

### Instructions
* Create a text blob object from the `two_cities` string.
* Print out the polarity and subjectivity.

In [None]:
# Import the required packages
from textblob import TextBlob

# Create a textblob object  
blob_two_cities = TextBlob(two_cities)

# Print out the sentiment 
print(blob_two_cities.sentiment)

## Comparing the sentiment of two strings

In this exercise, you will compare the sentiment of two different strings. A string called `annak` has been defined for you and it contains the first sentence of Anna Karenina. A second string called `catcher` has been created and it contains the first sentence of The Catcher in the Rye. Feel free to explore both in the IPython Shell.

Your task is again to detect the sentiment of each string - both their polarity and subjectivity. Which one has higher sentiment score? Did you expect that to be the case?

### Instructions
* Import the required function from the appropriate package.
* Create a text blob object from the `annak` string.
* Create a text blob from the catcher `string` as well.
* Print out the polarity and subjectivity of each of the created blobs.

In [None]:
# Import the required packages
from textblob import TextBlob

# Create a textblob object 
blob_annak = TextBlob(annak)
blob_catcher = TextBlob(catcher)

# Print out the sentiment   
print('Sentiment of annak: ', blob_annak.sentiment)
print('Sentiment of catcher: ', blob_catcher.sentiment)

## What is the sentiment of a movie review?

In a previous exercise, you detected the sentiment of the first sentence of the _Tale of Two Cities_ novel by Dickens. Now you will continue to work with the movie reviews dataset. Do you remember how you found the longest and shortest reviews? One of the longest reviews has been imported for you. It is called `titanic` as it discusses the Titanic movie. Feel free to explore it in the Shell.

Can you calculate the polarity and subjectivity of the `titanic` string? This review is positive (i.e. has a label of 1). Is the polarity score also positive?

### Instructions
* Import the required functionality.
* Create a text blob object from the `titanic` string.
* Print out the result of its `sentiment` property.

In [None]:
# Import the required packages
from textblob import TextBlob

# Create a textblob object  
blob_titanic = TextBlob(titanic)

# Print out its sentiment  
print(blob_titanic.sentiment)

## Your first word cloud

We saw in the video that word clouds are very intuitive and a great and fast way to get a first impression on what a piece of text is talking about.

In this exercise, you will build your first word cloud. A string `east_of_eden` has been defined for you. It contains one of the first sentences of John Steinbeck's novel _East of Eden_. You can inspect its contents in the IPython Shell.

The `matplotlib.pyplot` package has been imported for you as `plt`.

### Instructions

#### Section 1
* Import the required package to build a word cloud.
* Generate a word cloud using the `east_of_eden` string. The background color has been specified as white.

#### Section 2
* Create a figure from the word cloud object you generated in the previous step.
* Display the image.

In [None]:
from wordcloud import WordCloud

# Generate the word cloud from the east_of_eden string
cloud_east_of_eden = WordCloud(background_color="white").generate(east_of_eden)

# Create a figure of the generated cloud
plt.imshow(cloud_east_of_eden, interpolation='bilinear')  
plt.axis('off')
# Display the figure
plt.show()

## Word Cloud on movie reviews

You have been working with the movie reviews dataset. You have explored the distribution of the reviews and have seen how long the longest and the shortest reviews are. But what do positive and negative reviews talk about?

In this exercise, you will practice building a word cloud of the top 100 positive reviews.

What are the words that pop up? Do they make sense to you?

The string `descriptions` has been created for you by concatenating the descriptions of the top 100 positive reviews. A movie-specific set of stopwords (very frequent words, such as the, a/an, and, which will not be very informative and we'd like to exclude from the graph) is available as `my_stopwords`. Recall that the interpolation argument makes the word cloud appear more smoothly.

### Instructions
* Import the `wordcloud` function from the respective package.
* Apply the word cloud function to the descriptions string. Set the background color as `'white'`, and change the `stopwords` argument.
* Create a wordcloud image.
* Finally, do not forget to display the image.

In [None]:
# Import the word cloud function  
from wordcloud import WordCloud

# Create and generate a word cloud image 
my_cloud = WordCloud(background_color='white', stopwords=my_stopwords).generate(descriptions)

# Display the generated wordcloud image
plt.imshow(my_cloud, interpolation='bilinear') 
plt.axis("off")

# Don't forget to show the final image
plt.show()