1\. Let's predict the sentiment!
--------------------------------

00:00 - 00:07

In this final chapter, we will use a supervised learning model to predict the sentiment.

2\. Classification problems
---------------------------

00:07 - 00:42

Imagine we are working with the product reviews. A supervised learning task will try to classify any new review as either positive or negative based on already labeled reviews. This is what we call a classification problem. In the case of the product and movie reviews, we have two classes - positive and negative. This is a binary classification problem. The airline sentiment Twitter data has three categories of sentiment: positive, neutral and negative. This is a multi-class classification problem.

- Product and movie reviews: positive or negative sentiment (binary classification)
- Tweets about airline companies: positive, neutral and negative (multi-class classification)

3\. Linear and logistic regressions
-----------------------------------

00:42 - 01:12

One algorithm commonly applied in classification tasks is a logistic regression. You might be familiar with a linear regression, where we fit a straight line to approximate a relationship, shown in the graph on the left. With a logistic regression, instead of fitting a line, we are fitting an S-shaped curve, called a sigmoid function. A property of this function is that for any value of x, y will be between 0 and 1.

```
                Y=1 |----- • • • •.          Y=1 |------------ • • •
                    |           /                 |          ••'
                    |         /                   |        •'
                    |       /                     |      •'
                    |     /                       |    •'
                    |   /                         |  •'
                    | /                           |•'
                Y=0 |/___________________         |___________________
                    |           X-axis            |           X-axis
                    |                            |
            Linear regression              Logistic regression 
```

And here's a markdown version of the actual image:

![Comparison between Linear regression (left) showing a straight line fitting binary data points, and Logistic regression (right) showing an S-shaped curve (sigmoid) better fitting the same type of binary classification data. Both plots show data points scattered between Y=0 and Y=1 on the Y-axis.]

The key difference shown is that linear regression uses a straight line while logistic regression uses a sigmoid curve, which is more appropriate for binary classification as it bounds the output between 0 and 1.

4\. Logistic function
---------------------

01:12 - 01:49

When performing linear regression, we are predicting a numeric outcome (say the sale price of a house). With logistic regression, we estimate the probability that the outcome (sentiment) belongs to a particular category(positive or negative) given the review. Since we are estimating a probability and want an output between 0 and 1, we model the X values using the sigmoid/logistic function, as shown on the graph. For more details on logistic regression, refer to other courses on DataCamp.

Linear regression: numeric outcome

Logistic regression: probability:

Probability(sentiment = positive|review)

```
Y = 1 ▬ ▬ ▬ ▬ ▬ ▬ ▬ ▬ ▬
        •     •  •
        /‾‾
      /
Y = 0 _____•_•_•____________
           X-axis
     Logistic regression
```

The above is a simple ASCII representation of the logistic regression curve shown in the image, with the S-shaped curve and scatter points. A more accurate visualization would require actual plotting tools.

Note: I've tried to maintain the exact content and format from the image, including the equation notation and basic graph structure. The ASCII art representation is an approximation of the logistic curve and data points shown in the original image.

5\. Logistic regression in Python
---------------------------------

01:49 - 02:31

In Python, we import the LogisticRegression from the sklearn.linear_model library. Keep in mind that the sklearn API works only with continuous variables. It also requires either a DataFrame or an array as arguments and cannot handle missing data. Therefore, all transformation of the data needs to be completed beforehand. We call the logistic regression function and create a Logistic classifier object. We fit it by specifying the X matrix, which is an numpy array of our features or a pandas DataFrame, and the vector of targets y.

```python
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression().fit(X, y)
```

6\. Measuring model performance
-------------------------------

02:31 - 03:15

How do we know if the model is any good? We look at the discrepancy between the predicted label and what was the real label for each instance (observation) in our dataset. One common metric to use is the accuracy score. Though not appropriate in all contexts, it is still useful. Accuracy gives us the fraction of predictions that our model got right. The higher and closer it is to 1, the better. One way we can calculate the accuracy score of a logistic regression model is by calling the score method on the logistic regression object. It takes as arguments the X matrix and y vector.

Accuracy: Fraction of predictions our model got right.

The higher and closer the accuracy is to 1, the better

```python
# Accuracy using score
score = log_reg.score(X, y)
print(score)
```
```
0.9009
```

7\. Using accuracy score
------------------------

03:15 - 04:02

Alternatively, we can use the accuracy_score function from sklearn.metrics. There is an accuracy_score function apart from the score function because different models have different default score metrics. Thus, the accuracy_score function always returns the accuracy but the score function might return other metrics if we use it to evaluate other models. Here, we need to explicitly calculate the predictions of the model, by calling predict on the matrix of features. The accuracy score takes as arguments the vector of true labels and the predicted labels. We see in the case of logistic regression, both score and accuracy score return value of 0.9009.

```python
# Accuracy using accuracy_score
from sklearn.metrics import accuracy_score

y_predicted = log_reg.predict(X)
acurracy = accuracy_score(y, y_predicted)
```
```
0.9009
```

8\. Let's practice!
-------------------

04:02 - 04:19

Can we trust such high accuracy? We should be careful in making strong conclusions just yet. In the next video, we will see how to check how robust the model performance is but before that, let's solve some exercises!

Logistic regression of movie reviews
====================================

In the video we learned that logistic regression is a common way to model a classification task, such as classifying the sentiment as positive or negative. 

In this exercise, you will work with the `movies`reviews dataset. The `label` column stores the sentiment, which is `1` when the review is positive, and `0` when negative. The text review has been transformed, using BOW, to numeric columns. 

Your task is to build a logistic regression model using the `movies` dataset and calculate its accuracy.

Instructions
------------

-   Import the logistic regression function.
-   Create and fit a logistic regression on the labels `y` and the features `X`.
-   Calculate the accuracy of the logistic regression model, using the default `.score()` method.

In [None]:
# Import the logistic regression
from sklearn.linear_model import LogisticRegression

# Define the vector of targets and matrix of features
y = movies.label
X = movies.drop('label', axis=1)

# Build a logistic regression model and calculate the accuracy
log_reg = LogisticRegression().fit(X, y)
print('Accuracy of logistic regression: ', log_reg.score(X, y))

Logistic regression using Twitter data
======================================

In this exercise, you will build a logistic regression model using the `tweets` dataset. The target is given by the `airline_sentiment`, which is `0` for negative tweets, `1` for neutral, and `2` for positive ones. So, in this case, you are given a multi-class classification task. Everything we learned about binary problems applies to multi-class classification problems as well. 

You will evaluate the accuracy of the model using the two different approaches from the slides. 

The logistic regression function and accuracy score have been imported for you.

Instructions
------------

-   Build and fit a logistic regression model using the defined `X` and `y` as arguments. 
-   Calculate the accuracy of the logistic regression model.
-   Predict the labels.
-   Calculate the *accuracy score* using the predicted and true labels.

In [None]:
# Define the vector of targets and matrix of features
y = tweets.airline_sentiment
X = tweets.drop('airline_sentiment', axis=1)

# Build a logistic regression model and calculate the accuracy
log_reg = LogisticRegression().fit(X, y)
print('Accuracy of logistic regression: ', log_reg.score(X,y))

# Create an array of prediction
y_predict = log_reg.predict(X)

# Print the accuracy using accuracy score
print('Accuracy of logistic regression: ', accuracy_score(y, y_predict))

1\. Did we really predict the sentiment well?
---------------------------------------------

00:00 - 00:17

In the previous video, we used all of the available data to build a logistic regression model and assess its accuracy. However, we want to make sure our machine learning model generalizes and performs well on unseen data. How to do that?

2\. Train/test split
--------------------

00:17 - 00:54

To get any idea on how well a model will perform on unseen data, we randomly split the dataset in 2 parts: one used for training (building the model) and one for testing (evaluate the performance of the model). In some cases, when we want to tune the parameters of our algorithm, we might have 3 sets: training, testing and validation, but this is out of scope for our course. The training set is usually around 70 or 80% of the whole dataset, and the rest is used for testing.

```
[Training data]                                    [Test data]
                  |______________________________|
                  Total number of observations
```

- Training set: used to train the model (70-80% of the whole data)  
- Testing set: used to evaluate the performance of the model

3\. Train/test in Python
------------------------

00:54 - 01:59

In Python, we can perform a random train-test split using the train_test_split function from the sklearn.model_selection package. It takes as arguments arrays, lists, or DataFrames. The X-train and test matrices and y-train and test vectors are the output of the train_test_split. The first arguments we provide in the function are the features matrix X and labels vector y. We can specify the proportion of the data going to testing; here, it is equal to 0.2. Another parameter is the random state, which is the seed generator used to make the random split. It ensures that every time you perform the train-test split on the same data, you will get the same instances in each set. We can also specify the stratify argument. If we want to ensure that the train and test set have similar proportions of both classes, we can do that by specifying stratify to be equal to y.

```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, stratify=y)
```

- X : features
- y : labels  
- test_size: proportion of data used in testing
- random_state: seed generator used to make the split
- stratify: proportion of classes in the sample produced will be the same as the proportion of values provided to this parameter

4\. Logistic regression with train/test split
---------------------------------------------

01:59 - 02:32

Let's revisit our logistic regression example, executed after a train-test split. We create the LogisticRegression object and fit it on the training set. We can calculate the accuracy on the training data, calling score on the logistic regression with arguments X_train and y_train. We can also calculate the accuracy score of the model using the test set - X_test and y_test. It is slightly lower than the accuracy on the training data, which is usually the case.

```python
log_reg = LogisticRegression().fit(X_train, y_train)

print('Accuracy on training data: ', log_reg.score(X_train, y_train))
```
```
0.76
```

```python  
print('Accuracy on testing data: ', log_reg.score(X_test, y_test))
```
```
0.73
```

5\. Accuracy score with train/test split
----------------------------------------

02:32 - 02:59

You may recall that another way to calculate the accuracy was to use the accuracy_score function from the sklearn.metrics. After we have built the logistic regression model, we apply predict to the logistic regression specifying X_test as an argument. In the last step, we call the accuracy score on the true and predicted labels. The value is identical to the accuracy produced by the score function.

```python
from sklearn.metrics import accuracy_score

log_reg = LogisticRegression().fit(X_train, y_train)

y_predicted = log_reg.predict(X_test)
print('Accuracy score on test data: ', accuracy_score(y_test, y_predicted))
```
```
0.73
```

6\. Confusion matrix
--------------------

02:59 - 03:33

The accuracy is a useful measure of a model's performance but it's not always the most informative. We can instead use something called a 'confusion matrix'. It shows the number of predicted and true values of each of the classes, as displayed in the table. A confusion matrix will allow us to calculate the values in each cell and say how many observations of each class we have predicted correctly. For more details of when we would want to optimize for the different cells, refer to other DataCamp courses.

```
| | Actual = 1 | Actual = 0 |
|-------------|------------|------------|
| Predicted = 1 | True positive | False positive |
| Predicted = 0 | False negative | True negative |
```

7\. Confusion matrix in Python
------------------------------

03:33 - 04:00

In Python, we import the confusion_matrix from the sklearn.metrics module. After we have built our logistic regression and predicted the test set labels, we call the confusion matrix where we give as arguments the true and predicted labels. We have divided the matrix by the length of the y-vector in order to obtain proportions in the cells of the matrix.

```python
from sklearn.metrics import confusion_matrix

log_reg = LogisticRegression().fit(X_train, y_train)
y_predicted = log_reg.predict(X_test)

print(confusion_matrix(y_test, y_predicted)/len(y_test))
```
```
[[0.3788 0.1224]
 [0.1352 0.3636]]
```

8\. Let's practice!
-------------------

04:00 - 04:04

Now let's solve some exercises!

Build and assess a model: movies reviews
========================================

In this problem, you will build a logistic regression model using the `movies` dataset. The score is stored in the `label` column and is `1` when the review is positive, and `0` when negative. The text review has been transformed, using BOW, to numeric columns. 

You have already built a classifier but evaluated it using the same data employed in the training step. Make sure you now assess the model using an unseen test dataset. How does the performance of the model change when evaluated on the test set?

Instructions
------------

-   Import the function required for a train/test split.
-   Perform the train/test split, specifying that 20% of the data should be used as a test set.
-   Train a logistic regression model.
-   Print out the accuracy of the model on the training and on the testing data.

In [None]:
# Import the required packages
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Define the vector of labels and matrix of features
y = movies.label
X = movies.drop('label', axis=1)

# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build a logistic regression model and print out the accuracy
log_reg = LogisticRegression().fit(X_train,y_train)
print('Accuracy on train set: ', log_reg.score(X_train, y_train))
print('Accuracy on test set: ', log_reg.score(X_test, y_test))

Performance metrics of Twitter data
===================================

You will train a logistic regression model that predicts the sentiment of tweets and evaluate its performance on the test set using different metrics. 

A matrix `X` has been created for you. It contains features created with a BOW on the `text` column.

The labels are stored in a vector called `y`. Vector `y` is `0` for negative tweets, `1` for neutral, and `2` for positive ones.\
Note that although we have 3 classes, it is still a classification problem. The accuracy still measures the proportion of correctly predicted instances. The confusion matrix will now be of size 3x3, each row will give the number of predicted cases for classes 2, 1, and 0, and each column - the true number of cases in class 2, 1, and 0. 

All required packages have been imported for you.

Instructions
------------

-   Perform the train/test split, and stratify by `y`.
-   Train a a logistic regression classifier.
-   Predict the performance on the test set.
-   Print the accuracy score and confusion matrix obtained on the test set.

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123, stratify=y)

# Train a logistic regression
log_reg = LogisticRegression().fit(X_train, y_train)

# Make predictions on the test set
y_predicted = log_reg.predict(X_test)

# Print the performance metrics
print('Accuracy score test set: ', accuracy_score(y_test, y_predicted))
print('Confusion matrix test set: \n', confusion_matrix(y_test, y_predicted)/len(y_test))

Build and assess a model: product reviews data
==============================================

In this exercise, you will build a logistic regression using the `reviews` dataset, containing customers' reviews of Amazon products. The array `y` contains the sentiment : `1` if positive and `0` otherwise. The array `X`contains all numeric features created using a BOW approach. Feel free to explore them in the IPython Shell.

Your task is to build a logistic regression model and calculate the accuracy and confusion matrix using the test dataset. 

The logistic regression and train/test splitting functions have been imported for you.

Instructions
------------

-   Import the accuracy score and confusion matrix functions.
-   Split the data into training and testing, using 30% of it as a test set and set the random seed to `42`. 
-   Train a logistic regression model.
-   Print out the accuracy score and confusion matrix using the test data.

In [None]:
# Import the accuracy and confusion matrix
from sklearn.metrics import accuracy_score, confusion_matrix

# Split the data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=42)

# Build a logistic regression
log_reg = LogisticRegression().fit(X_train,y_train)

# Predict the labels 
y_predict = log_reg.predict(X_test)

# Print the performance metrics
print('Accuracy score of test data: ', accuracy_score(y_test, y_predict))
print('Confusion matrix of test data: \n', confusion_matrix(y_test, y_predict)/len(y_test))

1\. Logistic regression: revisited
----------------------------------

00:00 - 00:14

Before we build a logistic regression using text features, we transformed the text fields to numeric columns. As a result, we might end up having hundreds or even thousands of features, which can make the model quite complex.

2\. Complex models and regularization
-------------------------------------

00:14 - 01:03

A complex model can occur in a few scenarios. If we use a very complicated function to explain the relationship of interest, we will inevitably fit the noise in the data. Such a model will not perform well when used to score unseen data. This is also called overfitting. A complex model could stem from including too many unnecessary features and parameters; especially with transformed text data, where we might create thousands of extra numeric columns. These two sources of complexity often go hand-in-hand. One way to artificially discourage complex models is by the use of regularization. When using regularization, we are penalizing, or restricting the function of the model.

Complex models:
- Complex model that captures the noise in the data (overfitting)
- Having a large number of features or parameters

Regularization:
- A way to simplify and ensure we have a less complex model

3\. Regularization in a logistic regression
-------------------------------------------

01:03 - 02:11

Regularization is applied by default in the logistic regression function from sklearn. It uses the so-called L2 penalty; the details of it are outside of the scope of this course, but intuitively it's good to know that the L2 penalty shrinks all the coefficients towards zero, effectively reducing the impact of each feature. The parameter that determines the strength of regularization is given by C, which takes a default value of 1. Higher values of C correspond to less regularization, in other words, the model will try to fit the data as best as possible. Small values of C correspond to high penalization(or regularization), meaning that the coefficients of the logistic regression will be closer to zero; the model will be less flexible because it will not fit the training data so well. How to find the most appropriate value of C? Usually we need to test different values and see which one gives us the best performance on the test data.

```python
from sklearn.linear_model import LogisticRegression

# Regularization arguments
LogisticRegression(penalty='l2', C=1.0)
```

- L2: shrinks all coefficients towards zero
- High values of C: low penalization, model fits the training data well
- Low values of C: high penalization, model less flexible

4\. Predicting a probability vs. predicting a class
---------------------------------------------------

02:11 - 02:39

You should recall that when we trained a logistic regression model, we applied the predict function to the test set to predict the labels. The predict function predicts a class: 0 or 1 if we are working with a binary classifier. However, instead of a class, we can predict a probability using the predict_proba function. We again pass as an argument the test dataset.

```python
log_reg = LogisticRegression().fit(X_train, y_train)

# Predict labels
y_predicted = log_reg.predict(X_test)

# Predict probability 
y_probab = log_reg.predict_proba(X_test)
```

5\. Predicting a probability vs. predicting a class
---------------------------------------------------

02:39 - 03:07

This returns an array of probabilities, ordered by the label of classes - first the class 0 then the class 1. The probabilities for each observation are displayed on a separate row. The first value is the probability that the instance is of class 0, and the second of a class 1. Therefore, is is common when predicting the probabilities to specify already that we want to extract the probabilities of the class 1.

```python
y_probab
array([[0.5002245, 0.4997755],
       [0.4900345, 0.5099655],
       ...,
       [0.7040499, 0.2959501]])

# Select the probabilities of class 1
y_probab = log_reg.predict_proba(X_test)[:, 1]

array([0.4997755, 0.5099655 ..., 0.2959501])
```

6\. Model metrics with predicted probabilities
----------------------------------------------

03:07 - 04:06

One important thing to know is that we cannot directly apply the accuracy score or confusion matrix to the predicted probabilities. If you do that in sklearn, you will get a ValueError. The reason is that the accuracy and the confusion matrix work directly with classes. If we have predicted probabilities, we need to encode them as classes. The default is that any probability higher or equal to 0.5 is translated to class 1, otherwise to class 0. However, you can change that threshold depending on your problem. Imagine only 1% of the reviews are positive and you have built a model to predict whether a new review is positive or negative. In that context, you don't want to translate any predicted probability higher than 0.5 to class 1, this threshold should be much lower.

- Raise ValueError when applied with probabilities
- Accuracy score and confusion matrix work with classes

```python
# Default probability encoding:
# If probability >= 0.5, then class 1 Else class 0
```

7\. Let's practice!
-------------------

04:06 - 04:11

Let's apply what we've learned in the exercises!

Predict probabilities of movie reviews
======================================

In this problem, you will build a logistic regression using the `movies` dataset. The labels are stored in the array`y` and the features in `X`.

Train the model on the training data. Instead of predicting classes, predict the probabilities that each instance in the test set belongs to each of the two classes.

The logistic regression and train/test splitting functions have been imported for you.

Instructions
------------

-   Split the data into training and testing set.
-   Train a logistic regression model.
-   Predict the probabilities for class 0 and for class 1 of the testing data. Class 0 is located as the first column in the predicted probabilities, and class 1 is the second one.

In [None]:
# Split into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=321)

# Train a logistic regression
log_reg = LogisticRegression().fit(X_train,y_train)

# Predict the probability of the 0 class
prob_0 = log_reg.predict_proba(X_test)[:, 0]
# Predict the probability of the 1 class
prob_1 = log_reg.predict_proba(X_test)[:, 1]

print("First 10 predicted probabilities of class 0: ", prob_0[:10])
print("First 10 predicted probabilities of class 1: ", prob_1[:10])
     

Product reviews with regularization
===================================

In this exercise, you will work once more with the `reviews` dataset of Amazon product reviews. A vector of labels `y` contains the sentiment : `1` if positive and `0` otherwise. The matrix `X` contains all numeric features created using a BOW approach. 

You will need to train two logistic regression models with different levels of regularization and compare how they perform on the test data. Remember that regularization is a way to control the complexity of the model. The more regularized a model is, the less flexible it is but the better it can generalize. Models with higher level of regularization are often less accurate than non-regularized ones.

Instructions
------------

-   Split the data into a train and test sets.
-   Train a logistic regression with regularization parameter of `1000`. Train a second logistic regression with regularization parameter equal to `0.001`.
-   Print the accuracy scores of both models on the test set.

In [None]:
# Split data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Train a logistic regression with regularization of 1000
log_reg1 = LogisticRegression(C=1000).fit(X_train, y_train)
# Train a logistic regression with regularization of 0.001
log_reg2 = LogisticRegression(C=0.001).fit(X_train, y_train)

# Print the accuracies
print('Accuracy of model 1: ', log_reg1.score(X_test, y_test))
print('Accuracy of model 2: ', log_reg2.score(X_test, y_test))

Regularizing models with Twitter data
=====================================

You will work with the Twitter data expressing customers' sentiment about airline companies. The `X` matrix of features and `y` vector of labels have been created for you. In addition, the training and testing split has been performed. You can work with the `X_train`, `X_test`, `y_train` and `y_test` arrays directly. 

You will train regularized and a more flexible models and evaluate them using different model performance metrics.

All required packages have been imported for you.

Instructions
------------

-   Train two logistic regressions: one with regularization parameter of 100 and a second of 0.1.
-   Print the accuracy scores of both models.
-   Print the confusion matrix of each model.

In [None]:
# Build a logistic regression with regularizarion parameter of 100
log_reg1 = LogisticRegression(C=100).fit(X_train,y_train)
# Build a logistic regression with regularizarion parameter of 0.1
log_reg2 = LogisticRegression(C=0.1).fit(X_train,y_train)

# Predict the labels for each model
y_predict1 = log_reg1.predict(X_test)
y_predict2 = log_reg2.predict(X_test)

# Print performance metrics for each model
print('Accuracy of model 1: ', accuracy_score(y_test, y_predict1))
print('Accuracy of model 2: ', accuracy_score(y_test, y_predict2))
print('Confusion matrix of model 1: \n' , confusion_matrix(y_test, y_predict1)/len(y_test))
print('Confusion matrix of model 2: \n', confusion_matrix(y_test, y_predict2)/len(y_test))

1\. Bringing it all together
----------------------------

00:00 - 00:14

Welcome back! In this video we will put together all the steps we have applied in this course on sentiment analysis. I find myself apply these same steps in my work as a data scientist.

2\. The Sentiment Analysis problem
----------------------------------

00:14 - 00:55

We defined sentiment analysis as the process of understanding the opinion of an author about a subject. Throughout the course we worked with examples of movie and Amazon product reviews, Twitter airline sentiment data, and different emotionally charged literary examples. We went through various steps to transform the text column, which contained the review, to numeric features. We finished our analysis by training a logistic regression model and predicting the sentiment of a new review based on the words in the text. Let's go through these steps in more detail.

Sentiment analysis as the process of understanding the opinion of an author about a subject

- Movie reviews
- Amazon product reviews  
- Twitter airline sentiment
- Various emotionally charged literary examples

3\. Exploration of the reviews
------------------------------

00:55 - 01:28

We started with exploring the review column in the movies reviews dataset. We found which were the shortest and longest reviews. We also plotted word clouds from the movie reviews, which allowed us to quickly see which are the most frequently mentioned words in positive or negative reviews. Furthermore, we created features for the length of a review in terms of number of words and number of sentences, and we learned how to detect the language of a document.

- Basic information about size of reviews
- Word clouds  
- Features for the length of reviews: number of words, number of sentences
- Feature detecting the language of a review

4\. Numeric transformations of sentiment-carrying columns
---------------------------------------------------------

01:28 - 02:30

We continued with numeric transformations of the review features. We transformed the text using a bag-of-words approach and a Tfidf vectorizer. The bag-of-words created features corresponding to the frequency count of a word in a respective review or tweet (also called document in NLP problems). The term frequency-inverse document frequency approach is similar to the bag-of-words but it accounts for how frequently a word occurs in a document with respect to the rest of the documents. So, we can capture 'important' words, whereas words that occur frequently have lower tfidf score. We used the CountVectorizer and TfidfVectorizer from sklearn.feature_extraction.text to construct each of the vectors. As a reminder of the syntax, we called the vectorizer function and fit and then transformed it to the text column in our data.

- Bag-of-words
- TfIdf vectorization

```python
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Vectorizer syntax
vect = CountVectorizer().fit(data.text_column)
X = vect.transform(data.text_column)
```

5\. Arguments of the vectorizers
--------------------------------

02:30 - 03:29

There are many arguments we specified in the vectorizers. We dealt with stop words: those frequently occurring and non-informative words. We had a video on n-grams, which allowed us to use different lengths of phrases instead of a single word. We learned how to limit the size of the vocabulary by setting any of a number of parameters: max_features (for the maximum number of features), max and min_df (which tells the vectorizer to ignore terms with higher or lower than the specified frequency, respectively). We could capture only certain characters using the token_pattern argument. Last but not least, we learned about lemmas and stems and practiced lemmatizing and stemming of tokens and strings. We could adjust all these arguments - with the exception of lemmas and stems - in both the the count- and tfidfvectorizers.

- stop words: non-informative, frequently occurring words
- n-gram range: use phrases not only single words 
- control size of vocabulary: max_features, max_df, min_df
- capturing a pattern of tokens: remove digits or certain characters

Important but NOT arguments to the vectorizers:
- lemmas and stems

6\. Supervised learning model
-----------------------------

03:29 - 03:58

In the final step, we used a logistic regression to train a classifier predicting the sentiment. We evaluated the performance of the model using metrics such as accuracy score and a confusion matrix. Since the goal is for our model to perform well on unseen data, we randomly split the data into a training and testing set; we used the training set to build the model and the test set to evaluate its performance.

- Logistic regression classifier to predict the sentiment
- Evaluated with accuracy and confusion matrix
- Importance of train/test split

7\. Let's practice!
-------------------

03:58 - 04:09

These are all very valuable skills and essential in performing a sentiment analysis task. Let's perform some of these steps in the exercises.

Step 1: Word cloud and feature creation
=======================================

You will work with a sample of the `reviews`dataset throughout this exercise. It contains the `review` and `score` columns. Feel free to explore it in the IPython Shell. 

In the first step, you will build a word cloud using only positive reviews. The string `positive_reviews` has been created for you by concatenating the top 100 positive reviews. 

In the second step, you will create a new feature for the length of each review and add that new feature to the dataset. 

All the functions needed to plot a word cloud have been imported for you, as well as the `word_tokenize` function from the `nltk`module.

Instructions 1/2
----------------

-   Call and create a word cloud image using the `positive_reviews`. 
-   Display the generated image.

In [None]:
# Create and generate a word cloud image
cloud_positives = WordCloud(background_color='white').generate(positive_reviews)
 
# Display the generated wordcloud image
plt.imshow(cloud_positives, interpolation='bilinear') 
plt.axis("off")

# Don't forget to show the final image
plt.show()

Instructions 2/2
----------------

-   Tokenize each item in the `review` column, using the word tokenizing function we have been working with.
-   Iterate over the created `word_tokens` list and find the length of each item in the list. Append that length to the empty `len_tokens` list.

In [None]:
# Tokenize each item in the review column
word_tokens = [word_tokenize(review) for review in reviews.review]

# Create an empty list to store the length of the reviews
len_tokens = []

# Iterate over the word_tokens list and determine the length of each item
for i in range(len(word_tokens)):
     len_tokens.append(len(word_tokens[i]))

# Create a new feature for the lengh of each review
reviews['n_words'] = len_tokens 

Step 2: Building a vectorizer
=============================

In this exercise, you are asked to build a TfIDf transformation of the `review` column in the `reviews` dataset. You are asked to specify the n-grams, stop words, the pattern of tokens and the size of the vocabulary arguments. 

This is the last step before we train a classifier to predict the sentiment of a review.

Make sure you specify the maximum number of features properly, as a very large vocabulary size could disconnect your session.

Instructions
------------

-   Import the Tfidf vectorizer and the default list of English stop words.
-   Build the Tfidf vectorizer, specifying - in this order - the following arguments: use as stop words the default list of English stop words; as n-grams use uni- and bi-grams;the maximum number of features should be 200; capture only words using the specified pattern.
-   Create a DataFrame using the Tfidf vectorizer.

In [None]:
# Import the TfidfVectorizer and default list of English stop words
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS

# Build the vectorizer
vect = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, ngram_range =(1, 2), max_features=200, token_pattern=r'\b[^\d\W][^\d\W]+\b').fit(reviews.review)

# Create sparse matrix from the vectorizer
X = vect.transform(reviews.review)

# Create a DataFrame
reviews_transformed = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
print('Top 5 rows of the DataFrame: \n', reviews_transformed.head())

Step 3: Building a classifier
=============================

This is the last step in the sentiment analysis prediction. We have explored and enriched our dataset with features related to the sentiment, and created numeric vectors from it. 

You will use the dataset that you built in the previous steps. Namely, it contains a feature for the length of reviews, and 200 features created with the Tfidf vectorizer.

Your task is to train a logistic regression to predict the sentiment. The data has been imported for you and is called `reviews_transformed`. The target is called `score` and is binary : `1` when the product review is positive and `0` otherwise. 

Train a logistic regression model and evaluate its performance on the test data. How well does the model do? 

All the required packages have been imported for you.

Instructions
------------

-   Perform the train/test split, allocating 20% of the data to testing and setting the random seed to `456`.
-   Train a logistic regression model.
-   Predict the class.
-   Print out the accuracy score and the confusion matrix on the test set.

In [None]:
# Define X and y
y = reviews_transformed.score
X = reviews_transformed.drop('score', axis=1)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=456)

# Train a logistic regression
log_reg = LogisticRegression().fit(X_train, y_train)
# Predict the labels
y_predicted = log_reg.predict(X_test)

# Print accuracy score and confusion matrix on test set
print('Accuracy on the test set: ', accuracy_score(y_test, y_predicted))
print(confusion_matrix(y_test, y_predicted)/len(y_test))
     

1\. Wrap up
-----------

00:00 - 00:15

Well done! This is our final video! You've learned the basics of performing a sentiment analysis using Python. We covered a lot of material in a short period of time, so you should be proud of yourself.

2\. The Sentiment Analysis world
--------------------------------

00:15 - 00:26

We defined the three parts of a sentiment analysis system: sentiment/emotion of an opinion holder; the subject being talked about, and an opinion holder.

| Component | |
|------------|-------------|
| Sentiment/emotion | |
| Subject being talked about | |
| Opinion holder | |

3\. Sentiment analysis types
----------------------------

00:26 - 00:56

We classified the sentiment analysis algorithms as rule-based or automated. The rule or lexicon based methods had a predefined list of words with a positivity score. The algorithm matches the words from the lexicon to the words in the text. Automated systems are based on machine learning. Using historical data with known sentiment, we predict the sentiment of a new piece of text. This is what we did in this course.

Lexicon/rule based <--> Automated/machine learning

4\. The automated sentiment analysis system
-------------------------------------------

00:56 - 01:16

In our sentiment analysis process, we followed 3 steps: after exploration of the data, we built new features related to the sentiment column. We then transformed it to numeric features. And finally, we built a machine learning model that classifies the sentiment.

Exploration, new features → Numeric transformation of review/sentiment column → A classifier predicting the sentiment

5\. Congratulations!
--------------------

01:16 - 01:24

Thank you for the time and effort you dedicated to this course! I wish you the best of luck in your further learning.