# Lab 6 / Summary of Text Analytics Methods

In this lab, we will revisit each of the text analytics methods we learned about in the previous labs, but will think about how we can try and optimize these methods and what the effects are of data preprocessing.

Before we get started, we need to load the relevant Python modules that us used in the code later by running the next code cell.

In [None]:
import somialabs.lab6
%matplotlib inline
import warnings; warnings.simplefilter('ignore')

To complete this lab:
1. Follow the instructions running the code when asked.
2. Discuss each question in your group.
3. Keep notes for your answers to the questions in a separate MS Word document (you can use [this template](Lab6_answers_template.docx)).
4. When completed, briefly discuss your answers with the Lecturer/Teaching Assistant attending your lab. You **do not** need to submit your answers to Studentportalen.

## Word Frequencies

In Lab 2, we looked at word frequencies (counting words) and the distribution of these frequencies, and how various preprocessing steps effect the word frequencies.

Run the following cell to interact with and explore a word frequency plot generated on some IMDb review data:

In [None]:
somialabs.lab6.interact_word_frequency_plot()

**Question 1.1.** How does stop-word removal effect the word frequency distribution?

**Question 1.2.** How does stemming effect word the frequency distribution?

**Question 1.3.** How does lemmatization effect the word frequency distribution?

**Question 1.4.** What other factors might effect the word frequency distribution?

**Question 1.5.** How useful are word frequencies for finding out the themes within the IMDb data?

## Text Classification

In Labs 3 and 4, we looked at text classification, firstly on classifying the product relating to consumer complaints (Lab 3 using SVM) and then secondly on classifying positive or negative sentiment in tweets (Lab 4 using logistic regression and naive Bayes techniques). 

Run the following cell to interact with and explore the cross-validation scores generated on training a classifier using the IMDb review data:

In [None]:
somialabs.lab6.interact_classifier_cross_validation()

**Question 2.1.** How does stop-word removal effect the performance of text classification?

**Question 2.2.** How does stemming effect the performance of text classification?

**Question 2.3.** How does lemmatization effect the performance of text classification?

**Question 2.4.** How might classifying sentiment differ from other kinds of classification?

**Question 2.5.** Thinking about word frequencies from Question 1, can you use word frequencies to try and predict sentiment? If so, how might you do so?

## Topic Modelling

In Lab 5, we looked at topic modelling using Latent Dirichlet allocation (LDA). 

Run the following code cell to explore how preprocessing effects the topics generated by LDA:

In [None]:
somialabs.lab6.interact_lda_model_topics()

**Question 3.1.** How does stop-word removal effect the performance of topic modelling?

**Question 3.2.** How does stemming effect the performance of topic modelling?

**Question 3.3.** How does lemmatization effect the performance of topic modelling?

**Question 3.4.** How does the output of LDA compare to word frequencies? Which do you think is more useful and why?

## Applying methods

**Question 4.** Assume you are given a new social media dataset to analyse. Which of the above methods is most effective to find out about what are the a common themes in the new dataset? Discuss in your group the purpose, advantages and disadvantages of each method.

## Summary

The labs intend to give you a taste of what you can do with text analytics and computational approaches to data.

You can extract some interesting information by simply counting word frequencies.

You can train machine learning models on existing labelled data to create computer programs that can make predictions on new data.

You can also create machine learning models that predict the sentiment (positive/negative emotion) of some text.

Finally, you can also computationally process some text to try and extract the themes (topics) using topic modelling.

All of these can be done on social media data, such as social network posts, blogs, emails etc. They are not intended to replace traditional (non-computational) research methods, e.g. ethnography etc.,  but might help steer your research or help with trying to make sense of very big datasets that would otherwise take an extremely (inhumanly) long time to sift through by hand.