# Lab 2 / Exercise 3 Analyzing Bokmässan Tweets

## Preparatory task for the laboratory
 
Before you get started with the lab, you need to familiarize yourself with the Jupyter environment.  Take a look at the `Lab 2` exercises 1 and 2:

- [Lab 2 / Exercise 1: Notebook_Basics](Lab2_Ex1_Notebook_Basics.ipynb)
- [Lab 2 / Exercise 2: Running Code](Lab2_Ex2_Running_Code.ipynb)

You do not need to understand any code to complete this lab. However, if you are interested in how the underlying code works you can check the lab Python package in the GitHub repository under the `somialabs` directory. To learn more detailed Python programming you can also study chapters 1-9 from the free online book [Python for Everybody](https://www.py4e.com/html3/). 

Once you are familiar with the Jupyter environment, we can now try out a simple analysis of Twitter data with a focus on data cleansing.

Before we get started, we need to load the relevant Python modules that us used in the code later by running the next code cell.

In [None]:
import somialabs.lab2
%matplotlib inline

To complete this lab:
1. Follow the instructions running the code when asked.
2. Discuss each question in your group.
3. Keep notes for your answers to the questions in a separate MS Word document (you can use [this template](Lab2_Ex3_answers_template.docx)).
4. When completed, submit your answers to Studentportalen under Lab 2: http://www.uu.se/goto/tedd015.

## Analysis of Twitter data from the book fair

You have been hired as consultants by a book publisher who wants you to find out which themes and books have generated attention on Twitter during the 2016 book fair in Gothenburg.

Your task is to find out if there is any topic that has been particularly hot on Twitter before the book fair and during the book fair and to present a proposal to the company on what themes seem to create debate. In this lab we focus on data preparation. In order to prepare data, it is important to understand data.

Often, the data to be analyzed must be cleansed before we can use it. Data cleansing can include tasks such as dealing with missing values or, as in our case, filtering out some parts of the raw text data. Data you have been provided with was collected from Twitter during the period May 2016 to September 2016 during the "Book Fair" event.

Run the following code cell to view some tweets taken from the dataset:

In [None]:
somialabs.lab2.first_5_tweets()

**Question 1.** What do you think is distinctive Twitter data and how will this effect how we might want to pre-process the data?

## Data processing

Run the following code cell to view a preview of the full dataset:

In [None]:
somialabs.lab2.view_table()

**Question 2.** How many rows and columns are in the dataset?

Run the following cell to get some summary information about the dataset:

In [None]:
somialabs.lab2.view_table_info()

**Question 3.** How many tweets the dataset are directed to another user? 

*Hint: The count of non-null objects in the summary information about the dataset imply that of values present in a particular columnm. Null objects are where values are absent.*

**Question 4.** Inspect the columns and contents of the dataset. What part of data may be of interest for your analysis?

### Emojis

On Twitter it is common to use emojis 👍 ✨ 🐫 🎉 🚀 🤘.

When doing text analysis this can be useful because an emoji can contain a lot of information about what a person who wrote something means and what tone the text has. However, emojis may be problematic during analyses since coding of these is not necessarily compatible with the processing modules like NLTK.

Sometimes emojis create problems in text processing 😭 and therefore need to filter be filtered out from the raw data. However, doing this might effect the quality of analysis.

Run the following cell that displays a sample of 5 tweets with emojis. Try applying a filter to the tweets to remove the emojis by using the tickbox:

In [None]:
somialabs.lab2.filter_emojis()

**Question 5.** How might removing emojis effect the quality of analysis? Explain your answer.

### Remove URLs
On Twitter, it is common to link to locations on the Web using URLs. It is often the case that commonly occuring parts of URLs will end up among the most frequent words. It is therefore important to filter them out.

Run the following code cell that displays a sample of 5 tweets, some with URLs. Try applying a filter to the tweets to remove URLs by using the tickbox:

In [None]:
somialabs.lab2.filter_urls()

**Question 6.** How might removing URLs effect the quality of analysis? Explain your answer.

### Function for most frequent words

We will look for the most frequent words several times during this lab after each pre-processing step in order to compare the affect of the pre-processing. We will do the same operations several times, so therefore we will create a couple of functions to help us with our analysis.

#### What is a Term Document Matrix?

First, we create a term-document matrix (TDM), which can also referred to as a document-term matrix (DTM). A TDM gives us a table of the number of instances of a word for each document in a corpus. You should start by creating a TDM that is a representation of each tweet in terms of a feature vector. The feature vector creates an element for each word (unless excluded in the pre-processing, see further below). Thus, each element in the feature vector represents a word contained in one of the tweets. In the TDM you create, each line corresponds to the text of a tweet where all words that are not filtered out of the tweet are saved in the corresponding elements in the feature vector.

Run the following code cell to create a TDM for the first three documents in our tweets corpus. Using the slider, you can adjust the sample size to adjust the number of tweets used to create the TDM:

In [None]:
somialabs.lab2.make_tdm()

**Question 7.** How many columns are created for our small TDM above for sample sizes of 1, 3, and 5 tweets?

To find the top words we will do a bit more work. We sum up each of the columns in the TDM and sort the word frequencies by counts to generate the top sorted words list. We can then plot these words in a nice bar chart.

Run the next code cell to create an interactive plot of the top words based on our generated TDM. You can use the sliders to control the number of words in the historgram plot (using `top_words`) and in the list of top words output below (using `num_word...`):

In [None]:
somialabs.lab2.plot_top_words()

**Question 8** How many times must a word occur in your corpus for the function to appear in the top words list output above?

**Question 9.** What are the top 5 occurring words in the corpus? Discuss how useful these words are to our analysis.

### Lowercase

The next step is to redo all the words in lowercase letters. You do this to avoid identiftyinhg the same words as different ones, when written in different cases. For example before transforming the whole corpus into lowercase letters, the word `Bokmaessan`and `bokmaessan` may be identified as different words.

Run the following code cell that displays a sample of 5 tweets with mixed-case letters. Try applying a filter to the tweets to transform them to lowercase by using the tickbox:

In [None]:
somialabs.lab2.make_lowercase()

### Small words

Most small words are usually of limited importance, so let's strip those out. One way we can do this is to simply find words that are at least 3 letters long and keep them in the corpus. We can define a "word" as being any string of letters.

Run the following code cell that displays a sample of 5 tweets with small words. Try applying a filter to the tweets to remove small words by using the tickbox:

In [None]:
somialabs.lab2.remove_small_words()

### Stop Words

Stop words are words of limited importance and are therefore not so interesting for your analysis. We use stop words as a reference so that we can filter out words that we do not want to analyze, for example prepositions and conjunctions.

First, we can create a list of stopwords that we can use to remove from the most frequent word collection:

````
"och", "det", "att", "i", "en", "jag", "hon", "som", "han", "paa", "den", "med", "var", "sig", "foer", "saa", "till", "aer", "men", "ett", "om", "hade", "de", "av", "icke", "mig", "du", "henne", "daa", "sin", "nu", "har", "inte", "hans", "honom", "skulle", "hennes", "daer", "min", "man", "ej", "vid", "kunde", "naagot", <"fraan", "ut", "naer", "efter", "upp", "vi", "dem", "vara", "vad", "oever", "aen", "dig", "kan", "sina", "haer", "ha", "mot", "alla", "under", "naagon", "eller", "allt", "mycket", "sedan", "ju", "denna", "sjaelv", "detta", "aat", "utan", "varit", "hur", "ingen", "mitt", "ni", "bli", "blev", "oss", "din", "dessa", "naagra", "deras", "blir", "mina", "samma", "vilken", "er", "saadan", "vaar", "blivit", "dess", "inom", "mellan", "saadant", "varfoer", "varje", "vilka", "ditt", "vem", "vilket", "sitta", "saadana", "vart", "dina", "vars", "vaart", "vaara", "ert", "era", "vilka"
```

Run the following code cell that displays a sample of 5 tweets that contains some stop words. Try applying a filter to the tweets to remove stop words by using the tickbox:

In [None]:
somialabs.lab2.remove_stop_words()

Let's try applying these filters to the whole tweets corpus and plot the top words.

Run the following code cell to create an interactive plot of the top words based on our generated TDM. You can use the tickboxes to apply the various transforms and filters we have already discussed above.

In [None]:
somialabs.lab2.plot_top_words_with_filters()

**Question 10.** What do you observe in the data after plotting the lowered tweets? Suggest some reasons for your observations.

**Question 11.** After removing small words only, how many times must a word occur in the corpus for one to appear in the top-10 words?

**Question 12.** After removing stop words only, how many times must a word occur in the corpus for one to appear in the top-10 words?

**Question 13.** Does the pre-processing of data changed the list of the 20 most frequent words? Provide reasons for your observations.

### Add your own stopwords

You can also choose to add your own stop words if you think there are words in the plot that are not so informative to determine what kind of topics discussed at the book fair. For example, you could remove `years` as represented in the text with ` aar`. Create your own *additional* stop words to the stop word list above as a comma-separated list and try and refine the analysis.

Run the following code cell to create an interactive plot of the top words based on our generated TDM. You can use the tickboxes to apply the various transforms and filters we have already discussed above *and* enter your own additional stop words to filter out:

In [None]:
somialabs.lab2.plot_top_words_with_custom_stopwords()

**Q1.14.** What stop words did you add and why? Did you notice any further problems?

### Visualization of analysis and recommendation

Now you will create a visualization that will help you convince the company why they should focus on this particular topic. A common way of visualizing commonly used words in a text is by using a word cloud.

Run the following code cell to create an interactive word cloud of the top words based on our generated TDM. You can use the tickboxes to apply the various transforms and filters we have already discussed above and enter your own additional stop words to filter out:

In [None]:
somialabs.lab2.plot_wordcloud()

Run the following code cell again to create a second word cloud so you can compare the outputs of different filtered datasets (for example, compare with no filters):

In [None]:
somialabs.lab2.plot_wordcloud()

### Compare your word clouds

Create word clouds for at least two of your top words lists to compare how the pre-processing has affected the word clouds. You can also change the minimum frequency for a word to end up in the word cloud. If you think any words should be deleted, you can go back to an earlier step and add it to your stop word list and re-run the cells afterwards.

**Q1.15.** Are there any words that are not as informative that you removed to improve visualization? Explain why you removed any additional words.

**Q1.16.** What theme would you recommend the book publisher to target next year? Explain your answer.

**Q1.17.** Now that you have explored some Twitter data, what do you now think are the interesting characteristics of this kind of data? How does it affect how you must pre-process data?