Jeopardy is a popular TV show in the US where participants answer trivia to win money. Participants are given a set of categories to choose from and a set of questions that increase in difficulty. As the questions get more difficult, the participant can earn more money for answering correctly.

In June 2019, contestant James Holzhauer ended a 32-game winning streak, just barely missing the record for highest winnings. James Holzhauer dedicated hours of effort to optimizing what he did during a game to maximize how much money he earned. To acheive what he did, James had to learn and master the vast amount of trivia that Jeopardy can throw at the contestants.

Let's say we want to compete on Jeopardy like James. As he did, we'll have to familiarize ourself with an enormous amount of trivia to be competitive. Given the vastness of the task, is there a way that we can somehow simplify our studies and prioritize topics that appear more often in Jeopardy? 

In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset is named `jeopardy.csv` and contains a subset of `20000` rows from much larger dataset of Jeopardy questions. we can download the dataset [here](http://data.world/dataquest/jeopardy). Here's the beginning of the file:

![image.png](attachment:image.png)


Each row represents a single question from a single episode of Jeopardy. Crucially, we can see the different question categories that appeared on a paricular episode, the questions and answers themselves, and the value associated with the question. 

Before we delve into the analysis, familiarize ourself with the data. In our own RMarkdown file, write up our analysis.

1. Read the dataset into a tibble called `jeopardy` using `tidyverse`.
2. Print out the first `5` rows of `jeopardy`
3. Print out the columns of the `jeopardy` data
4. Perform some formatting on the column names for easier analysis later:
    * Replace the spaces with underscores for each column and
    * lowercase all of the column names
    * We can do all this by assigning `colnames(jeopardy)` to a new character vector
5. Make sure that we understand what type is column is before proceeding

[Solution](https://github.com/dataquestio/solutions/blob/master/Mission443Solutions.Rmd)

One of the curious things we may have seen is that the `value` column has the `character` type! We would expect this to have a numerical value, so let's have a look at the `value` column to see why this might be the case:

![image.png](attachment:image.png)

It turns out that the `value` column actually incorporates a dollar sign and uses the value `None` in places where the question came from a Final Jeopardy, the last question of every episode. 

The presence of these factors causes R to convert this column to a `character` instead of a numerical one. For our later analysis, we'll need the `value` column to be numeric, so we should do this now.

1. Filter `jeopardy` so that we remove all of the "None" values from the dataset. We will be sacrificing some questions, but this will make analysis easier later.
2. Use regular expression to remove all of the dollar signs and commas that appear in the `value` column.
     * We can use the `str_replace_all()` multiple times to remove these troublesome values
3. Finally, convert the cleaned `value` column into a numeric column and make sure that we've done the conversion correctly.

One messy aspect about the Jeopardy dataset is that it contains text. Text can contain punctuation and different capitalization, which will make it hard for us to compare the text of an answer to the text of a question. 

We would like to make this process easier for ourselves, so we'll need to process the text data in this step. The process of cleaning text in data analysis is sometimes called **normalization**. More specifically, we want ensure that we lowercase all of the words and any remove punctuation.

We remove punctuation because it ensures that the text stays as purely letters. Before normalization, the terms `Don't` and `don't` are considered to be different words, and we don't want this. For this step, normalize the `question`, `answer`, and `category` columns.

1. Take the `question`, `answer` and `category` columns and normalize them.
    * Lowercase all of the words of every question and answer
    * Remove all punctuation. A good way of thinking about this is establishing everything we want to keep and negating this to remove everything we don't want. In this case, we want to keep all letters and numbers.
    * `str_replace_all()` is our best friend here too
2. As always, check our work to make sure that our data cleaning had its intended effect

In our above cleaning step, we need to address the `air_date` column. Like `value's` original type, `air_date` is a `character`. Ideally we would want to separate this column into a `year`, `month` and `day` column to make filtering easier in the future. Furthermore, we would also want each of these new date columns to be numeric to make comparison easier as well.

1. Take the `air_date` column and split it into 3 new columns: `year`, `month` and `day`.
    * the `separate()` function can serve us well here
2. Convert each of these new columns into numeric columns as well.

We are now in a place where we can properly ask questions from the data and perform meaningful hypothesis tests on it. Given the near infinite amount of questions that can be asked in Jeopardy, we wonder if any particular subject area has increased relevance in the dataset. 

Many people seem to think that science and history facts are the most common categories to appear in Jeopardy episodes. Others feel that Shakespeare questions gets an awful lot of attention from Jeopardy.

With the chi-squared test, we can actually test these hypotheses! For this exercise, let's assess if science, history and Shakespeare have a higher prevalence in the data set.

First, we need to develop our null hypotheses. There are around 3369 unique categories in the Jeopardy data set after doing all of our cleaning. If we suppose that no category stood out, we would expect that the probability of picking a random category would be the same no matter what category we picked. This comes out to be $1/3369$. This would also mean that the probability of not picking a particular category would be $3368/3369$.

When we first learned the c`hisq.test()` function when testing for the number of `males` and `females` in the Census data, we assumed that their proportion would be equal — that there would be a 50-50 split between them. The `chisq.test()` automatically assumes this of the data we provide it, but we can also specify what these proportions should be using the `p` argument.

`n_questions <- nrow(jeopardy)
p_category_expected <-   1/3369 
p_not_category_expected <- 3368/3369 
p_expected <- c(p_category_expected, p_not_category_expected)`

For each of the three categories we discussed (science, history, Shakespeare), conduct a hypothesis test to see if they are more likely to appear than other categories. The process can be broken down below:

1. For Science:
    * First, count how many times the word "science" appears in the `category` column. Use this information to count how many times "science" doesn't appear in the category.
    * After counting these values, the `chisq.test()` to conduct the hypothesis test.
    * After investigating the resulting test, make our conclusion about the null hypothesis. Write our conclusions below the tests we conduct.
    
2. Repeat the above for History and Shakespeare

We see `p-values` less than 0.05 for each of the hypothesis tests. From this, we would conclude that we should reject the null hypothesis that science doesn't have a higher prevalence than other topics in the Jeopardy data. We would conclude the same with history and Shakespeare.

Let's say we want to investigate how often new questions are repeats of older ones. We're only working with about `10%` of the full Jeopardy question dataset, but we at least start investigating this question. To start on this process, we can do the following:

1. Sort `jeopardy` in order of ascending air date.
2. Intialize an empty vector to store all the unique terms that are in the Jeopardy questions.
3. For each row, split the value for `question` into distinct words, remove any word shorter than `6` characters, and check if each word occurs in `terms_used`.
    * If it does, add the word to the unique term vector
    
    
This vector of terms will enable us to check if they have been used previously or not in future questions. Only looking at words greater than 6 characters enables us to filter out stop words like `the` and `than`, which are commonly used, but don't tell us a lot about a question. This vector will also help us set up for another hypothesis test after this screen.

1. Create an empty vector called `terms_used`.
2. Sort `jeopardy` by ascending air date.
3. Get the `question` column into its own character vector and iterate through each question:
    * split each question into another character vector based on individual words
    * See if any of the words are greater or equal than 6 letters and if they are currently in `terms_used`, and add them to the list if a word satisfies these criteria

Let's say we only want to study terms that have high values associated with it rather than low values. This optimization will help us earn more money when we're on Jeopardy while reducing the number of questions we have to study. To do this, we need to count how many high value and low value questions are associated with each term. For our exercise, we'll define low and high values as follows:

* Low value: Any row where `value` is less than `800`.
* High value: Any row where `value` is greater or equal than `800`.

If we are not familiar with Jeopardy, below is an image of what the question board looks like at the start of every round:

![image.png](attachment:image.png)

For each category, we can see that under this definition that for every 2 high value questions, there are 3 low value questions. Once we count the number of low and high value questions that appear for each term, we can use this information to our advantage. 

If the number of high and low value questions is appreciably different from the 2:3 ratio, we would have reason to believe that a term would be more prevalent in either the low or high value questions. We can use the chi-squared test to test the null hypothesis that each term is not distributed more to either high or low value questions.

To do this, our code should follow the rough outline:

1. Create an empty dataset that we can add more rows to
2. Iterate through all the different terms in `terms_used`.
3. For each term:
    * Iterate through all of the questions in the dataset and see if the term is present in each question. Since we're iterating a lot here, it might be useful to test our code on only a few terms to make sure that our code works correctly before going through the entire list of terms.
    * If the term is present in the question, we then need to check if the question is high or low value
    * After iterating through all the questions, test the null hypothesis using the information we discussed on this screen.
    * Each term should be associated with a high value question count, a low value question count, and a p-value. Turn these values into a vector and append it to the empty dataset we created.
4. Finally, investigate the resulting dataset and see what terms are associated with the lowest p-values and the higher proportions of high value questions. Write our findings in RMarkdown file.


We can see from the output that some of the values are less than 5. Recall that the chi-squared test is prone to errors when the counts in each of the cells are less than 5. We may need to discard these terms and only look at terms where both counts are greater than 5.

From the 20 terms that we looked at, it seems that the term `"indian"` is more associated with high value questions. Interesting!

That's it for the project! We recommend exploring the data more on our own.

Here are some potential next steps:

* Our criteria for removing terms was a bit crude. It might be helpful to eliminate non-informative words in other ways than just removing words that are less than` 6` characters long. Some ideas:
    * Manually create a list of words to remove, like `the`, `than`, etc.
    * Find a list of stopwords to remove and use this instead.
    * Remove words that occur in more than a certain percentage (like `5%`) of questions.
* Another way of analyzing the "value" of each term might be to take all the values associated with it and calculate the "average value" of a term. This would give us a more quantitative idea of what terms are more high value than others.
* Use the whole Jeopardy dataset (available [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file)) instead of the subset we used . Note that we'll need to vectorize our code to make sure that our solution doesn't run excessively long. The solution code uses for loops, which are slow for large amounts of data.
* Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.