# **Algorithmic Methods of Data Mining - Winter Semester 2023**

## **Research questions (RQs)**
1. [__RQ1__] *Exploratory Data Analysis (EDA)* - Before working on your research questions, you should provide meaningful statistical summaries through visualizations and tabular tools to understand your data.


2. [__RQ2__] *Let’s finally dig into this vast dataset, retrieving some vital information:*
    - Plot the number of books for each author in descending order.
    - Which book has the highest number of reviews?
    - Which are the top ten and ten worst books concerning the average score?
    - Explore the different languages in the book’s dataset, providing a proper chart summarizing how these languages are distributed throughout our virtual library.
    - How many books have more than 250 pages?
    - Plot the distribution of the fans count for the 50 most prolific authors (the ones who have written more books).
3. [__RQ3__] *Let’s have a historical look at the dataset!*

    - Write a function that takes as input a year and returns as output the following information:

       - The number of books published that year.
   
       - The total number of pages written that year.
   
       - The most prolific month of that year.
   
       - The longest book written that year.
   
    - Use this function to build your data frame: the primary key will be a year, and the required information will be the attributes within the row. Finally, show the head and the tail of this new data frame considering the first ten years registered and the last ten years.
   
    - Ask **ChatGPT** or any other LLM chatbot tool to implement this function and compare your work with the one the bot gave you as an answer. Does the chatbot implementation work? Please test it out and verify the correctness of the implementation, explaining the process you followed to prove it. 

4. [__RQ4__] *Quirks questions about consistency*. In most cases, we will not have a consistent dataset, and the one we are dealing with is no exception. So, let's enhance our analysis.
     - You should be sure there are no **eponymous** (different authors who have precisely the same name) in the author's dataset. Is it true?
     -  Write a function that, given a list of author_id, outputs a dictionary where each author_id is a key, and the related value is a list with the names of all the books the author has written.
     -  What is the **longest book title** among the books of the top 20 authors regarding their average rating? Is it the longest book title overall?
     -  What is the shortest overall book title in the dataset? If you find something strange, provide a comment on what happened and an alternative answer.
       
5. [__RQ5__] *We can consider the authors with the most fans to be influential. Let’s have a deeper look.*
   - Plot the top 10 most influential authors regarding their fan count and number of books. Who is the most influential author?
   - Have they published any series of books? If any, extract the longest series name among these authors.
   - How many of these authors have been published in different formats? Provide a meaningful chart on the distribution of the formats and comment on it. 
   - Provide information about the general response from readers (number of fans, average rating, number of reviews, etc.), divide the authors by gender, and comment about anything eventually related to “structural bias.” You may want to have a look at the following recommended readings:
     
         - https://bhm.scholasticahq.com/article/38021
     
         - https://priyanka-ddit.medium.com/how-to-deal-with-imbalanced-dataset-86de86c49
     
         - https://compass.onlinelibrary.wiley.com/doi/10.1111/soc4.12962
     You can even ask ChatGPT or any other LLM chatbot tool: try to formulate a prompt that provides helpful information about it. Put that information in your notebook and provide comments on what you found.
     
6. [__RQ6__] *For this question, consider the top 10 authors concerning the number of fans again.*
   - Provide the average time gap between two subsequent publications for a series of books and those not belonging to a series. What do you expect to see, and what is the actual answer to this question?
   - For each of the authors, give a convenient plot showing how many books has the given author published **UP TO** a given year. Are these authors contemporary with each other? Can you notice a range of years where their production rate was higher?

7. [__RQ7__] *Estimating probabilities is a core skill for a data scientist: show us your best!*
   - Estimate the probability that a book has over 30% of the ratings above 4.
   - Estimate the probability that an author publishes a new book within two years from its last work.
   - In the file [*list.json*](https://www.kaggle.com/datasets/opalskies/large-books-metadata-dataset-50-mill-entries?select=list.json), you will find a peculiar list named **"The Worst Books of All Time."** Estimate the probability of a book being included in this list, knowing it has more than 700 pages.
   - Are the events *X=’Being Included in The Worst Books of All Time list’* and *Y=’Having more than 700 pages’* independent? Explain how you have obtained your answer.

8. [__RQ8__] *Charts, statistical tests, and analysis methods are splendid tools to illustrate your data-driven decisions to check whether a hypothesis is correct.*
   - Can you demonstrate that readers usually rate the longest books as the worst?
   - Compare the average rate distribution for English and non-English books with a proper statistical procedure. What can you conclude about those two groups?
   - About the two groups in the previous question, extract helpful statistics like mode, mean, median, and quartiles, explaining their role in a box plot.
   - It seems reasonable to assume that authors with more fans should have more reviews, but maybe their fans are a bit *lazy*. Confirm or reject this with a convenient statistical test or a predictive model.
   - Provide a short survey about helpful statistical tests in data analysis and mining: focus on hypothesis design and the difference between parametric and nonparametric tests, explaining the reasons behind the choice of one of these two tests.

### Bonus points

**1.**

- Select one alternative library to Pandas (i.e., Dask, Polar, Vaex, Datatable, etc.), upload [authors.json](https://www.kaggle.com/datasets/opalskies/large-books-metadata-dataset-50-mill-entries) dataset, and filter authors with at least 100 reviews. Do the same using Pandas and compare performance in terms of milliseconds.

- Select one alternative library to Pandas (i.e., Dask, Polar, Vaex, Datatable, etc.), upload [books.json](https://www.kaggle.com/datasets/opalskies/large-books-metadata-dataset-50-mill-entries), and join them with [authors.json](https://www.kaggle.com/datasets/opalskies/large-books-metadata-dataset-50-mill-entries) based on author_id. How many books don’t have a match for the author?

**2.** *Every book should have a field named description, and any author should have a field named description. Choose one of the two and perform a text-mining analysis:*

- If you choose to text-mine [**books.json**](https://www.kaggle.com/datasets/opalskies/large-books-metadata-dataset-50-mill-entries) **’ descriptions**, try to find a way to group books in genres using whatever procedure you want, highlighting words that are triggers for these choices.

- If you choose to text-mine [**authors.json**](https://www.kaggle.com/datasets/opalskies/large-books-metadata-dataset-50-mill-entries)**’ about-field**, try to find a way to group authors in genres using whatever procedure you want, highlighting words that are triggers for these choices.

- If you feel comfortable and did **both** tasks, analyze the matching of the two procedures. You grouped books and authors in genres. Do these two procedures show correspondence?