In [1]:
import pandas as pd
from collections import Counter, defaultdict
import csv

# Introduction

This notebook provides several different approaches to solving the book recommendation dataset problem posed in the last class of Foundations of Data Science. The problem is reposted below.

**Problem**: Create a function that takes a number (an integer) as an argument and returns the books in the dataset that have been recommended that number of times or more.

There is no one "right" way to approach this problem. The beauty of Python (and other programming languages) problems is that there are numerous tools at your disposal for concocting solutions. Therefore, the goal of this notebook is not to provide "correct" answers, but to expose you to different approaches for the problem. The more tools, packages, approaches, etc. you are exposed to, the more successful you will be in solving diverse problems down the road. It is worth saying that while there are abundant ways to solve any one problem, you should always strive for code that is as readable and simple as possible (without sacrificing functionality or, in some cases, efficiency).

### When Confronted with Something New or Foreign...

There will likely be parts of the different solutions below that are new to you. This notebook will provide links to useful resources for reading up on some of the methods and functions, so be on the lookout for those. If you come across something new or something you don't entirely understand, your first instinct should not be to charge ahead, but to Google the new method, package, syntax, etc. and try to find illuminating examples. Also try to break the code down into smaller pieces to understand those blocks before tackling the code in its entirety. And it never hurts to test out a piece of the code for yourself, using your own numbers or variables.

For example, say you are looking at a code block that makes use of a dictionary object `{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}` called `my_dict` and uses `my_dict.items()`. You may not know what `my_dict.items()` gives you. There is no better way to find out than running it in a separate cell to see what the output looks like. See below for an example.

In [136]:
my_dict = {'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}

for item in my_dict.items():
    
    print(item)

('key1', 'value1')
('key2', 'value2')
('key3', 'value3')


If you're new to the concept of a dictionary object, you should take time to read this great [Dictionaries in Python](https://realpython.com/python-dicts/) article. It will walk you through the basics.

### Breaking Down the Problem

Before diving into this problem, you should first identify what we are trying to calculate and what features in the dataset are relevant/needed. The dataset does not have a column with recommendation counts, so we need to work with the existing columns to arrive at a recommendation count for the books.

Each instance in the dataset &mdash; a recommendation &mdash; has several key features: title of the book being reviewed, author of the book, and the name of the recommender. The values of these features are found under the `Title`, `Author`, and `Recommender` columns, respectively. A book in the dataset can be uniquely identified based on its title and author (it is possible that two different books share the same title, but do not share the same author). Therefore, you need to calculate the number of distinct recommenders for each unique book (or book title, book author pairing). A recommender is identified by name, so you are, in essence, counting the number of unique recommender names for each unique book.

# Quick Dataset Exploration

This section reads the book recommendations dataset into the notebook as a dataframe and presents a few quick ways to explore some of the dataset's characteristics.

In [2]:
#In this cell, I set the relative path for where the dataset is located on my machine.
#You should set this according to where the dataset is located on your own machine.
data_path = 'data/master_list.txt'

Below, you read in the dataset using Pandas' `.read_csv()` method. Remember to define the separator/delimiter as tab.

In [180]:
book_df = pd.read_csv(data_path, sep='\t')

Calling `.shape` on a dataframe will return a tuple, with the first number being the total number of records (or instances), and the second number being the number of columns/fields/features. Based on the output below, you see that the dataset has 1037 instances (not necessarily unique ones) and 12 columns.

In [138]:
book_df.shape

(1037, 12)

The `.head()` method for a dataframe is useful for getting a glimpse of what your data looks like. You can pass it an integer to return that number of records. If you do not pass anything, it defaults to 5 records. Note that calling `.tail()` on the dataframe will return the last 5 records in the dataset.

In [181]:
book_df.head()

Unnamed: 0,Title,Author,Recommender,Source,Amazon_Link,Description,Type,Genre,Length,Publish_Year,On_List,Review_Excerpt
0,1984,George Orwell,Paul Coelho,http://theweek.com/articles/514936/best-books--chosen-by-paulo-coelho,http://www.amazon.com/1984-Signet-Classics-George-Orwell/dp/0451524934,,,,,,,
1,1984,George Orwell,Steven Pinker,http://www.americanscientist.org/bookshelf/pub/steven-pinker,http://www.amazon.com/1984-Signet-Classics-George-Orwell/dp/0451524934,,,,,,,
2,The Accidental Superpower,Peter Zeihan,Fareed Zakaria,http://globalpublicsquare.blogs.cnn.com/2012/07/23/a-list-of-fareeds-gps-book-recommendations/,http://www.amazon.com/Accidental-Superpower-Generation-American-Preeminence/dp/1455583669,,,,,,,
3,Act of Congress,Robert Kaiser,Lawrence Lessig,https://www.ted.com/talks/lawrence_lessig_we_the_people_and_the_republic_we_must_reclaim/recommendations,http://www.amazon.com/Act-Congress-Americas-Essential-Institution/dp/0307744515,,,,,,,
4,"Age of Ambition: Chasing Fortune, Truth, and Faith in the New China",Evan Osnos,Fareed Zakaria,http://globalpublicsquare.blogs.cnn.com/2012/07/23/a-list-of-fareeds-gps-book-recommendations/,http://www.amazon.com/Age-Ambition-Chasing-Fortune-Truth/dp/0374535272,,,,,,,


Based on the output above, it looks like our dataset suffers from many null values, which manifest as NaN in Pandas. (NaN is the default missing value marker). The code below provides the number of null values in each column. If interested, see this [introductory article](https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/) on working with missing values in Pandas. It introduces other functions for detecting, removing, and replacing null values in a Pandas dataframe.

In [142]:
book_df.isnull().sum()

Title             0   
Author            0   
Recommender       0   
Source            0   
Amazon_Link       650 
Description       1031
Type              1037
Genre             1037
Length            1031
Publish_Year      1031
On_List           1031
Review_Excerpt    1031
dtype: int64

A quick note on null/missing values: most machine learning algorithms cannot handle missing values. Therefore, it is usually a good idea to explore your data and understand whether or not it suffers from missing values. You will learn about how to handle these later on in the program. But for this problem at hand, you can ignore the null values because they are within columns/features that you will not use. In your case, you are interested in the `Title`, `Author`, and `Recommender` columns, which all have no missing values.

The cell below identifies whether or not there are duplicate records in the dataset based on the `Title`, `Author`, and `Recommender` columns. It is usually important to identify duplicate records and deduplicate them before working on a dataset. In your case, if there are duplicate records, and you do not realize this, you may overestimate the count of unique recommendations for certain books.

The following cell makes use of Pandas' [`.duplicated` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html), which can be called on a dataframe object to identify duplicate records. This [article](https://thispointer.com/pandas-find-duplicate-rows-in-a-dataframe-based-on-all-or-selected-columns-using-dataframe-duplicated-in-python/) provides a solid walkthrough of how this method works. Below, you consider duplicate records based on the `Title`, `Author`, and `Recommender` columns. `book_df.duplicated(subset=['Title', 'Author', 'Recommender'])` returns a boolean series of True and False, with True representing the presence of a duplicate record. You can pass this series back into the original dataframe `book_df`, which will filter `book_df` to show rows corresponding with True &mdash; in other words, the duplicate records!

In [182]:
book_df[book_df.duplicated(subset=['Title', 'Author', 'Recommender'])]

Unnamed: 0,Title,Author,Recommender,Source,Amazon_Link,Description,Type,Genre,Length,Publish_Year,On_List,Review_Excerpt
406,An Intimate History of Humanity,Theodor Zeldin,Alain de Botton,http://alaindebotton.com/book-list/,,,,,,,,
439,"The Mantle of Command: FDR at War, 1941-1942",Nigel Hamilton,Fareed Zakaria,http://globalpublicsquare.blogs.cnn.com/2012/07/23/a-list-of-fareeds-gps-book-recommendations/,,,,,,,,
817,The Principles of Psychology,William James,Sam Harris,https://www.samharris.org/book_store/all_sams_recommendations/P144,,,,,,,,


Based on the output above, you can see that the dataset has three duplicate records, with one of these being for the book 'An Intimate History of Humanity.' If you filter the dataframe `book_df` for rows where the `Title` column is this book, you see that there are, indeed, two rows (with one duplicate record). 

Note that the code below uses two equal signs, not one. This is an operator that compares two operands (in this case, the values in the `Title` column against the title An Intimate History of Humanity) and results in True if they are equal and False otherwise.

In [183]:
book_df[book_df['Title'] == 'An Intimate History of Humanity']

Unnamed: 0,Title,Author,Recommender,Source,Amazon_Link,Description,Type,Genre,Length,Publish_Year,On_List,Review_Excerpt
405,An Intimate History of Humanity,Theodor Zeldin,Alain de Botton,http://alaindebotton.com/book-list/,,,,,,,,
406,An Intimate History of Humanity,Theodor Zeldin,Alain de Botton,http://alaindebotton.com/book-list/,,,,,,,,


In the cell below, the [`.drop_duplicates` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) is called on the dataframe to drop all duplicate records (except for the first instances). Setting the `inplace` parameter to True will tell Python to modify the datafrarme in place and not return a copy. You can print the shape of the dataframe again to confirm that the 3 duplicate records were dropped.

In [184]:
book_df.drop_duplicates(subset=['Title', 'Author', 'Recommender'], inplace=True)
book_df.shape

(1034, 12)

# Solutions

This section presents four different functions for solving the problem. After each function is defined, there will be a cell for testing out the function and viewing its output. You will notice that the output for each function differs, but satisfies the problem requirements.

The function below relies on Pandas' [`.groupby` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html), which can be very useful for grouping large amounts of data and computing operations or calculations over these groups. You pass the `Title` and `Author` columns to the `.groupby` to group based on title and author (i.e., distinct books).

It also makes use of [`.nunique()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.SeriesGroupBy.nunique.html), which will return the count of unique elements in each group of the `.groupby`. It is worth noting that because you have deduplicated your data in the steps above, you could simply replace `.nunique()` with `.count()`. If you hadn't deduplicated your data, then `.nunique()` would have still been successful in counting the number of unique recommenders for each book, while `.count()` would not have been. In the code below, `.nunique()` is called on the `Recommender` column of the grouped data, which counts the number of unique recommenders per group (or book). Calling `.reset_index()` on the grouped data returns a dataframe object.

The last line in the function below filters the dataframe for records with a recommender count greater than or equal to the integer you pass into the function. See [this article](https://thispointer.com/python-pandas-select-rows-in-dataframe-by-conditions-on-multiple-columns/) on filtering dataframes based on condition(s) for a better understanding of the conditional dataframe filtering syntax below.

You will also notice that the dataframe returned by the function is sorted using [`.sort_values()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html), which is used to sort the instances from highest to lowest recommender count, making the dataframe easier to read.

In [121]:
def recommender_counter(integer):
    
    recommender_counts = book_df.groupby(['Title', 'Author'])['Recommender'].nunique().reset_index(name='Count')
    
    return recommender_counts[recommender_counts['Count'] >= integer].sort_values(by=['Count'], ascending=False)

In [174]:
recommender_counter(4)

Unnamed: 0,Title,Author,Count
810,The Selfish Gene,Richard Dawkins,5


The function below produces output different from the one above. It prints out the books' titles and authors according to their recommendation counts.

It first makes use of the built-in function [`zip()`](https://docs.python.org/3.3/library/functions.html#zip). This function, which you can read more about [here](https://data-flair.training/blogs/python-zip-function/), takes the values in the `Title` column and the values in the `Author` column and pairs them together. Order is retained such that the first item in each passed iterator (the Title values series and the Author values series) is paired together, and then the second item in each is paired together, etc. Each resulting tuple in the zip object represents a unique book, since it contains both the title and author information.

The zip object is passed to a [`Counter`](https://docs.python.org/2/library/collections.html#collections.Counter) object, which you imported from Python's [`collections`](https://docs.python.org/2/library/collections.html) module after running the first cell in this notebook. The Counter is useful for quickly counting elements passed to it. If you pass it a list of elements, it will store the elements as dictionary keys with dictionary values representing their occurrence counts. So when the zip object is passed to the Counter below, the Counter will return how many times each tuple (or book) appears. This is equivalent to counting how many times each book has been recommended, since the dataset has been deduplicated. See [this article](https://data-flair.training/blogs/python-counter/) for some more examples using Counter.

The function below makes use of another dictionary subclass [`defaultdict`](https://docs.python.org/2/library/collections.html#collections.defaultdict) from the `collections` module. A defaultdict is initialized with an argument that represents the default value for the dictionary. In the code below, it is initialized with an empty list. This is useful because you can append values for repeated keys to one value list. Take a look at [this article](https://towardsdatascience.com/python-pro-tip-start-using-python-defaultdict-and-counter-in-place-of-dictionary-d1922513f747) for some more insight into defaultdicts and Counters. In the cell below, defaultdict is used to group the books together based on their recommendation counts, with its keys representing the recommendation counts, and its values being the lists of books corresponding with the counts.

The function then feeds the integer argument passed to it into [`range()`](https://docs.python.org/3/library/functions.html#func-range), which is then [`reversed()`](https://docs.python.org/3/library/functions.html#reversed), producing an iterator. Both the integer argument and the maximum recommendation count (plus 1) are passed to range and then reversed, meaning the for loop is restricted to defaultdict keys ranging from the maximum count (which in this dataset is 5 recommendations) down to the integer argument (counts greater than or equal to the passed integer, which is desired).

Finally, the function prints out the books with counts in this range in a readable format. To better understand the function below, you may also want to read up on Python's [f-strings](https://realpython.com/python-f-strings/) and read more on [`range()` and `reversed()`](https://realpython.com/python-range/).

In [199]:
def recommender_counter2(integer):
    
    title_author_tuples = zip(book_df['Title'], book_df['Author'])
    
    counter = Counter(title_author_tuples)
    
    counter_dict = defaultdict(list)

    for k,v in counter.items():
        
        counter_dict[v].append(k)

    max_count = max(counter_dict.keys())
    
    for k in reversed(range(integer, max_count + 1)):
        
        #if counter_dict[k] checks if the value for the key k is empty and runs only if the value is not empty
        if counter_dict[k]:
            print(f"\n{k} Recommendations:\n")
            
            for book in counter_dict[k]:
                
                print(f"{book[0]} by {book[1]}")
                            
        else:
            print(f"\n{k} Recommendations:\n\nNone!")

In [186]:
recommender_counter2(3)


5 Recommendations:

The Selfish Gene by Richard Dawkins

4 Recommendations:

None!

3 Recommendations:

The Blind Watchmaker: Why the Evidence of Evolution Reveals a Universe without Design by Richard Dawkins
The Singularity is Near: When Humans Transcend Biology by Ray Kurzweil
Tao Te Ching by Lao Tzu
Thinking, Fast and Slow by Daniel Kahneman
The Really Hard Problem: Meaning in a Material World by Owen Flanagan
The History of the Decline and Fall of the Roman Empire by Edward Gibbon
The Demon-Haunted World: Science as a Candle in the Dark by Carl Sagan
Why Evolution Is True by Jerry Coyne
The Character of Physical Law by Richard Feynman


Iteration plays an important role in the next two functions shown below. They are presented in this notebook because iteration is often more approachable for people who are newer to programming. However, be wary of iteration, especially with regards to dataframes. It is [generally slow and often unneeded](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758#55557758).

In the `recommender_counter3` function below, the code employs Pandas' [`.iterrows()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html) function, which lets you loop through a dataframe's rows, returning an iterator containing the index of each row and the data in each row as a series. It loops through the dataframe's rows and counts (with the use of Counter) how many times each book appears (which is equivalent to its recommendation count, since the dataframe has been deduplicated). The function then uses Pandas' [`.from_dict()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html), which takes a dictionary object and transforms it into a dataframe, with the option of setting the dictionary's keys as the dataframe's columns or rows. Because the Counter dictionary object is passed to `.from_dict` with the argument `orient='index'`, the Counter's keys (books) are set as the dataframe's rows.

In [200]:
#This cell lets us customize Pandas' behavior. You set the width of the columns in your dataframe at max.
#(So you can see the entire book title and author without them getting cut off)
pd.set_option('display.max_colwidth', -1)

In [201]:
def recommender_counter3(integer):
    
    count_dict = Counter()
    
    for index, row in book_df.iterrows():
                
        book = f"{row['Title']} by {row['Author']}"
        
        count_dict[book] += 1
        
    counter_df = pd.DataFrame.from_dict(count_dict, orient='index', columns=['Count'])
    
    counter_df.reset_index(inplace=True)
    
    counter_df.rename(columns={'index': 'Book'}, inplace=True)
    
    return counter_df[counter_df['Count'] >= integer].sort_values(by=['Count'], ascending=False)

In [202]:
recommender_counter3(3)

Unnamed: 0,Book,Count
261,The Selfish Gene by Richard Dawkins,5
28,The Blind Watchmaker: Why the Evidence of Evolution Reveals a Universe without Design by Richard Dawkins,3
266,The Singularity is Near: When Humans Transcend Biology by Ray Kurzweil,3
290,Tao Te Ching by Lao Tzu,3
294,"Thinking, Fast and Slow by Daniel Kahneman",3
432,The Really Hard Problem: Meaning in a Material World by Owen Flanagan,3
455,The History of the Decline and Fall of the Roman Empire by Edward Gibbon,3
511,The Demon-Haunted World: Science as a Candle in the Dark by Carl Sagan,3
529,Why Evolution Is True by Jerry Coyne,3
544,The Character of Physical Law by Richard Feynman,3


The last function in this notebook, `recommender_counter4`, moves away from Pandas and uses Python's [`csv`](https://docs.python.org/3/library/csv.html) module, which was imported in the first cell. It specifically reads the data in using the module's [`DictReader`](https://docs.python.org/3/library/csv.html#csv.DictReader), which maps the information in each row into a dictionary. The data in the row can be accessed using the column names in the dataset, passing them as keys.

Again, a Counter object is used to count the number of times each book appears in the dataset. However, there is a catch. Because you are reading the data in anew, the data will not be deduplicated. Therefore, an empty list `unique_recommendations` is established to store each unique book and recommender name pairing as you iterate through the data using `DictReader`. The conditional statement, `if book_recommendation not in unique_recommendations:`, tells Python to only increase the count for an encountered book and its recommender if they do not represent a duplicate record (checking to see if they had already been appended to the `unique_recommendations` list as a record).

The final line of the function utilizes [dictionary comprehension](https://www.datacamp.com/community/tutorials/python-dictionary-comprehension) to filter out books with recommendation counts less than the passed integer. It also uses the built-in [`sorted()`](https://docs.python.org/3/howto/sorting.html) function to return the books sorted from highest to lowest count.

In [206]:
def recommender_counter4(integer):
    
    with open(data_path) as tsv:

        reader = csv.DictReader(tsv, delimiter='\t')

        counter_dict = Counter()

        unique_recommendations = []

        for row in reader:

            book = f"{row['Title']} by {row['Author']}"

            book_recommendation = [book, row['Recommender']]

            if book_recommendation not in unique_recommendations:

                counter_dict[book] += 1

                unique_recommendations.append(book_recommendation)

    return {k:v for k,v in sorted(counter_dict.items(), key=lambda item: item[1], reverse=True) if v >= integer}   

In [207]:
recommender_counter4(3)

{'The Selfish Gene by Richard Dawkins': 5,
 'The Blind Watchmaker: Why the Evidence of Evolution Reveals a Universe without Design by Richard Dawkins': 3,
 'The Singularity is Near: When Humans Transcend Biology by Ray Kurzweil': 3,
 'Tao Te Ching by Lao Tzu': 3,
 'Thinking, Fast and Slow by Daniel Kahneman': 3,
 'The Really Hard Problem: Meaning in a Material World by Owen Flanagan': 3,
 'The History of the Decline and Fall of the Roman Empire by Edward Gibbon': 3,
 'The Demon-Haunted World: Science as a Candle in the Dark by Carl Sagan': 3,
 'Why Evolution Is True by Jerry Coyne': 3,
 'The Character of Physical Law by Richard Feynman': 3}

Similar to dictionary comprehension, there is the concept of list comprehension. See this [article](https://realpython.com/list-comprehension-python/) for a great introduction to list comprehension.