# Pandas lecture 2

In the previous lecture, spent time to understand the data types `Series` and `DataFrame` and some basic operations using them. Among those operations, we learned how to add new columns and filter rows using the operators that `pandas` has implemented that operate on the `Series` and `DataFrame` types. 

But, since we are working with a general purpose language in Python, we should be allowed to define any sort of complex conditions and formulas for doing the same operations without limiting ourselves to just the operators implemented by `pandas`.

The next few examples should make it clear what we are talking about.

In [1]:
import pandas as pd
import json

In [2]:
movies_df = pd.read_csv('movies.csv')
movies_df.head(5)

Unnamed: 0,adult,budget,genres,id,imdb_id,original_language,original_title,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,title,vote_average,vote_count
0,False,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,tt0114709,en,Toy Story,21.946943,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033,81,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Toy Story,7.7,5415
1,False,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,tt0113497,en,Jumanji,17.015539,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249,104,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Jumanji,6.9,2413
2,False,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,tt0113228,en,Grumpier Old Men,11.7129,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0,101,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Grumpier Old Men,6.5,92
3,False,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,tt0114885,en,Waiting to Exhale,3.859495,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156,127,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Waiting to Exhale,6.1,34
4,False,0,"[{'id': 35, 'name': 'Comedy'}]",11862,tt0113041,en,Father of the Bride Part II,8.387519,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911,106,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Father of the Bride Part II,5.7,173


In [3]:
len(movies_df)

45466

## Filtering by boolean indexing

### apply() method on Series

This dataset contains information about the movies present in the Internet Movies database. A lot of the information, like the genres of the movies in the list is in the **JSON** format.

Let's say we wanted to create a filtered `DataFrame` containing only Comedy movies. The approach we have already learnt to filetering will not apply in this case, as there is no `pandas` operator or built in function that will check for `name == Comedy` in a custom JSON expression.

The approach that we will use allows us to write _any custom function_ which looks at the value of a _one column_ in a row, and returns True if we want to keep this row, and False otherwise. We have defined such a function below - which takes in the genre JSON string, and the actual genre to filter, and returns True if the movie genres list contains that genre.

In [4]:
def is_genre(json_str, genre):
    genres = []
    json_list = json.loads(json_str.replace("'", '"'))
    for genre_json in json_list:
        genres.append(genre_json['name'])
    
    return genre in genres

The next step is to create a new boolean column, which contains `True` if that row should be kept, and `False` otherwise. Given that we already have a function that returns that boolean value, we just need a way to create a new column that is filled with the return value of that function.

In `pandas`, the `Series` type has a method called `apply()` - which takes as input a function with one argument. When called on the `Series`, it returns a new `Series` that consists of the output of the function _applied_ to each value in the original `Series`.

We create a new column called `is_comedy` using this method.

In [5]:
movies_df['is_comedy'] = movies_df['genres'].apply(lambda x: is_genre(x, 'Comedy'))
movies_df['is_comedy'].head(5)

0     True
1    False
2     True
3     True
4     True
Name: is_comedy, dtype: bool

**Aside:** For those not familiar with the `lamda` syntax, this is a cool way in Python to write small functions without using the full function syntax. The syntax is `lambda <input variables>: <expression for return value>`. This is extremely useful for data analysis.

As you can see in the code above, we call the `apply()` method to the `genres` column in the `DataFrame`, pass in a function which checks if the movie is a Comedy movie, and assign the resultant `Series` to a new column called `is_comedy`.

Now, we can simply use a regular filter and filter rows by checking if the value of column `is_comedy` is `True`. Note that in the code below, we omit `== True`. Why?

In [6]:
comedy_movies = movies_df[movies_df['is_comedy']]
comedy_movies.head(10)

Unnamed: 0,adult,budget,genres,id,imdb_id,original_language,original_title,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,title,vote_average,vote_count,is_comedy
0,False,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,tt0114709,en,Toy Story,21.946943,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033,81,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Toy Story,7.7,5415,True
2,False,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,tt0113228,en,Grumpier Old Men,11.7129,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0,101,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Grumpier Old Men,6.5,92,True
3,False,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,tt0114885,en,Waiting to Exhale,3.859495,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156,127,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Waiting to Exhale,6.1,34,True
4,False,0,"[{'id': 35, 'name': 'Comedy'}]",11862,tt0113041,en,Father of the Bride Part II,8.387519,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911,106,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Father of the Bride Part II,5.7,173,True
6,False,58000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...",11860,tt0114319,en,Sabrina,6.677277,"[{'name': 'Paramount Pictures', 'id': 4}, {'na...","[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...",1995-12-15,0,127,"[{'iso_639_1': 'fr', 'name': 'Français'}, {'is...",Released,Sabrina,6.2,141,True
10,False,62000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",9087,tt0112346,en,The American President,6.318445,"[{'name': 'Columbia Pictures', 'id': 5}, {'nam...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-11-17,107879496,106,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The American President,6.5,199,True
11,False,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 27, 'nam...",12110,tt0112896,en,Dracula: Dead and Loving It,5.430331,"[{'name': 'Columbia Pictures', 'id': 5}, {'nam...","[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",1995-12-22,0,88,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Dracula: Dead and Loving It,5.7,210,True
17,False,4000000,"[{'id': 80, 'name': 'Crime'}, {'id': 35, 'name...",5,tt0113101,en,Four Rooms,9.026586,"[{'name': 'Miramax Films', 'id': 14}, {'name':...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-09,4300000,98,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Four Rooms,6.5,539,True
18,False,30000000,"[{'id': 80, 'name': 'Crime'}, {'id': 35, 'name...",9273,tt0112281,en,Ace Ventura: When Nature Calls,8.205448,"[{'name': 'O Entertainment', 'id': 5682}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-11-10,212385533,90,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Ace Ventura: When Nature Calls,6.1,1128,True
19,False,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 35, 'nam...",11517,tt0113845,en,Money Train,7.337906,"[{'name': 'Columbia Pictures', 'id': 5}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-11-21,35431113,103,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Money Train,5.4,224,True


In [7]:
len(comedy_movies)

13182

Just to recap, this method teaches us two things:

- How to use `apply()` method in a series so that we can create a new column using any custom function. This is the crux of why this works.
- Using the new column to filter the rows.

This approach to filtering is called **Filtering by Boolean Indexing**.

**Exercise:** Find all successful movies made by `Columbia Pictures`. A successful movie is one whose revenue is greater than the budget.

### apply() method on DataFrame

If you've completed the exercise above, you will have noticed that you needed another filtering step to judge the success of the movie, after you've applied the boolean indexing filter. This was required because you couldn't include the values of other columns in your function.

In general, we would like to be able to get all the columns in a row to be able to compute the value of a new column that we want to create. Fortunately, the way to do that is similar to what we've already done: there is an `apply()` on the `DataFrame` class as well. Let's learn a bit more about it.

In [8]:
df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=['A', 'B'])
df

Unnamed: 0,A,B
0,1,2
1,3,4
2,5,6


In [9]:
# Series apply(). We get a new Series with the application of the function.
df['A'].apply(lambda x: x**2)

0     1
1     9
2    25
Name: A, dtype: int64

In [10]:
# DataFrame apply(). The function is applied on each cell of the DataFrame
df.apply(lambda x: x**2)

Unnamed: 0,A,B
0,1,4
1,9,16
2,25,36


But, to create a new column, this is not what we are looking for. We want that function to be applied to each row, and we want a `Series` as a return value.

To do that, the `apply()` method on `DataFrame` exposes another argument called `axis`. Like in Coordinate Geometry, axis defines the direction of movement. For example, in a 2-D plane, we have two axes which are perpendicular to each other.

![2d-axes](https://raw.githubusercontent.com/amangup/data-analysis-bootcamp/master/07-Pandas2/axes.jpeg)

Their meaning is very similar in `pandas`. Our dataset is also 2 dimensional - it has rows and columns. The value of axis can be 0 or 1, where 
- 0 is used to describe the direction including all rows, and 
- 1 is used to describe the direction including all columns.

Here is a diagram to show that:

![2d-axes](https://raw.githubusercontent.com/amangup/data-analysis-bootcamp/master/07-Pandas2/data-frame-axes.png)


Let's look at our small dataframe again. Let's say we want to print the sum of all elements in each row. What we can do is:

In [11]:
df['sum'] = df.apply(sum, axis=1)
df

Unnamed: 0,A,B,sum
0,1,2,3
1,3,4,7
2,5,6,11


With axis=1, our function (`sum` in this case), is called once for each row. The argument to our function is a `Series` representing the values of that row.

Just for completion sake, let's also see what happens when axis = 0. This case can be useful to create some sort of summary of the data.

In [12]:
df.loc[3] = df.apply(sum, axis=0)
df

Unnamed: 0,A,B,sum
0,1,2,3
1,3,4,7
2,5,6,11
3,9,12,21


As you can see, with axis=0, the function is applied once for each column. The input to our function is a `Series` representing all the values in the column.

Now let's do our exercise again, using the `apply()` method on `DataFrame`. Let's also twist it a bit, and look at failures instead of successes.

In [13]:
def studio_failure(row, studio):
    studios = []
    try:
        json_list = json.loads(row['production_companies'].replace("'", '"'))
    except:
        return False
    
    if not isinstance(json_list, list):
        return False
    
    for studio_json in json_list:
        studios.append(studio_json['name'])
    
    if row['revenue'] == '0' or row['budget'] == 0:
        return False
    
    failure = float(row['revenue']) < float(row['budget'])

    return (studio in studios) and failure

In [14]:
columbia_failure_df = movies_df[movies_df.apply(lambda x: studio_failure(x, 'Columbia Pictures'), axis=1)]
columbia_failure_df.head(10)

Unnamed: 0,adult,budget,genres,id,imdb_id,original_language,original_title,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,title,vote_average,vote_count,is_comedy
19,False,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 35, 'nam...",11517,tt0113845,en,Money Train,7.337906,"[{'name': 'Columbia Pictures', 'id': 5}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-11-21,35431113,103,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Money Train,5.4,224,True
407,False,34000000,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",10436,tt0106226,en,The Age of Innocence,8.013617,"[{'name': 'Columbia Pictures', 'id': 5}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1993-09-17,32255440,139,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,The Age of Innocence,7.0,172,False
425,False,13000000,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",117553,tt0106505,en,Calendar Girl,1.316741,"[{'name': 'Columbia Pictures', 'id': 5}, {'nam...","[{'iso_3166_1': 'US', 'name': 'United States o...",1993-09-03,2570145,90,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Calendar Girl,3.7,10,True
453,False,35000000,"[{'id': 36, 'name': 'History'}, {'id': 28, 'na...",35588,tt0107004,en,Geronimo: An American Legend,4.653462,"[{'name': 'Columbia Pictures', 'id': 5}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1993-12-10,18635620,115,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Geronimo: An American Legend,5.9,44,False
974,False,38000000,"[{'id': 18, 'name': 'Drama'}, {'id': 53, 'name...",11306,tt0116259,en,Extreme Measures,4.990506,"[{'name': 'Columbia Pictures', 'id': 5}, {'nam...","[{'iso_3166_1': 'US', 'name': 'United States o...",1996-09-27,17380126,118,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Extreme Measures,5.7,80,False
1575,False,36000000,"[{'id': 53, 'name': 'Thriller'}, {'id': 878, '...",782,tt0119177,en,Gattaca,12.89312,"[{'name': 'Columbia Pictures', 'id': 5}, {'nam...","[{'iso_3166_1': 'US', 'name': 'United States o...",1997-09-07,12532777,106,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Gattaca,7.5,1846,False
2225,False,65000000,"[{'id': 27, 'name': 'Horror'}, {'id': 9648, 'n...",3600,tt0130018,en,I Still Know What You Did Last Summer,11.023042,"[{'name': 'Columbia Pictures', 'id': 5}, {'nam...","[{'iso_3166_1': 'US', 'name': 'United States o...",1998-11-13,40002112,100,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,I Still Know What You Did Last Summer,5.1,381,False
2594,False,24000000,"[{'id': 14, 'name': 'Fantasy'}, {'id': 35, 'na...",10208,tt0158811,en,Muppets from Space,7.531668,"[{'name': 'Columbia Pictures', 'id': 5}, {'nam...","[{'iso_3166_1': 'US', 'name': 'United States o...",1999-07-14,16290976,87,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Muppets from Space,5.8,94,True
2840,False,17000000,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",31650,tt0094008,en,Someone to Watch Over Me,3.50111,"[{'name': 'Columbia Pictures', 'id': 5}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1987-10-09,10278549,106,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Someone to Watch Over Me,5.8,38,False
3379,False,43000000,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",54087,tt0104928,en,Mr. Saturday Night,0.987196,"[{'name': 'Columbia Pictures', 'id': 5}, {'nam...",[],1992-09-23,13300000,119,[],Released,Mr. Saturday Night,5.2,17,True


Note that

- Because the data is not completely clean, the `studio_failure()` function has a bunch of conditions to eliminate certain rows.
- We don't really need to create a new column for the output of `df.apply()`. The filter step just needs a boolean `Series` of the same length as the number of rows in the dataframe and that's what `df.apply()` returns when the `axis=1`.

## Exercises

In this set of exercises, we are going to use a data set that contains a large list of wines, their description and reviews.

Let's start with a few exercises to get to know the data better:

1. How many wines are there?
2. How many US wines are there?
3. What is the average rating of a wine?
4. Let's look at the distribution of ratings. First, add the following import statement in your notebook:

```python
import matplotlib.pyplot as plt
```

and then run the following code in a cell:

```python
raw_df['points'].hist()
plt.show()
```

5. Do exercises 3 and 4 only for wines from the US, and only for wines from France.
6. Do exercises 3, 4 and 5, but this time let's look at the price of wines, instead of their rating. Note that by default the histogram is not as useful. Create a histogram of wines priced upto $100 instead.
7. Find if there what is the pearson correlation coefficient between the price of the wine and its rating. Hint: checkout the `corr()` method in the `DataFrame` class.

### Wine search

Now that we are familiar the data, let's work on a mini project: Create a wine search which can highlight the top 5 wines that whose description matches the search keywords.

We will use a well known method for keyword based ranking called the **tf-idf algorithm**.

- Let's say we want to search wines which mention `oak` in their description. 
   - We would like to highlight wines whose description mentions the work `oak` the maximum number of times. 
   - This count of mentions is called **Term Frequency**, or **tf**. 
   - Higher the **tf** of a wine, higher it's ranking.
- Let's say the search term is `oak wine`. 
   - `wine` is a common term and ideally we would not like to give that term much weight.
   - We will count the _number of descriptions_ in which the word `wine` appears. This count represents the **Inverse Document Frequency**, or **idf**.
   - If a word appears in most of the documents, that means that it's is a common word, and thus we should more or less ignore this word.
   - The precise definition of **idf** is `idf(word) = log(N_docs / N_docs_with_word)`, where `N_docs` is the total number of descriptions, and `N_docs_with_word` is the number of descriptions which contains the specific word. You can see that `idf` decreases as `N_docs_with_word` increases.
   
Let's understand this using an example. Here is a sample description:

"There's plenty of oak to this solid, peppery Merlot that's also a touch green. The cassis and cherry fruit that drives the palate is healthy and sturdy, while the finish features some tight-grained oak and firm enough tannins. Maybe too much oak given the fruit quality."

- This sentence contains the word `oak` three times. Thus, `tf = 3`
- The word 'oak' appears in 8,721 descriptions out of a total of 129,971 descriptions. Thus, `idf = log(129971/8721) = 2.7`. I've used base = `e` here.
- The ranking score of this wine would be `tf * idf = 3 * 2.7 = 8.1` given the search term `oak`.

Let's build our wine search engine by following these steps:

1. Write a function which takes a text and returns a dictionary with the word count for each word.
2. Add a column to your dataframe called `tf` that stores the word count dictionary corresponding to it's description.
3. Create another dictionary called `idf_dict` by going over all the descriptions for all wines, and map each word to its **idf** score.
4. Write a function which takes as argument the search keywords (as a list of strings) that:
   - calculates the score of each wine as: `tf(word_1) * idf(word_1) + ... + tf(word_k) * idf(word_k)`
   - sorts the wines in descending order of their score (Hint: use `sort_values()` method on `DataFrame`)
   - Returns a dataframe with details of the top 5 scoring wines.
   
5. Implement your own search ranking scheme that gives a certain priority to the rating of the wine as an augmentation of the ranking algorithm above. You can also implement a max price filter - to only search among wines priced lower than the max price.