# Jupyter Exercises
[Jupyter](https://jupyter.org/) is a great tool for (most commonly) interactively running Python code on data and providing embedded Markdown documentation. If you have a local Jupyter instance (or have an interest in going through the process of setting one up), feel free to use it. Otherwise, we would recommend using [Try Jupyter](https://jupyter.org/try) with the **Try Classic Notebook** option. Just take care to export/backup your work on a frequent interval in case the instance is deprovisioned.

Alternatively, feel free to write standalone Python code that can be executed via the CLI. We think pretty highly of the Notebook-based approach, though, and believe that you'll enjoy that approach, too.

Here's a process that you can use with Try Jupyter to get up and running:

1. [https://jupyter.org/try](https://jupyter.org/try)
1. Select ***Try Classic Notebook***
1. Once your Jupyter instance comes up, open the ***File => Open...*** menu option and upload the following files that we've included:
    * `Exercises.Jupyter.ipynb`
    * `show_titles.csv`
1. Once uploaded, open `Exercises.Jupyter.ipynb`. If you've since left the ***Files*** tab, you can return via ***File => Open...***
1. Select the ***Cell => Run All*** menu option to execute the current statements in the notebook
1. Read through the Notebook and follow the instructions

Once you're done with the exercises, use ***File => Download as => Notebook (.ipynb)*** to download a copy of your notebook. Include this downloaded file in your submissions back to HealthMine.

# Dataset
Included alongside this file is a CSV of a public Netflix show dataset that we'll be using for the following exercises. Feel free to profile the dataset using any tool of your choosing to become familiar with it.

Given that the exercises below will be using the [pandas](https://pandas.pydata.org/) Python package, we'll go ahead and load the file for you and output some basic dataset characteristics.

In [137]:
import pandas

df = pandas.read_csv('show_titles.csv', sep = '|')

In [138]:
print('Number of rows: %d' % len(df.axes[0]))
print('Number of columns: %d' % len(df.axes[1]))

Number of rows: 6239
Number of columns: 12


In [139]:
print(df.dtypes)

show_id          int64
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object


In [140]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


# Exercises

Using the existing `df` DataFrame reference as a starting point, write Python statements that answer the following questions:

---
### When considering all of the columns, how many duplicated rows are there?

_The first occurrence of a duplicated row should not be counted. Stated another way, if there are only two rows that are identical in the entire dataset, then the answer would be 1._

In [141]:
# just compare the length of the original and the deduped list to find the number of duplicates
len(df) - len(df.drop_duplicates())

5

### Solution E1:
5

---

#### Out of curiosity...

In [142]:
# just because I'm curious, here's the indices of the dupes
df.duplicated().loc[lambda x: x]
# see that they're at the end.  were they possibly added for the sake of testing future employees? :D

6234    True
6235    True
6236    True
6237    True
6238    True
dtype: bool

In [143]:
len(df)

6239

In [144]:
df.tail()
# interesting choices...

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
6234,80000063,TV Show,Red vs. Blue,,"Burnie Burns, Jason Saldaña, Gustavo Sorola, G...",United States,,2015,NR,13 Seasons,"TV Action & Adventure, TV Comedies, TV Sci-Fi ...","This parody of first-person shooter games, mil..."
6235,70286564,TV Show,Maron,,"Marc Maron, Judd Hirsch, Josh Brener, Nora Zeh...",United States,,2016,TV-MA,4 Seasons,TV Comedies,"Marc Maron stars as Marc Maron, who interviews..."
6236,80116008,Movie,Little Baby Bum: Nursery Rhyme Friends,,,,,2016,,60 min,Movies,Nursery rhymes and original music for children...
6237,70281022,TV Show,A Young Doctor's Notebook and Other Stories,,"Daniel Radcliffe, Jon Hamm, Adam Godley, Chris...",United Kingdom,,2013,TV-MA,2 Seasons,"British TV Shows, TV Comedies, TV Dramas","Set during the Russian Revolution, this comic ..."
6238,70153404,TV Show,Friends,,"Jennifer Aniston, Courteney Cox, Lisa Kudrow, ...",United States,,2003,TV-14,10 Seasons,"Classic & Cult TV, TV Comedies",This hit sitcom follows the merry misadventure...


In [145]:
# drop duplicates to clean for the rest of the problems
df = df.drop_duplicates()

---
### How many unique TV shows were released in the UK in 2016?

_If a TV show was released in multiple countries, then they should be considered independently._

In [146]:
# check that there's not different values like UK, Ireland, etc.
# noticed that there's some CSV stuff happening so split that into its values and put it in a set
# make it into a function to use for exercise 3 as well for some EDA

def clean_csv_column(df, col):
    # definitely a better way to do this using more pandas operations and actually cleaning up the dataframe
    # but since it's not a part of the exercise and is more just for my curiosity I'm going to leave as is for time purposes
    # QUICK AND DIRTY python native for the most part...
    unique_dirty = df.dropna()[col].unique()
    unique_cleaned = set()
    for uni in unique_dirty:
        split_list = uni.split(',')
        for split_val in split_list:
            unique_cleaned.add(split_val.strip())

    return unique_cleaned
    

In [147]:
clean_csv_column(df, "country")

{'',
 'Afghanistan',
 'Albania',
 'Argentina',
 'Australia',
 'Austria',
 'Bangladesh',
 'Belgium',
 'Bermuda',
 'Brazil',
 'Bulgaria',
 'Cambodia',
 'Canada',
 'Cayman Islands',
 'Chile',
 'China',
 'Colombia',
 'Croatia',
 'Czech Republic',
 'Denmark',
 'Dominican Republic',
 'East Germany',
 'Ecuador',
 'Egypt',
 'Finland',
 'France',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Guatemala',
 'Hong Kong',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 'Iran',
 'Iraq',
 'Ireland',
 'Israel',
 'Italy',
 'Japan',
 'Jordan',
 'Kenya',
 'Latvia',
 'Lebanon',
 'Liechtenstein',
 'Luxembourg',
 'Malawi',
 'Malaysia',
 'Malta',
 'Mexico',
 'Montenegro',
 'Morocco',
 'Nepal',
 'Netherlands',
 'New Zealand',
 'Nicaragua',
 'Nigeria',
 'Norway',
 'Pakistan',
 'Panama',
 'Paraguay',
 'Peru',
 'Philippines',
 'Poland',
 'Portugal',
 'Qatar',
 'Romania',
 'Russia',
 'Saudi Arabia',
 'Senegal',
 'Serbia',
 'Singapore',
 'Slovakia',
 'Slovenia',
 'Somalia',
 'South Africa',
 'South Korea',
 'Sovi

In [133]:
# since the data set looks fine, we can just start filtering it.  
# United Kingdom should be the only thing that we have to look for, but make sure to check those CSVs that we found earlier

# tried using a Python "in" at first, since I wasn't familiar with .str.contains for a Series, but after messing around for a 
# bit, found some documentation... https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html

df.loc[
    df['country'].str.contains(r"United Kingdom", na=False)
].loc[
    df['release_year'] == 2016
]['title'].unique()

print(len(e2))
print(e2)

92
['Jandino: Whatever it Takes' 'Bobby Sands: 66 Days' 'White Island'
 'Come and Find Me' 'The White Helmets' 'The Five' 'The Last Shaman'
 'Saudi Arabia Uncovered' 'A Plastic Ocean' 'Compulsion'
 'Into the Inferno' 'Free Fire' 'Horror Homes' 'Dark Crimes'
 'City in the Sky' 'I Am Bolt' "Nature's Great Race" 'PJ Masks'
 'When Two Worlds Collide' 'Sour Grapes' 'Paranoid' 'Kaleidoscope'
 'Forever Pure' 'Detour' 'Elizabeth at 90: A Family Tribute' 'Christine'
 'The Intent' 'Bridget Christie: Stand Up for Her' 'FirstBorn'
 'The Rolling Stones: Olé Olé Olé! A Trip Across Latin America'
 'Jimmy Carr: Funny Business' 'My Beautiful Broken Brain'
 'Notes on Blindness' 'Forces of Nature' 'Koko: The Gorilla Who Talks'
 'SuperNature: Wild Flyers' 'The Darkest Dawn' 'Crashing' 'Brahman Naman'
 'Mary Portas: Secret Shopper' 'The Exception' 'City of Tiny Lights'
 'Encounters with Evil' 'Sex Doll' 'Under the Shadow' 'Retribution'
 'The Rezort' 'Hostage to the Devil' 'The Pass' "Don't Knock Twice"
 'K

### E2 Solution
92

---
### List the top 5 directors that directed the most action movies in order of the number of movies directed.

_If more than one person directed a given movie, then they should be counted individually._

In [121]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autob..."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,When a prison ship crash unleashes hundreds of...
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,When nerdy high schooler Dani finally attracts...


In [96]:
# make sure there's not any hidden genres that are like action that we are missing
[actionish for actionish in clean_csv_column(df, 'listed_in') if 'action' in actionish.lower()]
# looks like we're just looking for Action & Adventure

['TV Action & Adventure', 'Action & Adventure']

In [130]:
# filter just the action movies
action_movies = df.loc[
    df['listed_in'].str.contains(r"Action & Adventure")
].loc[
    ~df['listed_in'].str.contains(r"TV Action & Adventure")
]

# run a value_counts on the director column when it's split between all directors
action_movies['director'].str.split(',').value_counts()

[S.S. Rajamouli]                                                                                                                                                                                        7
[Johnnie To]                                                                                                                                                                                            5
[Don Michael Paul]                                                                                                                                                                                      5
[Wilson Yip]                                                                                                                                                                                            4
[Quentin Tarantino]                                                                                                                                                                             

### Solution E3
1. S.S. Rajamouli
2. Johnnie To
3. Don Michael Paul
4. Wilson Yip
5. Quentin Tarantino