In [19]:
import time

In [20]:
import numpy as np
import pandas as pd

In [21]:
np.random.seed(0)

In [22]:
pd.options.mode.copy_on_write = True

# String Manipulation

Python is well-known for its ease of handling strings and text. The string object’s built-in methods simplify most text operations, while complex pattern matching tasks can be handled with regular expressions.

To enable you to perform string methods and regular expressions concisely on entire arrays of data while managing missing values, pandas provides the `.str` accessor on Series. This accessor includes a lot of vectorized string operations that skip over and propagate NA values and contains some parameters to control regular expression behavior, case sensitivity, and handling of missing data.

The `.str` attribute is accessible only for text data types, and pandas offers two ways to store text data:

- **object** dtype NumPy array
- **StringDtype** extension type

It's recommended to use `StringDtype` for storing text data since the `object` dtype is a generic dtype inherited from NumPy arrays, used to store any Python object type.

In contrast, StringDtype comes from pandas' new extension type system, allowing for new data types that are not natively supported by NumPy. These new data types can be treated as first-class citizens and contains specific optimizations to make it computally efficient.

## Vectorized String Methods 

The `str` accessor implements most of the built-in string handling methods available in Python, making it as straightforward to work with as traditional Python strings.

The complete list of available methods can be found in the [official pandas docs](https://pandas.pydata.org/docs/user_guide/text.html#method-summary).

In [23]:
# Concatenate strings in the Series/Index with a given separator.

s1 = pd.Series(['arthur', 'bruna', 'camila'])
s2 = pd.Series(['costa', 'silva', 'oliveira'])

s1.str.cat(s2, sep='.') + '@email.com'

0       arthur.costa@email.com
1        bruna.silva@email.com
2    camila.oliveira@email.com
dtype: object

In [24]:
# Counts occurrences of a pattern in each string.

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.count('a')

0    1
1    3
2    0
dtype: int64

In [25]:
# Computes length of each string in the Series/Index.

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.len()

0    5
1    6
2    6
dtype: int64

In [26]:
# Checks if each string in the Series/Index contains a pattern.

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.contains('a')

0     True
1     True
2    False
dtype: bool

In [27]:
# Check if each string starts with a match of a pattern.

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.match('a')

0     True
1    False
2    False
dtype: bool

In [28]:
# Finds all occurrences of a pattern in each string.

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.findall('a')

0          [a]
1    [a, a, a]
2           []
dtype: object

In [29]:
# Replaces occurrences of pattern/regex with another string.

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.replace('a', 'o')

0     opple
1    bonono
2    cherry
dtype: object

In [30]:
# Joins lists contained as elements in the Series/Index with passed delimiter.

s1 = pd.Series([['a', 'b', 'c'], ['d', 'e', 'f']])
s1.str.join('-')

0    a-b-c
1    d-e-f
dtype: object

In [31]:
# Splits each string in the Series/Index by given delimiter.
#
# Tip: Elements in the split lists can be accessed using get or [] notation:
# Or you can use 'expand' to transform the splitted elements
# into columns

s1 = pd.Series(['a-b-c', 'd-e-f'])
s1.str.split('-')

0    [a, b, c]
1    [d, e, f]
dtype: object

In [32]:
# Removes leading and trailing characters from each string.

s1 = pd.Series(['  apple  ', '  banana', 'cherry   '])
s1.str.strip()

0     apple
1    banana
2    cherry
dtype: object

In [33]:
# Slice the first 3 characters of each string

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.slice(start=0, stop=3)

0    app
1    ban
2    che
dtype: object

In [34]:
# Extract dummy variables from string columns.
# For example if they are separated by a '|':

s1 = pd.Series(["a", "a|b", np.nan, "a|c"])
s1

0      a
1    a|b
2    NaN
3    a|c
dtype: object

In [35]:
s1.str.get_dummies(sep="|")

Unnamed: 0,a,b,c
0,1,0,0
1,1,1,0
2,0,0,0
3,1,0,1


## Regular Expressions

Most methods of the `str` accessor support regular expressions (regex). By using regex patterns, we can perform powerful and flexible string manipulations.

In [36]:
# Regex pattern to find 'a' followed by any character and then 'n'

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.contains(r'a.n')

0    False
1    False
2    False
dtype: bool

In [37]:
# Counts occurrences of a pattern in each string.

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str.count(r'na')

0    0
1    2
2    0
dtype: int64

In [38]:
# Check if strings match pattern of a letter followed by a digit

s1 = pd.Series(['a1', 'bx', 'c3'])
s1.str.match(r'[a-z]\d')  

0     True
1    False
2     True
dtype: bool

## Python Operators: Arithmetic and Boolean

As seen in the *Essential Basic Functionality* notebook, pandas supports using traditional arithmetic and boolean operators to handle text data on Series and DataFrames, enabling element-wise operations similar to those in traditional Python string manipulation.

The drawback is that while `str` accessor methods offer many optional parameters to control a lot of behaviors and enable complex string manipulations in various scenarios, especially handling missing data, the overloaded Python operators are as simple as they come.

In [39]:
# Concatenate two Series element-wise or
# Series with a scalar string.

s1 = pd.Series(['arthur', 'bruna', 'camila'])
s2 = pd.Series(['costa', 'silva', 'oliveira'])

s1 + '.' + s2 + '@email.com'

0       arthur.costa@email.com
1        bruna.silva@email.com
2    camila.oliveira@email.com
dtype: object

In [40]:
# Repeats each string in the Series
# a specified number of times.

s1 = pd.Series(['a', 'b', 'c'])
s1 * 3

0    aaa
1    bbb
2    ccc
dtype: object

In [41]:
# Equality comparison
s1 = pd.Series(['apple', 'banana', 'cherry'])
s1 == 'apple'

0     True
1    False
2    False
dtype: bool

In [42]:
# Slicing

s1 = pd.Series(['apple', 'banana', 'cherry'])
s1.str[:3]

0    app
1    ban
2    che
dtype: object

# String Manipulation on Index and Columns

String methods on Index objects are especially useful for cleaning up or transforming DataFrame columns. For instance, if you have columns with leading or trailing whitespace, you can use the `.str` accessor on `df.columns` since it is an Index object.

In [43]:
df = pd.DataFrame({
    'FruiTs': ['apple', 'banana', 'cherry'],
    '   Color_  ': ['red', 'yellow', 'red'],
    }, index=['a', 'b', 'c']
)
df

Unnamed: 0,FruiTs,Color_
a,apple,red
b,banana,yellow
c,cherry,red


In [44]:
# Clean up DataFrame column names

df.columns = df.columns.str.lower().str.strip().str.replace('_', '')
df

Unnamed: 0,fruits,color
a,apple,red
b,banana,yellow
c,cherry,red


In [45]:
df[df.index.str.contains('a')]

Unnamed: 0,fruits,color
a,apple,red


# Enhancing Performance

As mentioned earlier, it's recommended to use `StringDtype` for storing text data because the `object` dtype is a generic type inherited from NumPy arrays, used to store any Python object type.

In contrast, `StringDtype` comes from pandas' new extension type system and include specific optimizations to enhance computational efficiency. Indeed, `StringDtype` arrays generally use much less memory and are often more computationally efficient for operations on large datasets, especially when using the Apache Arrow memory layout.

To illustrate the difference between the default `object` dtype and `StringDtype` in string manipulation, this section includes a very simple and naive benchmark.

The benchmark creates two DataFrames, each with one column named `'fruits'` containing 1 million rows. The column values are five different fruit names. Three different operations are tested: length calculation, replacement, and containment, as well as the size of the DataFrame.

To use `StringDtype`, you need to have installed pyarrow.

In [53]:
def show_time_results(object_results: dict, string_results: dict):
    speedup = {op: object_results[op] / string_results[op] for op in object_results if op != 'memory_usage'}

    table = f"{'Operation':<15}{'Object Time (s)':<20}{'StringDtype Time (s)':<25}{'Speedup':<10}\n"
    table += "-" * 70 + "\n"
    for op in speedup:
        table += f"{op:<15}{object_results[op]:<20.5f}{string_results[op]:<25.5f}{speedup[op]:<10.2f}\n"

    print(table)


def show_saving_results(object_results: dict, string_results: dict):
    savings = object_results['memory_usage'] / string_results['memory_usage']

    object_mem_use_mb = object_results['memory_usage'] / (1024 ** 2)
    string_mem_use_mb = string_results['memory_usage'] / (1024 ** 2)

    table = f"{'Memory Usage (MB)':<20}{'Object Size (MB)':<20}{'StringDtype Size (MB)':<25}{'Savings (x)':<15}\n"
    table += "-" * 80 + "\n"
    table += f"{'Memory Usage':<20}{object_mem_use_mb:<20.2f}{string_mem_use_mb:<25.2f}{savings:<15.2f}\n"

    print(table)

In [54]:
def run_benchmark(df: pd.DataFrame):
    results = {}
    
    start_time = time.time()
    df['fruits'].str.len()
    results['length'] = time.time() - start_time
    
    start_time = time.time()
    df['fruits'].str.replace('a', 'o')
    results['replace'] = time.time() - start_time

    start_time = time.time()
    df['fruits'].str.contains('a')
    results['contains'] = time.time() - start_time

    results['memory_usage'] = df['fruits'].memory_usage(deep=True)

    return results

In [55]:
n_rows = 1_000_000

data = np.random.choice(['apple', 'banana', 'cherry', 'dragon fruit', 'elderberry'], n_rows)

df_object = pd.DataFrame({'fruits': data})
df_string = pd.DataFrame({'fruits': data}, dtype=pd.StringDtype('pyarrow'))

In [56]:
print(f'df_object\n{df_object.dtypes}', end='\n\n')
print(f'df_string\n{df_string.dtypes}')

df_object
fruits    object
dtype: object

df_string
fruits    string[pyarrow]
dtype: object


In [57]:
object_results = run_benchmark(df_object)
string_results = run_benchmark(df_string)

In [58]:
show_time_results(object_results, string_results)

Operation      Object Time (s)     StringDtype Time (s)     Speedup   
----------------------------------------------------------------------
length         0.18496             0.01083                  17.07     
replace        0.12840             0.02321                  5.53      
contains       0.14556             0.03796                  3.83      



In [59]:
show_saving_results(object_results, string_results)

Memory Usage (MB)   Object Size (MB)    StringDtype Size (MB)    Savings (x)    
--------------------------------------------------------------------------------
Memory Usage        61.80               15.07                    4.10           



# References

- [Python for Data Analysis by Wes McKinney (3e)](https://wesmckinney.com/book/)
- [Pandas Official Documentation](https://pandas.pydata.org/docs/user_guide/10min.html)
- [Frequently Asked Questions (FAQ) on Pandas](https://pandas.pydata.org/docs/user_guide/gotchas.html)

# Exercises

To help you understand the concepts covered in this notebook, here are some practice problems.

These questions refer to a dataset containing information on the type, cast, director, description and name of Netflix titles over the years. The dataset is available on [Kaggle by Shivam Bansal](https://www.kaggle.com/datasets/shivamb/netflix-shows).

# Note
You may need to specify the dataset path explictly if using Windows

In [4]:
import numpy as np
import pandas as pd

In [5]:
df = pd.read_csv('datasets/netflix-movies-and-tv-shows/netflix_titles.csv')
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


1. Get all rows where country is "Brazil" and have "Dramas" in listed_in category

In [6]:
cond = (df['country'] == 'Brazil') & (df['listed_in'].str.contains('Dramas'))
df[cond]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
650,s651,Movie,O Vendedor de Sonhos,Jayme Monjardim,"César Troncoso, Dan Stulbach, Thiago Mendonça,...",Brazil,"June 22, 2021",2016,TV-14,96 min,"Dramas, International Movies",A disillusioned psychologist tries to commit s...
1339,s1340,TV Show,Invisible City,,"Marco Pigossi, Alessandra Negrini, Fábio Lago,...",Brazil,"February 5, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Dramas","After a family tragedy, a man discovers mythic..."
1403,s1404,Movie,Double Dad,Cris D'Amato,"Maisa Silva, Eduardo Moscovis, Marcelo Médici,...",Brazil,"January 15, 2021",2020,TV-G,105 min,"Children & Family Movies, Comedies, Dramas","While her mom is away, a teen sneaks out of th..."
2127,s2128,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
2230,s2231,TV Show,Kissing Game,,"Caio Horowicz, Iza Moreira, Michel Joelsas, De...",Brazil,"July 17, 2020",2020,TV-MA,1 Season,"International TV Shows, TV Dramas, TV Mysteries","At a high school in a rural, isolated ranching..."
2359,s2360,TV Show,Most Beautiful Thing,,"Maria Casadevall, Pathy Dejesus, Fernanda Vasc...",Brazil,"June 19, 2020",2020,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",A 1950s housewife goes to Rio de Janeiro to me...
2984,s2985,TV Show,Omniscient,,"Carla Salle, Sandra Corveloni, Jonathan Haagen...",Brazil,"January 29, 2020",2020,TV-MA,1 Season,"International TV Shows, TV Dramas, TV Sci-Fi &...","In a city where citizens are monitored 24/7, a..."
3164,s3165,Movie,Nothing to Lose 2,Alexandre Avancini,"Petrônio Gontijo, Day Mesquita, Beth Goulart, ...",Brazil,"December 7, 2019",2019,PG-13,97 min,"Dramas, Faith & Spirituality, International Mo...",As controversy surrounds the evangelical churc...
3174,s3175,TV Show,The Chosen One,Michel Tikhomiroff,"Paloma Bernardi, Renan Tenca, Gutto Szuster, P...",Brazil,"December 6, 2019",2019,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries",Determined to bring a Zika vaccine to the remo...
3243,s3244,TV Show,Nobody's Looking,,"Victor Lamoglia, Júlia Rabello, Kéfera Buchman...",Brazil,"November 22, 2019",2019,TV-MA,1 Season,"International TV Shows, TV Comedies, TV Dramas","A new guardian ""angelus"" uncovers a secret beh..."


2. Get all rows where the cast includes the actor "Leonardo DiCaprio"

In [7]:
df[df['cast'].str.contains('Leonardo DiCaprio', na=False)]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
329,s330,Movie,Catch Me If You Can,Steven Spielberg,"Leonardo DiCaprio, Tom Hanks, Christopher Walk...","United States, Canada","August 1, 2021",2002,PG-13,142 min,Dramas,An FBI agent makes it his mission to put cunni...
340,s341,Movie,Inception,Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellio...","United States, United Kingdom","August 1, 2021",2010,PG-13,148 min,"Action & Adventure, Sci-Fi & Fantasy, Thrillers",A troubled thief who extracts secrets from peo...
392,s393,Movie,Django Unchained,Quentin Tarantino,"Jamie Foxx, Christoph Waltz, Leonardo DiCaprio...",United States,"July 24, 2021",2012,R,165 min,"Action & Adventure, Dramas","Accompanied by a German bounty hunter, a freed..."
1358,s1359,Movie,Shutter Island,Martin Scorsese,"Leonardo DiCaprio, Mark Ruffalo, Ben Kingsley,...",United States,"February 1, 2021",2010,R,139 min,Thrillers,A U.S. marshal's troubling visions compromise ...
1469,s1470,Movie,What's Eating Gilbert Grape,Lasse Hallström,"Johnny Depp, Leonardo DiCaprio, Juliette Lewis...",United States,"January 1, 2021",1993,PG-13,118 min,"Classic Movies, Dramas, Independent Movies","In a backwater Iowa town, young Gilbert is tor..."
6272,s6273,Movie,Before the Flood,Fisher Stevens,Leonardo DiCaprio,United States,"February 1, 2018",2016,PG,97 min,Documentaries,Leonardo DiCaprio crisscrosses the globe to in...
6826,s6827,Movie,Gangs of New York,Martin Scorsese,"Leonardo DiCaprio, Daniel Day-Lewis, Cameron D...","United States, Italy","August 20, 2019",2002,R,167 min,Dramas,In the crime-ridden slums of New York in the 1...
7865,s7866,Movie,Revolutionary Road,Sam Mendes,"Leonardo DiCaprio, Kate Winslet, Kathy Bates, ...","United States, United Kingdom","November 1, 2019",2008,R,120 min,"Dramas, Romantic Movies",April and Frank's marriage unravels when a pla...
8272,s8273,Movie,The Departed,Martin Scorsese,"Leonardo DiCaprio, Matt Damon, Jack Nicholson,...","United States, Hong Kong","January 1, 2021",2006,R,151 min,"Dramas, Thrillers",Two rookie Boston cops are sent deep undercove...


3. Get all rows where the title is composed of more than 5 words

In [8]:
df[df['title'].str.split().str.len() > 5]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
6,s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,"September 24, 2021",2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...
10,s11,TV Show,"Vendetta: Truth, Lies and The Mafia",,,,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, Docuseries, International TV S...","Sicily boasts a bold ""Anti-Mafia"" coalition. B..."
16,s17,Movie,Europe's Most Dangerous Man: Otto Skorzeny in ...,"Pedro de Echave García, Pablo Azorín Williams",,,"September 22, 2021",2020,TV-MA,67 min,"Documentaries, International Movies",Declassified documents reveal the post-WWII li...
20,s21,TV Show,Monsters Inside: The 24 Faces of Billy Milligan,Olivier Megaton,,,"September 22, 2021",2021,TV-14,1 Season,"Crime TV Shows, Docuseries, International TV S...","In the late 1970s, an accused serial rapist cl..."
23,s24,Movie,Go! Go! Cory Carson: Chrissy Takes the Wheel,"Alex Woo, Stanley Moore","Maisie Benson, Paul Killam, Kerry Gudjohnsen, ...",,"September 21, 2021",2021,TV-Y,61 min,Children & Family Movies,From arcade games to sled days and hiccup cure...
...,...,...,...,...,...,...,...,...,...,...,...,...
8735,s8736,Movie,Who's That Knocking at My Door?,Martin Scorsese,"Zina Bethune, Harvey Keitel, Anne Collette, Le...",United States,"July 1, 2019",1967,R,90 min,"Classic Movies, Dramas, Independent Movies",A woman's revelation that she was once raped s...
8737,s8738,Movie,Why Are We Getting So Fat?,"Milla Harrison-Hansley, Alicky Sussman",Giles Yeo,United Kingdom,"February 1, 2019",2016,TV-14,50 min,Documentaries,A Cambridge geneticist dispels misconceptions ...
8739,s8740,Movie,Why We Fight: The Battle of Russia,"Frank Capra, Anatole Litvak",,United States,"March 31, 2017",1943,TV-PG,82 min,Documentaries,This installment of Frank Capra's acclaimed do...
8745,s8746,Movie,Willy Wonka & the Chocolate Factory,Mel Stuart,"Gene Wilder, Jack Albertson, Peter Ostrum, Roy...","United States, East Germany, West Germany","January 1, 2020",1971,G,100 min,"Children & Family Movies, Classic Movies, Come...",Zany Willy Wonka causes a stir when he announc...


4. Get all rows where the description contains the words "power" and "love".

In [9]:
df[df['description'].str.contains(r'\b(?:power|love)\b', case=False, regex=True)]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
25,s26,TV Show,Love on the Spectrum,,Brooke Satchwell,Australia,"September 21, 2021",2021,TV-14,2 Seasons,"Docuseries, International TV Shows, Reality TV",Finding love can be hard for anyone. For young...
26,s27,Movie,Minsara Kanavu,Rajiv Menon,"Arvind Swamy, Kajol, Prabhu Deva, Nassar, S.P....",,"September 21, 2021",1997,TV-PG,147 min,"Comedies, International Movies, Music & Musicals",A tangled love triangle ensues when a man fall...
30,s31,Movie,Ankahi Kahaniya,"Ashwiny Iyer Tiwari, Abhishek Chaubey, Saket C...","Abhishek Banerjee, Rinku Rajguru, Delzad Hiwal...",,"September 17, 2021",2021,TV-14,111 min,"Dramas, Independent Movies, International Movies","As big city life buzzes around them, lonely so..."
40,s41,TV Show,He-Man and the Masters of the Universe,,"Yuri Lowenthal, Kimberly Brooks, Antony Del Ri...",United States,"September 16, 2021",2021,TV-Y7,1 Season,"Kids' TV, TV Sci-Fi & Fantasy",Mighty teen Adam and his heroic squad of misfi...
47,s48,TV Show,The Smart Money Woman,Bunmi Ajakaiye,"Osas Ighodaro, Ini Dima-Okojie, Kemi Lala Akin...",,"September 16, 2021",2020,TV-MA,1 Season,"International TV Shows, Romantic TV Shows, TV ...",Five glamorous millennials strive for success ...
...,...,...,...,...,...,...,...,...,...,...,...,...
8674,s8675,Movie,Viceroy's House,Gurinder Chadha,"Hugh Bonneville, Gillian Anderson, Manish Daya...","United Kingdom, India, Sweden","December 12, 2017",2017,NR,106 min,Dramas,As viceroy Lord Mountbatten arrives in Delhi t...
8675,s8676,Movie,Victor,Brandon Dickerson,"Patrick Davis, Lisa Vidal, Josh Pence, José Zú...",United States,"July 24, 2017",2015,PG-13,110 min,Dramas,"In 1962 Brooklyn, a Puerto Rican teen who join..."
8680,s8681,Movie,Viswasapoorvam Mansoor,P.T. Kunju Muhammad,"Roshan Mathew, Asha Sarath, Prayaga Martin, Za...",India,"July 1, 2018",2017,TV-14,125 min,"Dramas, International Movies",When a mother and her daughter arrive to stay ...
8705,s8706,Movie,We Need to Talk,David Serrano,"Hugo Silva, Michelle Jenner, Ernesto Sevilla, ...",Spain,"September 1, 2016",2016,TV-MA,91 min,"Comedies, International Movies, Romantic Movies","A happy woman's new love is going great, but s..."


5. Show the top 10 TV shows with the biggest cast size. Display only the title, director, cast, cast size, country, and release year columns.

In [10]:
df['cast_size'] = df['cast'].str.split(',').str.len()

sorted_df = df.sort_values('cast_size', ascending=False)

sorted_df[['title', 'director', 'cast', 'cast_size', 'country', 'release_year']].head(10)

Unnamed: 0,title,director,cast,cast_size,country,release_year
3774,Black Mirror,,"Jesse Plemons, Cristin Milioti, Jimmi Simpson,...",50.0,United Kingdom,2019
1854,Social Distance,,"Danielle Brooks, Oscar Nuñez, Mike Colter, Hea...",50.0,United States,2020
4220,COMEDIANS of the world,,"Neal Brennan, Chris D'Elia, Nicole Byer, Nick ...",47.0,United States,2019
1639,Heartbreak High,,"Callan Mulvey, Lara Cox, Emma Roche, Ada Nicod...",47.0,Australia,1999
3449,Creeped Out,,"Victoria Diamond, William Romain, Sydney Wade,...",47.0,"United Kingdom, Canada",2019
6186,Arthur Christmas,Sarah Smith,"James McAvoy, Hugh Laurie, Bill Nighy, Jim Bro...",44.0,"United Kingdom, United States",2011
5305,Narcos,,"Wagner Moura, Pedro Pascal, Boyd Holbrook, Dam...",42.0,"United States, Colombia, Mexico",2017
3238,Dolly Parton's Heartstrings,,"Dolly Parton, Julianne Hough, Kimberly William...",41.0,United States,2019
5613,"Michael Bolton's Big, Sexy Valentine's Day Spe...","Scott Aukerman, Akiva Schaffer","Michael Bolton, Andy Samberg, Will Forte, Kenn...",41.0,United States,2017
1701,American Horror Story,,"Evan Peters, Sarah Paulson, Jessica Lange, Den...",40.0,United States,2019


6. Show the top 10 TV shows with the most listed_in categories.

In [11]:
df['listed_in_count'] = df['listed_in'].str.split(',').str.len()

sorted_df = df.sort_values('listed_in_count', ascending=False)
sorted_df.head(10)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,cast_size,listed_in_count
8806,s8807,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...,8.0,3
6482,s6483,Movie,Chronicle of an Escape,Israel Adrián Caetano,"Rodrigo de la Serna, Pablo Echarri, Nazareno C...",Argentina,"June 15, 2018",2006,R,104 min,"Dramas, International Movies, Thrillers",Soccer goalie Claudio Tamburrini is kidnapped ...,9.0,3
6472,s6473,Movie,Chittagong,Bedabrata Pain,"Manoj Bajpayee, Barry John, Delzad Hiwale, Veg...","United States, India, Bangladesh","January 1, 2018",2012,NR,105 min,"Dramas, Independent Movies, International Movies",In the turbulent 1930s of British colonial Ind...,9.0,3
6473,s6474,Movie,Chitty Chitty Bang Bang,Ken Hughes,"Dick Van Dyke, Sally Ann Howes, Lionel Jeffrie...","United Kingdom, United States","January 1, 2020",1968,G,146 min,"Children & Family Movies, Classic Movies, Come...",Quirky inventor Caractacus Potts and his famil...,15.0,3
3164,s3165,Movie,Nothing to Lose 2,Alexandre Avancini,"Petrônio Gontijo, Day Mesquita, Beth Goulart, ...",Brazil,"December 7, 2019",2019,PG-13,97 min,"Dramas, Faith & Spirituality, International Mo...",As controversy surrounds the evangelical churc...,9.0,3
6476,s6477,Movie,Christian Mingle,Corbin Bernsen,"Lacey Chabert, Jonathan Patrick Moore, Saidah ...",United States,"June 6, 2019",2014,PG,99 min,"Comedies, Faith & Spirituality, Romantic Movies",A career woman who has everything but romance ...,8.0,3
6479,s6480,Movie,Christmas in the Smokies,Gary Wheeler,"Sarah Lancaster, Barry Corbin, Alan Powell, Ji...",United States,"March 1, 2019",2015,TV-G,88 min,"Children & Family Movies, Dramas, Romantic Movies","In the Smoky Mountains, an ambitious woman wor...",8.0,3
6481,s6482,Movie,Christopher Robin,Marc Forster,"Ewan McGregor, Hayley Atwell, Bronte Carmichae...",United States,"March 5, 2019",2018,PG,104 min,"Children & Family Movies, Comedies, Dramas","Now a careworn middle-aged man, Christopher Ro...",8.0,3
6485,s6486,Movie,Chupke Chupke,Hrishikesh Mukherjee,"Dharmendra, Sharmila Tagore, Amitabh Bachchan,...",India,"December 31, 2019",1975,TV-PG,127 min,"Classic Movies, Comedies, International Movies",Jealous of the high regard in which his new wi...,10.0,3
3104,s3105,Movie,Como caído del cielo,Pepe Bojórquez,"Omar Chaparro, Ana Claudia Talancón, Stephanie...",Mexico,"December 24, 2019",2019,TV-14,117 min,"Comedies, International Movies, Music & Musicals","To earn his place in heaven, legendary Mexican...",7.0,3


7. Show the top 5 TV shows with the most seasons.

In [12]:
tv_shows_df = df[df['type']=='TV Show'].copy()

tv_shows_df['n_seasons'] = tv_shows_df['duration'].str.extract('(\d+)').astype('int')
tv_shows_df.sort_values('n_seasons', ascending=False).head(10)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,cast_size,listed_in_count,n_seasons
548,s549,TV Show,Grey's Anatomy,,"Ellen Pompeo, Sandra Oh, Katherine Heigl, Just...",United States,"July 3, 2021",2020,TV-14,17 Seasons,"Romantic TV Shows, TV Dramas",Intern (and eventual resident) Meredith Grey f...,15.0,2,17
2423,s2424,TV Show,Supernatural,Phil Sgriccia,"Jared Padalecki, Jensen Ackles, Mark Sheppard,...","United States, Canada","June 5, 2020",2019,TV-14,15 Seasons,"Classic & Cult TV, TV Action & Adventure, TV H...","Siblings Dean and Sam crisscross the country, ...",9.0,3,15
4798,s4799,TV Show,NCIS,,"Mark Harmon, Michael Weatherly, Pauley Perrett...",United States,"July 1, 2018",2017,TV-14,15 Seasons,"Crime TV Shows, TV Dramas, TV Mysteries",Follow the quirky agents of the NCIS – the Nav...,14.0,3,15
1354,s1355,TV Show,Heartland,,"Amber Marshall, Michelle Morgan, Graham Wardle...",Canada,"February 1, 2021",2019,TV-14,13 Seasons,TV Dramas,Spunky teenager Amy is reeling from the sudden...,8.0,1,13
4220,s4221,TV Show,COMEDIANS of the world,,"Neal Brennan, Chris D'Elia, Nicole Byer, Nick ...",United States,"January 1, 2019",2019,TV-MA,13 Seasons,"Stand-Up Comedy & Talk Shows, TV Comedies",This global stand-up comedy series features a ...,47.0,2,13
7847,s7848,TV Show,Red vs. Blue,,"Burnie Burns, Jason Saldaña, Gustavo Sorola, G...",United States,,2015,NR,13 Seasons,"TV Action & Adventure, TV Comedies, TV Sci-Fi ...","This parody of first-person shooter games, mil...",10.0,3,13
4964,s4965,TV Show,Trailer Park Boys,,"Mike Smith, John Paul Tremblay, Robb Wells, Jo...",Canada,"March 30, 2018",2018,TV-MA,12 Seasons,"Classic & Cult TV, Crime TV Shows, Internation...",Follow the booze-fueled misadventures of three...,15.0,3,12
5412,s5413,TV Show,Criminal Minds,,"Mandy Patinkin, Joe Mantegna, Thomas Gibson, S...","United States, Canada","June 30, 2017",2017,TV-14,12 Seasons,"Crime TV Shows, TV Dramas, TV Mysteries",This intense police procedural follows a group...,11.0,3,12
6795,s6796,TV Show,Frasier,,"Kelsey Grammer, Jane Leeves, David Hyde Pierce...",United States,,2003,TV-PG,11 Seasons,"Classic & Cult TV, TV Comedies",Frasier Crane is a snooty but lovable Seattle ...,6.0,2,11
6456,s6457,TV Show,Cheers,,"Ted Danson, Rhea Perlman, George Wendt, John R...",United States,"July 1, 2017",1992,TV-PG,11 Seasons,"Classic & Cult TV, TV Comedies","Sam Malone, an ex-baseball player turned bar o...",9.0,2,11


8. Count how many titles "Emma Stone" has done.

In [13]:
mask = df['cast'].str.contains('Emma Stone', na=False)
df.loc[mask, 'title']

# Count
# df.loc[df['cast'].str.contains('Emma Stone', na=False), 'title'].count()

3527    The Mind, Explained
4629                 Maniac
6662                 Easy A
7515               Movie 43
8126               Superbad
8258             The Croods
8341               The Help
8804             Zombieland
Name: title, dtype: object

9. Count how many movies "Steven Spielberg" has done with the actor "Harrison Ford".

In [14]:
mask = (
    df['director'].str.contains('Steven Spielberg')
    & df['cast'].str.contains('Harrison Ford')
    & (df['type']=='Movie')
)
df[mask]

# Count
# df.loc[mask, 'title'].count()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,cast_size,listed_in_count
7070,s7071,Movie,Indiana Jones and the Kingdom of the Crystal S...,Steven Spielberg,"Harrison Ford, Cate Blanchett, Karen Allen, Ra...",United States,"January 1, 2019",2008,PG-13,123 min,"Action & Adventure, Children & Family Movies, ...",Indiana Jones is drawn into a Russian plot to ...,8.0,3
7071,s7072,Movie,Indiana Jones and the Last Crusade,Steven Spielberg,"Harrison Ford, Sean Connery, Denholm Elliott, ...",United States,"January 1, 2019",1989,PG-13,127 min,"Action & Adventure, Children & Family Movies, ...","Accompanied by his father, Indiana Jones sets ...",10.0,3
7072,s7073,Movie,Indiana Jones and the Raiders of the Lost Ark,Steven Spielberg,"Harrison Ford, Karen Allen, Paul Freeman, Rona...",United States,"January 1, 2019",1981,PG,116 min,"Action & Adventure, Children & Family Movies, ...",When Indiana Jones is hired by the government ...,10.0,3
7073,s7074,Movie,Indiana Jones and the Temple of Doom,Steven Spielberg,"Harrison Ford, Kate Capshaw, Amrish Puri, Rosh...",United States,"January 1, 2019",1984,PG,119 min,"Action & Adventure, Children & Family Movies, ...","Indiana Jones, his young sidekick and a spoile...",10.0,3


10. Replace the country 'United States' for 'no geography knowledge' (rs)

In [15]:
df['country'].str.replace('United States', 'no geography knowledge')

0       no geography knowledge
1                 South Africa
2                          NaN
3                          NaN
4                        India
                 ...          
8802    no geography knowledge
8803                       NaN
8804    no geography knowledge
8805    no geography knowledge
8806                     India
Name: country, Length: 8807, dtype: object

11. (Plus) Find the top 5 most frequent genres

In [16]:
genres = df['listed_in'].str.split(', ').explode()
genres

0                  Documentaries
1         International TV Shows
1                      TV Dramas
1                   TV Mysteries
2                 Crime TV Shows
                  ...           
8805    Children & Family Movies
8805                    Comedies
8806                      Dramas
8806        International Movies
8806            Music & Musicals
Name: listed_in, Length: 19323, dtype: object

In [17]:
genres.value_counts().head(5)

listed_in
International Movies      2752
Dramas                    2427
Comedies                  1674
International TV Shows    1351
Documentaries              869
Name: count, dtype: int64