# Most Influential Women in Pop Culture

I was thinking a lot about the topic of my first blog post, as I really wanted it to be something special, which represents me well and shows a snippet of my skillset at the same time. Having the International Women's Day just around the corner, I thought running an analysis on what and who inspires women the most would be really exciting. I could not help involving my other field of interest: I am going to investigate pop culture's impact through movies and TV shows.

### Data preparation

As I found the open governmental [data source](https://www.ssa.gov/oact/babynames/limits.html) of the babies' names born in the US between 1880 and 2017, I though what could be a better marker of impact that naming your own child after one of your heroes.

The annual birth data was in separate files, so in the first step I read them into a single dataframe.

In [3]:
import os
import pandas as pd

def read_baby_names():
    names = pd.DataFrame(columns=["name", "gender", "count", "year"])
    for file in os.listdir("data/names"):
        if file.endswith(".txt"):
            df = pd.read_csv(os.path.join("./data/names", file), 
                             header=None, 
                             names=["name", "gender", "count"])
            df["year"] = int(file.replace("yob", "").replace(".txt", ""))
            names = names.append(df)
    return names

baby_names = read_baby_names()
baby_names.head()

Unnamed: 0,name,gender,count,year
0,Mary,F,7065,1880
1,Anna,F,2604,1880
2,Emma,F,2003,1880
3,Elizabeth,F,1939,1880
4,Minnie,F,1746,1880


### Top rated movies

Given the birth data, I had to find a set of names of influential characters as candidates. Although IMBb doesn't provide an open API, I found [TMBd](https://www.themoviedb.org/) as an alternative for iterative search. It provides a daily updated "most popular" list of movies and TV shows, but it contains almost exclusively recent titles, which doesn't help our analysis of impact. On the other hand it also provides a list of top rated movies, critically successful titles usually have larger pop culture relevancy.

In [6]:
import http.client
import json

def get_top_rated(media_type):
    conn = http.client.HTTPSConnection("api.themoviedb.org")

    conn.request("GET", "/3/{}/top_rated?page=1&language=en-US&api_key={}".format(media_type, API_KEY))

    raw_data = conn.getresponse().read()
    data = json.loads(raw_data.decode("utf-8"))

    top_rated = pd.DataFrame(columns=["id", "type", "title", "release_date"])
    for i in range(len(data["results"])):
        m = data["results"][i]
        top_rated.loc[i] = [m["id"], media_type, m["title"] if media_type == "movie" else m["name"],
                            m["release_date"] if media_type == "movie" else m["first_air_date"]]
    return top_rated

top_rated_movies = get_top_rated("movie")
top_rated_tv = get_top_rated("tv")
top_rated_titles = top_rated_movies.append(top_rated_tv)

top_rated_titles.head()

After having top rated titles, I also used TMDb's API for getting the names of the top 2 cast members of both genders. Despite the final goal of analysis, I was interested in intermediate results of both genders.

In [13]:
def get_top_characters(media, number_of_characters):
    conn = http.client.HTTPSConnection("api.themoviedb.org")
    conn.request("GET", "/3/{}/{}/credits?api_key={}".format(media["type"], media["id"], API_KEY))

    raw_data = conn.getresponse().read()
    cast = json.loads(raw_data.decode("utf-8"))["cast"]

    names = pd.DataFrame(columns=["title", "release_year", "first_name", "gender"])

    inserted = 0
    for i in range(len(cast)):
        c = cast[i]
        if "gender" not in c: continue
        c_gender = "F" if c["gender"] == 1 else "M"
        if len(names[names.gender == c_gender]) <= (number_of_characters - 1) / 2:
            c_name = c["character"]
            if not c_name: continue
           
            names.loc[inserted] = [media["title"], 
                                   media["release_date"][:4], 
                                   c_name.split()[0],
                                   c_gender]
            inserted += 1
        if inserted == number_of_characters:
            break
    return names


names_top_rated_titles = pd.DataFrame(columns=["title", "release_year", "first_name", "gender"])
for i in range(len(top_rated_titles)):
    names_top_rated_titles = names_top_rated_titles.append(get_top_characters(top_rated_titles.iloc[i], 4))

names_top_rated_titles.head()

### Visualization

There are several options for Data Visualization in Python, general libraries like [Matplotlib](https://matplotlib.org/) or [ggplot](https://pypi.org/project/ggplot/) and some more specailized ones like [Geoplotlib](https://pypi.org/project/geoplotlib/). You can read a nice summary of them in [this](https://www.fusioncharts.com/blog/best-python-data-visualization-libraries/) article. In this project I used [Plotly](https://plot.ly/python/) because it is not only suitable for plotting images, but it has a powerful interactive feature as well. At the end of this post you find all the links to the online visualization, where you can truly exlore the data in detail. I really encourage you to visit them, preferably from a desktop browser.

In [30]:
import plotly.io as pio
import plotly.graph_objs as go

def plot_rising_names(names, baby_names, min_growth, following_years, title, output_file, x_range=None, y_range=None):
    plot_data = []
    names = names[names.first_name.isin(baby_names.name)]

    for i in range(len(names)):
        name = names.iloc[i]
        name_trend = baby_names[baby_names.name == name.first_name]
        name_trend = name_trend[name_trend.gender == name.gender]
        name_trend = name_trend[name_trend.year.isin(range(int(name.release_year), int(name.release_year) + following_years))]
        if name_trend.empty: continue
    
        growth = name_trend["count"].max() / name_trend["count"].iloc[0]
        if growth < min_growth: continue
        
        plot_data.append(
                go.Scatter(
                    x=name_trend['year'],
                    y=name_trend['count'],
                    mode='lines',
                    name="{} ({})".format(name.first_name, name.title)
                )
            )
        
    layout = go.Layout(
        title=title,
        paper_bgcolor='rgb(255,255,255)',
        plot_bgcolor='rgb(229,229,229)',
        xaxis=dict(
            title='Year',
            gridcolor='rgb(255,255,255)',
            showgrid=True,
            showline=False,
            showticklabels=True,
            tickcolor='rgb(127,127,127)',
            ticks='outside',
            tickmode='linear',
            zeroline=False
        ),
        yaxis=dict(
            title='Number of Babies Named',
            gridcolor='rgb(255,255,255)',
            showgrid=True,
            showline=False,
            showticklabels=True,
            tickcolor='rgb(127,127,127)',
            ticks='outside',
            zeroline=True
        )
    )
    
    if x_range is not None:
        layout.xaxis.range = x_range
    if y_range is not None:
        layout.yaxis.range = y_range

    fig = go.Figure(data=plot_data, layout=layout)
    pio.write_image(fig, output_file, height=800, width=1200)

plot_rising_names(names = names_top_rated_titles,
                  baby_names=baby_names, 
                  min_growth=2.0,
                  following_years=10,
                  title="Names with Growing Popularity in 10 Years After Movie/TV Release", 
                  output_file="images/rising_names_top_rated.png")

In the figure above you see all the names which at least doubled their popularity in 10 years after their realease. All lines start from the year of release and shows the trend of the following 10 years. As the lines close to the x axis might seem irrelevant, I made a zoomed version to show a more detailed view of them.

In [29]:
plot_rising_names(names = names_top_rated_titles,
                  baby_names=baby_names, 
                  min_growth=2.0,
                  following_years=10,
                  title="Names with Growing Popularity in 10 Years After Movie/TV Release - Zoomed", 
                  output_file="images/rising_names_top_rated_zoomed.png",
                  y_range = [0, 400])

### IMDb lists as candidates

As a first iteration, visualizing popular names from top rated movies worked well, but the average rating of a movie isn't necessarily the best predictor of pop culture significance. In the next step I exported [Top Pop Culture Films](https://www.imdb.com/list/ls002272292/) and [Top Pop Culture TV Shows](https://www.imdb.com/list/ls022768086/) lists from IMDb. Despite of these lists being highly subjective, they include so many films and TV shows that most of the relevant titles seem to be included.

In [4]:
pop_culture_films = pd.read_csv("data/imdb_top_100_pop_culture_films.csv",encoding='ANSI')
pop_culture_tv = pd.read_csv("data/imdb_pop_culture_tv_shows.csv",encoding='ANSI')

pop_culture_imdb_df = pop_culture_tv.append(pop_culture_films)
pop_culture_imdb_df.head()

Unnamed: 0,Position,Const,Created,Modified,Description,Title,URL,Title Type,IMDb Rating,Runtime (mins),Year,Genres,Num Votes,Release Date,Directors
0,1,tt2788432,2018-06-15,2018-06-15,,American Crime Story,https://www.imdb.com/title/tt2788432/,tvSeries,8.5,42.0,2016,"Biography, Crime, Drama",64542,2015-12-07,
1,2,tt4093826,2018-06-15,2018-06-15,,Twin Peaks,https://www.imdb.com/title/tt4093826/,tvSeries,8.5,60.0,2017,"Crime, Drama, Fantasy, Mystery, Thriller",41100,2017-05-19,
2,3,tt3531824,2018-06-15,2018-06-15,,Nerve,https://www.imdb.com/title/tt3531824/,movie,6.5,96.0,2016,"Action, Adventure, Crime, Drama, Mystery, Thri...",100607,2016-07-12,"Henry Joost, Ariel Schulman"
3,4,tt0288937,2018-06-15,2018-06-15,,Degrassi: The Next Generation,https://www.imdb.com/title/tt0288937/,tvSeries,7.4,30.0,2001,"Drama, Romance",10091,2001-10-14,
4,5,tt4154858,2018-06-15,2018-06-15,,Inhumans,https://www.imdb.com/title/tt4154858/,tvMiniSeries,5.1,43.0,2017,"Action, Adventure, Sci-Fi",18439,2017-08-28,


After having the lists exported and read in, let's translate them to TMDb's format, so we can then iteratively search for their character names.

In [11]:
import time

def get_tmdb_data(imdb_df):
    conn = http.client.HTTPSConnection("api.themoviedb.org")
    tmdb_titles = pd.DataFrame(columns=["id", "type", "title", "release_date"])
    for i in range(len(imdb_df)):
        imdb_item = imdb_df.iloc[i]
        conn.request("GET",
                 "/3/find/{}?api_key={}&language=en-US&external_source=imdb_id".format(imdb_item["Const"], API_KEY))
        # TMDb API has a request rate limit 40 / 10 seconds
        time.sleep(0.25)
        
        raw_data = conn.getresponse().read()
        data = json.loads(raw_data.decode("utf-8"))
        
        if len(data["movie_results"]) > 0:
            media_type = "movie"
            m = data["movie_results"][0]
        elif len(data["tv_results"]) > 0:
            media_type = "tv"
            m = data["tv_results"][0]
        else: continue
        
        tmdb_titles.loc[i] = [m["id"], media_type, m["title"] if media_type == "movie" else m["name"],
                            m["release_date"] if media_type == "movie" else m["first_air_date"]]
    return tmdb_titles

pop_culture_titles = get_tmdb_data(pop_culture_imdb_df)

In [14]:
pop_culture_names = pd.DataFrame(columns=["title", "release_year", "first_name", "gender"])
for i in range(len(pop_culture_titles)):
    pop_culture_names = pop_culture_names.append(get_top_characters(pop_culture_titles.iloc[i], 4))
    time.sleep(0.25)
pop_culture_names.head()

Unnamed: 0,title,release_year,first_name,gender
0,American Crime Story,2016,Andrew,M
1,American Crime Story,2016,Gianni,M
2,American Crime Story,2016,Donatella,F
0,Nerve,2016,Vee,F
1,Nerve,2016,Ian,M


In [31]:
plot_rising_names(names = pop_culture_names,
                  baby_names=baby_names, 
                  min_growth=2.0,
                  following_years=10,
                  title="Names with Growing Popularity (Candidates from IMDb Lists)", 
                  output_file="images/rising_names_imdb.png")

Well, it is nice to have a lot of data to investigate, but it is also somewhat crowded. If you're interested in all of them, look at the interactive version right [here](https://plot.ly/~george.katona/24/names-growing-popularity-from-imdb-lists/). Otherwise let's sort it by the popularity growth rate and show only the top 10 results.

### Names the with highest growth of popularity

To find the names with the highest growth of popularity, I introduced a new column "max_growth". Doing so keeps a while, but then we can sort and make queries on it.

In [17]:
def get_names_with_max_growth(names, baby_names, following_years):
    names = names[names.first_name.isin(baby_names.name)]
    
    names_with_max_growth = pd.DataFrame(columns=["title", "release_year", "first_name", "gender", "max_growth"])
    
    inserted = 0
    for i in range(len(names)):
        name = names.iloc[i]
        if name.release_year == "": continue
        name_trend = baby_names[baby_names.name == name.first_name]
        name_trend = name_trend[name_trend.gender == name.gender]
        name_trend = name_trend[name_trend.year.isin(range(int(name.release_year), int(name.release_year) + following_years))]
        if name_trend.empty: continue
        
        max_growth = name_trend["count"].max() / name_trend["count"].iloc[0]
        names_with_max_growth.loc[inserted] = [name["title"],
                                      name["release_year"],
                                      name["first_name"],
                                      name["gender"],
                                      max_growth]
        inserted += 1
    return names_with_max_growth

        
pop_culture_names_with_max_growth = get_names_with_max_growth(names=pop_culture_names, 
                                            baby_names=baby_names,
                                            following_years=10)
pop_culture_names_with_max_growth.head()

Unnamed: 0,title,release_year,first_name,gender,max_growth
0,American Crime Story,2016,Andrew,M,1.0
1,American Crime Story,2016,Gianni,M,1.1875
2,American Crime Story,2016,Donatella,F,1.0
3,Nerve,2016,Ian,M,1.0
4,Nerve,2016,Sydney,F,1.0


Now let's visualize the top 10 female names which gained the most popularity in 10 years after release.

In [24]:
female_names = pop_culture_names_with_max_growth[pop_culture_names_with_max_growth.gender == "F"]

plot_rising_names(names = female_names.nlargest(10, "max_growth"),
                  baby_names=baby_names, 
                  min_growth=2.0,
                  following_years=10,
                  title="Top 10 Growing Female Names",
                  output_file="images/top_10_growing_female_names.png")

The figure above implies that although there has been far less popular TV shows and movies before 1999, there has been some titles with huge impact. On the other hand it isn't very suprising that in the 21st century there are a lot of shows and movies with great popularity, as it has become so much easier to access our favorites. Let's focus now on the titles released after 1999.

In [35]:
plot_rising_names(names = female_names[female_names.release_year > "1999"].nlargest(10, "max_growth"),
                  baby_names=baby_names, 
                  min_growth=2.0,
                  following_years=10,
                  title="Top 10 Growing Female Names in the 21st Century",
                  output_file="images/top_10_growing_female_names_2000.png")

Let's zoom again on the lines close to the x axis.

In [34]:
plot_rising_names(names = female_names[female_names.release_year > "1999"].nlargest(10, "max_growth"),
                  baby_names=baby_names, 
                  min_growth=2.0,
                  following_years=10,
                  title="Top 10 Growing Female Names in the 21st Century - Zoomed",
                  output_file="images/top_10_growing_female_names_2000_zoomed.png",
                  y_range = [0, 500])

### Summary

The data shows us not only that we live in the best years so far to gain inspiration from female movie / TV characters, but also that what huge impact these shows have on our life. The world has never been so deeply connected and for this reason, the movie industry has never had greater responsibility of telling us stories which can really help us to become or raise our heroes.

*Links to interactive plots:*

- [Names with Growing Popularity [Top Rated Titles]](https://plot.ly/~george.katona/26/names-with-growing-popularity-in-10-years-after-movietv-release/)
- [Names with Growing Popularity [IMDb Lists]](https://plot.ly/~george.katona/28/names-with-growing-popularity-candidates-from-imdb-lists/)
- [Top 10 Growing Female Names](https://plot.ly/~george.katona/30/top-10-growing-female-names/)
- [Top 10 Growing Female Names in the 21st century](https://plot.ly/~george.katona/32/top-10-growing-female-names-in-the-21st-century/)